# LearnPlatform COVID-19 Impact on Digital Learning

## 1. Introduction

### Problem Statement

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

### Challenge

1. Explore the state of digital learning in 2020.
2. How the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

## 2. Data Description

Original dataset contains daily edtech engagement data from over 200 school districts in 2020. There are three basic sets of files to get started with:

* The `engagement_data` folder is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.
* The `products_info.csv` file includes information about the characteristics of the top 372 products with most users in 2020.
* The `districts_info.csv` file includes information about the characteristics of school districts, including data from NCES and FCC.

### 2.1 Import modules and setup directories

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
sns.set_style("whitegrid")

# Input data files are available in the read-only "../input/" directory
data_root = '../input/learnplatform-covid19-impact-on-digital-learning'
engagement_data_folder = os.path.join(data_root, 'engagement_data')

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### 2.2 View data sample from `engagement_data` folder

In [None]:
# Read and view first file from engagement_data folder
engagement_sample = pd.read_csv(
    os.path.join(engagement_data_folder, os.listdir(engagement_data_folder)[0])
)
engagement_sample.head()

**Columns description:**

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |

<font color='green'>**Note:**</font> The engagement data are aggregated at school district level, and each file in the folder `engagement_data` represents data from one school district. The 4-digit file name represents `district_id` which can be used to link to district information in `district_info.csv`. The `lp_id` can be used to link to product information in `product_info.csv`.

### 2.3 View data sample from `products_info.csv` file

In [None]:
# Read and view products_info.cvs file
products_info = pd.read_csv(os.path.join(data_root, 'products_info.csv'))
products_info.head()

**Columns description:**

| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled | 


<font color='green'>**Note:**</font> Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

### 2.4 View data sample from `districts_info.csv` file

In [None]:
# Read and view districts_info.csv file
districts_info = pd.read_csv(os.path.join(data_root, 'districts_info.csv'))
districts_info.head()

**Columns description:**

| Name | Description |
| :--- | :----------- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools \(NERD\$\) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |

<font color='green'>**Note:**</font> There are many missing data marked as `NaN` indicating that the data was suppressed to maximize anonymization of the dataset.


## 3. The state of digital learning in 2020.

Let's explore the state of digital learning in 2020. I'd like to start with `engagement_data` first.

### 3.1 Explore `engagement_data`

#### 3.1.1 Load data

Before starting to make any data analisys it is good to write a helper function which will read multiple files into one dictionary of dataframes.

In [None]:
# Create function to read engagement data
def read_engagement_data(data_folder=engagement_data_folder):
    """
    Returns dictionary of dataframes with key as filename
    and value as pd.DataFrame.
    >>> read_engagement_data()
    {"6345": pd.DataFrame, ...}
    """
    result = {}
    for i in os.listdir(data_folder):
        result[i[:-4]] = pd.read_csv(os.path.join(data_folder, i), parse_dates=[0,])
    return result

# Read engagement data
engagement_data = read_engagement_data()

# View data sample
engagement_data["6345"].head()

#### 3.1.2 Calculate monthly mean engagement index

Now let's calculate monthly mean `engagement_index` for each district.

In [None]:
# Create function to calculate monthly engagement_index mean
def mean_monthly_engagement_index(data=engagement_data):
    """
    Calculates mean monthly engagement_index dropping 'Nan' values.
    monthly_engagement_index()
    >>> {"6345": pd.DataFrame, ...}
    """
    result = {}
    cols_filter = ["time", "engagement_index"]
    cols_rename = {"time": "month", "engagement_index": "mean_eng_idx"}
    for key, value in data.items():
        new_value = value[cols_filter].fillna(0).copy()
        new_value["time"] = new_value["time"].dt.month
        new_value = new_value.groupby(["time"]).mean().reset_index()
        new_value.rename(columns=cols_rename, inplace=True)
        result[key] = new_value
    return result

# Calculate monthly engagement index mean
mean_eng_idx = mean_monthly_engagement_index()

# View data sample
mean_eng_idx["6345"]

#### 3.1.3 Merge monthly districts data
In the below code cell we'll merge all districts monthly mean `engagement_index` into one dataframe.

In [None]:
# Create function to merge monthly engagement index mean
def merge_mean_monthly_engagement_index(data=mean_eng_idx):
    """
    Merge mean_eng_idx (mean monthly engagement index) of
    every district into one pd.Dataframe and rename columns
    with mean monthly engagement index values by district
    id number.
    """
    result = pd.DataFrame()
    for key, value in data.items():
        val = value.rename(columns={"mean_eng_idx": key})
        if result.empty:
            result = val
        else:
            result = pd.merge(result, val, how="left", on="month")
    return result.fillna(0)

# Merge monthly engagement index mean
mean_eng_idx_merged = merge_mean_monthly_engagement_index()

# View result
mean_eng_idx_merged

#### 3.1.4 View difference between districts in 2020

Let's see if there is any big difference between districts' `engagement_index` in 2020 to identify outliers and understand data distribution:

In [None]:
mean_eng_idx_merged.iloc[:, 1:].mean().describe()

In 233 school districts mean `engagement_index` in 2020 vary from 3.61 to 1215.50 total page-load events per one thousand students per day. Such a big values distribution tells us that school districts differ in terms of using distance learning tools and digital platforms. The best result is almost 1 page-load event per student per day. The worst is 4.16 which means that in some districts students don't use digital learning platforms at all.

#### 3.1.5 The biggest and the lowest mean engagement index examples
Let's explore districts with the biggest and the lowest mean engagement index values in more details on below charts to try to identify data patterns for the both examples.

In [None]:
# Find max and min index (label)
idx_max = mean_eng_idx_merged.iloc[:, 1:].mean().idxmax()
idx_min = mean_eng_idx_merged.iloc[:, 1:].mean().idxmin()

# Create months values array for x axis
months = mean_eng_idx_merged.month.values

# Create line plots for 2 districts
fig, axes = plt.subplots(1, 2, figsize=(18, 6))
fig.suptitle("Districts with max and min engagement_index in 2020")
axes[0].set_title(f"District {idx_max} (top outlier)")
sns.lineplot(ax=axes[0], x=months, y=mean_eng_idx_merged[idx_max].values)
axes[1].set_title(f"District {idx_min} (bottom outlier)")
sns.lineplot(ax=axes[1], x=months, y=mean_eng_idx_merged[idx_min].values)
plt.show()

For the 2 school districts (top and bottom outliers) data patterns of mean engagement index from month to month look also different.

#### 3.1.6 Monthly mean egagement index

Let's try to look at mean engagement index values distribution for all districts. But first we need to prepare data for it.

In [None]:
# Create function to concat monthly engagement index mean
def concat_mean_monthly_engagement_index(data=mean_eng_idx, fix_missing=False):
    """
    Concat mean_eng_idx (mean monthly engagement index) of
    every district into one pd.Dataframe and add new colum
    district_id with district id number. Fix missing
    month values by adding zero (optional).
    """
    result = []
    fix_df = pd.DataFrame({"month": [i for i in range(1, 13)]})
    
    def fix(dataframe):
        """
        Add missing months values.
        """
        if len(dataframe) < 12:
            #print(key, len(new_value), end=" >> ")
            dataframe = pd.merge(fix_df, dataframe, how="left", on="month")
            dataframe.fillna(0)
            #print(key, len(new_value))
        return dataframe
    
    for key, value in data.items():
        new_value = value.copy()
        if fix_missing:
            new_value = fix(new_value)
        new_value["district_id"] = key
        result.append(new_value)
    return pd.concat(result, ignore_index=True)

# Concat monthly engagement index mean
mean_eng_idx_concat = concat_mean_monthly_engagement_index(fix_missing=True)

# View result
mean_eng_idx_concat

In [None]:
# Count month values, should be 233 if there are no missing values in dataset
# or fix_missing=True and less than 233 otherwise
mean_eng_idx_concat.month.value_counts()

In [None]:
# Create function to boxenplot monthly mean engagement index for all districts
def plot_monthly_engagement_index(data, stripplot=True):
    plt.figure(figsize=(18, 5))
    plt.title("Monthly mean engagement index in 2020 (all districts)")
    if stripplot:
        sns.stripplot(x="month", y="mean_eng_idx", data=data)
    sns.boxenplot(x="month", y="mean_eng_idx", data=data)
    
# Boxenplot monthly mean engagement index for all districts  
plot_monthly_engagement_index(mean_eng_idx_concat)

# Plot line for mean engagement index (all districts)
plt.figure(figsize=(18, 4))
plt.title("Monthly mean engagement index in 2020 (all districts combined)")
sns.lineplot(x="month", y="mean_eng_idx", data=mean_eng_idx_concat.groupby("month").mean().reset_index())
plt.show()

There are some significant outliers in the first half of 2020. From May to July it looks like decreasing trend but in January and February trend is increasing. July is the worst month of the year. Let's drop some top outliers to better see data distribution on chart.

In [None]:
# Create function to boxenplot monthly mean engagement index for all districts w/o outliers
def plot_monthly_engagement_index_wo_outliers(data, top_limit, stripplot=True):
    top_map = data["mean_eng_idx"] < top_limit
    plot_monthly_engagement_index(data[top_map], stripplot)

top_limit = 1000
# Boxenplot monthly mean engagement index for all districts w/o outliers
plot_monthly_engagement_index_wo_outliers(mean_eng_idx_concat, top_limit)

In this chart data became more distinct. And what is more interesting, the last 4 months look very similar and display some stability (even with top outliers). Moreover bottom line of the biggest box segments raised a little up. Let's plot common boundaries for the last 4 months and mean engagement index of all districts combined in 2020.

In [None]:
# Create function to find and plot common boundaries for last 4 months
def plot_monthly_engagement_index_with_bound(data, top_limit, stripplot=False,):
    plot_monthly_engagement_index_wo_outliers(data, top_limit, stripplot)
    reg_map = data["mean_eng_idx"] < top_limit
    upper_bound = data[reg_map].groupby("month").describe()[-4:][("mean_eng_idx", "75%")].max()
    lower_bound = data[reg_map].groupby("month").describe()[-4:][("mean_eng_idx", "25%")].min()
    plt.plot([upper_bound for i in range(12)], color="red")
    plt.plot([lower_bound for i in range(12)], color="red")
    
# Boxenplot monthly mean engagement index for all districts with boundaries
plot_monthly_engagement_index_with_bound(mean_eng_idx_concat, top_limit)

# Calculate outliers % input
reg_map = mean_eng_idx_concat["mean_eng_idx"] < top_limit
diff = mean_eng_idx_concat.groupby("month").mean().reset_index()\
    - mean_eng_idx_concat[reg_map].groupby("month").mean().reset_index()
outliers_input = diff / mean_eng_idx_concat.groupby("month").mean().reset_index() * 100
outliers_input.month = mean_eng_idx_concat.groupby("month").mean().reset_index().month
outliers_input.rename(columns={"mean_eng_idx": "outliers_percent_input"}, inplace=True)

# Find top outliers
top_map = (mean_eng_idx_concat["mean_eng_idx"] > top_limit)
july_map = (mean_eng_idx_concat["mean_eng_idx"] > 400)\
            & (mean_eng_idx_concat["month"] == 7)
top_map = top_map | july_map
top_outliers = mean_eng_idx_concat[top_map]

# Plot line for mean engagement index (all districts)
plt.figure(figsize=(18, 4))
plt.title("Monthly mean engagement index in 2020 (all districts combined)")
sns.lineplot(
    x="month", y="mean_eng_idx",
    data=mean_eng_idx_concat[reg_map].groupby("month").mean().reset_index(),
    label="Top outliers excluded",
)
sns.lineplot(
    x="month", y="mean_eng_idx",
    data=mean_eng_idx_concat.groupby("month").mean().reset_index(),
    label="Top outliers included",
)
ax2 = plt.twinx()
ax2.grid(False)
for i, txt in enumerate(outliers_input.outliers_percent_input.values):
    ax2.annotate(
        round(txt, 2),
        (outliers_input.month.values[i],
        outliers_input.outliers_percent_input.values[i]),
        xytext=(outliers_input.month.values[i] + 0.1,
        outliers_input.outliers_percent_input.values[i] + 0.1),
    )
sns.lineplot(
    x="month", y="outliers_percent_input",
    data=outliers_input,
    label="Top outliers % input",
    ax=ax2,
    linestyle="None",
    marker="o",
    color='k'
)
plt.legend()
plt.show()
plt.figure(figsize=(18, 4))
plt.title("Monthly mean engagement index in 2020 (top outliers)")
sns.lineplot(x="month", y="mean_eng_idx", data=top_outliers.groupby("month").mean())
plt.show()

July was the most inactive month during 2020. Top outliers made more significant input to engagement index in the begining of the year gradually degreasing to 0 in July. After July their input was at maximum in September gradually decreasing again. In general top outliers input into monthly mean engagement index decreased almost as much as 5 times (in comparison with February). It is difficult to say now what was the reason for it. This fact needs more study in terms of characteristics of the top 372 products from `products_info.csv` file, engagement data `pct_access` and state interventions, practices or policies. Hopefully this extra data will help us to find a reasonable explanation.

#### 3.1.7 Top outliers

Let's identify a group of top outliers and districts with the biggest mean engagement index in 2020.

In [None]:
# Find districts with biggest mean engagement index
b_map = mean_eng_idx_concat.groupby("month").idxmax()
biggest_outliers = mean_eng_idx_concat.iloc[b_map.mean_eng_idx.values]

# Calculate final rating
final_rating = pd.concat(
    [top_outliers, biggest_outliers]
    ).district_id.value_counts()

# Plot results
plt.figure(figsize=(18, 10))
gs = gridspec.GridSpec(2, 6)
gs.update(wspace=0.4, hspace=0.3)
ax1 = plt.subplot(gs[0, :3])
ax2 = plt.subplot(gs[0, 3:])
ax3 = plt.subplot(gs[1, :2])
ax4 = plt.subplot(gs[1, 2:4])
ax5 = plt.subplot(gs[1, 4:]) 
plt.suptitle("Top outliers", fontsize="16")

ax1.set_title("Top outliers in 2020")
sns.stripplot(
    ax=ax1, x="month", y="mean_eng_idx",
    hue="district_id", data=top_outliers
)

ax2.set_title("Top positions in 2020")
sns.stripplot(
    ax=ax2, x="month", y="mean_eng_idx",
    hue="district_id", data=biggest_outliers
)

ax3.set_title("Top outliers scores in 2020")
sns.barplot(
    ax=ax3,
    x=top_outliers.district_id.value_counts().index,
    y=top_outliers.district_id.value_counts().values,
)

ax4.set_title("Top position scores in 2020")
sns.barplot(
    ax=ax4,
    x=biggest_outliers.district_id.value_counts().index,
    y=biggest_outliers.district_id.value_counts().values,
)

ax5.set_title("Final rating scores")
sns.barplot(ax=ax5, x=final_rating.index, y=final_rating.values)
plt.show()

Now we have top outliers and some information about them. Most interesting for further study are districts 9536, 6418 and probably 9007. District 9536 is the most persistent one. It appears 7 times among top outliers and keeps top position during 5 months in a row (collecting 12 points in final rating score) during decreasing trend and capturing the worst month July. District 6418 is a newcomer in top outliers since September. It won position from district 9536 in September and October, keeping top position during 4 months in a row till the end of the year. The second newcomer is district 9007. It had top position in August and the third position among top outliers in September disappearing from top outliers till the end of the year. Before making further top outliers study let's identify middle segments first.

#### 3.1.8 Middle segments


In [None]:
# Find upper and lower boundaries of middle segments in each month
b_lim = mean_eng_idx_concat["mean_eng_idx"] < top_limit
mid_seg_cols = [("mean_eng_idx", "25%"), ("mean_eng_idx", "75%")]
limits = mean_eng_idx_concat[b_lim].groupby("month").describe()[mid_seg_cols]
limits.columns = ['_'.join(col) for col in limits.columns.values]

# Filter middle segment
raw = mean_eng_idx_concat[b_lim].merge(limits.reset_index(), on="month")
mid_seg_map = (raw.iloc[:, 1] <= raw.iloc[:, 4]) & (raw.iloc[:, 1] >= raw.iloc[:, 3])
mid_seg = raw[mid_seg_map].iloc[:, 0:3].copy()
mid_seg

In [None]:
# Plot monthly mean engagement index of middle segments
plt.figure(figsize=(18, 5))
plt.title("Monthly mean engagement index in 2020 (middle segments)")
sns.lineplot(x="month", y="mean_eng_idx", data=mid_seg.groupby("month").mean())
plt.show()

Middle segments demonstrate more stability and improvement in the last 4 months of 2020 comparing to the begining of the year.

### 3.2 Explore `products_info.csv`

#### 3.2.1 Variety of product types

Let's count values in `Primary Essential Function` column, to view variety of product types.

In [None]:
# Count values of Primary Essential Function
count_prod_types = products_info["Primary Essential Function"].value_counts()
print(count_prod_types.shape[0], "types of products in total.")
count_prod_types

There are 35 types of products. It is a little complicated starting point for data study. Let's make it a bit easier and find top 10 products with highest mean engagement index in 2020.

#### 3.2.2 Top 10 products in 2020

Prepare data.

In [None]:
# Create function to merge products and engagement index
def eng_prod_merge(prod=products_info, eng=engagement_data):
    """
    """
    prd_inf = prod.rename(columns={"LP ID": "lp_id"})
    result = None
    for key, value in eng.items():
        new_val = value.copy()
        #new_val.dropna(inplace=True)
        new_val.rename(columns={"time": "month"}, inplace=True)
        new_val["month"] = new_val["month"].dt.month
        new_val = new_val.groupby(["month", "lp_id"]).mean().reset_index()
        new_val["district_id"] = key
        new_val = new_val.merge(prd_inf[["lp_id", "Primary Essential Function"]], on="lp_id")
        new_val["lp_id"] = new_val["lp_id"].astype(int)
        if result is None:
            result = new_val.copy()
        else:
            result = pd.concat([result, new_val])
    return result

e_p_merged = eng_prod_merge()
e_p_merged

Find top 10 products with highest mean engagement index.

In [None]:
# Find top ten products
e_p_summary = e_p_merged[["lp_id", "pct_access", "engagement_index"]].groupby("lp_id").mean()
top_ten = e_p_summary.sort_values(["engagement_index"], ascending=False)[:10].reset_index()
top_ten = top_ten.merge(products_info.rename(columns={"LP ID": "lp_id"}), on="lp_id")
top_ten

Let's plot 10 top products in 2020 to see their engagement index trend over all school districts.

In [None]:
# Plot top ten products line charts
fig, axes = plt.subplots(4, 3, figsize=(18, 12))
plt.subplots_adjust(hspace=0.6)
plt.suptitle("Top 10 products in 2020\n with trend line", fontsize="16")
for i in range(12):
    r, c = divmod(i, 3)
    if i < top_ten.shape[0]:
        _map = e_p_merged.lp_id == top_ten.iloc[i].lp_id
        data = e_p_merged[_map].groupby("month").mean().reset_index()
        axes[r][c].set_title(f"Product id: {top_ten.iloc[i].lp_id}, rating position # {i + 1}")
        sns.lineplot(ax=axes[r][c], x="month", y="engagement_index", data=data)
        # Plot trend line if no data is missing
        if len(data) == 12:
            sns.lineplot(
                ax=axes[r][c], x=[1, 12],
                y=[data.iloc[:4].mean().engagement_index, data.iloc[-4:].mean().engagement_index]
            )
    else:
        axes[r][c].axis("off")
plt.show()

At quick view product 61292 (LC - Sites, Resources & Reference - Streaming Services) stands out. It looks like it was created in June and had significant growth and took the 3rd rating position (there is no trend line as not all months data is available). On the other hand engagement index of product 24711 (LC - Study Tools) reduced by the end of the year. And it has almost identical negative slope of trend line with product 99916 (LC/CM/SDO - Other).

#### 3.2.3 Positive and negative trends.

Let's find which products have positive and negative trends.

In [None]:
# Find trend line slopes for all products
trends_dict = {}
for i in e_p_merged.lp_id.unique():
    _map = e_p_merged.lp_id == i
    data = e_p_merged[_map].set_index("lp_id").groupby("month").mean()#.reset_index()
    if data.shape[0] == 12:
        trends_dict[i] = data.iloc[-4:].mean().engagement_index\
            - data.iloc[:4].mean().engagement_index

product_trends = pd.DataFrame(trends_dict.values(), trends_dict.keys()).reset_index()
product_trends.rename(columns={"index": "lp_id", 0: "trend"}, inplace=True)
pos_trends = (product_trends["trend"] > 0).sum()
neg_trends = (product_trends["trend"] < 0).sum()
print("Positive trend:", pos_trends)
print("Negative trend:", neg_trends)
print("Missing values:", len(e_p_summary) - (pos_trends + neg_trends))
print("Total:", len(e_p_summary))

We have 66 positive trends, 256 negative trends, 47 with missing values and 3 products are missing in our summary as total number of products is 371. Let's identify the missing products to figure out the reason why they were skipped.

In [None]:
# Identify missing products numbers
missing_products = set(products_info["LP ID"]).difference(set(e_p_summary.index))
missing_products

It looks like products 36254, 37805, 88065 do not have any records in egagement data.

In [None]:
# View missing products
products_info[products_info["LP ID"].isin(missing_products)]

In [None]:
# Check missing products vs engagement data
all_districts = pd.concat(engagement_data.values())
all_districts[all_districts.lp_id.isin(missing_products)]

In [None]:
# Add trend annotation to product trends
def trend(x):
    if x < -1:
        return "negative"
    elif x > 1:
        return "positive"
    else:
        return "no trend"
product_trends["trend"].transform(trend)
product_trends["trend_annot"] = product_trends["trend"].transform(trend)
product_trends

### 3.3 Explore `districts_info.csv`

#### 3.3.1 Check for missing data
Now let's try to explore districts info data to understand how much we can get from this dataset for our analysis.

In [None]:
# Check what kind of missing values we have
total = len(districts_info)
col_names = districts_info.columns.to_list()[3:]
print("Total number of districts:", total)
print("Districts with missing location data:", total - len(districts_info.iloc[:, :3].dropna()))
for i, col_n in enumerate(col_names, 3):
    print(
        f"Districts with missing {col_n} data:",
        total - len(districts_info.iloc[:, [0, i]].dropna())
    )
print(
    "Districts with missing all the data:",
    districts_info[districts_info.isna().sum(axis=1) == 6].count().sum())

It looks like 57 districts which do not have location data, also do not have all the other data in the dataset. Let's drop them as they will not give us any valuable information. 

#### 3.3.2 How many states represented by school districts

In [None]:
# Drop distrcicts with Nan values in all columns
d_inf_clean = districts_info\
    .set_index("district_id")\
    .dropna(how="all")\
    .reset_index()

# Count districts by state
print(
    "Total number of states:",
    d_inf_clean.iloc[:, :2].groupby("state").count().count()[0]
)
print(
    "Total number of districts:",
    d_inf_clean.iloc[:, :2].groupby("state").count().sum()[0]
)
d_inf_clean.iloc[:, :2]\
    .groupby("state")\
    .count()\
    .sort_values("district_id", ascending=False)\
    .rename(columns={"district_id": "number_of_districts"})

After removing missing data we have 23 states represented by 176 districts. Connecticut is on the top and has 30 school districts in the dataset.

#### 3.3.3 Correlation

Let's combine some of the data we've explored so far into one dataset.

In [None]:
# Combine all districts data with mean_eng_idx
year_mean = mean_eng_idx_concat[["district_id", "mean_eng_idx"]]\
    .groupby("district_id")\
    .mean()\
    .reset_index()
year_mean["district_id"] = year_mean["district_id"].astype(int)
all_districts = d_inf_clean.merge(year_mean, on="district_id")
all_districts.head()

Do the same for top districts.

In [None]:
# Top districts
# Set index as 'district_id' and filter by final_rating.index
top_districts = all_districts\
    .set_index("district_id")\
    .reindex(final_rating.index.astype(int))
top_districts.dropna(how="all", inplace=True)
top_districts

Interesting fact about top districts: District 9536 in New York (city locations) is on the top of the list. District 9515 is also located in New York state but rural.

Split columns to prepare data for pairplot.

In [None]:
# Split columns data
def split(series, col1, col2, dict_, to="int"):
    for val in series.values:
        if val is not np.nan:
            val = val[1:-1]
            if to == "int":
                val1 = int(val[:val.find(",")])
                val2 = int(val[val.find(" "):])
            elif to == "float":
                val1 = float(val[:val.find(",")])
                val2 = float(val[val.find(" "):])
        elif val is np.nan:
            val1, val2 = np.nan, np.nan
        
        if col1 in dict_:
            dict_[col1].append(val1)
        elif col1 not in dict_:
            dict_[col1] = [val1,]
        
        if col2 in dict_:
            dict_[col2].append(val2)
        elif col2 not in dict_:
            dict_[col2] = [val2,]
    return dict_

splitted = {}
splitted = split(all_districts["pct_black/hispanic"], "pct_b", "pct_h", splitted, to="float")
splitted = split(all_districts["pct_free/reduced"], "pct_free", "pct_reduced", splitted, to="float")
splitted = split(all_districts["county_connections_ratio"], "conn_r", "conn_rr", splitted, to="float")
splitted = split(all_districts["pp_total_raw"], "pp_loc", "pp_fed", splitted)

In [None]:
# Add splitted data to dataframe
for key, val in splitted.items():
    series = pd.Series(splitted[key], all_districts.index, name=key)
    all_districts = all_districts.merge(series, left_index=True, right_index=True)

all_districts

Plot pair correlations.

In [None]:
# Plot pair correlations
sns.pairplot(all_districts.iloc[:, 7:])

From the above plot it is possible to see only 2 correlations: `pct_black/hispanic` (`pct_h`/`pct_b`) and `pp_total_raw` as local vs federal expenditures (`pp_loc` / `pp_fed`). No correlation found between mean engagement index and districts info data.

#### 3.3.4 By state rating

Let's find mean engagement index by state and add number of districts in each state.

In [None]:
# Mean by state
mean_dist = all_districts[["state", "mean_eng_idx"]].groupby(
    "state").mean().sort_values("mean_eng_idx", ascending=False)
count = d_inf_clean.iloc[:, :2].groupby(
    "state").count().sort_values("district_id", ascending=False)
count.rename(columns={"district_id": "n_of_districts"}, inplace=True)
mean_dist.merge(count, left_index=True, right_index=True)

## 4. Conclusion

### Districts
I tried to identify top outliers (districts) in my study as a first step. These top 10 districts are located in the following states in descending order by mean enagement index:

1. New York (City) - district 9536
2. District of Columbia (City) - district 6418
3. Arizona (City) - district 9007
4. Illinois (Suburb) - district 8815
5. New York (Rural) - district 9515
6. Utah (Suburb) - district 3692

But if we combine data by mean engagement index in each state including city and rural (all districts) we'll have the following top 10 results:

1. Arizona - 1 district
2. New York - 8 districts
3. New Hampshire - 2 districts
4. District Of Columbia - 3 districts
5. Connecticut - 30 districts
6. New Jersey - 2 districts
7. Indiana - 7 districts
8. Illinois - 18 districts
9. Massachusetts - 21 districts
10. Utah - 29 districts

There is some intersection of states within both lists. It can be considered as strong evidence that the listed states have very good engagement index in comparison with other states. No correlation was found between engagement index and such districts data as:

* pct_black/hispanic
* pct_free/reduced
* county_connections_ratio
* pp_total_raw

### Trends and patterns
All districts had similar pattern of engagement index in 2020. Engagement index dropped in summer with the smallest minimum in July. There are 66 products with positive trend, 256 with negative trend. 47 products have missing values (mostly) in the begining of the year which makes impossible to calculate trend.

### Products

It is obvious to outline 3 most popular products:
* Google Docs - which was probably driven by the need of creating and exchanging documents and information.
* Google Classroom - which was probably driven by the need of LMS, online classes and digital learning.
* YouTube - which was probably driven by the need of educational videos.

## 5. Afterword

Thank you very much for your attention and time spent in reading my study. As it is my first analytics competition, and I realize that my notebook not so perfect that I would like it to be. But I hope it was helpful and you could get valuable insights from it. Thank you very much for this interesting experience.
