# Exploratory Data Analysis of COVID-19 Impact on Digital Learning

## Problem Statement

Current research shows educational outcomes are far from equitable. The imbalance was exacerbated by the COVID-19 pandemic. There's an urgent need to better understand and measure the scope and impact of the pandemic on these inequities.

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms.

In this notebook we will try to get insight on:
* the state of digital learning in 2020 
* how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

## Overview of the data
What datas are we going to use for analysis ?

* engagement_data: each file represent data from school district, the 4-digit number file name represent district_id
* products_info.csv: file includes information about the characteristics of the top 372 products with most users in 2020.
* districts_info.csv : file includes information about the characteristics of school - - districts, including data from NCES and FCC.

## Importing libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import warnings
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
import math
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from scipy.stats import skew
from IPython.display import Markdown, display, Image, display_html

In [None]:
sns.set()
%matplotlib inline
warnings.filterwarnings("ignore")

## Utility Functions and Classes

In [None]:
def read_csv(csv_path, missing_values=[]):
    try:
        df = pd.read_csv(csv_path, na_values=missing_values)
        print("file read as csv")
        return df
    except FileNotFoundError:
        print("file not found")

def save_csv(df, csv_path):
    try:
        df.to_csv(csv_path, index=False)
        print('File Successfully Saved.!!!')
        return df

    except Exception:
        print("Save failed...")


def percent_missing(df: pd.DataFrame) -> float:

        totalCells = np.product(df.shape)
        missingCount = df.isnull().sum()
        totalMissing = missingCount.sum()
        return round((totalMissing / totalCells) * 100, 2)
def percent_missing_for_col(df: pd.DataFrame, col_name: str) -> float:
        total_count = len(df[col_name])
        if total_count <= 0:
            return 0.0
        missing_count = df[col_name].isnull().sum()

        return round((missing_count / total_count) * 100, 2)

def drop_duplicates(df):
    old = df.shape[0]
    df.drop_duplicates(inplace=True)
    new = df.shape[0]
    count = old - new
    if (count == 0):
        print("No duplicate rows were found.")
    else:
        print(f"{count} duplicate rows were found and removed.")

def fix_missing_mode(df, cols):
    for col in cols:
        mode = df[col].mode()[0]
        count = df[col].isna().sum()
        df[col] = df[col].fillna(mode)
        if type(mode) == 'str':
            print(f"{count} missing values in the column {col} have been replaced by its mode value \'{mode}\'.")
        else:
            print(f"{count} missing values in the column {col} have been replaced by its mode value {mode}.")

def fix_missing_value(df, col, value):
    count = df[col].isna().sum()
    df[col] = df[col].fillna(value)
    if type(value) == 'str':
        print(f"{count} missing values in the column {col} have been replaced by \'{value}\'.")
    else:
        print(f"{count} missing values in the column {col} have been replaced by {value}.")

In [None]:
class DataOverview:
    """
    A class used to get some information about a given dataframe. 
    It extracts the number of rows and columns and 
    calculates the percent of missing values and unique values of each column.
    """
    
    def __init__(self, df):
        
        self.df = df
    
    def read_head(self, top=5):
        return self.df.head(top)
    
    # returning the number of rows columns and column information
    def get_info(self):
        row_count, col_count = self.df.shape
    
        print(f"Number of rows: {row_count}")
        print(f"Number of columns: {col_count}")
        print("================================")

        return (row_count, col_count), self.df.info()
    
    def get_count(self, column_name):
        return pd.DataFrame(self.df[column_name].value_counts())
    
    # getting the null count for every column
    def get_null_count(self, column_name):
        print("Null values count")
        print(self.df.isnull().sum())
        return self.df.isnull().sum()
    
    # getting the percentage of missing values
    def get_percent_missing(self):
        percent_missing_info = percent_missing(self.df)
        null_percent_df = pd.DataFrame(columns = ['column', 'null_percent'])
        columns = self.df.columns.values.tolist()
        null_percent_df['column'] = columns
        null_percent_df['null_percent'] = null_percent_df['column'].map(lambda x: percent_missing_for_col(self.df, x))
        
        
        return null_percent_df.sort_values(by=['null_percent'], ascending = False), percent_missing_info
    
    def percent_missing_values(self):

        # Calculate total number of cells in dataframe
        totalCells = np.product(self.df.shape)

        # Count number of missing values per column
        missingCount = self.df.isnull().sum()

        # Calculate total number of missing values
        totalMissing = missingCount.sum()

        # Calculate percentage of missing values
        print("The dataset contains", round(((totalMissing/totalCells) * 100), 2), "%", "missing values.")
    

    def count_missing_rows(self):

        # Calculate total number rows with missing values
        missing_rows = sum([True for idx,row in self.df.iterrows() if any(row.isna())])

        # Calculate total number of rows
        total_rows = self.df.shape[0]

        # Calculate the percentage of missing rows
        print(f"{missing_rows} rows({round(((missing_rows/total_rows) * 100), 2)}%) contain atleast one missing value.")

    def missing_values_table(self):
        
        df = self.df
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * mis_val / len(df)

        # Data type of missing values
        mis_val_dtype = df.dtypes
        
        # Total unique values in each column
        unique_val = df.nunique()

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent, unique_val, mis_val_dtype], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Missing Values', 2: 'Unique Values',  3: 'Dtype',})

        # Sort the table by percentage of missing descending and remove columns with no missing values
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,0] != 0].sort_values(
        '% of Missing Values', ascending=False).round(2)
        
        # Reset the index as a column
        mis_val_table_ren_columns.reset_index(inplace=True)
        mis_val_table_ren_columns.rename(columns={'index': 'Columns'}, inplace=True)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
            " columns that have missing values.")

        if mis_val_table_ren_columns.shape[0] == 0:
            return

        # Return the dataframe with missing information
        return mis_val_table_ren_columns

## District information data

The district file `districts_info.csv` includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, identifiable information about the school districts have been removed. An open source tool ARX (Prasser et al. 2020) have bben used to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

| Name                   | Description                                                                                                                                                                                                                                                                              |
|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| district_id            | The unique identifier of the school district                                                                                                                                                                                                                                             |
| state                  | The state where the district resides in                                                                                                                                                                                                                                                  |
| locale                 | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.                                                                                                          |
| pct_black/hispanic     | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data                                                                                                                                                                                       |
| pct_free/reduced       | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data                                                                                                                                                                              |
| countyconnectionsratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.                                                                         |
| pptotalraw             | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |

In [None]:
base_path = "../input/learnplatform-covid19-impact-on-digital-learning"

In [None]:
district = read_csv(f"{base_path}/districts_info.csv")
district.head()

### General statistics

In [None]:
# rows and columns in the df
district_overview = DataOverview(district)
display(district_overview.get_info())

### Duplicates

In [None]:
drop_duplicates(district)

### Missing values

In [None]:
district_overview.percent_missing_values()
district_overview.count_missing_rows()

In [None]:
district_overview.missing_values_table()

From the missing table we can observe a pattern. The last 3 columns have the same number of missing values. This indicates that we should investigate to see if the all the missing values of these 3 columns exist together in the same rows.

In [None]:
# number of rows with missing values for the group('state', 'locale', 'pct_black/hispanic')
DataOverview(district[['state', 'locale', 'pct_black/hispanic']]).count_missing_rows()


As predicted these 57 rows contain all the missing values from the columns state', 'locale', and 'pct_black/hispanic'. Since we have too many missing values in the same rows, we will remove them.

In [None]:

cleaned_district = district[district['state'].notna()].reset_index(drop=True)

Let's check the updated missing values table.

In [None]:
DataOverview(cleaned_district).missing_values_table()

Let's see the ratio of each unique value in each of these columns.

In [None]:
cleaned_district["pp_total_raw"].value_counts()

In [None]:

cleaned_district["pct_free/reduced"].value_counts()


In [None]:

cleaned_district["county_connections_ratio"].value_counts()


We will impute missing values in the columns `pp_total_raw` and `pct_free/reduced` based on the mode value of the same `locale`. This is because we assume that there is some connection between the values of `pp_total_raw` and `pct_free/reduced` that are in the same `locale`. For the column `county_connections_ratio`, we will impute its missing values by its mode as every value except one is equal to the mode.

In [None]:
fix_missing_mode(cleaned_district, ['county_connections_ratio'])

In [None]:
DataOverview(cleaned_district).missing_values_table()

In [None]:
# UTILITY FUNCTIONS

def get_mode_for_locale(df, locale, column):

    filtered_df = df[df["locale"] == locale]
    _mode = filtered_df[column].mode().to_list()
    
    if len(_mode) > 0:
        return _mode[0]
    
    return "[0.0, 0.0["

def handle_missing_for_district(row):
    
    if str(row["pp_total_raw"]) == "nan":
        row["pp_total_raw"] = get_mode_for_locale(cleaned_district, row["locale"], "pp_total_raw")
    
    if str(row["pct_free/reduced"]) == "nan":
        row["pct_free/reduced"] = get_mode_for_locale(cleaned_district, row["locale"], "pct_free/reduced")
    
    return row

In [None]:
cleaned_district = cleaned_district.apply(lambda row: handle_missing_for_district(row), axis=1)
cleaned_district_overview = DataOverview(cleaned_district)
cleaned_district_overview.missing_values_table()

### Graphical Analysis

#### Count plots for  locale and pct_black/hispanic

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16,6))

sns.countplot(data=cleaned_district, x='locale', palette='GnBu', ax=ax[0])
sns.countplot(data=cleaned_district, x='pct_black/hispanic', palette='GnBu', ax=ax[1])
plt.show()

As shown above, most of the districts are located in suburb areas and have lower percentage of black or hispanic residents.

#### Number of counts for  pct_free/reduced values and county_connections_ratio columns
Unfortunately the county_connections_ratio columns is predominantly filled with '[0.18, 1[.
[1, 2[ appears only in a single row only. so this column is no use to us

In [None]:
fig, ax = plt.subplots(1, 2,figsize=(16,6))
sns.countplot(data=cleaned_district, x='pct_free/reduced', palette='GnBu', ax=ax[0])
sns.countplot(data=cleaned_district, x='county_connections_ratio', palette='GnBu', ax=ax[1])


#### Number of counts for  pp_total_raw

In [None]:
fig, ax = plt.subplots(figsize=(16,6))
sns.countplot(data=cleaned_district, y='pp_total_raw', palette='GnBu')
plt.show()

In [None]:
state_abb = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

For further analysis we extracted new columns from columns having thier values definded in terms of range. We did this by calculating the (min_range + max_range) / 2

In [None]:
def calualte_mean_from_range(value: str):
    min_range = eval(value.split(", ")[0][1:])
    max_range = eval(value.split(", ")[1][:-1])
    return round((min_range + max_range) / 2, 1)

In [None]:
cleaned_district['pct_black/hispanic_mean'] = cleaned_district['pct_black/hispanic'].map(lambda x: calualte_mean_from_range(x))
cleaned_district['pct_free/reduced_mean'] = cleaned_district['pct_free/reduced'].map(lambda x: calualte_mean_from_range(x))
cleaned_district['pp_total_raw_mean'] = cleaned_district['pp_total_raw'].map(lambda x: calualte_mean_from_range(x))
cleaned_district['county_connections_ratio_mean'] = cleaned_district['county_connections_ratio'].map(lambda x: calualte_mean_from_range(x))
cleaned_district.head()

### Number of school district in each state.
Top states with most school districts are Connecticut,  Utah and Massachusetts


In [None]:
location_count_df = DataOverview(cleaned_district).get_count("state")

locations = location_count_df.index.map(lambda x: state_abb[x]).to_list()
district_counts = location_count_df['state'].to_list()

fig = px.choropleth(locations=locations, locationmode="USA-states", color=district_counts, scope="usa", color_continuous_scale='Gnbu')
fig.show()

#### Average of total hispanic or black communiuty in each state

In [None]:
state_pct_black_hispanic_agg = cleaned_district.groupby("state").agg({"pct_black/hispanic_mean": "mean"})

locations = state_pct_black_hispanic_agg.index.map(lambda x: state_abb[x]).to_list()
colors = state_pct_black_hispanic_agg['pct_black/hispanic_mean'].to_list()
state_pct_black_hispanic_agg.sort_values(by=["pct_black/hispanic_mean"], ascending=False)


fig = px.choropleth(locations=locations, locationmode="USA-states", color=colors, scope="usa", color_continuous_scale='Gnbu')
fig.show()

#### Average of total expediture in each state

In [None]:
state_expediutre_agg = cleaned_district.groupby("state").agg({"pp_total_raw_mean": "mean"})

locations = state_expediutre_agg.index.map(lambda x: state_abb[x]).to_list()
colors = state_expediutre_agg['pp_total_raw_mean'].to_list()

fig = px.choropleth(locations=locations, locationmode="USA-states", color=colors, scope="usa",  color_continuous_scale='Gnbu')
fig.show()

#### Average of percentage of students in the districts eligible for free or reduced-price lunch  in each state

In [None]:
pct_free_reduced_mean_agg = cleaned_district.groupby("state").agg({"pct_free/reduced_mean": "mean"})

locations = pct_free_reduced_mean_agg.index.map(lambda x: state_abb[x]).to_list()
colors = pct_free_reduced_mean_agg['pct_free/reduced_mean'].to_list()

fig = px.choropleth(locations=locations, locationmode="USA-states", color=colors, scope="usa",  color_continuous_scale='Gnbu')
fig.show()

In [None]:
def piePlot(data,title,legend):
    plt=go.Pie(labels=data.index,values=data.values)
    fig = go.Figure(data=plt)
    fig.update_layout(
    title=title,
    legend_title=legend,
    font=dict(
        family="Roboto, monospace",
        size=18,
        color="Black"
        )
    )
    
    fig.show()
    fig.data=[]
    

#### Locale Breakdown
We can see that the majority of our districts are in suburbs followed by rural,city then town. 

In [None]:
data=cleaned_district.groupby('locale')["district_id"].count()

piePlot(data,"Breakdown of locales of the districts","Locale")

#### Racial identity of students Breakdown
We can see that the majority of our districts have a percentage of 0%-0.2% hispanic/black students with diminishing number of districts as we go up in the percentage of hispanic/black students. 

In [None]:
data=cleaned_district.groupby('pct_black/hispanic')["district_id"].count()

piePlot(data,"Breakdown of demographics of the districts","pct_black/hispanic")

#### Percentage of students eligible for free of reduced price lunch Breakdown
This can be an indicator of the economic level of the students at these districts and will be used as such in the following sections

In [None]:
data=cleaned_district.groupby('pct_free/reduced')["district_id"].count()

piePlot(data,"Breakdown of districts by the percentage of students eligible for free or reduced lunch","pct_free/reduced")

#### Breakdown of demographics overall

The plot show the number of districts by demographic, economic level as well as locale. For all locales except cities majority of the districts have a 0-0.2 percentage of students that are hispanic or black.Often we can see that a very high percentage of black or hispanic students(0.8-1) coincides with higher percentage of students eligible for lunch assistance than their lesser counterparts. This demonstrates racial bias to the economic status of the students.

In [None]:
data=cleaned_district.groupby(['locale', 'pct_black/hispanic', 'pct_free/reduced'],as_index=False).count()
data["District Count"]=data["district_id"]
fig = px.sunburst(data, color_continuous_scale='Blues', path=['locale', 'pct_black/hispanic', 'pct_free/reduced'],width=600, height=600, values='District Count')
# fig.update_layout(uniformtext=dict(minsize=10))

fig.update_traces(textinfo="label+percent parent")
fig.update_layout(title="Breakdown locale, pct_black/hispanic and pct_free/reduced in respective rings",
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    )
)
fig.show()

In [None]:
data=cleaned_district.groupby(['pct_black/hispanic', 'pct_free/reduced'],as_index=False).count()
data["District Count"]=data["district_id"]
fig = px.sunburst(data, color_continuous_scale='Blues', path=[ 'pct_black/hispanic', 'pct_free/reduced'],color='pct_free/reduced',width=600, height=600, values='District Count')
# fig.update_layout(uniformtext=dict(minsize=10))

fig.update_traces(textinfo="label+percent parent")
fig.update_layout(title="Breakdown pct_black/hispanic, pct_free/reduced in respective rings",
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    )
)

fig.show()

## Product information data

The product file `products_info.csv` includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

| Name                       | Description                                                                                                                                                                                                                                                                                                                    |
|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LP ID                      | The unique identifier of the product                                                                                                                                                                                                                                                                                           |
| URL                        | Web Link to the specific product                                                                                                                                                                                                                                                                                               |
| Product Name               | Name of the specific product                                                                                                                                                                                                                                                                                                   |
| Provider/Company Name      | Name of the product provider                                                                                                                                                                                                                                                                                                   |
| Sector(s)                  | Sector of education where the product is used                                                                                                                                                                                                                                                                                  |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

In [None]:
product_df = read_csv(f"{base_path}/products_info.csv")
product_df.head()

### General statistics

In [None]:
# rows and columns in the df
product_overview = DataOverview(product_df)
display(product_overview.get_info())

### Duplicates

In [None]:
drop_duplicates(product_df)

### Missing values

In [None]:
product_overview.percent_missing_values()
product_overview.count_missing_rows()

In [None]:
product_overview.missing_values_table()

In [None]:
product_df[pd.isna(product_df['Provider/Company Name'])]

Since the product with missing value in the column `Provider/Company Name` have also missing values for the columns `Sector(s)` and `Primary Essential Function`, we will drop it.

In [None]:
cleaned_product = product_df.dropna(subset=['Provider/Company Name'])

Let's check the updated missing values table.

In [None]:
DataOverview(cleaned_product).percent_missing_values()

We will impute missing values in the column `Sector(s)` based on the mode value of the same `Provider/Company Name`. Then we will impute missing values in the column `Primary Essential Function` based on the mode value of the same `Provider/Company Name` and `Sector(s)`.

In [None]:
def get_mode_for_company(df, comp_name, column):
    filtered_df = df[ df["Provider/Company Name"] == comp_name]

    _mode = filtered_df[column].mode().to_list()
    if len(_mode) > 0:
        return _mode[0]
    
    return None

def handle_missing_for_product(row):
    
    if str(row["Sector(s)"]) == "nan":
        row["Sector(s)"] = get_mode_for_company(product_df, row["Provider/Company Name"], "Sector(s)")
    
    if str(row["Primary Essential Function"]) == "nan":
        row["Primary Essential Function"] = get_mode_for_company(product_df, row["Provider/Company Name"], "Primary Essential Function")
    
    
    return row


In [None]:
cleaned_product = cleaned_product.apply(lambda row: handle_missing_for_product(row), axis=1 )

In [None]:
product_overview = DataOverview(cleaned_product)
product_overview.missing_values_table()

The rows left with missing values doesn't have either a common company name and/or sector with the other non-missing rows. Therfore, we will drop them.

In [None]:
# since missing values of both columns exist together
cleaned_product.dropna(subset=['Sector(s)'], inplace=True)

In [None]:
product_overview.missing_values_table()

### Splitting Columns
Extracting main_function and sub_function from "Primary Essential Function" column and dropping Primary Essential Function column

In [None]:
cleaned_product['main_fun'] = cleaned_product['Primary Essential Function'].apply(lambda x: x.split(' - ')[0])
cleaned_product['sub_fun'] = cleaned_product['Primary Essential Function'].apply(lambda x: x.split(' - ')[1])

cleaned_product.drop("Primary Essential Function", axis=1, inplace=True)

In [None]:
cleaned_product.head()

In [None]:
display(DataOverview(cleaned_product).get_count("main_fun"))
display(DataOverview(cleaned_product).get_count("sub_fun"))

### Graphical analysis

In [None]:
fig, ax = plt.subplots(figsize=(8,6))
sns.countplot(data=cleaned_product, x='main_fun', palette='GnBu')
plt.show()

In [None]:
df = px.data.tips()
sub_df = cleaned_product[['main_fun', 'sub_fun']]
sub_df = sub_df.groupby(['main_fun', 'sub_fun']).size().reset_index(name='count')
fig = px.sunburst(sub_df, path=['main_fun', 'sub_fun'], values='count')
fig.show()

## Engagement data

The engagement data are aggregated at school district level, and each file in the folder `engagement_data` represents data from one school district. The 4-digit file name represents `district_id` which can be used to link to district information in `district_info.csv`. The `lp_id` can be used to link to product information in `product_info.csv`.

| Name             | Description                                                                                                    |
|------------------|----------------------------------------------------------------------------------------------------------------|
| time             | date in "YYYY-MM-DD"                                                                                           |
| lp_id            | The unique identifier of the product                                                                           |
| pct_access       | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day                         |

For better analysis we will concatenate all the dataframes read from the folder `engagement_data` vertically into a single dataframe. Note that since we have dropped some districts from the districts dataframe, engagement data about these districts will also be excluded.

In [None]:
temp = []

for district in cleaned_district.district_id.unique():
    df = pd.read_csv(f'{base_path}/engagement_data/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
    temp.append(df)
    
engagement_df = pd.concat(temp)
engagement_df = engagement_df.reset_index(drop=True)
engagement_df.head()

In [None]:

engagement_df.lp_id.nunique()

In [None]:
cleaned_product['LP ID'].nunique()

In [None]:
print(engagement_df.shape)
engagement_df = engagement_df[engagement_df.lp_id.isin(cleaned_product['LP ID'].unique())]
print(engagement_df.shape)

### General statistics

In [None]:
# rows and columns in the df
engagement_overview = DataOverview(engagement_df)
display(engagement_overview.get_info())

### Duplicates

In [None]:
drop_duplicates(engagement_df)

### Missing values

In [None]:
engagement_overview.percent_missing_values()

In [None]:
engagement_overview.missing_values_table()

In [None]:
engagement_df.dropna(subset=['pct_access'], inplace=True)

In [None]:
engagement_overview.missing_values_table()

In [None]:
missing_engagement_index = engagement_df[pd.isna(engagement_df['engagement_index'])]
missing_engagement_index

In [None]:
print(missing_engagement_index.shape)
temp = missing_engagement_index[missing_engagement_index['pct_access'] == 0.0]
print(temp.shape)

In [None]:
engagement_df['engagement_index'].fillna(0.0, inplace=True)

In [None]:
engagement_overview.missing_values_table()

## Merged Engagement Data

### Merging

In [None]:
cleaned_product.rename(columns={"LP ID": "lp_id"}, inplace=True)

In [None]:
df = engagement_df.merge(cleaned_district, on='district_id')
df.head()

In [None]:
df = df.merge(cleaned_product, on='lp_id')
df.head()

### Analysis

In [None]:
def lineTimePlot(column:str,value:str)->None:
    line= df.groupby([column,"time"],as_index=False).agg('mean')
    fig = px.line(line, x ='time', y =value, color=column, width=1000,facet_col_wrap=1)
#     fig.add_vrect(x0="2020-03-04", x1="2020-03-14",annotation_text="States Announce State of Emergency",fillcolor="yellow", opacity=0.25, annotation_position="top right", line_width=0)
    fig.add_vrect(x0="2020-04-02", x1="2020-03-16",annotation_text="State Closure",fillcolor="red", opacity=0.25, line_width=0, annotation_position="top left")
    fig.add_vrect(x0="2020-06-01", x1="2020-08-23",annotation_text="Summer Holidays",fillcolor="green", opacity=0.25, line_width=0, annotation_position="top right")
    
    fig.add_hline(y=line[value].mean(),annotation_position="top right",annotation_text="Average "+value )
    fig.update_xaxes(rangeslider_visible=True)
    fig.update_layout(title= "Time Analysis of "+value+ " by "+column,
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    )
)
    fig.add_annotation(xref='x domain',
    x=0.5,
    yref='y domain',
    y=-0.5, font=dict(size=10),
                       showarrow=False,
    text="State Closure Data taken from <a href='https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC'>COVID-19 US State Policy Database by openICPSR</a>")
    fig.add_annotation(xref='x domain',
    x=0.5,
    yref='y domain',
    y=-0.55, font=dict(size=10),
                       showarrow=False,
    text="Holiday Data taken from  <a href='https://www.fcps.edu/sites/default/files/media/forms/19-20-standard-school-year-calendar.pdf'>FCPS 2019-2020 Standard School Year Calendar</a>. It is used as a source for an estimate")
    fig.show()
    fig.data = []

#### Percent Access By Locale
The figure shows that rural locales have a higher access percentage followed by towns with cities performing the worst. We can also see that cities were hit the hardest by state decisions for closure with access plummeting the hardest while the rest only showed a small decrease. The city shows almost half the access percentage to our highest access percentage(rural locale)

#### Engagement Index by locale
This is a similar pattern with the percentage access with cities being the lowest and rural being the highest. And the effects of the state closures being hardest on the cities

In [None]:
local_df = df[["locale", "pct_access","engagement_index"]]
local_df_agg = local_df.groupby("locale").agg({"pct_access": "mean","engagement_index": "mean" })

fig, ax = plt.subplots(1, 2, figsize=(16,4))

sns.barplot(x=local_df_agg.index, y="pct_access", data=local_df_agg, ax=ax[0], palette='GnBu')
sns.barplot(x=local_df_agg.index, y="engagement_index", data=local_df_agg, ax=ax[1], palette='GnBu')

plt.show()

In [None]:
lineTimePlot("locale",'pct_access')
lineTimePlot("locale",'engagement_index')

#### Percentage access by demogrphics
We can see that districts with a 0.8-1 percentage of black/hispanic students have a higher access percentage than their other counterparts. The districts with a 0.4-0.6 percentage of black/hispanic students performed the worst. Districts with a 0.8-1 percentage of black/hispanic students was hit the hardest with state closures but showed a great rebound with school opening in the fall with rates higher than before. Another range that was affected heavily is the districts with  0.4-0.6 percentage of black/hispanic students though it did rebound after the summer holidays.

#### Engagement Index by demographics
Similar to its access counterpart the figure shows that  districts with a 0.8-1 percentage of black/hispanic students have a higher access percentage than their other counterparts. The districts with a 0.4-0.6 percentage of black/hispanic students performed the worst. Districts with a 0.8-1 percentage of black/hispanic students was hit the hardest with state closures but showed a great rebound with school opening in the fall with rates similar to before.

In [None]:
demo_df =  df[["pct_black/hispanic_mean", "pct_access","engagement_index"]]

demo_agg = demo_df.groupby("pct_black/hispanic_mean").agg({"pct_access": "mean","engagement_index": "mean" })

fig, ax = plt.subplots(1, 2, figsize=(16,4))

sns.barplot(x=demo_agg.index, y="pct_access", data=demo_agg, ax=ax[0], palette='GnBu')
sns.barplot(x=demo_agg.index, y="engagement_index", data=demo_agg, ax=ax[1], palette='GnBu')

plt.show()

In [None]:
lineTimePlot("pct_black/hispanic",'pct_access')
lineTimePlot("pct_black/hispanic",'engagement_index')

#### Percentage access and Engagement Index in each state
We can see that North Dakota, Arizona, New York, District of Colombia and New Hemisphere have higher overall access percentage and Engagement Index than the other states. The reason for New Hemisphere, North Dakota and New York can be attributed to the fact that they are mainly rulalr areas as we can see from the follwing graphs, In the other hands Arizona and District of Colombia have a predominant black/hispanic demographics, so this may attribute to a higher Engagement and Percentage access


In [None]:
state_pct_df =  df[["state", "pct_access"]]
state_eng_df =  df[["state", "engagement_index"]]

state_pct_agg = state_pct_df.groupby("state").agg({"pct_access": "mean" })
state_eng_agg = state_eng_df.groupby("state").agg({"engagement_index": "mean" })

state_pct_agg = state_pct_agg.sort_values(by=["pct_access"], ascending=False)
state_eng_agg = state_eng_agg.sort_values(by=["engagement_index"], ascending=False)

fig, ax = plt.subplots(figsize=(16,8))
sns.barplot(x="pct_access", y=state_pct_agg.index, data=state_pct_agg, palette='GnBu')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.barplot(x="engagement_index", y=state_eng_agg.index, data=state_eng_agg, palette='GnBu')
plt.show()

In [None]:
px.histogram(cleaned_district, x='state', color="locale").update_xaxes(categoryorder='total ascending')

In [None]:
px.histogram(cleaned_district, x='state', color="pct_black/hispanic").update_xaxes(categoryorder='total ascending')

#### Percentage access by pct_free/reduced
We can see that districts with a 0.8-1 percentage of pct_free/reduced eligible students have a higher access percentage than their other counterparts on average. The districts with a 0.6-0.8 percentage of pct_free/reduced eligible students had the worst access percentages. Districts with a 0.8-1 percentage oof pct_free/reduced eligible students was hit the hardest with state closures but showed a rebound with school openings in the fall with rates that were only slightly lower than before. 

#### Engagement Index by pct_free/reduced
We can see that districts with a 0.8-1 percentage of pct_free/reduced eligible students have a engagement index than their other counterparts on average. The districts with a 0.6-0.8 percentage of pct_free/reduced eligible students had the worst access percentages. Districts with a 0.8-1 percentage oof pct_free/reduced eligible students was hit the hardest with state closures but showed a rebound with school openings in the fall but never recovered to the level before the closure.

In [None]:
lineTimePlot("pct_free/reduced",'pct_access')
lineTimePlot("pct_free/reduced",'engagement_index')

#### Top 5 engaged products

#### Top 5 products with higher percentage access
The top 5 products with higher Percentage access are
1. Google classroom
2. Google docs
3. Google Drive
4. Youtube
5. ClassLink

#### Top 5 products with most engaged products are
1. Google docs
2. Google classroom
3. Youtube
4. Canvas
5. Meet

You can obseve that most used products for online learning in terms of both higher percentage access and engagment index are from Google.



In [None]:
product_df = df[['Provider/Company Name', 'Product Name', 'pct_access', 'engagement_index']]
product_df_agg = product_df.groupby('Product Name').agg({"pct_access": "mean", "engagement_index": "mean"})
product_pct_access_agg = product_df_agg[['pct_access']].sort_values(by="pct_access", ascending=False)
product_eng_agg =  product_df_agg[['engagement_index']].sort_values(by="engagement_index", ascending=False)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(24,4))

sns.barplot(x=product_pct_access_agg.head().index, y="pct_access", data=product_pct_access_agg.head(), ax=ax[0], palette='GnBu')
sns.barplot(x=product_eng_agg.head().index, y="engagement_index", data=product_eng_agg.head(), ax=ax[1], palette='GnBu')

plt.show()

#### Top 5 engaged products per main function

In [None]:
main_funcs = list(df['main_fun'].unique())

row = len(main_funcs) // 2 
if len(main_funcs) % 2 != 0:
    row += 1
    
fig, ax = plt.subplots(row, 2, figsize=(24,16))


temp_df = df[['main_fun', 'engagement_index','Provider/Company Name', 'Product Name']]
for i, func in enumerate(main_funcs):
    temp_agg = temp_df[temp_df['main_fun'] == func].groupby('Product Name').agg({'engagement_index': "mean"})
    temp_agg = temp_agg.sort_values(by='engagement_index', ascending=False)
    
    fig.tight_layout()
    ax[i // 2, i%2].set_title(f'Top 5 in {func}', fontsize=16)
    sns.barplot(x=temp_agg.head().index, y="engagement_index", data=temp_agg.head(), ax=ax[i // 2, i%2], palette='GnBu')

Similar to the previous plot, Google products have a higher percentage access and engagement index in each main function category.

#### Products per main function vs. pct_access and engagement_index

In [None]:
lineTimePlot("main_fun",'pct_access')
lineTimePlot("main_fun",'engagement_index')

#### Products per sector vs. pct_access and engagement_index

In [None]:
lineTimePlot("Sector(s)",'pct_access')
lineTimePlot("Sector(s)",'engagement_index')

#### Virtual classroom products per vs. pct_access and engagement_index

In [None]:
def lineTimePlotVC(column:str,value:str)->None:

    virtual_classroom_lp_id = cleaned_product[cleaned_product.sub_fun == 'Virtual Classroom']['lp_id'].unique()

    # Remove weekends from the dataframe
    df['weekday'] = pd.DatetimeIndex(df['time']).weekday
    engagement_without_weekends = df[df.weekday < 5]
    

    engagement_without_weekends = engagement_without_weekends[engagement_without_weekends['lp_id'].isin(virtual_classroom_lp_id)]
    
    line= engagement_without_weekends.groupby([column,"time"],as_index=False).agg('mean')
    fig = px.line(line, x ='time', y =value, color=column, width=1000,facet_col_wrap=1)
    #     fig.add_vrect(x0="2020-03-04", x1="2020-03-14",annotation_text="States Announce State of Emergency",fillcolor="yellow", opacity=0.25, annotation_position="top right", line_width=0)
    fig.add_vrect(x0="2020-04-02", x1="2020-03-16",annotation_text="State Closure",fillcolor="red", opacity=0.25, line_width=0, annotation_position="top left")
    fig.add_vrect(x0="2020-06-01", x1="2020-08-23",annotation_text="Summer Holidays",fillcolor="green", opacity=0.25, line_width=0, annotation_position="top right")

    fig.add_hline(y=line[value].mean(),annotation_position="top right",annotation_text="Average "+value )
    fig.update_xaxes(rangeslider_visible=True)
    fig.update_layout(title= "Time Analysis of "+value+ " by "+column,
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    ),
    legend=dict(
        orientation="h",
        yanchor="bottom",
        y=1.0,
        xanchor="right",
        x=1,
        font=dict(
            size=10,
            color="black"
        )
    )
    )
    fig.for_each_trace(
  lambda trace: trace.update(name=trace.name[:15]))
    fig.add_annotation(xref='x domain',
    x=0.5,
    yref='y domain',
    y=-0.5, font=dict(size=10),
                       showarrow=False,
    text="State Closure Data taken from <a href='https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC'>COVID-19 US State Policy Database by openICPSR</a>")
    fig.add_annotation(xref='x domain',
    x=0.5,
    yref='y domain',
    y=-0.55, font=dict(size=10),
                       showarrow=False,
    text="Holiday Data taken from  <a href='https://www.fcps.edu/sites/default/files/media/forms/19-20-standard-school-year-calendar.pdf'>FCPS 2019-2020 Standard School Year Calendar</a>. It is used as a source for an estimate")
    fig.show()
    fig.data = []

In [None]:
lineTimePlotVC("Product Name",'pct_access')
lineTimePlotVC("Product Name",'engagement_index')

Weekends are removed from this visualization for better understanding because students do not attend classes on weekends. Both pct_access and engagement_index have rose up after the pandemic. But the difference of the engagement_index is more bigger than that of pct_access. This indicates that even though students accessed these virtual classroom products befre the pandemic, they engaged with them much more after the pandemic started.

### Cluster Analysis


In [None]:
temp=df.groupby(["district_id","time"]).agg("mean")
index=temp.index.get_level_values("district_id").unique().to_list()
district_engagement_series =[]
for ind in index:
    frame = temp.loc[ind,["engagement_index","pct_access"]]
    district_engagement_series.append(frame)

In [None]:
del temp

In [None]:
!pip install tslearn
from tslearn.clustering import TimeSeriesKMeans
from tslearn.utils import to_time_series_dataset

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
for i in range(len(district_engagement_series)):
    scaler = MinMaxScaler()
    district_engagement_series[i] = MinMaxScaler().fit_transform(district_engagement_series[i])

district_engagement_series=to_time_series_dataset(district_engagement_series)

In [None]:
ks = range(1, 10)
inertias = []
for k in ks:
    # Create a KMeans instance with k clusters: model
    model = TimeSeriesKMeans(n_clusters=k, metric="dtw", max_iter=1,n_init=3,random_state=9)
    
    # Fit model to samples
    model.fit(district_engagement_series)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.title("Elbow Method for Finding Optimum Number of Clusters")
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

The optimum number of clusters for our data set is 5 from our results of the elbow method. We will use this to fit our clustering algorithm. We are using a time series based kmeans for our data. The metric used for calculating distance is dynamic time warping which measures similarity between two temporal sequences. 

In [None]:
model = TimeSeriesKMeans(n_clusters=5, metric="dtw", max_iter=1,n_init=3,random_state=9).fit(district_engagement_series)
clusters=model.predict(district_engagement_series)

In [None]:
copy=df.copy()
for i in range (0,len(index)):
    copy.loc[copy["district_id"]==index[i],'cluster']=clusters[i]

The clustering seems to have captured some patterns in our time series data sets as demonstrated in the graphs below. 

In [None]:


line=copy.groupby(["time","district_id"],as_index=False).agg("mean")

fig = px.line(line, x ='time', y ="engagement_index", color="district_id",facet_row="cluster",height=1000,facet_col_wrap=1)

fig.show()
fig.data=[]

In [None]:

line=copy.groupby(["time","district_id"],as_index=False).agg("mean")

fig = px.line(line, x ='time', y ="pct_access", color="district_id",facet_row="cluster",height=1000,facet_col_wrap=1)

fig.show()
fig.data=[]

We can see that the pct_free/reduced_mean irrespective of the average spending can be seen as an indication of pct_access and engagement_index. This shows that though there is some investment in schools the students who are eligible for reduced lunches have difficulty in access and that due to covid they can't may not use the services at their schools to fill the gap.

In [None]:
copy.groupby("cluster").agg("mean").sort_values("pct_access",ascending=True)

### Causal Inference

Now let's explore the impact of school closure on states with the most and least pct_black/Hispanic; to check if covid 19 has disproportionately impacted student engagement with online learning platforms in areas where there are more Black or Hispanic Students.

In [None]:
!pip install pycausalimpact
from causalimpact import CausalImpact

We used a public database of state policies  from [OPENICPSR](https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC?path=/openicpsr/119446/fcr:versions/V75/COVID-19-US-State-Policy-Database-master&type=folder) to identify the dates of school closures in each state.

In [None]:
policy_df = pd.read_csv('../input/covid19-us-state-policy-database/COVID-19_US_state_policy - State policy changes .csv')
policy_df = policy_df[['STATE', 'CLSCHOOL']]
policy_df.drop(df.index[[0, 1, 2, 3]])
policy_df = policy_df[policy_df.STATE.isin(df.state.unique())]
policy_df['CLSCHOOL'] = pd.to_datetime(policy_df['CLSCHOOL']).dt.day_of_year
policy_df.set_index('STATE', inplace=True)
policy_df

In [None]:
x = df.groupby(['district_id', 'state']).agg({ 'engagement_index': 'mean', 
                                               'pct_access': 'mean', 
                                               'pct_black/hispanic_mean': 'mean', })
x = x.sort_values(by=['pct_black/hispanic_mean'], ascending = False)
x.head()

In [None]:
df_9536 = df[df.district_id==9043]
df_9536['time'] = pd.to_datetime(df_9536['time'])
df_9536['time'] = df_9536['time'].dt.day_of_year
df_9536 = df_9536.groupby(['time']).agg({ 'engagement_index': 'mean'})
df_9536

In [None]:
sns.set_theme(style='darkgrid')
plt.figure(figsize=(15,8))

s = sns.lineplot(data=df_9536, x="time", y='engagement_index', linewidth=1)
s.set_title('Time series data of engagement Index for district 9536', y=1.02, fontsize=15)
s.set_xlabel('Date', fontsize=14, labelpad=15)
s.set_ylabel('Engagement', fontsize=14, labelpad=15)
plt.axvline(77, color='r', linewidth=1, linestyle='--')
plt.show()

While looking at graphs and comparing pre and post-school closure can give a good idea of the impact school closure had on student engagement in district 9536. To increase confidence in our findings, we will use the Causal Impact library for our statistical analysis.

In [None]:
pre_period = [1, 77]
post_period = [78, 366]
ci = CausalImpact(df_9536, pre_period, post_period)

In [None]:
ci.plot()
print(ci.summary())
print(ci.summary(output='report'))

The report shows school closure has impacted engagement significantly in district 9536, which has a large percentage of students identified as Black or Hispanic. The probability of obtaining this effect by chance is very small, and therefore the causal effect can be considered statistically significant. Next, let's look at states with the least Black or Hispanic demographics.

In [None]:
x.tail()

In [None]:
df_1904	 = df[df.district_id==1904	]
df_1904['time'] = pd.to_datetime(df_1904['time'])
df_1904['time'] = df_1904['time'].dt.day_of_year
df_1904 = df_1904.groupby(['time']).agg({ 'engagement_index': 'mean'})
df_1904

In [None]:
pre_period = [1, 77]
post_period = [78, 366]
ci = CausalImpact(df_1904, pre_period, post_period)

In [None]:
ci.plot()
print(ci.summary())
print(ci.summary(output='report'))

School closures did not negatively impact student engagement in the 1904 district; instead, when schools were closed, student engagement increased significantly. It shows that COVID-19 and school closures have had a much greater negative effect on student engagement in areas where the population is predominantly black or Hispanic.