**Problem Statement**
The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States.
In the Spring of 2020, most states and local governments across the U.S. closed educational institutions
to stop the spread of the virus. In response, schools and teachers have attempted to reach students 
remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting
digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

**Challenge**

(1) the state of digital learning in 2020 
(2) how the engagement of digital learning relates to factors such as district demographics, broadband access,
    and state/national level policies and events.

**DATA Description**

The data is a set of daily edtech engagement data from over 200 school districts in 2020.

The engagement_ data folder is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The districts_info.csv file includes information about the characteristics of school districts, including data from NCES and FCC.
The districts file includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab:

distrist_id
state
locale
pct_black/hispanic - percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.
pct_free/reduced - percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data.
county_connections_ratio - ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version).
pp_total_raw - per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project.





In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import os
import glob
import wandb
import json
import warnings
import imageio
import datetime
import pandas as pd
import numpy as np
from PIL import Image

from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML, Javascript
import IPython.display as py_display
import plotly.express as px
from pandas_profiling import ProfileReport 
from IPython.display import Image
import missingno as msno
from wordcloud import WordCloud, STOPWORDS
from IPython.display import Markdown, display, Image, display_html
from geopy.geocoders import Nominatim
from geopy.geocoders import Nominatim
#from geopy.distance import vincenty
# Environment check
warnings.filterwarnings("ignore")

In [None]:
# Import data

districts_info = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
products_info = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

districts_info data set 

Name                                   Description
district_id         =    The unique identifier of the school district
state               =    The state where the district resides in
locale              =    NCES locale classification that categorizes U.S. territory into four types of areas: City,                          Suburban, Town,and Rural. See Locale Boundaries User's Manual for more information
pct_black/hispanic  =    Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES                     data
pct_free/reduced    =    Percentage of students in the districts eligible for free or reduced-price lunch based on                          2018-19 NCES data
countyconnectionsratio = ratio (residential fixed high-speed connections over 200 kbps in at least one                                      direction/households) based on the county level data from FCC From 477 (December 2018                              version). See FCC data for more information.
pptotalraw           =   Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's                            National Education Resource Database on Schools (NERD$) project. The expenditure data are                          school-by-school, and we use the median value to represent the expenditure of a given                              school district.

In [None]:
districts_info.head()


Products_info data set
Name                                    Description
LP ID                        =  The unique identifier of the product
URL                          =  Web Link to the specific product
Product Name                 =  Name of the specific product
Provider/Company Name        =  Name of the product provider
Sector(s)                    =  Sector of education where the product is used
Primary Essential Function : =  The basic function of the product. There are two layers of labels here. Products                                   are first labeled as one of these three categories: LC = Learning & Curriculum, CM                                 = Classroom Management, and SDO = School & District Operations. Each of these                                       categories have multiple sub-categories with which the products were labeled


In [None]:
products_info

In [None]:
districts_info.shape, products_info.shape

In [None]:
#number of missing values
districts_info.isnull().sum().sum()

The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. The lp_id can be used to link to product information in product_info.csv.Loading the engagemnt data is a little bit tricky, we want to reserve the file names(since they are district ids) and load all the data in one data frame.

In [None]:
#code from https://www.kaggle.com/dmitryuarov/eda-covid-19-impact-on-digital-learning

eng_path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
eng_files = glob.glob(eng_path + "/*.csv")

files = []

for file in eng_files:
    df = pd.read_csv(file, index_col = None, header = 0)
    districts_id = file.split('/')[4].split('.')[0]
    df['district_id'] = districts_id
    files.append(df)
    
engagement = pd.concat(files)
engagement = engagement.reset_index(drop = True)
engagement['time'] = pd.to_datetime(engagement['time'])

In [None]:
engagement.head()

In [None]:
#dropping null values in district_info
districts_info.dropna(inplace = True)

In [None]:
#no more null values
districts_info.isnull().sum().sum()

In [None]:
#null values in products_info
products_info.isnull().sum().sum()

In [None]:
#dropping null values in products_info
products_info.dropna(inplace = True)

In [None]:
products_info.isnull().sum().sum()

What is an Engagement Index?
Total page-load events per one thousand students of a given product and on a given day. For example, if district A has an engagement index of 26666.66 for product X on 2021-08-10, that means there were 26666.66 page-load events per 1000 students for product X on 2021-08-10

In [None]:
engagement.shape

In [None]:
engagement.isnull().sum()

In [None]:
engagement.dropna( inplace = True)

In [None]:
engagement.isnull().sum()

In [None]:
#plotting functions
def plot_hist(df: pd.DataFrame, column: str, color: str) -> None:
    plt.figure(figsize=(9, 7))
    sns.displot(data=df, x=column, color=color, kde=True, height=7, aspect=2)
    plt.title(f'Distribution of {column}', size=20, fontweight='bold')
    plt.show()


def plot_dist(df: pd.DataFrame, column: str):
    plt.figure(figsize=(9, 7))
    sns.distplot(df).set_title(f'Distribution of {column}')
    plt.show()


def plot_count(df: pd.DataFrame, column: str) -> None:
    plt.figure(figsize=(12, 7))
    sns.countplot(data=df, x=column)
    plt.title(f'Plot count of {column}', size=20, fontweight='bold')
    plt.show()


def plot_bar(df, x_col, y_col, title=''):
    plt.figure(figsize=(20, 7))
    sns.barplot(data = df, x=x_col, y=y_col)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(x_col, fontsize=16)
    plt.ylabel(y_col, fontsize=16)
    plt.show()


def plot_heatmap(df: pd.DataFrame, title: str, cbar=False) -> None:
    plt.figure(figsize=(12, 7))
    sns.heatmap(df, annot=True, cmap='viridis', vmin=0,
                vmax=1, fmt='.2f', linewidths=.7, cbar=cbar)
    plt.title(title, size=18, fontweight='bold')
    plt.show()


def plot_box(df: pd.DataFrame, x_col: str, title: str) -> None:
    plt.figure(figsize=(12, 7))
    sns.boxplot(data=df, x=x_col)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.show()


def plot_box_multi(df: pd.DataFrame, x_col: str, y_col: str, title: str) -> None:
    plt.figure(figsize=(12, 7))
    sns.boxplot(data=df, x=x_col, y=y_col)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks(fontsize=14)
    plt.show()


def plot_scatter(df: pd.DataFrame, x_col: str, y_col: str, title: str, hue: str, style: str) -> None:
    plt.figure(figsize=(10, 8))
    sns.scatterplot(data=df, x=x_col, y=y_col, hue=hue, style=style)
    plt.title(title, size=20)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.show()
    
def time_plot(df, x_col, y_col, title=''):
    plt.figure(figsize=(20, 7))
    sns.lineplot(data=df, x=x_col, y=y_col)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(x_col, fontsize=16)
    plt.ylabel(y_col, fontsize=16)
    plt.show()
    


In [None]:
plot_count(districts_info, 'locale')

In [None]:
plot_count(districts_info, 'pct_black/hispanic')

In [None]:
def plot_count_hor(df, col, title, hue=None):
    plt.figure(figsize=(20, 7))
    sns.countplot(data = df, y=col, hue=hue, order=df[col].value_counts().index)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(col, fontsize=16)
    plt.ylabel("Count", fontsize=16)
    plt.show()

In [None]:
plot_count_hor(districts_info,'state','plot count of state')

In [None]:
plot_hist(districts_info, 'pp_total_raw','orange')

In [None]:
plot_count_hor(districts_info,'pp_total_raw','plot count of state')

In [None]:
profile = ProfileReport( districts_info, title='Pandas profiling report for districts_info ' , html={'style':{'full_width':True}})
profile.to_notebook_iframe()

In [None]:
profile = ProfileReport( products_info, title='Pandas profiling report for products_info ' , html={'style':{'full_width':True}})
profile.to_notebook_iframe()

In [None]:
profile = ProfileReport( engagement, title='Pandas profiling report ' , html={'style':{'full_width':True}})
profile.to_notebook_iframe()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8,4))

sns.histplot(engagement.groupby('district_id').time.nunique(), bins=30)
ax.set_title('Days of Engagement Data per District')
plt.show()

In [None]:
graph1 = engagement.groupby('time').agg({'engagement_index': 'mean', 'pct_access': 'mean', 'lp_id': 'count'}).\
            reset_index()

time_plot(graph1, "time", "engagement_index", title='Engagement Over Time')

In [None]:
time_plot(graph1, "time", "pct_access", title='Percentage of Access Over Time')

In [None]:
product_engagement_df = pd.merge(engagement, products_info, left_on='lp_id', right_on='LP ID' )

In [None]:
#changing the data type of district_id to int for merging
product_engagement_df[["district_id"]] = product_engagement_df[["district_id"]].apply(pd.to_numeric)

In [None]:
product_engagement_df= pd.merge(product_engagement_df, districts_info, left_on='district_id', right_on='district_id' )

In [None]:
graph2 = product_engagement_df.groupby(['locale', 'Product Name']).agg({'time': 'count'})
graph2 = graph2.reset_index()

def per_local(locale):
    local  =  graph2[graph2['locale'] == locale]
    
    new_df = pd.DataFrame({"Product Name": local['Product Name'], "time": local['time']})
    top_10 = new_df.sort_values(by='time', ascending=False).head(10)
    plot_bar(top_10, "Product Name", "time", title=f'Top Used application In {locale}')
#     plot_bar(top_10, "Product Name", 'time', title=f'Top Used application In {locale}', None, None) 

for local in graph2.locale.unique():
    per_local(local)

In [None]:
#credit for this code https://www.kaggle.com/iamleonie/how-to-approach-analytics-challenges
import plotly.graph_objects as go
from plotly.subplots import make_subplots
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

districts_info['state_abbrev'] = districts_info['state'].replace(us_state_abbrev)
districts_info_by_state = districts_info['state_abbrev'].value_counts().to_frame().reset_index(drop=False)
districts_info_by_state.columns = ['state_abbrev', 'num_districts']

fig = go.Figure()
layout = dict(
    title_text = "Number of Available School Districts per State",
    geo_scope='usa',
)

fig.add_trace(
    go.Choropleth(
        locations=districts_info_by_state.state_abbrev,
        zmax=1,
        z = districts_info_by_state.num_districts,
        locationmode = 'USA-states', # set of locations match entries in `locations`
        marker_line_color='white',
        geo='geo',
        colorscale=px.colors.sequential.Teal, 
    )
)
            
fig.update_layout(layout)   
fig.show()

In [None]:
districts_info.pp_total_raw.unique()
temp = districts_info.groupby('locale').pp_total_raw.value_counts().to_frame()
temp.columns = ['amount']

temp = temp.reset_index(drop=False)

temp = temp.pivot(index='locale', columns='pp_total_raw')['amount']
temp = temp[['[4000, 6000[', '[6000, 8000[', '[8000, 10000[', '[10000, 12000[',
       '[12000, 14000[', '[14000, 16000[', '[16000, 18000[', 
       '[18000, 20000[',  '[22000, 24000[' ]]


fig, ax = plt.subplots(1, 2, figsize=(24,4))

sns.countplot(data=districts_info, x='locale', ax=ax[0], palette='GnBu')

sns.heatmap(temp, annot=True,  cmap='GnBu', ax=ax[1])
ax[1].set_title('Heatmap of Districts According To locale and pp_total_raw')
plt.show()

In [None]:

def replace_ranges_pct(range_str):
    if range_str == '[0, 0.2[':
        return 0.1
    elif range_str == '[0.2, 0.4[':
        return 0.3
    elif range_str == '[0.4, 0.6[':
        return 0.5
    elif range_str == '[0.6, 0.8[':
        return 0.7
    elif range_str == '[0.8, 1[':
        return 0.9
    else:
        return np.nan
    
def replace_ranges_raw(range_str):
    if range_str == '[4000, 6000[':
        return 5000
    elif range_str == '[6000, 8000[':
        return 7000
    elif range_str == '[8000, 10000[':
        return 9000
    elif range_str == '[10000, 12000[':
        return 11000
    elif range_str ==  '[12000, 14000[':
        return 13000
    elif range_str ==  '[14000, 16000[':
        return 15000
    elif range_str == '[16000, 18000[':
        return 17000
    elif range_str ==  '[18000, 20000[':
        return 19000
    elif range_str ==  '[20000, 22000[':
        return 21000
    elif range_str ==  '[22000, 24000[':
        return 21000
    else: 
        return np.nan
    
districts_info['pct_black_hispanic_num'] = districts_info['pct_black/hispanic'].apply(lambda x: replace_ranges_pct(x))
districts_info['pct_free_reduced_num'] = districts_info['pct_free/reduced'].apply(lambda x: replace_ranges_pct(x))
districts_info['pp_total_raw_num'] = districts_info['pp_total_raw'].apply(lambda x: replace_ranges_raw(x))

def plot_state_mean_for_var(col):
    temp = districts_info.groupby('state_abbrev')[col].mean().to_frame().reset_index(drop=False)

    fig = go.Figure()
    layout = dict(
        title_text = f"Mean {col} per State",
        geo_scope='usa',
    )

    fig.add_trace(
        go.Choropleth(
            locations=temp.state_abbrev,
            zmax=1,
            z = temp[col],
            locationmode = 'USA-states', # set of locations match entries in `locations`
            marker_line_color='white',
            geo='geo',
            colorscale=px.colors.sequential.Teal, 
        )
    )

    fig.update_layout(layout)   
    fig.show()

plot_state_mean_for_var('pct_black_hispanic_num')
plot_state_mean_for_var('pct_free_reduced_num')
plot_state_mean_for_var('pp_total_raw_num')

In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'}).generate(" ".join(products_info['Provider/Company Name'].astype(str)))
plt.figure(figsize=(15, 10))
plt.imshow(cloud)
plt.axis('off')

In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'}).generate(" ".join(districts_info['state'].astype(str)))
plt.figure(figsize=(15, 10))
plt.imshow(cloud)
plt.axis('off')

In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'}).generate(" ".join(products_info['Product Name'].astype(str)))
plt.figure(figsize=(15, 10))
plt.imshow(cloud)
plt.axis('off')

We've seen how engagement relates to various factors.
This is still in progress,in the future using covid data sets as additional data will allow us to answer what is the effect of covid-19 pandemic on learning 