<center>
    <img src="https://bsmedia.business-standard.com/_media/bs/img/article/2020-04/15/full/1586897152-2436.jpg" width="500" alt="cognitiveclass.ai logo"  />
</center>

# <center>Covid 19 impact on digital learning</center>

## Objectives
Explore: 
1. the state of digital learning in 2020 and how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

**<font color='orange'>Below are some examples of questions that relate to our problem statement:**</font>

* What is the picture of digital connectivity and engagement in 2020?
* What is the effect of the COVID-19 pandemic on online and distance learning, and how  might this also evolve in the future?
* How does student engagement with different types of education technology change over the course of the pandemic?
* How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
* Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

## License

This work is licensed under a
[Creative Commons Attribution-ShareAlike 4.0 International License][cc-by-sa].

[![CC BY-SA 4.0][cc-by-sa-image]][cc-by-sa]

[cc-by-sa]: http://creativecommons.org/licenses/by-sa/4.0/
[cc-by-sa-image]: https://licensebuttons.net/l/by-sa/4.0/88x31.png
[cc-by-sa-shield]: https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg

# <center><font color='orange'>Importing necessary libraries</font></center>

In [None]:

import pandas as pd
import numpy as np 
import os
import re
import glob
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns
pd.options.display.float_format = '{:,.2f}'.format

# Loading and Reading the files:
* <font color='lightgreen'>Products_info file</font>
* <font color='lightgreen'>District file</font>
* <font color='lightgreen'>Engagement files</font>
    

# <font color='orange'>Cleaning Product dataset</font>

## Product information data
The product file `products_info.csv` includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

## Let's load the file and see a sample

In [None]:
df_products = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
df_products.sample(5)

## Checking data types and NaN values

In [None]:
print('Product file info:\n', '*'*50)
df_products.info()
print('\n', '*'*50)
print('Total of NaN value in each column is\n')
df_products.isnull().sum()

<!-- #######  YAY, I AM THE SOURCE EDITOR! #########-->
<h3 style="color: #5e9ca0;"><span style="color: #ffffff;">From the <span style="text-decoration: underline;"><strong>sample and info.</strong></span>,&nbsp; we can see some issues in data and columns' names, we need to fix it first</span></h3>
<ul>
<li>we will <strong>rename the columns</strong> for easire usage.</li>
</ul>
<table style="border-collapse: collapse; width: 100%; height: 126px;" border="1">
<tbody>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px; border-style: dotted;"><strong>Old Column name</strong></td>
<td style="width: 50%; height: 18px; border-style: dotted;"><strong>New Column name</strong></td>
</tr>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px;">LP ID</td>
<td style="width: 50%; height: 18px;">lp_id</td>
</tr>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px;">URL</td>
<td style="width: 50%; height: 18px;">will be removed</td>
</tr>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px;">Product Name</td>
<td style="width: 50%; height: 18px;">product_name</td>
</tr>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px;">Provider/Company Name</td>
<td style="width: 50%; height: 18px;">provider_com_name</td>
</tr>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px;">Sector(s)</td>
<td style="width: 50%; height: 18px;">sectors</td>
</tr>
<tr style="height: 18px;">
<td style="width: 50%; height: 18px;">Primary Essential Function</td>
<td style="width: 50%; height: 18px;">primary_ess_fun</td>
</tr>
</tbody>
</table>
<ul>
<li>we don't need <strong>URL column</strong>, we will <strong><span style="color: #ff0000;">remove</span></strong> it.</li>
<li>there are<strong> NaN</strong> values need to be replaced with '<strong>NA</strong>' in the following columns:&nbsp;
<ol>
<li><span style="color: #3366ff;">Provider/Company Name (1 NaN)</span></li>
<li><span style="color: #3366ff;">Sector(s) and Primary Essential Function (19 each)</span></li>
</ol>
</li>
<li><span style="color: #ffffff;">One-Hot Encoding the Product Sectors</span></li>
<li>Splitting (primary_ess_fun) column values into 2 columns
<ul>
<li>primary_ess_main</li>
<li>primary_ess_sub</li>
</ul>
</li>
</ul>

In [None]:
# Rename Columns
df_products.rename(columns={ 'LP ID': 'lp_id', 'Product Name' : 'product_name', 'Provider/Company Name' : 'provider_com_name', 'Sector(s)' : 'sectors', 'Primary Essential Function' : 'primary_ess_fun' }, inplace=True)

# Remove URL Column
df_products.drop(['URL'], axis=1, inplace=True)

# Delete the products that don't have Primary Essential function
df_products.dropna(subset=['primary_ess_fun'],axis= 0, inplace=True)

<li>Splitting (primary_ess_fun) column values into 2 columns
<ul>
<li>primary_ess_main</li>
<li>primary_ess_sub</li>
</ul>
</li>
<li><span style="co000000ffffff;">One-Hot Encoding the Product Sectors</span></li>

In [None]:

# Splitting (primary_ess_fun) column values into 2 columns
primary_essential_main = []
primary_essential_sub = []
for s in df_products["primary_ess_fun"]:
    if(not pd.isnull(s)):
        s1 = s.split("-",1)[0].strip()
        primary_essential_main.append(s1)
    else:
        primary_essential_main.append(np.nan)
    if(not pd.isnull(s)):
        s2 = s.split("-",1)[-1].strip()
        primary_essential_sub.append(s2)
    else:
        primary_essential_sub.append(np.nan)
df_products["primary_ess_main"] = primary_essential_main
df_products["primary_ess_sub"] = primary_essential_sub
################################################################
# One-Hot Encoding the Product Sectors
temp_sectors = df_products['sectors'].str.get_dummies(sep="; ")
temp_sectors.columns = [f"sector_{re.sub(' ', '', c)}" for c in temp_sectors.columns]
df_products = df_products.join(temp_sectors)
df_products.drop(['primary_ess_fun', 'sectors'], axis=1, inplace=True)
df_products.rename(columns=str.lower, inplace=True)
df_products.shape

# <center><font color='orange'>Product file data Exploratory</font></center>
* The purpose is to know which products, Categories, and sectors should focus on

# We will explore the most popular products with their producers.

In [None]:
sns.set_theme(style="whitegrid")
fig, axs = plt.subplots(figsize=(15,4))
count_of_products= pd.DataFrame(df_products['provider_com_name'].value_counts().reset_index().values,columns=['provider_com_name', "counts"])
sns.barplot(data=count_of_products.head(10), x='counts', y='provider_com_name', palette='rocket', ax=axs);
axs.set_title('\nTOP 10 companies produced DL Apps.', fontsize=25, fontname='calibri')
axs.set_xlabel('\nNumber of products', fontsize=20, color='green', fontweight='bold', fontname='calibri')
axs.set_ylabel('Provider/Company name\n', fontsize=20, color='red', fontweight='bold', fontname='calibri')
axs.set_xticks(range(0, 29, 2));
del (count_of_products)

## Products usages in different sectors:

In [None]:

def count_of_sectors(colname):
    df = df_products.loc[ : , ['product_name', 'sector_corporate', 'sector_highered', 'sector_prek-12']]
    df = df[df[colname] == 1]
    df = df[colname].value_counts()
    valuecount = int(df)
    return valuecount
labels = ['Sector_prek-12', 'Sector_highered', 'Sector_corporate']
counts = [count_of_sectors('sector_prek-12'),count_of_sectors('sector_highered'),count_of_sectors('sector_corporate')]
explode = (0.1, 0.0, 0) 
fig1, ax1 = plt.subplots()
ax1.pie(counts, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')
fig1.suptitle('Percentage of the products that used in diff. sectors')# Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

## Number of products in each Category

In [None]:
df_products.primary_ess_main.value_counts().head(10).to_frame()

## Number of products in each SUB-Category

In [None]:
df_products.primary_ess_sub.value_counts().head(10).to_frame()

# <p><font color='orange'>After exploring the products file, we can reach the following:</font></p>
<ol>
<li>Google is the company that offers the most popular products that serve digital learning, followed by Houghton Mifflin Harcourt but with a huge GAP between the number of products they offer&nbsp;<strong>(27 for google, and 5 for HMH).</strong></li>
<li>The 27th of Google products are used in the 3 sectors.&nbsp;</li>
<li>Google was able to take advantage of the pandemic and sought to profit from it by providing its services in various fields, and here we see how Google's share price rose throughout 2020.</li>
</ol>
<img src="https://i.ibb.co/tMpY8gS/google-Chart.jpg" alt="google-Chart" border="0">
<p>-----------------------------------------------------------------------------</p>
<ul>
<li>The most category has products is LC, and sub-category has products is Digital Learning Platform</li>
<li>The sector Prek-12 is the largest sector uses the digital learning products with a percentage of up to 54%</li>
</ul>

<p style="text-align: center;"><span style="color: #ff0000;">--------------------------------------------------------------------</span></p>
<p style="text-align: center;"><span style="color: #ff0000;">--------------------------------------------------------------------</span></p>

# <center><font color='orange'>District file data Exploratory</font></center>

### District information data

The district file `districts_info.csv` includes information about the characteristics of school districts, including data from [NCES](https://nces.ed.gov/) (2018-19), [FCC](https://www.fcc.gov/) (Dec 2018), and [Edunomics Lab](https://edunomicslab.org/). In this data set, we removed the identifiable information about the school districts. We also used an open source tool [ARX](https://arx.deidentifier.org/) [(Prasser et al. 2020)](https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2812) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset. 

| Name | Description |
| :--- | :----------- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |


## Loading districts file

In [None]:
df_district = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
df_district.sample(5)

## Checking data types and NaN values

In [None]:

print('\nDistricts file info:\n', '*'*50)
df_district.info()
print('\n', '*'*50)
print('Total of NaN value in each column is\n')
print(df_district.isnull().sum())
print('\n', '*'*50)
print('numbers of rows and columns in Districts dataset is:\n')
print(df_district.shape)


<h3 style="color: #5e9ca0;"><span style="color: #000000;">From the <span style="text-decoration: underline;"><strong>sample and info.</strong></span>,&nbsp; we can see some issues need to be fixed</span></h3>
* There are missing states, we will delete their rows
<br></br>
* Replace NaN in county_connections_ratio column with most popular speed [0.18, 1]

In [None]:
df_district.dropna(subset=['state'], axis=0, inplace=True)
df_district['county_connections_ratio'].replace(np.nan, '[0.18, 1[', inplace=True)
print('*' * 80)
print('numbers of rows and columns in Districts dataset after fixing NaN values are:')
print(df_district.shape)
print('*' * 80)

## Exploring the states that have the maximum number of school districts.

In [None]:
explode = [0.05,0.05,0.05,0.05,0,0,0,0,0,0]
df_district["state"].value_counts().plot(kind = 'pie', autopct='%1.1f%%', figsize=(16, 10), startangle=0, shadow=True).legend(loc='right', bbox_to_anchor=(1.5, 0.5));

## Connecticut, Utah, Massachusetts, Illinois are the `TOP` states have the maximum number of school districts with total percentage of 55.6% 

 <font color= 'orange'>External data: based on National Center for Education Statistics, the states that have most districts in 2019-2020 academic year are:</font>
  <p><a title="NCES" href="https://nces.ed.gov">https://nces.ed.gov</a></p>

-----------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
df_extrnal= pd.read_csv('../input/elsi-1/ELSI_1.csv')
df_extrnal_dist= df_extrnal[['state', 'no_districts']]
df_extrnal_dist= df_extrnal_dist.sort_values('no_districts', ascending= False)
df_extrnal_dist.head(10)

In [None]:
# Get TOP 10 States in the dataset
states = ['Connecticut', 'Utah', 'Massachusetts', 'Illinois', 'California', 'Ohio', 'New York', 'Indiana', 'Washington', 'Missouri']
df_top10_dist = df_district[df_district.state.isin(states)]

-----------------------------------------------------------------------------------------------------------------------------------------------

# <font color='skyblue'>Dist. of schools in each locale in each State</font>

In [None]:
plt.figure(figsize=(10,10))
g = sns.countplot(data = df_top10_dist, x= df_top10_dist['locale'],palette='tab10', hue='state')
plt.legend(loc='upper right')
g.set_title('\nDist. of schools in each locale in each State\n', fontsize=20, fontweight='bold', fontname='Calibri')
g.set_xlabel('\nLocale\n', fontsize=14, color='red', fontweight='bold', fontname='calibri')
g.set_ylabel('\nSchools Dist.\n', fontsize=20, color='red', fontweight='bold', fontname='calibri')
plt.show()

School districts in the dataset mainly are in suburb, rural and City locales;
## The distribution above shows the number of schools in our dataset per locale in each state, as we can see `**Connecticut**` has the most schools in Suburb locale, followed by `**Massachusetts**`, then `**Utah**`, and `**Illinois**`
## in City locale: `**California**` and `**Utah**` 
## in Rural locale: `**Connecticut**` followed by `**New York**`
## in Town locale: `**Utah**` has the maximum number of schools with no competitor

# so in overall Our data is heavily skewed towards Suburb and Rural as we can see below

In [None]:
#plt.figure(figsize=(10,10))
g = sns.countplot(data = df_top10_dist, x= df_top10_dist['locale'],palette='tab10')
g.set_title('\nDist. of schools in each locale in each State\n', fontsize=20, fontweight='bold', fontname='Calibri')
g.set_xlabel('\nLocale\n', fontsize=14, color='red', fontweight='bold', fontname='calibri')
g.set_ylabel('\nSchools Dist.\n', fontsize=20, color='red', fontweight='bold', fontname='calibri')
plt.show();

-------------------------------------------------------------------------------------------

In [None]:
def replace_ranges_pct(range_str):
    if range_str == '[0, 0.2[':
        return 0.1
    elif range_str == '[0.2, 0.4[':
        return 0.3
    elif range_str == '[0.4, 0.6[':
        return 0.5
    elif range_str == '[0.6, 0.8[':
        return 0.7
    elif range_str == '[0.8, 1[':
        return 0.9
    else:
        return np.nan
    
def replace_ranges_raw(range_str):
    if range_str == '[4000, 6000[':
        return 5000
    elif range_str == '[6000, 8000[':
        return 7000
    elif range_str == '[8000, 10000[':
        return 9000
    elif range_str == '[10000, 12000[':
        return 11000
    elif range_str ==  '[12000, 14000[':
        return 13000
    elif range_str ==  '[14000, 16000[':
        return 15000
    elif range_str == '[16000, 18000[':
        return 17000
    elif range_str ==  '[18000, 20000[':
        return 19000
    elif range_str ==  '[20000, 22000[':
        return 21000
    elif range_str ==  '[22000, 24000[':
        return 21000
    else: 
        return np.nan


In [None]:
df_top10_dist['pct_black_hispanic_num'] = df_top10_dist['pct_black/hispanic'].apply(lambda x: replace_ranges_pct(x))
df_top10_dist['pct_free_reduced_num'] = df_top10_dist['pct_free/reduced'].apply(lambda x: replace_ranges_pct(x))
df_top10_dist['pp_total_raw_num'] = df_top10_dist['pp_total_raw'].apply(lambda x: replace_ranges_raw(x))

In [None]:
def plot_state_mean_for_var(col):
    temp = df_top10_dist.groupby('state')[col].mean().to_frame().reset_index(drop=False)
    temp = temp.sort_values(col, ascending= False).head(10)
    sns.set_theme(style="whitegrid")
    fig, axs = plt.subplots(figsize=(15,4))
    sns.barplot(data=temp, x='state', y=col, ax=axs);
    axs.set_title(f"Mean {col} per State")
    plt.xticks(rotation= 60)


In [None]:
plot_state_mean_for_var('pct_black_hispanic_num')

## School districs in the dataset have a **lower percentage** of both Black and Hispanic students.

and based on our dataset, the most Black and Hispanic students exist in:
* `California` 
* `Indiana`
* `Illinois and Washington`

 <font color= 'orange'>External data: based on National Center for Education Statistics, the states that have most black/hispanic students in 2019-2020 academic year are:</font>
  <p><a title="NCES" href="https://nces.ed.gov">https://nces.ed.gov</a></p>

In [None]:
df_extrnal_bh= df_extrnal[['state', 'black/hispanic']]
df_extrnal_bh= df_extrnal.sort_values('black/hispanic', ascending= False)
df_extrnal_bh.head(10)

In [None]:
plot_state_mean_for_var('pct_free_reduced_num')

In [None]:
plot_state_mean_for_var('pp_total_raw_num')

## * The majority of spending per pupil is concentrated between 8K to 18K.

<font color= 'yellow'>Extrnal Data</font>
the following table indicates the highest per pupil spending
Source: <p><a href="https://worldpopulationreview.com/state-rankings/per-pupil-spending-by-state">https://worldpopulationreview.com/state-rankings/per-pupil-spending-by-state</a></p>

* 1- New York ($24,040)
* 2- Connecticut ($20,635)
* 3- New Jersey ($20,021)
* 4- Alaska ($17,726)
* 5- Massachusetts ($17,058)
* 6- New Hampshire ($16,893)
* 7- Pennsylvania ($16,395)
* 8- Wyoming ($16,224)
* 9- Rhode Island ($16,121)
* 10- Illinois ($15,741)

---------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------

# <center><font color='orange'>Engagement data</font></center>
The engagement data are aggregated at school district level, and each file in the folder `engagement_data` represents data from one school district. The 4-digit file name represents `district_id` which can be used to link to district information in `district_info.csv`. The `lp_id` can be used to link to product information in `product_info.csv`.

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |

so now we will combine the engagement files and linked with both district file and products file.

we'll use the following metrics as proxies:
**pct_access.max():** Why max value?. The reason is there are a lot of available products for the students each day. And if we take the mean, we'll underestimate the engagement level.

> For example: on Jan. 1st 2020, for the school district 1000 we have 138 different products, each with its pct_access. If we take the mean, which is only 0.145%! Which is obviously wrong, because if we have 5% of students active on only one product, then this 5% should be the reliable estimate for engagement for that day, we can't average it out. In this case the max value is 3.6% on a 'certain' product. We don't care which product at this stage, we just know that 3.6% of the students were active on some technology product that day.

**engagement_index.sum():** for engagement_index we choose the sum, since it registers the total page-load events per one thousand students of a given product and on a given day.

### according to that we will extract:
* 1- MAX pct_access each day per district
* 2- Sum of engagement_index each day per district
* 3- Sum of engagement_index each day per product

there are 366 unique days available. However, for 43 school districts there are less than 366 unique days of data available. For example for district 3670 there is only data available from 2020-02-15 to 2020-03-02 or for district 2872 there is only data available for January 2020 and then two more single days in Feburary and March.
therefore will also consider districts with full 2020 engagement data and will delete all districts that have less than 366 days

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_files = glob.glob(path + "/*.csv")
li, li2, li3 = [],[], []
temp= []
df1= pd.DataFrame()
df2= pd.DataFrame()
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0) # READ FILES
    district_id = filename.split('/')[4].split(".")[0]
    district_id = int(district_id)
    df["district_id"] = int(district_id)
    if df.time.nunique() == 366:
        temp.append(df)
    df1= df.sort_values('pct_access',ascending = False).groupby(['time']).head(1).reset_index(drop= True)
    df2= df.groupby(['time', 'district_id'], as_index= False)['engagement_index'].sum()
    df3= df.groupby(['time', 'lp_id'], as_index= False)['engagement_index'].sum()
    li.append(df1) 
    li2.append(df2)
    li3.append(df3)
engagement = pd.concat(temp)
engagement = engagement.reset_index(drop=True)

df_max_pct_access = pd.concat(li)
df_max_pct_access = df_max_pct_access[df_max_pct_access.district_id.isin(engagement.district_id.unique())].reset_index(drop=True)

df_sum_engagement_index_districts = pd.concat(li2)
df_sum_engagement_index_districts = df_sum_engagement_index_districts[df_sum_engagement_index_districts.district_id.isin(engagement.district_id.unique())].reset_index(drop=True)

df_sum_engagement_index_products = pd.concat(li3)
df_sum_engagement_index_products = df_sum_engagement_index_products[df_sum_engagement_index_products['lp_id'].isin(engagement.lp_id.unique())].reset_index(drop=True)

del([df, df1, df2, df3])

In [None]:
df_max_pct_access.dropna(inplace=True)
df_sum_engagement_index_districts.dropna(inplace= True)
df_sum_engagement_index_products.dropna(inplace= True)
df_max_pct_access['lp_id'] = df_max_pct_access['lp_id'].astype(int)
df_sum_engagement_index_products['lp_id'] = df_sum_engagement_index_products['lp_id'].astype(int)

--------------------------------------------------

## Merge Engagement dataset with Districts and remove weekends

In [None]:

def merge_dfs(df1, df2, left, right, how, x=[], GroupByCol='', col='',sortCols=[], lineY='', title=''):
    global df_total_mean
    df = pd.merge(df1,df2,left_on= left,right_on= right, how= how)
    df['time'] = pd.to_datetime(df['time'])
    df['weekday'] = pd.DatetimeIndex(df['time']).weekday
    df = df[df.weekday < 5]
    df = df[x]
    df = df.groupby([GroupByCol,df['time'].dt.strftime('%B')])[col].mean()
    df = pd.DataFrame(df)
    df.reset_index(inplace=True)
    months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
    df['time'] = pd.Categorical(df['time'], categories=months, ordered=True)
    df = df.sort_values(sortCols, ascending= [True,False])
    df_total_mean= df
    g= px.line(df, x='time', y=lineY, title= title ,color='state', markers=True)
    return g

now we need to get the mean of pct_access for each state every month to see the perecentage of students who accessed the digital learning products every month, we will also know the TOP 10 states that their students have at least one page-load event.

In [None]:
merge_dfs(df_max_pct_access, df_top10_dist, 'district_id', 'district_id', 'inner', ['state', 'time', 'pct_access'], 'state','pct_access', ['time','pct_access'], 'pct_access', 'Percentage of students who accessed at least one-page / month, per state (TOP10 states)')

In [None]:

x = df_total_mean[['state','pct_access']]
x = df_total_mean.groupby(['state'])['pct_access'].mean()
x = pd.DataFrame(x)
x.reset_index(inplace=True)
x= x.sort_values(['pct_access'], ascending= [False])
df_mean_pct_access_districts= x


---------------------------------------------------------------------------------------

now we will get the mean of engagement_index for each state every month to see the total of students/1000 who engaged with apps of the digital learning products every month.

In [None]:
merge_dfs(df_sum_engagement_index_districts, df_top10_dist, 'district_id', 'district_id', 'inner', ['state', 'time', 'engagement_index'], 'state','engagement_index', ['time','engagement_index'], 'engagement_index', 'Total page-load events per one thousand students of all products per state (TOP10 states)')

In [None]:
x = df_total_mean[['state', 'engagement_index']]
x= x.groupby('state', as_index= False)['engagement_index'].mean()
df_mean_engagement_district= x
x= x.sort_values('engagement_index', ascending= False).head(10)


Let's summarize what we see:
* the home schooling phase starts at the beginning of March
* during July and August there are summer holidays and therefore no classes to attend
* after the summer holidays the pct_access increases to a higher level as observed at the beginning of the pandemic and it stays somewhat constant
* there are a few drops in pct_access visible throughout the year - these might be national holidays or other holidays
------------------------------------------------------------------------------------------------------------------------

In [None]:
df_mean_districts= pd.merge(df_mean_pct_access_districts, df_mean_engagement_district, left_on='state', right_on='state')

In [None]:
size = [200, 180, 160, 120, 100, 80, 60, 40, 20, 10]
fig = px.scatter(df_mean_districts, x="engagement_index", y="pct_access", color="state", size= 'pct_access', size_max=20)
fig.show()


<h2><span style="color: #00ff00;">New&nbsp;York,&nbsp;Illinois,&nbsp;and&nbsp;Indiana</span></h2>

are the top 3 states that their students have accessed at least one page-load event of a given product and on a given day., and also the TOP 3 of total engagement
<br> </br>

<h1><span style="color: #ff9900;">What&nbsp;happened&nbsp;in&nbsp;New&nbsp;York?</span></h1>

Gov. Cuomo held a press conference on Aug. 7, where he announced all school districts in the state were authorized to open, as long as the rate of positive tests in the district remained below 5 percent. The decisions about in-person learning were left to each district.

### <font color= 'orange'>Based on the above chart we can see the activity of students during the year, and that reflects the decisions that have been taken by the Gov. Andrew Cuomo</font>
> * March 16, 2020: Cuomo announced that schools across the state would close for at least two weeks beginning March 18 but the states remain closed for in-person instruction for the remainder of the academic year. Prior to the announcement, schools were closed through May 15.
> * July 13, 2020: The State Department of Education released a framework for school reopening plans. Each school district would be required to submit a district-specific reopening plan based on the template between July 17 and July 31.
> * August 7, 2020: New York Gov. Andrew Cuomo (D) announced schools would reopen to in-person instruction at the start of the school year. Students would be required to wear masks. Parents would retain the option to keep their children home.
> * October 30, 2020: New York Gov. Andrew Cuomo (D) announced schools in the state's red and orange mitigation zones could reopen after all of a school’s students and teachers got tested. Cuomo did not give a timeline for the reopening but said the state would provide the tests.
<br></br>

## District reopening plans

Each school district had until July 31 to submit plans to the New York State Department of Health for three different learning models–all in-person, all remote, and a hybrid of the two. Each plan had to detail how districts would meet state requirements for each model. The plans are required to be made publicly available online.
<br></br>
## In-person, hybrid, and online learning

The decision to reopen schools for in-person learning has been left up to individual districts, and each district has been required to post online plans regarding testing, contact tracing, and remote learning. Cuomo is also requiring districts to host information sessions with parents and the community to discuss their plans.

source: <p><a href="https://ballotpedia.org/School_responses_in_New_York_to_the_coronavirus_(COVID-19)_pandemic">School responses in New York to the coronavirus (COVID-19) pandemic</a></p>

***********************************************
<br> </br>
<h1><span style="color: #ff9900;">What&nbsp;happened&nbsp;in&nbsp;Illinois?</span></h1>

## District reopening plans

School districts are required to develop and publicly post a Remote Learning Days and Blended Remote Learning Day Plan, which the district superintendent must approve. The plans must address the following:

**Accessibility of the remote instruction to all students enrolled in the district;**
When applicable, a requirement that the Remote Learning Day and Blended Remote Learning Day activities reflect the Illinois Learning Standards;
Means for students to confer with an educator, as necessary;
How the district will take attendance and monitor and verify each student's remote participation; and
Transitions from remote learning to on-site learning upon the State Superintendent's declaration that Remote Learning Days and Blended Remote Learning Days are no longer deemed necessary.

In-person, hybrid, and online learning In-person operations at schools are encouraged to resume in Phase 4 regions with precautions to allow for social distancing. Schools and districts are allowed to use hybrid schedules and online integration as necessary. According to the plan, “Data and feedback should be analyzed through an equity lens to determine what student groups may need greater supports to meet high standards in a Remote or Blended Remote Learning environment.”

source: <p><a href="https://ballotpedia.org/School_responses_in_Illinois_to_the_coronavirus_(COVID-19)_pandemic">School responses in Illinois to the coronavirus (COVID-19) pandemic</a></p>
<br> </br>
<h1><span style="color: #ff9900;">What&nbsp;happened&nbsp;in&nbsp;Indiana?</span></h1>
Schools in Indiana were closed to in-person instruction on March 19, 2020, and remained closed for the remainder of the 2019-2020 academic year. The state allowed schools to start reopening on July 1, 2020.

> * July 21, 2020: Of the school reopening plans reviewed by WFYI:
> * Forty districts are offering an option of full-time in-person or remote instruction.
> * Five districts plan to begin the academic year with full-time remote instruction.
> * Six districts are following a hybrid model, where students learn in the classroom part of the week, and virtually on other days.
> * Two districts plan to have full-time in-person instruction.
<br></br>
source: <p><a href="https://ballotpedia.org/School_responses_in_Indiana_to_the_coronavirus_(COVID-19)_pandemic">School responses in Indiana to the coronavirus (COVID-19) pandemic</a></p>

For the full plan, kindly check: <p><a title="covid-school-reopening-plans-for-central-indiana" href="https://www.wfyi.org/news/articles/covid-school-reopening-plans-for-central-indiana">covid-school-reopening-plans-for-central-indiana</a></p>
<br></br>

After analyzing the States, we can say the following:
## Top performing States are: <font color= 'orange'>**Illinois, New York, and Indiana**</font>.
## <font color= 'lightred'>**California and Utah**</font>'s performance are quite bad compared to the best. Both metrics pct_access and engagement_index are only 1/2 of those at top of the ranking. 
<br></br>
We want to dig deeper into this finding, so we're going to ask some questions and want to answer them.
- Did the students have computers and the Internet?
- How were schools able to organize the distance education process by creating virtual classes, or delivering curricula via the Internet, and did the students follow up on the classes and curricula, or did they not care?
- With regard to states that contain a large number of blacks and Latinos, were these students able to receive distance education properly, and did they follow the classes and curricula?

The answers provided by <p><a href="https://datacenter.kidscount.org/data#USA/2/8/10,11,12,13,15,14,2719/char/0/271">Kids Count Data Center</a></p>
that clarifies 
- percentage of households in which internet and a computer to digital device are usually or always available to children for educational purposes in the United States

<a href="https://ibb.co/pKQQcMG"><img src="https://i.ibb.co/NNrr5JQ/percentage-of-households-in-which-internet-and-a-computer-to-digital-device-are-usually-or-always-av.jpg" alt="percentage-of-households-in-which-internet-and-a-computer-to-digital-device-are-usually-or-always-av" border="0" /></a>
<br></br>


* The percentage of households with children received education due to the coronavirus pandemic in the United States
<br></br>
<img src="https://i.ibb.co/sF6JByV/households-with-children-received-education-due-to-the-coronavirus-pandemic-in-the-United-States.jpg" alt="households-with-children-received-education-due-to-the-coronavirus-pandemic-in-the-United-States" border="0">

get the same result from <p><a href="&lt;p&gt;&lt;img src=&quot;https://usafacts.org/articles/65-of-childrens-education-has-moved-online-during-covid-19/&quot; alt=&quot;USA Facts&quot; /&gt;&lt;/p&gt;">USAFacts.org</a></p>

<iframe width="750" height="520" frameborder="0" src="https://usafacts.org/embed/chart/?autosize=false&axisTextColor=%23616161&chartType=1&margins=%7B%22top%22%3A0%7D&metrics=%5B%7B%22id%22%3A%22sotu2151%22%2C%22allStates%22%3Atrue%7D%5D&selectableYears=false&sortRows=false&source=SOTU&title=Percentage%20of%20households%20with%20children%20reporting%20use%20of%20online%20distance%20learning%3A"></iframe>

## interesting result here as we can see from 50% to 85% of students could received the online education.
> * No excuse for California. Despite having quite big and complete sample size, the California's performance on both pct_access and engagement_index only half of those States on the top part of the ranking namely: Illinois and New York.

> * Utah one of the most states that have school districts looks like doesn't have a plan to deliver the online education to its students, 51% only of students could receive the online edu. and this percentage clarifies why students' pct_acc and egagements are too low.

* Percentage of Black/Hispanic households with children received education due to the coronavirus pandemic
<br></br>
<img src="https://i.ibb.co/jWTmjT3/households-with-children-received-education-due-to-the-coronavirus-pandemic-by-race-ethnicity.jpg" alt="households-with-children-received-education-due-to-the-coronavirus-pandemic-by-race-ethnicity" border="0">

but +25% of Black/Hispanic students don't have Computer and +15% doesn't have Internet access.

According to usafacts.org, 4.4 million households with children don’t have consistent access to computers for online learning during the pandemic
Millions of students have no internet while sheltering at home and here we can see Percent of students in households with no internet or computer access by race.

<img src="https://staticweb.usafacts.org/media/images/Tech_access_by_race_uVEEMGh.width-1200.png" alt="" width="500" height="300" border="0">

Access to computers and home internet varies by geography. While 14% of all households with school-aged children don’t have internet at home, the percent increases by four percentage points to 18% for households in rural areas. The rural/urban internet divide is an issue of concern for governments. The US Department of Agriculture has invested in rural broadband for decades and the federal government announced the American Broadband Initiative in 2019.

Access also varies by race. For both internet and computers, white and Asian children have higher than average access, whereas Black, Hispanic, and American Indian and Native Alaskan children have lower than average access. Access is particularly low for American Indian and Native Alaskan children, with 65% with access to a computer and 63% with home internet.

Technology availability isn’t the only barrier to remote learning. Some children may not have access to a quiet place to learn; remote learning will be particularly difficult for the 1.5 million homeless students in the US. At present, there are no statistics on where children are living during coronavirus shutdowns. However, news outlets reported some children are staying outside their parental home with grandparents or other caretakers who may have slow speeds or no internet altogether. Lastly, the Department of Education is helping schools address remote learning challenges for students with disabilities.

Policies to expand universal home internet access may receive renewed attention in response to coronavirus shelter in place orders. In the meantime, some school districts are getting creative to provide access to digital learning. South Bend, Indiana is deploying buses as wifi hotspots and New York City is distributing 300,000 Apple iPads to students. Responses vary based on a school district’s ability to provide resources.


<font color='yellow'>--------------------------------------------------------------------------------------------------------------------------------------------------</font>
<font color='yellow'>--------------------------------------------------------------------------------------------------------------------------------------------------</font>
<font color='yellow'>--------------------------------------------------------------------------------------------------------------------------------------------------</font>

Now let's see the Top 10 products accessed/engaged by the students, and then we will see which are the TOP 3 products have been accessed by the students in the TOP 3 states (New York, Indiana, and Illinois

In [None]:
def merge_products(df1,df2, left, right, how, x=[], GroupCols=[], col= '', title=''):
    global df_products_total_mean
    df= pd.merge(df1,df2, left_on=left, right_on= right, how = how)
    df['time']= pd.to_datetime(df['time'])
    df['weekday'] = pd.DatetimeIndex(df['time']).weekday
    df = df[df.weekday < 5]
    df = df[x]
    df = df.groupby(GroupCols, as_index= False)[col].mean()
    df = df.sort_values(col, ascending= False).head(10)
    df['lp_id']= df['lp_id'].astype(int)
    df_products_total_mean = df
    g= px.line(df, x='product_name', y=col, title= title, markers= True )
    return g

In [None]:
merge_products(df_max_pct_access, df_products, 'lp_id', 'lp_id', 'inner', ['lp_id','product_name', 'pct_access', 'primary_ess_main', 'primary_ess_sub', 'provider_com_name'], ['lp_id','product_name', 'primary_ess_main', 'primary_ess_sub', 'provider_com_name'], 'pct_access', 'Percentage of students in the district have at least one page-load event of a given product' )

In [None]:
merge_products(df_sum_engagement_index_products, df_products, 'lp_id', 'lp_id', 'inner', ['lp_id','product_name', 'engagement_index', 'primary_ess_main', 'primary_ess_sub', 'provider_com_name'], ['lp_id','product_name', 'primary_ess_main', 'primary_ess_sub', 'provider_com_name'], 'engagement_index', 'Total page-load events per one thousand students of a given product (TOP10 Products)')


--------------------------------- 

In [None]:
df_dists_products = pd.merge(df_max_pct_access, df_top10_dist, left_on= 'district_id', right_on= 'district_id', how= 'inner')
df_dists_products = pd.merge(df_dists_products, df_products, left_on= 'lp_id', right_on= 'lp_id', how= 'inner')
df = df_dists_products[['state', 'pct_access', 'product_name', 'primary_ess_main', 'primary_ess_sub']]
states = ['New York', 'Illinois', 'Indiana']
df = df[df.state.isin(states)]
df= df.groupby(['state', 'product_name', 'primary_ess_main', 'primary_ess_sub'], as_index=False)['pct_access'].mean()
df = df.sort_values(['state','pct_access'], ascending= [True, False]).groupby('state').head(3).reset_index(drop=True)
df

In [None]:
states = ['New York', 'Illinois', 'Indiana']
df_engagement_products_dist = df_dists_products[df_dists_products.state.isin(states)]
df_engagement_products_dist= df_engagement_products_dist.groupby(['state', 'product_name', 'primary_ess_main', 'primary_ess_sub'], as_index=False)['engagement_index'].mean()
df_engagement_products_dist = df_engagement_products_dist.sort_values(['state','engagement_index'], ascending= [True, False]).groupby('state').head(3).reset_index(drop=True)
df_engagement_products_dist

It is clear that schools have relied on delivering curricula to students through virtual meeting apps, and relied on little some other courses and curricula that are offered by various companies. Google ranks first in use in many eductational aspects like Meet, Classroom, Docs, in addition to  Zoom, which scooped the top prize by providing a virtual meeting service..

-----------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------

# Conclusion

Conclusion
Despite the fact that we do not know exactly how the COVID-19 pandemic is affecting children’s needs and academic performance, we know enough from existing research on learning during somewhat comparable educational experiences, and from news and observations of how education is being produced during the crisis, to assess the likely consequences on educational outcomes both overall and for relatively disadvantaged subgroups.

researchs were reviewed on what to expect when children experience a substantial loss of learning time, when schools make a sudden shift to remote learning and home schooling without meeting the conditions for their effectiveness, and when circumstances lead to a massive increase in stress and disruption for children and their families. Evidences also reviewed by the researched that has emerged during the crisis on the multiple challenges that children, their teachers, schools, families, and communities face, all of which exacerbate opportunity gaps. Indeed, the evidence points to disparities in opportunities that exacerbate existing inequities and place major stress on low-income students and their teachers, in particular. Due to the digital divide and many other factors, these children are most likely to lose more substantial learning time. And their families are also most likely to experience compounded stresses—such as job loss, the loss of health care, the lack of paid sick leave, the lack of child care, and the need to work on site in “essential” jobs that put them at health risks: all these factors make it much harder for these families to attend to children who are suddenly home schooling and struggling with ad-hoc efforts at remote learning.

The lessons learned point to the need to enact an agenda that lifts up children and reduces educational inequities after the interruption to schooling due to the coronavirus is over. The agenda must also rebuild the system so that lifting up children and reducing inequities in education become the new norm. To accomplish this, we outline a three-stage response. The first stage is immediate relief for students and educators so they can function better , as remote learning continues in some form for many children. The second stage is significant short-term investments during the recovery that will enable students whose education was interrupted by the coronavirus crisis to catch up and continue their development. The third stage is longer-term reforms to rebuild the education system so that the challenges documented here are corrected and the system finally delivers an excellent, equitable education to all children.

In the rebuilding phase, it is essential to establish an education system that embraces a whole-child approach, addresses the impacts of poverty and inequality on students’ capacity to learn and on teachers’ abilities to do their jobs, offers a flexible set of wraparound supports to mitigate the impacts of the inequities that are built into the system, values education and educators, and creates viable contingency plans for future crises.

In closing, the ultimate consequences of the pandemic for K–12 education in the United States will indeed be a function of the quality, intensity, and comprehensiveness of our response to counter the pandemic’s negative lasting effects. Indeed, our call for relief, recovery, and reform has a historical precedent. As Darren Walker, president of the Ford Foundation, recently noted:

This societal reimagination certainly encompasses a reimagination of our education system. With the right vision, we can actually ensure that public education plays a critical role in restoring the human and social capital in our country and in readying us for the next challenges, big or small, that we may confront in the future. Our children and our future depend on it.