# Table of Contents
<a id="table-of-contents"></a>
- [1 Introduction](#1)
- [2 Preparations](#2)
- [3 Datasets Overview](#3)
    - [3.1 Engagement](#3.1)
    - [3.2 Districts](#3.2)
    - [3.3 Products](#3.3)
    - [3.4 Combine](#3.4)
- [4 Product](#4)    
    - [4.1 Accessed Products](#4.1)
    - [4.2 Page Load](#4.2)
    - [4.3 Number of Products](#4.3)
    - [4.4 First Accessed](#4.4)
    - [4.5 Top Products & Providers](#4.5)
    - [4.6 Category & Sectors](#4.6)
- [5 Demographic](#5)
    - [5.1 Geographic](#5.1)
        - [5.1.1 Geographic and page load](#5.1.1)
        - [5.1.2 Geographic and page load correlation](#5.1.2)
    - [5.2 Black / Hispanic](#5.2)
    - [5.3 County Connection Ratio](#5.3)
    - [5.4 Per-pupil Total Expenditure](#5.4)
    - [5.5 Free or Reduced Price](#5.5)
- [References](#ref)

[back to top](#table-of-contents)
<a id="1"></a>
# 1 Introduction

Nelson Mandela believed education was the most powerful weapon to change the world. But not every student has equal opportunities to learn. Effective policies and plans need to be enacted in order to make education more equitable—and perhaps your innovative data analysis will help reveal the solution.

Current research shows educational outcomes are far from equitable. The imbalance was exacerbated by the COVID-19 pandemic. There's an urgent need to better understand and measure the scope and impact of the pandemic on these inequities.

Education technology company LearnPlatform was founded in 2014 with a mission to expand equitable access to education technology for all students and teachers. LearnPlatform’s comprehensive edtech effectiveness system is used by districts and states to continuously improve the safety, equity, and effectiveness of their educational technology. LearnPlatform does so by generating an evidence basis for what’s working and enacting it to benefit students, teachers, and budgets.

This analytics competition expects to uncover trends in digital learning. Accomplish this with data analysis about how engagement with digital learning relates to factors like district demographics, broadband access, and state/national level policies and events.

The submissions will inform policies and practices that close the digital divide. With a better understanding of digital learning trends, you may help reverse the long-term learning loss among America’s most vulnerable, making education more equitable.

**Problem Statement**

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

[back to top](#table-of-contents)
<a id="2"></a>
# 2 Preparations
We are preparing packages and source data that will be used in the analysis process. Python packages that will be used in the analysis mainly are for data manipulation (`numpy` and `pandas`) and data visualization (`matplotlib` and `seaborn`). Engagement data are stored by district, we will merge all the individual files under `engagement_data` folder into 1 dataframe.

*(to see the details, please expand)*

In [None]:
import os
import glob
import numpy as np
import pandas as pd
import warnings

import squarify
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

#loading dataset
districts = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data'):
    for filename in filenames:
        engagement_files = list(glob.glob(os.path.join(dirname,'*.*')))

engagement = pd.DataFrame()
for file in engagement_files:
    district_id = file[79:83]
    engagement_file = pd.read_csv(file)
    engagement_file['id'] = district_id
    engagement = pd.concat([engagement, engagement_file], axis=0).reset_index(drop=True)

#mapping for districts dataset
mapping_1 = {
    '[0, 0.2[': '0%-20%',
    '[0.2, 0.4[': '20%-40%',
    '[0.4, 0.6[': '40%-60%',
    '[0.6, 0.8[': '60%-80%',
    '[0.8, 1[': '80%-100%'}

mapping_2 = {
    '[4000, 6000[': '4000-6000',
    '[6000, 8000[': '6000-8000',
    '[8000, 10000[': '8000-10000',
    '[10000, 12000[': '10000-12000',
    '[12000, 14000[': '12000-14000',
    '[14000, 16000[': '14000-16000',
    '[16000, 18000[': '16000-18000',
    '[18000, 20000[': '18000-20000',
    '[20000, 22000[': '20000-22000',
    '[22000, 24000[': '22000-24000',
    '[32000, 34000[': '32000-34000'}

mapping_3 = {
    '[0.18, 1[': '18%-100%',
    '[1, 2[': '100%-200%'
}

districts['pct_black/hispanic'] = districts['pct_black/hispanic'].map(mapping_1)
districts['pct_free/reduced'] = districts['pct_free/reduced'].map(mapping_1)
districts['county_connections_ratio'] = districts['county_connections_ratio'].map(mapping_3)
districts['pp_total_raw'] = districts['pp_total_raw'].map(mapping_2)

#separating category
products[['Category', 'Subcategory']] = products['Primary Essential Function'].str.split('-', n=1, expand=True,)
products = products.drop('Primary Essential Function', axis=1)

[back to top](#table-of-contents)
<a id="3"></a>
# 3 Dataset Overview
The overview is prepared to get the feel on data structure. It will also include a quick analysis on missing values, basic statistics and data manipulation. In general there will 3 datasets: `engagement`, `districts` and `products`.

<a id="3.1"></a>
## 3.1 Engagement

The engagement data are aggregated at school district level, and each file represents data from one school district. The 4-digit file name represents `district_id` which can be used to link to district information in `district_info`. The `lp_id` can be used to link to product information in `product_info`.

This dataset consists of below information:
- **time:** date in "YYYY-MM-DD"
- **lp_id:** The unique identifier of the product
- **pct_access:** Percentage of students in the district have at least one page-load event of a given product and on a given day
- **engagement_index:** Total page-load events per one thousand students of a given product and on a given day

**Observations:**
- There are `22,324,190` rows with `5` columns as mentioned above. 
- This dataset contain missing value of `5,392,397` which come from `lp_id` of `541`, `pct_access` of `13,447` and `engagement_index` `5,378,409`. Missing value in the `engagement_index` can be considered big as it consist of `24.15%` from total observation.

### 3.1.1 Quick view
Below is the first 5 rows of `engagement` dataset:

In [None]:
engagement.head()

In [None]:
print(f'Number of rows: {engagement.shape[0]};  Number of columns: {engagement.shape[1]}; No of missing values: {sum(engagement.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(engagement.isna().sum())

### 3.1.2 Basic statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
engagement.describe()

[back to top](#table-of-contents)
<a id="3.2"></a>
## 3.2 Districts

The district file includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, LearnPlatform removed the identifiable information about the school districts. LearnPlatform also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

This dataset consists of below information:
- **district_id:** The unique identifier of the school district
- **state:** The state where the district resides in
- **locale:** NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural.
- **pct_black/hispanic:** Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.
- **pct_free/reduced:** Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data
- **countyconnectionsratio:** ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version).
- **pptotalraw:** Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource - Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.

**Observations:**
- There are `223` rows with `7` columns as mentioned above. 
- This dataset contain missing value of `442` which mainly come from `pp_total_raw` of `115`, `pct_free/reduced` of `85` and `county_connections_ratio` of `71`.

### 3.2.1 Quick view
Below is the first 5 rows of `districts` dataset:

In [None]:
districts.head()

In [None]:
print(f'Number of rows: {districts.shape[0]};  Number of columns: {districts.shape[1]}; No of missing values: {sum(districts.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(districts.isna().sum())

[back to top](#table-of-contents)
<a id="3.3"></a>
## 3.3 Products

The product file includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by LearnPlatform team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

This dataset consists of below information:
- **LP ID:** The unique identifier of the product
- **URL:** Web Link to the specific product
- **Product Name:** Name of the specific product
- **Provider/Company Name:** Name of the product provider
- **Sector(s):** Sector of education where the product is used
- **Category:** The basic function of the product. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. 
- **Subcategory:** Each of these categories have multiple sub-categories with which the products were labeled

**Observations:**
- There are `372` rows with `7` columns as mentioned above. 
- This dataset contain missing value of `61` which mainly come from `Sectors(s)`, `Category`, `Subcategory` with each of them has `20` missing values and `1` missing value on `Provider/Company Name`.

### 3.3.1 Quick view
Below is the first 5 rows of `products` dataset:

In [None]:
products.head()

In [None]:
print(f'Number of rows: {products.shape[0]};  Number of columns: {products.shape[1]}; No of missing values: {sum(products.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(products.isna().sum())

[back to top](#table-of-contents)
<a id="3.4"></a>
## 3.4 Combine
We will merge `engagement`, `districts` and `products` datasets into 1 big dataset called `combine` that consist all of the information from all dataset and we will delete existing dataset to free up some memory.

*(to see the details, please expand)*

In [None]:
combine = engagement.copy()
combine['id'] = combine['id'].astype('int64') 
combine = combine.merge(districts, left_on='id', right_on='district_id', how='left')
combine = combine.merge(products, left_on='lp_id', right_on='LP ID', how='left')
combine = combine.drop('district_id', axis=1)
combine = combine.drop('LP ID', axis=1)
combine['time'] = pd.to_datetime(combine['time'])
del engagement
del districts
del products

### 3.4.1 Quick view
Below is the first 5 rows of `combine` dataset:

In [None]:
combine.head()

In [None]:
print(f'Number of rows: {combine.shape[0]};  Number of columns: {combine.shape[1]}; No of missing values: {sum(combine.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(combine.isna().sum())

In [None]:
combine['Provider/Company Name'].value_counts(dropna=False).shape

[back to top](#table-of-contents)
<a id="4"></a>
# 4 Products
Engagement dataset represents on how many products (in a school district) that have been accessed by students in a daily basis for year 2020 with the total of `22 million` product accessed in 2020. There are `8,646` products but only `368` products that have been successfully mapped using the `products_info` dataset, unmapped products are categorized as `Unknown`. 

To make a little bit clearer:
- The dataset is presented in a daily basis.
- A product will only one product per school district if there is an accessed to the product.
- There can be 2 or more recore on the same products in the dataset if the product is accessed by two or more different school districts.

In this part we will also find some analysis related to trend:
- We will look into the mean `accessed products`. 
- Go deeper by looking into `page-load` that is provided in the dataset.
- How many products that have been used in a daily basis.

<a id="4.1"></a>
## 4.1 Accessed Products
The analysis will focus on mean `accessed products`. What and how will we calculate the mean of `accessed products`? Every observations in the engagement dataset represent an `accessed product` in a school district. We can calculate how many products have been used in a school district and take the average from them. We will see the average of total products that has been used per school district. 

**Observations**:
- Mean daily `accessed products` is around 300 - 500 per school districts in 2020. There is an increased in the daily mean of `accessed products` after the `summer holiday`. Are the students in every school districts using a more diversify products or there are new products to support digital learning?
- In every month there are volatility in the trend, that follow an order of `5-2` which are `5 days of school` and `2 days of weekend`.
- Though the number of `accessed products` decreased in the weekend but we still see `accessed products` in the weekend, number of `accessed products` in the weekend are about `25%-50%` from the weekday. Does it mean there are still many students that studying in the weekend?
- In the mid-February, there is an temporary school closures followed by WHO that characterized COVID-19 as a pandemic on March 11th 2020. We can see there is an increased on `accessed products` starting from mid-February, though the increased are marginal.
- `Summer holiday` in the United States are differ between schools districts, usually start in `late May / early June` and end in `late August / early September`, this consistent with lower `accessed products` on those dates but it can be considered high compared to regular date. Once again, does it mean there are many students that still study in the holiday? It seems weird.

In [None]:
temp = pd.DataFrame(combine.groupby(['time', 'id'])['time'].count())
temp = temp.rename(columns={"time":"amount"})
temp = temp.reset_index(drop=False)
temp = temp.groupby('time')['amount'].mean()

background_color = "#f6f5f5"
sns.set_palette(['dimgray']*400)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(9, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)
ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
ax0 = sns.barplot(ax=ax0, x=temp.index, y=temp, zorder=2, linewidth=0.8, saturation=1)
summer = np.arange(np.datetime64("1970-06-01"), np.datetime64("1970-08-24"))
ax0.fill_between(summer, np.max(temp), color='#ffd514', alpha=0.5, zorder=2, linewidth=0)
plt.axvline(np.datetime64("1970-02-12"), color='#ffd514', alpha=0.5)
plt.axvline(np.datetime64("1970-03-11"), color='#ffd514', alpha=0.5)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', lw=0.3)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', lw=0.3)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim() 
ax0.text(x0, y1*1.11, 'Mean Daily Accessed Products', color='black', fontsize=7, ha='left', va='bottom', weight='bold')
ax0.text(x0, y1*1.1, 'After the summer holiday, there are an increased in accessed products', 
        color='#292929', fontsize=5, ha='left', va='top')
ax0.annotate("temporary\nschool closures", 
             xy=(np.datetime64("1970-02-12"), 430), 
             xytext=(np.datetime64("1970-01-09"), 380), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("COVID-19\nPandemic", 
             xy=(np.datetime64("1970-03-11"), 430), 
             xytext=(np.datetime64("1970-03-18"), 350), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("Summer Holiday", 
             xy=(np.datetime64("1970-06-27"), 350), 
             xytext=(np.datetime64("1970-06-27"), 350), 
             fontsize=5)

#format axis
ax0.set_xlabel("date",fontsize=5, weight='bold')
ax0.set_ylabel("products",fontsize=5, weight='bold')

#format the ticks
ax0.tick_params('both', length=2, which='major', labelsize=5)
months = mdates.MonthLocator()
ax0.xaxis.set_major_locator(months)
ax0.set_xticklabels(['Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020', 'Jun 2020', 'Jul 2020', 
                     'Aug 2020', 'Sep 2020', 'Oct 2020', 'Nov 2010', 'Dec 2020'])
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

plt.show()

[back to top](#table-of-contents)
<a id="4.2"></a>
## 4.2 Page Load
The `page-load` analysis is trying to supplement our finding in the mean daily accessed products analysis. As a reminder page-load events is calculated per one thousand students of a given product and on a given day. We can see students learning activitis in one day by combining all the pagae-load in a day.

**Observations:**
- After the temporary school closures, we see there is an increased in `page-load`, meaning digital learning are used more frequently than before. We hardly see the increased of students activity in the mean daily accessed products.
- Though the mean of `accessed products` in the weekend are about `25%-50%` from the weekday, it doesn't mean the students are studying at the weekend as can be seen on a very low `page-load` in every weekend.
- Same as the weekend activities, the `page-laod` are also very low in the summer holiday, which also mean the students are not intensively using the products. 
- High number of `accessed products` in the weekend and summer holiday is more due to the variety of products used in by students but not by how intesive the product is used. 

In [None]:
temp = combine.groupby('time')['engagement_index'].sum()

background_color = "#f6f5f5"
sns.set_palette(['dimgray']*400)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(9, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)
ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
ax0 = sns.barplot(ax=ax0, x=temp.index, y=temp, zorder=2, linewidth=0.8, saturation=1)
summer = np.arange(np.datetime64("1970-06-01"), np.datetime64("1970-08-24"))
ax0.fill_between(summer, np.max(temp), color='#ffd514', alpha=0.5, zorder=2, linewidth=0)
plt.axvline(np.datetime64("1970-02-12"), color='#ffd514', alpha=0.5)
plt.axvline(np.datetime64("1970-03-11"), color='#ffd514', alpha=0.5)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', lw=0.3)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', lw=0.3)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim() 
ax0.text(x0, y1*1.11, 'Daily Page-Load', color='black', fontsize=7, ha='left', va='bottom', weight='bold')
ax0.text(x0, y1*1.1, 'Following the temporary school disclosure, there are an increased in page-load', 
        color='#292929', fontsize=5, ha='left', va='top')
ax0.annotate("temporary\nschool\nclosures", 
             xy=(np.datetime64("1970-02-12"), 20000000), 
             xytext=(np.datetime64("1970-01-15"), 15000000), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("COVID-19\nPandemic", 
             xy=(np.datetime64("1970-03-11"), 20000000), 
             xytext=(np.datetime64("1970-03-18"), 16000000), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("Summer Holiday", 
             xy=(np.datetime64("1970-06-27"), 20000000), 
             xytext=(np.datetime64("1970-06-27"), 15000000), 
             fontsize=5)

#format axis
ax0.set_xlabel("date",fontsize=5, weight='bold')
ax0.set_ylabel("page-load",fontsize=5, weight='bold')

#format the ticks
ax0.tick_params('both', length=2, which='major', labelsize=5)
months = mdates.MonthLocator()
ax0.xaxis.set_major_locator(months)
ax0.set_xticklabels(['Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020', 'Jun 2020', 'Jul 2020', 
                     'Aug 2020', 'Sep 2020', 'Oct 2020', 'Nov 2010', 'Dec 2020'])
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

plt.show()

[back to top](#table-of-contents)
<a id="4.3"></a>
## 4.3 Number of Products
There are `8,647` products available in 2020, let's see how many products have been used in a daily basis. Are there any changes in the number of product used?

**Observations:**
- Range of products used in 2020 are around `2,000 - 4,000` from the total of `8,647`.
- Number of products used before the temporary school closures is below `3,000` and increased to above `3,000` thereafter. This increment indicates an increased in digital learning as supported by daily page-load before.
- Number of product used in weekend and summer holiday don't decrease significantly as in `mean daily accessed`. there is a consistency of the product used even in weekend and summer holiday but with a lower activity.

In [None]:
temp = combine.groupby('time')['lp_id'].nunique()

background_color = "#f6f5f5"
sns.set_palette(['dimgray']*400)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(9, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)
ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
ax0 = sns.barplot(ax=ax0, x=temp.index, y=temp, zorder=2, linewidth=0.8, saturation=1)
summer = np.arange(np.datetime64("1970-06-01"), np.datetime64("1970-08-24"))
ax0.fill_between(summer, np.max(temp), color='#ffd514', alpha=0.5, zorder=2, linewidth=0)
plt.axvline(np.datetime64("1970-02-12"), color='#ffd514', alpha=0.5)
plt.axvline(np.datetime64("1970-03-11"), color='#ffd514', alpha=0.5)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', lw=0.3)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', lw=0.3)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim() 
ax0.text(x0, y1*1.11, 'No of Products Used', color='black', fontsize=7, ha='left', va='bottom', weight='bold')
ax0.text(x0, y1*1.1, 'There are an increased in no of product used after the temporary school disclosure', 
        color='#292929', fontsize=5, ha='left', va='top')
ax0.annotate("temporary\nschool\nclosures", 
             xy=(np.datetime64("1970-02-12"), 4000), 
             xytext=(np.datetime64("1970-01-15"), 3200), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("COVID-19\nPandemic", 
             xy=(np.datetime64("1970-03-11"), 4000), 
             xytext=(np.datetime64("1970-03-18"), 3800), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("Summer Holiday", 
             xy=(np.datetime64("1970-06-27"), 3500), 
             xytext=(np.datetime64("1970-06-27"), 3500), 
             fontsize=5)

#format axis
ax0.set_xlabel("date",fontsize=5, weight='bold')
ax0.set_ylabel("page-load",fontsize=5, weight='bold')

#format the ticks
ax0.tick_params('both', length=2, which='major', labelsize=5)
months = mdates.MonthLocator()
ax0.xaxis.set_major_locator(months)
ax0.set_xticklabels(['Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020', 'Jun 2020', 'Jul 2020', 
                     'Aug 2020', 'Sep 2020', 'Oct 2020', 'Nov 2010', 'Dec 2020'])
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

plt.show()

[back to top](#table-of-contents)
<a id="4.4"></a>
## 4.4 First Accessed
It would be interesting to know when the first time a product has been accessed in relation to digital learning. We would like to expect there is an increased of new accessed products in the COVID-19 situation. This increased may come from a new products (an opportunity for education business) or an old products that recently be accessed. The chart will show the total number of new products accessed in a specific date.

**Observations:**
* In the start of 2020, we see there are a high new accessed products. This may come from older products that have been used and available before the pandemic.
* The interesting part is there are a jump of new products accessed after the announcement of temporary school disclosure.
* We can also barely see, there are many new products accessed in the summer holiday and also a little bit jump in the beginning of November 2020.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(9, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)

##########Top 5 Trends##########
temp = pd.DataFrame(combine.groupby('lp_id', dropna=False)['time'].min())
temp = temp.fillna('Unknown')
temp = temp.reset_index(drop=False)
temp = pd.DataFrame(temp.groupby(['time'], dropna=False)['lp_id'].count())
temp = temp.reset_index(drop=False)
temp['time'] = pd.to_datetime(temp['time'])

background_color = "#f6f5f5"
sns.set_palette(['dimgray']*400)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0 = sns.barplot(ax=ax0, x=temp['time'], y=temp['lp_id'], zorder=2, linewidth=0.5)
ax0.fill_between(summer, np.max(temp['lp_id']), color='#ffd514', alpha=0.5, zorder=2, linewidth=0)
plt.axvline(np.datetime64("1970-02-12"), color='#ffd514', alpha=0.5)
plt.axvline(np.datetime64("1970-03-11"), color='#ffd514', alpha=0.5)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', lw=0.3)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', lw=0.3)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim() 
ax0.text(x0, y1*1.11, 'No of Products Used', color='black', fontsize=7, ha='left', va='bottom', weight='bold')
ax0.text(x0, y1*1.1, 'There is an increased in new product accessed after the temporary school disclosure', 
        color='#292929', fontsize=5, ha='left', va='top')
ax0.annotate("temporary\nschool\nclosures", 
             xy=(np.datetime64("1970-02-13"), 900), 
             xytext=(np.datetime64("1970-01-19"), 800), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("COVID-19\nPandemic", 
             xy=(np.datetime64("1970-03-11"), 900), 
             xytext=(np.datetime64("1970-03-17"), 900), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("Summer Holiday", 
             xy=(np.datetime64("1970-06-27"), 1000), 
             xytext=(np.datetime64("1970-06-27"), 1000), 
             fontsize=5)

#format axis
ax0.set_xlabel("date",fontsize=5, weight='bold')
ax0.set_ylabel("page-load",fontsize=5, weight='bold')

#format the ticks
ax0.tick_params('both', length=2, which='major', labelsize=5)
months = mdates.MonthLocator()
ax0.xaxis.set_major_locator(months)
ax0.set_xticklabels(['Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020', 'Jun 2020', 'Jul 2020', 
                     'Aug 2020', 'Sep 2020', 'Oct 2020', 'Nov 2010', 'Dec 2020'])
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

[back to top](#table-of-contents)
<a id="4.5"></a>
## 4.5 Top Products & Providers
As stated before, there are `369` products (including `Unknown`) that have been successfully mapped in 2020. This mapped products will be used as the basis of the analysis. In this section we are trying to see: 
1. Most used products by students in 2020 using page-load as the basis.
2. Top providers in 2020 based on the page-load.

**Observations:**
- **Products**
    - `Four` of top 5 products (excluding `Unknown`) are managed by `Google LLC`, they are `Google Docs`, `Google Classroom`, `Youtube` and `Meet`. `Google Docs` is the most used products in 2020, it had `769.9 million` page-load`,`Google Classroom`, is in the 3rd place with `373.6 million` page-load. `Youtube` and `Meet` are in position `four` and `five` respectively. 
    - There is 1 product in the top 5 products that doesn't belong to `Google LLC`, the name of the product is `Canvas`. This product is owned by `Instructure, Inc.` and is in the `3rd` position.
    - `Unknown` is in the second place with `417.1 million` page-load. We can assume it was coming from many products.
- **Providers**
    - As expected, `Google LLC` has the highest page load outperforming any others providers.
    - `Instructure, Inc.`, a company that made `Canvas` is in the third place with `138 million` page-load in 2020.
    - In 4th place, we can see there is `Kahoot! AS` with `87 million` page-load 

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(4, 3), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=1.1, hspace=1.5)

##########PRODUCT##########
temp = pd.DataFrame(combine.groupby('Product Name', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['product', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)
temp = temp[0:10]

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.barplot(ax=ax0, y=temp['product'], x=temp['amount'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.3)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.3)

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("products",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.45, 'Top 10 Products', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.2, 'Top 10 products are controlled by Google LLC', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 55
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)

##########PROVIDER##########
temp = pd.DataFrame(combine.groupby('Provider/Company Name', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['product', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)
temp = temp[0:10]

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.barplot(ax=ax0, y=temp['product'], x=temp['amount'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.3)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.3)

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("providers",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.47, 'Top 10 Providers', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.2, 'Google LLC is the top provider for digital learning', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 130
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)

[back to top](#table-of-contents)
<a id="4.6"></a>
## 4.6 Category & Sectors 
In the first part we would like to see how `category` and `sectors` relates to the `page-load`.  In the perspective of `sector`, a product can be classified into more than 1 sector, there are `5 sectors` excluding the `unknown`. There are `3 categories` in the dataset that are described below:
1. LC = Learning & Curriculum
2. CM = Classroom Management 
3. SDO = School & District Operations

**Observations:**
- **Category**
    - `Learning & Curriculum` has the highest page-load reaching `1.6 billion` in 2020, this is a good sign as products are used for students to study.
    - `School & District Operations` is in the second place with `494.9 million` page-load.
    - `Classroom Management` is in the third place (excluding `Unknown`) with `187.4 million` of page-load.
- **Sectors**
    - Almost `2 billion` page-load are coming from products that can be categorize into 3 sector: `PreK-12; Higher Ed; Corporate`.
    - `Unknown` sectors is in the second place with `424.6 million` page-load.
    - In the 3rd place there is `PreK-12` with `391.9 million` page-load.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(4, 4), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.2, hspace=1.5)

##########CATEGORY##########
temp = pd.DataFrame(combine.groupby(['Category'], dropna=False)['engagement_index'].sum()).reset_index()
temp.columns = ['description', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0.plot(temp['description'], temp['amount']/1000000, 'o--', color="#ffd514", markersize=3, markeredgewidth=0, linewidth=0.5, zorder=4)
ax0.fill_between(temp['description'], temp['amount']/1000000, color="#d3d3d3", zorder=3, alpha=0.5, linewidth=0)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("category",fontsize=3, weight='bold')
ax0.set_ylabel("page-load (million)",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
ax0.yaxis.set_major_formatter(y_format)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+285, 'Category & Page Load', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+160, 'Most of the page-load come from Learning & Curriculum', fontsize=3, ha='left', va='top')

##########SECTORS##########
temp = pd.DataFrame(combine.groupby(['Sector(s)'], dropna=False)['engagement_index'].sum()).reset_index()
temp.columns = ['description', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0.plot(temp['description'], temp['amount']/1000000, 'o--', color="#ffd514", markersize=3, markeredgewidth=0, linewidth=0.5, zorder=4)
ax0.fill_between(temp['description'], temp['amount']/1000000, color="#d3d3d3", zorder=3, alpha=0.5, linewidth=0)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("sector",fontsize=3, weight='bold')
ax0.set_ylabel("page-load (million)",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
ax0.yaxis.set_major_formatter(y_format)
ax0.set_xticklabels(['PreK-12;\nHigher Ed;\nCorporate', 'Unknown', 'PreK-12', 
                     'PreK-12;\nHigher Ed', 'Corporate', 'Higher Ed;\nCorporate'])

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+300, 'Sectors & Page Load', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+150, 'PreK-12; Higher Ed; Corporate is dominating', fontsize=3, ha='left', va='top')

plt.show()

[back to top](#table-of-contents)
<a id="5"></a>
# 5 Demographic
We will try to make an analysis from the perspective of `geographic`, `black/hispanic`, `county connection ratio` and `per-pupil total expenditure`. We would like to see how these demographic relates to the page-load.

<a id="5.1"></a>
## 5.1 Geographic
We will try to see how geographic (which include `state`, `locale` and `district`) relates to `page-load`.

<a id="5.1.1"></a>
### 5.1.1 Geographic and page load
We will try to see the relationship between the `geographic` with the `page-load`. `Page-load` analysis can give us more information on student activities in the digital learning. In this analysis we are aggregating all daily page-load in 2020 for a given `State`, `Locale` and `School District` while `School District` will also be grouped on `State` level. There are`233` district in `23` states with `4` locale excluding `Unknown`. 

**Observations:**
- **State**
    - `Page-load` that doesn't have their corresponding `state` are big which is reaching `594 million` which is in the first position and can not be easily ignored. In the analysis, we called it `Unknown`.
    - There are `3 states` that have more than `300 million` of `page-load` in 2020, they are `Connecticut`, `Illinois` and `Massachusetts`. They are contributing around `40.7%` of total `page-load` in 2020, if we include `Unknown` state, the contribution increased to `61.7%`.
- **Locale**
    - There are `2.8 billion` page-load in 2020, most of it is coming from `Suburb` area that contributes `48%` of total observations.
    - `Unknown` is the second highest contribution in the `page-load` which is around `594 million` with a contribution of 20.9%.
    - `City` and `Town` are the lowest locale with `308 million` and `99 million` page-load.
- **District**
    - Higher `page-load` in a state doesn't always mean a good things, especially if the state is supported by many school districts. A good indication is to see the mean `page-load` from each school districts in a state. There is only one state that passed `20 million` page-load in 2020 that is `Illinois`. 
    - Though `Unknown` has the highest `page-load`, it only has a mean of around `10 million`, this indicate that there are many school districts that have not be mapped.
    - `Arizona` has the second highest `page-load` and it has only 1 school district.

In [None]:
##########STATE##########
temp = pd.DataFrame(combine.groupby('state', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['state', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(3, 3), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.7, hspace=0.1)

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0:2, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.barplot(ax=ax0, y=temp['state'], x=temp['amount'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("state",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1*3, 'Page-Load by State', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1*1.9, 'Unknown state is dominating the page-load', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 50
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)

##########LOCALE##########
temp = pd.DataFrame(combine.groupby('locale', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['locale', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)
color_map = ['#ffd514', 'dimgray', 'gray', 'darkgray', 'lightgray']
ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)

#graph
ax0.pie(x=temp['amount'], wedgeprops=dict(width=0.2), colors=color_map, 
        textprops={'fontsize': 3}, autopct='%1.1f%%')

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1*1.49, 'Page-Load Proportion by Locale', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1*1.36, 'Suburb is dominating the page-load', fontsize=3, ha='left', va='top')
ax0.legend(temp['locale'], loc="upper left", bbox_to_anchor=(x0*0.03, y1*0.92), prop={'size': 2.5}, frameon=False, ncol=3)

#format tick
ax0.tick_params(labelsize=3, width=0.5, length=1.5)

##########DISTRICT##########
temp = pd.DataFrame((combine.groupby(['state', 'id'], dropna=False)['engagement_index'].sum()/1000000).reset_index())
temp.columns = ['state', 'district', 'amount']
temp = temp.fillna('Unknown')

color_map = ["dimgray" for _ in range(233)]
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[1, 1])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)

#graph
ax0_sns = sns.pointplot(x=temp['amount'], y=temp['state'], join=False, 
              ci=None, scale=0.2, color='#ffd514', markers="d", zorder=4, ax=ax0)
plt.setp(ax0_sns.collections, zorder=4, label="")
ax0_sns = sns.stripplot(x=temp['amount'], y=temp['state'], dodge=True, zorder=3, size=1, ax=ax0)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1*6, "District's Page-Load by State", fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1*3.6, 'Yellow diamond is showing the mean on each state', fontsize=3, ha='left', va='top')

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("state",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', color='#EEEEEE', linewidth=0.2, zorder=0)
ax0_sns.grid(which='major', axis='y', color='#EEEEEE', linewidth=0.2, zorder=0)

# data label
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)

<a id="5.1.2"></a>
### 5.1.2 Geographic and page load correlation
We will try to see below correlation:
 - Correlation between its own `state` and its own `locale` in terms of `page-load`. At first, we will look into the correlation between `state` followed by correlation between `locale.
 - Correlation between the `locale`. There are 5 locale including the `Unknown`, the others are `City`, `Rural`, `Suburb` and `Town`.

**Observations:**
- **Correlation between state**
    - The lower correlation between state is betwen `Minnesota` and `North Dakota` which is `-0.121`. This meaning they have an inverse relationship, though it's near zero.
    - Both of `New York` and `North Dakota` almost don't have any relationship with correlation of `-0.033`.
    - The highest postive correlation are between `Unknown` state and `Missouri` with a correlation of `0.932`.
- **Correlation between locale**
    - Most of the locale have a high positive correlation above `0.8`.
    - `Suburb` and `Rural` have the highest correlation of `0.98` which is almost 1.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(4, 4), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.7, hspace=0.1)

##########CORRELATION STATE##########
temp = combine[['time', 'state', 'engagement_index']]
temp['state'].fillna('Unknown', inplace=True)
temp = pd.DataFrame(temp.pivot_table(index='time', columns='state', values='engagement_index', 
                                     aggfunc='sum', dropna=False)).reset_index(drop=False)

background_color = "#f6f5f5"
colors = ["black", "#ffd514"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
ax0_sns = sns.heatmap(temp.corr(), ax=ax0, annot=True, square=True, xticklabels=True, yticklabels=True,
            annot_kws={"size": 3}, cbar=False, cmap=colormap, linewidths=0.3, 
            linecolor='black', fmt='.1g')

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-1.1, "Correlation Between State", fontsize=5, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.5, 'Yellow indicates a high positive correlation', fontsize=3, ha='left', va='top')

#axis
ax0_sns.set_xlabel("")
ax0_sns.set_ylabel("")
ax0_sns.tick_params(length=0, labelsize=3)

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(3, 3), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.7, hspace=0.1)

##########CORRELATION LOCALE##########
temp = combine[['time', 'locale', 'engagement_index']]
temp['locale'].fillna('Unknown', inplace=True)
temp = pd.DataFrame(temp.pivot_table(index='time', columns='locale', values='engagement_index', 
                                     aggfunc='sum', dropna=False)).reset_index(drop=False)

background_color = "#f6f5f5"
colors = ["black", "#ffd514"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
#matrix = np.triu(temp.corr())
ax0_sns = sns.heatmap(temp.corr(), ax=ax0, annot=True, square=True, xticklabels=True, yticklabels=True,
            annot_kws={"size": 4}, cbar=False, cmap=colormap, linewidths=0.3, 
            linecolor='black', fmt='.1g')

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.38, "Correlation Between Locale", fontsize=6, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.2, 'Yellow indicates a high positive correlation', fontsize=4, ha='left', va='top')

#axis
ax0_sns.set_xlabel("")
ax0_sns.set_ylabel("")
ax0_sns.tick_params(length=0, labelsize=3)

[back to top](#table-of-contents)
<a id="5.2"></a>
## 5.2 Black/Hispanic
Black/Hispanic is a percentage of students in the school districts that are identified as Black or Hispanic based on 2018-19 NCES data.

**Observations:**
- Most school districts have `0%-20%` of black/hispanic which also consistent with the highest number of page-load in year 2020.
- In a point of view black/hispanic, higher number of school districts resulting into a higher number page-load.
- `Unknown` percentage black/hispanic and state has the highest page-load that are more than `500 million` load.
- There are 3 states that are worth mentioning, they are `Connecticut`, `Massachusetts` and `Illinois`. These state has more than `200 million` of page-load in school districts that have `0%-20%` of black/hispanic.
- In a range of `20%-40%` of black/hispanic, there are `Illinois` with `63.9 million` page-load and `Utah` with `51 million` page-load.
- The most page-load for a `60%-80%` black/hispanic in school districts can be found in `Illinois` with `50 million` page-load.
- A `30 million` page-load can be found in `New York` state with `80%-100%` black/hispanic.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(3, 3), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.7, hspace=0.8)

##########BLACK HISPANIC & DISTRICT##########
temp = pd.DataFrame(combine.groupby('pct_black/hispanic', dropna=False)['id'].nunique()).reset_index()
temp.columns = ['black/hispanic', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(6)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0.scatter(x=temp['black/hispanic'], y=temp['amount'], s=6, color=color_map, zorder=3)
ax0.vlines(x=temp['black/hispanic'], ymin=0, ymax=temp['amount'], color=color_map, zorder=3)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("black/hispanic",fontsize=3, weight='bold')
ax0.set_ylabel("school districts",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
plt.xticks(rotation=90)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+15.9, 'School District by Black/Hispanic', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+7, '0%-20% has the most school districts', fontsize=3, ha='left', va='top')

##########BLACK HISPANIC & PAGE LOAD##########
temp = pd.DataFrame(combine.groupby('pct_black/hispanic', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['pct_black/hispanic', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#f6f5f5"

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.barplot(ax=ax0, y=temp['pct_black/hispanic'], x=temp['amount'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("black/hispanic",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.78, 'Page-Load by Black/Hispanic', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.3, '0%-20% is dominating the page-load', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 150
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)

##########BLACK HISPANIC & STATE##########
temp = pd.DataFrame(combine.groupby(['pct_black/hispanic', 'state'], dropna=False)['engagement_index'].sum()).reset_index(drop=False)
temp = temp.fillna('Unknown')
temp['engagement_index'] = temp['engagement_index'] / 1000000

background_color = "#f6f5f5"
colors = ["dimgray", "#ffd514"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0 = fig.add_subplot(gs[1, 0:2])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.scatterplot(ax=ax0, x=temp['state'], y=temp['pct_black/hispanic'], hue=temp['engagement_index'],
                          linewidth=0, zorder=2, palette=colormap, s=12)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0_sns.set_xlabel("state",fontsize=3, weight='bold')
ax0_sns.set_ylabel("black/hispanic",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)
ax0_sns.xaxis.set_minor_locator(ticker.MultipleLocator(50))
plt.xticks(rotation=90)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.9, 'Page-Load by Black/Hispanic & State (in million)', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.5, 'Connecticut with 0%-20% black/hispanic has the highest page-load', fontsize=3, ha='left', va='top')
ax0.legend(loc="upper left", bbox_to_anchor=(x0+1.4, y1-0.47), 
           prop={'size': 2.5}, frameon=False, ncol=5, markerscale=0.3)

plt.show()

[back to top](#table-of-contents)
<a id="5.3"></a>
## 5.3 County Connection Ratio
County Connection Ratio is residential fixed high-speed connections over 200 kbps in at least one direction/households.

**Observations:**
- There is only 1 state that has a connection ratio between `100% - 200%` which is `North Dakota`.
- Most of the state are in the county connection ratio between `18% - 100%`.

In [None]:
temp = pd.DataFrame(combine.groupby(['county_connections_ratio'])['state'].nunique()).reset_index()
temp.columns = ['County Connections Ratio', 'No of State']
temp = temp.fillna('Unknown')
temp = temp.sort_values('No of State', ascending=False)
temp

[back to top](#table-of-contents)
<a id="5.4"></a>
### 5.4 Per-pupil Total Expenditure 
Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. We would like to see how per-pupil total expenditure relates to page-load. We may have 2 assumptions which are:
1. Higher total expenditure will have higher page-load, as this relate to money that have been spent.
2. We may expect higher total expenditure will not have higher page-load, as there are not many school districts have high budget spending. 

**Observations:**
- **Page Load**
    - Unfortunately, almost half of the information on per-pupil total expenditure is tagged as unknown.
    - From the remaining information, we see that most of the page-load come from school districts that has `8,000 - 10,000` expenditures.
    - We also see that school districts that have expenditure in a range from `10,000` to `18,000` are having more than `200 million` of page-load in 2020.
    - Two lowest page-load are coming from `4,000 - 6,000` and `32,000 - 34,000` which are the lowest and highest spending range in the dataset.
    
- **School Districts**
    - `Page-load` in a category is inline with the numbers of `school districts` in a category. 
    - `Unknown` has a highest number of `school districts`, it's also has the highest `page-load`. 
    - There are `30` school districts that have `8,000 - 10,0000` of total expenditures which also explain higher number of `page-load`.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(5, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=0.2, hspace=0.5)

##########PPTE##########
temp = pd.DataFrame(combine.groupby(['pp_total_raw'], dropna=False)['engagement_index'].sum()).reset_index()
temp.columns = ['pp_total_raw', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0.step(y=temp['amount']/1000000, x=temp['pp_total_raw'], 
        zorder=2, linewidth=0.5, alpha=1)
ax0.plot(temp['pp_total_raw'], temp['amount']/1000000, 'o--', color="#4b4b4c", markersize=2, markeredgewidth=0, alpha=0.3, linewidth=0.2)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("per-pupil total expenditure",fontsize=3, weight='bold')
ax0.set_ylabel("page-load (million)",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
plt.xticks(rotation=90)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+280, 'Per-pupil Total Expenditure & Page Load', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+160, 'Almost half of page-load is registered as unknown', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_height():,.0f}'
    x = p.get_x() + p.get_width() / 2
    y = p.get_y() + p.get_height() + 75
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

##########PPTE##########
temp = pd.DataFrame(combine.groupby(['pp_total_raw'], dropna=False)['id'].nunique()).reset_index()
temp.columns = ['pp_total_raw', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)
temp = temp[0:10]

background_color = "#f6f5f5"
color_map = ["#ffd514", "#ecc200", "#c5a100", "#9d8100", "#766100", "#4f4100", "#4b4b4c", "#676767",
             "#808080", "#989898", "#c6c6c6", "#d3d3d3"]

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)
for s in ["right", "top", "left", "bottom"]:
    ax0.spines[s].set_visible(False)

#graph
squarify.plot(sizes=temp['amount'], label=temp['pp_total_raw'][:6], color=color_map, ax=ax0, text_kwargs={'fontsize':3, 'wrap':True})

#format axis
ax0.set_yticklabels([])
ax0.set_xticklabels([])
ax0.tick_params(left=False, bottom=False)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+16, 'Per-pupil Total Expenditure & School Districts', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+8, 'Higher number of school districts reflecting a higher page-load', fontsize=3, ha='left', va='top')

plt.show()

[back to top](#table-of-contents)
<a id="5.5"></a>
## 5.5 Free or Reduced Price
Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data. We would like to see how `free or reduced price` relates to page-load. We may have 2 assumptions which are:
1. Higher reduced price will have higher page-load, as cheaper products will increase number of students that use digital learning product which will increased the page-load.
2. We may expect higher reduced price will not have higher page-load, as there are not many school districts get high reduced price. 

**Observations:**
- **Free or Reduced Price & Page-Load**
    - Sadly, most of `Free or Reduced Price` is unknown which contribute of `936 million` page-load.
    - From the remaining information, we see that most of the page-load come from school districts that get `0%-20%` reduced price.
    - We can see that the number page-load continue decreasing as percentage of `reduced price` increased, this may happen due to low school districts that get high reduced price.
- **Reduced Price & Per-pupil Total Expenditure**
    - Before dive into the heat map, in this analysis we are excluding the NaN figures which have a total of `1.1 billion` page load instead of `2.8 billion`.
    - A combination of `40%-60%` reduced price and `8,000-10,000` per-pupil total expenditure has the highest page-load of `141.1 million`.
    - The next 2 highest page-load are coming from page-load of around `100 million` which are:
        1. A combination of `0%-20%` reduced price and `10,000-12,000` and 
        2. A combination of `20%-40%` reduced price and `8,000-10,000`

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(5, 1), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=0.3, hspace=0.8)

##########Page Load##########
temp = pd.DataFrame(combine.groupby(['pct_free/reduced'], dropna=False)['engagement_index'].sum()).reset_index()
temp.columns = ['description', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#f6f5f5"
color_map = ["dimgray" for _ in range(6)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0 = sns.barplot(x=temp['description'],y=temp['amount']/1000000, zorder=3, saturation=1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("Free or Reduced Price",fontsize=3, weight='bold')
ax0.set_ylabel("school districts",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
plt.xticks(rotation=90)
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+140, 'Free or Reduced Price & Page Load', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+60, 'Unknown has the highest page-load', fontsize=3, ha='left', va='top')

####################
temp = pd.DataFrame(pd.pivot_table(combine, index=['pct_free/reduced', 'pp_total_raw'], values='engagement_index', aggfunc='sum', dropna=False)).reset_index()
temp = pd.pivot_table(temp, index=['pct_free/reduced'], columns='pp_total_raw', values='engagement_index', aggfunc='sum', dropna=False)/1000000000

background_color = "#f6f5f5"
colors = ["black", "#ffd514"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)

#graph
ax0_sns = sns.heatmap(temp, ax=ax0, annot=True, square=True, xticklabels=True, yticklabels=True,
            annot_kws={"size": 2.5}, cbar=False, cmap=colormap, linewidths=0.3, 
            linecolor=background_color, fmt='.1g')

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.8, "Reduced Price & Per-pupil Total Expenditure", fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.4, 'Relation in terms of billion page-load', fontsize=3, ha='left', va='top')

#axis
ax0_sns.set_xlabel("")
ax0_sns.set_ylabel("")
ax0_sns.tick_params(length=0, labelsize=3)

plt.show()

[back to top](#table-of-contents)
<a id="ref"></a>
# References

- https://www.who.int/news/item/27-04-2020-who-timeline---covid-19
- https://www.edweek.org/leadership/the-coronavirus-spring-the-historic-closing-of-u-s-schools-a-timeline/2020/07