In [None]:
# To find all the pathnames that matched the specified pattern
import glob

# To use operating system dependent functionality
import os

# To do statistical calculation
import numpy as np

# To create and manipulate dataset or table
import pandas as pd

# Both modules are used to create visualization
import seaborn as sns
import matplotlib.pyplot as plt

## **Data**

### **Districts Info**

The district data are based on districts_info.csv files, which contains information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. 
* In this data set, we removed the identifiable information about the school districts. 
* We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. 
* For data generalization purposes some data points are released with a range where the actual value falls under.
* There are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

In [None]:
districts_dt = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
districts_dt.info()

In [None]:
districts_dt.head()

| Column Name | Description |
| --- | --- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced   | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connectionsratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). |
| pptotalraw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools project. The expenditure data are school-by-school, and used the median value to represent the expenditure of a given school district. |

In [None]:
# remove null values on several columns
districts_dt.dropna(thresh = 6, inplace=True)

In [None]:
districts_dt.isna().sum()

In [None]:
# populate percentage of free and reduced lunch fee with the median of each category

#separate the value of each category
free_reduced = districts_dt['pct_free/reduced'].str.split(",",n=1, expand=True)
districts_dt['pct_free'] = free_reduced[0].str.replace('[','', regex = True)
districts_dt['pct_reduced'] = free_reduced[1].str.replace('[','', regex = True)
districts_dt['pct_free'] = pd.to_numeric(districts_dt['pct_free'])
districts_dt['pct_reduced'] = pd.to_numeric(districts_dt['pct_reduced'])

#find median in each category and fill the null value
districts_dt['pct_free'].fillna(districts_dt['pct_free'].median(), inplace=True)
districts_dt['pct_reduced'].fillna(districts_dt['pct_reduced'].median(), inplace = True)
districts_dt['pct_free_reduced'] = (districts_dt['pct_free']+districts_dt['pct_reduced'])/2

In [None]:
districts_dt.head(10)

In [None]:
# populate connection ratio in each county with the median value

#separate the ranges
conn_ratio = districts_dt['county_connections_ratio'].str.split(",",n=1, expand=True)
districts_dt['ratio_c1'] = conn_ratio[0].str.replace('[','', regex = True)
districts_dt['ratio_c2'] = conn_ratio[1].str.replace('[','', regex = True)
districts_dt['ratio_c1'] = pd.to_numeric(districts_dt['ratio_c1'])
districts_dt['ratio_c2'] = pd.to_numeric(districts_dt['ratio_c2'])
districts_dt['ratio_c1'].fillna(districts_dt['ratio_c1'].median(), inplace = True)
districts_dt['ratio_c2'].fillna(districts_dt['ratio_c2'].median(), inplace = True)

#find the mean of separated range
districts_dt['ratio_connect_county'] = (districts_dt['ratio_c1'] + districts_dt['ratio_c2'])/2
districts_dt.info()

after that, drop unnecessary column to save memory.

In [None]:
districts_dt.drop(['pct_free/reduced', 'county_connections_ratio', 'pct_free', 'pct_reduced', 'ratio_c1', 'ratio_c2'], axis=1, inplace=True)
districts_dt.info()

### **Products Info**

The product data are based on file products_info.csv which contains information about the characteristics of the top 372 products with most users in 2020. 
* The categories listed in this file are part of LearnPlatform's product taxonomy. 
* Data were labeled by LearnPlatform team. 
* Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

In [None]:
products_dt = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
products_dt.info()

In [None]:
products_dt.isna().sum()

From the products table, there are three columns that have null values.
* Provider/Company Name : 1 null data
* Sector(s) : 20 null data
* Primary Essential Functions : 20 null data

In [None]:
products_dt.head()

| Column Name | Description |
| --- | --- |
| LP ID | The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s)   | Sector of education where the product is used |
| Primary Essential Function |The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

Deal with the null values,

In [None]:
#fill null values in Provider/Company Name column based on the information of the products
products_dt['Provider/Company Name'].fillna('PowerSchool Group LLC', inplace=True)
products_dt[products_dt['Provider/Company Name']=='PowerSchool Group LLC']

In [None]:
#fill null values in Sector(s) and Primary Essential Function column based on the mode of that column
products_dt['Sector(s)'].fillna(products_dt['Sector(s)'].mode()[0], inplace=True)
products_dt['Primary Essential Function'].fillna(products_dt['Primary Essential Function'].mode()[0], inplace=True)
products_dt.info()

### **Engagement Data**

In [None]:
files = glob.glob(os.path.join('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data', "*.csv"))
engage_data = []

for a in files:
    dat = pd.read_csv(a)
    fname = os.path.splitext(a)
    dat['district_id'] = os.path.basename(fname[0])
    engage_data.append(dat)
    
    # create dataframe for engagement data
    engagement_dt = pd.concat(engage_data)

In [None]:
engagement_dt.info()

to avoid error when merging between engagement table to district table, the district_id column have to change into numerical datatype.

In [None]:
engagement_dt['district_id'] = pd.to_numeric(engagement_dt['district_id'])
engagement_dt.info(show_counts = True)

In [None]:
engagement_dt.isna().sum()

there are missing values in lp_id, which can pose a problem when merging engagement table and product table. Therefore that column have to be filled.

In [None]:
#fill null values in lp_id column
engagement_dt['LP ID'] = engagement_dt['lp_id'].fillna(0.0).astype(int)
engagement_dt.info(show_counts = True)

In [None]:
engagement_dt.head()

| Column Name | Description |
| --- | --- |
| time| date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |
| district_id | The unique identifier of each district |

## **Exploration**

let's explore each dataset

### **Districts** 

In [None]:
districts_dt.describe(include = 'all')

From the distircts database, it is known that:
* the most data collected are in Suburb, with the data amount to 104 points.
* The state the have the most count is Utah, with the data amount to 30 points.
* The most frequent percentage of black/hispanics is around 0% to 20%.
* The average percentage of students who are eligible to have free and reduced fee lunch is 34%.
* The most frequent per-pupil total expenditure have range around 8000 - 10000.
* The average connection in each counties ratio is 59.5%

**Data in Each State and Locale**

In [None]:
sns.catplot(data=districts_dt, y='state', kind='count', height=8, aspect=1)
plt.title('Amount of Data in Each State')

From the chart above, turns out that Utah have the most data point in this table which is 30 data points, followed by Conneticut (25 data points) and Illinois (20 data points)

In [None]:
sns.displot(data=districts_dt, y='state', hue='locale', col='locale', height=10, aspect=0.6)

Chart above shows that the Suburb area has the most data point among all of the locales. It is followed by Rural area, City and Town.
* In the suburb area, the number of data points in Utah and Conneticut are almost the same, followed by Massachusetts and Illinois.
* Data points in Utah mostly located in suburb area and Town. Meanwhile, Data points in Connecticut mostly located in suburb and rural area.
* Data points in California are mostly located in the City.
* In Michigan, the data are only located in Suburb area.

**Percentage of free and reduced lunch fee in each States and Locale**

In [None]:
sns.displot(data=districts_dt, x='pct_free_reduced', hue='state')
plt.title('Free and Reduced Lunch Fee Percentage Distribution in Each State')

Char above shows that most states have 30% free or reduced lunch fee. The majority of that is located in Massachusetts, Wisconsin, North Carolina.
The second highest percentage is between 10%-20% free or reduced lunch fee.
Only two states have 80%-90% lunch fee, which is Michigan and Missouri.

In [None]:
sns.displot(data=districts_dt, x='pct_free_reduced', hue='locale', kind='kde')

Distribution chart above shows that the majority of data is located in Suburb area. But the average percentage in suburb area is aroun 40%, where as City and town have the average percentage around 50%-60%.


**ratio of high-speed connections in each states and locale**

In [None]:
sns.displot(data=districts_dt, x='ratio_connect_county', hue='state')

Chart above shows that most states have the connectivity ratio around 0.6 to 0.7. But only North Dakota has connectivity ratio above 1.4.

In [None]:
sns.displot(data=districts_dt, x='ratio_connect_county', hue='locale')

In terms of locales, most of the locations have the connectivity ratio around 0.6 to 0.7 but only rural area has connectivity ratio above 1.4

**Total Raw Expenditure per Person**

In [None]:
sns.catplot(data=districts_dt, y='pp_total_raw', kind='count', height=8, aspect=1)

Based on the expenditure chart above, per-pupil expenditure range mostly around 8000 to 10000. Followed by expenditure range of [10000 - 12000] 

In [None]:
sns.displot(data=districts_dt, y='pp_total_raw', hue='locale', col='locale', height=10, aspect=0.6)

If it split based on the locales,
* The expenditure range of [8000 - 10000] has the most amount in rural, city, and town area.
* Meanwhile, the most amount of expenditure in suburb area is [14000,16000]

### **Engagement**

merge all data into one table

In [None]:
# merge engagement table and product table with lp_id as key
engagement_product = pd.merge(engagement_dt, products_dt, on=['LP ID'])
engagement_product.info(show_counts=True)

In [None]:
# merge with district table
learn_dt = pd.merge(engagement_product, districts_dt, on =['district_id'])
learn_dt.info(show_counts=True)

**all of the data's statistical description**

In [None]:
learn_dt.describe(include= 'all')

It is known that during 2020,
* percent access average is 0.83 across the states
* the engagement index average is 0.258
* the most used product is Google Docs from Google LLC.
* The online service are most often used for PreK-12 (Pre-Kindergarden)
* Its Primary Essential Function is mostly for Digital Learning Platforms

**Online engagement and Connectivity throughout Pandemic in 2020**

In [None]:
#change time column into datetime datatype
learn_dt['time'] = pd.to_datetime(learn_dt['time'])
learn_dt.info()

In [None]:
sns.set(rc={'figure.figsize':(24,8)})
sns.lineplot(x="time", y="pct_access", data=learn_dt, hue='state', ci=None)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper right', borderaxespad=0.)

Based on the time chart above, the overall percentage of access throughout the pandemic is fluctuating. In USA, the pandemic started to affect learning facilites at March 2020, except in Michigan, who have high percentage during the first two month of 2020, but exponentially decreased right after that. That means there are certain events or factors in Michigan that affect its access percentage. During this month into April 2020, there are slight increase of access percentage in most of the states, and New York is at the peak of that percentage. But during May 2020 to July 2020, the percentage of access in all of the states are actually decreasing and stayed at low level until August 2020, where the access percentage is exponentially increased until September 2020. After that, the percentage of access are stable until the end of the year, where California is at the peak of the percentage. 

This data chart suggest that pandemic are slightly increasing the amount of access to online platforms compared to normal days (before pandemic), but there is an anomaly that could be researched further for clarifying its factor.

In [None]:
sns.set(rc={'figure.figsize':(24,8)})
sns.lineplot(x="time", y="engagement_index", data=learn_dt, hue='state', ci=None, estimator = np.median)

plt.legend(bbox_to_anchor=(1.05, 1), loc='upper right', borderaxespad=0.)

Based on the time chart above regarding the engagement index, the engagement index in all of the states are gradually increasing until April 2020, where New York holds the top of the engagement index, followed by Missouri. But after that, the engagement index graduallly decreased until July 2020. The index across all of the states stayed stagnant until August 2020 where it has slight increase until September 2020. The engagement index kept stable during Septmember 2020 untul December 2020. Overall New York holds the highest engagement index across all of the states, followed by Missouri.

This data chart suggest that interm of engagement to online platforms, the amount of engagement during pandemic has no significant change compared to before pandemics. Therefore pandemics doesn't really affecting the amount of engagement to online platforms.

**Percent access and Engagement Index in respect of Sectors**

In [None]:
sns.catplot(data=learn_dt,y='Sector(s)', x='pct_access', kind='bar', height=8, aspect=1, estimator=np.median)

From this barchart, it is known that highest access to online platforms used is in Corporate Sector which have around 16% access, followed by PreK-12; Higher Education; Corporate sector which have 6% access and PreK-12 only sector and PreK-12; Higher Education sector.

This means that the most accessed online learning platforms are the one used in the corporate sector.

In [None]:
sns.catplot(data=learn_dt,y='Sector(s)', x='engagement_index', kind='bar', height=8, aspect=1, estimator=np.median)

Similar to access percentage, the highest engagement index is in the Corporate Sector which have around engagement index 11, followed by PreK-12; Higher Education; Corporate sector which have around engagement index 6 and PreK-12 only sector and PreK-12; Higher Education sector.

This means that the most engaged online platforms are the one used in Corporate only Sector.

**Engagement Index and Percent Access in respect of Primary Essential Function**

In [None]:
learn_dt.groupby('Primary Essential Function')['engagement_index'].median().plot(kind="bar", color='green')
plt.ylabel('Engagement Index')

Based on the barchart above, the online platforms that have highest engagement are the one that have learning Management Systmes (LMS) as its primary essential function, followed by School management Software - Mobile Device Management as its primary essential function.

In [None]:
learn_dt.groupby('Primary Essential Function')['pct_access'].median().plot(kind="bar", color='blue')
plt.ylabel('Percent Access')

Meanwhile, based on the barchart above, the most accessed online platforms is the one that have school oriented primary essential function, which is School Management Software - Mobile Device Management and School Management Software - SSO.

**Percent Access and Engagement Index in respect of district demographics**

In [None]:
access_state = sns.catplot(data=learn_dt,x='pct_access', y='state', col='locale', kind='bar', estimator= np.median, height=10, aspect=0.6)
access_state.set_xlabels("Percent Access")

Based on the chart above, the locale that have the most access to online platforms is rural area, even though suburb area have the most data points. Many states have higher access percentage in rural area compared to other locales, such as Massachusetts, Utah, Ohio, California, New Hampshire and North Dakota. The highest access percentage is held by rural area in North Dakota which is 70% access. Coincidentally, only rural area that have access percentage in North Dakota. This means that a lot of students accessed online platforms in rural area rather than in other locales.

Meanwhile the second highest access percentage is in city area of New York which is nearly 40% access. This is obvious as New York is a metropolitan state.

In [None]:
engage_state = sns.catplot(data=learn_dt,x='engagement_index', y='state', col='locale', kind='bar', estimator= np.median, height=10, aspect=0.6)
engage_state.set_xlabels("Engagement Index")

The chart above shows that the highest engagement index occurred in rural area, particularly in North Dakota which have the index of nearly 40. Other high engagement index in rural area have similar distribution as percent access, which is in Massachusetts, Ohio, New Hampshire, etc. This means that, along with the percentage of students who accessed the online platforms, the amount of page-load events are at the highest in rural area, particularly in North Dakota.

Same as the access percentage, the second highest engagement index occurred in city area of New York.

**Access Percentage and Engagement Index in respect of Connectivity Ratio**

In [None]:
learn_dt.groupby('ratio_connect_county')['pct_access'].median().plot(kind="bar", color='blue', xlabel='Connectivity Ratio')
plt.ylabel('Percent Access')
plt.title('Access Percentage vs Ratio of Connectivity')

In [None]:
learn_dt.groupby('ratio_connect_county')['engagement_index'].median().plot(kind="bar", color='green', xlabel='Connectivity Ratio')
plt.ylabel('Engagement Index')
plt.title('Engagement Index vs Ratio of Connectivity')

Based on two charts above,
* the higher ratio of connectivity (index 1.5) have higher access percentage (near 70%)
* the higher ratio of connectivity (index 1.5) have higher engagement index (index above 35)

**Access Percentage and Engagement Index in respect of Expenditure Per Student**

In [None]:
exp_access = sns.catplot(data=learn_dt,x='pct_access', y='pp_total_raw', col='locale', kind='bar', estimator= np.median, height=10, aspect=0.6)
exp_access.set_xlabels("Access Percentage")
exp_access.set_ylabels("Expenditure Per Pupil")

Based on the chart above, it is known that,
* the highest access percentage is more than 16% at expenditure range of [20000 - 22000] in rural area.
* In many case, rural area is the area where it has a large amount of students accessing online learning platform with the low to high expenditures.
* The city locale only have students accessing online platforms that have expenditure range of [6000 - 20000], except range [16000 - 18000].
* the access amount in Suburb area are at the highest when occured to the students that have expenditure [20000 - 22000].

In [None]:
exp_engage = sns.catplot(data=learn_dt,x='engagement_index', y='pp_total_raw', col='locale', kind='bar', estimator= np.median, height=10, aspect=0.6)
exp_engage.set_xlabels("Engagement Index")
exp_engage.set_ylabels("Expenditure Per Pupil")

the engagement index in each of the locales has similar ditribution compared to access percentage. The highest engagement index occured to students that have expenditures of [20000 - 22000] in rural area.

**Access Percentage and Engagement Index in respect of Percentage of Black/Hispanic**

In [None]:
#create a list to control the order of the categorical data
learn = learn_dt['pct_black/hispanic'].unique().tolist()
learn_s = []
for n in sorted(learn):
    learn_s.append(n)

In [None]:
blh_access = sns.catplot(data=learn_dt,x='pct_access', y='pct_black/hispanic', col='locale', kind='bar', estimator= np.median, order = learn_s, height=10, aspect=0.6)
blh_access.set_xlabels("Access Percentage")
blh_access.set_ylabels("Percentage of Black/Hispanic People")

* Based on the chart above, the overall highest access are the district that have the least amount of black/hispanic people.
* The highest access to online learning platforms are the one in rural area that have least amount of black/hispanic people (0% - 20% black/hispanic). 
* On the other hand, in the city area, the highest access to online learning platforms are the one that have the most amount of black/hispanic people in the district (80% - 100% black/hispanic people).

In [None]:
blh_engage = sns.catplot(data=learn_dt,x='engagement_index', y='pct_black/hispanic', col='locale', kind='bar', estimator= np.median, order = learn_s, height=10, aspect=0.6)
blh_engage.set_xlabels("Engagement Index")
blh_engage.set_ylabels("Percentage of Black/Hispanic People")

* Based on the chart above, the overall highest engagement index are the distrtict that have the least amount of black/hispanic people.
* The highest engagement index occured in the rural area and have the least amount black/hispanic people and also in the suburban area.
* In the city, the highest engagement are occured in the district that have the most black/hispanic people.
* In town area, the highest engagement are occured in the district that have 20%-40% black/hispanic people.

# **Conclusion**

In summary, based on the explored data,
1. Time series chart shows that there are slight increase of **access percentage** and **engagement index** from March 2020 - April 2020 but gradually decreased until July 2020, and after that both parameter increased again. Therefore, the online learning platform use is actually increasing during the pandemic in 2020, but there is an very low access and engagement period that needed to be researched further to understand the factor of that period.

2. **The most accessed** (the most amount of students that access the platform) and **the most engaged** (the most amount of page-load events per one thousand students)online learning platform are the one that is used in **Corporate Sector**.

3. **The most accessed online learning platforms** are the platforms which Primary Essential Function is for **School Management Software - Mobile Device Management**, followed by general **School Management Software**. Meanwhile, **the most engaged online learning platforms** are the platforms which Primary Essential Function is for **Learning Management System**, followed by **School Management Software - Mobile Device Management**.

4. The highest access percentage and engagement index occurred in Rural Area, where the highest value is in North Dakota. The second highest access percentage and engagement occured in City Locale of New York.

5. The access percentage and the engagement index of high connectivity ratio (1.5) is **significantly higher** than the values of low connectivity ratio (0.59). This means that the more amount of at least 200kbps internet connection in a county is, the more online learning platforms are used.

6. The highest access percentage and engagement index occured at districts that have **expenditures 20000 - 22000 per pupil** in the **Rural Area**. Meanwhile, in the **City Area**, the highest access percentage and engagment index occured at districts that have **expenditures 6000 - 8000 per pupil**.

7. In general, the highest access percentage and engagement index occured at districts that have the least amount of Black/Hispanic people.

8. In **Rural Area**, the highest access percentage and engagement index occured at districts that have the **least amount of Black/Hispanic people**. Meanwhile in **City Area**, the highest access percentage and engagement index occured at districts that have the **most amount of Black/Hispanic people**.

9. There are technological/digital access difference in racial context (Black/Hispanic people and non-Black/Hispanic) between City and Rural area.