# Introduction

$Problem Statement$

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

The purpose of the current Kaggle challenge is to investigate (1) the state of digital learning in 2020 and (2) how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

$Questions$

* During 2020 what type of learning platforms were used and which was dominant?
* Within Google product what type of function was used the most?
* How does student engagement with different types of education technology change over the course of the pandemic?
* How does student engagement with online learning platforms looks like according to different geography and demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?

Participants of this Notebook
* kaggle id
* imda27(User ID 7381393), Meredith(User ID 8162594), kimyoungwon(User ID 7177865)

* Names
* Dana(Daeun) Im, Meredith Luo, Youngwon Kim  

### Preparations for Data Analysis

In [None]:
# Installing packages (If there are the following pakcages on the kaggle system, we can delete this section)
!pip install pdpipe
!pip install geopandas
!pip install us

In [None]:
# Loading packages 
import datetime as dt 
import geopandas as gpd
import glob 
import matplotlib.pyplot as plt 
%matplotlib inline
import numpy as np 
import pandas as pd 
import pdpipe as pdp
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline
import re 
import seaborn as sns 
import us
import warnings
warnings.filterwarnings("ignore")

from plotly.subplots import make_subplots
from sklearn.preprocessing import scale

### Data management and visualization of **district.csv**



In [None]:
# Importing district.csv
df_district1 = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
df_district1

In [None]:
# Excluding all “NaN” for the state column and remaining all other columns
df_district2 = df_district1.copy()
df_district2.dropna(subset=['state'], inplace=True)

# Deleting the county_connections_ratio (All values are identical)
df_district2.drop(columns='county_connections_ratio', inplace=True)
df_district2

In [None]:
len(pd.unique(df_district1['state']))

In [None]:
len(pd.unique(df_district1['district_id']))

In [None]:
# Checking the state differences between the original dataset and the dataset excluding NaN
len(pd.unique(df_district1['state']))-len(pd.unique(df_district2['state']))

In [None]:
# Checking the district differences between the original dataset and the dataset excluding NaN
len(pd.unique(df_district1['district_id']))-len(pd.unique(df_district2['district_id']))

* The original dataset (distric.csv) has 24 states and 233 districts. 
* Once we exclude missing values in the 'state' column, we have 23 states and 176 districts.

In [None]:
# Turning the two variables (black/hispanic, reduced/free) into likert scale
likert_scale = {'[0, 0.2[':'1', '[0.2, 0.4[':'2', '[0.4, 0.6[':'3', '[0.6, 0.8[':'4', '[0.8, 1[':'5'}
# Changing the variable names
df_district2['pct_black/hispanic'].replace(likert_scale, inplace=True)
df_district2['pct_free/reduced'].replace(likert_scale, inplace=True)

# Turning the pp_total variable into mid-point
pp_total = df_district2['pp_total_raw'].str.replace('[','').str.split(',', expand = True).astype(float)
df_district2['pp_total_raw'] = (pp_total[0]+pp_total[1])/2

# Reordering locale
df_district2['locale'] = df_district2['locale'].astype('category')
df_district2['locale'].cat.reorder_categories(['Rural', 'Town','Suburb','City'], inplace=True)

# Dummy coding - locale
df_district2 = df_district2.join(df_district2['locale'].str.get_dummies())
df_district2

In [None]:
# Adding state abbreviation into the dataset
abbr = ['na']*len(df_district2)
i=0

for s in df_district2['state']:
  state=us.states.lookup(s)
  abbr[i] = state.abbr
  i += 1

df_district2['state_abbr'] = abbr
df_district2

In [None]:
# Inspecting all the district per state
df_district2.groupby("state")["district_id"].apply(set).to_frame()

* Districts are not equally distributed. Some districts could be over-representative (e.g., Arizona, Florida, Minnesota, etc.) 
Therefore, to find general trend looking at locale level data might be more reasonable.

In [None]:
# Distribution and proportion of locale
fig, (ax1, ax2) = plt.subplots(1,2)
sns.countplot(x="locale", data=df_district2, ax=ax1)
ax2.pie(df_district2['locale'].value_counts(),labels = df_district2['locale'].unique(), 
        colors=['tab:green','tab:blue','tab:red','tab:orange'], autopct='%1.1f%%')
plt.show()

In the above plot shows how the general population looks like within each locale. 59.1% live in suburb then rural and city follows respectively.

In [None]:
# Distribution and proportion of black/hispanic
fig, (ax1, ax2) = plt.subplots(1,2)
sns.countplot(x='pct_black/hispanic', data=df_district2, ax=ax1).set_xticklabels(labels = ["0-20%","20-40%","40-60%","60-80%","80-100%"], rotation=45)
ax2.pie(df_district2['pct_black/hispanic'].value_counts(),labels = ["0-20%","20-40%","40-60%","60-80%","80-100%"], autopct='%1.1f%%')
plt.show()

* The proportion of the Black/Hispanic population in the districts, which account for 65.9% of the total districts, ranges from 0 to 20%.

In [None]:
#Distribution of race/ethnicity and locale
sns.displot(data=df_district2, y='pct_black/hispanic', hue= 'pct_black/hispanic', col='locale', height=5, aspect=.8).set_yticklabels(labels = ["0-20%","20-40%","40-60%","60-80%","80-100%"])

* Districts located in suburban and city areas are likely to have higher portions of Black/Hispanic populations.

### Data management and visualization of **products_info.csv**

In [None]:
# Importing products.csv
df_products = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
df_products

In [None]:
# Dividing sector(s) into dummy variables
df_sectors = df_products['Sector(s)'].str.get_dummies(sep="; ")
df_sectors.columns = ["Sector_Corporate","Sector_HigherEd","Sector_Prek-12"]
df_products = df_products.join(df_sectors)
df_products

In [None]:
df_products['primary_function_main'] = df_products['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
df_products['primary_function_sub1'] = df_products['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)
df_products['primary_function_sub2'] = df_products['Primary Essential Function'].apply(lambda x: x.split(' - ')[-1] if x == x else x)

df_products

In [None]:
df_length = len(df_products['primary_function_sub1'])
for i in range(df_length): 
  if df_products['primary_function_sub1'][i] == df_products['primary_function_sub2'][i]:
     df_products['primary_function_sub2'][i] = None
df_products

In [None]:
# Deleting a duplicate column from the dataframe
df_products = df_products.drop(["Primary Essential Function"], axis=1)
df_products

In [None]:
# Checking whether every company's product name is same or not.
len(df_products["Product Name"].unique())

In [None]:
# Checking whether every Provider/Company Name is same or not
len(df_products["Provider/Company Name"].unique())

In [None]:
# Distribution of top 10 provider/company names
# explain where this 30 coming from 
# check this out
sns.countplot(y='Provider/Company Name', data=df_products, order=df_products["Provider/Company Name"].value_counts().index[:10])
plt.title("Top 10 Provider/Company Names")
plt.show()

In [None]:
# Subsetting top 10 provider/company name
df_products2 = df_products[df_products["Provider/Company Name"].isin(list(df_products["Provider/Company Name"].value_counts().index[:10]))]

# Reordering 'Sector(s)'
df_products2['Sector(s)'] = df_products2['Sector(s)'].astype('category')
df_products2['Sector(s)'].cat.reorder_categories(['PreK-12','PreK-12; Higher Ed; Corporate','PreK-12; Higher Ed'], inplace=True)

* Google LLC is the top provider among 291 companies. So we are going to focus on Google data to see how was the learning trend was like during 2020 and what we can learn from this company to possibly give information to improve other learning platforms. 

### Data management and visualization of **engagement** data sets

In [None]:
# Importing all files in the engagement folder and merging them into one file
all_files = glob.glob("../input/learnplatform-covid19-impact-on-digital-learning/engagement_data" + "/*.csv")

merged_df = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    # add district_id from the data file name
    df["district_id"] = filename.replace("\\", "/").split("/")[-1].split(".")[0]
    merged_df.append(df)

df_engagement = pd.concat(merged_df, axis=0, ignore_index=True)

df_engagement.head()

In [None]:
# Checking the number of district id
len(df_engagement["district_id"].unique())

In [None]:
# Checking the number of lp id (products)
len(df_engagement["lp_id"].unique())

In [None]:
# Checking the types of variables
df_engagement.info()

### Exploratory data analysis of the *merged dataset*




In [None]:
# Converting variable types
convert_dict = {'district_id': 'int64'}
df_engagement = df_engagement.astype(convert_dict)
df_engagement['time'] = pd.to_datetime(df_engagement['time'])

# Checking the types of variables again
df_engagement.info()

In [None]:
# Merging district data with engagement data by district_ID
merge_en_dist = pd.merge(df_engagement, df_district1, on='district_id')
merge_en_dist

In [None]:
merge_en_dist = merge_en_dist.drop('pp_total_raw', 1)

In [None]:
# Trying to think of how to merge engagement data with districts data set(Dana)
df_new_pro = df_products.rename(columns={'LP ID':'lp_id'})

# Merging product data with engagement data by LPID(Dana)
merge_all = pd.merge(merge_en_dist, df_new_pro, on='lp_id')

In [None]:
len(merge_all["lp_id"].unique())

In [None]:
len(merge_all["Provider/Company Name"].unique())

In [None]:
len(merge_all["state"].unique())

In [None]:
len(merge_all["district_id"].unique())

* Once we merge all three files, we have 369 different products, 289 providers/companies, and 23 states, and 176 districts.
* The total number of rows is 9,139,701.

In [None]:
# Checking the proportions of missingness
merge_all.isnull().sum() / len(merge_all)

According to the missingness, we are going to use pct_access more than engagement index since there are less missing data. 

#Usage Trend over time

Since the data set contains multiple time stamps, we focused on students' usage over time. 
For the analysis, we tried looking the data in different ways to see students behavioral trend and turned out that they all show similar trend over time. 

#### Trend of Average Percent Access by State

In [None]:
pct_time_mean_s = merge_all.groupby(['time', 'state'])['pct_access'].mean()
pct_time_mean_s = pct_time_mean_s.reset_index()
pct_time_mean_s

In [None]:
fig = px.line(pct_time_mean_s, x="time", y="pct_access", color='state',title='Percentage of access in a given day in each state',
               template="ggplot2", width=2000, height=800)
fig.show()

#### Trend of Average Engement Index by State

In [None]:
en_time_mean_s = merge_all.groupby(['time', 'state'])['engagement_index'].mean()
en_time_mean_s = en_time_mean_s.reset_index()
en_time_mean_s

In [None]:
fig = px.line(en_time_mean_s, x="time", y="engagement_index", color='state',title='average page load of students in a given a day in each state',
               template="ggplot2", width=2000, height=800)
fig.show()

####Trend of Average Percent Acess by Locale

In [None]:
pct_time_mean_l = merge_all.groupby(['time', 'locale'])['pct_access'].mean()
pct_time_mean_l = pct_time_mean_l.reset_index()
pct_time_mean_l

In [None]:
fig = px.line(pct_time_mean_l, x="time", y="pct_access", color='locale',title='percentage of access in a given day by locale',
               template="ggplot2", width=2000, height=800)
fig.show()

####Trend of Average Engagement Index by Locale

In [None]:
en_time_mean_l = merge_all.groupby(['time', 'locale'])['engagement_index'].mean()
en_time_mean_l = en_time_mean_l.reset_index()
en_time_mean_l

In [None]:
fig = px.line(en_time_mean_l, x="time", y="engagement_index", color='locale',title='average page load of students in a given a day by locale',
               template="ggplot2", width=2000, height=800)
fig.show()

According to all the above plots, we can see that the learning plat forms were used during the academic year. Its usage drops over the summer break season. From this, generally we can infer that the platforms were mostly used to support students learning during the school season.

#### Google products

Before we start, below analysis used Google data due to its highest usage among all the platforms. This can be seen in the above bar plot where the it shows the count of each platforms.

In [None]:
# use the desire columns: time, pct_access, City, Rural, Suburb, Town, state_abbr, Product Name
df_all = merge_all[['time','pct_access','locale','Provider/Company Name','pct_black/hispanic','pct_free/reduced']]

In [None]:
# For each locale, aggregate their pct_access on each day
df_locale = df_all.groupby(['locale', 'time'], as_index=False).agg({'pct_access':'mean'})

In [None]:
# Google dataset
mask = df_all['Provider/Company Name'].str.contains('Google')
df_all['Google'] = np.where(mask, 1, 0)
# percentage of google for each subscale for each day
df_google = df_all.groupby(['locale', 'time'], as_index=False).agg({'Google':lambda x: sum(x)/len(x)})#mean('pct_access')

In [None]:
# plot it
import plotly.express as px
fig = px.line(df_locale, x="time", y="pct_access", color='locale',
              labels = {
                  'pct_access':'Access Index'
              },
              title='Percentage of students with at least one-page load event on a given day',
               template="ggplot2", width=1500, height=500)
fig.show()

Based on the above plot, districts that are **rural** have the highest percentage of students with at least one-page load throughout the time, while **city** districts sometimes have the lowest. This result is the opposite to what we initially expected. 

In the following section, we will see a more detailed picture for each locale category based on their percentage of black/hispanic and percentage of free/reduced cost lunch. We want to see if the result will also be unexpected. 

In [None]:
# for city in locale == rural, the pct_black/hispanic
df_rural_bh = df_all[df_all['locale'] == 'Rural']
df_rural_bh = df_rural_bh.groupby(['pct_black/hispanic', 'time'], as_index=False).agg({'pct_access':'mean'})

df_suburb_bh = df_all[df_all['locale'] == 'Suburb']
df_suburb_bh = df_suburb_bh.groupby(['pct_black/hispanic', 'time'], as_index=False).agg({'pct_access':'mean'})

df_city_bh = df_all[df_all['locale'] == 'City']
df_city_bh = df_city_bh.groupby(['pct_black/hispanic', 'time'], as_index=False).agg({'pct_access':'mean'})

df_town_bh = df_all[df_all['locale'] == 'Town']
df_town_bh = df_town_bh.groupby(['pct_black/hispanic', 'time'], as_index=False).agg({'pct_access':'mean'})

In [None]:
fig = px.line(df_rural_bh, x="time", y="pct_access", color='pct_black/hispanic',
                            labels = {
                  'pct_access':'Access Index'
              },
              title='Rural Area Percentage of Access on a Given Day',
               template="ggplot2", width=1500, height=500)
fig.update_yaxes(range=[0, 3])
fig.show()

In [None]:
fig = px.line(df_suburb_bh, x="time", y="pct_access", color='pct_black/hispanic',
                            labels = {
                  'pct_access':'Access Index'
              },
              title='Suburb Area Percentage of Access on a Given Day',
               template="ggplot2", width=1500, height=500)
fig.update_yaxes(range=[0, 3])
fig.show()

In [None]:
fig = px.line(df_city_bh, x="time", y="pct_access", color='pct_black/hispanic',
                            labels = {
                  'pct_access':'Access Index'
              },
              title='City Area Percentage of Access on a Given Day',
               template="ggplot2", width=1500, height=500)
fig.update_yaxes(range=[0, 3])
fig.show()

In [None]:
fig = px.line(df_town_bh, x="time", y="pct_access", color='pct_black/hispanic',
                            labels = {
                  'pct_access':'Access Index'
              },
              title='City Area Percentage of Access on a Given Day',
               template="ggplot2", width=1500, height=500)
fig.update_yaxes(range=[0, 3])
fig.show()

In [None]:
# plot it
fig = px.line(df_google, x="time", y="Google", color='locale',
              labels = {
                  'Google':'Percentage of Google Usage'
              },
              title='Percentage of Google usage on a given day',
               template="ggplot2", width=1500, height=500)
fig.show()

In the previous section, we found that Google is the most frequently appeared company in this dataset. Therefore, we want to explore whether the percentage of Google products usage changes throughout the pandemic period and whether the percentages are different among different locales. 

As shown by the plot above, we can see that the percentage of Google products increased substantially starting from June to August, which should be during the summer break. Additionally, compared to city and suburb, town and rural areas used higher percentage of Google products.

# Interesting fact about google by its function

In [None]:
google_function = merge_all[merge_all['Provider/Company Name'] == 'Google LLC'] 
avg_google_func = google_function.groupby(['time', 'primary_function_sub1'])['pct_access'].mean()
avg_google_func = avg_google_func.reset_index()
avg_google_func

In [None]:
fig = px.line(avg_google_func, x="time", y="pct_access", color='primary_function_sub1',title='Average Percentage of Access of Google by its Funtion',
               template="ggplot2", width=2000, height=800)
fig.show()

The interesting fact about this plot is that the product google has been divided by its function. Here, we can see that among all the functions that google provide, 'Learning Management System' was the most used product. Referring that google product was used mostly during the academic years, this indicates that the product was used to facilitate students' learning. Seems like instructors were using google product to manage students during class sessions. 

According to https://elearningindustry.com/google-classroom-a-free-learning-management-system-for-elearning
google classroom 'learning management system'(LMS) provides instructors to give out assignments through google docs(paperless), provide students' assignment deadline, and the collected students data give students feedback for their learning process. Another interesting fact about this platform is that it has a function to track students access to the platform. This indicates that it enables teachers to monitor students' activity and can give personalized guidance for them.  
This is very interesting and can be critical in learning platform industry. Due to covid-19, students had to take classes from home and because of that, teachers weren't able to manage students. From this, we can infer that instructors were trying to maintain student management by learning platforms. 
 
What we can take away from this graph is that among all the functions, LMS shows dominant usage among users. Since Goolge is one of the most used learning platforms, other platforms can improve LMS function or adopt LMS to improve their platforms and possibly help increase user nember as well.  

In [None]:
data_to_submit = pd.DataFrame({
    'district_id':df_district1['district_id'],
    'district_id':merge_all['district_id']
})
data_to_submit.to_csv('csv_to_submit.csv', index = False)