# **Problem Statement**

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow. 

# **Description of the Product Data**

The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

# **Description of the District Data**

The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, we removed the identifiable information about the school districts. We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

# **Description of the Engagement Data**

The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. The lp_id can be used to link to product information in product_info.csv.

# **Challenge**

# **Importing the libraries**

In [None]:
import numpy as np
import pandas as pd
import pandas_profiling as pp
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import glob 
import gc
%matplotlib inline
sns.set()

# **Reading the data**

In [None]:
products = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
products.head()

In [None]:
districts = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
districts.head()

In [None]:
engagement_path = "../input/learnplatform-covid19-impact-on-digital-learning/engagement_data"
engagement_data = glob.glob(engagement_path + '/*.csv')
list = []

for engagement in engagement_data:
    data = pd.read_csv(engagement, index_col = None, header = 0)
    district_id = engagement.split('/')[-1].split('.')[0]
    data['district_id'] = district_id
    list.append(data)
    


In [None]:
engagement = pd.concat(list)
engagement = engagement.reset_index(drop = True)

In [None]:
engagement

# **Showing Profile Report of Products data**

In [None]:
pp.ProfileReport(products)

 # **Showing Profile Report of Districts data**

In [None]:
pp.ProfileReport(districts)

In [None]:
# renaming LP id  because LP ID has different name in products and engagement
products = products.rename({'LP ID': 'lp_id'}, axis = 1)

In [None]:
# district_id is float, but though int will be enough, ID can not be a fraction
engagement['district_id'] = engagement['district_id'].apply(int)
engagement['time'] = pd.to_datetime(engagement['time'])

# **Missing Values**

In [None]:
# engagement index is highly skewed (see the profile report of engagement above), 
# so it would be better to use median

median = engagement['engagement_index'].median()

In [None]:
engagement['engagement_index'].fillna(median, inplace = True)

In [None]:
engagement['engagement_index'].isnull().sum()

In [None]:
# I want to aggregate the data by state 
# later so that's why I filled the rows of state and local

districts['state'].fillna('Other', inplace = True)
districts['locale'].fillna('unidentified', inplace = True)

In [None]:
# due to the interest in educational products.
# SO, I remove the Nan rows in pct_access column and last two columns

products = products.dropna(subset = ['Sector(s)','Primary Essential Function'])

In [None]:
engagement = engagement.dropna(subset = ['pct_access'])


#   **Question 1. What is the picture of digital connectivity and engagement in 2020? Engagement distribution plot will be the answer¶**


In [None]:
plt.style.use('dark_background')
plt.plot(engagement['time'], engagement['engagement_index'],color = 'red')
plt.title('Engagement Index in 2020',fontsize = 20)
plt.xlabel('Time', fontsize = 15)
plt.ylabel('Index', fontsize = 15)
plt.show()

# **Question 2. What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?**

In [None]:
gc.collect()

# **Question 3. How does student engagement with different types of education technology change over the course of the pandemic?¶**

In [None]:
engagement = engagement.merge(districts, on = 'district_id').merge(products,
                                                                   on = 'lp_id').sort_values(by = ['time'])

In [None]:
engagement.head()

In [None]:
plt.plot(engagement.loc[engagement['Sector(s)'] == 'PreK-12']['time'],
        engagement.loc[engagement['Sector(s)'] == 'PreK-12']['engagement_index'],
        color = 'orange')
plt.title('Engagement Index in 2020 - PreK-12',fontsize = 20)
plt.xlabel("Time",fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed; Corporate']['time'],
        engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed; Corporate']['engagement_index'], color = 'darkgreen')
plt.title("Engagement Index in 2020 - PreK-12; Higher Ed; Corporate",fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed']['time'],
        engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed']['engagement_index'], color = 'darkcyan')
plt.title('Engagement Index in 2020 - PreK-12; Higher Ed',fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['Sector(s)'] == 'Corporate']['time'],
        engagement.loc[engagement['Sector(s)'] == 'Corporate']['engagement_index'], color = 'firebrick')
plt.title('Engagement Index in 2020 - Corporate',fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['Sector(s)'] == 'Higher Ed; Corporate']['time'],
        engagement.loc[engagement['Sector(s)'] == 'Higher Ed; Corporate']['engagement_index'], color = 'steelblue')
plt.title('Engagement Index in 2020 - Higher Ed; Corporate',fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

# **Question 4. How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?**

# **Geography**

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['state'] == 'Connecticut']['time'],
        engagement.loc[engagement['state'] == 'Connecticut']['engagement_index'], color = 'mediumseagreen')
plt.title('Engagement Index in 2020 -Connecticut',fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['state'] == 'Utah']['time'],
        engagement.loc[engagement['state'] == 'Utah']['engagement_index'], color = 'royalblue')
plt.title('Engagement Index in 2020 - Utah',fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

In [None]:
plt.style.use('dark_background')
plt.plot(engagement.loc[engagement['state'] == 'Other']['time'],
        engagement.loc[engagement['state'] == 'Other']['engagement_index'], color = 'slateblue')
plt.title('Engagement Index in 2020 - Other',fontsize = 20)
plt.xlabel('Time',fontsize = 15)
plt.ylabel('Index',fontsize = 15)
plt.show()

## It seems like they are similar so we want States should be compared using some other type of plot

In [None]:
plt.style.use('ggplot')
plt.plot(engagement.groupby('state').agg({'engagement_index' : ['mean']}))
plt.xticks(rotation = 90)
plt.show()

In [None]:
fig = px.bar(engagement.groupby(['state','pct_black/hispanic'])['district_id'].count().reset_index(name = 'Total count'),
             x = 'state', y = 'Total count', color = 'pct_black/hispanic')
fig.update_layout(legend = dict(orientation = 'h', yanchor = 'bottom', y = 1.0, 
                                xanchor = 'right', x = 0.5))
fig.update_xaxes(categoryorder = 'category descending')
fig.show()

In [None]:
fig = px.bar(engagement.groupby(['state','pct_free/reduced'])['district_id'].count().reset_index(name = 'Total count'),
                               x = 'state', y = 'Total count', color = 'pct_free/reduced')
fig.update_layout(legend = dict(orientation = 'h', yanchor = 'bottom',
                               y = 1.0, xanchor = 'right', x = 0.5))
fig.update_xaxes(categoryorder = 'category ascending')
fig.show()

In [None]:
fig = px.bar(engagement.groupby(['state', 'county_connections_ratio'])['district_id'].count().reset_index(name = 'Total count'),
            x = 'state', y = 'Total count', color = 'county_connections_ratio')
fig.update_layout(legend = dict(orientation = 'v',
                               yanchor = 'bottom',
                               y = 1.09, xanchor = 'right', x = 1.10))
fig.update_xaxes(categoryorder = 'category ascending')
fig.show()

## I have tried compare mean of engagement index aggregated by state with ethinic and socioeconmics status, but these plots did not show any pattern  

# **Overall engagement variability : ethnic**

In [None]:
fig, ax1 = plt.subplots()
ax1.plot(engagement.groupby(['pct_black/hispanic']).agg({'engagement_index' : ['mean']}), color = 'red')
ax1.tick_params(axis = 'y', labelcolor = 'red')
ax2 = ax1.twinx()
ax2.plot(engagement.groupby(['pct_black/hispanic']).agg({'engagement_index': ['count']}), color = 'blue')
ax2.tick_params(axis = 'y', labelcolor = 'blue')
plt.show()

# **Socioeconomics**

In [None]:
fig, ax1 = plt.subplots()
ax1.plot(engagement.groupby(['pct_free/reduced']).agg({'engagement_index' : ['mean']}), color = 'red')
ax1.tick_params(axis ='y', labelcolor = 'red')
ax2 = ax1.twinx()
ax2.plot(engagement.groupby(['pct_free/reduced']).agg({'engagement_index' : ['count']}), color = 'blue')
ax2.tick_params(axis = 'y', labelcolor = 'blue')
plt.show()

# **County Connections**

In [None]:
plt.plot(engagement.groupby(['county_connections_ratio']).agg({'engagement_index' : ['mean']}))
plt.show()

## we see county connections plots seems not interesting

# **Engagement by state and black/hispanic percentage**

In [None]:
plt.style.use('seaborn-whitegrid')
sns.catplot(x = 'pct_black/hispanic',y = 'engagement_index', col = 'state',
           col_wrap = 2, hue = 'pct_black/hispanic',
           data = engagement.groupby(['state','pct_black/hispanic'])['engagement_index'].mean().reset_index(),
           kind = 'bar')
plt.show()

# **Engagement by state and Socioeconomics status**

In [None]:
sns.catplot(x = 'pct_free/reduced', y = 'engagement_index',
           col = 'state', col_wrap = 2, hue = 'pct_free/reduced',
           data = engagement.groupby(['state', 'pct_free/reduced'])['engagement_index'].mean().reset_index(), kind = 'bar')
plt.show()