*The state of digital learning in 2020
*How the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

 Guiding questions for analysis 

*What is the picture of digital connectivity and engagement in 2020?
*How does student engagement with different types of education technology change over the course of the pandemic?
*How does student engagement with online learning platforms relate to different geography? Demographic context.
*Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

In [None]:
import  pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import glob
import os
import warnings

In [None]:
districts=pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products=pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')


In [None]:

path='../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
# joined files are returned in the form of list
# joined_list = glob.glob(os.path.join(path,'*.csv'))

# # the files are joined
# engagement_df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)    

for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data'):
    for filename in filenames:
        engagement_files = list(glob.glob(os.path.join(dirname,'*.*')))

engagement = pd.DataFrame()
for file in engagement_files:
    district_id = file[79:83]
    engagement_file = pd.read_csv(file)
    engagement_file['id'] = district_id
    engagement = pd.concat([engagement, engagement_file], axis=0).reset_index(drop=True)
    


In [None]:
engagement['id']=engagement['id'].astype('int64')


In [None]:
mapping_1 = {
    '[0, 0.2[': '0%-20%',
    '[0.2, 0.4[': '20%-40%',
    '[0.4, 0.6[': '40%-60%',
    '[0.6, 0.8[': '60%-80%',
    '[0.8, 1[': '80%-100%'}

mapping_2 = {
    '[4000, 6000[': '4000-6000',
    '[6000, 8000[': '6000-8000',
    '[8000, 10000[': '8000-10000',
    '[10000, 12000[': '10000-12000',
    '[12000, 14000[': '12000-14000',
    '[14000, 16000[': '14000-16000',
    '[16000, 18000[': '16000-18000',
    '[18000, 20000[': '18000-20000',
    '[20000, 22000[': '20000-22000',
    '[22000, 24000[': '22000-24000',
    '[32000, 34000[': '32000-34000'}

mapping_3 =mapping_3 = {
    '[0.18, 1[': '18%-100%',
    '[1, 2[': '100%-200%'
}
districts['pct_black/hispanic'] = districts['pct_black/hispanic'].map(mapping_1)
districts['pct_free/reduced'] = districts['pct_free/reduced'].map(mapping_1)
districts['county_connections_ratio'] = districts['county_connections_ratio'].map(mapping_3)
districts['pp_total_raw'] = districts['pp_total_raw'].map(mapping_2)
districts.head()

In [None]:
districts['district_id']=districts['district_id'].astype('int64')


In [None]:

products[['Category','Sub-Category']]=products['Primary Essential Function'].str.split('-',n=1,expand=True)

products.head()

In [None]:
products['LP ID']=products['LP ID'].astype('float64')


**Number of missing values in Enagement data **

In [None]:
print(engagement.isna().sum())

In [None]:
print(f'Number of rows {engagement.shape[0]}\nNumber of columns {engagement.shape[1]}\nNumber of missing values {sum(engagement.isna().sum())} ')

In [None]:
engagement.describe()

**Districts quick view **

In [None]:
districts.head()

**Number of missing values in Districts data **


In [None]:
print(f'Number of rows {districts.shape[0]}\nNumber of columns {districts.shape[1]}\nNumber of missing values {sum(districts.isna().sum())} ')

**Number of missing values in each coulmns(Districts) **

In [None]:
print(districts.isna().sum())

**Products quick view **


In [None]:
products.head()

In [None]:
# Number of missing values in products 

print(f'Number of rows {products.shape[0]}\nNumber of columns {products.shape[1]}\nNumber of missing values {sum(products.isna().sum())} ')

In [None]:
#Missing values in each columns (products)
print(products.isna().sum())

**Main Observations: 
**

Engagement Data 

* Number of rows 22324190 and Number of columns 4
* The missing value in Engagement Data is 5392397  mainly come from engagement_index 5378409 (24% from total observation), lp_id 541, pct_access 13447. 

Districts Data

* Number of rows 233 and Number of columns 7
* The missing value in Districts Data is 590 mainly come from pct_free/reduced 233, county_connections_ratio   71, pp_total_raw 115

Products Data

* Number of rows 372 and Number of columns 8
* The number of missing values 81 mainly comes from Provider/Company Name 1, Sector(s) 20, Primary Essential Function 20





In [None]:
data=engagement.copy()
data['id']=data['id'].astype('int64')
data=data.merge(products,left_on='lp_id',right_on='LP ID',how='left')
data=data.merge(districts,left_on='id',right_on='district_id',how='left')
data['time']=pd.to_datetime(data['time'])

del engagement
del products
del districts



In [None]:
data = data.drop('district_id', axis=1)
data = data.drop('LP ID', axis=1)
data=data.drop('Primary Essential Function',axis=1)
data.head()

In [None]:
print(f'Number of rows {data.shape[0]}\nNumber of columns {data.shape[1]}\nNumber of missing values {sum(data.isna().sum())}')

In [None]:
print(f'Number of missing values in each column{data.isna().sum()}')

In [None]:
def show_values(axs, orient="v", space=.01):
    def _single(ax):
        if orient == "v":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() / 2
                _y = p.get_y() + p.get_height() + (p.get_height()*0.01)
                value = '{:.1f}'.format(p.get_height())
                ax.text(_x, _y, value, ha="center") 
        elif orient == "h":
            for p in ax.patches:
                _x = p.get_x() + p.get_width() + float(space)
                _y = p.get_y() + p.get_height() - (p.get_height()*0.5)
                value = '{:.1f}'.format(p.get_width())
                ax.text(_x, _y, value, ha="left")

    if isinstance(axs, np.ndarray):
        for idx, ax in np.ndenumerate(axs):
            _single(ax)
    else:
        _single(axs)

In [None]:
#Top 10 products

products_temp=pd.DataFrame(data.groupby('Product Name',dropna=False)['engagement_index'].sum()/1000000).reset_index()
products_temp.columns=['Product Name','Amount']
products_temp=products_temp.sort_values('Amount',ascending=False)
products_temp=products_temp[0:10]
products_temp=products_temp.fillna('Unknown')

Key Observations Top 10 products

1. out of top 5 products, 4 products are managed by Google. Those 4 products are Google Docs, Google Classroom,Youtube, meet.
2. The page load for google docs is 769 million and the page load for Goggle classroom is 373 million 
3. Canva is in 3rd position
4. unknown stands in 2nd position can be assumed to be coming from many products

In [None]:
plt.figure(figsize=(12,5))
g1=sns.barplot(x=products_temp['Product Name'],y=products_temp['Amount'])
plt.xlabel('Products')
plt.ylabel('Page load (million)')
plt.title('Top 10 Products')
plt.tight_layout()



Key Observations Top 10 providers

1. Google outperforms all the other providers with 1581 million page-load
2. Instructure stands (Canva) in second place with 138 million page-load
3. Kahoot stands in 3rd place with 87.4 million page-load

In [None]:
#Top 10 providers
providers_temp=pd.DataFrame(data.groupby('Provider/Company Name',dropna=False)['engagement_index'].sum()/1000000).reset_index()
providers_temp.columns=['Provider/Company Name','Amount']
providers_temp=providers_temp.sort_values('Amount',ascending=False)
providers_temp=providers_temp[0:10]
providers_temp=providers_temp.fillna('Unknown')
#plot
plt.figure(figsize=(15,5))
g2=sns.barplot(x=providers_temp['Amount'],y=providers_temp['Provider/Company Name'],orient='h')
plt.ylabel('Providers')
plt.xlabel('Page load (million)')
plt.title('Top 10 Providers')
plt.tight_layout()
show_values(g2, "h", space=0)



There are 3 categories in the dataset that are described below:

LC = Learning & Curriculum
CM = Classroom Management
SDO = School & District Operations

 

In [None]:
# Category and page load
category_temp=pd.DataFrame(data.groupby('Category')['engagement_index'].sum()/1000000).reset_index()
category_temp.columns=['Category','Amount']
category_temp=category_temp.sort_values('Amount',ascending=False)
category_temp=category_temp[0:10]
category_temp=category_temp.fillna('Unknown')
# category plot 
fig = plt.figure()
axes1 = fig.add_axes([0.1, 0.1, 0.8, 0.8])
axes1.plot(category_temp['Category'],category_temp['Amount'],'g--')
axes1.set_xlabel('Category')
axes1.set_ylabel('Page load (million)')
axes1.set_title('Category and Page Load')

# Sub-Category and page load
sector_temp=pd.DataFrame(data.groupby('Sector(s)')['engagement_index'].sum()/1000000).reset_index()
sector_temp.columns=['Sector','Amount']
sector_temp=sector_temp.sort_values('Amount',ascending=False)
sector_temp=sector_temp[0:10]
sector_temp=sector_temp.fillna('Unknown')

#Sub Category plot
axes2= fig.add_axes([1.1, 0.1, 0.8, 0.8])
axes2.plot(sector_temp['Sector'],sector_temp['Amount'],'r*-')
axes2.set_xlabel('Sectors ')
axes2.set_ylabel('Page load (million)')
axes2.set_title('Sectors and page load')
axes2.set_xticklabels(['PreK-12;\nHigher Ed;\nCorporate', 'Unknown', 'PreK-12', 
                     'PreK-12;\nHigher Ed', 'Corporate', 'Higher Ed;\nCorporate'])



Key Observations on Page load by state 

1. In the analysis the page-load which doesn't correspond to any state is labelled as unknown.
2. Nearly 40% of the total page-load comes from the following states Connecticut, Illinois and Massachusetts. 


In [None]:
#Page load by state
state_temp=pd.DataFrame(data.groupby('state',dropna=False)['engagement_index'].sum()/1000000).reset_index()
state_temp.columns=['state','Amount']
state_temp=state_temp.sort_values('Amount',ascending=False)
state_temp=state_temp.fillna('Unknown')
#plot
plt.figure(figsize=(10,8))
g3=sns.barplot(x=state_temp['Amount'],y=state_temp['state'],orient='h',palette='rainbow')
plt.ylabel('State')
plt.xlabel('Page load (million)')
plt.title('Page load by state')
plt.tight_layout()
show_values(g3, "h", space=0)




Key Observations on page load by locale

1. There are 2.8 billion page-load in 2020, most of it is coming from Suburb area that contributes 48% of total observations.
1. Unknown is the second highest contribution in the page-load which is around 594 million with a contribution of 20.9%.
1. City and Town are the lowest locale with 308 million and 99 million page-load.

In [None]:
#Page load by Locale
locale_temp=pd.DataFrame(data.groupby('locale',dropna=False)['engagement_index'].sum()/1000000).reset_index()
locale_temp.columns=['locale','Amount']
locale_temp=locale_temp.sort_values('Amount',ascending=False)
locale_temp=locale_temp.fillna('Unknown')
#plot
plt.figure(figsize=(5,5))
g4=sns.barplot(x=locale_temp['Amount'],y=locale_temp['locale'],orient='h',palette='rainbow')
plt.ylabel('Locale')
plt.xlabel('Page load (million)')
plt.title('Page load by Locale')
plt.tight_layout()
show_values(g4, "h", space=0)





Key Observations on page load by Black/hispanic
1. With respect to Black/hispanic, Greater the number of school districts greater the number of Page-load
2. Most of the school district have 0-20% hispanic/black Which has highest number of load page in 2020
3. unknown black/hispanic contributes to 594 million page-load 


In [None]:
#School District by Black/hispanic
black_hispanic_temp=pd.DataFrame(data.groupby('pct_black/hispanic',dropna=False)['id'].nunique()).reset_index()
black_hispanic_temp.columns=['black/hispanic','Amount']
black_hispanic_temp=black_hispanic_temp.sort_values('Amount',ascending=False)
black_hispanic_temp=black_hispanic_temp.fillna('Unknown')
#plot
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
g5=sns.barplot(x=black_hispanic_temp['Amount'],y=black_hispanic_temp['black/hispanic'],orient='h',palette='BuGn_r')
plt.ylabel('black/hispanic')
plt.xlabel('school districts')
plt.title('School District by Black/hispanic')
plt.tight_layout()
show_values(g5, "h", space=0)

#Page load by hispanic/Black
black_hispanic_temp=pd.DataFrame(data.groupby('pct_black/hispanic',dropna=False)['engagement_index'].sum()/1000000).reset_index()
black_hispanic_temp.columns=['black/hispanic','Amount']
black_hispanic_temp=black_hispanic_temp.sort_values('Amount',ascending=False)
black_hispanic_temp=black_hispanic_temp.fillna('Unknown')
#plot
plt.subplot(1,2,2)
g6=sns.barplot(x=black_hispanic_temp['Amount'],y=black_hispanic_temp['black/hispanic'],orient='h',palette='coolwarm')
plt.ylabel('black/hispanic')
plt.xlabel('Page Load (million)')
plt.title('Page load by Black/Hispanic')
plt.tight_layout()
show_values(g6, "h", space=0)




In [None]:
#School District by Black/hispanic
combine_temp=pd.DataFrame(data.groupby(['pct_black/hispanic','state'],dropna=False)['engagement_index'].sum()/1000000).reset_index()
combine_temp=combine_temp.fillna('Unknown')
#plot
plt.figure(figsize=(10,8))
g5=sns.scatterplot(x=combine_temp['pct_black/hispanic'],y=combine_temp['state'],hue=combine_temp['engagement_index'],palette='Oranges_r')
plt.xlabel('Black/Hispanic')
plt.ylabel('State')
plt.title('Page Load by State and Black Hispanic')
plt.tight_layout()


**County Connection Ratio
**
County Connection Ratio is residential fixed high-speed connections over 200 kbps in at least one direction/households.

1. North Dakhota has county connection 100%-200%
2. rest of the county has connection ratio 18%-100%

In [None]:
connection_ratio=pd.DataFrame(data.groupby(['county_connections_ratio'])['state'].nunique()).reset_index()
connection_ratio.columns=['county connections ratio','No of State']
connection_ratio=connection_ratio.fillna('unknown')
connection_ratio=connection_ratio.sort_values('No of State',ascending=False)
connection_ratio

**Total Expenditure Per-pupil
**

1. Most of the page load comes from school districts that has 8000-10000 expenditure
2. More than half of the information is labelled as unkown 
3. it can be obsereved in the graph that the more than 200 million page-load comes from school districts that has 10000-18000 expenditure.  

In [None]:
puple_expense=pd.DataFrame(data.groupby('pp_total_raw',dropna=False)['engagement_index'].sum()/1000000).reset_index()
puple_expense.columns=['Per-Pupil-Expenditure','Amount']
puple_expense=puple_expense.fillna('Unknown')
puple_expense=puple_expense.sort_values('Amount',ascending=False)


#plot
plt.figure(figsize=(15,5))
plt.plot(puple_expense['Per-Pupil-Expenditure'],puple_expense['Amount'],'b*-')
plt.xlabel('Per-Pupil Total Expenditure')
plt.ylabel('Page Load (million)')
plt.title('Per-Pupil Total Expenditure and Page Load')


**Reduced Price or Free**

1. Nearly 936 million page-load information is unkown 
2. Higher the expenditure lower the page-load due to low number of school districts which get reduced price
3. most of the page-load is contributed by the school disctricts which get 0-20% fee reduction. 

In [None]:

reduced_free=pd.DataFrame(data.groupby('pct_free/reduced',dropna=False)['engagement_index'].sum()/1000000).reset_index()
reduced_free.columns=['pct_free/reduced','Amount']
reduced_free=reduced_free.sort_values('Amount',ascending=False)
reduced_free=reduced_free.fillna('Unknown')


plt.figure(figsize=(6,5))
sns.barplot(x=reduced_free['pct_free/reduced'],y=reduced_free['Amount'])
plt.xlabel('Free or Reduced Price')
plt.ylabel('Page load (million)')
plt.title('Reduced Price or Free and Page Load')
plt.tight_layout()

*Reduced Price and Per-pupil Expenditure relation with page-load 

1. A combination of 40%-60% reduced price and 8,000-10,000 per-pupil total expenditure has the highest page-load of 141.1 million.
2. The next 2 highest page-load are coming from page-load 
   * A combination of 0%-20% reduced price and 10,000-12,000 and
   * A combination of 20%-40% reduced price and 8,000-10,000


Correlation between Reduced Price and Per-Pupil Total expenditure  interms of page-load


In [None]:
relation = pd.DataFrame(pd.pivot_table(data, index=['pp_total_raw','pct_free/reduced'], values='engagement_index', aggfunc='sum', dropna=False)).reset_index()
relation = data.pivot_table(index=['pp_total_raw'], columns='pct_free/reduced', values='engagement_index', aggfunc='sum', dropna=False)/100000000
relation=relation.fillna(0)
sns.heatmap(relation,annot=True,cmap='viridis')