**Competition Overview:**
LearnPlatform is a technology company founded in 2014 with a mission to expand access to education technology for all students and teachers. The competition requires data analysis about how engagement with digital learning relates to factors like district demographics, broadband access, and state/national level policies and events. 
*****************************************************************************************************************
**Questions to answer:**
**What is the picture of digital connectivity and engagement in 2020?**

Plotting pct_access and engagement_index over FY20 shows that use of digital learning solutions increased substantially from March to May and then from August till December. 
*****************************************************************************************************************
**What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future**
Usage of various online learning apps increased substantially - first during Feb to April. There was a drop during summer holidays and then when the academic activity resumed the online/ distance learning applications' utilization rates jumped up to higher levels compared to Feb-April. Going forward, with the pace of vaccination increasing and expected authorization of vaccines even for young children, the usage of online learning applications/ tools may drop as these learning methodologies cannot fully substitute classroom learning. Several US states have made or are expected to make  in-classroom learning for children mandatory. That said, the convenience, novel learning experiences and cost/ time savings provided by online learning apps/ tools will continue to drive their usage in future.
*****************************************************************************************************************
**How does student engagement with different types of education technology change over the course of the pandemic?**

Please see functions getTopProducts() and getAllStatesData() which provide list of top products for selected category.

**Learning Management Systems (LMS)**: Google Classroom is most popular. Savvas Realise and Schoology at 2nd positions but far behind Google Classroom

**Virtual Classroom:** Meet (key states: Connecticut, Massachusttes, NY) and Zoom (key states: IL, NJ, AZ) are top 2 products in majority of states followed by Google Hangouts & Loom

**School Management:** Clever and Classlink are top 2. Classlink is top in WA, CT, MA, IL and in rest of the states Clever leads.
*****************************************************************************************************************
**How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?**

To analyse access to online learning, I prepared catplots of variables such as state, locale, per pupil spending at district level etc. against pct_access for Google Classroom. Google Classroom was chosen as it is amongst the top 3 digital products in terms of pct_access for all states except and hence Max and average pct_access statistics for Google Classroom can be used as uniform and consistent indicators of adoption of learning across the districts.
1) District level data: pp_total_raw - (Per-pupil total expenditure) appears tp have an impact on access to online learning platforms. For districts with pp_total_raw > 11000, we can see that generally max(pct_access to Google Classroom) > 40% (which indicates higher access to digital learning). Most of the districts with pct_access < 40% also have pp_total_raw < = 11000 (see catplot below with pp_total_raw on x-axis.
2) States - Utah has higher concnetration of districts with low pct_access. Districts in Connecticut, Massachusetts, Ohio have max(pct_access) in higher ranges (60% to 80%).
3) Locale type doesnt appear to have impact on pct_access. Surprisingly rural districts have higher proportion of high pct_access districts however number of districts with Rural locale type is much lesser than Suburb and City and hence it is not possible to derive any conclusion.
4) Districts with City and Suburb locales tend to have higher max(pct_access) in general when proportion of black/ hispanics is > = 40% (see catplot with locale on x-axis and black/ hispanics as hew)
*****************************************************************************************************************
**Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?**
State interventions help in continuity of economic activity through support for small businesses, salaries/ inflows for low-income households, protections from eviction and hence these measures should support in higher attendance by students for online lectures/ distance learning activities.

Thank you.
.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
#plt.rcParams.update({'font.size': 14})
import re
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

For data generalization purposes some data points are released with a range where the actual value falls under.

locale -  NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. 

pct_black/hispanic - Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data 


Pct_free/reduced - Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |

county_connections_ratio - (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |

pp_total_raw - Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. 

In [None]:
district_info = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv').dropna(how='all')

In [None]:
#cleaning the values in district info column
district_info["pct_black/hispanic"] = district_info["pct_black/hispanic"].apply(lambda x: x if pd.isnull(x) else float(str(x).split(',')[0][1:]))
district_info["pct_free/reduced"] = district_info["pct_free/reduced"].apply(lambda x: x if pd.isnull(x) else float(str(x).split(',')[0][1:]))
district_info["county_connections_ratio"] = district_info["county_connections_ratio"].apply(lambda x: x if pd.isnull(x) else float(str(x).split(',')[0][1:])+0.1)
district_info["pp_total_raw"] = district_info["pp_total_raw"].apply(lambda x: x if pd.isnull(x) else float(str(x).split(',')[0][1:])+1000)

| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

In [None]:
product = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
temp_sectors = product['Sector(s)'].str.get_dummies(sep="; ")
temp_sectors.columns = [f"sector_{re.sub(' ', '', c)}" for c in temp_sectors.columns]
product = product.join(temp_sectors)
#product.drop("Sector(s)", axis=1, inplace=True)
product['primary_function_main'] = product['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
product['primary_function_sub'] = product['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)
product['primary_function_sub'] = product['primary_function_sub'].replace({'Sites, Resources & References' : 'Sites, Resources & Reference'})
product.rename(columns = {'LP ID': 'lp_id'}, inplace=True)
product.head(3)

The 4-digit file name represents `district_id` which can be used to link to district information in `district_info.csv`. The `lp_id` can be used to link to product information in `product_info.csv`.

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |

In [None]:
print(district_info.shape)
print(product.shape)

In [None]:
PATH = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 

#del engagement

temp = []
dist_grp_data = []
district_product_info = pd.DataFrame(columns=['district_id', 'lp_id', 'pct_access_mean', 'pct_access_max', 'eng_index_mean', 'eng_index_max'])
for district in district_info.district_id.unique():
    df = pd.read_csv(f'{PATH}/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
    #df_grouped = engagement_all_data[['Product Name', 'primary_function_main', 'primary_function_sub', 'pct_access']].groupby(['Product Name', 'primary_function_main', 'primary_function_sub']).pct_access.mean().to_frame().reset_index().sort_values(by='pct_access', ascending = False)
    if df.time.nunique() >= 250:
        temp.append(df)
        grp_df = df.groupby(['district_id','lp_id']).agg({'pct_access':['mean', 'max'], 'engagement_index':['mean', 'max']}).reset_index()
        grp_df.columns = ['district_id', 'lp_id', 'pct_access_mean', 'pct_access_max', 'eng_index_mean', 'eng_index_max']
        dist_grp_data.append(grp_df)

engagement = pd.concat(temp)
district_product_info = pd.concat(dist_grp_data)
engagement = engagement.reset_index(drop=True)
district_product_info.reset_index(drop=True)
engagement.time = engagement.time.astype('datetime64[ns]')
# Only consider districts with full 2020 engagement data
district_info = district_info[district_info.district_id.isin(engagement.district_id.unique())].reset_index(drop=True)
product = product[product['lp_id'].isin(engagement.lp_id.unique())].reset_index(drop=True)

In [None]:
print(district_info.shape)
print(product.shape)

**district_product_info: For each district amd product combination this dataframe contains mean and max of pct_access and engagement index. We will combine this dataframe with product and district dataframes to plot district level charts and analyses trends with respect to key variables such as % of hispanic/ black population, per pupil expenditure, locale etc.**

In [None]:
district_product_info.head()

**Plotting the number of districts in various states**

In [None]:
districts_info_by_state = district_info['state'].value_counts().reset_index()

fig, ax = plt.subplots(1,1, figsize=(20,5))
sns.barplot(data=districts_info_by_state, x="index", y= "state", palette ='BuGn', ax=ax)
ax.set_title('No of districts in states')
plt.xticks(rotation=90)
plt.tight_layout()

**Findings:
- Suburb is the most common locale.
- Districts with Town locale are fewest and are primarily in Utah
- Utah state contains all 4 locale types
- California has highest proportion of Citi locales

In [None]:
sns.catplot(x='locale', col='state', data=district_info, col_wrap = 10, height = 3, kind="count" )

In [None]:
sns.catplot(x='pct_free/reduced', col='state', data=district_info, col_wrap = 10, height = 3, kind="count" )

**From the chart below we can see that:**
* Utah has majority of districts in low per_pupil expenditure categories whereas districts in  Messachusttes and Illinois seem to have higher per_pupil expenditure in general
* We can also see that per_pupil expenditure is not available for many states

In [None]:
s=sns.catplot(x='pp_total_raw', col='state', data=district_info, col_wrap = 10, height = 3, kind="count" )
s.set_xticklabels(rotation=90)

In [None]:
fig, ax = plt.subplots(1,4, figsize=(20,9))

sns.countplot(data=product[product['primary_function_main'] == 'LC'], x='primary_function_sub', palette ='GnBu', ax=ax[0])
ax[0].set_title('Sub-Categories in Primary Function LC', fontsize = 9)
ax[0].set_xticklabels(ax[0].get_xticklabels(), fontsize = 10, rotation = 90)

sns.countplot(data=product[product['primary_function_main'] == 'CM'], x='primary_function_sub', palette ='GnBu', ax=ax[1])
ax[1].set_title('Sub-Categories in Primary Function CM', fontsize = 9)
ax[1].set_xticklabels(ax[1].get_xticklabels(), fontsize = 10, rotation=90)

sns.countplot(data=product[product['primary_function_main'] == 'SDO'], x='primary_function_sub', palette ='GnBu', ax=ax[2])
ax[2].set_title('Sub-Categories in Primary Function SDO', fontsize = 9)
ax[2].set_xticklabels(ax[2].get_xticklabels(),fontsize = 10, rotation=90)

sns.countplot(data=product[product['primary_function_main'] == 'LC/CM/SDO'], x='primary_function_sub', palette ='GnBu', ax=ax[3])
ax[3].set_title('Sub-Categories in Primary Function LC/CM/SDO', fontsize = 9)
ax[3].set_xticklabels(ax[3].get_xticklabels(), fontsize = 10, rotation = 90)
plt.show()

In [None]:
virtual_classroom_lp_id = product[product.primary_function_sub == 'Virtual Classroom']['lp_id'].unique()

# Remove weekends from the dataframe
engagement['weekday'] = pd.DatetimeIndex(engagement['time']).weekday
engagement_without_weekends = engagement[engagement.weekday < 5]

# Figure 1
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
for virtual_classroom_product in virtual_classroom_lp_id:
    temp = engagement_without_weekends[engagement_without_weekends.lp_id == virtual_classroom_product].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=temp.time, y=temp.pct_access, label=product[product['lp_id'] == virtual_classroom_product]['Product Name'].values[0])
plt.legend()
plt.show()

# Figure 2
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
for virtual_classroom_product in virtual_classroom_lp_id:
    temp = engagement_without_weekends[engagement_without_weekends.lp_id == virtual_classroom_product].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=temp.time, y=temp.engagement_index, label=product[product['lp_id'] == virtual_classroom_product]['Product Name'].values[0])
plt.legend()
plt.show()

In [None]:
engagement=pd.merge(engagement, district_info, on="district_id", how="left")
district_product_info = pd.merge(district_product_info, district_info, on="district_id", how="left")

In [None]:
zoom_id = product[product['Product Name'] == 'Zoom']['lp_id'].values[0]

locale = district_info['locale'].unique()

engagement['weekday'] = pd.DatetimeIndex(engagement['time']).weekday
engagement_without_weekends = engagement[engagement.weekday < 5]

# Figure 1
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
for localeType in locale:
    temp = engagement_without_weekends[(engagement_without_weekends.lp_id == zoom_id) & (engagement_without_weekends.locale == localeType)].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=temp.time, y=temp.pct_access, label=localeType)
plt.legend()
plt.show()


f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
for localeType in locale:
    temp = engagement_without_weekends[(engagement_without_weekends.lp_id == zoom_id) & (engagement_without_weekends.locale == localeType)].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=temp.time, y=temp.engagement_index, label=localeType)
plt.legend()
plt.show()

In [None]:
engagement_all_data = pd.merge(engagement, product[["lp_id","Product Name","Provider/Company Name","primary_function_main", "primary_function_sub"]], on="lp_id", how="left")
district_product_info = pd.merge(district_product_info, product[["lp_id","Product Name","Provider/Company Name","primary_function_main", "primary_function_sub"]], on="lp_id", how="left")

In [None]:
district_product_info.isnull().sum()/len(district_product_info)

**Function to list top products by pct_access and engagement index**

In [None]:
#drop = False
def getTopProducts(state, category):
    if state == "ALL":
        grouped_data_pctaccess = engagement_all_data[['Product Name', 'primary_function_main', 'primary_function_sub', 'pct_access']].groupby(['Product Name', 'primary_function_main', 'primary_function_sub']).pct_access.mean().to_frame().reset_index().sort_values(by='pct_access', ascending = False)
        grouped_data_eindex = engagement_all_data[['Product Name', 'primary_function_main', 'primary_function_sub', 'engagement_index']].groupby(['Product Name', 'primary_function_main', 'primary_function_sub']).engagement_index.mean().to_frame().reset_index().sort_values(by='engagement_index', ascending = False)
    else:
        state_data = engagement_all_data[engagement_all_data['state'] == state]
        grouped_data_pctaccess = state_data[['Product Name', 'primary_function_main', 'primary_function_sub', 'pct_access']].groupby(['Product Name', 'primary_function_main', 'primary_function_sub']).pct_access.mean().to_frame().reset_index().sort_values(by='pct_access', ascending = False)
        grouped_data_eindex = state_data[['Product Name', 'primary_function_main', 'primary_function_sub', 'engagement_index']].groupby(['Product Name', 'primary_function_main', 'primary_function_sub']).engagement_index.mean().to_frame().reset_index().sort_values(by='engagement_index', ascending = False)
    
    if category != "ALL":
        grouped_data_pctaccess= grouped_data_pctaccess[grouped_data_pctaccess['primary_function_sub'] == category].reset_index() 
        grouped_data_eindex= grouped_data_eindex[grouped_data_eindex['primary_function_sub'] == category].reset_index() 
    else:
        grouped_data_pctaccess=grouped_data_pctaccess.reset_index()
        grouped_data_eindex = grouped_data_eindex.reset_index()
    return grouped_data_pctaccess, grouped_data_eindex
    

**This function lists top 10 products in all states for a given category by pct_access and engagement index**

In [None]:
def getAllStatesData(states, category):
    product_df_pct_access = pd.DataFrame(columns = states)
    product_df_pct_access.columns = all_states
    product_df_eng_index = pd.DataFrame(columns = states)
    product_df_eng_index.columns = all_states
    for state in all_states:
        list_access, list_index = getTopProducts(state,category)
        product_df_pct_access[state] = list_access['Product Name'].astype(str) + " (" + round(list_access['pct_access'], 2).astype(str) + ") "
        product_df_eng_index[state] = list_index['Product Name'].astype(str) + " (" + round(list_index['engagement_index'], 2).astype(str) + ") "
    if len(product_df_pct_access)>10:
        product_df_pct_access = product_df_pct_access[:10]
        product_df_eng_index = product_df_eng_index[:10]
    return product_df_pct_access, product_df_eng_index

In [None]:
all_states = engagement_all_data['state'].dropna().unique()
#all_states = ['Utah']
category = "ALL"
product_df_pct_access, product_df_eng_index = getAllStatesData(all_states, category)
product_df_pct_access

In [None]:
all_states = engagement_all_data['state'].dropna().unique()
#all_states = ['Utah']
category = "Learning Management Systems (LMS)"
product_df_pct_access, product_df_eng_index = getAllStatesData(all_states, category)
product_df_pct_access

In [None]:
all_states = engagement_all_data['state'].dropna().unique()
#all_states = ['Utah']
category = "Virtual Classroom"
product_df_pct_access, product_df_eng_index = getAllStatesData(all_states, category)
product_df_pct_access

In [None]:
all_states = engagement_all_data['state'].dropna().unique()
#all_states = ['Utah']
category = "School Management Software"
product_df_pct_access, product_df_eng_index = getAllStatesData(all_states, category)
product_df_pct_access

In [None]:
all_states = engagement_all_data['state'].dropna().unique()
#all_states = ['Utah']
category = "Courseware & Textbooks"
product_df_pct_access, product_df_eng_index = getAllStatesData(all_states, category)
product_df_pct_access

In [None]:
all_states = engagement_all_data['state'].dropna().unique()
#all_states = ['Utah']
category = "Digital Learning Platforms"
#product_df_pct_access, product_df_eng_index = getAllStatesData(all_states, category)
#product_df_pct_access

In [None]:
def plotStateProductTrend(state, product_id, col):
    state_data = engagement_all_data[engagement_all_data['state'] == state].copy()
    
    state_data = state_data[state_data.weekday < 5]

    proportions = state_data[col].dropna().unique()

    # Figure 1
    f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
    for p in proportions:
        temp = state_data[(state_data[col]==p) & (state_data['lp_id']==product_id)].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
        sns.lineplot(x=temp.time, y=temp.pct_access, label=p)
    plt.legend()
    plt.show()


    f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
    for p in proportions:
        temp = state_data[(state_data[col]==p) & (state_data['lp_id']==product_id)].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
        sns.lineplot(x=temp.time, y=temp.engagement_index, label=p)
    plt.legend()
    plt.show()

In [None]:
selected_state = 'Connecticut'
product_cd = product[product['Product Name'] == 'Google Classroom']['lp_id'].values[0]
col = 'pct_black/hispanic'
plotStateProductTrend(selected_state, product_cd, col)

In [None]:
selected_state = 'California'
product_cd = product[product['Product Name'] == 'Google Classroom']['lp_id'].values[0]
col = 'pct_free/reduced'
#plotStateProductTrend(selected_state, product_cd, col)

**District Level Analysis**

In [None]:
district_product_google_class = district_product_info[district_product_info["Product Name"]=="Google Classroom"]

In [None]:
district_product_google_class.shape

In [None]:
g=sns.catplot(x="locale", y="pct_access_max", hue="pct_black/hispanic", data=district_product_google_class)
g.fig.set_size_inches(15,4)

In [None]:
g=sns.catplot(x="state", y="pct_access_max", hue="pct_black/hispanic", data=district_product_google_class)
g.fig.set_size_inches(20,6)
g.set_xticklabels(rotation=90)

In [None]:
g=sns.catplot(x="locale", y="pct_access_max", hue="pct_free/reduced", data=district_product_google_class)
g.fig.set_size_inches(15,4)

In [None]:
g=sns.catplot(x="pp_total_raw",  y="pct_access_max", hue="pct_black/hispanic", data=district_product_google_class, 
             palette=sns.color_palette(['orange', 'purple', 'green', 'red', 'blue']))
g.fig.set_size_inches(15,4)

In [None]:
g=sns.catplot(x="pct_black/hispanic",  y="pct_access_max", hue="pp_total_raw", data=district_product_google_class, 
             )
g.fig.set_size_inches(15,4)