<h1>LearnPlatform COVID-19 Impact on Digital Learning </h1>

<div style="background:#abd5f5; color:#069; border:1px solid #b3deff; padding: 20px">
    <h2>Table of Contents</h2>
    <ul>
        <li><a href="#Introduction">Introduction</a></li>
        <li><a href="#Districts-and-states">Districts and states</a></li>
        <li><a href="#Products-statistics">Products statistics</a></li>
        <li><a href="#COVID-19">COVID-19</a></li>
        <li><a href="#Conclusions">Conclusions</a></li>
    </ul>
</div>

<h2 id="Introduction">Introduction</h2>
<p>This notebook is dedicated to the analysis of LearnPlatform and how COVID 19 impacts digital learning.</p>
<p>I decided to split analysis on three parts.</p>
<ul>
<li>Districts and states. Distribution on types of locale. Percentage of black and hispanic students in each state. Etc.</li>
<li>Products statistics. Distribution on sectors and categories.</li>
<li>Changing engagements on LearnPlatform during COVID 19. I'm going to show how to change the engagement index during 2020
and find out if there is dependence on COVID 19 events or not </li>
</ul>
<p>There'll be data visualization in each part. Based on it I'll try to make conclusions.</p>

<h2 id="Districts-and-states">Districts and states</h2>

In [None]:
import pandas as pd
import os
from matplotlib import pyplot as plt
import squarify

In [None]:
# load data
districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

Preprocessing the data.
- There are mostly a range instead of particular value. Let's replace them with the mean.
- If there are coefficients greater than 1, let's replace them with 0 because they cannot be more than 100%.
- Also, let's remove all the districts without states.

In [None]:
# function for calculation the mean value for range coefficients
def avg_pct(pct):
    if pd.isna(pct):
        return 0
    else:
        first_coef = float(pct[1:pct.find(',')])
        first_coef = 0 if first_coef > 1 else first_coef
        second_coef = float(pct[pct.find(',') + 2:pct.find('[',1)])
        second_coef = 0 if second_coef > 1 else second_coef
        return round((first_coef + second_coef)/ 2, 3)

def avg_total_raw(pct):
    if pd.isna(pct):
        return 0
    else:
        first_coef = float(pct[1:pct.find(',')])
        second_coef = float(pct[pct.find(',') + 2:pct.find('[',1)])
        return round((first_coef + second_coef)/ 2, 3)


# renaming columns for convenience
districts.rename(columns={'pct_black/hispanic':'bl_his',
                          'pct_free/reduced':'free_lunch'},
                 inplace=True)

districts.bl_his= districts.bl_his.apply(avg_pct)
districts.free_lunch = districts.free_lunch.apply(avg_pct)
districts.county_connections_ratio = districts.county_connections_ratio.apply(avg_pct)
districts.pp_total_raw = districts.pp_total_raw.apply(avg_total_raw)

districts.dropna(inplace=True)

Districts with the schools in the suburb are prevailed.

In [None]:
fig, ax = plt.subplots(1, figsize = (8,8))
districts['locale'].value_counts().plot(kind='pie',autopct='%1.1f%%')
plt.title('Categories of Locale',fontsize=20)
plt.show()


Arizona has the biggest percentage of black and hispanic students. Minnesota is the leader of % students with free or
reduced-cost lunches. Internet connection ratio is almost the same in each state. At the first sight, there is
no correlation between these coefficients.

In [None]:
bardata = districts.groupby('state').mean()[['bl_his','free_lunch','county_connections_ratio']]
bardata.rename(columns={'bl_his':'% black and hispanic students',
                        'free_lunch':'% students with free or reduced cost lunch',
                        'county_connections_ratio': 'internet connection ratio'},
               inplace=True)
bardata.plot(kind='barh', figsize=(13,13))
plt.title('Social description of states',fontsize=24)
plt.show()

As we see, the distribution of expenditure is not so big. The leader is the District of Columbia, and Florida has
the least expenditure. The difference between max and min value is about 2,5 times

In [None]:
fig, ax = plt.subplots(1, figsize = (12,12))
treemapdata = districts.groupby('state').pp_total_raw.mean().sort_values(ascending=False)
treemapdata = treemapdata[treemapdata != 0]
squarify.plot(sizes = treemapdata, label=treemapdata.index)
plt.title('Per-pupil total expenditure', fontsize=20)
plt.axis('off')
plt.show()

<h2 id="Products statistics">Products statistics</h2>

In [None]:
# load data with products
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
# load engagements data
engs = pd.DataFrame()
pth = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'
for file in os.listdir(pth):
    eng = pd.read_csv(pth + file)
    eng['district_id'] = file[:file.find('.')]
    engs = pd.concat([engs, eng])
engs.fillna(0, inplace=True)

In [None]:
# count mean engagement index for each product
products_count = engs.groupby('lp_id').engagement_index.mean()
products.rename(columns={'LP ID':'lp_id'},inplace=True)
products.set_index('lp_id', inplace=True)
products = products.join(products_count)

In [None]:
top20products = products.sort_values('engagement_index',ascending=False).head(20)
top20products.set_index('Product Name').plot(kind='barh',figsize=(12,12))
plt.title('Top-20 products with the biggest engagement index',fontsize=20)
plt.show()

In [None]:
# one-hot encoding for sectors
products.rename(columns={'Product Name':'product_name',
                         'Provider/Company Name':'company_name',
                         'Sector(s)':'sectors',
                         'Primary Essential Function':'function'},
                inplace=True)
products.sectors.fillna('',inplace=True)
products['prek12'] = products.sectors.map(lambda s:1 if 'PreK-12' in s else 0)
products['higher_ed'] = products.sectors.map(lambda s:1 if 'Higher Ed' in s else 0)
products['corporate'] = products.sectors.map(lambda s:1 if 'Corporate' in s else 0)

As we see, the corporate sector has the least number of products and the best engagement index and vice versa
for PreK-12. So, working adult people probably are more interested or have more opportunities in online education.

In [None]:
fig, ax = plt.subplots(1,2,figsize=(12,3))
ax[0].barh(y=['PreK-12','Higher Ed','Corporate'],
           width=[sum(products.prek12),sum(products.higher_ed),sum(products.corporate)])
ax[0].set_title('Number of products in each sector of education')
ax[1].barh(y=['PreK-12','Higher Ed','Corporate'],
         width=[products[products.prek12 == 1].engagement_index.mean(),
                products[products.higher_ed == 1].engagement_index.mean(),
                products[products.corporate == 1].engagement_index.mean()])
ax[1].set_title('The mean engagement index in each sector of education')
plt.show()

As we see, SDO (School & District Operations) products have much more engagements than products with other functions.

In [None]:
products.function.fillna('-', inplace=True)
products['category'] = products.function.map(lambda s: s[0:s.find('-')-1])
products.groupby('category').engagement_index.mean().plot(kind='barh')
plt.title('Product functions and the mean engagement index')
plt.show()

The most popular products are LMS, the second ones are online course providers.

In [None]:
products['sub_category'] = products.function.map(lambda s: s[s.find('-',0)+2:])
products.sub_category = products.sub_category.map(lambda s: s[:s.find('-',0)-1] if '-' in s else s)
products.groupby('sub_category').engagement_index.mean().plot(kind='barh',figsize=(10,10))
plt.title('Product sub-functions and the mean engagement index')
plt.show()

<h2 id="COVID-19">COVID-19</h2>
Changing engagements on LearnPlatform during COVID 19

<p>I'm going to use xslx file from
<a href="https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC">COVID-19 US State Policy database</a></p>

In [None]:
!pip install openpyxl

In [None]:
# loading file
covid19 = pd.read_excel('../input/covid19-us-state-policy-database-3-29-2021/COVID-19 US state policy database 3_29_2021.xlsx')
covid19.drop(labels=[0,1,2,55,56], axis=0, inplace=True)
covid19.STATE.replace('District of Columbia','District Of Columbia',inplace=True)
covid19.set_index('STATE',inplace=True)
covid19

In [None]:
# adding states to engagement dataframe
engs.district_id = engs.district_id.astype(int)
engs = engs.join(districts[['district_id','state']].set_index('district_id'),on='district_id')

In [None]:
engs.groupby('state').engagement_index.mean().sort_values(ascending=False).plot(kind='barh',figsize=(12,6))
plt.title('The mean engagement index in each state',fontsize=20)
plt.show()

The most active states are:
- Arizona
- North Dakota
- New York
- New Hampshire
- District Of Columbia

Let's explore engagements in these states during the whole year.
- Red: State of emergency issued
- Green: State of emergency expired
- Blue: Stay at home/ shelter in place
- Orange: End stay at home/shelter in place

In [None]:
fig, ax = plt.subplots(5,figsize=(10,30))
cols = ['STEMERG','STEMERGEND','STAYHOME','END_STHM']
colors = ['red','green','blue','orange']
top5states = ['Arizona','North Dakota','New York','New Hampshire','District Of Columbia']
for i,st in enumerate(top5states):
    data_for_plot = engs[engs.state == st].groupby('time').engagement_index.mean()
    ax[i].plot(data_for_plot.index, data_for_plot)
    ax[i].set_title(st)
    ax[i].axes.xaxis.set_visible(False)
    for col,color in zip(cols,colors):
        if covid19.loc[st][col]!=0:
            ax[i].axvline(covid19.loc[st][col].timetuple().tm_yday, color=color, linestyle="dashed")

plt.show()

What we can say:
- There is a gap during summer (obviously)
- We don't have enough data for analysis from North Dakota
- After an emergency start, an engagement grows especially after the summer holidays.

Let's explore the prek-12 separately. Red line is the date when schools were closed.

In [None]:
engs = engs.join(products[['prek12']],on='lp_id')
fig, ax = plt.subplots(5,figsize=(10,30))
for i,st in enumerate(top5states):
    data_for_plot = engs[(engs.state == st)&(engs.prek12 == 1)].groupby('time').engagement_index.mean()
    ax[i].plot(data_for_plot.index, data_for_plot)
    ax[i].set_title(st)
    ax[i].axes.xaxis.set_visible(False)
    if covid19.loc[st]['CLSCHOOL']!=0:
        ax[i].axvline(covid19.loc[st]['CLSCHOOL'].timetuple().tm_yday, color='red', linestyle="dashed")

plt.show()

After schools closing, an online engagement grows especially in New Hampshire and New York.

<h2 id="Conclusions">Conclusions</h2>

As summary of this descriptive analysis, we may say:
- Districts with the schools in the suburb are prevailed.
- At the first sight, there is no correlation between the percentage of black/hispanic students, the perctntage
of students with free or reduced cost lunches, the speed of Internet. Of course, I might wrong. Such kind of analysis
demands more data.
- The distribution of expenditure is about 2,5 times. The leader is the District of Columbia, and Florida has
the least expenditure.
- SDO (School & District Operations) products have more engagements.
- The corporate sector has the least number of products and the best engagement index and vice versa
for PreK-12.
- The most popular products are LMS.
- The most active states are: Arizona, North Dakota, New York, New Hampshire, District Of Columbia.
- After an emergency of COVID-19 starts, an engagement grows especially after the summer holidays.
- After schools closing, an online engagement grows.



