# Executive Summary
LearnPlatform (LP henceforth) is frustrated that education is not equal across the ethnicities/genders. They want to study how the Covid 19 pandemic has exacerbated this inequality. LP was founded to give equal access to education to all students.

Your goal is to look for educational inequalities with respect to district demographics, broadband access, state Covid policies, and then propose a solution to remedy these inequalities.

Why should I even care? From the Covid 19 pandemic, education became virtual. So you are trying to identify: (1) how did people learn during the 2020 pandemic? and (2) how virtual learning inequality relates to district demographics, broadband access, and state Covid policies.

This notebook looks at the online learning inequality between advantaged (low minority/paid lunch) and disadvantaged (high minority/free lunch) schools within each state. The notebook first looks at the delta before and after K-12 public schools were closed, finding that the online learning gap on average *increased* after K-12 public schools closed. It then proposes four states in which more analysis can be done to discover if we could learn anything from their educational policies. It finally ends with a case study on Utah, because it was a state with high inequality during Covid-19, however for the 2020-21 school year, it basically had online learning equality, and we discover which products drive that change.

In [None]:
import pandas as pd
import numpy as np
import os, gc
from tqdm import tqdm

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"]=20,7
import seaborn as sns
import datetime

import umap

sns_deep10 = ['#4C72B0','#DD8452','#55A868','#C44E52','#8172B3','#937860','#DA8BC3','#8C8C8C','#CCB974','#64B5CD']

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Read districts_info:

In [None]:
districts_info = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

# Data cleaning districts_info:

In [None]:
def convert_range_to_float(the_range):
    '''
    Input:
        the_range (string) - string of the form "[x, y[" where x and y are numbers
    Output:
        avg - the average values of x and y
    '''
    return (float(the_range[0][1:]) + float(the_range[1][:-1])) / 2

In [None]:
# The pct_black/hispanic are represented by quintile ranges
# Let's convert them to numeric by taking the middle
mask = districts_info['pct_black/hispanic'].notnull()
districts_info.loc[mask, 'pct_black/hispanic'] = districts_info.loc[mask, 'pct_black/hispanic'].str.split(',').apply(convert_range_to_float)
districts_info['pct_black/hispanic'] = districts_info['pct_black/hispanic'].astype('float64')

# pct_free/reduced refers to what % of students are eligible for a free/reduced lunch
# Therefore it's a signal for economic status; a high percentage implies a very poor area
mask = districts_info['pct_free/reduced'].notnull()
districts_info.loc[mask, 'pct_free/reduced'] = districts_info.loc[mask, 'pct_free/reduced'].str.split(',').apply(convert_range_to_float)
districts_info['pct_free/reduced'] = districts_info['pct_free/reduced'].astype('float64')

# county_connections_ratio measures the number of connections with basic internet (200 kbps) divided by the number of households
# If it's == 1, that means every household has basic internet
# If it's < 1, that means some houses lack internet
# If it's > 1, that means some households have multiple connections (maybe they are a big apartment or something idk)
# Anyway, let's map it into something easier for me to read - let's use 0 to represent < 1, and 1 to represet > 1
districts_info['county_connections_ratio'] = districts_info['county_connections_ratio'].map({'[0.18, 1[': 0,
                                                                                             '[1, 2[': 1})

# pp_total_raw measures how much USD is spent per student within that district
# Maybe a good notebook idea would be to look at districts with low spending yet high online education, and prognosis what are they doing differently.
# Anyway, let's convert this from a string to a number
mask = districts_info['pp_total_raw'].notnull()
districts_info.loc[mask, 'pp_total_raw'] = districts_info.loc[mask, 'pp_total_raw'].str.split(',').apply(convert_range_to_float)
districts_info['pp_total_raw'] = districts_info['pp_total_raw'].astype('float64')

# Read products_info:

In [None]:
products_info = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

# Data cleaning product_info:

In [None]:
# Primary Essential Function has 2 labels: a category and sub-category. Let's split it.
mask = products_info['Primary Essential Function'].notnull()
category_map = {'LC':'Learning and Curriculum', 'CM':'Classroom Management', 'SDO': 'School and District Operations', 'LC/CM/SDO': 'Other'}
products_info.loc[mask, 'product_category'] = products_info.loc[mask,'Primary Essential Function'].str.split(' - ').apply(lambda x: x[0]).map(category_map)
products_info.loc[mask, 'product_subcategory'] = products_info.loc[mask,'Primary Essential Function'].str.split(' - ').apply(lambda x: x[1])

# Read in the engagement data:

In [None]:
base_filepath = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'

list_of_dfs = []
for file in tqdm(os.listdir(base_filepath)):
    df = pd.read_csv(base_filepath + file)
    df['district_id'] = file[:-4] # Remove the 4 letters .csv
    list_of_dfs.append(df)

In [None]:
engage = pd.concat(list_of_dfs,axis=0).reset_index(drop=True)

In [None]:
# Sum up the total engagement across all products - this is a measure of the total online learning
engage_summed = engage.groupby(['time','district_id'])['engagement_index'].sum().to_frame('total_pageloads_per_1k_students').reset_index(drop=False)

In [None]:
engage_summed['district_id'] = engage_summed['district_id'].astype('int64')

In [None]:
districts_info['county_connections_ratio'].value_counts(dropna=False)

# Now we want to measure inequality between groups.

Let's plot all the districts based on their inequality metrics - pct_black/hispanic and pct_free/reduced.

Why didn't I include county_connections_ratio? Because there is only 1 district with a high value. So it's not useful for our dataset.

Why didn't I include pp_total_raw? Because, I used an unsupervised clustering approach called UMAP before this. And that approach basically split the dataset to where pp_total_raw is NULL vs NOT NULL. So I didn't think it was descriptive enough.

In [None]:
def jitter(values):
    return values + np.random.normal(0,0.01,values.shape)

fig, ax = plt.subplots(figsize=(10,5))
sns.scatterplot(x=jitter(districts_info['pct_black/hispanic'].fillna(districts_info['pct_black/hispanic'].mean())), 
                y=jitter(districts_info['pct_free/reduced'].fillna(districts_info['pct_free/reduced'].mean())), ax=ax)

# Based on this, we can create 2 clusters.

1. People in the bottom left, who usually pay/bring their own lunch and have low black/hispanic populations
2. People in the top right, who need free lunches and have high black/hispanic percentages

In [None]:
cluster1 = ((districts_info['pct_black/hispanic'].fillna(districts_info['pct_black/hispanic'].mean()) < 0.4) & 
            (districts_info['pct_free/reduced'].fillna(districts_info['pct_free/reduced'].mean()) < 0.4))

cluster_no_info = ((districts_info['pct_black/hispanic'].isnull()) & (districts_info['pct_free/reduced'].isnull()))

# Assign the clusters
districts_info.loc[cluster1, 'cluster'] = 'Advantaged'
districts_info.loc[cluster_no_info, 'cluster'] = 'No info'
districts_info.loc[districts_info['cluster'].isnull(), 'cluster'] = 'Disadvantaged'

# Let's plot it color-coded so you can understand what I did

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
sns.scatterplot(x=jitter(districts_info['pct_black/hispanic'].fillna(districts_info['pct_black/hispanic'].mean())), 
                y=jitter(districts_info['pct_free/reduced'].fillna(districts_info['pct_free/reduced'].mean())), ax=ax,
                hue=districts_info['cluster'])

# Now, let's merge the districts_info with daily engagement.

In [None]:
print('Engage_summed shape before merging with districts_info:', engage_summed.shape)
engage_summed = engage_summed.merge(districts_info,on='district_id')
print('Engage_summed shape after merging with districts_info:', engage_summed.shape)

# Great. Now let's answer the first question: How did online learning change before, during, and after the pandemic?

In [None]:
avg_pageloads_by_state = engage_summed.groupby(['time','state'])['total_pageloads_per_1k_students'].agg(['mean','median','size']).reset_index(drop=False)
avg_pageloads_by_state['time'] = pd.to_datetime(avg_pageloads_by_state['time'])

# In order to have some "story"/context for the engagement, I also read in the state policies. This tells me eg the day when state issued State of Emergency, when it was lifted, when K-12 public schools were closed, etc.

In [None]:
!pip install openpyxl # For reading excel files
state_policies = pd.read_excel('../input/covid19-us-state-policy-database/COVID-19-US-State-Policy-Database-master/COVID-19 US state policy database 7_23_2021.xlsx',
                               sheet_name='State policy changes ', # They put a space at the end of the sheet name
                               skiprows=1,
                               engine='openpyxl'
                              ).tail(-3)
state_policies.loc[state_policies['State']=='District of Columbia','State'] = 'District Of Columbia' # Rename with capital O

In [None]:
# Convert datetime columns to datetime

date_cols = [
    'State of emergency issued', 'State of emergency lifted', 'Date closed K-12 public schools',
    'Closed day cares', 'Reopen day cares','Date banned visitors to nursing homes',
    'Stay at home/ shelter in place',
    'Stay at home order\' issued but did not specifically restrict movement of the general public',
    'End/relax stay at home/shelter in place','Closed other non-essential businesses', 'Closed businesses overnight',
    'Began to reopen businesses','Ended face mask mandate', 'Ended face mask mandate x2',
    'Face mask mandate for employees of public-facing businesses',
    'Ended face mask mandate', 'Ended face mask mandate x2',
    'Attempted to prevent local governments from implementing face mask orders',
    'Banned local mask mandates','Allowed restaurants to sell takeout alcohol','Allowed restaurants to deliver alcohol',
    'Closed restaurants except take out', 'Reopen restaurants','Closed gyms','Reopened gyms', 
    'Closed movie theaters', 'Reopened movie theaters','Closed Bars', 'Reopen bars', 
    'Reopened hair salons/barber shops','Reopened religious gatherings', 'Reopened other non-essential retail',
    'Allowed businesses to reopen overnight', 'Began to reclose bars','Closed bars (x2)', 
    'Closed movie theaters (x2)','Closed hair salons/barber shops (x2)', 'Closed gyms (x2)',
    'Closed restaurants (x2)', 'Reopened restaurants (x2)','Reopened bars (x2)', 'Reopened gyms (x2)',
    'Reopened hair salons/barber shops (x2)','Reopened movie theaters (x2)', 'Closed bars (x3)',
    'Closed restaurants (x3)', 'Reopened bars (x3)','Reopened restaurants (x3)',
    'Mandate quarantine for those entering the state from specific settings',
    'Mandate quarantine for all individuals entering the state',
    'Date all mandated quarantines ended','Date vaccine allocation plan last updated',
    'Date adults ages 80+ became eligible for COVID-19 vaccination',
       'Date adults ages 75+ became eligible for COVID-19 vaccination',
       'Date adults ages 70+ became eligible for COVID-19 vaccination',
       'Date adults ages 65+ became eligible for COVID-19 vaccination',
       'Date adults ages 60+ became eligible for COVID-19 vaccination',
       'Date adults ages 55+ became eligible for COVID-19 vaccination',
       'Date adults ages 50+ became eligible for COVID-19 vaccination',
       'Date adults ages 45+ became eligible for COVID-19 vaccination',
       'Date adults ages 40+ became eligible for COVID-19 vaccination',
       'Date adults ages 30+ became eligible for COVID-19 vaccination',
       'Date K-12 school employees became eligible for COVID-19 vaccination',
       'Date grocery store workers became eligible for COVID-19 vaccination',
       'Date incarcerated people became eligible for COVID-19 vaccination',
       'Date general public became eligible for COVID-19 vaccination',
    'First overall eviction moratorium start','First overall eviction moratorium end',
    'Second overall eviction moratorium start','Second overall eviction moratorium end',
    'Third overall eviction moratorium start','Third overall eviction moratorium end',
    'First eviction initiation ban start','First eviction initiation ban end','Second Eviction Initiation Ban Start',
    'Second Eviction Initiation Ban End','First eviction hearing ban start', 'First eviction hearing ban end',
    'Second Eviction Hearing Ban Start', 'Second Eviction Hearing Ban End','First eviction enforcement ban start',
    'First eviction enforcement ban end','Second Eviction Enforcement Ban Start','Second Eviction Enforcement Ban End',
    'COVID-19 hardship limitation start','COVID-19 hardship limitation end','Second COVID-19 hardship limitation start',
    'Second COVID-19 hardship limitation end','Non-payment limitation start', 'Non-payment limitation end',
    'Second non-payment limitation start','Second non-payment limitation end', 'CARES Act pleading start',
    'CARES Act pleading end', 'CDC moratorium start', 'CDC moratorium end','Late Fee Ban Start', 
    'Late Fee Ban End', 'Second Late Fee Ban Start','Second Late Fee Ban End', 'Utilities shutoff moratorium start',
    'Utilities shutoff moratorium expiration','Second utilities shutoff moratorium start',
    'Utilities reconnection start', 'Utilities reconnection end',
    'SNAP Waiver - Emergency Allotments to Current SNAP Households',
       'SNAP Waiver - Pandemic EBT during school year 2019-2020',
       'SNAP Waiver - Pandemic EBT during school year 2020-2021',
       'SNAP Waiver - Pandemic EBT during summer 2021',
       'SNAP Waiver - Temporary Suspension of Claims Collection',
    'Modify Medicaid requirements with 1135 waivers (date of CMS approval)',
       'Reopened ACA enrollment using a special enrollment period','Allow audio-only telehealth',
       'Allow/expand Medicaid telehealth coverage',
    'Stopped personal visitation in state prisons','Stopped legal visitation in state prisons',
    'Resumed visitation in state prisons','Stopped visitation in state prisons x2',
    'Resumed visitation in state prisons x2','Suspended elective medical procedures',
    'Resumed elective medical procedures',
    'No order to suspend elective medical procedures but did release guidance or orders to resume',
    'Suspended elective medical procedures x2','Resumed elective medical procedures x2',
    'Waived one week waiting period for unemployment insurance','Extended Benefits program activated',
    'Extended Benefits program deactivated','Extended Benefits program activated x2',
    '20-week Extended Benefits program activated','20-week Extended Benefits program deactivated',
    '20-week Extended Benefits program activated x2',
       'Stopped participating in pandemic-related federal unemployment benefit programs',
        'Use of telemedicine/telephone evaluations to initiate buprenorphine prescribing',
       'Patients can receive 14-28 take-home doses of opioid medication',
       'Home delivery of take-home medication by opioid treatment programs',
       'Use of telemedicine for schedule II-V prescriptions','Exceptions to emergency oral prescriptions',
       'Waive requirement to obtain separate DEA registration to dispense outside home state',
    'Last date of receipt of mail-in ballot request for the general election (by mail or online)',
       'Closed casinos', 'Reopened casinos', 'Closed casinos (x2)','Reopened casinos (x2)',
]

for date_col in tqdm(date_cols):
    state_policies[date_col] = pd.to_datetime(state_policies[date_col], errors='coerce')

# Plot the average engagement for each school.

In [None]:
for state, state_df in avg_pageloads_by_state.groupby('state'):
    state_df = state_df.reset_index(drop=True).reset_index(drop=False)
    xticks = state_df.loc[state_df['time'].dt.day==1,'index'].values
    emergency_start = state_policies.loc[state_policies['State']==state,'State of emergency issued'].iloc[0]
    emergency_end = state_policies.loc[state_policies['State']==state,'State of emergency lifted'].iloc[0]
    k12_closed = state_policies.loc[state_policies['State']==state,'Date closed K-12 public schools'].iloc[0]
    k12_teachers_eligible = state_policies.loc[state_policies['State']==state,'Date K-12 school employees became eligible for COVID-19 vaccination'].iloc[0]
    snap_19_20 = state_policies.loc[state_policies['State']==state,'SNAP Waiver - Pandemic EBT during school year 2019-2020'].iloc[0]
    snap_20_21 = state_policies.loc[state_policies['State']==state,'SNAP Waiver - Pandemic EBT during school year 2020-2021'].iloc[0]
    
    ax = sns.lineplot(x=state_df['time'], y=state_df['mean'])
    sns.scatterplot(x=state_df['time'], y=state_df['mean'], ax=ax)
    ax.set_xlabel('Time')
    ax.set_ylabel('Total Pageloads per 1k students')
    #ax.set_xticks(xticks)
    #ax.set_xticklabels(state_df.loc[state_df['time'].dt.day==1,'time'].dt.date.map(str).values)
    if type(emergency_start) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(emergency_start, color='red',label='State of emergency issued')
    if type(emergency_end) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(emergency_end, color='green',label='State of emergency lifted')
    if type(k12_closed) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(k12_closed, color='orange',label='K-12 public schools closed')
    if type(k12_teachers_eligible) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(k12_teachers_eligible, color='blue',label='K-12 teachers eligible vaccine')
    if type(snap_19_20) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(snap_19_20, color='purple',label='SNAP School Year 2019-2020')
    if type(snap_20_21) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(snap_20_21, color='black',label='SNAP School Year 2020-2021')
    ax.set_title(state+': Total # of pageloads per 1k students, averaged across districts')
    plt.legend()
    plt.show()

# Summary of previous plots: What happened right after K-12 public schools were closed?

1. Arizona looks like it had a Spring Break, then after it slowly decreased their virtual learning until Summer. After summer, virtual learning was really high, but after a couple months it went back
2. California looks like it had a Spring Break, then after it slowly decreased their virtual learning until Summer. After summer, virtual learning was really high, and was sustained.
3. Connecticut increased virtual learning 2x, slow decline towards Summer (but still higher than before). After summer, virtual learning was high and increasing.
4. Washington DC had low virtual learning even before and after. Then after summer, they had like 5x to 6x increase of virtual learning
5. Florida has Spring Break, then minor increase of virtual increase which decayed toward Summer. Then after Summer, 2x increase which dipped a little but was sustained.
6. Illinois had a small bump up in virtual learning which decayed toward Summer. After summer, a small increase in virtual learning which was sustained for months.
7. Indiana had high virtual learning before the pandemic, then it looks like virtual learning went way down during the pandemic (why?) After Summer, virtual learning was high again.
8. Massachusetts had a dip when K-12 announced, and then back to regular virtual learning which slowly decayed until summer. After summer, it picked up 2x virtual learning and was sustained.
9. After K-12 shut down, Michigan was basically 0 virtual learning. Then after summer, it increased 4x and was sustained.
10. Minnesota: immediately low after K12 closed, then a period of 2 weeks really high, then went to 0. (Not enough data to be honest)
11. Missouri - they had their Spring break before K12 was closed it seems. Their virtual learning decayed toward summer, then pepped back up and sustained after summer.
12. New Hampshire - after K-12 shut down, big increase in virtual learning. Then summer. Then sustained virtual learning after summer.
13. New Jersey - Not much of a change in virtual learning. It appears their virtual learning was increasing before the state of emergency was declared. Then they had a short summer and had sustained & increaasing virtual learning after summer.
14. New York - appears virtual learning was increasing before state of emergency. Declined toward summer. After summer, it was sustained strong.
15. North Carolina - after K12 closed, basically 0 virtual learning. After summer, it went back to high & sustained.
16. North Dakota - not enough data
17. Ohio - Spring break, then still has high virtual learning. It appears the virtual learning increased before state of emergency declared. Slow decline to summer. Then high and sustained after summer.
18. Tennessee - low virtual learning, then 0 after State of Emergency. Then like 2.5x increase after summer.
19. Texas - a week before the state of emergency, virtual learning went 0. Then in the middle it came back up to what we're used to, then back to 0 over the summer. Then after summer it went high, like 2x and sustained.
20. Utah - after K12 closed, slight bump in virtual learning which then decayed toward summer. After summer, it was high & sustained.
21. Virginia - after K12 closed, virtual learning became lower. Decayed toward summer. After summer, it was high & sustained.
22. Washington - After K12 closed, looks like spring break? (cuz it's low). Then it went low and decayed until summer. After summer, it was high & sustained.
23. Wisconsin - After K12 closed, it was like 1 week closed. Then it was med-high and sustained until summer. After summer it was med, slight decay.

# This is all fine and dandy, but remember we are focused on inequality. Therefore we should look at the difference between the average pageviews of Advantaged and Disadvantaged schools in the same state.

In [None]:
avg_pageloads_by_state_split_by_cluster = engage_summed.groupby(['time','state','cluster'])['total_pageloads_per_1k_students'].agg(['mean','median','size']).reset_index(drop=False)
avg_pageloads_by_state_split_by_cluster['time'] = pd.to_datetime(avg_pageloads_by_state_split_by_cluster['time'])

In [None]:
# Create advantaged average pageviews minus disadvantaged average pageviews, per state
advantaged_minus_disadvantaged_pageviews = (avg_pageloads_by_state_split_by_cluster.query('cluster=="Advantaged"').set_index(['time','state'])[['mean','median','size']] - avg_pageloads_by_state_split_by_cluster.query('cluster=="Disadvantaged"').set_index(['time','state'])[['mean','median','size']]).reset_index(drop=False)

In [None]:
# Look at the dataframe to make sure you are comfortable with what it looks like
# You can see sometimes it is NaN. That can occur when we have data for the Advantaged but not for Disadvantaged, or vice-versa.
advantaged_minus_disadvantaged_pageviews.head(10)

# Repeat the same plots state-by-state.

This time, whenever you see a ***positive*** value, it means the Advantaged schools are doing more with online learning.

Whenever you see a ***negative*** value, it means the Disadvantaged schools are doing more with online learning.

If there is true equality, then the y-value should be zero, which means both Advantaged and Disadvantaged schools are having the same amounts of online learning engagement.

In [None]:
for state, state_df in advantaged_minus_disadvantaged_pageviews.groupby('state'):
    state_df = state_df.reset_index(drop=True).reset_index(drop=False)
    xticks = state_df.loc[state_df['time'].dt.day==1,'index'].values
    emergency_start = state_policies.loc[state_policies['State']==state,'State of emergency issued'].iloc[0]
    emergency_end = state_policies.loc[state_policies['State']==state,'State of emergency lifted'].iloc[0]
    k12_closed = state_policies.loc[state_policies['State']==state,'Date closed K-12 public schools'].iloc[0]
    k12_teachers_eligible = state_policies.loc[state_policies['State']==state,'Date K-12 school employees became eligible for COVID-19 vaccination'].iloc[0]
    snap_19_20 = state_policies.loc[state_policies['State']==state,'SNAP Waiver - Pandemic EBT during school year 2019-2020'].iloc[0]
    snap_20_21 = state_policies.loc[state_policies['State']==state,'SNAP Waiver - Pandemic EBT during school year 2020-2021'].iloc[0]
    
    ax = sns.lineplot(x=state_df['time'], y=state_df['mean'])
    sns.scatterplot(x=state_df['time'], y=state_df['mean'], ax=ax)
    ax.set_xlabel('Time')
    ax.set_ylabel('Total Pageloads per 1k students')
    #ax.set_xticks(xticks)
    #ax.set_xticklabels(state_df.loc[state_df['time'].dt.day==1,'time'].dt.date.map(str).values)
    if type(emergency_start) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(emergency_start, color='red',label='State of emergency issued')
    if type(emergency_end) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(emergency_end, color='green',label='State of emergency lifted')
    if type(k12_closed) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(k12_closed, color='orange',label='K-12 public schools closed')
    if type(k12_teachers_eligible) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(k12_teachers_eligible, color='blue',label='K-12 teachers eligible vaccine')
    if type(snap_19_20) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(snap_19_20, color='purple',label='SNAP School Year 2019-2020')
    if type(snap_20_21) != pd._libs.tslibs.nattype.NaTType:
        ax.axvline(snap_20_21, color='black',label='SNAP School Year 2020-2021')
    ax.set_title(state+': Total # of pageloads per 1k students, averaged across districts')
    plt.legend()
    plt.show()

# Analysis of these plots:

- If one of the state's plots is empty, that means we did not have both an Advantaged and Disadvantaged school for that state. So we couldn't take the difference between the two (it was all NULL values) so nothing was plotted!

1. California had slightly more online learning engagement from disadvantaged schools, but AFTER the State of emergency was issued and the K-12 public schools were closed, you can clearly see that the Advantaged schools start having more online learning engagement! After summer vacation, the disadvantaged schools have more online learning for about 1 month, and then the advantaged schools have more online learning engagement for the rest of our data period.
2. Connecticut had more online learning engagement for advantaged schools than disadvantaged schools both before AND after the pandemic. However, similar to California, you can see that the difference between the two nearly doubled when the State of Emergency and K-12 public schools were closed.
3. Illinois is in a similar situation. You can see that advantaged schools had more online learning, but after the state of emergency and K-12 public schools closed, the advantaged schools had nearly 3-4x more online learning engagement than the disadvantaged schools!
4. Massachusetts had nearly perfect equality before the State of emergency. Then Advantaged schools had more online engagement than Disadvantaged schools. However, after a period of about 2 months, they came back down to roughly equality measures in May 2020.
5. Missouri is biased toward Advantaged schools having more engagement. After the State of Emergency and K-12 schools closing, the level of inequality actually lowered.
6. New York is the opposite from all of the other states. Before the K-12 public schools closing, Advantaged schools by far had 100,000 more pageloads per 1k students on average compared to Disadvantaged schools. Even after the State of Emergency was issued, this bias still existed. However, right when the K-12 public schools closed, we clearly see New York does a complete 180 and now the Disadvantaged schools have 100,000 more pageloads per 1k students compared to the Advanted schools. And this slowly tapered down toward equality as summer rolled along, but actually when school started again in September 2020, Disadvantaged schools are still engaging more with online learning compared to Advantaged schools.
7. North Carolina - not enough schools to really make a confident statement
8. Ohio had more engagement by Advanted schools. After K-12 public schools were closed, it took a drop for a few weeks but then went back to the pre-pandemic inequality levels.
9. Utah is really different compared to the other states. Utah pre-pandemic had Disadvantaged schools engaging in more online learning than Advantaged schools. When K-12 public schools closed, this inequality strengthened, but slowly tapered back to equality during summer 2020. When school resumed back in September 2020, we can actually see this equality being roughly maintained, indicating that Utah was able to solve the inequality online learning gap for the 2020-21 school year.
10. Virginia doesn't have enough schools to make a confident statement. (But it looks like inequality INCREASED after K-12 public schools were closed).
11. Washington doesn't have enough schools to make a confident statement.

# Can you give me some statistics of the pageloads differnece before and after K-12 public schools closed?

Yes, I can. I will take the average 1 month before and the average 1 month after and see the change.

In [None]:
list_of_dicts = []
for state, state_df in advantaged_minus_disadvantaged_pageviews.groupby('state'):
    state_df = state_df.reset_index(drop=True).reset_index(drop=False)
    k12_closed = state_policies.loc[state_policies['State']==state,'Date closed K-12 public schools'].iloc[0]
    one_month_before = k12_closed - pd.DateOffset(months=1)
    one_month_after = k12_closed + pd.DateOffset(months=1)
    
    avg_pageloads_difference_before = state_df.loc[(state_df['time']>=one_month_before) & (state_df['time']<=k12_closed),'mean'].mean()
    avg_pageloads_difference_after = state_df.loc[(state_df['time']>=k12_closed) & (state_df['time']<=one_month_after),'mean'].mean()
    
    # Pct difference isn't so meaningful here because we have positive and negative numbers
    pct_difference = (avg_pageloads_difference_after - avg_pageloads_difference_before) / (avg_pageloads_difference_before)
    
    list_of_dicts.append({'state':state, 'avg_pageload_gap_before':avg_pageloads_difference_before, 'avg_pageload_gap_after': avg_pageloads_difference_after})

In [None]:
diffs = pd.DataFrame(list_of_dicts)
diffs = diffs.loc[diffs['avg_pageload_gap_before'].notnull()].copy().reset_index(drop=True) # Remove rows where we don't have a comparison

In [None]:
diffs # Ignore North Carolina because there aren't enough schools
diffs['inequality_increase'] = abs(diffs['avg_pageload_gap_after']) - abs(diffs['avg_pageload_gap_before'])

In [None]:
diffs

You can see that the average is that pretty much all states INCREASED their online learning inequality after K-12 public schools were closed.

In [None]:
print('Average inequality increase in pageloads per 1k states:', diffs['inequality_increase'].mean())

# Based on those plots, I would encourage further analysis into the education solutions from the following states:

1. Utah. Utah historically had inequality (where Disadvantaged schools had more online learning than Advantaged schools). When covid struck, that inequality gap only got stronger. What makes Utah a good candidate to study is that when the 2020-21 school year started, Utah eliminated that inequality. This means that Utah created some measures for the new school year that helped eliminate inequality for online learning.
2. Massachusetts. This state had equality pre-pandemic, and once K-12 public schools were shut down, they experienced a gap in online learning between Advantaged and Disadvantaged schools. Massachusetts is probably a case-study of what NOT to do, because things were fine before the pandemic and then they changed it so that Disadvantaged schools began having fewer pageloads compared to Advantaged schools. However, the inequality only lasted for 2 months (or was it because summer came around?) So take this finding with a grain of salt.
3. Missouri. This state took measures to reduce the gap.
4. New York. Something funny happened where the inequality gap reversed between Advantaged and Disadvantaged schools. It might be worth it to investigate what measures were taken, and decide if the measures taken were less intense, would we be able to strive closer toward equality?

# The most compelling case to me is Utah. Because of that, let's look into what Utah did to reduce their online learning inequality for the 2020-21 school year.

In [None]:
advantaged_utah_schools = districts_info.query('state=="Utah" & cluster=="Advantaged"')['district_id'].map(str).unique().tolist()
disadvantaged_utah_schools = districts_info.query('state=="Utah" & cluster=="Disadvantaged"')['district_id'].map(str).unique().tolist()

In [None]:
# So we know the data size we are dealing with, let's print out the lengths.
# We have a really low sample-size in this whole dataset. Utah is one of the states with larger amount of information. So let's rely on it to make a robust conclusion.
print('Number of advantaged Utah schools:', len(advantaged_utah_schools))
print('Number of disadvantaged Utah schools:', len(disadvantaged_utah_schools))

# Group by the product id so we can find out what product Utah implemented to reduce the online learning gap

In [None]:
# Create a sub-dataframe so there are fewer rows and we can work with it faster
engage_utah = engage.loc[engage['district_id'].isin(advantaged_utah_schools+disadvantaged_utah_schools)].copy().reset_index(drop=True)

In [None]:
engage_utah.loc[engage_utah['district_id'].isin(advantaged_utah_schools), 'cluster'] = 'Advantaged'
engage_utah.loc[engage_utah['district_id'].isin(disadvantaged_utah_schools), 'cluster'] = 'Disadvantaged'

In [None]:
engage_utah['time'] = pd.to_datetime(engage_utah['time'])

In [None]:
engage_utah['week_of_year'] = engage_utah['time'].dt.week

In [None]:
# View what the dataframe looks like
engage_utah

In [None]:
engage_utah_summed = engage_utah.groupby(['lp_id','week_of_year','cluster'])['engagement_index'].sum().to_frame('sum_pageloads_per_1k_students').reset_index()

# Now we take the advantaged school's engagement for each product and subtract it from the disadvantaged school's engagement for each product

In [None]:
engage_utah_merged = engage_utah_summed.query('cluster=="Advantaged"').merge(engage_utah_summed.query('cluster=="Disadvantaged"'),
                                                                             on=['lp_id','week_of_year'],
                                                                             how='outer',
                                                                             suffixes=('_adv','_disadv'))

In [None]:
# For any rows that are null, it means that the group had 0 pageloads for that product.
# So let's fillna.
engage_utah_merged['sum_pageloads_per_1k_students_adv'].fillna(0, inplace=True)
engage_utah_merged['sum_pageloads_per_1k_students_disadv'].fillna(0, inplace=True)

In [None]:
engage_utah_merged['pageload_gap'] = engage_utah_merged['sum_pageloads_per_1k_students_adv'] - engage_utah_merged['sum_pageloads_per_1k_students_disadv']
engage_utah_merged['pageload_sum'] = engage_utah_merged['sum_pageloads_per_1k_students_adv'] + engage_utah_merged['sum_pageloads_per_1k_students_disadv']

In [None]:
del engage_utah_merged['cluster_adv'], engage_utah_merged['cluster_disadv'] # No longer need these columns

# Now for each week, we see which products contributed the biggest gaps

In [None]:
# Sort the values
engage_utah_merged_sort_sum = engage_utah_merged.sort_values(['week_of_year','pageload_sum']).reset_index(drop=True)
engage_utah_merged_sort_gap = engage_utah_merged.sort_values(['week_of_year','pageload_gap']).reset_index(drop=True)

In [None]:
# Take the top 10 products that all Utah schools engage in
list_of_dfs_sum = []

for week_of_year, week_df in engage_utah_merged_sort_sum.groupby('week_of_year'):
    list_of_dfs_sum.append(week_df.tail(10))
    
# Take the top 5 products from the top and bottom that contribute towards the most inequality
list_of_dfs_gap = []

for week_of_year, week_df in engage_utah_merged_sort_gap.groupby('week_of_year'):
    list_of_dfs_gap.append(week_df.head(5))
    list_of_dfs_gap.append(week_df.tail(5))

In [None]:
most_pageloads = pd.concat(list_of_dfs_sum,axis=0)
most_gap_driving = pd.concat(list_of_dfs_gap,axis=0)

# So now I am wondering if there was any new product that occurred after the 2021-21 school year started? If yes, then this is probably the product that helped most drive Utah toward online learning equality

In [None]:
# Following external data https://www.schools.utah.gov/file/f27e32f6-d5ab-4886-bfda-dfa7f951e878
# Most Utah schools open in late August
# Which corresponds to week_of_year == 36

In [None]:
# Look at the lp_id count for the 2020-21 school year vs the 2019-20 school year
(most_gap_driving.loc[most_gap_driving['week_of_year']>=36,'lp_id'].value_counts() - most_gap_driving.loc[most_gap_driving['week_of_year']<=18,'lp_id'].value_counts()).sort_values(ascending=False)

In [None]:
engage_utah_merged.query('lp_id==95253')

In [None]:
engage_utah_merged.query('lp_id==55136')

# Ding ding ding! We see that the new products 95253 and 55136 are the products which most contributed to Utah's 2020-21 school year having equality in online learning. If you look into the data you can see that the Disadvantaged schools are primarily engaging with these products while the Advantaged schools are not. Therefore this brings the number of pageloads between the two schools more even (because it was previously biased where the Advantaged schools had more pageloads). And just what is this product?

In [None]:
products_info.loc[products_info['LP ID'].isin([95253, 55136])]

# It seems that Edgenuity is the main driver to reducing Utah's online learning inequality

# So let's make a conclusion.

1. We segmented districts into Advantaged and Disadvantaged schools based on their pct_black/hispanic and pct_free/reduced lunches.
2. Then we looked at the online learning engagement pageloads for Advantaged and Disadvantaged schools for each state.
3. What we saw was that when K-12 public schools were closed by the state, the inequality gap increased for pretty much every state, on average over 3,000 pageloads per 1k students
4. From those insights we highlighted four interesting states: Utah (because they had an online learning gap in 2019-20, but then virtually eliminated inequality in the 2020-21 school year), Massachusetts (they had equality pre-pandemic, but post-pandemic they INCREASED the inequality gap, so we should study them as an idea for what not to do), Missouri (this is one of the few states that managed to lower its inequality gap after K-12 public schools closed), and New York (their inequality shifted from Advantaged schools getting more online learning engagement to Disadvantaged schools getting more online learning engagement. And it happened right when K-12 public schools closed. Looking into why this happened may yield important discoveries)
5. We focused on Utah for the conclusion because the fact that they were able to reduce their inequality gap for the 2020-21 school year was very attractive to me.
6. By looking at the products which most shrink the gap between Disadvantaged and Advantaged schools in Utah, we found a new product being used a LOT more in the 2020-21 school year: Edgenuity.
7. Therefore, maybe to promote more equality among schools in our post-covid world, we could rely on a tool like Edgenuity to foster more online engagement. 

edgenuity.com/states/utah/