# Introduction

With the sudden declaration of states of emergency across the US (and beyond) in March of 2020, and the closure of nearly all public schools as a result, it is fair to say that most states and school districts were not prepared for the new, lockdown-style way of living, working, and studying. Schools, students, parents and teachers had to rapidly adjust to radically different methods of teaching and learning.

Even before the pandemic began, the US public education system was already pretty unequal and divided across geography, demographics, and socioeconomics. With the closure of schools and the transition to at-home online learning, there was a risk that already disadvantaged students could succumb to even greater disadvantages if they were having to study from home without adequate Internet connectivity, regular access to Internet-capable devices, support from their schools, or if they were in living situations that, for one reason or another, were not conducive to productive study sessions.

The purpose of this study is to answer: If another state of emergency were to arise in the future that caused the widespread closure of in-person classrooms, to be replaced with online classrooms, what can state governments and school districts do to better support their grade-school students next time?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.graph_objects as go
from plotly import subplots
import plotly.offline as pyo
import json

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

district_files = []
for dirname, _, filenames in os.walk('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'):
    for filename in filenames:
        district_files.append(filename)

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
pyo.init_notebook_mode()

In [None]:
districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
# drop rows where every column except for district_id is NaN - it provides no info
districts.dropna(subset=['state', 'locale', 'pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw'], how='all', inplace=True)

In [None]:
districts_one_hot_ethnicity = pd.get_dummies(districts['pct_black/hispanic'], prefix='Pct_black_hispanic')
districts_one_hot_free = pd.get_dummies(districts['pct_free/reduced'], prefix='Pct_free_reduced')
districts_one_hot_funding = pd.get_dummies(districts.pp_total_raw, prefix='Funding_per_student')
districts_one_hot_locale = pd.get_dummies(districts.locale, prefix='Locale')

districts_one_hot = pd.concat([districts, districts_one_hot_locale, districts_one_hot_ethnicity, districts_one_hot_free, districts_one_hot_funding], axis=1)
districts_one_hot.drop(columns=['locale', 'pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw', 'county_connections_ratio'], axis=1, inplace=True)


In [None]:
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

# Caveats and Assumptions

In [None]:
district_counts_by_state = districts.groupby('state')['district_id'].nunique().to_frame().reset_index(drop=False)
district_counts_by_state.columns = ['state', 'districts_count']
with open('../input/states-abbrevs/states_abbrev.json') as json_file:
    states_abbrev = json.load(json_file)
states_abbrev_lookup = {}
for row in states_abbrev:
    states_abbrev_lookup[row['State'].upper()] = row['Code']
district_counts_by_state['state_abbrev'] = district_counts_by_state['state'].str.upper().replace(states_abbrev_lookup)

It's important to begin this analysis by exploring the provided dataset of student engagement. The map below shows the number of school districts within each US state for which student engagement data is available. As the map shows, some states have no data available at all, and other states only have data for a very small number of school districts. This fact should therefore always be kept in mind when reading through this report: due to the limited sample size of the available data, not all states (and therefore not all demographics) are represented equally in the data, which may affect the analysis within this report.

In [None]:
fig = go.Figure(data=go.Choropleth(
    locations=district_counts_by_state['state_abbrev'],
    z=district_counts_by_state['districts_count'].astype(int),
    locationmode='USA-states',
    colorscale='Purples',
    colorbar_title='Number of districts',
    geo='geo'))
fig.update_layout(title_text='School Districts Data by State', geo_scope='usa')

Furthermore, because all school district information has been anonymized (apart from which state the school district is in), it's impossible to know how many students are within a school district. This report treats every school district with an equal weight, but if there is a large variation in the number of students within school districts, the results of this analysis may be skewed in favour of students in smaller districts. Unfortunately this potential bias is unavoidable given the limitations of the data provided.

In [None]:
def get_top_5_most_popular_products_in_district(district_id: int, display_frames: bool = False):
    district_engagement = pd.read_csv(os.path.join('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data', '{}.csv'.format(district_id)))
    district_engagement_summary = district_engagement.groupby('lp_id').agg({'engagement_index': 'mean', 'pct_access': 'mean'})
    district_engagement_summary.sort_values(by='engagement_index', ascending=False, inplace=True)
    top_5_prod_ids = district_engagement_summary.index.to_list()[:5]
    if display_frames:
        display(district_engagement_summary.head(5))
    top_5_products = products.query('`LP ID` in @top_5_prod_ids')
    if display_frames:
        display(top_5_products)
        display(districts_one_hot.query('district_id == @district_id'))
    district_engagement_summary.reset_index(inplace=True)
    return district_engagement_summary.loc[0:4]

In [None]:
temp_dfs = []
for filename in district_files:
    district_id = int(filename.split('.')[0])
    temp_df = get_top_5_most_popular_products_in_district(district_id)
    temp_df['district_id'] = district_id
    temp_df['rank_in_district'] = temp_df.index.astype(np.int) + 1
    temp_df = pd.merge(left=temp_df, right=districts_one_hot, how='inner', on='district_id')
    temp_dfs.append(temp_df)
top_5_all_districts = pd.concat(temp_dfs)
top_5_all_districts['lp_id'] = top_5_all_districts['lp_id'].astype(np.int)
top_5_all_districts.reset_index(inplace=True)
top_5_all_districts.drop('index', axis=1, inplace=True)

In [None]:
def get_product_name_by_id(product_id: int):
    row = products.query('`LP ID` == @product_id')
    if not row.empty:
        return row['Product Name'].values[0]
    return None

def get_product_essential_function_by_id(product_id: int):
    row = products.query('`LP ID` == @product_id')
    if not row.empty:
        return row['Primary Essential Function'].values[0]
    return None

popular_products = top_5_all_districts.groupby('lp_id').aggregate({'pct_access': 'mean'})
popular_products.reset_index(inplace=True)
popular_products['product_name'] = popular_products.apply(lambda row: get_product_name_by_id(row['lp_id']), axis=1)
popular_products['essential_function'] = popular_products.apply(lambda row: get_product_essential_function_by_id(row['lp_id']), axis=1)
popular_products.dropna(subset=['product_name', 'essential_function'], how='all', inplace=True)

By compiling the engagement data for every school district available, the mean proportion of students accessing an online learning product on any given day (subsequently referred to as the "percentage access" in this report), can be calculated. (To reiterate, this is the mean percentage access across all available school districts, not across all students.)

Below is a graph showing the mean daily percentage access for the most popular learning products.

In [None]:
popular_products.sort_values(by='pct_access', ascending=False, inplace=True)
fig = go.Figure()
fig.add_trace(go.Bar(y=popular_products.pct_access, x=popular_products.product_name, marker_color='grey'))
fig.update_yaxes(title_text = 'Percentage Access (%)')
fig.update_layout(title='Mean Daily Percentage Access by Students')
fig.show()

To get a better sense of what some of the above products are used for, the pie chart below shows the essential functions of the most common learning products, weighted by the frequency with which each category of learning product was accessed throughout the calendar year 2020.

In [None]:
fig = go.Figure()
fig.add_trace(go.Pie(labels=popular_products.essential_function, values=popular_products.pct_access))
fig.update_layout(title='Mean Percentage Access by Essential Function of Product')
fig.show()

As the final step in the exploratory part of this data analysis (EDA), a correlation map can help guide our analysis by showing which factors are most closely related to each other.

When looking at the correlation map below, remember that not every cell has a meaningful value. For example, there is inevitably going to be negative correlations between the different locales: if a district is in Locale A, it cannot possibly also be in Locale B.

In [None]:
corr = top_5_all_districts.corr()
corr.style.background_gradient(cmap='coolwarm')

These are some of the points worth noting that have been gleaned from the correlation map:

* there is a strong positive correlation between districts with 80-100% Black and/or Hispanic students being located in cities
* there is a strong positive correlation between districts with 80-100% Black and/or Hispanic students that also have 80-100% of students in free or reduced-cost lunch programs
* there is a strong positive correlation between districts with less than 20% Black/Hispanic students being located in rural communities
* there is a strong positive correlation between districts with 0-20% Black and Hispanic students having less than 20% of their students in free or reduced-cost lunch programs
* districts with \\$4,000 - \\$8,000 funding per student are most likely to have less than 20% of their students identifying as Black or Hispanic ethnicities
* districts with \\$20,000 or more in annual per-student funding were most likely to have less than 20% of their students identifying as Black or Hispanic
* there is a significant negative correlation between percentage access and districts where 60-80% of students are in free/reduced lunch programs. However, there is only a very slightly negative correlation between percentage access and districts having 80-100% of students in those same subsidized lunch programs
* there is a negative correlation between engagement_index/percentage access and city/town locales. Rural and suburban school districts had positive online engagement and percentage access. This is possibly counter-intuitive: I personally would have expected that cities and towns would have higher COVID infection rates due to greater population density, and therefore would more likely utilize online learning to avoid the possibility of in-school infections.

# Student Engagement as a Function of Time

I propose that the 2020 calendar year can be categorized into three segments in relation to the US public education system:

* January - mid March 2020: before states of emergency were declared; schools were still open. Student engagement values for this time frame indicate the extent to which school districts were utilizing online learning products before the pandemic began
* Late March - late June 2020: the first few months of the pandemic; most in-person classrooms closed; lots of uncertainty regarding the state of local economies, health care systems, etc. Most states and/or school districts opted to shutter schools, but weren't necessarily well-prepared for online and distance learning.
* Early September through end of 2020: the start of a new school year; states and school districts have had opportunity over the summer months to put in place either in-person protections against the spread of Covid or better digital learning platforms and systems. Student engagement in this segment is reflective of school districts that opted to remain closed after the summer vacation, or school districts that chose to resume using online learning products even if classes were being conducted in-person again.

The chart below tracks the daily student engagement for all school districts for which data is available. "Student engagement" in this context is measured by the total number of page load events that occurred within a school district for each day. Note that these trendlines have not been adjusted for the size of the student population within each school district, so a district with more page load events is not necessarily more "engaged" than another district with fewer page load events - it could just mean that the former has a larger student body than the latter.

In [None]:
districts_list = []
for filename in district_files:
    district_id = int(filename.split('.')[0])
    df = pd.read_csv(os.path.join('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data', filename))
    state = districts.query('district_id == @district_id')['state'].array
    if len(state) > 0:
        by_day_df = df.groupby('time').agg({'engagement_index': 'sum'})
        by_day_df['district_id'] = district_id
        by_day_df['state'] = state[0]
        districts_list.append(by_day_df)

interesting_district_ids = [9536, 8815, 2779, 9553]
neon_colour_palette = ['#72147E', '#F21170', '#FA9905', '#FF5200']

fig = go.Figure()
for dist in districts_list:
    if dist.iloc[0]['district_id'] not in interesting_district_ids:
        fig.add_trace(go.Scatter(x=dist.index, y=dist['engagement_index'], mode='lines', line_color='grey', opacity=0.2, showlegend=False, 
            customdata = dist[['state', 'district_id', 'engagement_index']],
            hovertemplate = '<b>%{customdata[0]}</b> - district %{customdata[1]}<br>Total Page Load Events: %{customdata[2]:.0f}',
            name= ""))
    else:
        fig.add_trace(go.Scatter(x=dist.index, y=dist['engagement_index'], mode='lines', line_color=neon_colour_palette.pop(), showlegend=False,
            customdata = dist[['state', 'district_id', 'engagement_index']],
            hovertemplate = '<b>%{customdata[0]}</b> - district %{customdata[1]}<br>Total Page Load Events: %{customdata[2]:.0f}',
            name=""))

fig.update_layout(dict(title='Daily Page Load Events by School District'))
fig.update_yaxes(dict(title_text='Daily Page Load Events'))
fig.show()

In this chart, four school districts have been selected (mostly at random) and highlighted for easier visibility, in order to recognize varying patterns in online student engagement trends over time. For instance, district 8815 in Illinois (in yellow) shows much higher student engagement in September compared to the last few months of the previous school year. This suggests that the administrative team in this particular school district used the summer months to put in place a plan for managing online learning.

Conversely, we can see that district 9536 in New York (in purple) made very little use of online learning products before the pandemic was declared. In early April 2020, this particular school district quickly adopted online education products and had some of the highest daily page load event numbers of all reporting districts in the country. However, starting with the new school year in September, their page load events decreased somewhat lower than what was seen at the end of the 2019-20 school year. This could possibly indicate that schools in this NY district returned to in-person classes (or a hybrid of online and in-person) for the 2020-21 school year, but continued to use online learning platforms much more prominently than they did before the pandemic began.

By looking at the overall trend of student engagement for all school districts reporting data, we can see that even in the later months of 2020, when it is likely that many schools had re-opened for in-person instruction, online learning products were still being used much more commonly than they had been in the first few months of 2020. 

# Factors influencing Student Engagement

In [None]:
top_5_all_districts.rename(columns={'Pct_black_hispanic_[0, 0.2[': 'Pct_black_hispanic_0-0.2', 'Pct_black_hispanic_[0.2, 0.4[': 'Pct_black_hispanic_0.2-0.4', 'Pct_black_hispanic_[0.4, 0.6[': 'Pct_black_hispanic_0.4-0.6', 'Pct_black_hispanic_[0.6, 0.8[': 'Pct_black_hispanic_0.6-0.8', 'Pct_black_hispanic_[0.8, 1[': 'Pct_black_hispanic_0.8-1', 'Pct_free_reduced_[0, 0.2[': 'Pct_free_reduced_0-0.2', 'Pct_free_reduced_[0.2, 0.4[': 'Pct_free_reduced_0.2-0.4', 'Pct_free_reduced_[0.4, 0.6[': 'Pct_free_reduced_0.4-0.6', 'Pct_free_reduced_[0.6, 0.8[': 'Pct_free_reduced_0.6-0.8', 'Pct_free_reduced_[0.8, 1[': 'Pct_free_reduced_0.8-1'}, inplace=True)

dfs = []
select_by = ['Pct_free_reduced_0-0.2', 'Pct_free_reduced_0.2-0.4', 'Pct_free_reduced_0.4-0.6', 'Pct_free_reduced_0.6-0.8', 'Pct_free_reduced_0.8-1']

for i in range(5):
    df = top_5_all_districts.groupby(select_by[i]).agg({'engagement_index': 'mean', 'pct_access': 'mean'})
    df['pct_free_reduced'] = select_by[i][17:]
    df['num_districts'] = int(top_5_all_districts[select_by[i]].value_counts()[1] / 5)
    dfs.append(df[df.index == 1])

engagement_by_free_lunch = pd.concat(dfs)

In [None]:
dfs = []
select_by = ['Pct_black_hispanic_0-0.2', 'Pct_black_hispanic_0.2-0.4', 'Pct_black_hispanic_0.4-0.6', 'Pct_black_hispanic_0.6-0.8', 'Pct_black_hispanic_0.8-1']

for i in range(5):
    df = top_5_all_districts.groupby(select_by[i]).agg({'engagement_index': 'mean', 'pct_access': 'mean'})
    df['pct_black_hispanic'] = select_by[i][19:]
    df['num_districts'] = int(top_5_all_districts[select_by[i]].value_counts()[1] / 5)
    dfs.append(df[df.index == 1])

engagement_by_ethnicity = pd.concat(dfs)

In [None]:
fig = subplots.make_subplots(
    rows=2, 
    cols=2, 
    horizontal_spacing = 0.15, 
    vertical_spacing = 0.18, 
    shared_yaxes = True
 )

trace = go.Scatter(x=engagement_by_ethnicity['pct_black_hispanic'], y=engagement_by_ethnicity['engagement_index'], mode='lines+markers+text', showlegend=False, marker_color='black', line_color='black')

fig.add_trace(trace, 1, 1)

trace = go.Scatter(x=engagement_by_free_lunch['pct_free_reduced'], y=engagement_by_free_lunch['engagement_index'], mode='lines+markers+text', showlegend=False, marker_color='black', line_color='black')

fig.add_trace(trace, 1, 2)

trace = go.Scatter(x=engagement_by_ethnicity['pct_black_hispanic'], y=engagement_by_ethnicity['pct_access'], mode='lines+markers+text', showlegend=False, marker_color='black', line_color='black')

fig.add_trace(trace, 2, 1)

trace = go.Scatter(x=engagement_by_free_lunch['pct_free_reduced'], y=engagement_by_free_lunch['pct_access'], mode='lines+markers+text', showlegend=False, marker_color='black', line_color='black')

fig.add_trace(trace, 2, 2)

large_title_format = "<span style='font-size:26px; font-family:Times New Roman'>Online Student Engagement by Ethnicity and Socioeconomics</span>"
small_title_format = "<span style='font-size:14px; font-family:Helvetica'>Mean daily page load events and percentage of students accessing online-based school<br>software broken down by the proportions of black and hispanic students in each<br>school district, and by the proportions of students on free/reduced lunch programs.</span>"

fig.update_xaxes(title_text="Proportion of Black & Hispanic students", col=1)
fig.update_xaxes(title_text="Proportion of students on free/reduced lunch", col=2)

fig.update_yaxes(title_text="Daily Page Load Events per Student", row=1)
fig.update_yaxes(title_text="Daily Percentage Access", row=2, range=[6, 16])


layout = dict(title = large_title_format + '<br>' + small_title_format,
    margin = dict(t=200, b=60, r=20),
    width = 700,
    height=700,
    yaxis = dict(range=[0,8000], showgrid=True), 
    yaxis2 = dict(range=[0, 15], showgrid=True))

fig.update_layout(layout)
fig.show()

We can see from the chart above that in school districts where 60-80% of the students are in free/reduced price lunch programs (of which there were 13 school districts in the provided data set), online engagement was significantly lower than for any other category of socioeconomic status. In fact, districts with 40-60% and 80-100% of students on free lunch programs showed very similar results in the mean daily percentage of students accessing online learning resources. Strangely, the percentage access drops by 25% in districts where 60-80% of the students are on free lunch programs.

Some encouraging news is that districts with 80-100% of their students in free lunch programs had nearly as many page load events per student per day, on average, as districts where less than 20% of students had free lunches. This suggests that in these school districts where the large majority of students come from poor families, there are supports in place to encourage those students to participate in school. Unfortunately, that doesn't appear to be the case in districts where 60-80% of students eat free lunches.

We can also see from the spread over mean daily page load events that a student's (or school district's) socioeconomic status had a greater effect on their online participation than the student's ethnicity did.

The wealthiest school districts (less than 20% of students getting free lunches) reported the highest percentage of daily online access. This is to be expected, given that families with more wealth are more likely to be able to afford home Internet connections and personal devices to allow students to access the Internet either from home or at school.

Also noticeable is that school districts where less than 20% of students identify as Black or Hispanic reported similar daily percentage access numbers to school districts where 60% or more of students were Black or Hispanic. It would appear that districts with a more equal spread of ethnicities (20-60% of students Black or Hispanic) were the least likely to access online school resources on a daily basis.

In [None]:
top_5_all_districts.rename({'Funding_per_student_[10000, 12000[': 'Student_funding_10k-12k', 'Funding_per_student_[18000, 20000[': 'Student_funding_18k-20k', 'Funding_per_student_[12000, 14000[': 'Student_funding_12k-14k', 'Funding_per_student_[14000, 16000[': 'Student_funding_14k-16k', 'Funding_per_student_[16000, 18000[': 'Student_funding_16k-18k', 'Funding_per_student_[20000, 22000[': 'Student_funding_20k-22k', 'Funding_per_student_[22000, 24000[': 'Student_funding_22k-24k', 'Funding_per_student_[32000, 34000[': 'Student_funding_32k-34k', 'Funding_per_student_[4000, 6000[': 'Student_funding_4k-6k', 'Funding_per_student_[6000, 8000[': 'Student_funding_6k-8k', 'Funding_per_student_[8000, 10000[': 'Student_funding_8k-10k'}, axis=1, inplace=True)

dfs = []
select_by = ['Student_funding_4k-6k',
       'Student_funding_6k-8k', 'Student_funding_8k-10k', 'Student_funding_10k-12k',
       'Student_funding_12k-14k', 'Student_funding_14k-16k',
       'Student_funding_16k-18k', 'Student_funding_18k-20k',
       'Student_funding_20k-22k', 'Student_funding_22k-24k',
       'Student_funding_32k-34k']

for i in range(len(select_by)):
    df = top_5_all_districts.groupby(select_by[i]).agg({'engagement_index': 'mean', 'pct_access': 'mean'})
    df['per_student_funding'] = select_by[i][16:]
    df['num_districts'] = int(top_5_all_districts[select_by[i]].value_counts()[1] / 5)
    dfs.append(df[df.index == 1])

engagement_by_student_funding = pd.concat(dfs)

In [None]:
fig = subplots.make_subplots(
    rows=2, 
    cols=1, 
    horizontal_spacing = 0.15, 
    vertical_spacing = 0.18, 
    shared_yaxes = True
 )

trace = go.Scatter(x=engagement_by_student_funding['per_student_funding'], y=engagement_by_student_funding['engagement_index'], mode='lines+markers+text', marker_color='black', line_color='black', showlegend=False)

fig.add_trace(trace, 1, 1)

trace = go.Scatter(x=engagement_by_student_funding['per_student_funding'], y=engagement_by_student_funding['pct_access'], mode='lines+markers+text', marker_color='black', line_color='black', showlegend=False)

fig.add_trace(trace, 2, 1)

large_title_format = "<span style='font-size:26px; font-family:Times New Roman'>Online Student Engagement by Funding per Student</span>"
small_title_format = "<span style='font-size:14px; font-family:Helvetica'>Mean daily page load events and percentage of students accessing online-based school<br>software broken down by per-student funding allocated to the school district.</span>"

fig.update_xaxes(title_text="Per-student funding", col=1)

fig.update_yaxes(title_text="Daily Page Load Events per Student", row=1)
fig.update_yaxes(title_text="Daily Percentage Access", row=2)


layout = dict(title = large_title_format + '<br>' + small_title_format,
    margin = dict(t=200, b=60, r=20),
    width = 700,
    height=700,
    yaxis = dict(range=[0,10000], showgrid=True), 
    yaxis2 = dict(range=[0, 20], showgrid=True))

fig.update_layout(layout)
fig.show()

These graphs clearly show that online student engagement is lowest in school districts with the least funding allocated per student. However, the data needs to be investigated further before any inferences can be drawn from this, such as: in school districts with low per-student funding, was there less online engagement because more classes were conducted in-person (and therefore there was less need for online software)? Does low per-student funding in a district equate to low-income communities (and therefore more students living in low-income families, who are less likely to have Internet connections or Internet-accessible devices at home)? How commonplace is it (within the student engagement dataset provided) for districts to only have \\$4000-\\$6000 funding per student?

(Note that there is no student engagement data available for districts with per-student funding between \\$24,000-\\$32,000.)

In [None]:
fig = go.Figure()
bar = go.Bar(x=engagement_by_student_funding['per_student_funding'], y=engagement_by_student_funding['num_districts'])
fig.add_trace(bar)
fig.update_layout(dict(title='Counts of School Districts Categorized by Annual Funding per Student'))
fig.update_yaxes(title_text='Number of School Districts')
fig.update_xaxes(title_text='Annual Funding per Student (USD$)')
fig.update_traces(marker_color='grey')
fig.show()

We can see from this that school districts at either end of the funding spectrum (\\$4000-\\$6000 per student, or \\$20,000-\\$22,000, \\$22,000-\\$24,000, or \\$32,000-\\$34,000 per student) are not common in the available dataset, so it should be kept in mind that any inferences drawn about these types of school districts are based on a very small sample size, and therefore may not be representative of all school districts in the US with similar funding levels.

In [None]:
lowest = top_5_all_districts.query('`Student_funding_4k-6k` == 1')
display(lowest[['lp_id', 'engagement_index', 'pct_access', 'district_id', 'state', 'Pct_black_hispanic_0-0.2',
       'Pct_black_hispanic_0.2-0.4', 'Pct_black_hispanic_0.4-0.6',
       'Pct_black_hispanic_0.6-0.8', 'Pct_black_hispanic_0.8-1',
       'Pct_free_reduced_0-0.2', 'Pct_free_reduced_0.2-0.4',
       'Pct_free_reduced_0.4-0.6', 'Pct_free_reduced_0.6-0.8',
       'Pct_free_reduced_0.8-1']])

In [None]:
display(lowest.groupby('district_id').agg({'engagement_index': 'mean', 'pct_access': 'mean'}))

Above is code written to query the 2 school districts with per-student funding of \\$4000-\\$6000 annually (the top 5 most popular online products are listed for each district). We can see from the output that both of the relevant school districts are in Utah, and both have less than 20% of students identifying as Black or Hispanic. In one of the school districts, less than 20% of students are in free or reduced lunch programs, and in the other, 20-40% of students were on free or reduced lunch.

This would indicate that the communities that these two school districts are located in are probably relatively wealthy. Note that according to the "Online Student Engagement by Ethnicity and Socioeconomics" chart, one would expect some of the highest page load events/student and percentage access (approximately 5,300 - 7,100 page load events per day per student, on average, and 12.5-15% daily access). Instead, the Utah school district with less than 20% of students on free/reduced lunches had an engagement index of 4700 page load events per student per day, and approx. 10% of students used an online product on any given day. In the Utah school district where 20-40% of students are on free/reduced lunches, there was a very low average of 1092 page load events per student per day, and only an average of 2.9% of students accessed an online product on any given day. Because the school districts have been anonymized in the dataset, there is no additional information available to be able to determine the reason why both of these school districts (and one in particular) had so little online student engagement.

However, it is worth noting (and will be shown in Utah's state chart below), Utah's overall student engagement was generally lower than many other states. It is possible that the state Department of Education for Utah does not encourage or support online schooling to the same extent as other state education departments do. Alternatively, (or perhaps additionally), there could be other demographic factors at play, such as Utah's historically homogeneous distributions of political leanings and religion.

In [None]:
series_list = []
states_list = top_5_all_districts['state'].unique()
funding_levels = ['Student_funding_4k-6k',
       'Student_funding_6k-8k', 'Student_funding_8k-10k', 'Student_funding_10k-12k',
       'Student_funding_12k-14k', 'Student_funding_14k-16k',
       'Student_funding_16k-18k', 'Student_funding_18k-20k',
       'Student_funding_20k-22k', 'Student_funding_22k-24k',
       'Student_funding_32k-34k']

for state in states_list:
    state_dict = {}
    state_dict['state'] = state
    for level in funding_levels:
        df = top_5_all_districts[(top_5_all_districts['state'] == state) & (top_5_all_districts[level] == 1)]
        val = df.shape[0]/5
        state_dict[level] = int(val)
    series_list.append(pd.Series(state_dict))

student_funding_by_state = pd.concat(series_list, axis=1)
student_funding_by_state = student_funding_by_state.transpose()

In [None]:
states_dict = {}
for state in states_list:
    row = student_funding_by_state.query('state == @state')
    states_dict[state] = {'state_min': 0, 'state_max': 0}
    # find min
    for level in funding_levels:
        if row[level].values[0] > 0:
            category = level[16:]
            floor = int(category.split('-')[0].replace('k', ''))
            ceil = int(category.split('-')[1].replace('k', ''))
            states_dict[state]['state_min'] = floor
            states_dict[state]['state_max'] = ceil
            break
    # find max
    for i in range(len(funding_levels)-1, 0, -1):
        level = funding_levels[i]
        if row[level].values[0] > 0:
            category = level[16:]
            ceil = int(category.split('-')[1].replace('k', ''))
            states_dict[state]['state_max'] = ceil
            break

In [None]:
engagement_by_state = top_5_all_districts.groupby('state').agg({'engagement_index': ['mean', 'max'], 'pct_access': ['mean', 'max']})
count_districts_in_state = {}
for state_name in engagement_by_state.index:
    count_districts_in_state[state_name] = int(top_5_all_districts.query('state == @state_name').shape[0]/5)
engagement_by_state['num_districts'] = count_districts_in_state.values()

In [None]:
fig = go.Figure()

for state in states_dict.keys():
    if states_dict[state]['state_min'] != 0:
        state_label = state + ' (' + str(engagement_by_state[engagement_by_state.index == state]['num_districts'].values[0]) + ')'
        fig.add_trace(go.Scatter(x=np.array(states_dict[state]['state_min']), y=[state_label], mode='markers', marker_color='grey', name='Minimum per-student funding in state (USD$)'))
        fig.add_trace(go.Scatter(x=np.array(states_dict[state]['state_max']), y=[state_label], mode='markers', marker_color='black', name='Maximum per-student funding in state (USD$)'))
        fig.add_trace(go.Scatter(x=np.array(engagement_by_state[engagement_by_state.index == state]['engagement_index']['mean'].values[0]/1000), y=[state_label], mode='markers', marker_color='green', name='Mean online student engagement'))
        fig.add_shape(type='line', x0=states_dict[state]['state_max'], y0=state_label, x1=states_dict[state]['state_min'], y1=state_label, name='Range in per-student funding (USD$)')

large_title_format = "<span style='font-size:26px; font-family:Times New Roman'>Funding per Student and Mean Engagement by State</span>"
small_title_format = "<span style='font-size:14px; font-family:Helvetica'>The range in funding per student for school districts in each state,<br>and the mean online engagement for all students in their respective states.<br>(Numbers next to state names indicate the number of school districts within each state that have data available.)</span>"

fig.update_yaxes(type='category')
fig.update_xaxes(tick0=4, dtick=2, title_text='scale of thousands (1000)')
fig.update_layout(dict(height=600, title=large_title_format + '<br>' + small_title_format, margin = dict(t=200, b=60, r=20), showlegend=False))
fig.show()

This chart shows that in some states, such as New York and Massachusetts, there is a large in-state disparity between school districts regarding the amount of per-student funding received annually. The available data for other states, such as Indiana and Wisconsin, indicate that school districts are funded equally throughout their respective states.

What can also be observed in this chart is the general trend that online student engagement is higher (on average) in states that provide more per-student funding to public schools. (Note that some states had only a small number of school districts reporting student engagement, so the data for some of these states should be taken with a grain of salt.) For example, Illinois and the District of Columbia had equally high student engagement numbers (over 9,000 page load events per student per day, on average), and indeed they had the highest engagement of all states with available data. These two jurisdictions similarly had relatively high minimum values for per-student funding (\\$10,000 in Illinois, and \\$18,000 in DC).

At the opposite end of the spectrum, the two school districts with the lowest per-student funding in the country were both in Utah, and Utah's overall student-funding trend was generally lower than in other states. It is probably therefore not that surprising that the mean student engagement in Utah was one of the lowest of all states; in fact, the only states that had lower student engagement values were within states that had very little engagement data available (Michigan, Tennessee, and North Carolina).

# Covid-19 by State

In this section we'll investigate Covid-19 cases by state, state interventions and policies, and look for trends in how student engagement changed over the course of 2020.

Only those states with the largest number of school districts will be examined, since analysis of student engagement within a state will only be meaningful when there is a reasonably large sample size. 

In [None]:
daily_covid_by_state = pd.read_csv('../input/covid19-in-usa/us_states_covid19_daily.csv')


In [None]:
connecticut_districts = districts[districts['state'].str.match('Connecticut')]['district_id'].to_list()
district_summaries = []
for district_id in connecticut_districts:
    district_engagement_breakdown = pd.read_csv(os.path.join('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data', '{}.csv'.format(district_id)))
    district_engagement_summary = district_engagement_breakdown.groupby('time').agg({'engagement_index': 'sum'})
    district_summaries.append(district_engagement_summary)
connecticut_engagement = pd.concat(district_summaries)
connecticut_engagement = connecticut_engagement.groupby('time').agg({'engagement_index': 'sum'})
connecticut_engagement['engagement_index'] = connecticut_engagement.apply(lambda row: row['engagement_index']/1000, axis=1)
connecticut_engagement.reset_index(inplace=True)
connecticut_engagement['time'] = pd.to_datetime(connecticut_engagement['time'], format='%Y-%m-%d')

In [None]:
connecticut_covid = daily_covid_by_state.query('state == "CT"')

connecticut_covid['date'] = pd.to_datetime(connecticut_covid['date'], format='%Y%m%d')
connecticut_covid = connecticut_covid.sort_values(by='date')

connecticut_engagement = connecticut_engagement.sort_values(by='time')

fig = go.Figure()
fig.add_trace(go.Scatter(x=connecticut_engagement['time'], y=connecticut_engagement['engagement_index'], mode='lines', name='Avg Page Load Events/Student'))
fig.add_trace(go.Scatter(x=connecticut_covid['date'], y=connecticut_covid['hospitalizedCurrently'], mode='lines', name='Hospitalized Currently'))
fig.add_trace(go.Scatter(x=connecticut_covid['date'], y=connecticut_covid['inIcuCurrently'], mode='lines', name='In ICU Currently'))
fig.add_trace(go.Scatter(x=connecticut_covid['date'], y=connecticut_covid['onVentilatorCurrently'], mode='lines', name='On Ventilator Currently'))
fig.add_trace(go.Scatter(x=connecticut_covid['date'], y=connecticut_covid['positiveIncrease'], mode='lines', name='Positive Increase'))
fig.add_trace(go.Scatter(x=['2020-03-10', '2020-03-10'], y=[0,5500], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='green'), name="State of Emergency Declared"))
fig.add_trace(go.Scatter(x=['2020-03-17','2020-03-17'], y=[0,5500], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='orange'), name="Public K-12 schools closed"))
fig.add_trace(go.Scatter(x=['2020-03-16','2020-03-16'], y=[0,5500], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='purple'), name='Eviction Moratorium started'))
fig.add_trace(go.Scatter(x=['2020-03-12','2020-03-12'], y=[0,5500], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='pink'), name='Utilities Shutoff Moratorium started'))
fig.add_trace(go.Scatter(x=['2020-09-30','2020-09-30'], y=[0,5500], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='magenta'), name='Utilities Shutoff Moratorium ended'))
fig.update_layout(title='Connecticut Data')
fig.show()

The curves for Avg Page Load Events/Student and Positive Increase are not smooth because of weekends: students are less likely to be doing schoolwork on weekends. Based on the pattern of 2 consecutive days on the Positive Increase curve always having values of 0, it can be assumed that the state of Connecticut does not report Covid case statistics on weekends.

Note that there is a significant increase in the Avg Page Load Events/Student trend starting in late March 2020. This seems like a strong indicator that many (if not all) school districts in Connecticut switched to online learning rather than face-to-face at around this time. We can see the page load trend dropping off quickly in late June, which would coincide with the start of summer vacation. Page load events then rapidly increase at the start of September 2020, coinciding with the start of a new school year. Note that from September-December 2020, the Avg Page Load Events/Student trend is similar to that seen in the spring, suggesting that either Connecticut school districts continued with online learning for the 2020-2021 school year (unfortunately there is no data available to confirm or deny this speculation), or school districts in Connecticut chose to continue using online learning products in the classroom for the new school year.

In [None]:
utah_districts = districts[districts['state'].str.match('Utah')]['district_id'].to_list()
utah_district_summaries = []
for district_id in utah_districts:
    district_engagement_breakdown = pd.read_csv(os.path.join('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data', '{}.csv'.format(district_id)))
    district_engagement_summary = district_engagement_breakdown.groupby('time').agg({'engagement_index': 'sum'})
    utah_district_summaries.append(district_engagement_summary)
utah_engagement = pd.concat(utah_district_summaries)
utah_engagement = utah_engagement.groupby('time').agg({'engagement_index': 'sum'})
utah_engagement.reset_index(inplace=True)
utah_engagement['engagement_index'] = utah_engagement.apply(lambda row: row['engagement_index']/1000, axis=1)
utah_engagement['time'] = pd.to_datetime(utah_engagement['time'], format='%Y-%m-%d')

In [None]:
utah_covid = daily_covid_by_state.query('state == "UT"')
utah_covid['date'] = pd.to_datetime(utah_covid['date'], format='%Y%m%d')
utah_covid = utah_covid.sort_values(by='date')

utah_engagement = utah_engagement.sort_values(by='time')
utah_engagement['MA14'] = utah_engagement.engagement_index.rolling(14).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=utah_engagement['time'], y=utah_engagement['engagement_index'], mode='lines', name='Avg Page Load Events/Student'))
fig.add_trace(go.Scatter(x=utah_engagement['time'], y=utah_engagement['MA14'], mode='lines', name="14-Day Moving Average Page Load Events/Student"))
fig.add_trace(go.Scatter(x=utah_covid['date'], y=utah_covid['hospitalizedCurrently'], mode='lines', name='Hospitalized Currently'))
fig.add_trace(go.Scatter(x=utah_covid['date'], y=utah_covid['inIcuCurrently'], mode='lines', name='In ICU Currently'))
fig.add_trace(go.Scatter(x=utah_covid['date'], y=utah_covid['onVentilatorCurrently'], mode='lines', name='On Ventilator Currently'))
fig.add_trace(go.Scatter(x=utah_covid['date'], y=utah_covid['positiveIncrease'], mode='lines', name='Positive Increase'))
fig.add_trace(go.Scatter(x=['2020-03-06', '2020-03-06'], y=[0,3000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='green'), name="State of Emergency Declared"))
fig.add_trace(go.Scatter(x=['2020-03-16','2020-03-16'], y=[0,3000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='orange'), name="Public K-12 schools closed"))
fig.add_trace(go.Scatter(x=['2020-04-01','2020-04-01'], y=[0,3000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='purple'), name='Eviction Moratorium started'))
fig.add_trace(go.Scatter(x=['2020-05-16','2020-05-16'], y=[0,3000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='black'), name='Eviction Moratorium ended'))

fig.update_layout(title='Utah Data')
fig.show()

In comparison to Connecticut, Utah did not experience significant Covid cases until late 2020. 

The average page load events per student roughly doubled around the middle of February 2020, which could possibly suggest that Utah started relying on online resources for instruction around the time; however, February seems somewhat early for a transition to online learning, especially since Utah did not report its first positive case of Covid until the middle of March.

Utah recorded a brief uptick in online student engagement in the last two weeks of March, which coincides with the March 16th date the public schools were closed in Utah, but then the numbers continue the trend set in February, suggesting that Utah's school districts only used online instruction for a couple of weeks.

In [None]:
texas_districts = districts[districts['state'].str.match('Texas')]['district_id'].to_list()
district_summaries = []
for district_id in texas_districts:
    district_engagement_breakdown = pd.read_csv(os.path.join('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data', '{}.csv'.format(district_id)))
    district_engagement_summary = district_engagement_breakdown.groupby('time').agg({'engagement_index': 'sum'})
    district_summaries.append(district_engagement_summary)
texas_engagement = pd.concat(district_summaries)
texas_engagement = texas_engagement.groupby('time').agg({'engagement_index': 'sum'})
texas_engagement.reset_index(inplace=True)
texas_engagement['engagement_index'] = texas_engagement.apply(lambda row: row['engagement_index']/1000, axis=1)
texas_engagement['time'] = pd.to_datetime(texas_engagement['time'], format='%Y-%m-%d')

In [None]:
texas_covid = daily_covid_by_state.query('state == "TX"')
texas_covid['date'] = pd.to_datetime(texas_covid['date'], format='%Y%m%d')
texas_covid = texas_covid.sort_values(by='date')

texas_engagement = texas_engagement.sort_values(by='time')
texas_engagement['MA14'] = texas_engagement.engagement_index.rolling(14).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=texas_engagement['time'], y=texas_engagement['engagement_index'], mode='lines', name='Avg Page Load Events/Student'))
fig.add_trace(go.Scatter(x=texas_engagement['time'], y=texas_engagement['MA14'], mode='lines', name="14-Day Moving Average Page Load Events/Student"))
fig.add_trace(go.Scatter(x=texas_covid['date'], y=texas_covid['hospitalizedCurrently'], mode='lines', name='Hospitalized Currently'))
fig.add_trace(go.Scatter(x=texas_covid['date'], y=texas_covid['inIcuCurrently'], mode='lines', name='In ICU Currently'))
fig.add_trace(go.Scatter(x=texas_covid['date'], y=texas_covid['onVentilatorCurrently'], mode='lines', name='On Ventilator Currently'))
fig.add_trace(go.Scatter(x=texas_covid['date'], y=texas_covid['positiveIncrease'], mode='lines', name='Positive Increase'))
fig.add_trace(go.Scatter(x=['2020-03-13', '2020-03-13'], y=[0,2000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='green'), name="State of Emergency Declared"))
fig.add_trace(go.Scatter(x=['2020-03-21','2020-03-21'], y=[0,2000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='orange'), name="Public K-12 schools closed"))
fig.add_trace(go.Scatter(x=['2020-03-19','2020-03-19'], y=[0,2000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='purple'), name='Eviction Moratorium started'))
fig.add_trace(go.Scatter(x=['2020-05-28','2020-05-28'], y=[0,2000], mode='lines', marker=dict(size=12,line=dict(width=0.8),color='black'), name='Eviction Moratorium ended'))
fig.update_layout(title='Texas Data')
fig.show()

Note that due to the (sadly) very large scale of increases in positive Covid cases for the very populous Texas, it is necessary to toggle off most of the Covid-related trend lines or zoom in closer to the bottom of the chart in order to be able to see the trendlines related to daily page load events.

Texas does not appear to have made use of online learning much at all. The largest value of average page load events per student can be seen around the autumn of 2020 at approx. 150 page load events per student per day, which is significantly less than the usage of online products that Connecticut and Utah showed.

# Summary

Based on the data analysis explained above, the following are my responses to the questions posed in the study proposal:

**What is the picture of digital connectivity and engagement in 2020?**

Across the US, the most regularly accessed online software was in the digital learning platforms category. This includes products such as Khan Academy and Study.com. From what could be observed of the available student engagement data, there are substantial differences in online engagement across states. Texas appeared to barely use online education software whatsoever, whereas students in Connecticut were habitual users of learning software.

**What is the effect of the Covid-19 pandemic on online and distance learning, and how might this also evolve in the future?**

As was seen in the day-by-day data for Connecticut, online engagement increased immediately when public schools closed their doors in March 2020, and the trend of increased student engagement seen in the last few months of the 2019-2020 school year continued to roughly the same extent at the start of the new 2020-2021 school year. This might suggest that some school districts continued to use certain online learning products, which were originally adopted as a stopgap replacement for in-person learning, in the longer term.

Indeed, when looking at the daily sums of page load events by school district, it is apparent that online education software was being used much more often in the last three months of 2020 compared to the first three months. My hypothesis is that many school districts continued using online software for their classrooms even if schools were open and classes were being held in-person near the end of 2020.

**How does student engagement with online learning platforms relate to geographic, demographics, and socioeconomic contexts?**

Districts with greater per-student funding generally had better engagement. This could be because wealthier school districts can better afford in-classroom computers for online access, and/or can afford to buy the necessary licenses and subscriptions for paid education software. 

School districts with the fewest number of students enrolled in free or reduced-cost lunch programs had the best online engagement. The most likely cause for this is that if a student is not qualifying for free school lunches, it is most likely because their family's income bracket is relatively high, and consequently, their household is more likely to be able to afford an at-home Internet connection and computers or other electronic devices that can be used for accessing online learning. 

Data analysis also revealed that the most engaged school districts had relatively homogeneous ethnic distributions, whether that was predominantly Caucasian or predominantly Black/Hispanic. Districts with more equal distributions of ethnicities actually had the lowest online student engagement.

**Do certain state interventions or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease of online engagement?**

In states with sufficient engagement data, it is observable that online engagement immediately increased when schools were closed. 

From the student engagement dataset provided, there is little evidence to suggest that state interventions impacted engagement, although I do not believe there is sufficient engagement data available to be able to conclusively say that state interventions were ineffective.

# Data Sources

* Covid-19 in USA (https://www.kaggle.com/sudalairajkumar/covid19-in-usa) under Apache License 2.0
* USA Statewise Latest Covid-19 Data (https://www.kaggle.com/anandhuh/usa-statewise-latest-covid19-data)
* COVID-19 US State Policy Database (https://www.kaggle.com/cavfiumella/covid19-us-state-policy-database)