In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
from collections import defaultdict
from collections import Counter
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import linregress
from scipy.stats import ttest_ind
import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Motivation, key findings, and recommendations 

The last year and a half has presented significant challenges for all of us, but many of the pandemic's impacts have been more acute for some than for others. While many white-collar workers were able to seamlessly transition to working remotely, a much larger percentage of blue collar workers lost their jobs [[1](https://www.joblist.com/trends/working-during-the-covid-19-pandemic-class-differences)]. Analogously, recent data from NWEA suggests that while high-income children's performance on reading and math measures grew at a reasonable rate during the pandemic, growth in reading and math performance of low-income children was significantly less than in normal years [[2](https://www.nwea.org/research/publication/learning-during-covid-19-reading-and-math-achievement-in-the-2020-2021-school-year/)]. These impacts of the pandemic, both on low-income children and their parents, have the unfortunate outcome of growing an already-gaping chasm between the rich and poor in America. 

Understanding the sources of these divisions is critical to ameliorating them. This dataset provided the opportunity to gather meaningful insights relevant to that goal. In my investigation of the dataset, I sought to examine how children from higher-income and lower-income backgrounds experienced the pandemic with respect to their use of educational technology, whose already-central importance to the student experience before the pandemic reached new heights when schooling became remote. I did this in three steps.
1. I first examined changes to overall educational technology use by each income group (as indexed by percentage of children receiving free or reduced lunch) over the course of 2020. 
2. After identifying key differences between income groups in changes to overall technology use, I sought to identify whether any individual products or product categories were driving these differences. 
3. Finally, wanting to explore links between impacts on children and impacts on their parents, I examined whether state-level interventions intended to provide economic support during the pandemic were differentially correlated with educational technology use for lower-income vs higher-income children.

In this notebook, I start by presenting the key findings of this investigation, as well as some recommendations based on those findings. I then walk through my exploration of the data for each of the three steps above, presenting relevant figures and statistics. I conclude with a discussion of insights arising from and limitations of this investigation, finishing with suggestions for next steps. As a note to the reader, this is my first Kaggle submission and really my first time trying to make a notebook for others to read, so I apologize if any part of it is confusing. I made the notebook so that it should hopefully compile just by pressing "Run All."

<b>Key findings</b>
* There was a spring peak in engagement and access across all technologies for the wealthiest kids aligning with the timing of school closures in March and early April, which starkly contrasted with an overall decline in technology use through the spring for the two lowest-income groups.
* The largest driver of the peak among the wealthiest group by Google Docs. Other income groups increased their use of this tool, but the amount of increase among the wealthy was by far the greatest, compounding already high levels of usage by this group (especially of Google Docs) before the pandemic. Despite moderate increases in Google Docs usage by other groups, their overall decrease in learning technology use through the spring was driven by decreases in other tools such as digital learning platforms that were not heavily used by the wealthiest before the pandemic. 
* Leaving aside one outlier district, the greatest decreases in technology use in the spring were in the lowest income group, driven in large part by decreases in use of digital learning platforms -- specifically reading and math software, such as Lexia, i-Ready, and myON Reader. 
* By the fall, use of reading and math software had largely not resumed for the lowest-income group, reflecting a gap (at least as of late 2020) remaining to be filled. This gap is consistent with independent findings of disproportionate decreases in reading and math among low-income students after a year of the pandemic (NWEA, 2021).
* There were no significant correlations between technology use and a measure of state social support, though this may be due to limited data. The correlation between technology engagement and unemployment insurance did appear like it trended toward greater in the low-income group than the high-income group, which would be consistent with this support being more critical for blue-collar workers to bolster their children's engagement with technology.

<b>Recommendations</b>
* Conduct qualitative research to understand more deeply the enormous difference in Google Docs usage between the wealthiest group and everyone else. How do different income groups use Google Docs differently, and what kinds of activities are responsible for the large divide in usage? Do the wealthiest use Google Docs in ways that will in the long run further entrench their socioeconomic status? How can insights about that be used to increase educational and societal equity?
* Identify ways to fill the gap left by decreases in use of digital learning platforms, especially reading and math software, that has disproportionately impacted lower income kids. As the pandemic persists and school closures continue to be a regular occurrence, can districts more effectively use software they already have in remote instruction, or should they alternatively adopt technologies that are better suited to learning from home? What other interventions can be used to help get these kids up to speed?
* As time passes and more metrics are collected, look at further correlations between measures of social support and technology engagement, as well as between changes in the use of reading and math technologies and child reading and math performance.

# (1) Overall changes in educational technology use
I first sought to examine how children's overall educational technology use, collapsing across all products measured, differed among income groups. As mentioned above, I took the percentage of children receiving free or reduced-price lunch in a district as a proxy for income level [[3](https://nces.ed.gov/fastfacts/display.asp?id=898)]. Data on this measure was only available for a subset of the 233 districts in the dataset. Because I was interested in looking at change over the course of year, I also excluded the small number of districts that did not have data for February through November. After including only districts with data for percentage of free or reduced lunch and excluding districts with missing months, there were 139 districts left to analyze. These 139 districts are the focus of the analysis presented in this notebook.

In [None]:
# load and join all individual district engagement data dfs
all_dfs = []
for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'):
    for filename in filenames:
        currentDf = pd.read_csv(os.path.join(dirname,filename))
        currentDf['district_id'] = int(filename[0:4])
        all_dfs.append(currentDf)
engagement_df = pd.concat(all_dfs)

In [None]:
# load and clean up district data df
district_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
district_df = district_df.rename(columns = {"pct_free/reduced": "pct_free_reduced","pct_black/hispanic": "pct_black_hispanic"})
district_df = district_df.drop("county_connections_ratio", axis = 1)
district_df = district_df.drop("locale", axis = 1)
n_total_districts = len(district_df.district_id.unique())
district_df = district_df.dropna(subset = ['pct_free_reduced'])
n_nonNAdistricts = len(district_df.district_id.unique())
district_df = district_df.replace({'pct_free_reduced': {'[0, 0.2[': '0-20%', '[0.2, 0.4[': '20-40%', 
                    '[0.4, 0.6[': '40-60%', '[0.6, 0.8[': '60-80%', 
                    '[0.8, 1[': '80-100%'}})
district_df = district_df.replace({'pct_black_hispanic': {'[0, 0.2[': '0-20%', '[0.2, 0.4[': '20-40%', 
                    '[0.4, 0.6[': '40-60%', '[0.6, 0.8[': '60-80%', 
                    '[0.8, 1[': '80-100%'}})
district_df['pct_black_hispanic'] = district_df['pct_black_hispanic'].astype('category')
district_df['pct_black_hispanic'].cat.reorder_categories(['0-20%', '20-40%', '40-60%', '60-80%', '80-100%'])
district_df["pct_free_reduced_binned"] = district_df.pct_free_reduced
district_df = district_df.replace({'pct_free_reduced_binned': {'0-20%': '0-20%', '20-40%': '20-60%', 
                    '40-60%': '20-60%', '60-80%': '60-100%', 
                    '80-100%': '60-100%'}})
district_df['pct_free_reduced'] = district_df['pct_free_reduced'].astype('category')
district_df['pct_free_reduced'].cat.reorder_categories(['0-20%', '20-40%', '40-60%', '60-80%', '80-100%'])
district_df['pct_free_reduced_binned'] = district_df['pct_free_reduced_binned'].astype('category')
district_df['pct_free_reduced_binned'].cat.reorder_categories(['0-20%', '20-60%', '60-100%'])

print("{n_inc_dist} / {n_tot_dist} districts had data for percent free or reduced lunch.".format(n_inc_dist = n_nonNAdistricts, n_tot_dist = n_total_districts))
print("Our analysis only uses those districts.")

In [None]:
# join usage and district dfs
engagement_df_raw_district = pd.merge(engagement_df, district_df, on = "district_id")
engagement_df_raw_district["Month"] = engagement_df_raw_district.time.apply(lambda x: x.split('-')[1])
engagement_df_raw_district["Day"] = engagement_df_raw_district.time.apply(lambda x: x.split('-')[2])
engagement_df_raw_district["lp_id"] = engagement_df_raw_district["lp_id"].fillna(0)
engagement_df_raw_district["lp_id"] = engagement_df_raw_district.lp_id.apply(lambda x: str(int(x)))

In [None]:
# remove districts not having data for February through November
desired_months = ["02", "03", "04", "05", "06", "07", "08", "09", "10", "11"]
incomplete_month_districts = []
for district in engagement_df_raw_district.district_id.unique():
    included_months = engagement_df_raw_district["Month"].loc[engagement_df_raw_district["district_id"] == district].unique()
    districtHasDesiredMonths = all(x in included_months for x in desired_months)
    if not districtHasDesiredMonths:
        incomplete_month_districts.append(district)
engagement_df_raw_district_excluded = engagement_df_raw_district[~engagement_df_raw_district["district_id"].isin(incomplete_month_districts)]
included_districts = engagement_df_raw_district_excluded.district_id.unique()
n_excluded_districts = len(incomplete_month_districts)
n_total_districts = len(engagement_df_raw_district.district_id.unique())
income_group_counter = Counter(district_df["pct_free_reduced_binned"][district_df["district_id"].isin(included_districts)])

print("We excluded {n_exc_dist} / {n_tot_dist} remaining districts for not having data for February through November.".format(n_exc_dist = n_excluded_districts, n_tot_dist = n_total_districts))
print("This leaves us with " + str(len(included_districts)) + " districts for our analysis.")

print("""Of these, {HI} districts are in the high-income group (0-20% free/reduced lunch),  
      {MI} districts are in the middle-income group (20-60% free/reduced lunch),
      and {LI} districts are in the low-income group (60-100% free/reduced lunch).""".format(HI = income_group_counter["0-20%"],
                                                           MI = income_group_counter["20-60%"], 
                                                           LI = income_group_counter["60-100%"]))

In [None]:
# create full df with district, usage, and product information
product_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
product_df = product_df.rename(columns = {"LP ID": "lp_id"})
product_df["lp_id"] = product_df.lp_id.apply(lambda x: str(int(x)))

full_df = pd.merge(engagement_df_raw_district_excluded, product_df[['lp_id', 'Product Name', 'Primary Essential Function','Provider/Company Name']], on = "lp_id", how = "left")

In [None]:
# delete intermediate dfs to save RAM
del all_dfs
del engagement_df
del engagement_df_raw_district
del engagement_df_raw_district_excluded

After playing around a bit with the data, I realized that some non-trivial normalization would be required to allow us to make reasonable comparisons across districts and groups. This is because every school uses a different set of products, and products that are not used by a district are not present in the dataset. Thus, for the same total number of page views across all products used by a district on a given day, taking a simple mean across all products would inaccurately report a district that used five products as having twice the usage as a district that used ten products, since the summed engagement of the former would be divided by five and the latter by ten. To address this, in calculating monthly total engagement and percent access, I chose to take the sum of these metrics across all days in a month and all products used by the district. This results in an engagement index that reflects the total number of page views per thousand students over the course of each month. The results for the percentage access and engagement index are highly correlated, so in this investigation I focus on engagement index, as the meaning of the engagement index when summed across products is easier to think about.

I also noticed that there is one outlier district in the lowest income group (district 9536, in an urban center in New York) that significantly ramped up technology use in the early months of the pandemic. While looking more deeply at this district to understand its approach to the pandemic would be very interesting, I chose to exclude it from this analysis because it very clearly departs from the trend of the low-income groups.

Having settled on using the sum, I summed across all page views for all products by each income group for each district in each month. As the original metric was page views per 1000 students, I then divided by 1000 to get the total number per student. The resulting graph is shown below.


In [None]:
free_reduced_sum_by_district_df = full_df.groupby(['Month', 'state', 'district_id', 'pct_free_reduced', 'pct_free_reduced_binned'], observed = True, as_index = False)['pct_access', 'engagement_index'].sum().reset_index()
free_reduced_sum_by_district_df["engagement_index"] = free_reduced_sum_by_district_df["engagement_index"] / 1000

In [None]:
sns.lineplot(data=free_reduced_sum_by_district_df.loc[free_reduced_sum_by_district_df["district_id"] != 9536], x="Month", y="engagement_index", hue="pct_free_reduced", ci = None, alpha = 0.75)
plt.show()

As is evident from this figure, the highest-income group (0-20% free or reduced price lunch) had a sharp peak in overall technology use that corresponded in timing exactly with school closures in March and April 2020, at the onset of the pandemic. (Here, engagement index represents the total number of page views per student across all products in a given month in a given income category, averaged across all districts in that income category.) The 20%-40% and 40%-60% groups have very similar trends to each other, showing an increase during that period, albeit not as drastic as the highest income group. By contrast, the two lowest-income (60-80% and 80-100%) showed a decrease in technology use during the spring as schools closed. Because of the similarity of trends across certain groups, I decided to bin them further for the remainder of the investigation to allow both for greater clarity of presentation and greater statistical power. The three bins I ended up with, reflecting the shared trends noted above, were 0%-20%, 20%-60%, and 60%-80%. The below graph is a version of the graph above but using these new bins, with confidence intervals for each group. 

In [None]:
sns.lineplot(data=free_reduced_sum_by_district_df.loc[free_reduced_sum_by_district_df["district_id"] != 9536], x="Month", y="engagement_index", hue="pct_free_reduced_binned", alpha = 0.75)

To confirm that these visual trends were both significant and reflective of the sample in each group (rather than driven by outliers), I did two things. First, I created a scatterplot reflecting overall technology use by month, with each dot representing a district, and color representing the district's income group, shown below. 

In [None]:
sns.catplot(x="Month", y="engagement_index", hue = "pct_free_reduced_binned", data=free_reduced_sum_by_district_df)

Next, I ran a one-way ANOVA to determine whether there was an overall effect of income group on the change in engagement between February (before the pandemic started) and April (when schools had already closed in nearly all states, as evidenced in the Covid-19 state policy database [[3](https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC?path=/openicpsr/119446/fcr:versions/V75/COVID-19-US-State-Policy-Database-master&type=folder)].

In [None]:
total_change_df = free_reduced_sum_by_district_df[free_reduced_sum_by_district_df["Month"] == "02"][["district_id", "state", "pct_free_reduced", "pct_free_reduced_binned", "pct_access", "engagement_index"]]
total_change_df = total_change_df.rename(columns = {"pct_access": "Feb_pct_access", "engagement_index": "Feb_engagement_index"})
total_change_df = pd.merge(total_change_df, 
                 free_reduced_sum_by_district_df[free_reduced_sum_by_district_df["Month"] == "04"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
total_change_df = total_change_df.rename(columns = {"pct_access": "Apr_pct_access", "engagement_index": "Apr_engagement_index"})
total_change_df = pd.merge(total_change_df, 
                 free_reduced_sum_by_district_df[free_reduced_sum_by_district_df["Month"] == "11"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
total_change_df = total_change_df.rename(columns = {"pct_access": "Nov_pct_access", "engagement_index": "Nov_engagement_index"})
total_change_df = pd.merge(total_change_df, 
                 free_reduced_sum_by_district_df[free_reduced_sum_by_district_df["Month"] == "09"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
total_change_df = total_change_df.rename(columns = {"pct_access": "Sep_pct_access", "engagement_index": "Sep_engagement_index"})
total_change_df["FebAprRawChangePctAccess"] = total_change_df["Apr_pct_access"] - total_change_df["Feb_pct_access"]
total_change_df["FebAprRawChangeEngagementIndex"] = total_change_df["Apr_engagement_index"] - total_change_df["Feb_engagement_index"]
total_change_df["FebSepRawChangePctAccess"] = total_change_df["Sep_pct_access"] - total_change_df["Feb_pct_access"]
total_change_df["FebSepRawChangeEngagementIndex"] = total_change_df["Sep_engagement_index"] - total_change_df["Feb_engagement_index"]
total_change_df["FebNovRawChangePctAccess"] = total_change_df["Nov_pct_access"] - total_change_df["Feb_pct_access"]
total_change_df["FebNovRawChangeEngagementIndex"] = total_change_df["Nov_engagement_index"] - total_change_df["Feb_engagement_index"]
total_change_summary_df = total_change_df.loc[total_change_df["district_id"] != 9536].groupby(['pct_free_reduced'])["Feb_pct_access", "Apr_pct_access", 
                                                                                                                    "FebAprRawChangePctAccess", "FebAprRawChangeEngagementIndex",
                                                                                                                    "FebSepRawChangePctAccess", "FebSepRawChangeEngagementIndex",
                                                                                                                    "FebNovRawChangePctAccess", "FebNovRawChangeEngagementIndex"].mean().reset_index()

In [None]:
mod = ols('FebAprRawChangePctAccess ~ pct_free_reduced_binned', data=total_change_df.loc[total_change_df["district_id"] != 9536]).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
esq_sm = aov_table['sum_sq'][0]/(aov_table['sum_sq'][0]+aov_table['sum_sq'][1])
aov_table['EtaSq'] = [esq_sm, 'NaN']
print(aov_table)


The ANOVA indicated a large (EtaSq = 0.16) and significant (p < 0.00001) effect of income group, so I did posthoc comparisons to examine differences between individual groups.

In [None]:
pair_t = mod.t_test_pairwise('pct_free_reduced_binned')
pair_t.result_frame

Post-hoc comparisons indicated that the observed differences between the three bins were significant -- the highest-income group had significantly greater engagement than both the other groups (p < 0.001), and the middle-income group had significantly greater engagement than the low-income group (p < 0.01).

While visual inspection indicates that all groups had increased in their technology use by the fall, I wanted to check whether there were still differences across groups in changes from the early part of the year to the later part of the year. I thus did an ANOVA to check whether there was an effect of income group on change from February to November.

In [None]:
mod = ols('FebNovRawChangeEngagementIndex ~ pct_free_reduced_binned', data=total_change_df.loc[total_change_df["district_id"] != 9536]).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
esq_sm = aov_table['sum_sq'][0]/(aov_table['sum_sq'][0]+aov_table['sum_sq'][1])
aov_table['EtaSq'] = [esq_sm, 'NaN']
print(aov_table)

There was a trend toward an effect (p = 0.065), so I did followup posthocs.

In [None]:
pair_t = mod.t_test_pairwise('pct_free_reduced_binned')
pair_t.result_frame

There was a trend toward a significant difference between the lowest-income and the highest-income group in the change in technology engagement from February to November, but no differences between groups were significant. Encouragingly, the differences between groups were much smaller than the change in the spring, indicating that the divide in technology use occasioned by the early part of the pandemic had shrunken by the fall of 2020. Still, this left lower-income kids with a large period of time during which their exposure levels were different -- especially critical when schooling became remote and technology was the way of connecting not just to educational resources but to the stability of a school routine in the midst of societal upheaval. Indeed, an ANOVA with posthocs for the month of April indicates children from the different groups showed significantly different levels of engagement with technology.

In [None]:
mod = ols('engagement_index ~ pct_free_reduced_binned', data=free_reduced_sum_by_district_df.loc[(free_reduced_sum_by_district_df["district_id"] != 9536) & (free_reduced_sum_by_district_df["Month"] == "04")]).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
esq_sm = aov_table['sum_sq'][0]/(aov_table['sum_sq'][0]+aov_table['sum_sq'][1])
aov_table['EtaSq'] = [esq_sm, 'NaN']
print(aov_table)

pair_t = mod.t_test_pairwise('pct_free_reduced_binned')
pair_t.result_frame

# (2) Changes in use of individual technologies

Having observed that there were significant differences by income group in how children's use of technology changed over the pandemic, I next sought to determine what kinds of technologies were most contributing to this. Were the highest income kids increasing their use of particular products during the peak we saw in the spring, and were the lowest income kids decreasing their use of certain products that contributed heavily to their overall decrease in technology use during that period? To answer this, I calculated the engagement for each product in each month for each income group, and then looked at which products showed the greatest change in different groups. As before, I summed across daily usage for each product in each district to get monthly usage, giving us numbers that reflect total monthly usage per thousand students of a given product.


In [None]:
february_product_df = full_df.loc[full_df["Month"] == "02"].groupby(['district_id','pct_free_reduced', 'pct_free_reduced_binned', 'Product Name', 'Primary Essential Function'], observed = True)['pct_access', 'engagement_index'].sum().reset_index()
april_product_df = full_df.loc[full_df["Month"] == "04"].groupby(['district_id','pct_free_reduced', 'pct_free_reduced_binned', 'Product Name', 'Primary Essential Function'], observed = True)['pct_access', 'engagement_index'].sum().reset_index() 
november_product_df = full_df.loc[full_df["Month"] == "11"].groupby(['district_id','pct_free_reduced', 'pct_free_reduced_binned', 'Product Name', 'Primary Essential Function'], observed = True)['pct_access', 'engagement_index'].sum().reset_index()

february_product_df['income_group_size'] = pd.Series()
april_product_df['income_group_size'] = pd.Series()
november_product_df['income_group_size'] = pd.Series()

# normalize to size of income group
income_groups = february_product_df.pct_free_reduced_binned.unique()
for income_group in income_groups:
    income_group_districts = february_product_df["district_id"].loc[february_product_df["pct_free_reduced_binned"] == income_group].unique()
    income_group_size = len(income_group_districts)
    february_product_df["income_group_size"].loc[february_product_df["pct_free_reduced_binned"] == income_group] = income_group_size 
    april_product_df["income_group_size"].loc[april_product_df["pct_free_reduced_binned"] == income_group] = income_group_size 
    november_product_df["income_group_size"].loc[november_product_df["pct_free_reduced_binned"] == income_group] = income_group_size 
february_product_df["pct_access"] = february_product_df["pct_access"] / february_product_df["income_group_size"]
february_product_df["engagement_index"] = february_product_df["engagement_index"] / february_product_df["income_group_size"]
april_product_df["pct_access"] = april_product_df["pct_access"] / april_product_df["income_group_size"]
april_product_df["engagement_index"] = april_product_df["engagement_index"] / april_product_df["income_group_size"]
november_product_df["pct_access"] = november_product_df["pct_access"] / november_product_df["income_group_size"]
november_product_df["engagement_index"] = november_product_df["engagement_index"] / november_product_df["income_group_size"]

In [None]:
allProducts = april_product_df["Product Name"].unique()
product_analysis_df = pd.DataFrame(columns = ["IncomeGroup", "EngagementMetric", "PrimFunction", "Feb", "Apr", "Nov",
                                                "FebApr", "FebNov", "FebAprRaw", "FebNovRaw"])
engagement_metrics = ["pct_access", "engagement_index"]
for engagement_metric in engagement_metrics:
    for product in allProducts:
        for income_group in income_groups:
            febsum = february_product_df.loc[(february_product_df['pct_free_reduced_binned'] == income_group) & (february_product_df['Product Name'] == product)][engagement_metric].sum()
            aprsum = april_product_df.loc[(april_product_df['pct_free_reduced_binned'] == income_group) & (april_product_df['Product Name'] == product)][engagement_metric].sum()
            novsum = november_product_df.loc[(november_product_df['pct_free_reduced_binned'] == income_group) & (november_product_df['Product Name'] == product)][engagement_metric].sum()
            febaprchange = round((aprsum - febsum)/febsum * 100, 1)
            febnovchange = round((novsum - febsum)/febsum * 100, 1)
            febaprraw = aprsum - febsum
            febnovraw = novsum - febsum
            row = {"IncomeGroup": income_group, "EngagementMetric": engagement_metric, "Product": product, 
                   "Feb": round(febsum, 2), "Apr": round(aprsum, 2), "Nov": round(novsum, 2),
                   "FebApr": febaprchange, "FebNov": febnovchange,
                   "FebAprRaw": febaprraw, "FebNovRaw": febnovraw}
            product_analysis_df = product_analysis_df.append(row, ignore_index = True)


In [None]:
# tbh this was a last-minute decision
product_analysis_df["FebAprRaw"] = product_analysis_df["FebAprRaw"] / 1000

I started by looking at which products showed the greatest increase in engagement during the early part of the pandemic, where we saw the peak for high-income kids. The following table shows the top 20 largest increases in engagement of any product in any income group between February and April. Again, the increase in FebAprRaw represents change in number of page loads per student per month. 

In [None]:
round(product_analysis_df[['IncomeGroup', 'FebAprRaw', 'Product']].loc[product_analysis_df.EngagementMetric == "engagement_index"].sort_values(by="FebAprRaw", ascending = False).head(20))

The largest increases are distributed across groups, although the highest income group is disproportionately represented, with half of the largest increases. Google wins the day, with the highest increases of any product across all groups overwhelmingly from Google Docs. However, these increases weren't equal -- for the highest-income kids, the increase was more than double as compared to the lowest income kids and nearly double as compared to the middle-income kids. This can be seen in the figure below. 

In [None]:
google_docs_district_df = full_df[full_df["Product Name"] == "Google Docs"].groupby(['Month', 'district_id', 'pct_free_reduced', 'pct_free_reduced_binned'], observed = True)['pct_access', 'engagement_index'].sum().reset_index()
google_docs_district_df["engagement_index"] = google_docs_district_df["engagement_index"] / 1000
sns.lineplot(data=google_docs_district_df.loc[google_docs_district_df["district_id"] != 9536], x="Month", y="engagement_index", hue="pct_free_reduced_binned", ci = None, alpha = 0.75)
plt.show()

This is confirmed by ANOVA, which shows a significant (p < 0.0001) effect of income group on change in Google Docs usage from February to April, as well as posthocs that show the high-income kids were using them significantly (p < 0.001) more than the other two groups (which were not significantly different from each other).

In [None]:
google_docs_change_df = google_docs_district_df[google_docs_district_df["Month"] == "02"][["district_id", "pct_free_reduced", "pct_free_reduced_binned", "pct_access", "engagement_index"]]
google_docs_change_df = google_docs_change_df.rename(columns = {"pct_access": "Feb_pct_access", "engagement_index": "Feb_engagement_index"})
google_docs_change_df = pd.merge(google_docs_change_df, 
                 google_docs_district_df[google_docs_district_df["Month"] == "04"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
google_docs_change_df = google_docs_change_df.rename(columns = {"pct_access": "Apr_pct_access", "engagement_index": "Apr_engagement_index"})
google_docs_change_df = pd.merge(google_docs_change_df, 
                 google_docs_district_df[google_docs_district_df["Month"] == "11"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
google_docs_change_df = google_docs_change_df.rename(columns = {"pct_access": "Nov_pct_access", "engagement_index": "Nov_engagement_index"})
google_docs_change_df = pd.merge(google_docs_change_df, 
                 google_docs_district_df[google_docs_district_df["Month"] == "09"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
google_docs_change_df = google_docs_change_df.rename(columns = {"pct_access": "Sep_pct_access", "engagement_index": "Sep_engagement_index"})
google_docs_change_df["FebAprRawChangePctAccess"] = google_docs_change_df["Apr_pct_access"] - google_docs_change_df["Feb_pct_access"]
google_docs_change_df["FebAprRawChangeEngagementIndex"] = google_docs_change_df["Apr_engagement_index"] - google_docs_change_df["Feb_engagement_index"]
google_docs_change_df["FebSepRawChangePctAccess"] = google_docs_change_df["Sep_pct_access"] - google_docs_change_df["Feb_pct_access"]
google_docs_change_df["FebSepRawChangeEngagementIndex"] = google_docs_change_df["Sep_engagement_index"] - google_docs_change_df["Feb_engagement_index"]
google_docs_change_df["FebNovRawChangePctAccess"] = google_docs_change_df["Nov_pct_access"] - google_docs_change_df["Feb_pct_access"]
google_docs_change_df["FebNovRawChangeEngagementIndex"] = google_docs_change_df["Nov_engagement_index"] - google_docs_change_df["Feb_engagement_index"]
google_docs_change_summary_df = google_docs_change_df.loc[google_docs_change_df["district_id"] != 9536].groupby(['pct_free_reduced'], observed = True)["Feb_pct_access", "Apr_pct_access", 
                                                                                                                    "FebAprRawChangePctAccess", "FebAprRawChangeEngagementIndex",
                                                                                                                    "FebSepRawChangePctAccess", "FebSepRawChangeEngagementIndex",
                                                                                                                    "FebNovRawChangePctAccess", "FebNovRawChangeEngagementIndex"].mean().reset_index()

google_docs_binned_change_summary_df = google_docs_change_df.loc[google_docs_change_df["district_id"] != 9536].groupby(['pct_free_reduced_binned'], observed = True)["Feb_pct_access", "Apr_pct_access", 
                                                                                                                    "FebAprRawChangePctAccess", "FebAprRawChangeEngagementIndex",
                                                                                                                    "FebSepRawChangePctAccess", "FebSepRawChangeEngagementIndex",
                                                                                                                    "FebNovRawChangePctAccess", "FebNovRawChangeEngagementIndex"].mean().reset_index()
google_docs_change_df["FebAprRawChangeEngagementIndex"] = google_docs_change_df["FebAprRawChangeEngagementIndex"] / 1000

In [None]:
mod = ols('FebAprRawChangeEngagementIndex ~ pct_free_reduced_binned', data=google_docs_change_df.loc[google_docs_change_df["district_id"] != 9536]).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)

In [None]:
pair_t = mod.t_test_pairwise('pct_free_reduced_binned')
pair_t.result_frame

What was driving this striking difference between the high-income districts and everyone else? One insight might come from a quote from the book <i> Leading Technology-Rich Schools: Award-Winning Models for Success </i>, in which the authors mention a teacher "also told us she loves using Google Docs because the students will use it in college." Understanding how high-income kids use Google Docs, and how this may be used to preserve the status quo, could be key to dismantling inequality.

Next, to try to understand the decrease in technology engagement for the lower-income group, I sought to look at where the biggest losses to educational technology use were in the spring of the pandemic. The following table shows the largest 20 decreases in engagement across all products and income groups.

In [None]:
round(product_analysis_df[['IncomeGroup', 'FebAprRaw', 'Product']].loc[product_analysis_df.EngagementMetric == "engagement_index"].sort_values(by="FebAprRaw").head(20))

As is evident, 11 out of the 15 largest losses were in the lowest income group. Looking at this table, and thinking about the low relative growth in reading and math performance for low income kids as reported by NWEA, reading and math software stood out to me, so I decided to look at these in greater detail. I averaged use across all products with "read" in the title and plotted it below.

In [None]:
reading_products = []
math_products = []
for item in product_df["Product Name"]:
    if "read" in item.lower():
        reading_products.append(item)
    if "math" in item.lower():
        math_products.append(item)

In [None]:
reading_products_district_df = full_df[full_df["Product Name"].isin(reading_products)].groupby(['Month', 'district_id', 'pct_free_reduced_binned'], observed = True)['pct_access', 'engagement_index'].sum().reset_index()
reading_products_district_df["engagement_index"] = reading_products_district_df["engagement_index"] / 1000

In [None]:
sns.lineplot(data=reading_products_district_df.loc[reading_products_district_df["district_id"] != 9536], x="Month", y="engagement_index", hue="pct_free_reduced_binned", ci = None, alpha = 0.75)
plt.show()

As is evident from this graph, there was a sharp decline for low-income students in reading software use, which these children used more heavily than the other groups prior to the pandemic. Unlike the overall technology trend for low-income kids, there was not a recovery in the fall to offset this drop -- leaving low-income kids without critical reading support. ANOVA and posthocs confirm this drop was significantly greater (p < 0.01) for low-income kids than for anyone else.

In [None]:
reading_change_df = reading_products_district_df[reading_products_district_df["Month"] == "02"][["district_id", "pct_free_reduced_binned", "pct_access", "engagement_index"]]
reading_change_df = reading_change_df.rename(columns = {"pct_access": "Feb_pct_access", "engagement_index": "Feb_engagement_index"})
reading_change_df = pd.merge(reading_change_df, 
                 reading_products_district_df[reading_products_district_df["Month"] == "04"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
reading_change_df = reading_change_df.rename(columns = {"pct_access": "Apr_pct_access", "engagement_index": "Apr_engagement_index"})
reading_change_df = pd.merge(reading_change_df, 
                 reading_products_district_df[reading_products_district_df["Month"] == "11"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
reading_change_df = reading_change_df.rename(columns = {"pct_access": "Nov_pct_access", "engagement_index": "Nov_engagement_index"})
reading_change_df = pd.merge(reading_change_df, 
                 reading_products_district_df[reading_products_district_df["Month"] == "09"][["district_id", "pct_access", "engagement_index"]],
                 on = "district_id")
reading_change_df = reading_change_df.rename(columns = {"pct_access": "Sep_pct_access", "engagement_index": "Sep_engagement_index"})
reading_change_df["FebAprRawChangePctAccess"] = reading_change_df["Apr_pct_access"] - reading_change_df["Feb_pct_access"]
reading_change_df["FebAprRawChangeEngagementIndex"] = reading_change_df["Apr_engagement_index"] - reading_change_df["Feb_engagement_index"]
reading_change_df["FebSepRawChangePctAccess"] = reading_change_df["Sep_pct_access"] - reading_change_df["Feb_pct_access"]
reading_change_df["FebSepRawChangeEngagementIndex"] = reading_change_df["Sep_engagement_index"] - reading_change_df["Feb_engagement_index"]
reading_change_df["FebNovRawChangePctAccess"] = reading_change_df["Nov_pct_access"] - reading_change_df["Feb_pct_access"]
reading_change_df["FebNovRawChangeEngagementIndex"] = reading_change_df["Nov_engagement_index"] - reading_change_df["Feb_engagement_index"]
reading_change_summary_df = reading_change_df.loc[reading_change_df["district_id"] != 9536].groupby(['pct_free_reduced_binned'], observed = True)["Feb_pct_access", "Apr_pct_access", 
                                                                                                                    "FebAprRawChangePctAccess", "FebAprRawChangeEngagementIndex",
                                                                                                                    "FebSepRawChangePctAccess", "FebSepRawChangeEngagementIndex",
                                                                                                                    "FebNovRawChangePctAccess", "FebNovRawChangeEngagementIndex"].mean().reset_index()

reading_change_summary_df = reading_change_df.loc[reading_change_df["district_id"] != 9536].groupby(['pct_free_reduced_binned'], observed = True)["Feb_pct_access", "Apr_pct_access", 
                                                                                                                    "FebAprRawChangePctAccess", "FebAprRawChangeEngagementIndex",
                                                                                                                    "FebSepRawChangePctAccess", "FebSepRawChangeEngagementIndex",
                                                                                                                    "FebNovRawChangePctAccess", "FebNovRawChangeEngagementIndex"].mean().reset_index()

In [None]:
mod = ols('FebNovRawChangeEngagementIndex ~ pct_free_reduced_binned', data=reading_change_df.loc[reading_change_df["district_id"] != 9536]).fit()
aov_table = sm.stats.anova_lm(mod, typ=2)
print(aov_table)
esq_sm = aov_table['sum_sq'][0]/(aov_table['sum_sq'][0]+aov_table['sum_sq'][1])
aov_table['EtaSq'] = [esq_sm, 'NaN']
print(aov_table)

In [None]:
pair_t = mod.t_test_pairwise('pct_free_reduced_binned')
pair_t.result_frame

Changes in engagement with math software are shown below, with similar trends for children in the low-income group -- both a decrease in the spring and a lack of recovery in the fall. This left kids without critical support for these basic skills that could help explain the finding of poor math performance in this group. 

In [None]:
math_products_district_df = full_df[full_df["Product Name"].isin(math_products)].groupby(['Month', 'district_id', 'pct_free_reduced_binned'], observed = True)['pct_access', 'engagement_index'].sum().reset_index()
math_products_district_df["engagement_index"] = math_products_district_df["engagement_index"] / 1000
sns.lineplot(data=math_products_district_df.loc[math_products_district_df["district_id"] != 9536], x="Month", y="engagement_index", hue="pct_free_reduced_binned", ci = None, alpha = 0.75)
plt.show()

# (3) Correlations with a state measure of support

Because this pandemic has impacted both children and adults from lower-income backgrounds, I sought to determine whether the support that parents received might be correlated with children's technological engagement on a state-by-state basis. This support was indexed by the maximum unemployment amount from each state, as taken from the Covid-19 US State Policy Database. The thinking was that greater state support could give parents the resources, either financial or otherwise, to support their children's engagement with technology. I also thought that this could perhaps affect people from different income backgrounds differently -- maybe income support would be more likely to positively correlate with engagement in lower-income groups, especially given the disproportionate impact of the pandemic on blue-collar workers. To do this, I averaged engagement by income group and state over the course of the entire year, and correlated this with the maximum unemployment amount in each state for each income group.

In [None]:
total2020_state_df = free_reduced_sum_by_district_df.groupby(['state', 'pct_free_reduced_binned'], observed = True, as_index = False)['pct_access', 'engagement_index'].mean().reset_index()
state_covid_df = pd.read_csv("../input/covid19-database/COVID-19 US state policy database 3_29_2021.csv")
social_supports_df = state_covid_df[["STATE", "UIMAXAMT"]].iloc[4:, :]
engagement_and_social_support_df = pd.merge(total2020_state_df, social_supports_df, left_on = "state", right_on = "STATE")
engagement_and_social_support_df["UIMAXAMT"] = pd.to_numeric(engagement_and_social_support_df["UIMAXAMT"])

In [None]:
linregress(engagement_and_social_support_df["engagement_index"][engagement_and_social_support_df["pct_free_reduced_binned"] == "0-20%"],
            pd.to_numeric(engagement_and_social_support_df["UIMAXAMT"][engagement_and_social_support_df["pct_free_reduced_binned"] == "0-20%"]))

In [None]:
linregress(engagement_and_social_support_df["engagement_index"][engagement_and_social_support_df["pct_free_reduced_binned"] == "20-60%"],
            pd.to_numeric(engagement_and_social_support_df["UIMAXAMT"][engagement_and_social_support_df["pct_free_reduced_binned"] == "20-60%"]))

In [None]:
linregress(engagement_and_social_support_df["engagement_index"][engagement_and_social_support_df["pct_free_reduced_binned"] == "60-100%"],
            pd.to_numeric(engagement_and_social_support_df["UIMAXAMT"][engagement_and_social_support_df["pct_free_reduced_binned"] == "60-100%"]))

I did not find a significant correlation between unemployment amount and engagement for any of the groups, though correlations did appear more positive for the low income group (R = 0.18) as compared to the high-income group (R = -0.40), which is consistent with what I had expected. Fewer then twenty states are represented in the data, making it a small number of samples for a correlation. Further principled exploration of different metrics of change in the data, as well as correlations with other measures of social support, could be interesting next steps.

# Some final thoughts

This exploration of the data shows that the educational experience of the pandemic was very different for children from different income backgrounds. In my view, the most meaningful findings are the increase in the use of Google Docs by the high-income group, which is worthy of further research to understand what's driving it, and the decrease in the use of reading and math software by the low-income group, which points to the need for further support for these students. 

<b>Limitations</b>
* The number of districts in each group was uneven, with far fewer districts in the lowest income group. A larger and more balanced sample size would be ideal.
* There was a great deal of missingness in the data, including the names of products (nearly half were excluded as a result of this) and districts without income designations. 
* As mentioned, there was an interesting outlier in the low-income group that was excluded for this analysis but would be interesting to try to examine in more detail.

<b>Insights</b>
* The wealthy ramped up their use of the remote-friendly technologies they were already primarily using. At the same time, students from lower income groups adopted these technologies to a lesser extent while leaving behind others, likely many of which (like reading and math software) were optimized for in-person use.
* Much like their adult counterparts, students in the wealthiest group were already working heavily with technologies that could easily be adapted to a remote setting. Foremost among these was Google Docs, which drove a spring peak in learning technology access and engagement in this group, while the lowest-income group saw declines in access and engagement.
* By contrast, students from lower income groups relied pre-pandemic more heavily on digital learning platforms like literacy and math tools, at least some of which were likely more difficult to adapt to remote instruction
* This divide between high-income and low-income children parallels the experience of their parents. While white-collar adults experiencing minimal disruption to their work and income as their knowledge economy jobs near-seamlessly transitioned to being remote, blue-collar adults lost employment in the face of both closures and understandable unwillingness to tolerate the significant risks associated with in-person jobs. The disproportionate losses faced by adults (of employment and income stability) and children (of critical learning resources) in lower income groups pose the risk of deepening already existing class divisions. The recommendations presented at the beginning of this document to address inequities in educational access furthered by the pandemic can be a step toward reducing those divisions. 
