In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # pretty plotting
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from IPython.display import display

# How did COVID-19 impact digital learning?
Education in the traditional way, in person with teachers and fellow students has been greatly disrupted by the COVID-19 pandemic. In response to the fast spread of the desease, personal contacts were reduced to an absolute minimum and schools have been closed for extended periods of time, allowing the students to learn almost exclusively through digital learning. In this analysis I present some insight into the state of digital learning in 2020 in the United States and how digital learning depends on various factors.

## Summary

The impact of COVID-19 on digital learning is extremely variable among products and districts. While the pandemic did not boost the average access rate, it lead to an increase in average student engagement. However, digital learning behaviour depends most on the product, then on the district and less than that, although not negligibly, on the pandemic. Therefore, it is important to look into individual products to investigate the influene of the pandemic as well as district characteristics. I looked into a number of products that were most widely used. From those products it becomes clear that virtual classroom learning as well as some individual learning platforms gained impact, whereas other learning platforms and, surprisingly cyber security lost impact. I found the video conference platform Zoom to be the big winner of the pandemic, which might partly be explained by its value to overcome social distancing in general next to its value for student-teacher contacts. 
Overall I do not find an impactful global dependency for digital learning on any of the district characteristics like state, amount of black or hispanic people, poverty or money invested in students. Rather, the information on digital learning behaviour contained in knowing the district is not present in any of the district characteristics alone. Digital learning being independent of distric characteristics is particularly true for many products when looking at the engagement of the fraction of students that did access digital learing products. In other words, students that used digital learning platforms were engaged similarly, independent of their sociocultural background, and got tendentially more engaged when the pandemic hit.

## Loading and cleaning the data
As a first step, I look at the three datasets provided for analysis in this competition to get an overview and do some cleaning of the data.

In [None]:
path = '/kaggle/input/learnplatform-covid19-impact-on-digital-learning/'
products = pd.read_csv(path + 'products_info.csv')
products = products.astype({'LP ID': 'str'})
products = products.rename({'LP ID': 'lp_id'}, axis=1)
print('Overview over the products data\n')
products.describe(include='all')

In [None]:
districts = pd.read_csv(path + 'districts_info.csv')
districts = districts.astype({'district_id': 'str'})
print('Overview over the districts data\n')
display(districts.describe(include='all'))

print("Values present for pct_black/hispanic and pct_free/reduced, respectively:")
print(districts['pct_black/hispanic'].unique())
print(districts['pct_free/reduced'].unique())

Given that the county connections ratio in the districts dataset takes on only two values, one of them only one single time, there is too little information in the dataset to investigate any influence of this parameter on digital learning behaviour. Therefore, I drop this column. Furthermore, percent black/hispanic, percent free/reduced and per pupil expenditure are provided as intervals. Since this initially was numeric data, I approximate those values by the center value of the interval. Note that the values for percent black/hispanic and percent free/reduced all lie between 0 and 1. This suggests that those numbers are actually not percentages but the corresponding ratios.

It is also noteworthy that for all parameters except unique identifiers the data is not equally distributed.

In [None]:
def interval_to_center_value(interval):
    numbers = interval[1:-1].split(', ')
    center = [float(x) for x in numbers]
    return np.mean(center)
districts = districts.drop('county_connections_ratio', axis=1)
districts[['pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw']] = \
    districts[['pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw']].applymap(
        interval_to_center_value, na_action='ignore')
print('Overview over the districts data\n')
districts.describe(include='all')

In [None]:
engagement = pd.DataFrame(columns=['time', 'lp_id', 'pct_access', 'engagement_index', 'district_id'])
dfs = []
for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data'):
    for filename in filenames:
        district_id = filename.split(sep='.')[0]
        df = pd.read_csv(os.path.join(dirname, filename))
        df['district_id'] = district_id
        dfs.append(df)
engagement = pd.concat(dfs)

a = engagement['lp_id'].unique()
engagement = engagement.astype({'lp_id': 'Int64'})
b = engagement['lp_id'].unique()
print(np.any((a-b) != 0))
engagement = engagement.astype({'lp_id': 'str'})
engagement['time'] = pd.to_datetime(engagement['time'])
print('\nOverview over the engagement data\n')
engagement.describe(include='all', datetime_is_numeric=True)

In [None]:
print('\nNumber of entries with 0 percent access:')
print(engagement[['engagement_index', 'pct_access']][engagement.pct_access == 0].count())

Inspecting the engagement data reveals that there are no entries with engagement index equal to 0, however, there are a large number of entries (more than missing entries for engagement index) with percent access equal to 0. Since no engagement can happen without anyone accessing the sites, this strongly suggests that for some districts or products 0 has been entered in cases of missing percent access data. Therefore, I will drop all data with 0 access when an engagement index is present, as well as all data where neither percent access nor engagement index is available.

In [None]:
engagement = engagement[(~engagement.pct_access.isnull()) | 
                        (~engagement.engagement_index.isnull())]
engagement = engagement[~(engagement.pct_access == 0) |
                        (engagement.engagement_index.isnull())]
print('Overview over the engagement data\n')
engagement.describe(include='all', datetime_is_numeric=True)

In [None]:
temp = engagement.groupby(['lp_id', 'district_id']).count()
print("Number of entries for each product/district combination:")
temp.describe()

Also, in more than 50% of the cases there are only a small number of entries for a given product in a district, causing noise in the data but providing very little information on the overall situation. I'll use only data that on average have at least 1 entry per month for a given product in a district.

In [None]:
temp = temp[temp.engagement_index > 11]
temp = temp[temp.pct_access > 11]
print("Number of entries for each product/district combination:")
display(temp.describe())
engagement['week'] = engagement['time'].dt.isocalendar().week
engagement['weekday'] = engagement['time'].dt.isocalendar().day
temp1 = engagement.set_index(['lp_id', 'district_id'])
temp1 = temp1.loc[temp.index]
engagement = temp1.reset_index()

In [None]:
temp3 = temp1.groupby(['lp_id', 'district_id']).agg(
    {'time': [np.min, np.max], 'week': [np.min, np.max]})
temp3 = temp3.sort_values(('time', 'amin')).reset_index()
t = temp3.drop(['lp_id', 'district_id'], axis=1).stack().reset_index()
t.columns = ['Product', 'Extreme Occurence', 'Date', 'Week']
t['Extreme Occurence'] = t['Extreme Occurence'].apply(
    lambda x: 'First' if x == 'amin' else 'Last')

In [None]:
sns.relplot(data=t, y='Product', x='Date', hue='Extreme Occurence', aspect=2/1);
plt.yticks([]);

Looking the first and last occurence date of the entries for a given product it becomes obvious that there are many products that have not been tracked throughout the year, but only for a limited amout of time. In order to investigate how COVID-19 changed digital learning behaviour over all, it is usefull to keep the product pool (feeding into the average) constant over time. Therefore, I'll use only data that has entries starting from the first or second week of the year and ending in the last three weeks of the year. 

In [None]:
cont_used = temp3[(temp3[('week', 'amin')] <= 2)]
cont_used = cont_used[(cont_used[('week', 'amax')] >= 51)]
cont_used = cont_used.set_index(['lp_id', 'district_id'])
temp = engagement.set_index(['lp_id', 'district_id'])
engagement = temp.loc[cont_used.index].reset_index()
print('Overview over the engagement data\n')
engagement.describe(include='all', datetime_is_numeric=True)

In [None]:
print('Number of products in the engagement dataset:')
n_e = engagement['lp_id'].unique()
print(len(n_e))
print('Number of products in the products dataset:')
n_p = products['lp_id'].unique()
print(len(n_p))
print('Number of products in the products dataset that are present in the engagement dataset:')
print(len([x for x in n_p if x in n_e]))

It is also noteworthy that there are a lot more products in the engagement dataset then there are in the products dataset, limiting the possibilities to further investigate those products.

## State of digital learning in 2020
### How did student engagement in digital learning evolve over time and particularly in response to the pandemic?

In order to understand how COVID-19 impacts digital learning, we need to look into how digital learning evolved over time throughout the year and align this with COVID-19 related events, like arrival of the desease in the US or school closures. Furthermore, breaks are a known cause of variation in learning behaviour, so it's certainly usefull to align those, too. 
Since all of those events vary slightly from district to district, I used some approximations to assign weeks to these circumstances. The arrival of the desease is approximated to end of February, when it becomes clear that COVID-19 is heading towards pandemic status (based on https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020), school closures are approximated by when they were ordered by state (based on https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020), knowing that districts were free to close longer/earlier, and for the breaks I only consider christmas break, spring break, summer break and thanksgiving (based on https://www.edarabia.com/school-holidays-united-states/), liberally setting break times such as to cover most districts at the expense of including some school time as well.

Furthermore, in case there is a change in student engagement, it would be interesting to see if this results from more students that are engaged or from more engagement of the students that are already engaged. Therefore, I introduce a new feature that is engaged student engagement index, calculated as engagement_index/(pct_access/100). 

In [None]:
engagement['eng_stud_engagement_index'] = engagement['engagement_index']*100/engagement['pct_access']

school_closure = pd.read_csv("../input/ballotpedia/Ballotpedia_SchoolClosuresByState.csv", header = 3)
school_closure['Start_closure'] = pd.to_datetime(school_closure['Start_closure'])
school_closure['Start_week'] = school_closure['Start_closure'].dt.isocalendar().week
school_closure['End_closure'] = pd.to_datetime(school_closure['End_closure'])
school_closure['End_week'] = school_closure['End_closure'].dt.isocalendar().week
# school_closure.describe()

In [None]:
# put labels to different time periods
# week 2-8 -> pre-corona
# week 15,16 -> spring break
# week 20 - 34 -> summer break
# week 48 -> thanksgiving
# week 52,53,1 -> christmas break

# https://www.edarabia.com/school-holidays-united-states/

break_dict = {i:'pre_corona' for i in range(2,9)}
break_dict.update({i: 'spring_break' for i in range(15,17)})
break_dict.update({i:'summer_break' for i in range(20,34)})
break_dict.update({48:'thanksgiving'})
break_dict.update({i:'christmas' for i in [0, 1, 52, 53]})
not_assigned = [j for j in range(54) if j not in break_dict.keys()]
break_dict.update({i:'school_with_corona' for i in not_assigned})

engagement['circumstances'] = engagement['week'].apply(lambda x: break_dict[x])

In [None]:
def plot_weekly_engagement(engagement, school_closure):
    weekly_engagement = engagement.groupby(['week', 'district_id']).mean()
    sns.set_theme()
    colors={'pre_corona': 'orange', 'spring_break': 'r', 'summer_break': 'purple', 'thanksgiving': 'brown', 'christmas': 'b', 'school_with_corona': 'g'}
    ax = sns.relplot(data=weekly_engagement, kind='line', x='week', y='pct_access', aspect=2/1, ci='sd')
    plt.axvspan(school_closure['Start_week'].median(), school_closure['End_week'].median(), facecolor='black', alpha=0.4)
    for i, j in break_dict.items():
        plt.axvspan(i-0.5, i+0.5, facecolor=colors[j], alpha=0.3)
    plt.xticks([0, 1, 5, 10, 15, 20, 25, 30, 40, 48], ['0', 'christmas', 'pre corona', '10', 'spring break', '20', 'summer break', '30', '40', 'thanksgiving'], rotation=20)

    ax2 = sns.relplot(data=weekly_engagement, kind='line', x='week', y='engagement_index', aspect=2/1, ci='sd')
    plt.axvspan(school_closure['Start_week'].median(), school_closure['End_week'].median(), facecolor='black', alpha=0.4)
    for i, j in break_dict.items():
        plt.axvspan(i-0.5, i+0.5, facecolor=colors[j], alpha=0.3)
    plt.xticks([0, 1, 5, 10, 15, 20, 25, 30, 40, 48], ['0', 'christmas', 'pre corona', '10', 'spring break', '20', 'summer break', '30', '40', 'thanksgiving'], rotation=20)

    ax2 = sns.relplot(data=weekly_engagement, kind='line', x='week', y='eng_stud_engagement_index', aspect=2/1, ci='sd')
    plt.axvspan(school_closure['Start_week'].median(), school_closure['End_week'].median(), facecolor='black', alpha=0.4)
    for i, j in break_dict.items():
        plt.axvspan(i-0.5, i+0.5, facecolor=colors[j], alpha=0.3)
    plt.xticks([0, 1, 5, 10, 15, 20, 25, 30, 40, 48], ['0', 'christmas', 'pre corona', '10', 'spring break', '20', 'summer break', '30', '40', 'thanksgiving'], rotation=20)
    plt.show()
plot_weekly_engagement(engagement, school_closure)  

These plots show the weekly averages for point access, engagement index and engaged student engagement index, respectively. The line represents the average value across districts, the shadow represents their standard deviation. The time of school closure is marked by a darker background.

These plots show that there is a large variability in digital learning behaviour between school districts. Looking at the average, we can see from these plots that point access basically remained at the same level when the pandemic hit, dropped during summer break, recovered after the break but slowly decreased in fall. The engagement index, however, ramped up early in the year and remained at a higher level throughout the year, except for summer break (which was expected) and a slow decrease in fall. The engaged student engagement index jumped end of February and remained at a higher level throughout the year, even through summer break, however, it seems to be slightly lower during fall again. For all three measures, school closures did not appear to lead to more use of digital learning platforms. If anything, point access seems to be slightly lower during that time.

This suggests that when the pandemic hit, students that were already engaged in digital learning got more engaged, whereas students that did not use digital platforms prior to the pandemic did not start doing so after the pandemic hit, not even when schools were closed.

I personally had hoped to find the oppisite, that students that used to learn in school independent of digital platforms would switch to digital learning when access to schools got restricted, indicating that they had continued their education. The above findings could mean that students not used to digital learning missed out on their eductation when they could not be in school. What is not covered by this data, however, are other means of compensating lack of school, like for example parents or private tutors teaching the students or learning material on paper given out by schools.

In [None]:
sns.catplot(data=engagement, x='weekday', y='engagement_index', hue='circumstances',
           kind='bar', aspect=2/1, ci='sd');

In [None]:
sns.catplot(data=engagement, x='weekday', y='engagement_index', hue='circumstances',
           kind='bar', aspect=2/1, ci=None);
sns.catplot(data=engagement, x='weekday', y='pct_access', hue='circumstances',
           kind='bar', aspect=2/1, ci=None);
sns.catplot(data=engagement, x='weekday', y='eng_stud_engagement_index', hue='circumstances',
           kind='bar', aspect=2/1, ci=None);

Looking at the data broken down by weekday for the various circumstances (pre-corona, corona, break times) there is a huge standard deviation among districts and/or products (note that here in contrast to the previous plots data has not been averaged over products for each district). 

On average, the engagement index clearly increased on all weekdays when the pandemic hit as compared to the time before corona. Interestingly, during spring break the index was comparable to the level during school time with corona and even higher than that on the weekend, however, it was more comparable to pre-corona levels during summer break. 
Point access rates, however, were quite similar during the pandemic as compared to the time before and only moderately higher during the weekend. Also here, spring break had much less of an impact than summer break.

Interestingly, the increase of the engaged student engagement index due to the pandemic is much less pronounced than the engagement index overall and especially shows that engaged students stayed engaged also during summer break. This finding somewhat contradicts the intuitive expectation resulting from looking at the ratio between average engagement and average point access that the increase in engagement would be more pronounced here, but this can be easily explained mathematically by the fact that the ratio of averages is not the same as the average of ratios, especially in situations with high variability.

### Which products have been used and how was this influenced by the pandemic?

There are more than 1000 products in the engagement dataset, however, average point access rates have been found to be below 1 percent. Therefore, it is interesting to see whether this low use is common to all products or if there are products that are much more widely used and others that are basically not used at all.

In [None]:
temp = engagement[engagement['circumstances'].isin(
    ['pre_corona', 'school_with_corona'])]
product_engagement = temp[['lp_id', 
                                 'district_id', 
                                 'circumstances', 
                                 'pct_access', 
                                 'engagement_index', 
                                 'eng_stud_engagement_index'
                                ]].groupby(['lp_id', 
                                            'district_id', 
                                            'circumstances']).mean()

In [None]:
sns.relplot(data=product_engagement, x='lp_id', y='pct_access', hue='circumstances', aspect=2/1);
sns.relplot(data=product_engagement, x='lp_id', y='engagement_index', hue='circumstances', aspect=2/1);
sns.relplot(data=product_engagement, x='lp_id', y='eng_stud_engagement_index', hue='circumstances', aspect=2/1);

Looking at the individual products shows that there are many products that aren't used very much but also that there are products that are used by many students, some of which also have high engagement indices. Comparing the values before the pandemic hit with the values later in the year shows that there are products where the pandemic lead to lower percent access rates, showing that not only did digital learning fail to reach more students, but in fact students lost interest in many of the products. Looking at the engagement index, 5 products stand out with very high indices especially during school time in the pandemic situation. Interestingly, this finding is not replicated by the engaged student engagement index. Rather, this index is more uniformly distributed among the products and wasn't impacted by the pandemic the same way the engagement index was.

In [None]:
temp1 = product_engagement.groupby('lp_id').mean()
print (temp1.engagement_index.mean())
print (temp1.engagement_index.std())

high_eng_products = temp1[temp1.engagement_index > 
                          temp1.engagement_index.mean() + 3 * temp1.engagement_index.std()]

high_eng_products = high_eng_products.index.values

temp1 = product_engagement.groupby('lp_id').mean()
print (temp1.pct_access.mean())
print (temp1.pct_access.std())

high_acc_products = temp1[temp1.pct_access > 
                          temp1.pct_access.mean() + 3 * temp1.pct_access.std()]

high_acc_products = high_acc_products.index.values

print(high_eng_products)
print([val for val in high_eng_products if val not in high_acc_products])

In [None]:
product_engagement = product_engagement.reset_index()
product_engagement = pd.merge(product_engagement,
                              products,
                              how='left',
                              left_on='lp_id',
                              right_on='lp_id',
                              sort=False
                             )
product_engagement = product_engagement.drop(['URL'], axis=1)

product_engagement = pd.merge(product_engagement,
                              districts,
                              how='left',
                              left_on='district_id',
                              right_on='district_id',
                              sort=False
                             )

In [None]:
temp = product_engagement[product_engagement.lp_id.isin(high_eng_products)]
print('Products with high engagement index')
sns.catplot(data=temp, y='Product Name', x='engagement_index', hue='circumstances',
           kind='box', aspect=2/1);

Looking at products with an engagement index larger than average plus three standard deviations, it is striking that Google Docs and Google Classroom score highest and also clearly gained engagement when the pandemic hit. With Canvas and Schoology there are another two learning platforms among those high engagement products, indicating that indeed virtual classroom learning has been boosted in response to the pandemic. Kahoot!, a learning game platform, however, shows a loss in engagement index in response to the pandemic. This points towards individual learning (independent of school) loosing influence during the pandemic.

In [None]:
temp = product_engagement[product_engagement.lp_id.isin(high_acc_products)]
print('Products with high percent access')
sns.catplot(data=temp, y='Product Name', x='pct_access', hue='circumstances',
           kind='box', aspect=2/1);

Looking at products with percent access larger than average plus three standard deviations we find that Google Classroom and Google Docs (both having very high engagement indices) are widely used by students, however, the access rate remained very similar when the pandemic hit. Zoom and ClassLink, on the other hand were used by a lot more students after the pandemic hit, pointing again to the importance of virtual classroom learning during the pandemic. With ST Math, we see another platform for game-based individual learning loosing user interest in response to the pandemic.

In [None]:
products[products.lp_id.isin(high_acc_products)]

In [None]:
products[products.lp_id.isin(high_eng_products)]

## How does engagement of digital learning relate to various factors?
### Which factors are predictive of engagement in digital learning?

One way to look into how predictive a factor is for engagement in digital learning is to use mutual information analysis. Mutual information quantifies the amount of information is obtained about a variable by observing the other variable, independent of the kind of relationship between the variables. It is zero if and only if two random variables are independent, and higher values mean higher dependency.

In [None]:
from sklearn.feature_selection import mutual_info_regression

cat_cols = ['Product Name', 'Provider/Company Name', 
            'Sector(s)', 'Primary Essential Function',
            'district_id', 'circumstances', 'state', 'locale']

data = product_engagement.copy()
data = data.drop('lp_id', axis=1)
data = data.dropna()

def get_mi_scores(data, target, cat_cols):
    data = data[data[target].notna()]
    target_ = data[target]
    data = data.drop(['pct_access', 
                      'engagement_index', 
                      'eng_stud_engagement_index'], axis=1)
    
    for col in cat_cols:
        data[col], _ = data[col].factorize()
    
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in data.dtypes]

    mi_scores = mutual_info_regression(data, target_, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=data.columns).sort_values(ascending=False)
    return mi_scores

def get_mi_dataframe(data, features):
    access = get_mi_scores(data, 'pct_access', features)
    eng = get_mi_scores(data, 'engagement_index', features)
    e_eng = get_mi_scores(data, 'eng_stud_engagement_index', features)
    temp = pd.concat([access, eng, e_eng], axis=1, 
                     keys=['pct_access', 'engagement_index', 'eng_stud_engagement_index'],
                     names=['Measure'])
    temp = temp.stack().reset_index()
    temp.columns = ['Feature', 'Measure', 'MI Score']
    return temp

df = get_mi_dataframe(data, cat_cols)
sns.catplot(data=df, y='Feature', x='MI Score', 
            hue='Measure', kind='bar', orient='h',
            aspect=2/1,
            legend_out=False);

Mutual information analysis reveals that for all, percent access, engagement index and engaged student engagement index it is most important which product is beeing looked at. The district is less important compared to the product, yet it is still more important than the circumstances, indicating that it matters more in which school district a student lives than whether or not the corona pandemic had hit. Looking at the various characteristics of the districts for percent access the state is most important, followed by per pupil expenditure, percent black/hispanic and percent free/reduced. However, all of these characteristics are very close and all have a very low MI score, suggesting that those factors alone, although not irrelevant, do not have a high impact on digital learning.

In [None]:
data = product_engagement[product_engagement.lp_id.isin(high_acc_products) |
                          product_engagement.lp_id.isin(high_eng_products)].copy()
data = data.drop('lp_id', axis=1)
data = data.dropna()
df1 = get_mi_dataframe(data, cat_cols)
sns.catplot(data=df1, y='Feature', x='MI Score', 
            hue='Measure', kind='bar', orient='h',
            aspect=2/1,
            legend_out=False);

Looking at products with high access rates only, again the most predictive feature for digital engagement is the product itself. The primary essential function contains a comparable amount of information, however, given there are only a few products included into the analysis and the many descriptions for primary essential function, the latter might just be unique to each of those products the same way the Product Name is. For these products, the state and per pupil expenditure become much more important for percent access as compared to all products, while the engaged student engagement index surprisingly seems independent of per pupil expenditure as well as district or locale.

The fact that also with this product selection the products themselves are most predictive for digital engagement suggests that the influence of other features on product use might vary depending on the product. Thus, I investigate the MI scores for products with high access rates individually.

In [None]:
features = ['district_id', 'circumstances', 'state', 'locale', 
            'pct_black/hispanic', 'pct_free/reduced',
            'pp_total_raw']
prods = data['Product Name'].unique()

mi_dfs = {}
for prod in prods:
    print(prod)
    temp = data[data['Product Name'] == prod]
    temp = temp[features + ['pct_access', 'engagement_index', 'eng_stud_engagement_index']]
    mi_dfs[prod] = get_mi_dataframe(temp, features)

temp = pd.concat(mi_dfs.values(), keys=mi_dfs.keys(), names=['Product', 'Index'])
product_specific_mis = temp.reset_index().drop('Index', axis=1)

In [None]:
def plot_spec_mis(data, measure):
    data = data[data.Measure == measure]
    sns.catplot(data=data, 
            x='Product', y='MI Score', 
            hue='Feature', kind='bar', orient='v',
            aspect=2/1,
            legend_out=False);
    plt.xticks(rotation=45);
    plt.title(measure)

plot_spec_mis(product_specific_mis, 'pct_access')
plot_spec_mis(product_specific_mis, 'engagement_index')
plot_spec_mis(product_specific_mis, 'eng_stud_engagement_index')

Indeed, the picture of mutual information scores is quite different from product to product. It is interesting to see that for the Securly Anywhere Filter the most important feature (besides district_id, which actually contains all the other information except circumstances) is percent black/hispanic. For ST Math and Schoology, the per pupil expenditure is the most important feature for percent access and engagement index.

COVID-19 was highly important for percent access and engagement index only on Zoom, but for engaged student engagement index it was amongh the most important features for almost all investigated products, except for Securly Anywhere Filter and Clever. This finding supports the notion that the pandemic mostly influenced the engagement of students that were already engaged in digital learning prior to the pandemic.

### Which relationship do relevant factors have with use of select products?

Mutual information analysis revealed that the influence of the different characteristics of districts is highly dependent on the individual products. To further investigate the way these parameters influence digital learning, I look into how they relate to the use of select products.

In [None]:
names = ['Securly Anywhere Filter', 'Zoom', 'Google Classroom', 'Canvas', 'Google Docs',
         'Kahoot!', 'ST Math']
measures = ['pct_access', 'engagement_index', 'eng_stud_engagement_index']
factors = ['pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw']

def plot_relationship(product_engagement, name, measure, factor):
    data = product_engagement[product_engagement['Product Name'] == name]
    sns.catplot(data=data, x=factor, y=measure,
                hue='circumstances', aspect=2/1, kind='box');
    plt.title(name);
    
for measure in measures:
    plot_relationship(product_engagement, names[0], measure, factors[0])

For the Securly Anywhere Filter mutual information analysis indicated that percent black/hispanic influences engagement with the product. Indeed, there seems to be a tendency that districts with 0.3 or more percent black or hispanic people have lower access rates and lower engagement indices, however, this tendency isn't present for the engaged student engagement index. Also, it seems that when the pandemic hit the filter had lower access rates and lower engagement indices than prior to the pandemic which, again, isn't reflected in the engaged student engagement index. 

In [None]:
for name in names[-2:]:
    for measure in measures:
        plot_relationship(product_engagement, name, measure, factors[1])

for measure in measures:
        plot_relationship(product_engagement, names[-1], measure, factors[2])

Looking into the learning games Kahoot! and ST Math, it seems there is a slight trend towards lower engagement index with higher percent free/reduced present for Kahoot!. What is more striking is that all measures were lower for Kahoot! during the pandemic. For ST Math, on contrary, percent access and engagement index both increased with increasing percent free/reduced, while the engaged student engagement index was largely independent of percent free/reduced or per pupil expenditure. The relationship between percent access or engagement index and per pupil expenditure turns out to be of non-linear nature. Further analysis would be required to investigate the possibility of confounding factors that could explain these results. Both engagement index and particularly engaged student engagement index increased in respose to the pandemic.
It would be interesting to understand why ST Math is used more in districts with higher percent free/reduced. 
Overall, however, these findings invalidate the theory from above that game-based individual learning was generally reduced due to the pandemic. Instead, this indicates that digital learning behaviour and it's susceptibility to the pandemic largly depends on the individual product.

In [None]:
for measure in measures:
        plot_relationship(product_engagement, names[1], measure, 'state')
        plt.xticks(rotation=70)

It is striking how Zoom got boosted from practically no use (both percent access and engagement index) to very high values, reaching up to almost 30 percent of the students in Illinois accessing the site per day. It is also interesting to see that while Zoom got very popular in some states, it remained largely unused in others.

In [None]:
for name in names[2:5]:
    for measure in measures:
        plot_relationship(product_engagement, name, measure, factors[2])

Looking at the virtual classroom related products Google Classroom, Canvas and Google Docs, we see an increase in percent access and engagement index with increasing per pupil expenditure for both Google products, but more of a decrease thereof for Canvas. This could give a hint towards that maybe some of the money had been spent on improving access and use of those Google products. For all three products, the pandemic clearly lead to an increase in engagement and particularly engaged student engagement.

## Conclusion

The impact of COVID-19 on digital learning is extremely variable among products and districts. Digital learning behaviour depends most on the product, then on the district and less than that, although certainly not negligibly, on the pandemic. Access rates had not been boosted consistently when the pandemic hit, but rather there was a shift from some products towards others, which averaged out overall. Overall engagement increased in response to the pandemic and for almost all investigated products engagement of engaged students clearly increased when the pandemic hit, independent of district properties. Interestingly, while state ordered school closures seemed to have little influence (on top of the pandemic situation), in terms of digital learning spring break did not happen in 2020, except for a slightly longer weekend. Possibly this results from schools beeing closed anyhow during this time, so the break ceased being special for the students.

There are only a few products that are widely used and only 5 that stand out with very high engagement indices. Virtual classroom learning clearly became more important during the pandemic, reflecting the situation that students had limited (physical) access to schools. Surprisingly, Securly Anywhere Filter lost impact when the pandemic hit, suggesting that students got somewhat more relaxed about cyber security. This also seems true for districts with more black or hispanic people. While students lost interest in the learning game platform Kahoot!, they gained interest in the individual learning platform ST Math, which is also most popular among districts with high percent free or reduced price lunches (low income). The biggest winner of the pandemic is clearly the video conference platfrom Zoom. Even though Zoom didn't have a breakthrough in all states, where it did it experienced a hugh boost from almost nothing to being one of the most highly used products in digital learning. Part of this boost probably originates not from learning behaviour, but from people connecting virtually in times of social distancing.

Overall I do not find a global dependency for digital learning on any of the district characteristics like state, amount of black or hispanic people, poverty or money invested in students. Rather, the information contained in knowing the district is not present in any of the district characteristics alone. Either, it's some combination of those characteristics which is important for digital learning, or there are relevant properties not covered by the dataset, like for example more or less motivated and engaged people (towards digital learning) in relevant positions in the school district. A more thorough analysis of school districts with high access rates compared to school districts with low access rates would be necessary to investigate what they had been doing right to achieve such digital learning behaviour.
