In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Starter Notebook ; New-York Example

In this starter notebook, I will first only work at the aggregated State level.

In [None]:
district = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv', index_col=0).dropna(how='all')
district.head()

In [None]:
district.shape

In [None]:
district.loc[district.state== 'New York']

Now, I'll include information on the COVID situation. As cases were only properly and extensively reported from summer 2020 (tests where not widely available before then), I will focus on deaths as an indicator of the gravity of the COVID situation.

In [None]:
states = pd.read_csv('../input/us-counties-covid-19-dataset/us-counties.csv', usecols=['date', 'state', 'cases', 'deaths']).groupby(['state', 'date']).sum().sort_values(by=['state', 'date'], ascending=True)
states.loc['New York']

You can see that cases and deaths are cumulated, we will need to get the daily, 7-days-smoothed count.

In [None]:
allstates, _ = zip(*states.index)
allstates = set(allstates)
allstates

In [None]:
states_dic = {
    state: states.loc[state]
    for state in allstates
}

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
for state, df in states_dic.items():
    df[['cases', 'deaths']] -= df[['cases', 'deaths']].shift(1)
    states_dic[state] = df.dropna().rolling(7, min_periods=1).mean()

Now, we'll also focus on the elearning product at an aggregated level.

In [None]:
product = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
product.head()

### First Study : New York

In [None]:
ny_ids = district.loc[district.state == 'New York'].index

The engagement also needs to be smoothed for week-ends.

In [None]:
ny_engagement = pd.concat([pd.read_csv(f'../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/{id}.csv') for id in ny_ids], axis=0).fillna(0).groupby(['time', 'lp_id']).mean().reset_index().groupby('time').sum().drop(columns=['lp_id']).rolling(7, min_periods=1).mean()
ny_engagement = states_dic['New York'][['deaths']].join(ny_engagement, how='right').fillna(0)
ny_engagement

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(20, 10))
ny_engagement['engagement_index'].plot(ax=ax)
plt.show()

The aggregated engagement index is obviously dependent on school holidays, but also we can see the impact of the beginning of the pandemic in March.

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(20, 10))
ny_engagement[['pct_access', 'deaths']].plot(ax=ax)
plt.show()

In [None]:
ny_engagement.corr()

Obviously, the engagement is highly correlated with the access.