# Did COVID-19 and digital learning reinforced inequalities ?

COVID-19 pandemic disrupted a lot of World's facet including Education. In the US, the majority of States decided to close K-12 schools mid-March  and adopted digital learning in place of traditional *in classroom* learning.
It is a known fact that demographics and economic factors as well as social class  create gaps in terms of academic success [1].

**One can ask if COVID-19 and digital learning tend to wide or shrink these inequalities ?**

In this notebook we will explore correlations between digital learning engagement and socio-economics factors such as parent's education and revenue, household composition and IT equipment. We will also look at correlations between this same factors and academic success.

We will eventually show that the same factors have an impact on both digital engagement and academic success. 

**Therefore, it's likely that COVID-19 reinforced education inequalities**

Strenghtened by more granular data, this kind of analysis can help representatives to build more inclusive public policies regarding education in pandemic time.
We especially recommend to :

* **Head special financing to "at-risk" school districts** : quality content is playing a cruacial role on digital engagement [2] : with additional financing, districts can specifically train their children and choose a best-in-class product.
* **Send IT equipment coupon to low-income households** : *[...] many students became disengaged when home technology was lacking or wasn't reliable* [3] These coupons could be used to upgrade household internet connection or purchase a new device.
* **Lend IT equipment to low-income households** : [4]
* **Engage and train parents (especially fathers)** : Parents play a crucial role on digital learning engagement and data shows that engagement is higher on State with a higher percentage of highly-educated parents. Studies show that women had been more affected economicaly by the pandemic because they were the primary caregiver on many household [5]. Engaging fathers would then be also a gender-equality measure 

# Engagement metric

We have two engagement metrics available on the data :
* *pct_access* : Percentage of students in the district have at least one page-load event of a given product and on a given day
* *engagement_index* : Total page-load events per one thousand students of a given product and on a given day

We will use *engagement_index* as this metric is more easily agreggatable than the other. It's also not dependent on district size and not subject to an "open and quit" behavior unlike the other (we would need a similar metric to *Bounce rate* which indicates in web analytics the percentage of visitors to a particular website who navigate away from the site after viewing only one page.).

We won't look at the product dimension on this analysis so we will just sum the engagement_index for each product by district and by day.
We can now have a first look at the data by plotting the metrics for 3 random districts :

In [None]:
import numpy as np 
import pandas as pd 
import os
import random
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib as mpl, matplotlib.pyplot as plt

mapPalette = sns.cubehelix_palette(start=2.0, rot=-0.4, dark=0.3, light=.8, as_cmap=True)
mapPalette.set_bad((1,1,0.98))
divergingPalette = sns.color_palette("coolwarm", as_cmap=True)
sequentialPalette = sns.cubehelix_palette(start=1, rot=0, as_cmap=True, hue=1)

def reduceEngagement(df, districtName):
    df = df[['time', 'engagement_index']]
    df = df.groupby('time').sum()
    df['district'] = districtName
    df = df.reset_index()
    return df


products_df = pd.DataFrame()
districts_df = pd.DataFrame()
engagement_raw = pd.DataFrame()
for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning'):
    for filename in filenames:
        if filename == 'products_info.csv':
            products_df = pd.read_csv(dirname+'/'+filename)
        elif filename == 'districts_info.csv':
            districts_df = pd.read_csv(dirname+'/'+filename)
        elif filename != 'README.md':
            engagement_raw = engagement_raw.append(reduceEngagement(pd.read_csv(dirname+'/'+filename), filename[0:4]))
engagement_df = engagement_raw
engagement_df['district'] = engagement_df['district'].astype(int)
engagement_df['time'] = pd.to_datetime(engagement_df['time'])
sample = [3301, 4165, 7541]
sampledDf = engagement_df[engagement_df.district.isin(sample)]
sampledDf = sampledDf[['time', 'engagement_index', 'district']]
sampledDf.reset_index(inplace=True)
fig, ax = plt.subplots(figsize=(15,7))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
g=sns.lineplot(data=sampledDf, x='time', y='engagement_index', hue='district', ax=ax)
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
plt.yticks(fontsize=15)
plt.title("Engagement across time" , loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
ax.legend().set_visible(False)
plt.plot()

We observe a strong seasonality pattern (weekdays/weekend), hence, we decide to take a weekly average.
We can also observe :
* A big spike early March corresponding to K-12 schools being closed
* A drop early April corresponding to Spring break
* A drop during Summer break

In [None]:
engagement_df['week'] = engagement_df.time.apply(lambda d : d.isocalendar()[1])
engagement_df = engagement_df[['week', 'engagement_index', 'district']]
engagement_df = engagement_df.groupby(['week','district']).mean()
engagement_df.reset_index(inplace=True)
sampledDf = engagement_df[engagement_df.district.isin(sample)]
fig, ax = plt.subplots(figsize=(15,7))
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
g=sns.lineplot(data=sampledDf, x='week', y='engagement_index', hue='district')
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
plt.yticks(fontsize=15)
plt.title("Engagement across time (cleaned)" , loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
ax0tr = ax.transData 
arrow1 = mpl.patches.FancyArrowPatch(
    (9,82000), (11.5,80000),  transform=ax0tr,  # Place arrow in figure coord system
    fc = "g", connectionstyle="arc3,rad=-0.2", arrowstyle='simple', alpha = 0.8, ec='g',
    mutation_scale = 40.
)
arrow2 = mpl.patches.FancyArrowPatch(
     (10,1000), (14.5, 4000), transform=ax0tr,  # Place arrow in figure coord system
    fc = "g", connectionstyle="arc3,rad=0.3", arrowstyle='simple', alpha = 0.8, ec='g',
    mutation_scale = 40.
)
arrow3 = mpl.patches.FancyArrowPatch(
     (27,19500), (27,5000), transform=ax0tr,  # Place arrow in figure coord system
    fc = "g", arrowstyle='simple', alpha = 0.8, ec='g',
    mutation_scale = 40.
)
rectangle = mpl.patches.Rectangle(xy=(20, -2000),width=14, height=6000, transform=ax0tr, ec='g', fc='g', alpha=0.3 )

ax.patches.append(arrow1)
ax.patches.append(arrow2)
ax.patches.append(arrow3)
ax.patches.append(rectangle)


ax.text(x=0,y=82000, s='K12 schools closed',fontsize=14)
ax.text(x=3,y=0, s='Spring break',fontsize=14)
ax.text(x=24,y=20000, s='Summer break',fontsize=14)
ax.legend().set_visible(False)
plt.plot()


In this analysis, we will also take a look at engagement metric state-wise.

To do so, we average the *engagement_index* across all weeks and we choose the median value for each state (taking the median reduce the impact of outliers, see appendix for a discussion on the variability of school districts within states).

Let's visualize this new metric on a "map" :

In [None]:
engagementDistrict = engagement_df.merge(how='inner',right=districts_df, left_on='district', right_on='district_id')
engagementDistrict.drop(['county_connections_ratio'], axis=1,inplace=True)
engagementWithoutDate = engagement_df[['district', 'engagement_index']]
engagementWithoutDate = engagementWithoutDate.groupby('district').mean()
engagementDistrictWithoutDate = engagementWithoutDate.merge(how='inner',right=districts_df, left_on='district', right_on='district_id')
engagementStateWithoutDate = engagementDistrictWithoutDate[['engagement_index', 'state']].groupby('state').median()
usaHeatMap = np.zeros((8,12))
labels = np.empty((8,12), dtype='U2')
helper = pd.read_csv('/kaggle/input/usa-shape-heat-map/usaHeatMap.csv')
for index, row in helper.iterrows():
    if pd.isnull(row['State']):
        usaHeatMap[row['x']-1][row['y']] = -1
        labels[row['x']-1][row['y']] = ''
    else:
        if row['State'] in engagementStateWithoutDate.index:
            usaHeatMap[row['x']-1][row['y']] = engagementStateWithoutDate.loc[row['State']]['engagement_index']
        labels[row['x']-1][row['y']] = row['State2letters']
fig, ax = plt.subplots(figsize=(18,10))
ax.axes.get_yaxis().set_visible(False)
ax.axes.get_xaxis().set_visible(False)
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
ax0tr = ax.transData 
arrow1 = mpl.patches.FancyArrowPatch(
    (4,1.5), (6.2,2.2),  transform=ax0tr,  # Place arrow in figure coord system
    fc = "g", connectionstyle="arc3,rad=-0.2", arrowstyle='simple', alpha = 0.8, ec='g',
    mutation_scale = 40.
)
arrow2 = mpl.patches.FancyArrowPatch(
     (3.5,6.5), (6.2,5.7), transform=ax0tr,  # Place arrow in figure coord system
    fc = "g", connectionstyle="arc3,rad=0.2", arrowstyle='simple', alpha = 0.8, ec='g',
    mutation_scale = 40.
)
ax.patches.append(arrow1)
ax.patches.append(arrow2)
ax.text(x=1.05,y=1.55, s='Maximum engagement',fontsize=16)
ax.text(x=0.6,y=6.55, s='Minimum engagement',fontsize=16)

g=sns.heatmap(usaHeatMap, annot=labels, fmt='', mask=usaHeatMap < 0, ax=ax, cmap=mapPalette)
ax.set_title('Engagement across USA', loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
plt.show()

We can see that maximum engagement is reached in Illinois while minimum engagement is reached in Tennessee.

No clear geographical pattern emerges due to missing data on a lot of states

# Correlation between engagement and demographics

We will now compute correlation between our engagement metric and demographics present on the provided dataset : 

* *pct_black/hispanic* : Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data
* *pct_free/reduced* : Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data
* *countyconnectionsratio* : ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). 
* *pptotalraw* : Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.

We decide to drop *countyconnectionsratio* as the granularity on the dataset not small enough for that feature being meaningfull.
We also convert the others features to ordinal features in to be able to compute a **Pearson's correlation coefficient**

In [None]:
engagementDistrictWithoutDate.dropna(inplace=True)
ppMapping = {}
pctMapping = {'[0, 0.2[':0,
             '[0.2, 0.4[':1,
             '[0.4, 0.6[':2,
             '[0.6, 0.8[':3,
             '[0.8, 1[':4}
lowbound = 0
i = 0
step = 2000
while lowbound < 34000:
    ppMapping['['+str(lowbound)+', '+str(lowbound+step)+'['] = i
    lowbound = lowbound + step
    i = i+1

engagementDistrictTransformed = engagementDistrictWithoutDate.loc[:,['engagement_index', 'pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw', 'district_id']]
engagementDistrictTransformed['pct_black/hispanic'] = engagementDistrictWithoutDate.loc[:,'pct_black/hispanic'].replace(pctMapping)
engagementDistrictTransformed['pct_free/reduced'] = engagementDistrictWithoutDate.loc[:,'pct_free/reduced'].replace(pctMapping)
engagementDistrictTransformed['pp_total_raw'] = engagementDistrictWithoutDate.loc[:,'pp_total_raw'].replace(ppMapping)
engagementDistrictTransformed.head()
corrMatrix = engagementDistrictTransformed[['engagement_index', 'pct_black/hispanic', 'pct_free/reduced', 'pp_total_raw']].corr()
toBeDisplayed = corrMatrix['engagement_index'][1:]
toBeDisplayed= toBeDisplayed[np.argsort(toBeDisplayed.map(lambda x : -abs(x)))]
fig, ax = plt.subplots(figsize=(15,7))
ax.axes.set_xlim(xmin=-0.5, xmax=0.5)
ax.axes.get_xaxis().set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
g = sns.barplot(x=toBeDisplayed, y=toBeDisplayed.index, palette=mpl.cm.ScalarMappable(cmap='coolwarm').to_rgba(np.append(toBeDisplayed, [[-0.5,0.5]])), ax=ax)
for p in g.patches:
    x = p.get_width()
    if p.get_width() < 0:
        x = x - 0.05
    g.annotate("%.2f" % p.get_width(), (x, p.get_height()/2+p.get_y()),
                ha='center', va='center', fontsize=14, color='gray', xytext=(20, 0),
                textcoords='offset points', weight='bold')
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
plt.yticks(fontsize=15)
plt.title("Pearson\'s correlation coefficient with engagement" , loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
plt.show()


We observe a slight correlation between engagement and expenditure per pupil. 

This correlation can be explained by district having invested more in **products and training**.

We will now explore other features at state level

# Correlation between engagement and demographics at state level

We used NCES (*National Center for Education Statistics*) Education Demographic and Geographic Estimates [6].

Especially we looked at these tables at state level : 
* **CDP 02.1 :** Household composition
* **CDP 02.11 :** Computers and Internet use
* **CDP 04.10 :** Occupants per room
* **PDP 02.5 :** Educational attainment
* **PDP 03.8 :** Percentage of families and people whose income in the past 12 months is below the poverty level

We transformed this raw data on several metrics : 
* *singleParent* : percentage of household with single parent present
* *withComputer* : percentage of household with a computer
* *withInternetAccess* : percentage of household with a high speed internet connection
* *morePersonThanRooms* : percentage of household with more occupants than rooms
* *HighSchoolOrHigher* : percentage of parents of children with a High School degree or higher education
* *BachelorOrHigher* : percentage of parents of children with a Bachelor degree or higher education
* *underPoverty* : percentage of household whose income in the past 12 months is below the poverty level

Let's now have a look again at **Pearson's correlation coefficient** for these features with the state-wise digital learning engagement

In [None]:
stateDemoRaw = pd.read_csv('/kaggle/input/state-demographics/stateDemographics.csv', encoding='latin1')
stateDemoRaw = stateDemoRaw.set_index('Unnamed: 0')
cols = stateDemoRaw.columns
num_cols = stateDemoRaw._get_numeric_data().columns
cat_cols = list(set(cols) - set(num_cols))
stateDemoRaw[cat_cols] = stateDemoRaw[cat_cols].apply(lambda d : d.apply(lambda s : s.replace(',','')))
stateDemoRaw[cat_cols] = stateDemoRaw[cat_cols].apply(lambda d : d.apply(lambda s : s.replace('%','')))
stateDemoRaw[num_cols] = stateDemoRaw[num_cols]/100
stateDemoRaw = stateDemoRaw.astype(float)
stateDemoRaw['PDP03.8All people'] = stateDemoRaw['PDP03.8All people']/100
stateDemoRaw.reset_index(inplace=True)
stateDemo = pd.DataFrame()
stateDemo['state'] = stateDemoRaw['Unnamed: 0']
stateDemo['singleParent'] = (stateDemoRaw.iloc[:,4]+stateDemoRaw.iloc[:,5])/stateDemoRaw.iloc[:,1]
stateDemo['withComputer'] = stateDemoRaw.iloc[:,10]/stateDemoRaw.iloc[:,9]
stateDemo['withInternetAcces'] = stateDemoRaw.iloc[:,11]/stateDemoRaw.iloc[:,9]
stateDemo['morePersonThanRooms'] = (stateDemoRaw.iloc[:,14]+stateDemoRaw.iloc[:,15])/stateDemoRaw.iloc[:,12]
stateDemo['HighSchoolOrHigher'] = stateDemoRaw.iloc[:,24]/stateDemoRaw.iloc[:,16]
stateDemo['BachelorOrHigher'] = stateDemoRaw.iloc[:,25]/stateDemoRaw.iloc[:,16]
stateDemo['underPoverty'] = stateDemoRaw.iloc[:,26]
engagementStateWithoutDate = engagementDistrictWithoutDate[['engagement_index', 'state']].groupby('state').median().merge(how='inner',right=stateDemo, left_on='state', right_on='state')
corrMatrixState = engagementStateWithoutDate.corr()
toBeDisplayed = corrMatrixState['engagement_index'][1:]
toBeDisplayed= toBeDisplayed[np.argsort(toBeDisplayed.map(lambda x : -abs(x)))]
fig, ax = plt.subplots(figsize=(15,7))
ax.axes.set_xlim(xmin=-0.5, xmax=0.5)
ax.axes.get_xaxis().set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
g = sns.barplot(x=toBeDisplayed, y=toBeDisplayed.index, palette=mpl.cm.ScalarMappable(cmap='coolwarm').to_rgba(np.append(toBeDisplayed, [[-0.5,0.5]])), ax=ax)
for p in g.patches:
    x = p.get_width()
    if p.get_width() < 0:
        x = x - 0.05
    g.annotate("%.2f" % p.get_width(), (x, p.get_height()/2+p.get_y()),
                ha='center', va='center', fontsize=14, color='gray', xytext=(20, 0),
                textcoords='offset points', weight='bold')
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
plt.yticks(fontsize=15)
plt.title("Pearson\'s correlation coefficient with engagement" , loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
plt.show()


The Pearson's correlation coefficient is extremely sensitive to outliers with so few datapoints. Let's remove the top 2 state by engagement (Illinois and Indiana) and rerun the coefficients computation.

In [None]:
engagementStateWithoutDate = engagementStateWithoutDate[engagementStateWithoutDate.state != 'Illinois']
engagementStateWithoutDate = engagementStateWithoutDate[engagementStateWithoutDate.state != 'Indiana']
corrMatrixStateWithoutOutliers = engagementStateWithoutDate.corr()
toBeDisplayed = corrMatrixStateWithoutOutliers['engagement_index'][1:]
toBeDisplayed= toBeDisplayed[np.argsort(toBeDisplayed.map(lambda x : -abs(x)))]
fig, ax = plt.subplots(figsize=(15,7))
ax.axes.set_xlim(xmin=-0.5, xmax=0.5)
ax.axes.get_xaxis().set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
g = sns.barplot(x=toBeDisplayed, y=toBeDisplayed.index, palette=mpl.cm.ScalarMappable(cmap='coolwarm').to_rgba(np.append(toBeDisplayed, [[-0.5,0.5]])), ax=ax)
for p in g.patches:
    x = p.get_width()
    if p.get_width() < 0:
        x = x - 0.05
    g.annotate("%.2f" % p.get_width(), (x, p.get_height()/2+p.get_y()),
                ha='center', va='center', fontsize=14, color='gray', xytext=(20, 0),
                textcoords='offset points', weight='bold')
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
plt.yticks(fontsize=15)
plt.title("Pearson\'s correlation coefficient with engagement" , loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
plt.show()


We observe a moderate correlation between parent's education and household composition with digital learning engagement.

This shows the importance of **parent's engagement** as outline in our recommendations.

Pearson's correlation coefficient is a great metric to compute correlation between two variables but it cannot compute the correlation between several dependant variables and on target variable. Thus, we will fit a **Multilinear Regression** to model *engagement_index* with these demographics features.

A multilinear regression with all the features and so few datapoints would eventually be overfitting. Therefore we will only consider each pairs of features. 
We compute the **R²** score for each of these pairs andplot the residuals (*model predictions - actuals*) of the best model :

In [None]:
X = engagementStateWithoutDate[['singleParent', 'withComputer',
       'withInternetAcces', 'morePersonThanRooms', 'HighSchoolOrHigher',
       'BachelorOrHigher', 'underPoverty']]
y = engagementStateWithoutDate['engagement_index']
scaler = StandardScaler()
scaledX = scaler.fit_transform(X)
score = np.zeros((7,7))
bestScore = 0
bestCoef = 0
for i in range(0,len(X.columns)):
    for j in range(0,len(X.columns)):
        reg = LinearRegression().fit(scaledX[:,[i,j]], y)
        score[i][j] = reg.score(scaledX[:,[i,j]],y)
        if score[i][j] > bestScore:
            bestScore = score[i][j]
            bestI = i
            bestJ = j

mask = np.full((7,7), True)
for i in range(0,7):
    for j in range(i,7):
        mask[i][j] = False
        
fig, ax = plt.subplots(1,2,figsize=(15, 7))
g = sns.heatmap(pd.DataFrame(score, index=X.columns, columns=X.columns), annot=True, cmap=sequentialPalette, mask=mask, ax=ax[0])
fig.patch.set_facecolor((1,1,0.98))
ax[0].set_facecolor((1,1,0.98))
fig.suptitle("Pair-wise linear regression - R² and residuals" , fontsize=24, fontweight='bold')
fig.axes[2].set_visible(False)

reg = LinearRegression().fit(scaledX[:,[bestI,bestJ]], y)
score[i][j] = reg.score(scaledX[:,[bestI,bestJ]],y)
ax[1].set_facecolor((1,1,0.98))
ax[1].spines["right"].set_visible(False)
ax[1].spines["top"].set_visible(False)
ax[1].set_ylabel('Residuals')
sns.scatterplot(x= reg.predict(scaledX[:,[bestI,bestJ]]), y=y-reg.predict(scaledX[:,[bestI,bestJ]]), ax=ax[1])

ax0tr = ax[0].transData 
ax1tr = ax[1].transData 
figtr = fig.transFigure.inverted() 
ptB = figtr.transform(ax0tr.transform((6.5, 0.3)))
ptE = figtr.transform(ax1tr.transform((15000, 8000)))

arrow = mpl.patches.FancyArrowPatch(
    ptB, ptE, transform=fig.transFigure,  # Place arrow in figure coord system
    fc = "g", connectionstyle="arc3,rad=-0.2", arrowstyle='simple', alpha = 0.8, ec='g',
    mutation_scale = 40.
)

rectangle = mpl.patches.Rectangle(xy=(6, 0),width=1, height=1, transform=ax0tr, fill=False, lw=5, ec='g')
fig.patches.append(arrow)
ax[0].patches.append(rectangle)
plt.show()

The best regressors are :
* *singleParent*
* *underPoverty*

The R² score of 0.5 shows that the model with these two features is a good fit (Human behavior is actually very hard to predict and R² score for this kind of studies is scarcely bigger than 0.5 [7] )

With these models, we highlighted the relationship between digital learning engagement and : 
* **Household compostion**
* **Economic wealth**
* **Parent's education**

Lets's see now how these factors are correlated with academic success.

# Demographics and academic success

We gathered 3 different metrics of academic succes by state on the **2020 Kids count data book** auxiliary materials [8] :
* Table 1.6 - Percentage of fourth-graders not proficient in reading, 2009 and 2019 
* Table 1.7 - Percentage of eighth-graders not proficient in math, 2009 and 2019 
* Table 1.8 - Percentage of high school students not graduating on time, 2010–11 and 2017–18

For each metrics we used the most recent data available.

For the sake of illustration, we will choose to compute correlations with math proficiency percentage.

In [None]:
academics = pd.DataFrame()
academics['state'] = stateDemoRaw.iloc[:,0]
academics['reading proficiency'] = stateDemoRaw.iloc[:,27]
academics['math proficiency'] = stateDemoRaw.iloc[:,28]
academics['graduating on time'] = stateDemoRaw.iloc[:,29]
engagementWithAcademics = engagementStateWithoutDate.merge(how='inner', right=academics, on='state')
corrMatrixAcademics = engagementWithAcademics.corr()

toBeDisplayed = corrMatrixAcademics['math proficiency'][1:-3]
toBeDisplayed= toBeDisplayed[np.argsort(toBeDisplayed.map(lambda x : -abs(x)))]
fig, ax = plt.subplots(figsize=(15,7))
ax.axes.set_xlim(xmin=-1, xmax=1)
ax.axes.get_xaxis().set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
g = sns.barplot(x=toBeDisplayed, y=toBeDisplayed.index, palette=mpl.cm.ScalarMappable(cmap='coolwarm').to_rgba(np.append(toBeDisplayed, [[-1,1]])), ax=ax)
for p in g.patches:
    x = p.get_width()
    if p.get_width() < 0:
        x = x - 0.1
    g.annotate("%.2f" % p.get_width(), (x, p.get_height()/2+p.get_y()),
                ha='center', va='center', fontsize=14, color='gray', xytext=(20, 0),
                textcoords='offset points', weight='bold')
fig.patch.set_facecolor((1,1,0.98))
ax.set_facecolor((1,1,0.98))
plt.yticks(fontsize=15)
plt.title("Pearson\'s correlation coefficient and math proficiency" , loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
plt.show()

We observe :

* A strong correlation between poverty index and math proficiency
* A strong correlation between IT equipment and math proficiency 
* A strong correlation between parents education math proficiency
* A strong correlation between marital status math proficiency

As the same factors impact both academic succes and digital learning engagement, we conclude that **COVID-19 pandemic reinforced education inequalities**


# Appendix : School district to state assumption

In this analysis, we assumed that merging engagement data at state-level and comparing it with state demographics was reasonable.
This choice was made because of the anonymization of the school districts which made impossible any analysis at the finest granularity.
In this section we explored the variability of school districts demographic metrics in Alabama to have a better sense of the validity of our assumption.

School districts data is obtained through :
* the ACS-ED District Demographic Dashboard [9]
* the NCES "Search for Public School Districts" tool [10]

We used the following metrics/dimensions :
* *locale type* : NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural.
* *Broadband* : percentage of household with a high speed iternet connection
* *singleHouseholder* : percentage of household with a single parent present
* *BachelorsUp* : percentage of parents with a Bachelor degree or higher
* *Below poverty* : percentage of household whose income in the past 12 months is below the poverty level

In [None]:
alabamaDemoRaw = pd.read_csv('../input/alabamas-schoold-districts-demographics/alabamaSchoolDistricts.csv')
alabamaDemo=alabamaDemoRaw.loc[:,['Broadband', 'locale type', 'MaleOnlyHouseholder', 'FemaleOnlyHouseholder', 'BelowPoverty', 'BachelorsUp']]
alabamaDemo['singleHousehold'] = alabamaDemo.loc[:,'MaleOnlyHouseholder'] + alabamaDemo.loc[:,'FemaleOnlyHouseholder']
alabamaDemo.drop(['MaleOnlyHouseholder', 'FemaleOnlyHouseholder'], axis=1, inplace=True)
alabamaDemo['locale type'] = alabamaDemo.loc[:,'locale type'].replace({'City':'Others', 'Town':'Others', 'Rural':'Others'})
fig, ax = plt.subplots(figsize=(10,7))

fig.patch.set_facecolor((1,1,0.98))
g = sns.violinplot(data=alabamaDemo.melt(id_vars='locale type', value_vars =['Broadband', 'BelowPoverty', 'BachelorsUp', 'singleHousehold']), x='variable', y='value' , hue='locale type', split=True)
ax.set_facecolor((1,1,0.98))
ax.set_title('Variability of district demographics in Alabama', loc='left', fontdict = {'fontsize':24, 'fontweight':'bold'}, pad=24)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.get_xaxis().get_label().set_visible(False)
ax.get_yaxis().get_label().set_visible(False)
plt.show()


These charts show how variable the demographics metrics are across school district in Alabama. However we notice that metrics' distribution on district not classified as *Suburb* generally have more a Gaussian shape than their *Suburbs* counterparts and are generally less flat.

**Excluding these districts from the engagement data and our state demographics might be a good idea for improving the accuracy of the analysis while preserving the anonymity of the school districts on the engagement dataset**

# Appendix : Ressources & additional datasources

[1] https://www.epi.org/publication/five-key-trends-in-u-s-student-performance-progress-by-blacks-and-hispanics-the-takeoff-of-asians-the-stall-of-non-english-speakers-the-persistence-of-socioeconomic-gaps-and-the-damaging-effect/#epi-toc-1

[2] https://www.bcg.com/fr-fr/publications/2021/covid-19-advanced-digital-learning-for-lower-income-populations

[3] https://rossier.usc.edu/survey-low-income-families-strained-by-distance-learning/

[4] https://www.placegrenet.fr/2020/12/21/isere-la-prefecture-distribue-1-500-tablettes-aux-eleves-dans-le-cadre-dun-plan-dequipement-numerique/533595

[5] https://www.mckinsey.com/featured-insights/diversity-and-inclusion/seven-charts-that-show-covid-19s-impact-on-womens-employment

[6] https://nces.ed.gov/programs/edge/TableViewer/acsProfile/2019

[7] https://statisticsbyjim.com/regression/interpret-r-squared-regression/

[8] https://www.aecf.org/resources/2020-kids-count-data-book/auxiliary-materials

[9] https://nces.ed.gov/Programs/Edge/ACSDashboard

[10] https://nces.ed.gov/ccd/districtsearch/index.asp