# Pandemic, Digital Learning, and the Widening Digital Gap
- by Shiyun Hu, RuiGuo Yang, QingRong Xin and Defang Cui
*some data processing code is not run fully for time saving need

## Table of Content
- Executive Summary
- Data cleaning
- Data Observation
- Prepaer community data & Regression Analysis¶
- Conclusion and policy suggestion

## Executive Summary
This report evaluates the state of digital learning in 2020 and how the engagement of digital learning relates to the pandemic, public policies, and students' socio-economic characteristics. Combining `LearnPlatform` account activity records, pandemic data, and policy records, we employed exploratory data analysis and regression analysis to identify key factors affecting digital learning. We have four main observations:

- We did not find decisive evidence that overall digital learning increased after the pandemic, although digital learning was expected to be a substitute to off-line learning.
- The utilization of digital learning is highly unequal, and this inequality is strongly associated with socio-economic factors such as race and income.
- While digital learning increased significantly for high socio-economic status groups, it has decreased for low socio-economic groups due the pandemic, which means the pandemic further widened the digital gap.
- The decrease of digital learning among low socio-economic status groups after the outbreak can not be explained by entrance of new users or new products. Disruption of teaching order is the most probable explanation.

These observations suggests that is digital gap is not a technological but a social problem, and inequality has hindered the possible use of education technology. We suggest conducting more detailed survey to better understand students' behavior, and build a ecosystem for the local authorities, teachers, parents, and students to be involved together and  facilitate students' learning.

## Data cleaning
in data cleaning part, we processed the data by the following ways
1. Add binary feature to the product info csv to replace the string feature like 'Pre-K 12' for easier future processing
2. Change the \[ float \] format in district info csv to float for easier future processing
3. Combine all the district csv to one dataframe
4. Add covid-19 data(daily basis) to the combined dataframe and add the state information

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

#### Combine all distirct info into one dataframe

In [None]:
filedict = {}
for file in os.listdir('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'):
    filedict[int(file.split('.')[0])] = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/' + file

In [None]:
dfdict = {}
for placename in filedict.keys():
    dfdict[placename] = pd.read_csv(filedict[placename])

In [None]:
productdf = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
districtdf = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

In [None]:
dfdict[3188].head()

In [None]:
dfdict[3188]['time'].unique().shape

In [None]:
dfdict.keys()

In [None]:
for key,df in dfdict.items():
    df['key'] = key
    dfdict[key] = df

In [None]:
summarydf = pd.concat([df for df in dfdict.values()],axis = 0)

#### Clean product file

In [None]:
productdf

In [None]:
productdf['Sector(s)'].unique()

In [None]:
productdf['PreK-12'] = productdf['Sector(s)'].astype(str).apply(lambda x:'PreK-12' in x)

In [None]:
productdf['Higher Ed'] = productdf['Sector(s)'].astype(str).apply(lambda x:'Higher Ed' in x)

In [None]:
productdf['Corporate'] = productdf['Sector(s)'].astype(str).apply(lambda x:'Corporate' in x)

In [None]:
productdf['Primary Essential Function'].unique()

In [None]:
productdf['Short PEF'] = productdf['Primary Essential Function'].astype(str).apply(lambda x:x.split('-')[0][0:-1])

In [None]:
productdf

In [None]:
productdf[productdf['Short PEF'] == 'LC/CM/SDO']

#### Clean district file

In [None]:
productdf.to_csv('product_clean.csv')

In [None]:
districtdf

In [None]:
#Black or Hispanic 
#free or reduced-price lunch 
#residential fixed high-speed connections over 200 kbps in at least one direction/households
#we use the median value to represent the expenditure of a given school district.

In [None]:
def clean_kuohao(x):
    try:
        np.isnan(x)
        return x
    except:
        pass
    ys = x.split(',')
    y1 = float(ys[0][1:])
    y2 = float(ys[1][0:-1])
    return (y1 + y2) / 2
for col in ['pct_black/hispanic','pct_free/reduced','county_connections_ratio','pp_total_raw']:
    districtdf[col] = districtdf[col].apply(lambda x:clean_kuohao(x))

In [None]:
districtdf.to_csv('clean_district.csv')

#### Add Covid data and state info

In [None]:
US_clean_data = pd.read_csv('../input/covid19data/time_series_covid19_confirmed_US.csv')

In [None]:
temp = US_clean_data.T.iloc[11:]

In [None]:
temp.index = pd.to_datetime(temp.index)

In [None]:
temp.index

In [None]:
cleaned_US_data = US_clean_data.groupby('Province_State').sum().T.iloc[5:350]

In [None]:
cleaned_US_data.index = pd.to_datetime(cleaned_US_data.index)

In [None]:
cleaned_US_data['District Of Columbia'] = cleaned_US_data['District of Columbia']

In [None]:
cleaned_US_data.to_csv('cleaned_US_confirmed.csv')

In [None]:
cleaned_US_data[[x for x in districtdf['state'].dropna().unique()]]#'District of Columbia'

In [None]:
[x for x in districtdf['state'].dropna().unique()]

In [None]:
districtdf['state'].dropna().unique()

In [None]:
districtdf['district_id'] = districtdf['district_id'].astype(int)

In [None]:
cleaned_US_data['Connecticut']

In [None]:
import time
for key,df in dfdict.items():
    print(key)
    start = time.time()
    df['time'] = pd.to_datetime(df['time'])
    state = districtdf[districtdf['district_id'] == key]['state'].iloc[0]
    try:
        if np.isnan(state):
            print('no state')
            df['State'] = np.nan
            df['Covid'] = np.nan
            continue
    except:
        pass
    df['State'] = state
    df['Covid'] = df['time'].apply(lambda x:cleaned_US_data.loc[x][state] if x in cleaned_US_data.index else np.nan)
    print(time.time() - start)
    dfdict[key] = df

In [None]:
dfdict[3640]

In [None]:
summarydf = pd.concat([df for df in dfdict.values()],axis = 0)

In [None]:
summarydf['Covid'].isna().sum() / summarydf['Covid'].shape[0]

In [None]:
summarydf.to_csv('summarydf_withcovid_withstate.csv')

## Data Observation
in this part, we observe the below four patterns of data
1. The distribution of product use times, find the heavy tail pattern and the deterministic power of Google LLC
2. The data of covid 19 and shutdown data to determine the break point of covid 19 effect
2. The trend of product engagement_index / pct_access, find the pattern of data change around covid
4. The trend of number of districts using learn platform / the trend of average number of products used by distrcits, find the no special trend around covid break point, point out the covid not change the number of products used, but how we use the existing products instead.

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
PATH = "../input/covid19/"

In [None]:
summarydf = pd.read_csv(PATH+'summarydf_withcovid_withstate.csv',
                        index_col = 0,
                        low_memory=False)

In [None]:
summarydf.head()

In [None]:
summarydf['time'] = pd.to_datetime(summarydf['time'])

#### the heavy tail distribution of products and the not heavy tail, yet concentrated distribution with selected products, which fit our understanding of the 'heavy tail internet'

In [None]:
summarydf.groupby('lp_id').count()['time'].hist(bins=50)

In [None]:
print(summarydf.groupby('lp_id').count()['time'].mean())
print(summarydf.groupby('lp_id').count()['time'].std())

In [None]:
productdf = pd.read_csv('product_clean.csv',index_col = 0)

In [None]:
productdf.head()

In [None]:
numobs_in = summarydf[summarydf['lp_id'].isin(productdf['LP ID'])].groupby('lp_id').count()['time']
numobs_notin = summarydf.groupby('lp_id').count()['time']

In [None]:
numobs_in.hist(bins = 50)

In [None]:
numobs_notin.hist(bins = 50)

In [None]:
notin_productlist = pd.merge(numobs_notin, productdf, left_on = numobs_notin.index, right_on = productdf['LP ID']).sort_values(by = 'time', ascending = False)

In [None]:
notin_productlist.head()

#### deterministic of google and other big companies

In [None]:
pd.DataFrame(notin_productlist.groupby(['Provider/Company Name']).sum()['time'].sort_values(ascending = False))

#### observe the breapoint of cumulative number of comfirmed case of covid-19, around 3.15 breakout, so we define 3.15 as our observation point

In [None]:
temp = summarydf.groupby(['time']).mean()['Covid']
temp.index = pd.to_datetime(temp.index)
temp.plot()

#### watch the trend of product change, find not much change at covid time, even go down, which is the main story we need to explain

In [None]:
summarydf_cleaned = summarydf[summarydf['lp_id'].isin(notin_productlist['LP ID'])]
summarydf_cleaned.head()

In [None]:
summarydf_cleaned.shape

In [None]:
summarydf.shape

In [None]:
engagement_product = pd.pivot_table(summarydf_cleaned, index = ['lp_id'], values = 'engagement_index', columns = ['time'])
pctaccess_product = pd.pivot_table(summarydf_cleaned, index = ['lp_id'], values = 'pct_access', columns = ['time'])

In [None]:
def get_lpid(k12,highered,corporate):
    return productdf[(productdf['PreK-12'] == k12) & (productdf['Higher Ed'] == highered) & (productdf['Corporate'] == corporate)]['LP ID']

In [None]:
#1.15 spring 3.15 covid 5.20 end spring 8.20 autumn start

In [None]:
def plot_with_covid(x):
    plt.plot(x.T.resample('7D').mean().T.mean().index, x.T.resample('7D').mean().T.mean())
    xm = x.T.resample('7D').mean().T.mean().max() + 1
    plt.axvline(x = pd.to_datetime('2020-03-18'), ymin = 0, ymax = xm, color = 'red', label = 'covid start')
    plt.axvline(x = pd.to_datetime('2020-05-20'), ymin = 0, ymax = xm, color = 'gold', label = 'summer start')
    plt.axvline(x = pd.to_datetime('2020-08-19'), ymin = 0, ymax =xm, color = 'green', label = 'summer end')
    plt.legend()
    plt.show()

#### mean of engagement index at product level

In [None]:
plot_with_covid(engagement_product)

#### mean of engagement index at product level, with only K12 product

In [None]:
plot_with_covid(engagement_product.loc[get_lpid(True,False,False)])

#### mean of engagement index at product level, with all education product

In [None]:
plot_with_covid(engagement_product.loc[get_lpid(True,True,False)])

#### mean of pct access at product level

In [None]:
plot_with_covid(pctaccess_product)

#### mean of pct access at product level, with only K12 product

In [None]:
plot_with_covid(pctaccess_product.loc[get_lpid(True,False,False)])

#### mean of pct access at product level, with only edcation product

In [None]:
plot_with_covid(pctaccess_product.loc[get_lpid(True,True,False)])

In [None]:
product_key = pd.pivot_table(summarydf_cleaned,index = ['key'], values = ['lp_id'], columns = 'time', aggfunc = 'count')

#### mean of distric number using learn platform

In [None]:
howmany_district_use_platform = ((~product_key.isna()).sum() - product_key.isna().sum())
howmany_district_use_platform.index = [x[1] for x in howmany_district_use_platform.index]
plt.plot(howmany_district_use_platform.resample('7D').mean().index, howmany_district_use_platform.resample('7D').mean())
xm = howmany_district_use_platform.resample('7D').mean().max()
plt.axvline(x = pd.to_datetime('2020-03-18'), ymin = 0, ymax = xm, color = 'red', label = 'covid start')
plt.axvline(x = pd.to_datetime('2020-05-20'), ymin = 0, ymax = xm, color = 'gold', label = 'summer start')
plt.axvline(x = pd.to_datetime('2020-08-19'), ymin = 0, ymax =xm, color = 'green', label = 'summer end')
plt.legend()
plt.show()

#### mean of number us product used by all districts 

In [None]:
howmany_product_used = product_key.mean()
howmany_product_used.index =  [x[1] for x in howmany_product_used.index]
plt.plot(howmany_product_used.resample('7D').mean().index, howmany_product_used.resample('7D').mean())
xm = howmany_product_used.resample('7D').mean().max()
plt.axvline(x = pd.to_datetime('2020-03-18'), ymin = 0, ymax = xm, color = 'red', label = 'covid start')
plt.axvline(x = pd.to_datetime('2020-05-20'), ymin = 0, ymax = xm, color = 'gold', label = 'summer start')
plt.axvline(x = pd.to_datetime('2020-08-19'), ymin = 0, ymax =xm, color = 'green', label = 'summer end')
plt.legend()
plt.show()

In [None]:
summarydf_add_product = pd.merge(summarydf, productdf, left_on = 'lp_id', right_on = 'LP ID')

In [None]:
summarydf_add_product.head()

### Prepare Community-Level Panel Data

In [None]:
panel_community = summarydf_add_product.groupby(['key','time']).mean()[['pct_access','engagement_index','Covid']]

In [None]:
panel_community

In [None]:
panel_edu_community = summarydf_add_product[(summarydf_add_product['PreK-12'] == True) &(summarydf_add_product['Higher Ed'] == True) & (summarydf_add_product['Corporate'] == False)].groupby(['key','time']).mean()[['pct_access','engagement_index','Covid']]

In [None]:
panel_edu_community

In [None]:
district_info = pd.read_csv('clean_district.csv', index_col = 0)

In [None]:
panel_edu_community['keykey'] = [x[0] for x in panel_edu_community.index]
panel_edu_community['time'] = [x[1] for x in panel_edu_community.index]
panel_community['Covid'].fillna(0,inplace = True)
panel_community["dCovid"] = panel_community.groupby("key")["Covid"].diff()
panel_edu_community.head()

In [None]:
panel_community['keykey'] = [x[0] for x in panel_community.index]
panel_community['time'] = [x[1] for x in panel_community.index]
panel_community['Covid'].fillna(0,inplace = True)
panel_community["dCovid"] = panel_community.groupby("key")["Covid"].diff()
panel_community.head()

In [None]:
panel_edu_community_add_dis = pd.merge(panel_edu_community, district_info, left_on = 'keykey', right_on = 'district_id',validate = 'm:1')

In [None]:
panel_community_add_dis = pd.merge(panel_community, district_info, left_on = 'keykey', right_on = 'district_id',validate = 'm:1')

In [None]:
panel_community_add_dis.head()

In [None]:
panel_community_add_dis = panel_community_add_dis.rename(columns={"pct_black/hispanic": "pctblack_hispanic", "pct_free/reduced": "pct_free_reduced"})

In [None]:
panel_edu_community_add_dis = panel_edu_community_add_dis.rename(columns={"pct_black/hispanic": "pctblack_hispanic", "pct_free/reduced": "pct_free_reduced"})


## Regression Analysis

Now we turn to a more detailed econometric analysis to figure out how pandemic has affected digital learning and how this effect varies between groups with different socio-economic status.

In [None]:
!pip install stargazer

In [None]:
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
from stargazer.stargazer import Stargazer

### Data Transformation and Model Specification
We use `engagement_index` and `pct_access` to measure the outcome. Since the distribution of `engagement_index` and `pct_access` are extremely skewed, we take the log of them to reduce estimation errors, generating `logEngage` and `logAccess`.

We use `logCovid`, the log of daily confirmed new cases in each state, to measure the intensity of the pandemic. The percentage of black and hispanic ethnic, `pctblack_hispanic`, and the percentage of students who enjoy free (or reduced) lunch, `pct_free_reduced`, are two proxies for the average socio-economic status of each school district. 

In [None]:
panel_community_add_dis['logCovid'] = (panel_community_add_dis['dCovid'] + 1).apply(np.log)
panel_community_add_dis['logEngage'] = (panel_community_add_dis['engagement_index'] + 1).apply(np.log)
panel_community_add_dis['logAccess'] = (panel_community_add_dis['pct_access'] + 1).apply(np.log)

We divide the sample into different time periods: `PrePandemic` means before COVID-19 outbreak in United States; `Spring` means spring semester, which is roughly January to May; `Fall` means fall semester, which is roughly late August to December. `SchoolWeek` is the combination of spring and fall semester.

In [None]:
PrePandemic = panel_community_add_dis[panel_community_add_dis["time"]<pd.to_datetime("2020-03-15")]
SchoolWeek = panel_community_add_dis[(panel_community_add_dis["time"]<=pd.to_datetime("2020-05-20"))|(panel_community_add_dis["time"]>pd.to_datetime("2020-08-20"))]
Spring = panel_community_add_dis[(panel_community_add_dis["time"]<=pd.to_datetime("2020-05-20"))]
Fall = panel_community_add_dis[(panel_community_add_dis["time"]>pd.to_datetime("2020-08-20"))]

### Digital gap before the pandemic

We use the `PrePandemic` subsample to explore the pre-exsiting digital gap. Regress `logAccess` or `logEngage` on demographic characteristics, we obtain the following results. Robust standard errors are shown in parentheses.

In [None]:
# Specify the models
model1 = smf.ols(formula = "logAccess~pctblack_hispanic", data = PrePandemic)
model2 = smf.ols(formula = "logAccess~pct_free_reduced", data = PrePandemic)
model3 = smf.ols(formula = "logAccess~pctblack_hispanic+pct_free_reduced", data = PrePandemic)
model4 = smf.ols(formula = "logEngage~pctblack_hispanic", data = PrePandemic)
model5 = smf.ols(formula = "logEngage~pct_free_reduced", data = PrePandemic)
model6 = smf.ols(formula = "logEngage~pctblack_hispanic+pct_free_reduced", data = PrePandemic)
# Fit the models
result1 = model1.fit(vce_type = 'robust')
result2 = model2.fit(vce_type = 'robust')
result3 = model3.fit(vce_type = 'robust')
result4 = model4.fit(vce_type = 'robust')
result5 = model5.fit(vce_type = 'robust')
result6 = model6.fit(vce_type = 'robust')
# Report the results
stargazer = Stargazer([result1,result2,result3,result4,result5,result6])
stargazer.custom_columns(["logAccess","logEngage"],[3,3])
stargazer.covariate_order(["pctblack_hispanic","pct_free_reduced"])
stargazer.show_degrees_of_freedom(False)
stargazer

It can be seen from the table that both ethnic composition and economic condition will affect the use ot digital learning. 

To understand the result, let's compare a school district of 100% black or hispanic students and another school district with no black or hispanic students. Column (1) tells us that the percentage of students who viewed at least one page on a typical day will be 15% lower in the first district. This difference is both statistically and economically significant, suggesting minority groups may not be fully utilizing digital technology for study even before the pandemic. 

Column (2) shows that economic condition, measured by the percentage of students who enjoy free lunch at school, is also correlated with digital learning. 

Column (3) puts the two explanatory variables together and finds that race composition is a stronger predictor of digital learning than economic condition. Thus, digital gap has more to do with race than economic status.

Column (4)-(6) repeats the same excercise by using `logEngage` as the explained variable. and the results are similar to that of `logAccess`. Putting the results together, school districts of low socio-economic status participate less, and less actively, in digital learning. 

Please note that the data is obtained from school district which already purchased `LearnPlatform` services. Considering the school districts which did not even purchased `LearnPlatform`, we conclude that the true digital gap before the pandemic may be even larger.

### Digital learning during the pandemic

Now we explore how the pandemic affect digital learning during the past year. We restrict our sample in `SchoolWeek`, dropping observations from the summer vacation.

We use the fixed effect regression to control for possible confounders. We include state fixed effect, i.e. generating a dummy variable for each state, to absorb the effect unoberved state-level time-invariant characteristics, such as culture and habits shared by the state's residents. We also include daily fixed effect, i.e. generating a dummy variable for each day, to capture the common time trend among all school districts.

The following table displays the result:

In [None]:
# Specify the models
model1 = smf.ols(formula = "logAccess~logCovid+pctblack_hispanic+pct_free_reduced+C(time)+C(state)", data = SchoolWeek)
model2 = smf.ols(formula = "logEngage~logCovid+pctblack_hispanic+pct_free_reduced+C(time)+C(state)", data = SchoolWeek)
# Fit the models
result1 = model1.fit(vce_type = 'robust')
result2 = model2.fit(vce_type = 'robust')
# Report the results
stargazer = Stargazer([result1,result2])
stargazer.custom_columns(["logAccess","logEngage"],[1,1])
stargazer.covariate_order(["logCovid","pctblack_hispanic","pct_free_reduced"])
stargazer.show_degrees_of_freedom(False)
stargazer.add_line("Daily Fixed Effect",["Yes","Yes"])
stargazer.add_line("State Fixed Effect",["Yes","Yes"])
stargazer

The pandemic significantly drives up the demand for online learning. Holding other factors constant, an increase in daily confirmed cases will not only increase the percentage of students who use digital learning, but also make the students more engaged when studying online.

Like before, race composition and economic status also affect digital learning. School districts with higher white/asian student share and higher income will have more students using digital learning, and they will be more engaged online.

Please note that even though we have tried our best to rule out counfounders, the estimates above may still be biased. On the one hand, since we included daily fixed effect to capture the common time trend, the effect of national public health measures will be absorbed, resulting in a downward bias in the effect of the pandemic. On the other hand, there may be other unobserved school district characteristics like internet penetration, making us overestimate the effect of racial composition and economic conditions.

### Covid and Widening Digital Gap

Digital learning helps students pull through the pandemic, but not everyone enjoys the same access. The results below shows that the pandemic has widened pre-existing digital gap.

We use the common two-way fixed effect model in econometrics with an additional interaction term: For school district $i$ at date $t$,

$$
\log{\text{Access}}_{it} = \alpha \log{\text{Covid}}_{it} + \beta\log{\text{Covid}}_{it}\times\text{pctblack_hispanic}_{i} + \delta_i +\lambda_t + \epsilon_{it}
$$

$\delta_i$ is the school district fixed effect, which controls the direct effect of all school district specific characteristics, observed or unobserved. Notice that the direct effect of race composition and economic status will also be absorbed. $\lambda_t$ captures national common trends. We are interested in $\alpha$ and $\beta$. $\alpha$ stands for the effect of COVID-19 on digital learning, while $\beta$ estimates how school district characteristics will amplify or reduce the effect of COVID-19. To see this more clearly, rewrite the model as:

$$
\log{\text{Access}}_{it} = (\alpha+\beta\cdot\text{pctblack_hispanic}_{i}) \log{\text{Covid}}_{it} + \delta_i +\lambda_t + \epsilon_{it}
$$

So if a school district is 0% black or hispanic, the effect of COVID-19 will be just $\alpha$; if the district is 100% black or hispanic, then the effect of COVID will be $\alpha + \beta$.

In [None]:
# Log Access, Spring
model1 = smf.ols(formula = "logAccess~logCovid+logCovid*pctblack_hispanic+C(time)+C(district_id)", data = Spring)
# Log Access, Fall
model2 = smf.ols(formula = "logAccess~logCovid+logCovid*pctblack_hispanic+C(time)+C(district_id)", data = Fall)
# Log Engagement, Spring
model3 = smf.ols(formula = "logEngage~logCovid+logCovid*pctblack_hispanic+C(time)+C(district_id)", data = Spring)
# Log Engagement, Fall
model4 = smf.ols(formula = "logEngage~logCovid+logCovid*pctblack_hispanic+C(time)+C(district_id)", data = Fall)

# Estimation
result1 = model1.fit(vce_type = 'robust')
result2 = model2.fit(vce_type = 'robust')
result3 = model3.fit(vce_type = 'robust')
result4 = model4.fit(vce_type = 'robust')
# Report the results
stargazer = Stargazer([result1,result2,result3,result4])
stargazer.custom_columns(["logAccess, Spring","logAccess, Fall","logEngage, Spring","logEngage, Fall"],[1,1,1,1])
stargazer.covariate_order(["logCovid","logCovid:pctblack_hispanic"])
stargazer.rename_covariates({"logCovid:pctblack_hispanic":"logCovid×pctblack_hispanic"})
stargazer.show_degrees_of_freedom(False)
stargazer.add_line("Daily Fixed Effect",["Yes","Yes","Yes","Yes"])
stargazer.add_line("School District Fixed Effect",["Yes","Yes","Yes","Yes"])
stargazer

The table above shows that all $\alpha$ estimates are significantly positive, $\beta$ are all negative, as well as $\alpha+\beta$. While school district with less black and hispanic students increased digital learning in response to the pandemic, districts with more black and hispanic students are using digital learning *less* than before. 

Take column (3) as an example: During 2020 fall semester, a one log point increase in covid cases will increase the engagement index by 0.02 log points in 0% black/hispanic school districts, but decrease the engagement index by 0.054(=0.030-0.074) log points in 100% black/hispanic districts.

Comparing (1),(2) with column (3),(4), we can see that the result is robust to different measures of digital learning (`pct_access` or `engagement_index`). Comparing column (1) with (2), (3) with (4), we find out that the widened digital gap seems to be persistent——the difference did not converge to zero as the fall semester began. This is consistent with previous economic literature in the sense that groups with low socio-economic status are more vulnerable to negative shocks.

The following table show that the results are robust after taking economic condition into account.

In [None]:
# Log Access, Spring
model1 = smf.ols(formula = "logAccess~logCovid+logCovid*pctblack_hispanic+logCovid*pct_free_reduced+C(time)+C(district_id)", data = Spring)
# Log Access, Fall
model2 = smf.ols(formula = "logAccess~logCovid+logCovid*pctblack_hispanic+logCovid*pct_free_reduced+C(time)+C(district_id)", data = Fall)
# Log Engagement, Spring
model3 = smf.ols(formula = "logEngage~logCovid+logCovid*pctblack_hispanic+logCovid*pct_free_reduced+C(time)+C(district_id)", data = Spring)
# Log Engagement, Fall
model4 = smf.ols(formula = "logEngage~logCovid+logCovid*pctblack_hispanic+logCovid*pct_free_reduced+C(time)+C(district_id)", data = Fall)

# Fit the models
result1 = model1.fit(vce_type = 'robust')
result2 = model2.fit(vce_type = 'robust')
result3 = model3.fit(vce_type = 'robust')
result4 = model4.fit(vce_type = 'robust')
#Report the results
stargazer = Stargazer([result1,result2,result3,result4])
stargazer.custom_columns(["logAccess, Spring","logAccess, Fall","logEngage, Spring","logEngage, Fall"],[1,1,1,1])
stargazer.covariate_order(["logCovid","logCovid:pctblack_hispanic","logCovid:pct_free_reduced"])
stargazer.rename_covariates({"logCovid:pctblack_hispanic":"logCovid×pct_blackhispanic","logCovid:pct_free_reduced":"logCovid×pct_free_reduced"})
stargazer.show_degrees_of_freedom(False)
stargazer.add_line("Daily Fixed Effect",["Yes","Yes","Yes","Yes"])
stargazer.add_line("School District Fixed Effect",["Yes","Yes","Yes","Yes"])
stargazer

To sum up, our regression analysis shows that:
- There was large pre-existing digital gap even before the pandemic.
- On average, COVID-19 has increased the demand for digital learning.
- However, COVID-19 has also widened pre-existing digital gap. Groups of low socio-economic status reduced online learning after the pandemic.

Why didn't these less previleged people study more on the Internet during the pandemic? Though important, it is beyond current data availability to fully answer this question. Here are some hypothesis to be tested in the future:
- The poor area has worse internet connection, making it harder to utilize online materials. While this is a plausible explanation for the pre-existing digital gap, however, it can not explain the decrease after the pandemic.
- Disruption of normal teaching order. The courses are paused, so students go to Internet less for study.
- Students' lack of self-discipline may result in the decrease in online learning activities.
- Economic hardship faced by students' family may drive student out of the online classroom.

## Conclusion

The report investigated digital learning and digital gap during this unprecedented pandemic. As a substitute for off-line learning, digital learning has played an more important role than before during this pandemic. However, we also document that access to and engagement on the digital learning are highly unequal. This inequality part stems from pre-existing digital gap associated with race and income, and COVID-19 has further widened this gap. Minority and less privileged social groups did not exploit the full potential of digital learning, leading to persistent learning loss.

#### Policy Suggestions

We believe the decreased usage of digital learning by minority and low income students should not be viewed as a technological problem, but as a problem of greater social structure. The platform has always been there, and the question is that why they didn't choose it during the pandemic? In order to better understand the problem and finally address education inequality, we suggest:

- Conducting more detailed user behavior survey in the less privileged school districts. If the students were not studying online, what did they do, and why?
- Empowering local education authorities and teachers. Efficient digital learning needs cooperation from authorities and educators. We may train the local education authorities so that they can make use of `LearnPlatform` data to discover problems timely and make better policies.
- Building the ecosystem. Unlike learning in school, where the whole study process is monitored and guided by teachers, digital learning face natural challenges of organization. We may need to get local authorities, teachers, parents, and students involved, and build a system on which different players can communicate and cooperate to facilitate students' learning.