Questions are:
+ What is the picture of digital connectivity and engagement in 2020?
+ What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?
+ How does student engagement with different types of education technology change over the course of the pandemic?
+ How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
+ Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

### *Import*

In [None]:
import numpy as np

import pandas as pd
import pandas_profiling as pp

import matplotlib.pyplot as plt
%matplotlib inline

#These parameters are common for all plots, moved them here
plt.rcParams['figure.figsize'] = [19, 5]
plt.rcParams['agg.path.chunksize'] = 5000

import plotly.express as px
import plotly.graph_objects as go

import seaborn as sns
import glob
import gc

### *Data description from the source*

#### Product information data
The product file **products_info.csv** includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

| Name  | Description |
|:-----:|:-----------:|
| LP ID | The unique identifier of the product |
| URL   | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: 1. LC = Learning & Curriculum, 2. CM = Classroom Management, and 3. SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

#### District information data
The district file **districts_info.csv** includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, we removed the identifiable information about the school districts. We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

| Name | Description |
|:----:|:-----------:|
| district_id |	The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county*connections*ratio | ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information |
| pptotalraw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district |

#### Engagement data
The engagement data are aggregated at school district level, and each file in the folder **engagement_data** represents data from one school district. The 4-digit file name represents *district_id* which can be used to link to district information in **district_info.csv**. The lp_id can be used to link to product information in product_info.csv.

| Name | Description |
|:----:|:-----------:|
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |

### *Read the data*

In [None]:
products = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
districts = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

path = '/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
engagement_pieces = glob.glob(path + '/*.csv')
li = []

for engagement in engagement_pieces:
    df = pd.read_csv(engagement, index_col = None, header = 0)
    district_id = engagement.split('/')[-1].split('.')[0]
    #for local machine used this version of split
    #district_id = engagement.split('/')[1].split('\\')[1].split('.')[0]
    df['district_id'] = district_id
    li.append(df)

engagement = pd.concat(li)
engagement = engagement.reset_index(drop = True)

#### Now we have 3 distinct dataset in memory. Let's look at each and decide, what to do next.

In [None]:
pp.ProfileReport(products)

In [None]:
pp.ProfileReport(districts)

In [None]:
pp.ProfileReport(engagement, minimal = True)

#### Pandas profiler is very great (at the same time very heavy and resource-consuming) package. It shows all statistic about our datasets, there's no need to add something else. Just conclusions.

* **LP ID** has different names in **products** and **engagement**
* **district_id** is represented as float (*real number*) though *int* will be enough - ID can not be a fraction
* **time** is categorical, but must be a datetime format
* there are bad brackets in **districts**, for example in **pct_black/hispanic** column

In [None]:
#rename LP ID
products = products.rename({'LP ID':'lp_id'}, axis = 1)

#change data types
engagement['district_id'] = engagement['district_id'].astype(int)
engagement['time'] = pd.to_datetime(engagement['time'])

#fix brackets direction
districts = districts.replace('\[$', ']', regex = True)

### *Missing values*

#### Engagement index

engagement_index is highly skewed (see the profiler's report on engagement), so it would be better to use median

In [None]:
median = engagement['engagement_index'].median()
engagement['engagement_index'].fillna(median, inplace = True)

#### State and locale

I think the absence of state and locale id's can also mess further analysis. These rows also must be filled somehow, because I want to aggregate data by state later.

In [None]:
districts['state'].fillna('Other', inplace = True)
districts['locale'].fillna('unidentified', inplace = True)

#### Educational products

Due to the interest to educational products, I will delete rows with NaNs in pct_access and last two columns.

In [None]:
products = products.dropna(subset = ['Sector(s)', 'Primary Essential Function'])
engagement = engagement.dropna(subset = ['pct_access'])

#### Other columns

I will not touch other columns (percentages, county conn ratios), because this data must be real, not synthetic.

### *Question 1. What is the picture of digital connectivity and engagement in 2020? Engagement distribution plot will be the answer*

In [None]:
plt.plot(engagement['time'], engagement['engagement_index'])
plt.title('Engagement Index in 2020')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

This plot has several zones:
* First, end of 2020-02 - peak
* First days of 2020-03 - fall
* From 2nd decade of 2020-03 till 2020-07 - index is decreasing
* From 2020-07 till 2020-08 there is a zone with small index
* 2020-08 to 2020-09 is an increasing zone with a peak
* From 2020-09 till 2021-01 is a period with constant high index
* End of 2020 has local minimum - Christmas and New Year

My next step is to compare this to COVID flow.

#### Some conslusion before 2nd part of analysis.

Thanks for comments, I will answer straight in the notebook. 
Today I've realized, that we do not have data for other years, before C-19. Conclusions based only on this data can be incorrect. I looked at school schedule, in general the behavior of this curve can be explained without influence of Covid.
* [United States School Calendar 2021 and 2022](https://publicholidays.us/school-holidays/)

I do not have access to previous year, but they can be similar. So let's look at some districts, I made a random choice:
* [Madison City Schools in Alabama](https://publicholidays.us/school-holidays/alabama/madison-city-schools/)
* [Aurora Public Schools in Colorado](https://publicholidays.us/school-holidays/colorado/aurora-public-schools/)
* [Bethel School District in Oregon](https://publicholidays.us/school-holidays/oregon/bethel-school-district/)
* [Washington County Public Schools in Maryland](https://publicholidays.us/school-holidays/maryland/washington-county-public-schools/)

All these schedules have some similarities:
* **Start of school** - from 1st decade of August till 2nd decade of September
* **End of school** - last days of May till 2nd decade of June.
So, the period from 2020-07 till 2020-08 with small index can be explained with school vacations.
* **Thanksgiving Break and Christmas Break** - they both are here.
* **The period from January to March** can be explained with a lack of interest to online courses after New Year's holidays. This must be compared to other years.
* **Peak at the end of August** - start of school, students are choosing what to learn, they can visit the same pages again and again until made their choices and started to learn, after that plot becomes almost horizontal.

And right now I do not know what to say about 1st decade of March.

### *Question 2. What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?*

This question is passed for now - I'm not sure there is any relation between Covid-19 and engagement.
Before answering posterior questions, I'll make some transformations with data

In [None]:
#clean up a little
gc.collect()

### *Question 3. How does student engagement with different types of education technology change over the course of the pandemic?*

Here are the plots of engagement flow for 5 sectors

In [None]:
engagement = engagement.merge(districts, on = 'district_id').merge(products, on = 'lp_id').sort_values(by = ['time'])

In [None]:
engagement.head()

In [None]:
plt.plot(engagement.loc[engagement['Sector(s)'] == 'PreK-12']['time'],
         engagement.loc[engagement['Sector(s)'] == 'PreK-12']['engagement_index'])
plt.title('Engagement Index in 2020 - PreK-12')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed; Corporate']['time'],
         engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed; Corporate']['engagement_index'])
plt.title('Engagement Index in 2020 - PreK-12; Higher Ed; Corporate')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed']['time'],
         engagement.loc[engagement['Sector(s)'] == 'PreK-12; Higher Ed']['engagement_index'])
plt.title('Engagement Index in 2020 - PreK-12; Higher Ed')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['Sector(s)'] == 'Corporate']['time'],
         engagement.loc[engagement['Sector(s)'] == 'Corporate']['engagement_index'])
plt.title('Engagement Index in 2020 - Corporate')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['Sector(s)'] == 'Higher Ed; Corporate']['time'],
         engagement.loc[engagement['Sector(s)'] == 'Higher Ed; Corporate']['engagement_index'])
plt.title('Engagement Index in 2020 - Higher Ed; Corporate')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

* The first 3 have similar patterns. They all have **PreK-12** inside, this have influence on plots' shapes. 
* **Corporate** education differs a little from them.
* Indexes also differ, maximum of 160k for the first type and only 1000 for the last.

##### I have no data to check my hypothesis about school schedule, but the last plot, containing only High Education and Corporate courses also has a similar shape. That fact can not be explained only by summer vacations and holidays.

### *Question 4. How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?*

#### Geography

In [None]:
plt.plot(engagement.loc[engagement['state'] == 'Connecticut']['time'],
         engagement.loc[engagement['state'] == 'Connecticut']['engagement_index'])
plt.title('Engagement Index in 2020 - Connecticut')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['state'] == 'Utah']['time'],
         engagement.loc[engagement['state'] == 'Utah']['engagement_index'])
plt.title('Engagement Index in 2020 - Utah')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['state'] == 'Other']['time'],
         engagement.loc[engagement['state'] == 'Other']['engagement_index'])
plt.title('Engagement Index in 2020 - Other states')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

##### No need to continue with geo this way, patterns are similar (though the max index differs). If only we do not want to get detailed info about some state. States should be compared using some other type of plot.

In [None]:
plt.plot(engagement.groupby('state').agg({'engagement_index': ['mean']}))
plt.xticks(rotation = 90)
plt.show()

In [None]:
eng_mean = engagement.groupby('state').agg({'engagement_index': ['mean']})
eng_mean.columns = eng_mean.columns.get_level_values(0)
eng_mean.columns = ['engagement_mean']
eng_mean['code'] = ['AZ', 'CA', 'CT', 'DC', 'FL', 'IL', 'IN', 'MA', 'MI', 'MN', 'MO', 'NH', 'NJ', 'NY', 'NC', 'ND', 'OH', 'Other', 'TN', 'TX', 'UT', 'VA', 'WA', 'WI']
eng_mean.drop('Other')

In [None]:
fig = go.Figure(data = go.Choropleth(locations = eng_mean['code'],
                                     z = eng_mean['engagement_mean'].astype(float),
                                     locationmode = 'USA-states',
                                     colorscale = 'Reds',
                                     colorbar_title = 'Mean'))

fig.update_layout(title_text = 'Engagement mean distribution by state',
                  geo_scope = 'usa')

fig.show()

In [None]:
fig = px.bar(engagement.groupby(['state', 
                                  'pct_black/hispanic'])['district_id'].count().reset_index(name = 'Total count'),
      x = 'state', y = 'Total count', color = 'pct_black/hispanic')
fig.update_layout(legend = dict(orientation = 'h',
                                yanchor = 'bottom',
                                y = 1.02,
                                xanchor = 'right',
                                x = .9))
fig.update_xaxes(categoryorder='category ascending')
fig.show()

In [None]:
fig = px.bar(engagement.groupby(['state', 
                                  'pct_free/reduced'])['district_id'].count().reset_index(name = 'Total count'),
      x = 'state', y = 'Total count', color = 'pct_free/reduced')
fig.update_layout(legend = dict(orientation = 'h',
                                yanchor = 'bottom',
                                y = 1.02,
                                xanchor = 'right',
                                x = .9))
fig.update_xaxes(categoryorder='category ascending')
fig.show()

In [None]:
fig = px.bar(engagement.groupby(['state', 
                                  'county_connections_ratio'])['district_id'].count().reset_index(name = 'Total count'),
      x = 'state', y = 'Total count', color = 'county_connections_ratio')
fig.update_layout(legend = dict(orientation = 'h',
                                yanchor = 'bottom',
                                y = 1.02,
                                xanchor = 'right',
                                x = .9))
fig.update_xaxes(categoryorder='category ascending')
fig.show()

I tried to compare mean of engagement index aggregated by state with ethnic and socioeconomic status, but these plots also did not show any pattern.

#### Race/ethnicity

This time I will compare two edge values - **0 to 0.2** and **0.8 to 1**

In [None]:
plt.plot(engagement.loc[engagement['pct_black/hispanic'] == '[0, 0.2]']['time'],
         engagement.loc[engagement['pct_black/hispanic'] == '[0, 0.2]']['engagement_index'])
plt.title('Engagement Index in 2020 - Less than 20% black/hispanic students')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

In [None]:
plt.plot(engagement.loc[engagement['pct_black/hispanic'] == '[0.8, 1]']['time'],
         engagement.loc[engagement['pct_black/hispanic'] == '[0.8, 1]']['engagement_index'])
plt.title('Engagement Index in 2020 - More than 80% black/hispanic students')
plt.xlabel('Time')
plt.ylabel('Index')
plt.show()

##### In this case both indexes' values and patterns are similar. Ethnicity plays no role in engagement flow.

### *Some more visualizations*

#### Overall engagement variability: ethnic

In [None]:
fig, ax1 = plt.subplots()
ax1.plot(engagement.groupby(['pct_black/hispanic']).agg({'engagement_index': ['mean']}), color = 'red')
ax1.tick_params(axis='y', labelcolor = 'red')
ax2 = ax1.twinx()
ax2.plot(engagement.groupby(['pct_black/hispanic']).agg({'engagement_index': ['count']}), color = 'blue')
ax2.tick_params(axis='y', labelcolor = 'blue')
plt.show()

#### Socioeconomic

In [None]:
fig, ax1 = plt.subplots()
ax1.plot(engagement.groupby(['pct_free/reduced']).agg({'engagement_index': ['mean']}), color = 'red')
ax1.tick_params(axis='y', labelcolor = 'red')
ax2 = ax1.twinx()
ax2.plot(engagement.groupby(['pct_free/reduced']).agg({'engagement_index': ['count']}), color = 'blue')
ax2.tick_params(axis='y', labelcolor = 'blue')
plt.show()

#### County connections

In [None]:
#This one does not look interesting
plt.plot(engagement.groupby(['county_connections_ratio']).agg({'engagement_index': ['mean']}))
plt.show()

#### Engagement by state and black/hispanic percentage

In [None]:
sns.catplot(x = 'pct_black/hispanic', 
            y = 'engagement_index',
            col = 'state',
            col_wrap = 2,
            hue = 'pct_black/hispanic', 
            data = engagement.groupby(['state', 'pct_black/hispanic'])['engagement_index'].mean().reset_index(),
            kind = 'bar')

#### Engagement by state and socioeconomic status

In [None]:
sns.catplot(x = 'pct_free/reduced', 
            y = 'engagement_index',
            col = 'state',
            col_wrap = 2,
            hue = 'pct_free/reduced', 
            data = engagement.groupby(['state', 'pct_free/reduced'])['engagement_index'].mean().reset_index(),
            kind = 'bar')