# COVID-19 Impact on Digital Learning
#### Analyzing the impact of COVID-19 on student learning
---

<img src = "https://image.freepik.com/vektoren-kostenlos/online-bildungsinfografiken-mit-bearbeitbarer-textillustration_1284-57254.jpg" width='100%'>

# 1. INTRODUCTION

# 1.1 Problem Statement

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

# 1.2 Challenge

**(1) Exploring the state of digital learning in 2020**

**(2) Exploring how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events**

Questions:
   - What is the picture of digital connectivity and engagement in 2020?
   - What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?
   - How does student engagement with different types of education technology change over the course of the pandemic?
   - How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
   - Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

# 1.3 Data Description

- **`engagement_data`** - 235 csv files based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. Each file represents data from one school district. The 4-digit file name represents `district_id`
    - `time`: date in "YYYY-MM-DD"
    - `lp_id`: The unique identifier of the product
    - `pct_access`: Percentage of students in the district have at least one page-load event of a given product and on a given day
    - `engagement_index`: Total page-load events per one thousand students of a given product and on a given day
    
    
- **`districts_info.csv`** - information about the characteristics of school districts, including data from NCES and FCC
    - `district_id`: unique identifier of the school district
    - `state`: The state where the district resides in
    - `locale`: NCES locale classification that categorizes U.S. territory into four types of areas: **City, Suburban, Town, and Rural**.
    - `pct_black/hispanic`: Percentage of students in the districts identified as Black or Hispanic
    - `pct_free/reduced`: Percentage of students in the districts eligible for free or reduced-price lunch 
    - `county_connections_ratio`: ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households)
    - `pp_total_raw`: Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD) project. The expenditure data are school-by-school. The median value is used to represent the expenditure of a given school district.
    

- **`products_info.csv`** - information about the characteristics of the top 372 products with most users in 2020

    - `LP ID`: The unique identifier of the product
    - `URL`: Web Link to the specific product
    - `Product Name`: Name of the specific product
    - `Provider/Company Name`: Name of the product provider
    - `Sector(s)`: Sector of education where the product is used
    - `Primary Essential Function`: The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories followed by sub-categories: 
        - LC = Learning & Curriculum
        - CM = Classroom Management
        - SDO = School & District Operations

# 2. IMPORTING LIBRARIES

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
sns.set(style="darkgrid")
import matplotlib.pyplot as plt
#plt.style.use('ggplot')
from matplotlib.pyplot import figure
from scipy import stats
import glob

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 3. LOADING AND VIEWING DATA

In [None]:
districts = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
products = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

In [None]:
# Checking number of rows and columns and the first five rows of both dataframes
print('DISTRICTS - Rows and Columns:',districts.shape)
display(districts.head(3))
print('\nPRODUCTS - Rows and Columns:',products.shape)
display(products.head(3))

In [None]:
# Putting all csv.files of the 'engagement-data' into one dataframe
# Then checking the number of rows and columns and the first rows of the dataframe
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    li.append(df)
    
engagement = pd.concat(li)
engagement = engagement.reset_index(drop=True)
print('ENGAGEMENT DATA - Rows and Columns:', engagement.shape)
engagement.head(3)

In [None]:
# Looking at colum names, data types and first overview of missing data 
print('DISTRICTS:\n')
display(districts.info())
print('\n\nPRODUCTS:\n')
display(products.info())
print('\n\nENGAGEMENT:\n')
display(engagement.info())

In [None]:
print('DISTRICTS:')
display(districts.describe(include = 'all'))
print('PRODUCTS:')
display(products.describe(include = 'all'))
print('ENGAGEMENT:')
display(engagement.describe(include = 'all'))

# 4. DATA WRANGLING

# 4.1 Standardization
#### Unifying the column names by renaming them in order make all of them both lowercase and to work with shorter names.

In [None]:
products.columns = products.columns.str.lower()
list(products)

In [None]:
products.rename(columns={'lp id':'lp_id','product name':'product_name', 'provider/company name':'company_name','sector(s)': 'sectors', 'primary essential function':'function'}, inplace=True)
list(products)

In [None]:
districts.rename(columns={'pct_black/hispanic':'pct_black_hispanic','pct_free/reduced':'pct_free_reduced'}, inplace=True)
list(districts)

# 4.2 Cleaning Data

## 4.2.1 Duplicate Values

In [None]:
# Identifying duplicate values 
print('DUPLICATES\nDistricts:\t',districts.duplicated().sum(),'\nProducts:\t', products.duplicated().sum(),'\nEngagement:\t', engagement.duplicated().sum())

#### No duplicate values in the dataframes.

## 4.2.2 Missing Values

### 4.2.2.1 Identifying Missing Values

In [None]:
# Finding the amount of missing values in each column
print('\nMISSING VALUES IN PRODUCTS:')
print(products.isnull().sum().sort_values(ascending = False))
print('\nMISSING VALUES IN DISTRICTS:')
print(districts.isnull().sum().sort_values(ascending = False))
print('\nMISSING VALUES IN ENGAGEMENT:')
print(engagement.isnull().sum().sort_values(ascending = False))

In [None]:
print('\nPERCENTAGE OF MISSING VALUES IN PRODUCTS:')
for col in products.columns:
    missing_products = np.mean(products[col].isnull())
    print('{}:  {:.2f}%'.format(col, missing_products*100))

print('\nPERCENTAGE OF MISSING VALUES IN DISTRICTS:')
for col in districts.columns:
    missing_districts = np.mean(districts[col].isnull())
    print('{}:  {:.2f}%'.format(col, missing_districts*100))

print('\nPERCENTAGE OF MISSING VALUES IN ENGAGEMENT:')
for col in engagement.columns:
    missing_engagement = np.mean(engagement[col].isnull())
    print('{}:  {:.2f}%'.format(col, missing_engagement*100))

### 4.2.2.2 Handling Missing Values

***
#### PRODUCTS
***



In [None]:
# Deleting rows of missing values in 'products' dataframe, as these twoe cannot be reproduced
products.dropna(subset=["function", "sectors"], axis=0, inplace=True)
products.reset_index(inplace=True, drop=True)
products.isnull().sum()

#### There are still missing values in `url` - but this column ist not relevant for the analysis.

In [None]:
# Deleting the column 'url'
products.drop("url", axis=1, inplace=True)
list(products)

***
#### DISTRICTS
***

In [None]:
# Rows of missing values in 'state'
districts[districts.state.isnull()][:10]

In [None]:
# Deleting rows of missing values in 'districts' dataframe, as missing values in 'state' are useless for the EDA
districts.dropna(subset=["state", "pp_total_raw", "pct_free_reduced"], axis=0, inplace=True)
districts.reset_index(inplace=True, drop=True)
districts.isnull().sum().sort_values(ascending=False)

In [None]:
# Checking the unique values of 'county_connections_ratio' in order to decide how to replace the missing values
districts.groupby('county_connections_ratio').sum()

#### After cleaning the data, there is only one unique value for 'county_connections_ratio'. The missing data could be replaced by it. But as there are no other values in the corresponding column, there is no need to include this column to the analysis. That's why I decided to delete it.

In [None]:
# Deleting the column
districts.drop("county_connections_ratio", axis=1, inplace=True)
districts.head(3)

***
#### ENGAGEMENT
***

In [None]:
engagement.isnull().sum().sort_values(ascending=False)

In [None]:
# Rows of missing values in 'pct_acess'
engagement[engagement.pct_access.isnull()][:10]

As the values in the `engagement_index` and in the unique identifier `lp_id` are important for the EDA I decided to delete the corresponding rows with missing values.

In [None]:
# Deleting rows of missing values in these columns
engagement.dropna(subset=["engagement_index", "lp_id"], axis=0, inplace=True)
engagement.isnull().sum()

## 4.2.3 Simplification and Splitting columns

***
### PRODUCTS
***

In [None]:
# splitting the columns 'function' to separate the 'function_category' from the subcategory
# in two separate coluns. 
new = products["function"].str.split("-", n = 1, expand = True)
products["function_cat"]= new[0]
products["function_sub"]= new[1]
products.drop(columns =["function"], inplace = True)
products.head(3)

In [None]:
products.sectors.value_counts()

In [None]:
def myfunc(prek_12_sectors):
    if prek_12_sectors == 'PreK-12':
        prek_12_sectors = 1
    elif prek_12_sectors ==  'PreK-12; Higher Ed; Corporate':
        prek_12_sectors = 1
    elif prek_12_sectors == 'PreK-12; Higher Ed':
        prek_12_sectors = 1
    else:
        prek_12_sectors = 0
    return prek_12_sectors

products['prek_12'] = products.apply(lambda x: myfunc(x['sectors']), axis=1)


def myfunc(higher_ed):
    if higher_ed == 'PreK-12; Higher Ed; Corporate':
        higher_ed = 1
    elif higher_ed == 'PreK-12; Higher Ed':
        higher_ed = 1
    elif higher_ed == 'Higher Ed; Corporate':
        higher_ed = 1
    else:
        higher_ed = 0
    return higher_ed

products['higher_ed'] = products.apply(lambda x: myfunc(x['sectors']), axis=1)


def myfunc(corporate):
    if corporate == 'PreK-12; Higher Ed; Corporate':
        corporate = 1
    elif corporate == 'Higher Ed; Corporate':
        corporate = 1
    elif corporate == 'Corporate':
        corporate = 1
    else:
        corporate = 0
    return corporate

products['corporate'] = products.apply(lambda x: myfunc(x['sectors']), axis=1)

products.head()

***
### DISTRICTS
***

#### As the data in the columns `pct_black_hispanic`, `pct_free_reduced` and `pp_total_raw` are not easily readable or self explaining I change the format.

In [None]:
# Replacing the first and last character in cells of the column
districts['pct_black_hispanic'] = districts['pct_black_hispanic'].map(lambda x: str(x)[1:-1])
districts['pct_free_reduced'] = districts['pct_free_reduced'].map(lambda x: str(x)[1:-1])
districts['pp_total_raw'] = districts['pp_total_raw'].map(lambda x: str(x)[1:-1])

# new data frame with split value columns
total_raw = districts["pp_total_raw"].str.split(",", n = 1, expand = True)
black_hispanic = districts["pct_black_hispanic"].str.split(",", n = 1, expand = True)
free_reduced = districts["pct_free_reduced"].str.split(",", n = 1, expand = True)
  
# making columns
districts["total_raw_min"]= total_raw[0]
districts["total_raw_max"]= total_raw[1]
# changing data types
districts["total_raw_min"] = pd.to_numeric(districts["total_raw_min"])
districts["total_raw_max"] = pd.to_numeric(districts["total_raw_max"])
# Makin column with average
sum_total_raw = districts.total_raw_min + districts.total_raw_max
districts['avg_total_raw'] = (sum_total_raw / 2).astype(int)

# making columns
districts["black_hispanic_min"] = black_hispanic[0]
districts["black_hispanic_max"] = black_hispanic[1]
# changing data types
districts["black_hispanic_min"] = pd.to_numeric(districts["black_hispanic_min"]) * 100
districts["black_hispanic_max"] = pd.to_numeric(districts["black_hispanic_max"]) * 100
# Makin column with average
sum_black_hispanic = districts.black_hispanic_min + districts.black_hispanic_max
districts['avg_black_hispanic'] = (sum_black_hispanic / 2) 

# making columns
districts["free_reduced_min"] = free_reduced[0]
districts["free_reduced_max"] = free_reduced[1]  
# changing data types
districts["free_reduced_min"] = pd.to_numeric(districts["free_reduced_min"]) * 100
districts["free_reduced_max"] = pd.to_numeric(districts["free_reduced_max"]) * 100
# Makin column with average
sum_free_reduced = districts.free_reduced_min + districts.free_reduced_max
districts['avg_free_reduced'] = (sum_free_reduced / 2)

districts.drop(columns =["pp_total_raw", "pct_black_hispanic", "pct_free_reduced"], inplace = True)
districts.head(3)

# 4.3 Merging Dataframes

#### In order to fully analyse the data I decideded to merge the dataframes on the unique identifiers `district_id` and `lp_id`. Before doing so the data types of these columns have to be changed if they are not identic.

In [None]:
# Changing data type from 'object' to 'int64'
engagement["district_id"] = pd.to_numeric(engagement["district_id"])

In [None]:
# Merging dataframes 'district' and 'engagement' on 'district_id'
districts_engagement = pd.merge(districts, engagement, on='district_id', how='left')
districts_engagement.head(3)

In [None]:
# Merging dataframe 'products' and 'districts_engagement' on 'lp_id' to the new dataframe
df = pd.merge(products, districts_engagement, on='lp_id', how='left')
df.head(10)

In [None]:
# Checking if there are any missing values
df.isnull().sum()

In [None]:
# Deleting the rows in which have no values for 'district_id'
df.dropna(subset=["district_id"], axis=0, inplace=True)
df.reset_index(inplace=True, drop=True)
df.isnull().sum()

# 5. Exploratory Data Analysis (EDA)

## 5.1 Correlation

#### Getting a rough overview of the data by visualizing individual columns and looking for correlations and causations.

In [None]:
df.head()

In [None]:
# Creating dataframe for correalation
df_corr = df[['prek_12', 'higher_ed', 'corporate', 'avg_total_raw', 'avg_black_hispanic', 'avg_free_reduced', 'pct_access', 'engagement_index']]
# Visualisation of the corralation table
correlation = df_corr.corr()
plt.figure(figsize=(14,7))
sns.heatmap(correlation, linecolor='white',linewidths=0.1, annot=True)
plt.title('Correlation Matrix', pad=11, size=17)
plt.xlabel('Digital Learning Data')
plt.ylabel('Digital Learning Data')
plt.show()

In [None]:
# Showing the highest correlations in descending order
correlations = correlation
corr_pairs = correlations.unstack()
sorted_pairs = corr_pairs.sort_values(kind="quicksort", ascending=False).where(corr_pairs < 1.0)
pairs = sorted_pairs[abs(sorted_pairs) >= 0.1]
print(pairs)

## 5.2 Analysing Patterns using Visualisations 


### STATE



#### Where does the data come from? Which state provided the most data?

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(y='state', data=df, order = df['state'].value_counts().index)
plt.title('States and Frequency in Dataset', size=16)
plt.show()

### LOCALE

In [None]:
plt.figure(figsize=(9,7))
sns.countplot(x='locale', data=df, order = df['locale'].value_counts().index)
plt.title('NCES Locale Classification Frequency in Dataset', size=16)
plt.show()

In [None]:
suburb = (df['locale'] == 'Suburb').sum()
rural = (df['locale'] == 'Rural').sum()
city = (df['locale'] == 'City').sum()
town = (df['locale'] == 'Town').sum()
proportions_locale = [suburb, rural, city, town]


plt.figure(figsize=(8,8))
plt.pie(proportions_locale, data=df, labels=['Suburb', 'Rural', 'City', 'Town'], startangle=90, autopct='%1.1f%%', shadow=False, explode=(0.03, 0.0, 0, 0))
plt.axis('equal')
plt.title("'NCES' Locale Classification Proportion", size=17)
plt.show()


### TOTAL RAW

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(y='avg_total_raw', data=df)
plt.title('Expenditure per student and frequency in Dataset', size=17, pad=11)
plt.show()

### BLACK / HISPANIC

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x='avg_black_hispanic', data=df)
plt.title('Frequency of mean Percentage of black/hispanic pupils in Dataset', size=16)
plt.xlabel('Average Percentage of black/hispanic pupils')
plt.show()

In [None]:
plt.figure(figsize=(21,10))
sns.countplot(x='avg_total_raw', hue='locale',data=df)
plt.title('DISTRIBUTION OF TOTAL MEAN EXPENDITURE PER STUDENT AND AREA (%)', size=16)
plt.xlabel('Total mean exendature per pupil')
plt.legend(loc='upper right')
plt.show()

In [None]:
# Visualisation of the software's 'function_category' and it's amount of usage
plt.figure(figsize=(9,7))
sns.countplot(x='function_cat', data=df)
plt.title('Main Software Functions and Frequency in Dataset', size=16)
plt.xlabel('Software Function Category')
plt.show()


#### The majority of software category used can be found in the category 'Learning & Curriculum'

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(y='function_sub', data=df, order = df['function_sub'].value_counts().index[0:10])
plt.title('Top 10 - Software sub-category and Frequency in Dataset', size=19, pad=13)
plt.ylabel('Software function sub-category')
plt.yticks(size=15)
plt.xticks(size=13)
plt.show()

In [None]:
plt.figure(figsize=(9,7))
sns.countplot(x='function_cat', hue='sectors', data=df, order = df['function_cat'].value_counts().index)
plt.title('Software Function Category within the Sectors', size=16)
plt.ylabel('count')
plt.xlabel('sectors')
plt.legend(loc='upper right')
plt.show()

### COMPANY NAME

In [None]:
products_companies = df.groupby('company_name').count()[['product_name']].sort_values(by='product_name', ascending=False)
products_companies.head(10)

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(y='company_name', data=df, order = df['company_name'].value_counts().index[0:10])
plt.title('Frequency of Company Name in Dataset - Top 10', size=17, pad=11)
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(x='sectors', hue='avg_total_raw', data=df, order = df['sectors'].value_counts().index)
plt.title('Total expenditure per pupil and Education Sector', size=19, pad=13)
plt.ylabel('Count', size=15)
plt.yticks(size=13)
plt.xlabel('Sectors', size=15)
plt.xticks(size=13)
plt.legend(loc='upper right')
plt.show()

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.countplot(y='state',hue='avg_black_hispanic', data=df, ax=ax[0])
ax[0].set_title('Percentages of black / hispanic pupils', size=17, pad=17)
ax[0].set_ylabel('', size=13, labelpad=11)
ax[0].set_xlabel('count', size=13)
ax[0].legend(loc='lower right', title='Percentages:')


sns.countplot(y='state', hue='avg_free_reduced',data=df, ax=ax[1])
ax[1].set_title('Percentages of pupils with free / reduced lunch', size=17, pad=17)
ax[1].set_ylabel('', size=13, labelpad=11)
ax[1].set_xlabel('count', size=13)
ax[1].legend(loc='lower right', title='Percentages:')

plt.xlim()
plt.show()