# Data Exploration #

Digital learning has been the latest best way for students to learn comfortable at decentralized places. This was more in effect when COVID-19 brought everything to a stand still in March 2020. However, for a successful digital learning, resources must be adequate ranging from the internet access to the learning materials in the internet thereby yielding a question on equity in education access t all students.
In this project, we will examine the user engagement in digital learning across all districts based on the data collected in 2020. Furthermore, the effects of COVID-19 will notable as we explore nature of engagement before and after the panademic in different districts, also considering the policies that ensure equity.

In [None]:
# importing the necessary libraries
import pandas as pd
import numpy as np
import os
import glob
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
import wandb

## District Characteristics Data ##
districts_info.csv file includes information about the characteristics of school districts, including data from NCES and FCC. It has the following columns:

'district_id' -is the unique identifier of the school district
'state' -give the state where the district resides in
'locale'-NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.
'pct_black/hispanic' -Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data
'pct_free/reduced' -Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data
'county_connections_ratio' -ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version).
'pp_total_raw' -Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.

In [None]:
# reading the datasets
districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
districts.head()

In [None]:
# size of the data
districts.shape

The districts_info dataset has 233 rows and 7 columns.

In [None]:
# List of columns of districts_info dataset
districts.columns.tolist()

## Product Information Data ##
The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy

In [None]:
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
products.head()

In [None]:
# size of the data 
products.shape

The product_info.csv file has 372 rows and 6 columns

In [None]:
# List of columns
products.columns.tolist()

## Engagement Data ##
The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. The lp_id can be used to link to product information in product_info.csv.

In [None]:
# reading the engagement data
path_to_engagement = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
files = glob.glob(path_to_engagement + "/*.csv")

data = []

for filename in files:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    data.append(df)
    
engagement_data = pd.concat(data)
engagement_data = engagement_data.reset_index(drop=True)
engagement_data.head()

In [None]:
# size of the data
engagement_data.shape

In [None]:
# the last five rows of the engagement data
engagement_data.tail()

Notably, the engagement data collection was done in the year, 2020, beginning from January and ended in December.

## Handling Missing Values ##

In [None]:
# how many missing values exist or better still what is the % of missing values in the dataset?
def percent_missing(df):

    # Calculate total number of cells in dataframe
    totalCells = np.product(df.shape)

    # Count number of missing values per column
    missingCount = df.isnull().sum()

    # Calculate total number of missing values
    totalMissing = missingCount.sum()

    # Calculate percentage of missing values
    print("The dataset contains", round(((totalMissing/totalCells) * 100), 2), "%", "missing values.")

## Districts Dataset ##

In [None]:
percent_missing(districts)

Therefore, we remove all the missing values using the dropna() function

In [None]:
districts_info = districts.dropna()

## Product Dataset ##

In [None]:
percent_missing(products)

In [None]:
import io
products.isna().sum()

The columns with missing values are fixed using the ffill method

In [None]:
# filling missing values in the respective columns with ffill method

def fix_missing_ffill(df, col):
    df[col] = df[col].fillna(method='ffill')
    return df[col]


def fix_missing_bfill(df, col):
    df[col] = df[col].fillna(method='bfill')
    return df[col]

products['Provider/Company Name'] = fix_missing_ffill(products, 'Provider/Company Name')
products['Sector(s)'] = fix_missing_ffill(products, 'Sector(s)')
products['Primary Essential Function'] = fix_missing_ffill(products, 'Primary Essential Function')

## Engagement Dataset ##

In [None]:
percent_missing(engagement_data)

In [None]:
# finding the missing values in every column
engagement_data.isna().sum()

In [None]:
# dropping all rows with missing values in lp_id column because they are identical values for specific products
engagement_df = engagement_data.dropna(subset = ["lp_id"])

In [None]:
# filling the missing values in 'pct_access' column with mode value, and 'engagement_index' column with median
engagement_df['pct_access'] = engagement_df['pct_access'].fillna(engagement_df['pct_access'].mode())
engagement_df['engagement_index'] = engagement_df['engagement_index'].fillna(engagement_df['engagement_index'].median())

# Data Visualization #

## State Distribution ##

In [None]:
districts_info.groupby(['state']).sum().plot(kind='pie', subplots=True, shadow = True,startangle=90,
figsize=(15,10), autopct='%1.1f%%')

## Locale Distribution ##

In [None]:
districts_info.groupby(['locale']).sum().plot(kind='pie', subplots=True, shadow = True,startangle=90,
figsize=(15,10), autopct='%1.1f%%')

## Sector Distribution ##

In [None]:
products.groupby(['Sector(s)']).sum().plot(kind='pie', subplots=True, shadow = True,startangle=90,
figsize=(15,10), autopct='%1.1f%%')

## Primary Essential Function Distribution ##

In [None]:
plt.clf()
products.groupby('Primary Essential Function').sum().plot(kind='bar')
plt.show()

## Top Company Providers ##

In [None]:
import seaborn as sns
plt.figure(figsize=(16, 10))
sns.countplot(y='Provider/Company Name', data=products, order=products["Provider/Company Name"].value_counts().index[:15])
plt.title("Top 15 Provider/Company Names",font="Cursive", size=20)
plt.show()

# Merging the Datasets #

To do further exploration on the engagement in the districts, and the products used, we have to merge the data. But first we edit the districts data, and the extract the mean of values in the list.

In [None]:
districts_info_edit = districts_info.copy()

districts_info_edit['pct_black/hispanic'] = districts_info['pct_black/hispanic'].apply(lambda x: (x.replace('[', '')).split(','))
districts_info_edit['pct_free/reduced'] = districts_info['pct_free/reduced'].apply(lambda x: (x.replace('[', '')).split(','))
districts_info_edit['pp_total_raw'] = districts_info['pp_total_raw'].apply(lambda x: (x.replace('[', '')).split(','))
districts_info_edit.drop(columns=['county_connections_ratio'],inplace=True)

for i in ['pct_black/hispanic','pct_free/reduced','pp_total_raw']:
    districts_info_edit[i] = districts_info_edit[i].apply(lambda x: (float(x[0])+float(x[1]))/2)

In [None]:
# ensuring lp_id and district_id are converted to int, to enable merging with the products and districts information
engagement_df["lp_id"] = engagement_df["lp_id"].astype(int)
engagement_df["district_id"] = engagement_df["district_id"].astype(int)
engagement_df.head()

In [None]:
# renaming the 'IP ID' to 'lp_id' to be identical both in the products dataset and engagement dataset
products.rename(columns = {'LP ID': 'lp_id'}, inplace = True)

In [None]:
# merging districts and products datasets
merged_data = pd.merge(engagement_df, districts_info_edit, on="district_id")
merged_data = pd.merge(merged_data, products, on="lp_id")
merged_data.head()

### Geo chart to extract features ##

In [None]:
# aggregation adapted from https://www.kaggle.com/gvyshnya/covid19-impact-on-digital-learning-platforms-usage
digital_learning_plat = merged_data[merged_data["Primary Essential Function"] == 'LC - Digital Learning Platforms']
agg_engagement_data = digital_learning_plat.groupby(["state", "time"],as_index=False)["engagement_index"].sum().reset_index()
agg_engagement_data.head()

In [None]:
!pip install pdpipe

In [None]:
def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value/100)
    if result < 0:
        result = 0.001
    return result

In [None]:
import pdpipe as pdp
#getting the state abbrevations to use in potting graphs
state_abb = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('engagement_index', set_size, 'size', drop=False),
    pdp.MapColVals('state', state_abb)
])

agg_engagement_data = pipeline.apply(agg_engagement_data)

agg_engagement_data.fillna(0, inplace=True)

agg_engagement_data = agg_engagement_data.sort_values(by='time', ascending=True)
agg_engagement_data.tail()

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline

import category_encoders as ce
fig = px.scatter_geo(
    agg_engagement_data, locations="state", locationmode='USA-states',
    scope="usa",
    color="engagement_index", 
    size='size', hover_name="state", 
    range_color= [0, 100000], 
    projection="albers usa", animation_frame="time", 
    title='Engagement Index: LC - Digital Learning Platforms', 
    color_continuous_scale="portland")

fig.show()

From this chart, there are a lot of insights that can be drawn in accessing the effects of COVID-19 on digital learning in different states:
* UT, IL, IN and CT record the highest engagement index, which describes the ease of accessing and embracing the digital learning.
* TX, NC and among other state record the low engagement index.
* Notably, on weekends all the states record very law engagement index, meaning minimal learning takes place on weekends.For instance, in the month of January, on 4th, 5th, 11th, 12th, 18th, and 25th, all indicate the lowest engagement index.
* From March towards the end, there is increase in the engagement index in most of the states even the ones that recorded the lowest values spike. This is due to the government policies of lockdown to reduce the spread of COVID-19. Most learners opted for remote learning, thereby, increasing the engagement index.
* In July, there is decrease in engagement index across all the states. This was due to the pause of learning for holiday.
* In August, learning resumes and the engagement index started increasing across the states.
* From 19th December, there is a gradual decrease in engagement index, meaning the learning are breaking for Christmas holiday.

Other specific details will still be drawn from the chart as I continue with analysis.