# Preface

This notebook has a dual purpose as follows

- analize the changes in using Digital Learning Platforms (LC) across the US states after the Covid-19 pandemy and lockdowns started to affect the daily life in the USA
- demonstrate the power and ease of use of the geospacial visualizations in the EDA for the problems of this sort

Digital Learning Platforms (LC) products are the principal classroom-level e-learning systems where students get their instructions as well as complete homework/tests and do self-learning activities. Therefore changes in using such type of the systems is going to be indicative of the global impact of Covid-19 pandemics on both  the Pre-K12 school system and higher eduction in USA.

# Pre-Requisites

First of all, we are going to place some preparational (pre-requisite) code into our notebook. It will

- install *pdpipe* (we will use this package for automating feature engineering/data transformation pipelines down the road)
- import neccessary Python packages
- provide the code of the auxiliary functions used for data extraction and preparing the data for appropriate visulaizations

In [None]:
!pip install pdpipe

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
import pdpipe as pdp
from typing import Tuple, List, Dict
import glob

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline

import category_encoders as ce

In [None]:
# read data
in_kaggle = True


def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        districts_info_path = '../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv'
        products_info_path = '../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv'
        engagements_path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'
    else:
        # running locally
        districts_info_path = 'data/districts_info.csv'
        products_info_path = 'data/products_info.csv'
        engagements_path = 'data/engagement_data'

    return districts_info_path, products_info_path, engagements_path

# set the size of the geo bubble
def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value/100)
    if result < 0:
        result = 0.001
    return result

# Reading and Pre-Processing the Data

In [None]:
districts_info_path, products_info_path, engagements_path = get_data_file_path(in_kaggle)

We are going to read districsts data first. For the purpose of the current analysis, we are not going to preprocess this data file any further.

In [None]:
districts_df = pd.read_csv(districts_info_path)
districts_df.head()

Now we are going to read the data about e-learning software products. As a part of it, we will have to rename one of the columns in this data set (from *'LP ID'* to *'lp_id'*) for ease of merging with other datasets needed in this analysis, down the road.

For the purpose of the current analysis, we are not going to preprocess this data file any further.

In [None]:
products_df = pd.read_csv(products_info_path)
products_df.rename(columns = {'LP ID': 'lp_id'}, inplace = True)
products_df.head()

Finally, we embark on loading the engagement data. We will combine the individual school district charts into a single dataframe on the fly as well as add a new feature (*district_id*) to the combined dataframe with the engagement data.

In [None]:
# read engagement data files
all_engagement_files = glob.glob(engagements_path + "/*.csv")

li = []

for filename in all_engagement_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    # add district_id from the data file name
    
    df["district_id"] = filename.replace("\\", "/").split("/")[-1].split(".")[0]
    li.append(df)

engagements_df = pd.concat(li, axis=0, ignore_index=True)

engagements_df.head()

In [None]:
# missing data: engagements_df

total = engagements_df.isnull().sum().sort_values(ascending=False)
percent = (engagements_df.isnull().sum()/engagements_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

As we can see, there is a tiny fraction of the engagement records without *lp_id* recorded. We will have to drop such obervations since there is no way to map them to any software product listed in products_df.

For *pct_access* and *engagement_index*, we can interprete 'NaN' values as 0.00, based on the definition of the respective attributes.

**Note:** Just for the convenience reasons, below is the refresher on how *pct_access* and *engagement_index* are defined across this project

- *pct_access* - Percentage of students in the district have at least one page-load event of a given product and on a given day
- *engagement_index* - Total page-load events per one thousand students of a given product and on a given day


In [None]:
engagements_df = engagements_df.drop(engagements_df.loc[engagements_df['lp_id'].isnull()].index)
engagements_df = engagements_df.fillna(0.0)

After handling missing data, we will cast a couple of columns in the unified engagements dataframe (*lp_id*, *district_id*) to int. It will be helpful down the road as we are going to merge engagements data with district and product information.

**Note:** We do not convert *time* attribute to *datetime* yet as its string representation will be useful in building the animated geoscatter plots (see below).

In [None]:
# cast lp_id and district_id to int, to enable merging with the products and districts info down the road
engagements_df["lp_id"] = engagements_df["lp_id"].astype(int)
engagements_df["district_id"] = engagements_df["district_id"].astype(int)
#engagements_df["time"] = pd.to_datetime(engagements_df["time"])
engagements_df.tail()

As a final step, we are ready to combine engagement, district, and product information into a single dataframe. Such a dataframe will then be used as a foundation for futher analytical and EDA activities.

In [None]:
# merge districts and products
result_df = pd.merge(engagements_df, districts_df, on="district_id")
result_df = pd.merge(result_df, products_df, on="lp_id")

In [None]:
result_df.head()

The important notes about the data are listed below

- the observations for engagement are provided for the period of time from Jan 1, 2020 through Dec 31, 2021 inclusive
- only a fraction of the US states is represented in the dataset provided for this project (it is assumed to be the concious decision of the contest organizers)

# Digital Learning Platform Patterns

We are going to focus on how Covid-19 and correlated lockdown actions impacted the use of *Digital Learning Platforms* across the selective school districts represented in the datasets for this project.

## Let's Engage Despite the Covid-19!

As a first step, we will create a separate dataframe where we filter the data for *Digital Learning Platforms* only. After it, we will aggregate the data by observation date (*time* attribute) and the US state (*state*) attribute.

In [None]:
agg_digi_learn_df = result_df[result_df["Primary Essential Function"] == 'LC - Digital Learning Platforms']
agg_engagement_data = agg_digi_learn_df.groupby(["state", "time"],as_index=False)["engagement_index"].sum().reset_index()
agg_engagement_data.head(10)

### Map It All

We are going to provide the comprehensive outlook on how the engagement patterns for Digital Learning Platforms changed over time across all of the US states represented by the dataset. We will do the magic with just one chart - yet the powerful one. Plotly Express's Geo Scatter plot will help us to see the historical changes in engagements with Digital Learning Platforms across 2020 in a comprehensive mapping interface.

However, before we display the data on the geo scatter plot, we need to do a couple of tiny pre-processing steps on the aggregated engagement stat dataframe as follows

- replace the US state names with their two-character codes
- add a new attribute to be used to define the size of the scatter bulbe for every state and obseration date

**Note:** this is required as appropriate inputs to draw the  Plotly Express's Geo  Scatter plots

We are going to implement such a pre-processing mini-pipeline using *pdpipe*.

In [None]:
# kudos to https://gist.github.com/rogerallen/1583593
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('engagement_index', set_size, 'size', drop=False),
    pdp.MapColVals('state', us_state_abbrev)
])

agg_engagement_data = pipeline.apply(agg_engagement_data)

agg_engagement_data.fillna(0, inplace=True)

agg_engagement_data = agg_engagement_data.sort_values(by='time', ascending=True)
agg_engagement_data.tail()

Now we are ready to unlock the power of visualization with the animanted geo scatter plot

In [None]:
fig = px.scatter_geo(
    agg_engagement_data, locations="state", locationmode='USA-states',
    scope="usa",
    color="engagement_index", 
    size='size', hover_name="state", 
    range_color= [0, 100000], 
    projection="albers usa", animation_frame="time", 
    title='Engagement Index: LC - Digital Learning Platforms', 
    color_continuous_scale="portland")

fig.show()

Just one chart... However, it can draw a lot of insights as well as define the extension points (directions for further research).

First of all, we can capture essential 'features' of the time series representing the changes in the engagement index through 2020 as follows
- there is a strong weeekly seasonality (weekends to display lower engagement charts vs. the regular business days)
- there is a strong vacation/holiday effect on the engagement charts (see the generic decline in the engagement index charts in all of the states across the country during the summer and Christmas holidays as well as per the mid-year holiday schedule in different states)
- unique school year schedule in different states or even at the school district level contributes to the seasonal effects observed

We can also conclude on the states with the highers and lowest average engagement index level across 2020 as follows
- UT and CT to demonstrate the highest average engagement index levels
- FL, TN, TX, AZ, MN, and NJ to demonstrate the lowest average engagement index levels

At the same time, observing the changes in the engagement index charts across the year of 2020, we can see the interference of the Covid-19 pandemic and lockdown-driven changes with seasonal and school year schedule effects. The most interesting observations are logged below

-	Min activity as of Jan 1, 2020
-	Jan 2-3: Mid-level activity spikes in CT and MA, moderate spike in UT
-	Jan 6: huge engagement spikes in CT and UT, mid-level engagement spikes in MA, OH, IN, and IL
-	Jan 7: IL becomes to be in the high spike ‘zone’, like CT and UT
-	Jan 9: MO enters the mid-spike ‘zone’, in addition to the above-mentioned states
-	FL, TN, TX, AZ, MN , NJ and other states to be on the low end of the engagement scale
-	Jan 14: CA enters the mid-spike ‘zone’, in addition to other states mentioned above; FL and TX outperform ND on the lower end of the scale
-	Jan 30: NY enters the middle-level engagement zone, in addition to other states mentioned above
-	Effectively the date of the start of the huge global-level panic around Covid-19 outbreak (Feb 24, 2021), we can observe more states to go to the high-engagement zone (UT, CT, MA, OH, IN, IL) as well as additional states to go to the mid-level engagement zone (NC and VA, in addition to earlier present states – CA, NY, and MO)
-	The situation and trends detected as of Feb 24, 2020 take the effect through Mar 18, 2020, and then we see the new shift (NY to go to the high-engagement zone; OH, IL and MA to drop to the middle-level engagement zone; NH to become more active yet still operating on the low engagement level; significant activity drop in FL, AZ, NC, and VA)
-	Starting Mar 19, 2020, NY goes back to the middle-engagement zone
-	As of Mar 27, only CT and UT remain in the high-engagement zone, with the rest of states in it dropping down to the middle-engagement zone
-	Effectively Apr 4, MA joins CT and UT in the high-engagement zone, and FL outperforms all other states in the low-engagement zone
-	As of Apr 15, MA goes back to the middle-engagement zone
-	As of Apr 29, IL joins CT and UT in the high-engagement zone
-	As of May 6, MA joins CT, UT and IL in the high-engagement zone
-	As of May 8, MA and IL drop to the middle-engagement zone again
-	As of May 11, MA returns to the high-engagement zone, and NJ becomes to be the leader in the low-engagement zone
-	As of May 20, only CT remains in the high-engagement zone (even UT temporary drops from it)
-	May 22-31: generic slow-down across the country, no states to be in the high-engagement zone
-	Jun 1: CT and MA return to the high-engagement zone
-	Jun 5: CT and MA drop down to the middle-level engagement zone
-	Jun 13: generic slow-down across the country (summer holidays driven) – all states dropped to the low-end engagement zone
-	Mid Aug 2020: as state districts come back to the school one by one, we can see gradual increase in the engagement level again (As of Aug 14, IN to enter the mid-level engagement zone; as of Aug 18, IN to enter the high-engagement zone, with CA and IL to enter the mid-level engagement zone; as of Aug 20, UT and IL to enter the high-engagement zone, in addition to IN; as of Aug 26, OH to enter the middle-level engagement zone)
-	As of Sep 2, CT returns to the mid-level engagement zone
-	As of Sep 10, CT joins UT and MA in the high-level engagement zone
-	As of Sep 22, IL and IN return to the high-end engagement zone to join CT, UT and MA there, and NY enters the middle-level engagement zone
-	As of Oct 2, IL drops to the middle-level engagement zone
-	As of Oct 7, IL re-enters the high-level engagement zone
-	As of Oct 15, UT drops to the middle-level engagement zone
-	As of Oct 19, UT comes back to the high-level engagement zone
-	The fluctuations across the high- and middle-level engagement zones observed in Oct 2020 are manifested in Nov and Dec (more specifically, on Dec 1-19)
-	After Dec 19, we observe the generic decline in the engagement charts across all of the country in observance of the Christmas holidays season

The observations above manifest the promising extensions to the research in future

- detailed Time Series  analysis of the engagement index data in the different states across the US
- finding the impact of/correlations with the lockdown and other Covid-19-driven policy actions in different states in 2020
- investigating the possible correlation/inference with the mobility behaviour changes and Digital Learning platform engagement index charts

Therefore this notebook is by no mean is complete. It is to be continued/extended  in future.

# References

Kudos to https://gist.github.com/rogerallen/1583593 for putting together the complete Pythonic mapping between the US State names and abbreviations.

