In [None]:
import numpy as np
import pandas as pd
import glob

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    li.append(df)
    
engagement_df = pd.concat(li)
engagement_df = engagement_df.reset_index(drop=True)
engagement_df.head()

In [None]:
# Check Shape of data
engagement_df.shape

In [None]:
#Check null value of data
display(engagement_df.isnull().sum())

In [None]:
#making a copy of engagement_df
df=engagement_df.copy()

In [None]:
# dropping lp_id , engagement_index and district_id.
# Objective is to look at the mean and median of the pct_access regardless of district
df.drop(['lp_id','engagement_index','district_id'],axis=1,inplace=True)
df

In [None]:
# Since the missing value is just 0.06% of the whole dataset (13447 out of 22324190), we can drop the missing data
df.dropna(axis=0,inplace=True)

In [None]:
# Creating another column to extract out the month from the given date
df['month'] = df['time'].str[5:7] # Slicing out the month

In [None]:
# df["month"].replace(to_replace="01",value= "Jan",inplace=True)

In [None]:
# df_sam=df.copy()
# df_sam.replace(to_replace="Jan",value="01",inplace=True)
# df_sam

In [None]:
# Writing Function to replace month to 3 character month
def replace_month(data,month):
    month_dict={
               "01":"Jan",
               "02":"Feb",
               "03":"Mar",
               "04":"Apr",
               "05":"May",
               "06":"Jun",
               "07":"Jul",
               "08":"Aug",
               "09":"Sep",
               "10":"Oct",
               "11":"Nov",
               "12":"Dec"}
    
    for key,value in month_dict.items():
        if month == key:
            data.replace(to_replace=month,value=value,inplace=True)
    return data


In [None]:
# Replacing the month value
lst=["01","02","03","04","05","06","07","08","09","10","11","12"]

for month in lst:
    replace_month(df,month)

df

In [None]:
# Daily mean and median
df1=df.copy()
df1=df1.groupby(['time']).mean()
df2=df.copy()
df2=df2.groupby(['time']).median()
print(df1.head())
print(df2.head())

In [None]:
#Monthly mean and median
df3=df.copy()
df3=df3.groupby(['month'],sort=False).mean()
df4=df.copy()
df4=df4.groupby(['month'],sort=False).median()
print(df3.head())
print(df4.head())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
fig,axes = plt.subplots(2,2,figsize=(16,10))
sns.lineplot(ax=axes[0,0],data=df1,x='time',y='pct_access').xaxis.set_ticks([]) # daily mean
sns.lineplot(ax=axes[0,1],data=df2,x='time',y='pct_access').xaxis.set_ticks([]) # daily median
sns.lineplot(ax=axes[1,0],data=df3,x='month',y='pct_access') # monthly mean
sns.lineplot(ax=axes[1,1],data=df4,x='month',y='pct_access') # monthly median
axes[0,0].set_title("Daily Mean")
axes[0,1].set_title("Daily Median")
axes[1,0].set_title("Monthly Mean")
axes[1,1].set_title("Monthly Median")

The <b>pct_access</b> refers to the <b>Percentage of students in the district have at least one page-load event of a given product and on a given day</b>

This may suggest the engagement level of the students with the learning tools and digital platforms between $1^{st}$ Jan 2021 and $31^{st}$ Dec 2021 across the districts, if we solely focus on the datetime and pct_access columns in the engagement_data dataset provided.

As shown above, we attempt to explore the trends for the daily mean, daily median, monthly mean, and monthly median. 
This may suggest the average engagement level (on a daily and monthly basis) of the students across the learning tools and digital platforms listed in the dataset . Both mean and median were explored as it will suggest a fairer representation of the results in case of any anamolous data for a specific product.

It is interesting to see that there was generally a declining trend starting from Jan 2020. A sharp decline was observed starting from May 2020 until a trend reversal that was picked up on Jul 2020. 

In [None]:
# Obtaining the states where the districts are located in US
district = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
# Missing Values in the District dataset
display(district.isnull().sum())
print(district.shape)

In the district_info dataset given to us, we have 57 missing value from the state column. Unfortuately, we are unlikely be able to fill up the information for the state column. Hence this will cause a loss of roughly <b>24.5%</b> (57 out of 233) of the dataset provided as we are going to drop them in this case. 

In [None]:
district_state=pd.DataFrame(district.groupby(['state'])['district_id'].nunique().sort_values(ascending=False)).reset_index()
district_state.rename(columns={'state':'State','district_id':'Count'},inplace=True)
district_state

Above shows that states in which the districts that were aforementioned are located. Let's put our attention on the top 5 states that were stated in the dataset, namely <b>Connecticut, Utah, Massachusetts, Illinois and California</b>

Based on the information from the articles named below, many public schools in the United States were closed due to mitigation measures for the Coronavirus in middle to late March 2020. For the top 5 schools that were aforementioned in the previous paragraph, it was found that these schools were all closed during this period of time. It was also mentioned that most schools will be operating through online platforms. This tallies with the graphs we plotted, as they show clear representations of a slight increase between the month of March 2020 and April 2020. 

As of why there was a slight increase and not a huge spike in the trend observed, one possible reason may be due to the lack of preparation by the school. As suggested by NCES Blog, in the 2017â€“18 School Survey on Crime and Safety (SSOCS), 2018, there was less than 50% of public school (46%) that had written plan in the scenario of pandemic disease. This may be a factor which led to the aforementioned observation as schools may lack the readiness in preparing education materials for the students when such an event occured. Things such as setting up infrastructure for the students to access the materials remotely or converting the conventional offline teaching materials to online may be challenges faced by schools who lacked the necessary preparations. 

The next interesting observation will be the declining trend observed between April 2020 and July 2020, especially the sharp decline starting from May 2020. One plausible reason may be due to a decrease in productivity of the educators in the United States. According to EdWeek Research Center Survey, 2020, there was a decrease in morality amongst the educator during these period of time. From the same source, an abstract of an interview with an educator suggested that conducting lessons remotely may be more taxing as compared to conventional offline teaching. These factors may lead to the fall in quantities of online education materials available to the students, leading to the decline in engaging the learning tools and digital platforms amongst the students.

While the above mentioned may suggest a declining trend over time, the reality was in fact the reverse. An intriguing trend reversal was spotted starting from July 2020 (monthly mean) and from Aug 2020 (montly median). One possible explanation can be due to different mitigation measures implemented by the government of their respective States. In the case of the top 5 States (July 2020),according to BALLOTPEDIA, Utah and California continued their closure of schools, Connecticut allowed the choice of fully reopening their school or undergoing a hybrid plan of partially re-opening of the schools (some lessons will still be online), while Illinois and Massachuesetts allowed their schools to be fully reopened. As we can see, starting from this period of time, states differ in practices. Our graphs plotted take into account of districts with missing values for their States. Hence, we will need to do more data manipulation to see if such a trend still persist.


Articles:
- https://www.edweek.org/leadership/the-coronavirus-spring-the-historic-closing-of-u-s-schools-a-timeline/2020/07
- https://nces.ed.gov/blogs/nces/post/the-prevalence-of-written-plans-for-a-pandemic-disease-scenario-in-public-schools
- https://ballotpedia.org/School_responses_in_Connecticut_to_the_coronavirus_(COVID-19)_pandemic
- https://ballotpedia.org/School_responses_in_Utah_to_the_coronavirus_(COVID-19)_pandemic
- https://ballotpedia.org/School_responses_in_Massachusetts_to_the_coronavirus_(COVID-19)_pandemic
- https://ballotpedia.org/School_responses_in_Illinois_to_the_coronavirus_(COVID-19)_pandemic
- https://ballotpedia.org/School_responses_in_California_to_the_coronavirus_(COVID-19)_pandemic
- https://www.ajmc.com/view/a-timeline-of-covid19-developments-in-2020

