# LearnPlatform COVID-19 Impact on Digital Learning

## Problem Statement
The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

## Objective :
Use digital learning data to analyze the impact of COVID-19 on student learning,explore the state of digital learning in 2020 and how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

## Data Overview
* engagement_data : is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.
* products_info.csv : includes information about the characteristics of the top 372 products with most users in 2020.
* districts_info.csv : file includes information about the characteristics of school districts, including data from NCES and FCC.

External data :
COVID-19 US State Policy database and KFF

#### Import libraries

In [None]:
import numpy as np 
import pandas as pd 

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings("ignore")

In [None]:
districts_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
products_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

#### Data overview and Preprocessing

In [None]:

class GetDfForPreprocessing:
    def __init__(self, df:pd.DataFrame):
        
        self.df = df
    
    def get_info(self):
        print('Info from data...')
        print(f' There are {self.df.shape[0]} rows and {self.df.shape[1]} columns presnet in this data')
        missing = self.df.isnull().sum().sum()
        print(f'The number of missing value(s): {missing}')
        
        total_cells = np.product(self.df.shape) 
        total_missing_count = self.df.isnull().sum().sum()
        
        print('There are', round(((total_missing_count/total_cells) * 100), 2), '%', 'missing values.')
        
        missing_columns = self.df.columns[self.df.isnull().any()]
        print(f'Columns having missing value(s): {missing_columns}')
    
    def get_percentage_missing_columns(self):
        '''a function to check for missing values count and percentage missing'''
    
        count_missing = self.df.isnull().sum() # calculate total sum of missing data
        count_missing_percentage= round((self.df.isnull().sum()*100/len(self.df))) # multiply sum of missing data by 100 and divide by length of the whole data and round up 
        missing_column_name= self.df.columns 
        missing_df=pd.DataFrame(zip(count_missing,count_missing_percentage,missing_column_name),
            columns=['Missing Count', '%Missing', 'ColumnName']) # create a dataframe 
        missing_df = missing_df.set_index('ColumnName') # set missing columns as index
        return missing_df
        
    def get_column_with_null(self):
        '''
        Return List of Columns which contain more than 30% of null values
        '''
        number_of_rows, number_of_columns = self.df.shape
        df_size = number_of_rows * number_of_columns
        df_size = self.df.shape[0]
    
        columns_list = self.df.columns
        bad_columns = []
    
        for column in columns_list:
            null_per_column = self.df[column].isnull().sum()
            percentage = round( (null_per_column / df_size) * 100 , 2)
        
            if (percentage > 30 or percentage == 30):
                bad_columns.append(column)
    
        return bad_columns

* ### Districts data

In [None]:
prep_dist = GetDfForPreprocessing(districts_df)
prep_dist.get_info()

In [None]:
prep_dist.get_percentage_missing_columns()

In [None]:
district_null_columns = prep_dist.get_column_with_null()
print('Columns With Null Value more than 30% : ')
print(district_null_columns)

In [None]:
# drop school districts with NaN states
districts_df = districts_df[districts_df.state.notna()].reset_index(drop=True)

In [None]:
# check size of data
districts_df.shape

In [None]:
districts_df.isnull().sum()

In [None]:
# Fill null values with Mode of the column
districts_df = districts_df.copy()

# columns with more than 30% null value    
districts_df = districts_df[districts_df['state'].notna()] # data frame with state value.
    
for col in district_null_columns:
    freq = districts_df[col].mode()[0]
    districts_df[col] = districts_df[col].fillna(freq)

* ### Products data

In [None]:
products_df.head()

In [None]:
prep = GetDfForPreprocessing(products_df)
prep.get_info()

* Only about 2% of the data contains missing values
* Data contains only object datatypes

In [None]:
# percentage of missing columns
prep.get_percentage_missing_columns()

In [None]:
products_null_columns = prep.get_column_with_null()
print('Columns With Null Value more than 30% : ')
print(products_null_columns)

All missing values are below 30%

In [None]:
# Return dataframe with non null 'Sector(s)''
products_df = products_df[products_df['Sector(s)'].notna()]
products_df.head()

In [None]:
# Split 'Primary Essential Function' column
products_df['primary_function_main'] = products_df['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
products_df['primary_function_sub'] = products_df['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)

products_df.drop("Primary Essential Function", axis=1, inplace=True)

In [None]:
products_df.head()

In [None]:
# replaces 'Sites, Resources & References' column as 'Sites, Resources & Reference'
products_df['primary_function_sub'] = products_df['primary_function_sub'].replace(
    {'Sites, Resources & References' : 'Sites, Resources & Reference'})

* ### Engagements data

In [None]:
# create engagement dataframe 

PATH = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 

temp = []

for district in districts_df.district_id.unique():
    df = pd.read_csv(f'{PATH}/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
    if df.time.nunique() == 366:
        temp.append(df)

engagement_df = pd.concat(temp)
engagement_df = engagement_df.reset_index(drop=True)

Above concatenates the engagement data from all remaining districts in one dataframe by adding the key column 'district_id' to each engagement file as shown below.

In [None]:
engagement_df.head()

In [None]:
prep_eng = GetDfForPreprocessing(engagement_df)
prep_eng.get_info()

In [None]:
# convert the time column to the datetime datatype
engagement_df.time = engagement_df.time.astype('datetime64[ns]')

In [None]:
# Feature generation - generate more features from the engagement data
engagement_df['month'] = engagement_df['time'].dt.month
engagement_df['day']= engagement_df['time'].dt.day
engagement_df['weekday']= engagement_df['time'].dt.weekday

#### Exploratory Data Analysis:
In this analytics, trends in digital learning are uncovered; how engagement with digital learning relates to factors like district demographics, broadband access, and state/national level policies and events.

In [None]:
# Ploting functions
def bar_plot(df, x_col, y_col, title=''):
    plt.figure(figsize=(20, 7))
    sns.barplot(data = df, x=x_col, y=y_col, palette="Set3")
    plt.title(title, size=16)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.show()
    
def count_plot(df, col, title):
    plt.figure(figsize=(20, 7))
    sns.countplot(data = df,y = col, order=df[col].value_counts().index, palette="Set3")
    plt.title(title, size=16)
    plt.xticks(fontsize=14)
    plt.yticks(fontsize=14)
    plt.show()
    
def pie_plot(df, col, title=''):
    count = df[col].value_counts()
    plt.figure(figsize=(20, 7))
    plt.pie(list(count), labels=count.index, pallete ='cubehelix',autopct='%1.2f%%')
    plt.title(title, size=16)
    plt.legend()
    plt.show()
    
def count_plot2(df,col,hue,title):
    plt.figure(figsize=(20,10))
    ax=sns.countplot(data=df,x=col,hue= hue,palette='Set3')
    plt.xticks(rotation=90)
    plt.title(title, size=16)
    plt.show()

def time_plot(df, x_col, y_col, title=''):
    plt.figure(figsize=(20, 7))
    sns.cubehelix_palette(as_cmap=True)
    sns.lineplot(data=df, x=x_col, y=y_col) #pallete='cubehelix_r')
    plt.title(title, size=16)
    plt.xticks(rotation=90, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(x_col, fontsize=16)
    plt.ylabel(y_col, fontsize=16)
    plt.show()

* ### Insights on Districts data

In [None]:
# view the data
districts_df.head()

In [None]:
count_plot(districts_df, 'state', 'Count of State Distribution')

The top 3 school district are Connecticut, Utah and Massachusets

In [None]:
count_plot(districts_df, 'locale', 'Locale Distribution')

The Subhurb locale has the highest distribution of collected data and Town, the least

In [None]:
count_plot(districts_df, 'pct_free/reduced', "Percentage Free or Reduced-price lunch Distribution")

Percentage of students in the districts eligible for free or reduced-price lunch has the most Schools provide about 20% - 40% of aid for their students.

In [None]:
count_plot(districts_df, 'pp_total_raw', 'Total Expenditure Per Pupil')

The maximum per-pupil total expenditure is about8,000 to 10,000. 

In [None]:
count_plot2(districts_df,'state','locale','Locality in each State')

* most states are made up of the suburbs 
* Tennesssee is made up of mainly tow locales
* NorthDakota and New Hampshire has Rural locales
* states like Minnesota,Arizona, District of Columbia, Michigan etc have only one locale

* ### Insights on Products data

In [None]:
products_df.head()

In [None]:
# top companies
result = products_df['Provider/Company Name'].value_counts().head(15)
top_comp = pd.DataFrame({'Company': result.index, 'Count': result})
bar_plot(top_comp,"Count" , "Company" , title='Top 15 Companies/Providers')

Google LLC is the top provider with a count more than 25, other companies/providers are below 10.

In [None]:
products_sect=products_df['Sector(s)'].value_counts().reset_index()

products_sect.columns = ['Sector(s)','percent']

products_sect['percent'] /= len(products_df)
fig = px.pie(products_sect, names='Sector(s)',values='percent',
             color_discrete_sequence=px.colors.qualitative.Set3,
             title='Distribution of Sectors',width=700,height=500
)
fig.show()

Perk-12 Sector of education is where the products are most used with a percentage of 48.3

In [None]:
count_plot(products_df, 'primary_function_main', 'Function of the Products')

The above shows the used functionalities of the websites;
LC - Learning and Curiculum, CM - Class Room Management, SDO - School and District Operation.

Most products fall in the Learning and Curriculum category. Let's have a look at the sub-categories.

In [None]:
count_plot(products_df, 'primary_function_sub', 'Sub-categories in Primary Function')

Sites, resources and refernece category has more products, then Digital learning platforms

* ### Insights on Engagements data

In [None]:
engagement_df.head()

In [None]:
plt.figure(figsize=(15,11))
sns.lineplot(y=engagement_df['pct_access'],x=engagement_df["month"],palette='rocket')
plt.title("Average access per month")

The trend tends to break, which may be due to the summer break from June to Mid August. The trend also moves up afterwards.

In [None]:
plt.figure(figsize=(15,11))
sns.lineplot(hue=engagement_df['weekday'],y=engagement_df['pct_access'],x=engagement_df["month"],palette='rocket')
plt.title("Average access per month per day")

The lineplot above shows the percentage access each day of the month. Interactions are quite low on the weekends. Engagement reached peek in months January and September, the trend starts dipping from february and takes a sharp drop in July.

In [None]:
plt.figure(figsize=(15,10))
sns.lineplot(y=engagement_df['engagement_index'],x=engagement_df["month"])
plt.title("Engagement per month")

The engagement per month has similar trend with the percentage access. Engagement took an uptrend in March and then dropped April.

In [None]:
plt.figure(figsize=(15,10))
sns.lineplot(y=engagement_df['engagement_index'],x=engagement_df["weekday"])
plt.title("Engagement index per day")

There is more engagment at the begining of the week then it slowly drops as it gets to the weekend. Tuesday has the highest number of average engagment


In [None]:
result = engagement_df.groupby('time').agg({'engagement_index': 'mean', 'pct_access': 'mean', 'lp_id': 'count'}).\
            reset_index()

time_plot(result, "time", "engagement_index", title='Engagement Over Time')

The Engagement is reduced from the 6th month to the 8th month, This is due to the school ends on this period and students are their break, and return to school on september

In [None]:
time_plot(result, "time", "pct_access", title='Percentage of access over time')

#### Merge datasets

In [None]:
# merge the districts and engagement data
districts_engagement = pd.merge(districts_df, engagement_df, left_on='district_id', right_on='district_id')
districts_engagement.head()

In [None]:
# merge the products and engagement data
products_engagement = pd.merge(products_df, engagement_df, left_on='LP ID', right_on='lp_id')
products_engagement.head()

In [None]:
bar_plot(districts_engagement, 'locale', 'engagement_index', title='Engagements in each locale')

Rural Areas are highly engaged than the others. it could be the a case since most schools are outside rural areas and this could be the better means to education.

### More analysis to perform:
* Upload the external datasets- COVID-19 US State Policy database and KFF
* Merge and perform analysis on the external, products, engagement and districts data
* Uncover more insights impacting Covid-19
* Draw conclusions