# Problem Statement
The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

# Challenge
1. Doing Exploration about the state of digital learning in 2020.
1. How the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

# Data Description
There are three dataset from a daily edtech engagement data from over 200 school districts in 2020
1. The engagement_ data folder is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.
1. The products_info.csv file includes information about the characteristics of the top 372 products with most users in 2020.
1. The districts_info.csv file includes information about the characteristics of school districts, including data from NCES and FCC.

In [None]:
# Import Library 
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import glob

In [None]:
# readDataset district and product
dt_district = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
dt_product = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
#read dataset engagement and name it based on district to easy analyse
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0,dtype={'lp_id':str})
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    li.append(df)
    
dt_engagement = pd.concat(li)
dt_engagement = dt_engagement.reset_index(drop=True)
dt_engagement.head()

# Exploration district_info dataset

In [None]:
# getting a few of data in district_info
dt_district.head()

In [None]:
#find information about dataset district_info 
dt_district.info()

## Explanation Column in disctrict_info dataset
* disctrict_id = The unique identifier of the school district
* state = The state where the district resides in
* locale = NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural.
* pct_black/hispanic = Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data
* pct_free/reduced = Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data
* countyconnectionsratio = ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version)
* pptotalraw = Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project.

In [None]:
# find the record in disctrict_info dataset
index = dt_district.index
record_total = len(index)
print("Total Record in disctrict_info",record_total)

In [None]:
# Check Statistict Description
dt_district.describe(include='all')

In [None]:
# Check Missing Value in district_info
dt_district.isna().sum()

In [None]:
#check column that contain missing value for all six column 
null_data = dt_district.loc[dt_district.isnull().sum(1)>5].index
dt_district.loc[null_data]

In [None]:
#Total Data that contain Missing Value for All Six Columns
print('Total Data that Contains Missing Value from district_info = ',len(dt_district.loc[null_data]))

In [None]:
# Because there are missing values in the dataset so we need to drop that to make more accurate analyst 
# dropping Missing Value for all six Columns to easy doing analysis
dt_district.dropna(thresh=6,inplace=True)

In [None]:
#Check are the missing value already gone
dt_district.isna().sum()

In [None]:
#check data district after drop missing value
dt_district.head()

In [None]:
#Dropping if there are a duplicates data to easy analyse
dt_district = dt_district.drop_duplicates()

In [None]:
#change percentage black and pct_free/reduced to easy read and analyse
dt_district = dt_district.replace(to_replace =["[0, 0.2["], 
                            value ="0%-20%")
dt_district = dt_district.replace(to_replace =["[0.2, 0.4["], 
                            value ="20%-40%")
dt_district = dt_district.replace(to_replace =["[0.4, 0.6["], 
                            value ="40%-60%")
dt_district = dt_district.replace(to_replace =["[0.6, 0.8["], 
                            value ="60%-80%")
dt_district = dt_district.replace(to_replace =["[0.8, 1["], 
                            value ="80%-100%")

In [None]:
#change country_connections_ratio
dt_district = dt_district.replace(to_replace =["[0.18, 1["], 
                            value ="18%-100%")
dt_district = dt_district.replace(to_replace =["[1, 2["], 
                            value ="100%-200%")

In [None]:
# Explore the State Distribution
dt_district['state'].value_counts()

In [None]:
plt.figure(figsize=(10,10))
dt_district['state'].value_counts().plot(kind='barh')
plt.title('Count Distribution for State')
plt.xlabel('count')
plt.ylabel('State')

In [None]:
# Pie Chart Visualisation of State
dt_district["state"].value_counts().head(10).plot(kind = 'pie', autopct='%1.1f%%', figsize=(10, 10)).legend()
plt.title('Pie Chart State')

In [None]:
#count locale data in district
dt_district['locale'].value_counts()

In [None]:
#visualisation count of locale in district dataset
plt.figure(figsize=(10,10))
dt_district['locale'].value_counts().plot(kind='barh')
plt.title('Count of Locale')
plt.xlabel('Count')
plt.ylabel('Locale')

In [None]:
# pie chart visualisation of percentage locale
plt.title('Plot Percentage Distribution Locale')
dt_district['locale'].value_counts().plot(kind='pie', autopct='%3.1f%%',figsize=(10, 10)).legend()

# Exploration product_info dataset

In [None]:
# show a few data product_info
dt_product.head()

In [None]:
# gain information from product info
dt_product.info()

## Explanation Column in product_info dataset
The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy.

* LP ID = The unique identifier of the product
* URL = Web Link to the specific product
* Product Name = Name of the specific product
* Provider/Company Name = Name of the product provider
* Sector(s) = Sector of education where the product is used
* Primary Essential Function = The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled

In [None]:
# find length of product info dataset
index = dt_product.index
record_total = len(index)
print("Total Record in product info",record_total)

In [None]:
# Check Statistict Description
dt_product.describe(include='all')

In [None]:
# Check Missing Value in product info
dt_product.isna().sum()

In [None]:
#Dropping if there are a duplicates data to easy analyse
dt_product = dt_product.drop_duplicates()

In [None]:
plt.title('Distribution of Sector(s) in the District Information Data')
dt_product["Sector(s)"].value_counts().head(10).plot(kind = 'pie', autopct='%1.1f%%', figsize=(10, 10)).legend()

# Explanation in Engagement_Data

In [None]:
#show a few of data from Engagement_Data
dt_engagement.head()

In [None]:
#show information about Engagement Data
dt_engagement.info()

## Explanation Column in Engagement Data dataset
* Name = Description
* time = date in "YYYY-MM-DD"
* lp_id = The unique identifier of the product
* pct_access = Percentage of students in the district have at least one page-load event of a given product and on a given day
* engagement_index = Total page-load events per one thousand students of a given product and on a given day
* district_id = The unique identifier of the school district

In [None]:
#dropping data that contain lp_id null 
dt_engagement = dt_engagement.drop(dt_engagement.loc[dt_engagement['lp_id'].isnull()].index)
dt_engagement = dt_engagement.fillna(0.0)

In [None]:
# set lp_id and disctrict_id to int to merge with product_info and state_info
dt_engagement["lp_id"] = dt_engagement["lp_id"].astype(int)
dt_engagement["district_id"] = dt_engagement["district_id"].astype(int)
#rename column dt_product to easy merge
dt_product.rename(columns = {'LP ID': 'lp_id'}, inplace = True)

In [None]:
# merge districts and products
dt_explore = pd.merge(dt_engagement, dt_district, on="district_id")
dt_explore = pd.merge(dt_explore, dt_product, on="lp_id")
dt_explore

In [None]:
# gain information from the combine dataset 
dt_explore.info()

In [None]:
# change the Dtype of time because previously it was an object
dt_explore['time']= pd.to_datetime(dt_explore['time'])

In [None]:
##GroupByDay
df_groupbydays = dt_explore.set_index('time').groupby(pd.Grouper(freq='D')).sum()

In [None]:
#show the data of groupbyday
df_groupbydays

In [None]:
sns.set(rc = {'figure.figsize':(15,8)})
sns.lineplot(x = "time", y = "pct_access", data = df_groupbydays,marker='o')

In [None]:
sns.lineplot(x = "time", y = "engagement_index", data = df_groupbydays,marker='o')

In [None]:
dt_month = dt_explore.groupby(pd.Grouper(key='time', axis=0, 
                      freq='M')).mean()

In [None]:
sns.lineplot(x = "time", y = "engagement_index", data = dt_month,marker='o')

In [None]:
sns.lineplot(x = "time", y = "pct_access", data = dt_month,marker='o')

In [None]:
dt_explore.head()

In [None]:
best_product = dt_explore.groupby(by = 'Product Name', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Product Name', data = best_product[0:15])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Product Name')
plt.title(label = 'Best 15 Product Name That Used')

In [None]:
best_sectors = dt_explore.groupby(by = 'Sector(s)', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Sector(s)', data = best_sectors)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Sectors')
plt.title(label = 'Top Sectors')

In [None]:
best_company = dt_explore.groupby(by = 'Provider/Company Name', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Provider/Company Name', data = best_company[0:5])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Provider/Company Name')
plt.title(label = 'Best 5 Provider/Company')

In [None]:
best_primary = dt_explore.groupby(by = 'Primary Essential Function', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Primary Essential Function', data = best_primary[0:10])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Primary Essential Function')
plt.title(label = 'Best 10 Primary Essential Function')

In [None]:
dt_state = dt_explore.groupby(by = 'state', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='state', data = dt_state)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'State')
plt.title(label = 'Engagement Index based on State')

In [None]:
dt_locale = dt_explore.groupby(by = 'locale', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='locale', data = dt_locale)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Locale')
plt.title(label = 'Engagement Index based on Locale')

In [None]:
dt_rural =  dt_explore[dt_explore["locale"] == 'Rural']
dt_rural.head()

In [None]:
rural_engagement = dt_rural.groupby(["locale", "time"],as_index=False)["engagement_index"].mean().reset_index(drop=True)
rural_engagement.head()

In [None]:
sns.lineplot(x = 'time',y='engagement_index',data=rural_engagement)
plt.title(label = 'Engagement Index based from Rural')

In [None]:
dt_suburb =  dt_explore[dt_explore["locale"] == 'Suburb']
dt_suburb.head()

In [None]:
suburb_engagement = dt_suburb.groupby(["locale", "time"],as_index=False)["engagement_index"].mean().reset_index(drop=True)
suburb_engagement.head()

In [None]:
sns.lineplot(x = 'time',y='engagement_index',data=suburb_engagement)
plt.title(label = 'Engagement Index based from Suburb')

In [None]:
dt_town =  dt_explore[dt_explore["locale"] == 'Town']
dt_town.head()

In [None]:
town_engagement = dt_town.groupby(["locale", "time"],as_index=False)["engagement_index"].mean().reset_index(drop=True)
town_engagement.head()

In [None]:
sns.lineplot(x = 'time',y='engagement_index',data=town_engagement)
plt.title(label = 'Engagement Index based from Town')

In [None]:
dt_city =  dt_explore[dt_explore["locale"] == 'City']
dt_city.head()

In [None]:
city_engagement = dt_city.groupby(["locale", "time"],as_index=False)["engagement_index"].mean().reset_index(drop=True)
city_engagement.head()

In [None]:
sns.lineplot(x = 'time',y='engagement_index',data=city_engagement)
plt.title(label = 'Engagement Index based from City')

In [None]:
sns.lineplot(x = 'time',y='engagement_index',data=rural_engagement,color='blue')
sns.lineplot(x = 'time',y='engagement_index',data=suburb_engagement,color='green')
sns.lineplot(x = 'time',y='engagement_index',data=city_engagement,color='red')
sns.lineplot(x = 'time',y='engagement_index',data=town_engagement,color='orange')
plt.legend(labels=['Rural', 'Suburb', 'City','Town'])
plt.title(label = 'Engagement Index based groupby Locale')

In [None]:
dt_explore.head()

In [None]:
fix_pptotalraw = {
    '[4000, 6000[': '4000-6000',
    '[6000, 8000[': '6000-8000',
    '[8000, 10000[': '8000-10000',
    '[10000, 12000[': '10000-12000',
    '[12000, 14000[': '12000-14000',
    '[14000, 16000[': '14000-16000',
    '[16000, 18000[': '16000-18000',
    '[18000, 20000[': '18000-20000',
    '[20000, 22000[': '20000-22000',
    '[22000, 24000[': '22000-24000',
    '[32000, 34000[': '32000-34000'}
dt_explore['pp_total_raw'] = dt_explore['pp_total_raw'].map(fix_pptotalraw)

In [None]:
dt_pctblack = dt_explore.groupby(by = 'pct_black/hispanic', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index')
sns.barplot(x = 'engagement_index', y ='pct_black/hispanic', data = dt_pctblack)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'pct_black/hispanic')
plt.title(label = 'Engagement Index based on Percentage student in district identifed as Black ')

In [None]:
dt_connection_ratio = dt_explore.groupby(by = 'county_connections_ratio', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index')
sns.barplot(x = 'engagement_index', y ='county_connections_ratio', data = dt_connection_ratio)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Country Connections Ratio')
plt.title(label = 'Engagement Index based on Country Connections Ratio')

In [None]:
dt_explore.head()

In [None]:
dt_explore.tail()

In [None]:
start_date = "2020-3-13"
end_date = "2020-11-23"

after_start_date = dt_explore["time"] >= start_date
before_end_date = dt_explore["time"] <= end_date
between_two_dates = after_start_date & before_end_date
dt_filtercovid = dt_explore.loc[between_two_dates]

In [None]:
start_date = "2020-1-1"
end_date = "2020-3-12"

after_start_date = dt_explore["time"] >= start_date
before_end_date = dt_explore["time"] <= end_date
between_two_dates = after_start_date & before_end_date
dt_beforecvd19 = dt_explore.loc[between_two_dates]

In [None]:
dt_filtercovid.head()

In [None]:
dt_beforecvd19.head()

In [None]:
dt_state_aftercvd = dt_filtercovid.groupby(by = 'state', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='state', data = dt_state_aftercvd)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'State')
plt.title(label = 'Engagement Index based on State after Covid19')

In [None]:
dt_state_beforecvd = dt_beforecvd19.groupby(by = 'state', as_index = False)['engagement_index'].agg('mean').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='state', data = dt_state_beforecvd)
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'State')
plt.title(label = 'Engagement Index based on State before Covid19')

In [None]:
best_product_beforecvd = dt_beforecvd19.groupby(by = 'Product Name', as_index = False)['engagement_index'].agg('sum').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Product Name', data = best_product_beforecvd[0:15])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Product Name')
plt.title(label = 'Best 15 Product Name That Used before Covid19')

In [None]:
best_product_aftercvd = dt_filtercovid.groupby(by = 'Product Name', as_index = False)['engagement_index'].agg('sum').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Product Name', data = best_product_aftercvd[0:15])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Product Name')
plt.title(label = 'Best 15 Product Name That Used After Covid19')

In [None]:
best_company_bfrcvd = dt_beforecvd19.groupby(by = 'Provider/Company Name', as_index = False)['engagement_index'].agg('sum').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Provider/Company Name', data = best_company_bfrcvd[0:5])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Provider/Company Name')
plt.title(label = 'Best 5 Provider/Company before Covid19')

In [None]:
best_company_aftrcvd = dt_filtercovid.groupby(by = 'Provider/Company Name', as_index = False)['engagement_index'].agg('sum').sort_values(by ='engagement_index', ascending = False)
sns.barplot(x = 'engagement_index', y ='Provider/Company Name', data = best_company_aftrcvd[0:5])
plt.xlabel(xlabel = 'Engagement Index')
plt.ylabel(ylabel = 'Provider/Company Name')
plt.title(label = 'Best 5 Provider/Company After Covid19')

In [None]:
dt_pct_black_1 =  dt_explore[dt_explore["pct_black/hispanic"] == '0%-20%']
dt_pct_black_2 =  dt_explore[dt_explore["pct_black/hispanic"] == '20%-40%']
dt_pct_black_3 =  dt_explore[dt_explore["pct_black/hispanic"] == '40%-60%']
dt_pct_black_4 =  dt_explore[dt_explore["pct_black/hispanic"] == '60%-80%']
dt_pct_black_5 =  dt_explore[dt_explore["pct_black/hispanic"] == '80%-100%']

In [None]:
dt_pct_black_1 = dt_pct_black_1.set_index('time').groupby(pd.Grouper(freq='D')).mean()
dt_pct_black_2 = dt_pct_black_2.set_index('time').groupby(pd.Grouper(freq='D')).mean()
dt_pct_black_3 = dt_pct_black_3.set_index('time').groupby(pd.Grouper(freq='D')).mean()
dt_pct_black_4 = dt_pct_black_4.set_index('time').groupby(pd.Grouper(freq='D')).mean()
dt_pct_black_5 = dt_pct_black_5.set_index('time').groupby(pd.Grouper(freq='D')).mean()

In [None]:
sns.lineplot(x = 'time',y='engagement_index',data=dt_pct_black_1,color='blue',linestyle="dashed")
sns.lineplot(x = 'time',y='engagement_index',data=dt_pct_black_2,color='green',linestyle="dashed")
sns.lineplot(x = 'time',y='engagement_index',data=dt_pct_black_3,color='red',linestyle="dashed")
sns.lineplot(x = 'time',y='engagement_index',data=dt_pct_black_4,color='orange',linestyle="dashed")
sns.lineplot(x = 'time',y='engagement_index',data=dt_pct_black_5,color='pink',linestyle="dashed")
plt.legend(labels=['0%-20%', '20%-40%', '40%-60%','60%-80%','80%-100%'])
plt.title(label = 'Engagement Index based groupby pct_black/hispanic')

# Conclusion

### After doing analysis from the dataset that is already given I want to answer the first challenge that talking about the state of digital learning in 2020. Of course, we can see there's a lot of product digital that can help humans to learn. From the dataset, there are  372 products. We can see the picture of digital connectivity and engagement increasing in 2020 it's because all of the activity including the most important Learning sector must do digital learning. We can see the in the data visualization time series of engagement index in using product digital is increasing in months where the covid19 are spreading. In the future, this habit that doing online and distance learning will adapt to the condition because time by time off course technology will improve. Based on the data student engagement index there is an increasing percentage because pandemic it's until two times more than before and getting high after the summer holiday