# INTRODUCTION

Nelson Mandela believed education was the most powerful weapon to change the world. But not every student has equal opportunities to learn. Effective policies and plans need to be enacted in order to make education more equitable—and perhaps your innovative data analysis will help reveal the solution.

Current research shows educational outcomes are far from equitable. The imbalance was exacerbated by the COVID-19 pandemic. There's an urgent need to better understand and measure the scope and impact of the pandemic on these inequities.

Education technology company LearnPlatform was founded in 2014 with a mission to expand equitable access to education technology for all students and teachers. LearnPlatform’s comprehensive edtech effectiveness system is used by districts and states to continuously improve the safety, equity, and effectiveness of their educational technology. LearnPlatform does so by generating an evidence basis for what’s working and enacting it to benefit students, teachers, and budgets.

In this analytics competition, you’ll work to uncover trends in digital learning. Accomplish this with data analysis about how engagement with digital learning relates to factors like district demographics, broadband access, and state/national level policies and events. Then, submit a Kaggle Notebook to propose your best solution to these educational inequities.

Your submissions will inform policies and practices that close the digital divide. With a better understanding of digital learning trends, you may help reverse the long-term learning loss among America’s most vulnerable, making education more equitable.

# PROBLEM STATEMENT

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.

# DATA PREPROCSSING

We are preparing packages and source data that will be used in the analysis process. Python packages that will be used in the analysis mainly are for data manipulation (numpy and pandas) and data visualization (matplotlib and seaborn). 

In [None]:
import os
import glob
import numpy as np
import pandas as pd
import warnings

import squarify
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

#loading dataset
districts = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products = pd.read_csv('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

for dirname, _, filenames in os.walk('/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data'):
    for filename in filenames:
        engagement_files = list(glob.glob(os.path.join(dirname,'*.*')))

engagement = pd.DataFrame()
for file in engagement_files:
    district_id = file[79:83]
    engagement_file = pd.read_csv(file)
    engagement_file['id'] = district_id
    engagement = pd.concat([engagement, engagement_file], axis=0).reset_index(drop=True)

#mapping for districts dataset
mapping_1 = {
    '[0, 0.2[': '0%-20%',
    '[0.2, 0.4[': '20%-40%',
    '[0.4, 0.6[': '40%-60%',
    '[0.6, 0.8[': '60%-80%',
    '[0.8, 1[': '80%-100%'}

mapping_2 = {
    '[4000, 6000[': '4000-6000',
    '[6000, 8000[': '6000-8000',
    '[8000, 10000[': '8000-10000',
    '[10000, 12000[': '10000-12000',
    '[12000, 14000[': '12000-14000',
    '[14000, 16000[': '14000-16000',
    '[16000, 18000[': '16000-18000',
    '[18000, 20000[': '18000-20000',
    '[20000, 22000[': '20000-22000',
    '[22000, 24000[': '22000-24000',
    '[32000, 34000[': '32000-34000'}

mapping_3 = {
    '[0.18, 1[': '18%-100%',
    '[1, 2[': '100%-200%'
}

districts['pct_black/hispanic'] = districts['pct_black/hispanic'].map(mapping_1)
districts['pct_free/reduced'] = districts['pct_free/reduced'].map(mapping_1)
districts['county_connections_ratio'] = districts['county_connections_ratio'].map(mapping_3)
districts['pp_total_raw'] = districts['pp_total_raw'].map(mapping_2)

#separating category
products[['Category', 'Subcategory']] = products['Primary Essential Function'].str.split('-', n=1, expand=True,)
products = products.drop('Primary Essential Function', axis=1)

In [None]:
import seaborn as sns
colors_blue = ["#132C33", "#264D58", '#17869E', '#51C4D3', '#B4DBE9']
colors_dark = ["#1F1F1F", "#313131", '#636363', '#AEAEAE', '#DADADA']
colors_red = ["#331313", "#582626", '#9E1717', '#D35151', '#E9B4B4']
colors_mix = ["#17869E", '#264D58', '#179E66', '#D35151', '#E9DAB4', '#E9B4B4', '#D3B651', '#6351D3']
colors_div = ["#132C33", '#17869E', '#DADADA', '#D35151', '#331313']

sns.palplot(colors_blue)
sns.palplot(colors_dark)
sns.palplot(colors_red)
sns.palplot(colors_mix)
sns.palplot(colors_div)

# DATA SET OVERVIEW

The overview is prepared to get the feel on data structure. It will also include a quick analysis on missing values, basic statistics and data manipulation. In general there will 3 datasets: engagement, districts and products

# ENGAGEMENT

The engagement data are aggregated at school district level, and each file represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info. The lp_id can be used to link to product information in product_info.

This dataset consists of below information:

--> **time:** date in "YYYY-MM-DD"

--> **lp_id:** The unique identifier of the product

--> **pct_access:** Percentage of students in the district have at least one page-load event of a given product and on a given day

--> **engagement_index:** Total page-load events per one thousand students of a given product and on a given day

**Observations:**

--> There are 22,324,190 rows with 5 columns as mentioned above.

--> This dataset contain missing value of 5,392,397 which come from lp_id of 541, pct_access of 13,447 and engagement_index 5,378,409. Missing value in the engagement_index can be considered big as it consist of 24.15% from total observation.

In [None]:
engagement.head()

In [None]:
print(f'Number of rows: {engagement.shape[0]};  Number of columns: {engagement.shape[1]}; No of missing values: {sum(engagement.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(engagement.isna().sum())

# BASIC STATISTICS

Below is the basic statistics for each variables which contain information on count, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile and maximum.

In [None]:
engagement.describe()

# DISTRICTS

The district file includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, LearnPlatform removed the identifiable information about the school districts. LearnPlatform also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

This dataset consists of below information:

-->**** district_id:** The unique identifier of the school district

--> **state:** The state where the district resides in

--> **locale:** NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural.

--> **pct_black/hispanic:** Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.


--> **pct_free/reduced:** Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data

--> **countyconnectionsratio:** ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version).

--> **pptotalraw:** Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource - Database on Schools (NERD$) project.
The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.

**OBSERVATIONS:**

There are 223 rows with 7 columns as mentioned above.
This dataset contain missing value of 442 which mainly come from pp_total_raw of 115, pct_free/reduced of 85 and county_connections_ratio of 71.

In [None]:
districts.head()

In [None]:
print(f'Number of rows: {districts.shape[0]};  Number of columns: {districts.shape[1]}; No of missing values: {sum(districts.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(districts.isna().sum())

In [None]:
plt.figure(figsize=(16, 10))

sns.countplot(y="state",data=districts,order=districts.state.value_counts().index,palette="pastel",linewidth=3)
plt.title("State Distribution",size=18)

sns.despine()
plt.show()

In [None]:
fig, ax  = plt.subplots(figsize=(5, 3))
fig.suptitle('Locale Type Distribution', size = 5)

labels = list(districts.locale.value_counts().index)
sizes = districts.locale.value_counts().values
explode = (0, 0, 0, 0.1)

ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.7, colors=["#FFFF33","#ff9100","#eaaa00","#6d6875"])
ax.add_artist(plt.Circle((0,0),0.3,fc='white'))
plt.show()

# PRODUCTS

The product file includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by LearnPlatform team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

This dataset consists of below information:

--> **LP ID:** The unique identifier of the product

--> **URL:** Web Link to the specific product

--> **Product Name:** Name of the specific product

--> **Provider/Company Name:** Name of the product provider

--> **Sector(s):** Sector of education where the product is used

--> **Category:** The basic function of the product. Products are first labeled as one of these three categories: LC = ----Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations.
Subcategory: Each of these categories have multiple sub-categories with which the products were labeled

**Observations:**

There are 372 rows with 7 columns as mentioned above.
This dataset contain missing value of 61 which mainly come from Sectors(s), Category, Subcategory with each of them has 20 missing values and 1 missing value on Provider/Company Name.

In [None]:
products.head()

In [None]:
print(f'Number of rows: {products.shape[0]};  Number of columns: {products.shape[1]}; No of missing values: {sum(products.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(products.isna().sum())

In [None]:
df = products.groupby('Subcategory').count()[['LP ID']].sort_values(by="LP ID", ascending=True)

fig, ax = plt.subplots(figsize=(8, 14))

bars0 = ax.barh(df.index, df['LP ID'], color="#179E66", alpha=0.8, edgecolor=colors_dark[0])

ax.grid(axis='x', alpha=0.3)
ax.set_axisbelow(True)
ax.set_xlabel("Total Products", fontsize=14, labelpad=10, fontweight='bold', color=colors_dark[0])
ax.set_ylabel("Subcategory", fontsize=14, labelpad=10, fontweight='bold', color=colors_dark[0])
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()

plt.text(s="About The Data | Products", ha='left', x=xmin, y=ymax*1.08, fontsize=24, color=colors_dark[0])
plt.text(s="Count of products by its functions", ha='left', x=xmin, y=ymax*1.04, fontsize=24, fontweight='bold', color=colors_dark[0])
plt.title("Most of products that is in this dataset belongs to Digital Learning Platforms", loc='left', fontsize=13, color=colors_dark[1]) 

plt.show()


In [None]:
df = products.groupby('Sector(s)').count()[['LP ID']].sort_values(by="LP ID", ascending=False)

fig, ax = plt.subplots(figsize=(14, 8))

bars0 = ax.bar(df.index, df['LP ID'], color="#9E1717", alpha=0.8, edgecolor="#9E1717")

ax.grid(axis='y', alpha=0.3)
ax.set_axisbelow(True)
ax.set_xlabel("Total Products", fontsize=14, labelpad=10, fontweight='bold', color=colors_dark[0])
ax.set_ylabel("Sector(s)", fontsize=14, labelpad=10, fontweight='bold', color=colors_dark[0])
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()


for i, bar in enumerate(bars0) : 
    x=bar.get_x()
    y=bar.get_height()
    if i < 3 : 
        ax.text(
        s=f"{df.iloc[i].values[0]}\nProducts",
        va='center', ha='center', 
        x=x+0.38, y=y/2,
        color='white',
        fontsize=18,
        fontweight='bold')
    else : 
        ax.text(
        s=f"{df.iloc[i].values[0]}",
        va='center', ha='center', 
        x=x+0.38, y=y+5,
        color=colors_dark[0],
        fontsize=14)
        

plt.text(s="About The Data | Products", ha='left', x=xmin, y=ymax*1.16, fontsize=24, color=colors_dark[0])
plt.text(s="Count of products by its Sector(s)", ha='left', x=xmin, y=ymax*1.1, fontsize=24, fontweight='bold', color=colors_dark[0])
plt.title("Most of products that is in this dataset belongs to PreK-12 sector\nmeaning that most of the education services that exists in this dataset is for kindergarten to 12th grade students", loc='left', fontsize=13, color=colors_dark[2]) 

plt.show()

# MERGING DATA SETS

We will merge engagement, districts and products datasets into 1 big dataset called combine that consist all of the information from all dataset and we will delete existing dataset to free up some memory

In [None]:
merged = engagement.copy()
merged['id'] = merged['id'].astype('int64') 
merged = merged.merge(districts, left_on='id', right_on='district_id', how='left')
merged = merged.merge(products, left_on='lp_id', right_on='LP ID', how='left')
merged = merged.drop('district_id', axis=1)
merged = merged.drop('LP ID', axis=1)
merged['time'] = pd.to_datetime(merged['time'])

In [None]:
merged.head()

In [None]:
print(f'Number of rows: {merged.shape[0]};  Number of columns: {merged.shape[1]}; No of missing values: {sum(merged.isna().sum())}')

In [None]:
print('Number of missing Values in every column:')
print(merged.isna().sum())

In [None]:
merged.describe()

In [None]:
merged['Provider/Company Name'].value_counts(dropna=False).shape

# VISUALIZATION

Engagement dataset represents on how many products (in a school district) that have been accessed by students in a daily basis for year 2020 with the total of 22 million product accessed in 2020. There are 8,646 products but only 368 products that have been successfully mapped using the products_info dataset, unmapped products are categorized as Unknown.

To make a little bit clearer:

--> The dataset is presented in a daily basis.

--> A product will only one product per school district if there is an accessed to the product.

In this part we will also find some analysis related to trend:

--> We will look into the mean accessed products.

--> How many products that have been used in a daily basis.

In [None]:
import pandas as pd
temp = pd.DataFrame(merged.groupby(['time', 'id'])['time'].count())
temp = temp.rename(columns={"time":"amount"})
temp = temp.reset_index(drop=False)
temp = temp.groupby('time')['amount'].mean()

background_color = "#B4DBE9"
sns.set_palette(['dimgray']*400)

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(9, 1), facecolor='#B4DBE9')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0, hspace=0)
ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
ax0 = sns.barplot(ax=ax0, x=temp.index, y=temp, zorder=2, linewidth=0.8, saturation=1)
summer = np.arange(np.datetime64("1970-06-01"), np.datetime64("1970-08-24"))
ax0.fill_between(summer, np.max(temp), color='#ffd514', alpha=0.5, zorder=2, linewidth=0)
plt.axvline(np.datetime64("1970-02-12"), color='#ffd514', alpha=0.5)
plt.axvline(np.datetime64("1970-03-11"), color='#ffd514', alpha=0.5)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', lw=0.3)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', lw=0.3)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim() 
ax0.text(x0, y1*1.11, 'Mean Daily Accessed Products', color='black', fontsize=7, ha='left', va='bottom', weight='bold')
ax0.text(x0, y1*1.1, 'After the summer holiday, there are an increased in accessed products', 
        color='#292929', fontsize=5, ha='left', va='top')
ax0.annotate("temporary\nschool closures", 
             xy=(np.datetime64("1970-02-12"), 430), 
             xytext=(np.datetime64("1970-01-09"), 380), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("COVID-19\nPandemic", 
             xy=(np.datetime64("1970-03-11"), 430), 
             xytext=(np.datetime64("1970-03-18"), 350), 
             arrowprops=dict(arrowstyle="->"), fontsize=5)
ax0.annotate("Summer Holiday", 
             xy=(np.datetime64("1970-06-27"), 350), 
             xytext=(np.datetime64("1970-06-27"), 350), 
             fontsize=5)

#format axis
ax0.set_xlabel("date",fontsize=5, weight='bold')
ax0.set_ylabel("products",fontsize=5, weight='bold')

#format the ticks
ax0.tick_params('both', length=2, which='major', labelsize=5)
months = mdates.MonthLocator()
ax0.xaxis.set_major_locator(months)
ax0.set_xticklabels(['Jan 2020', 'Feb 2020', 'Mar 2020', 'Apr 2020', 'May 2020', 'Jun 2020', 'Jul 2020', 
                     'Aug 2020', 'Sep 2020', 'Oct 2020', 'Nov 2010', 'Dec 2020'])
y_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.yaxis.set_major_formatter(y_format)

plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(4, 3), facecolor='#DADADA')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=1.1, hspace=1.5)

##########PRODUCT##########
temp = pd.DataFrame(merged.groupby('Product Name', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['product', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)
temp = temp[0:10]

background_color = "#f6f5f5"
color_map = ["blue" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.barplot(ax=ax0, y=temp['product'], x=temp['amount'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.3)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.3)

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("products",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.45, 'Top 10 Products', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.2, 'Top 10 products are controlled by Google LLC', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 55
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)

##########PROVIDER##########
temp = pd.DataFrame(merged.groupby('Provider/Company Name', dropna=False)['engagement_index'].sum()/1000000).reset_index()
temp.columns = ['product', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)
temp = temp[0:10]

background_color = "#f6f5f5"
color_map = ["blue" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0_sns = sns.barplot(ax=ax0, y=temp['product'], x=temp['amount'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.3)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.3)

#format axis
ax0_sns.set_xlabel("page-load (million)",fontsize=3, weight='bold')
ax0_sns.set_ylabel("providers",fontsize=3, weight='bold')
ax0_sns.tick_params(labelsize=3, width=0.5, length=1.5)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.47, 'Top 10 Providers', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.2, 'Google LLC is the top provider for digital learning', fontsize=3, ha='left', va='top')

# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 130
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='center', va='center', fontsize=2.7, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
x_format = ticker.FuncFormatter(lambda x, p: format(int(x), ','))
ax0.xaxis.set_major_formatter(x_format)


In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(4, 4), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.2, hspace=1.5)

##########CATEGORY##########
temp = pd.DataFrame(merged.groupby(['Category'], dropna=False)['engagement_index'].sum()).reset_index()
temp.columns = ['description', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#331313"
color_map = ["blue" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0.plot(temp['description'], temp['amount']/1000000, 'o--', color="#ffd514", markersize=3, markeredgewidth=0, linewidth=0.5, zorder=4)
ax0.fill_between(temp['description'], temp['amount']/1000000, color="#d3d3d3", zorder=3, alpha=0.5, linewidth=0)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("category",fontsize=3, weight='bold')
ax0.set_ylabel("page-load (million)",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
ax0.yaxis.set_major_formatter(y_format)

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+285, 'Category & Page Load', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+160, 'Most of the page-load come from Learning & Curriculum', fontsize=3, ha='left', va='top')

##########SECTORS##########
temp = pd.DataFrame(merged.groupby(['Sector(s)'], dropna=False)['engagement_index'].sum()).reset_index()
temp.columns = ['description', 'amount']
temp = temp.fillna('Unknown')
temp = temp.sort_values('amount', ascending=False)

background_color = "#331313"
color_map = ["dimgray" for _ in range(75)]
color_map[0] = "#ffd514"
sns.set_palette(sns.color_palette(color_map))

ax0 = fig.add_subplot(gs[0, 1])
ax0.set_facecolor(background_color)
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)

#graph
ax0.plot(temp['description'], temp['amount']/1000000, 'o--', color="#ffd514", markersize=3, markeredgewidth=0, linewidth=0.5, zorder=4)
ax0.fill_between(temp['description'], temp['amount']/1000000, color="#d3d3d3", zorder=3, alpha=0.5, linewidth=0)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.2)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.2)

#format axis
ax0.set_xlabel("sector",fontsize=3, weight='bold')
ax0.set_ylabel("page-load (million)",fontsize=3, weight='bold')
ax0.tick_params(labelsize=3, width=0.5, length=1.5)
ax0.yaxis.set_major_formatter(y_format)
ax0.set_xticklabels(['PreK-12;\nHigher Ed;\nCorporate', 'Unknown', 'PreK-12', 
                     'PreK-12;\nHigher Ed', 'Corporate', 'Higher Ed;\nCorporate'])

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1+300, 'Sectors & Page Load', fontsize=4, ha='left', va='top', weight='bold')
ax0.text(x0, y1+150, 'PreK-12; Higher Ed; Corporate is dominating', fontsize=3, ha='left', va='top')

plt.show()

In [None]:
merged['time'] = pd.to_datetime(merged['time'], errors='coerce')
merged['month'] = merged['time'].dt.month

In [None]:
engagement_per_month=merged.groupby(['month'], as_index=False)['engagement_index'].mean()
engagement_per_month=engagement_per_month.sort_values(by=['month'],ascending=True)

In [None]:
plt.figure(figsize = (7,4))

sns.lineplot(data=engagement_per_month, x="month", y= "engagement_index", color='g')
plt.title('Monthly Average Engagement in 2020 (All District)', size=10)
plt.xlabel('Month',size=12)

sns.despine()
plt.show()

In [None]:
#get the category data base on average engagement index and sort it
top_category_platform=merged.groupby(['Category'], as_index=False)['engagement_index'].mean()
top_category_platform=top_category_platform.sort_values(by=['engagement_index'],ascending=False)

In [None]:
top_category_platform.head()

In [None]:
plt.figure(figsize = (10,4))

sns.barplot(data=top_category_platform[:10], y="Category", x= "engagement_index")
plt.title('Top 10 Category Platform with the Most Average Daily Engagement in 2020 (All District)', size=18)
sns.despine()
plt.show()

In [None]:
#get the state data base on average engagement index
state_most_visit_lms = merged[merged['Category']=='SDO']
state_most_visit_lms = state_most_visit_lms.groupby(['state'], as_index=False)['engagement_index'].mean()
state_most_visit_lms = state_most_visit_lms.sort_values(by=['engagement_index'],ascending=False)

In [None]:
plt.figure(figsize = (12,6))

sns.barplot(data=state_most_visit_lms.head(5), x="state", y= "engagement_index")

plt.title('Top 5 State that Often Visited Learning Management Systems in 2020',size=18)
plt.xlabel('state',size=14)

locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
sns.despine()
plt.show()

In [None]:
plt.figure(figsize = (12,6))

sns.barplot(data=state_most_visit_lms.tail(5), x="state", y= "engagement_index")

plt.title('Top 5 State that the Least Often Visited Learning Management Systems in 2020',size=18)
plt.xlabel('State',size=14)

locs, labels = plt.xticks()
plt.setp(labels, rotation=45)
sns.despine()
plt.gca().invert_xaxis()
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(4, 4), facecolor='#DADADA')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.7, hspace=0.1)

##########CORRELATION STATE##########
temp = merged[['time', 'state', 'engagement_index']]
temp['state'].fillna('Unknown', inplace=True)
temp = pd.DataFrame(temp.pivot_table(index='time', columns='state', values='engagement_index', 
                                     aggfunc='sum', dropna=False)).reset_index(drop=False)

background_color = "#f6f5f5"
colors = ["black", "#51C4D3"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
ax0_sns = sns.heatmap(temp.corr(), ax=ax0, annot=True, square=True, xticklabels=True, yticklabels=True,
            annot_kws={"size": 3}, cbar=False, cmap=colormap, linewidths=0.3, 
            linecolor='black', fmt='.1g')

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-1.1, "Correlation Between State", fontsize=5, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.5, 'BLUE indicates a high positive correlation', fontsize=3, ha='left', va='top')

#axis
ax0_sns.set_xlabel("")
ax0_sns.set_ylabel("")
ax0_sns.tick_params(length=0, labelsize=3)

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(3, 3), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.7, hspace=0.1)

##########CORRELATION LOCALE##########
temp = merged[['time', 'locale', 'engagement_index']]
temp['locale'].fillna('Unknown', inplace=True)
temp = pd.DataFrame(temp.pivot_table(index='time', columns='locale', values='engagement_index', 
                                     aggfunc='sum', dropna=False)).reset_index(drop=False)

background_color = "#f6f5f5"
colors = ["black", "#6351D3"]
colormap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

#graph
#matrix = np.triu(temp.corr())
ax0_sns = sns.heatmap(temp.corr(), ax=ax0, annot=True, square=True, xticklabels=True, yticklabels=True,
            annot_kws={"size": 4}, cbar=False, cmap=colormap, linewidths=0.3, 
            linecolor='black', fmt='.1g')

#title
x0, x1 = ax0.get_xlim()
y0, y1 = ax0.get_ylim()
ax0.text(x0, y1-0.38, "Correlation Between Locale", fontsize=6, ha='left', va='top', weight='bold')
ax0.text(x0, y1-0.2, 'Byzantine Night Blue indicates a high positive correlation', fontsize=4, ha='left', va='top')

#axis
ax0_sns.set_xlabel("")
ax0_sns.set_ylabel("")
ax0_sns.tick_params(length=0, labelsize=3)

In [None]:
from geopy.geocoders import Nominatim
from folium.plugins import HeatMap

In [None]:
locations=pd.DataFrame({"state":districts['state'].unique()})

geolocator=Nominatim(user_agent="app")

#we need to get the latitude and longitude data
lat=[]
lon=[]
for location in locations['state']:
    location = geolocator.geocode(location)    
    if location is None:
        lat.append(np.nan)
        lon.append(np.nan)
    else:
        lat.append(location.latitude)
        lon.append(location.longitude)
        
locations['lat']=lat
locations['lon']=lon

In [None]:
state_engagement = merged.groupby(['state'], as_index=False)['engagement_index'].mean()

#merge the state engagement data with latidude and longitude
final_loc = state_engagement.merge(locations,on='state',how="left").dropna()
final_loc

In [None]:
import folium
from folium import plugins

us_map = folium.Map(location=[38,-97],zoom_start =5, tiles='Stamen Terrain')

HeatMap(final_loc[['lat','lon','engagement_index']],zoom=20,radius=20).add_to(us_map)
average_engagement = plugins.MarkerCluster().add_to(us_map)

for lat, long, label, in zip(final_loc.lat, final_loc.lon, final_loc.engagement_index):
    folium.Marker(
        location=[lat,long],
        icon=None,
        popup=label,
    ).add_to(average_engagement)

us_map