# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> - | Notebook resume</div>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left: 20px">
In this notebook I'm going to do an analysis on how the COVID-19 Pandemic affect the learning process in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow. <br>

# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> - | Table of Contents</div>

* [1-Libraries and data loading](#section-one)
* [2-Dataframes preprocessing](#section-two)
* [3-Features preprocessing](#section-three)
* [4-Exlporatory data analysis](#section-four)

# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 1 | Libraries and data loading</div>


In [None]:
import pandas as pd
import numpy as np
from wordcloud import WordCloud

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode(connected = True)

import glob
import os
import gc

In [None]:
px.defaults.width = 820
px.defaults.height = 600

In [None]:
pd.set_option('display.float_format', lambda x: '%.2f' % x) # Set standard notation instead scientific

In [None]:
disctricts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

In [None]:
files = glob.glob('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv')
engagement_df= pd.concat([pd.read_csv(fp).assign(district_id=os.path.basename(fp).split('.')[0]) 
       for fp in files])
gc.collect()

<a id="section-two"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 2 | Dataframes preprocessing</div>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left: 20px">
In this part of the preprocessing I'm going to work with the differents dataframes for this competition.<br>

## Products data frame

In [None]:
products.head()

In [None]:
products.info()

In [None]:
# Lower casing the columns names and add "_" instead of space
products.columns = [column.lower().replace(' ', '_') for column in products.columns]

In [None]:
# Checking the null values
products.isna().sum()

In [None]:
products[products['sector(s)'].isna()]

## Disctricts data frame

In [None]:
disctricts.head()

In [None]:
disctricts.info()

In [None]:
# Checking the null values
disctricts.isna().sum()

In [None]:
# There is some bad rows on disctricst, I'm gonna droped
bad_rows = disctricts[disctricts['state'].isna()].index
disctricts.drop(bad_rows, axis = 0, inplace = True)

In [None]:
disctricts.reset_index(drop = True,inplace=True)

In [None]:
# Dict to map the state name with his code, to use that for a cholopleth map 
code = {'Alabama': 'AL',
        'Alaska': 'AK',
        'Arizona': 'AZ',
        'Arkansas': 'AR',
        'California': 'CA',
        'Colorado': 'CO',
        'Connecticut': 'CT',
        'Delaware': 'DE',
        'District of Columbia': 'DC',
        'Florida': 'FL',
        'Georgia': 'GA',
        'Hawaii': 'HI',
        'Idaho': 'ID',
        'Illinois': 'IL',
        'Indiana': 'IN',
        'Iowa': 'IA',
        'Kansas': 'KS',
        'Kentucky': 'KY',
        'Louisiana': 'LA',
        'Maine': 'ME',
        'Maryland': 'MD',
        'Massachusetts': 'MA',
        'Michigan': 'MI',
        'Minnesota': 'MN',
        'Mississippi': 'MS',
        'Missouri': 'MO',
        'Montana': 'MT',
        'Nebraska': 'NE',
        'Nevada': 'NV',
        'New Hampshire': 'NH',
        'New Jersey': 'NJ',
        'New Mexico': 'NM',
        'New York': 'NY',
        'North Carolina': 'NC',
        'North Dakota': 'ND',
        'Ohio': 'OH',
        'Oklahoma': 'OK',
        'Oregon': 'OR',
        'Pennsylvania': 'PA',
        'Rhode Island': 'RI',
        'South Carolina': 'SC',
        'South Dakota': 'SD',
        'Tennessee': 'TN',
        'Texas': 'TX',
        'Utah': 'UT',
        'Vermont': 'VT',
        'Virginia': 'VA',
        'Washington': 'WA',
        'West Virginia': 'WV',
        'Wisconsin': 'WI',
        'Wyoming': 'WY'}

In [None]:
disctricts['state_code'] = disctricts['state'].map(code)

In [None]:
columns = ['district_id', 'state','state_code', 'locale', 'pct_black/hispanic','pct_free/reduced', 'county_connections_ratio', 'pp_total_raw']

In [None]:
disctricts = disctricts[columns]

In [None]:
disctricts.head()

## Engagement dataframe

In [None]:
engagement_df.head()

In [None]:
engagement_df.shape

In [None]:
# Because the data frame it's huge, I'm going to use a sample of 20% of them to do the analysis
engagement_sample = engagement_df.sample(frac = 0.20, random_state = 42)

In [None]:
engagement_sample.shape

In [None]:
engagement_sample['district_id'] = engagement_sample['district_id'].astype('int')

In [None]:
def merge (df, merge_df, on , how):
    df = df.copy()
    merge_df = merge_df.copy()
    
    temp_df = df.merge(merge_df, on = on, how = how)
    
    return temp_df

In [None]:
df = merge(engagement_sample, disctricts,'district_id', 'left' )

In [None]:
complete_df = merge(df, products,'lp_id', 'left' )

In [None]:
complete_df.columns

In [None]:
complete_df.shape

In [None]:
complete_df.isna().sum()

<a id="section-three"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 3 | Features preprocessing</div>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
In this part of the preprocessing I'm going to work with the differents features.<br>

## time

In [None]:
complete_df['time'] = pd.to_datetime(complete_df['time'])

In [None]:
complete_df['day'] = complete_df['time'].apply(lambda x: x.day)
complete_df['month'] = complete_df['time'].apply(lambda x: x.month)
complete_df['year'] = complete_df['time'].apply(lambda x: x.year)

## pct_black/hispanic
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.
<br><br>

In [None]:
dict_ratio_1 = {'[0, 0.2[': 'Very Low',
                '[0.2, 0.4[': 'Low',
                '[0.4, 0.6[': 'Midd',
                '[0.6, 0.8[': 'Midd-High',
                '[0.8, 1[': 'High'}

In [None]:
def preprocess_columns (df, column, dict_mapping, loc):
    df = df.copy()
    
    if columns != 'pp_total_raw':
        df[f'cat_{column}'] = df[column].map(dict_mapping)
    
    low = df[column].apply(lambda x: str(x).replace('nan', '[NaN, NaN[')).apply(lambda x: x.split(',')[0].replace('[',''))
    high = df[column].apply(lambda x: str(x).replace('nan', '[NaN, NaN[')).apply(lambda x: x.split(',')[1].replace('[',''))
    mean = (low.astype('float') + high.astype('float')) / 2.0
    
    df.insert(loc=loc, column=f'mean_pct_{column}', value=mean)
    
    return df

In [None]:
complete_df = preprocess_columns (complete_df, 'pct_black/hispanic', dict_ratio_1, 9)

## pct_free/reduced
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data.
<br><br>

In [None]:
complete_df = preprocess_columns (complete_df, 'pct_free/reduced', dict_ratio_1, 11)

## county_connections_ratio
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.
<br><br>

In [None]:
complete_df['county_connections_ratio'].value_counts()

In [None]:
complete_df['county_connections_ratio'] = complete_df['county_connections_ratio'].replace({'[0.18, 1[': 'One', '[1, 2[': 'More'} )

## pp_total_raw
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.
<br><br>

In [None]:
complete_df = preprocess_columns (complete_df, 'pp_total_raw', dict_ratio_1, 13)

## primary_essential_function
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled.
There is so many categories, I'm going to change all of them for just LC, CM, SDO or All.
<br><br>

In [None]:
complete_df['primary_essential_function'].unique()

In [None]:
complete_df['cat_p_essential_function'] = complete_df['primary_essential_function'].apply(lambda x: str(x)[:3] ).apply(lambda x : 'All' if x == 'LC/' else x )

<a id="section-four"></a>
# <div style="color:#fff;display:fill;border-radius:10px;background-color:#000000;text-align:left;letter-spacing:0.1px;overflow:hidden;padding:20px;color:white;overflow:hidden;margin:0;font-size:100%"> 4 | Exploratory Data Analysis</div>

## Engagement index
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; ;margin-left:20px">
The engagement index it's the total page-load events per one thousand students of a given product and on a given day. Is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.
A high value it's better.
<br><br>

In [None]:
engagement_df_1 = complete_df.groupby(['state','locale'])['engagement_index'].mean().reset_index(name='mean_engagement_index')

In [None]:
fig = px.bar(engagement_df_1, y='state', x='mean_engagement_index', color='locale', title= 'Mean engagement index by State and Area type (locale)',
             color_discrete_sequence=['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF'])
fig.update_yaxes(showgrid=False, categoryorder='total ascending', ticksuffix=' ', showline=False)
fig.update_layout(legend=dict(title='Locale'))
fig.show()

In [None]:
fig = px.density_heatmap(engagement_df_1, y='state', x='locale', z='mean_engagement_index',height=600,
                         color_continuous_scale=['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF'],
                         title='Engagement Idx of the Locales by State')
fig.show()

In [None]:
engagement_df_2 = complete_df.groupby(['state','state_code','locale'])[['engagement_index','mean_pct_pp_total_raw']].mean().reset_index()

In [None]:
fig = px.treemap (data_frame = engagement_df_2, path = ['state', 'locale'], height=480,
                  values = 'mean_pct_pp_total_raw', color = 'engagement_index',  color_continuous_scale = ['#C8D0DF','#6D83AA', '#334668'],
                  title = 'Per pupil mean expenditure (Federal and Local) vs Engagement index',
                  labels = {'engagement_index':'Engagement Index'})
fig.show()

<div class="alert alert-info">
  <strong>Observations</strong>
 <div>In this particular case, we can see there is no clear relationship between the total expendeture by pupil (student) and a better engagement, or increase in the utilization of learning plaforms by the students, with the exception of New York (Rural, Suburbs), Massachusetts and Indiana (Rural).</div>
</div>

In [None]:
choropleth_df = engagement_df_2.copy()

In [None]:
fig = px.bar(engagement_df_2, y='state', x='mean_pct_pp_total_raw', color='locale', title= 'Per pupil mean espenditure by State and Area type (locale)',
             color_discrete_sequence=['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF'])
fig.update_yaxes(showgrid=False, categoryorder='total ascending', ticksuffix=' ', showline=False)
fig.update_layout(legend=dict(title='Locale'))
fig.show()

<div class="alert alert-info">
  <strong>Observations</strong>
 <div>Massachusetts it's the State with most expenditure by pupil, New York and Ilinois complete the top 3.</div>
</div>

In [None]:
fig = px.choropleth (data_frame= choropleth_df, locations = 'state_code',  locationmode = 'USA-states', scope = 'usa',
                     color = 'mean_pct_pp_total_raw', title = 'Per pulil mean expenditure by State (provided)')
fig.show()

## Percentage Black/Hispanic analysis
<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.
</p>

In [None]:
races_df = complete_df.groupby(['state','state_code','locale','cat_pct_black/hispanic'])[['district_id','mean_pct_pp_total_raw','mean_pct_pct_black/hispanic']].agg({'district_id':'count','mean_pct_pp_total_raw':'mean', 'mean_pct_pct_black/hispanic':'mean' }).reset_index()

In [None]:
fig = px.bar(races_df, y='state', x='district_id', color='cat_pct_black/hispanic', title = 'Percentage Of Black and Hispanic by State',
             color_discrete_sequence=['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF'])
fig.update_layout(legend=dict(title='% Black/Hisp'))
fig.update_yaxes(showgrid=False, categoryorder='total ascending', ticksuffix=' ', showline=False)
fig.show()

In [None]:
fig = px.treemap (data_frame = races_df, path = ['state', 'locale'], height=480,
                  values = 'mean_pct_pp_total_raw', color = 'mean_pct_pct_black/hispanic',  color_continuous_scale = ['#C8D0DF','#6D83AA', '#334668'],
                  labels = {'mean_pct_pct_black/hispanic':'Mean % Black/Hisp'})
fig.show()

<div class="alert alert-info">
<strong></strong>
 <div>We can see with the exception of Indiana (City and Suburb), the states with the most student expenditure are not those with 
the greatest Black or Hispanic presence, sadly, data from many states are lacking for a better and deeper analysis in this particular topic.
</div>

## Products/Provider/Sectors and other columns analysis

In [None]:
providers = complete_df['provider/company_name'].value_counts()
fig = px.histogram(providers,x= providers.values[:15] , y = providers.index[:15], 
                   color = providers.index[:15], labels= {'y': 'Company'}, title = 'Top 15 providers',
                    color_discrete_sequence=px.colors.sequential.Blues_r,)
fig.update_layout(showlegend=False)
fig.update_yaxes(showgrid=False, categoryorder='total ascending', ticksuffix=' ', showline=False)
fig.show()

<div class="alert alert-info">
  <strong>Observations</strong>
 <div>By far Google it's the most popular content provider.</div>
</div>

In [None]:
complete_df.columns

In [None]:
products_df_1 = complete_df.groupby(by= ['state','state_code','locale','product_name','provider/company_name','sector(s)','primary_essential_function'])['cat_p_essential_function'].count().reset_index(name = 'primary_func_count')

In [None]:
sectors = products_df_1['sector(s)'].value_counts()

In [None]:
fig = px.histogram(sectors,x= sectors.values, y = sectors.index, 
                     color = sectors.index, labels = {'index': 'Sector'}, title = 'Most relevant sector',
                     color_discrete_sequence = ['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF'])
fig.show()

<div class="alert alert-info">
  <strong>Observations</strong>
 <div>Prek - 12 it's the most important sector, and in my opion that a good thing, because this program helps provide quality education to children so that in the future they can have a full life and avoid falling into homelessness, or another type of affliction. And it is a program closely related to the topic of the analysis, since given the information I was able to gather, 32% of homeless high school students are Latino; which means that a Latino high school student is 1.7 times more likely to be homeless than a white student.
I share a link where you can dive more deeply on this topic, <a href="https://schoolhouseconnection.org/learn/k-12/">[link]</a>.</div>
</div>

<p style="font-size:15px; font-family:verdana; line-height: 1.7em; margin-left:20px">
Finally we will make a word cloud to see which are the most used which will allow us to know which are the most used (popular) products.
</p>

In [None]:
cloud = WordCloud(width=1080, height=270,background_color='white').generate(" ".join(complete_df['product_name'].dropna().astype(str)))
plt.figure(figsize=(22, 10))
plt.imshow(cloud)
plt.axis('off');

<div class="alert alert-info">
  <strong>Observations</strong>
 <div>Here we can see why Google is the most popular provider, since its services are the most used (drive, docs, sheets, calendar, and others), other popular platforms are Canva and Grammarly.</div>
</div>