<h1 style="font-family:'lucida console';"> <center>üìö DIGITAL LEARNING IN THE PANDEMIC PERIOD (2020) üñ•Ô∏è</center> </h1>

In 2020, the world was surprised by a lethal virus, taking the entire planet to change the way we were used to doing things and also forcing us to accelerate the digitization process that had been going on since the end of the 20th century.

One of the areas most affected by this process was education.
Overnight, teachers needed to reinvent themselves and seek strategies for distance and online learning.

In a document published by UNESCO, UNICEF and The World Bank, entitled [What have we learned?](http://uis.unesco.org/sites/default/files/documents/national-education-responses-to-covid-19-web-final_en_0.pdf), in which they present a survey of responses from ministries of education around the world, it is possible to notice a disparity between the environment provided by high-income and low/low-middle income countries.

In high-income countries, where the United States is located, students missed fewer days of learning in the school year (*on average 27 days in high-income countries versus 70 days in low- and lower-middle-income countries*), kept monitoring their learning. of students by teachers (*while 26% of low- and lower-middle-income countries did not monitor student learning, this was the case in only 3% of high-income countries*).

<p float="left">

<img src="https://github.com/ac-garcia/kaggle-imgs/blob/main/Screenshot%20from%202021-09-29%2019-52-53.png?raw=true" />
<img src="https://github.com/ac-garcia/kaggle-imgs/blob/main/Screenshot%20from%202021-09-29%2020-24-42.png?raw=true" /> 
</p>
<cite>Image from What have we learned?, by UNESCO, UNICEF and The World Bank, pages 21 and 26.</cite>


With that, many were faced with a world of possibilities. Several platforms, different tools, countless providers...

This scenario, while sad for many families, allowed us the opportunity to collect teaching data in online enviroment that had never been possible before, mostly in high-income countries.

Different social classes, different levels of education, different audiences. All passed needed to some extent use online tools to maintain their study schedule.

Thus, in this report, our proposal is to analyze how online engagement changes with different audiences and different tools in United States.

In [None]:
# Libraries and some environment configurations
try:
    import openpyxl
except:
    !pip install openpyxl
try:
    import censusdata
except:
    !pip install censusdata
    import censusdata

import pandas as pd
import re
import glob
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import geopandas as gpd
from matplotlib.patches import Rectangle
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from warnings import simplefilter
import datetime as dt
from IPython.display import Markdown as md
# from IPython.display import Image

import gc
gc.collect()

simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.set_option('display.max_columns', None)

pal2 = ["#ffcbf2","#f3c4fb","#ecbcfd","#e5b3fe","#e2afff","#deaaff","#d8bbff","#d0d1ff","#c8e7ff","#c0fdff"]

In [None]:
# Functions that will be used in the process

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
#     print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type) == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
#     print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
#     print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


def import_data(file):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True, index_col=None, header=0)
    df = reduce_mem_usage(df)
    return df


In [None]:
product_df = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
district_df = import_data("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")

path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
    df = import_data(filename)
    district_id = filename.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    if df.time.nunique() == 366:
        li.append(df)
    
engagement_df = pd.concat(li, axis=0, ignore_index=True)
engagement_df = engagement_df.reset_index(drop=True)

engagement_df['district_id']=engagement_df['district_id'].astype(str)
district_df['district_id']=district_df['district_id'].astype(str)
district_df.loc[:,'district_id'] = district_df['district_id'].str.replace('\.0', '')


del li, path, all_files

gc.collect()

In [None]:
# Shape of the data files ( number of rows and number of columns) 
print('\033[1m'"Shape of the Engagement File "'\033[0m',engagement_df.shape )
print('\033[1m'"Shape of the District File"'\033[0m', district_df.shape)
print('\033[1m'"Shape of the Product File"'\033[0m',product_df.shape)

<div style="font-family:verdana; bold; word-spacing:1.5px;">
    <h1 id="methodology">
    1 Methodology
    </h1>
</div><br>

In this session we will explain a little bit about the process of understanding and transforming data.
In addition, we will present the external data that we have incorporated into the database to enrich our analyses.

<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="cleaning">
    1.1 Understanding and cleaning data
    </h1>
</div>

As the [challenge documentation](https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/data) informs you, the `district_info.csv` table contains information about the characteristics of school districts, from which identifiable information about school districts has been removed by LearnPlatform team.

We have 57 districts without *state* and *locale* informations.

As these districts do not have information, we removed them from the base.

In [None]:
show = district_df.head()

show = show.style.format(precision=0, na_rep='MISSING')

cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; color: darkgrey; font-weight:normal;'
}
headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #000066; color: white;'
}
show.set_table_styles([cell_hover, headers])
show.set_table_styles([
    {'selector': 'th.col_heading','props': 'text-align: center; font-size:1.25em'},
    {'selector': 'td', 'props': 'text-align: center;'},
], overwrite=False)
show.set_caption("First 5 rows from district table")\
 .set_table_styles([{
     'selector': 'caption',
     'props': 'caption-side: bottom; font-size:1.25em;font-style: italic;'
 }], overwrite=False)

show.set_table_styles([  # create internal CSS classes
    {'selector': '.border-red', 'props': 'border: 2px dashed red;'}
], overwrite=False)

cell_border = pd.DataFrame([[' ', ' ',' ',' ', ' ', ' ',' '],
                           ['border-red ', 'border-red ', 'border-red ', 'border-red ','border-red ','border-red ','border-red '],
                           [' ', ' ', ' ', ' ',' ',' ',' '],
                           ['border-red ', 'border-red ', 'border-red ', 'border-red ','border-red ','border-red ','border-red '],
                           ['border-red ', 'border-red ', 'border-red ', 'border-red ','border-red ','border-red ','border-red ']],
                          index=show.index,
                          columns=show.columns)

show.set_td_classes(cell_border)

In [None]:
district_df.iloc[district_df[(district_df.isnull().sum(axis=1) ==6)].index].count()

In [None]:
del show
gc.collect()

district_df = district_df[['district_id','state','locale','pct_black/hispanic']]
district_df = district_df[district_df.state.notna()]
district_df = district_df[district_df.district_id.isin(engagement_df.district_id.unique())].reset_index(drop=True)
district_df.head()

The `products_info.csv` table includes information about the features of the top 372 products that were most userd in 2020. The categories contained in the file are part of LearnPlatform's product taxonomy.

Some products came without the sector information, however as in our analysis we didn't explore this point so much, we chose not to remove the lines without this data, keeping a more complete base.

In [None]:
show = product_df.head()

show = show.style.format(precision=0, na_rep='MISSING')

cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; color: darkgrey; font-weight:normal;'
}
headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #000066; color: white;'
}
show.set_table_styles([cell_hover, headers])
show.set_table_styles([
    {'selector': 'th.col_heading','props': 'text-align: center; font-size:1.25em'},
    {'selector': 'td', 'props': 'text-align: center;'},
], overwrite=False)
show.set_caption("First 5 rows from product table")\
 .set_table_styles([{
     'selector': 'caption',
     'props': 'caption-side: bottom; font-size:1.25em;font-style: italic;'
 }], overwrite=False)

In [None]:
del show
gc.collect()

temp_sectors = product_df['Sector(s)'].str.get_dummies(sep="; ")
temp_sectors.columns = [f"sector_{re.sub(' ', '', c)}" for c in temp_sectors.columns]
product_df = product_df.join(temp_sectors)

del temp_sectors
gc.collect()

product_df['primary_function_main'] = product_df['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
product_df['primary_function_sub'] = product_df['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)

# Synchronize similar values
product_df['primary_function_sub'] = product_df['primary_function_sub'].replace({'Sites, Resources & References' : 'Sites, Resources & Reference'})
product_df.drop("Primary Essential Function", axis=1, inplace=True)
product_df = product_df[product_df['LP ID'].isin(engagement_df.lp_id.unique())].reset_index(drop=True)

In [None]:
product_df.head()

The `engagement_table` is a compilation of student engagement data from schools across numerous US districts across digital platforms.

In addition to the engagement numbers, we have some relationships, such as the page/platform where the engagement took place and the user's home district.

To select the events relevant to us, we filter the engagement base to have information only on the pages and districts contained in the tables mentioned above, after the first treatment done in them.

Thus, we had a base reduction of 18,612,528 lines to 7,784,803 lines, a reduction equivalent to 58% of the base.

This reduction is data that would make it impossible for us to make some classifications, since classification data was missing.

In [None]:
show = engagement_df.head()

show = show.style.format(precision=0, na_rep='MISSING')

cell_hover = {  # for row hover use <tr> instead of <td>
    'selector': 'td:hover',
    'props': [('background-color', '#ffffb3')]
}
index_names = {
    'selector': '.index_name',
    'props': 'font-style: italic; color: darkgrey; font-weight:normal;'
}
headers = {
    'selector': 'th:not(.index_name)',
    'props': 'background-color: #000066; color: white;'
}
show.set_table_styles([cell_hover, headers])
show.set_table_styles([
    {'selector': 'th.col_heading','props': 'text-align: center; font-size:1.25em'},
    {'selector': 'td', 'props': 'text-align: center;'},
], overwrite=False)
show.set_caption("First 5 rows from engagement table")\
 .set_table_styles([{
     'selector': 'caption',
     'props': 'caption-side: bottom; font-size:1.25em;font-style: italic;'
 }], overwrite=False)

In [None]:
print(len(engagement_df))
engagement_df = engagement_df[engagement_df.lp_id.isin(product_df['LP ID'].unique())]
print(len(engagement_df))
engagement_df = engagement_df[engagement_df.district_id.isin(district_df['district_id'].unique())]
print(len(engagement_df))

In [None]:
print(engagement_df.isna().sum())

In [None]:
full_table = pd.merge(engagement_df,product_df, "inner", left_on='lp_id', right_on='LP ID')
full_table = pd.merge(full_table,district_df, "inner", on='district_id')
full_table = full_table[full_table.engagement_index.notna()]
full_table.dropna(subset = ["state"], inplace=True)

In [None]:
full_table.shape

In [None]:
# full_table_sample1, full_table_sample2, state_sample1, state_sample2 = train_test_split(full_table.drop('state',axis=1), full_table['state'], stratify=full_table['state'], test_size=0.40)

# del engagement_df, product_df, district_df

# gc.collect()

In [None]:
# import pandas as pd
# import scipy.stats as stats

# tstats = {}
# ix_a = df['group'] == 'A'
# for x in df:
#     if x != 'group':
#         tstats['t_' + x] = stats.ttest_ind(df[x][ix_a], df[x][~ix_a])[0]

# df.groupby('group').mean().assign(**tstats)

In [None]:
# alpha = 0.05 #Or whatever you want your alpha to be.
# p_value = scipy.stats.f.cdf(F, df1, df2)
# if p_value > alpha:
#     # Reject the null hypothesis that Var(X) == Var(Y)

In [None]:
# full_table_sample1.reset_index(drop=True, inplace=True)
# state_sample1.reset_index(drop=True, inplace=True)
# full_table = pd.concat([full_table_sample1.reset_index(drop=True), state_sample1], axis=1)
# del full_table_sample1, full_table_sample2, state_sample1, state_sample2 
# gc.collect()
# full_table.head()

In [None]:
covid_19_state_policy = pd.read_excel("../input/covid19-us-state-policy-database-3-29-2021/COVID-19 US state policy database 3_29_2021.xlsx")
covid_19_state_policy.dropna(subset = ["STATE"], inplace=True)
covid_19_state_policy.fillna(0,inplace=True)
covid_19_state_policy[0:5]

In [None]:
covid_19_state_policy = covid_19_state_policy[['STATE','POSTCODE','STEMERG','STEMERGEND','CLDAYCR',
                                              'OPNCLDCR','CLSCHOOL','CLBSNS','END_BSNS',
                                              'CLGYM','ENDGYM','CLGYM2','END_CLGYM2',
                                              'EMSTART','EMEND','EMSTART2','EMEND2','EMSTART3',
                                              'EMEND3','SNAPALLO','SNAPEBT20','SNAPEBT21',
                                              'RELIGEX','FMFINE','FMCITE','FMNOENF','ALCOPEN',
                                              'GUNOPEN','SMALLBUSMINWAGE','MINWAGE2021',
                                              'TIPMINWAGE2020','VBMEXC','MH19','POV18',
                                              'UNEMP18','HMLS19','POP18','POPDEN18']][4:]
covid_19_state_policy = reduce_mem_usage(covid_19_state_policy)
covid_19_state_policy.head()

In [None]:

full_table_state = pd.merge(full_table,covid_19_state_policy, "left", left_on='state',right_on='STATE')
full_table_state['time'] = pd.to_datetime(full_table_state['time'], format='%Y-%m-%d', errors ='coerce').astype('datetime64[ns]')
full_table_state['week_year'] = full_table_state['time'].apply(lambda row: dt.datetime.strftime(row,format='%Y-%W'))
full_table_state['weekday'] = full_table_state['time'].apply(lambda row: dt.datetime.strftime(row,format='%Y-%w'))
#emergency state
full_table_state['STEMERG'] = pd.to_datetime(full_table_state['STEMERG'], format='%Y-%m-%d')
full_table_state['STEMERGEND'] = np.where(full_table_state['STEMERGEND']!=0,\
                                          pd.to_datetime(full_table_state['STEMERGEND'], format='%Y-%m-%d').astype('datetime64[ns]'),\
                                          pd.to_datetime('today').date())
#day care closed
full_table_state['CLDAYCR'] = np.where(full_table_state['CLDAYCR']!=0,\
                                       full_table_state['CLDAYCR'],\
                                       pd.to_datetime('today').date())
full_table_state['OPNCLDCR'] = np.where(full_table_state['OPNCLDCR']!=0,\
                                          full_table_state['OPNCLDCR'],\
                                          pd.to_datetime('today').date())

# k-12 public school closed
full_table_state['CLSCHOOL'] = np.where(full_table_state['CLSCHOOL']!=0,\
                                          full_table_state['CLSCHOOL'],\
                                          pd.to_datetime('today').date())

#non essential business closed 
full_table_state['CLBSNS'] = np.where(full_table_state['CLBSNS']!=0,\
                                       full_table_state['CLBSNS'],\
                                       pd.to_datetime('today').date())
full_table_state['END_BSNS'] = np.where(full_table_state['END_BSNS']!=0,\
                                          full_table_state['END_BSNS'],\
                                          pd.to_datetime('today').date())

#gym closed 
full_table_state['CLGYM'] = np.where(full_table_state['CLGYM']!=0,\
                                       full_table_state['CLGYM'],\
                                       pd.to_datetime('today').date())
full_table_state['ENDGYM'] = np.where(full_table_state['ENDGYM']!=0,\
                                          full_table_state['ENDGYM'],\
                                          pd.to_datetime('today').date())
full_table_state['CLGYM2'] = np.where(full_table_state['CLGYM2']!=0,\
                                       full_table_state['CLGYM2'],\
                                       pd.to_datetime('today').date())
full_table_state['END_CLGYM2'] = np.where(full_table_state['END_CLGYM2']!=0,\
                                          full_table_state['END_CLGYM2'],\
                                          pd.to_datetime('today').date())

#Overall eviction moratorium start
full_table_state['EMSTART'] = np.where(full_table_state['EMSTART']!=0,\
                                       full_table_state['EMSTART'],\
                                       pd.to_datetime('today').date())
full_table_state['EMEND'] = np.where(full_table_state['EMEND']!=0,\
                                          full_table_state['EMEND'],\
                                          pd.to_datetime('today').date())
full_table_state['EMSTART2'] = np.where(full_table_state['EMSTART2']!=0,\
                                       full_table_state['EMSTART2'],\
                                       pd.to_datetime('today').date())
full_table_state['EMEND2'] = np.where(full_table_state['EMEND2']!=0,\
                                          full_table_state['EMEND2'],\
                                          pd.to_datetime('today').date())
full_table_state['EMSTART3'] = np.where(full_table_state['EMSTART3']!=0,\
                                       full_table_state['EMSTART3'],\
                                       pd.to_datetime('today').date())
full_table_state['EMEND3'] = np.where(full_table_state['EMEND3']!=0,\
                                          full_table_state['EMEND3'],\
                                          pd.to_datetime('today').date())


# SNAP waiver
full_table_state['SNAPALLO'] = pd.to_datetime(full_table_state['SNAPALLO'], format='%Y-%m-%d')
full_table_state['SNAPEBT20'] =  pd.to_datetime(full_table_state['SNAPEBT20'], format='%Y-%m-%d')
full_table_state['SNAPEBT21'] = np.where(full_table_state['SNAPEBT21']!=0,\
                                          full_table_state['SNAPEBT21'],\
                                          pd.to_datetime('today').date())


full_table_state['in_emergstate_period'] = np.where((full_table_state['time']>=full_table_state['STEMERG']) & (full_table_state['time']<=full_table_state['STEMERGEND']), 1,0)
full_table_state['in_closed_daycare_period'] = np.where((full_table_state['time']>=full_table_state['CLDAYCR']) & (full_table_state['time']<=full_table_state['OPNCLDCR']), 1,0)
full_table_state['in_closed_gym_period'] = np.where(((full_table_state['time']>=full_table_state['CLGYM']) & (full_table_state['time']<=full_table_state['ENDGYM'])) |
                                                    ((full_table_state['time']>=full_table_state['CLGYM2']) & (full_table_state['time']<=full_table_state['END_CLGYM2'])), 1,0)
full_table_state['in_eviction_moratorium_period'] = np.where(((full_table_state['time']>=full_table_state['EMSTART']) & (full_table_state['time']<=full_table_state['EMEND'])) |
                                                    ((full_table_state['time']>=full_table_state['EMSTART2']) & (full_table_state['time']<=full_table_state['EMEND2'])) |
                                                    ((full_table_state['time']>=full_table_state['EMSTART3']) & (full_table_state['time']<=full_table_state['EMEND3'])) , 1,0)
full_table_state['in_closed_school_period'] = np.where((full_table_state['time']>=full_table_state['CLSCHOOL']), 1,0)
full_table_state['in_snap_emergency_allotments_period'] = np.where((full_table_state['time']>=full_table_state['SNAPALLO']), 1,0)
full_table_state['in_snap_ebt_school_period2020'] = np.where((full_table_state['time']>=full_table_state['SNAPEBT20']) & (full_table_state['time'] <= pd.to_datetime('2020-12-31')), 1,0)
full_table_state['in_snap_ebt_school_period2021'] = np.where((full_table_state['time']>=full_table_state['SNAPEBT21']), 1,0)

# flagged columns
full_table_state.rename({'Provider/Company Name':'company_name',
                         'Product Name':'product_name',
                         'POSTCODE':'state_short',
                         'time':'date',
                         'locale':'locale_type',
                         'pct_black/hispanic':'pct_black_hispanic',
                         'Sector(s)':'sector_description',
                         'sector_PreK-12':'sector_PreK',
                         'RELIGEX': 'excep_relig_meet', 
                         'FMFINE': 'face_mask_enforced_by_fine',
                         'FMCITE':'face_mask_enforced_by_criminal_charge',
                         'FMNOENF':'face_mask_no_legal_enforcement',
                         'ALCOPEN':'alchool_stores_open',
                         'GUNOPEN':'guns_stores_open',
                         'SMALLBUSMINWAGE':'small_business_min_wage',
                         'MINWAGE2021':'min_wage_2021',
                         'TIPMINWAGE2020':'tip_min_wage_2020',
                         'VBMEXC':'covid_isnt_reason_to_request_vote_by_mail',
                         'MH19':'mental_health_prof_per_100kpop_2019',
                         'POV18':'perc_people_under_poverty_line_2018',
                         'UNEMP18':'perc_unenployed_2018',
                         'HMLS19':'number_homeless_2019',
                         'POP18':'population_2018',
                         'POPDEN18':'pop_density_2018'}, axis=1, inplace = True)
full_table_state = full_table_state[['state','state_short','company_name','product_name','locale_type',
                                     'date','week_year','weekday','sector_description','pct_black_hispanic',
                                     'sector_Corporate','sector_HigherEd','sector_PreK',
                                     'primary_function_main','primary_function_sub',
                                     'engagement_index','pct_access', 'in_emergstate_period',
                                     'in_closed_daycare_period','in_closed_school_period',
                                     'in_closed_gym_period','in_eviction_moratorium_period',
                                     'in_snap_emergency_allotments_period','in_snap_ebt_school_period2020',
                                     'in_snap_ebt_school_period2021',
                                     'excep_relig_meet','face_mask_enforced_by_fine',
                                     'face_mask_enforced_by_criminal_charge',
                                     'face_mask_no_legal_enforcement',
                                     'alchool_stores_open','guns_stores_open',
                                     'small_business_min_wage','min_wage_2021',
                                     'tip_min_wage_2020', 'covid_isnt_reason_to_request_vote_by_mail',
                                     'mental_health_prof_per_100kpop_2019',
                                     'perc_people_under_poverty_line_2018',
                                     'perc_unenployed_2018','number_homeless_2019',
                                     'population_2018','pop_density_2018']].copy()

full_table_state[['excep_relig_meet','face_mask_enforced_by_fine',
                  'face_mask_enforced_by_criminal_charge','face_mask_no_legal_enforcement',
                  'alchool_stores_open','guns_stores_open',
                  'covid_isnt_reason_to_request_vote_by_mail',
                  'population_2018']] = full_table_state[['excep_relig_meet','face_mask_enforced_by_fine',
                          'face_mask_enforced_by_criminal_charge','face_mask_no_legal_enforcement',
                          'alchool_stores_open','guns_stores_open','covid_isnt_reason_to_request_vote_by_mail','population_2018']].astype('int64', errors='ignore')

full_table_state[['small_business_min_wage','min_wage_2021','tip_min_wage_2020',
                  'mental_health_prof_per_100kpop_2019','perc_people_under_poverty_line_2018',
                  'perc_unenployed_2018','number_homeless_2019','pop_density_2018']] = full_table_state[['small_business_min_wage','min_wage_2021','tip_min_wage_2020',
                          'mental_health_prof_per_100kpop_2019','perc_people_under_poverty_line_2018',
                          'perc_unenployed_2018','number_homeless_2019','pop_density_2018']].astype('float32', errors='ignore')


In [None]:

gc.collect()


codes = ['DP05_0003PE', 'DP05_0004PE', 'DP05_0005PE', 'DP05_0006PE', 'DP05_0007PE',
         'DP04_0002PE', 'DP04_0003PE', 'DP04_0028PE', 'DP04_0039PE', 'DP04_0040PE', 
         'DP04_0058PE', 'DP04_0075PE', 'DP04_0081PE','DP04_0091PE', 'DP04_0127E',
         'DP03_0002PE', 'DP03_0006PE', 'DP03_0011PE','DP03_0011PE', 'DP03_0015PE', 
         'DP03_0017PE', 'DP03_0019PE', 'DP03_0020PE', 'DP03_0021PE',  'DP03_0022PE', 
         'DP03_0024PE', 'DP03_0025PE', 'DP02_0015E', 'DP02_0016E', 'DP02_0034PE',
         'DP02_0035PE', 'DP02_0044PE','DP02_0050PE', 'DP02_0051PE', 'DP02_0053PE', 
         'DP02_0054PE', 'DP02_0055PE', 'DP02_0056PE', 'DP02_0057PE', 'DP02_0059PE', 
         'DP02_0060PE', 'DP02_0061PE', 'DP02_0066PE', 'DP02_0073PE', 'DP02_0075PE', 
         'DP02_0077PE', 'DP02_0092PE', 'DP02_0117PE', 'DP02_0119PE', 'DP02_0121PE', 
         'DP02_0150PE', 'DP02_0151PE', 'DP02_0152PE']

cols_names = {
    "DP05_0003PE":  "Total population Female",
    "DP05_0004PE":  "Under 5 years",
    "DP05_0005PE":  "5 to 9 years",
    "DP05_0006PE":  "10 to 14 years",
    "DP05_0007PE":  "15 to 19 years",
    "DP04_0002PE":  "Total housing units Occupied housing units",
    "DP04_0003PE":  "Total housing units Vacant housing units",
    "DP04_0028PE":  "Total housing units 1 room",
    "DP04_0039PE":  "Total housing units No bedroom",
    "DP04_0040PE":  "Total housing units 1 bedroom",
    "DP04_0058PE":  "Occupied housing units No vehicles available",
    "DP04_0075PE":  "Occupied housing units No telephone service available",
    "DP04_0081PE":  "Owner-occupied units Less than $50,000",
    "DP04_0091PE":  "Owner-occupied units Housing units with a mortgage",
    "DP04_0127E":   "Occupied units paying rent Less than $500",
    "DP03_0002PE":  "Population 16 years and over In labor force",
    "DP03_0006PE":  "Population 16 years and over In labor force Armed Forces",
    "DP03_0011PE":  "Females 16 years and over In labor force",
    "DP03_0015PE":  "Children under 6 years with all parents in family in labor force",
    "DP03_0017PE":  "Children 6 to 17 years with all parents in family in labor force",
    "DP03_0019PE":  "Workers drove alone",
    "DP03_0020PE":  "Workers carpooled",
    "DP03_0021PE":  "Workers using Public transportation",
    "DP03_0022PE":  "Workers Walked",
    "DP03_0024PE":  "Workers Worked at home",
    "DP03_0025PE":  "Mean travel time to work (minutes)",
    "DP02_0015E":   "Average household size",
    "DP02_0016E":   "Average family size",
    "DP02_0034PE":  "Females 15 years and over Widowed",
    "DP02_0035PE":  "Females 15 years and over Divorced",
    "DP02_0044PE":  "Grandparents responsible for grandchildren",
    "DP02_0050PE":  "Grandmother responsible for own grandchildren",
    "DP02_0051PE":  "Married grandparents responsible for own grandchildren",
    "DP02_0053PE":  "Population enrolled in school Nursery school, preschool",
    "DP02_0054PE":  "Population enrolled in school Kindergarten",
    "DP02_0055PE":  "Population enrolled in school Elementary school (grades 1-8)",
    "DP02_0056PE":  "Population enrolled in school High school (grades 9-12)",
    "DP02_0057PE":  "Population enrolled in school College or graduate school",
    "DP02_0059PE":  "Adults Less than 9th grade",
    "DP02_0060PE":  "Adults 9th to 12th grade, no diploma",
    "DP02_0061PE":  "Adults High school graduate (includes equivalency)",
    "DP02_0066PE":  "Adults high school graduate or higher",
    "DP02_0073PE":  "Under 18 years With a disability",
    "DP02_0075PE":  "18 to 64 years With a disability",
    "DP02_0077PE":  "65 years and over With a disability",
    "DP02_0092PE":  "Total population Foreign born",
    "DP02_0117PE":  "Indo-European who don't speak english well",
    "DP02_0119PE":  "Asian and Pacific Islander who don't speak english well",
    "DP02_0121PE":  "Other languages who don't speak english well",
    "DP02_0150PE":  "Total households",
    "DP02_0151PE":  "Total households With a computer",
    "DP02_0152PE":  "Total households With a broadband Internet subscription",
}

census_data = censusdata.download('acs5', 2015, censusdata.censusgeo([('county', '*')]),
                                   codes,
                                   tabletype='profile')
def get_state(geo):
  return geo.name.split(', ')[1]
census_data = census_data.rename(index=get_state)  
census_data.index.name='state'
census_data=census_data.reset_index()

census_data = census_data.groupby('state').mean().reset_index()
census_data = census_data.rename(columns=cols_names)

del codes, cols_names

In [None]:
gc.collect()

census_data = census_data.drop(columns=['Mean travel time to work (minutes)', 'Total households', 'Total households With a computer', 'Total households With a broadband Internet subscription'])
census_data.columns = map(str.lower, census_data.columns)
census_data.columns = census_data.columns.str.replace(' ','_')

census_data.head()

In [None]:
del covid_19_state_policy
gc.collect()

census_data.shape

In [None]:
full_table_state = pd.merge(full_table_state, census_data, on='state')

del census_data
gc.collect()

# def cut_engagement(x):
#     if x < 150:
#         return 'low'
#     elif x < 311:
#         return 'medium'
#     else:
#         return 'high'
# def cut_access(x):
#     if x < 0.517:
#         return 'low'
#     elif x < 0.694:
#         return 'medium'
#     else:
#         return 'high'
    
# full_table_state['engagement_categoy'] = full_table_state.engagement_index.apply(cut_engagement)
# full_table_state['access'] = full_table_state.pct_access.apply(cut_access)
full_table_state.head()

In [None]:
# link = 'https://data.cms.gov/provider-data/sites/default/files/resources/f0ac50d7d0a50b3b4bd21668bcc3b24a_1632852401/HH_State_July2021.csv'
# 
# health_care_data = pd.read_csv(link)
# health_care_data.columns = map(str.lower, health_care_data.columns)
# health_care_data.columns = health_care_data.columns.str.replace(' ','_')

# health_care_data.rename({'state':'state_short'}, axis=1, inplace = True)

# link = 'https://data.cms.gov/provider-data/sites/default/files/resources/8afcba3acd6f5791e7dc16df93fc7653_1631721917/NH_StateUSAverages_Sep2021.csv'

# health_care_data2 = pd.read_csv(link)
# health_care_data2.columns = map(str.lower, health_care_data2.columns)
# health_care_data2.columns = health_care_data2.columns.str.replace(' ','_')

# health_care_data2.rename({'state_or_nation':'state_short'}, axis=1, inplace = True)
# health_care_data2 = health_care_data2.drop('processing_date',axis=1)

# health_care_data = pd.merge(health_care_data, health_care_data2, on='state_short')

# health_care_data.head()

In [None]:
# full_table_state = pd.merge(full_table_state, health_care_data, on='state_short')

# del health_care_data
# gc.collect()

# full_table_state.head()

In [None]:
# temp_sectors = pd.get_dummies(full_table_state[['locale_type', 'primary_function_main']]
#                                , prefix=['locale_type', 'primary_function_main'])
# full_table_state = full_table_state.join(temp_sectors)

# del temp_sectors
# gc.collect()
# full_table_state.to_csv('full_table_state.csv')

In [None]:
full_table_state.shape

<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="enrichment">
    1.2 Data enrichment
    </h1>
</div>


To bring more insights and information into our analysis, we pull data from other data sources.

One of the sources we looked for, which brought a lot of relevant information for our analysis, is the [US states' database of policies to combat COVID-19](https://www.openicpsr.org/openicpsr/project/119446/version/V75/view?path=/openicpsr/119446/fcr:versions/V75/COVID-19-US-State-Policy-Database-master/COVID-19-US-state-policy-database-3_29_2021.xlsx&type=file).

We also get [States averages of several home health agency quality measures for Home Health Agencies](https://data.cms.gov/provider-data/dataset/tee5-ixt5) and others [States Health Averages data](https://data.cms.gov/provider-data/dataset/xcdc-v8bm).

In this database, we have a lot of information about public policies adopted in relation to the pandemic in 2020, and we also have data on some demographic data for each state.

In addition, we have included Census data, obtained from an [API available on GitHub](https://jtleider.github.io/censusdata/), where there is additional information about each state, such as population age, household characteristics, among others.

At the end of the selection and transformation of the most relevant data, we arrive at the following table:

In [None]:
full_table_state.head()

<div style="font-family:verdana; bold;word-spacing:1.5px;">
    <h1 id="overview">
    2 Overview
    </h1>
</div><br>

In [None]:
gc.collect()

corr = full_table_state.corr()
# corr = pd.DataFrame(corr['engagement_index'].sort_values(ascending=False).sort_values(ascending=False)

df1 = pd.DataFrame(corr['engagement_index']).reset_index().sort_values(by='engagement_index',ascending=False)[2:]

df1 =  df1.head(10)

df2 =  pd.DataFrame(corr['engagement_index']).reset_index().sort_values(by='engagement_index',ascending=False)[2:]
                    
df2 = df2.tail(10)

fig, ax = plt.subplots(1,2,figsize=(20,8))
fig.suptitle('Top positive and negative correlations with engagement number',fontweight='bold',fontsize=20)

p1 = sns.barplot(x='engagement_index',y='index', data=df1,color = '#57799C', ax=ax[0])
                    
ax[0].set_title('Top 10 Positive Correlations',fontsize=12)
ax[0].set_xlabel('Corr')
ax[0].set_ylabel('Feature')


for x in p1.patches:
    width = x.get_width() /5  # get bar length
    p1.text(width + 0.01,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.3f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

# ax[0].add_patch(Rectangle((0.01, -0.49), 69.5, 1,fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

p2 = sns.barplot(x='engagement_index',y='index', data=df2,color = '#57799C', ax=ax[1])
ax[1].set_title('Top 10 Negative Correlations',fontsize=12)
ax[1].set_xlabel('Corr')
ax[1].set_ylabel('Feature')

for x in p2.patches:
    width = x.get_width() /5   # get bar length
    p2.text(width - 0.01,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.3f}'.format(width), # set variable to display, 2 decimals
            ha = 'right',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment


plt.tight_layout()
plt.show()

<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="ovw-location">
    2.1 Location
    </h1>
</div><br>

In [None]:
gc.collect()


state_code = pd.read_excel("../input/covid19-us-state-policy-database-3-29-2021/COVID-19 US state policy database 3_29_2021.xlsx")
state_code.dropna(subset = ["STATE"], inplace=True)
state_code.fillna(0,inplace=True)
state_code[0:5]
state_code = state_code[['STATE','POSTCODE']]


plot_df = full_table_state[['state','engagement_index']].\
    groupby(by=['state']).sum().reset_index()

plot_df = pd.merge(plot_df,state_code, 'left',left_on='state',right_on='STATE')

fig = go.Figure(data=go.Choropleth(
    locations=plot_df['POSTCODE'], # Spatial coordinates
    z = plot_df['engagement_index'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    text=plot_df['state'], # hover text
    colorbar_title = "Engagement"
))

fig.update_layout(
    title_text = 'Total Engagement by State',
    geo_scope='usa', # limite map scope to USA
)

fig.show()  # Output the plot to the screen

In [None]:
gc.collect()

df1 = full_table_state[['state','engagement_index']].\
    groupby(by=['state']).sum().reset_index().\
    sort_values(['engagement_index'],ascending=False)
df1['engagement_percent'] = df1.engagement_index.apply(lambda x: 100 * x / df1['engagement_index'].sum())
df1 =  df1.head(10)

df2 = full_table_state[['locale_type','engagement_index']].\
    groupby(by=['locale_type']).sum().reset_index().\
    sort_values(['engagement_index'],ascending=False)
df2['engagement_percent'] = df2.engagement_index.apply(lambda x: 100 * x / df2['engagement_index'].sum())
df2.reset_index(inplace=True,drop=True)

fig, ax = plt.subplots(1,2,figsize=(20,8))
fig.suptitle('Engagement Rate per Location',fontweight='bold',fontsize=20)

p1 = sns.barplot(x='engagement_percent',y='state', data=df1,color = '#57799C', ax=ax[0])
ax[0].set_title('Engagement Rate per State',fontsize=12)
ax[0].set_xlabel('Engagement Rate (%)')
ax[0].set_ylabel('State')
for x in p1.patches:
    width = x.get_width()    # get bar length
    p1.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

# ax[0].add_patch(Rectangle((0.01, -0.49), 69.5, 1,fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

p2 = sns.barplot(x='engagement_percent',y='locale_type', data=df2,color = '#57799C', ax=ax[1],order=df2['locale_type'])
ax[1].set_title('Engagement Rate per Location Type',fontsize=12)
ax[1].set_xlabel('Engagement Rate (%)')
ax[1].set_ylabel('Location Type')
for x in p2.patches:
    width = x.get_width()    # get bar length
    p2.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

# ax[1].add_patch(Rectangle((0.01, -0.49), 35.5, 2, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))
# ax[1].add_patch(Rectangle((0.01, 2.5), 35.5, 1, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))
# ax[1].add_patch(Rectangle((0.01, 4.5), 35.5, 4, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

plt.tight_layout()
plt.show()

<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="ovw-products">
    2.2 Products
    </h1>
</div><br>

Analyzing the all database, we can see that the company Google has an interaction domain in general, owning 67% of the interactions observed in the year 2020.

When we look at the top 10 products, 7 are from Google. The first two, Google Docs and Google Classroom, added together, sum 50% of all tracked engagement.

In [None]:
gc.collect()

df1 = full_table_state[['company_name','engagement_index']].\
    groupby(by=['company_name']).sum().reset_index().\
    sort_values(['engagement_index'],ascending=False)
df1['engagement_percent'] = df1.engagement_index.apply(lambda x: 100 * x / df1['engagement_index'].sum())
df1 =  df1.head(10)

df2 = full_table_state[['product_name','engagement_index']].\
    groupby(by=['product_name']).sum().reset_index().\
    sort_values(['engagement_index'],ascending=False)
df2['engagement_percent'] = df2.engagement_index.apply(lambda x: 100 * x / df2['engagement_index'].sum())
df2 = df2.head(10)

fig, ax = plt.subplots(1,2,figsize=(20,8))
fig.suptitle('Top companies and products in total engagement rate',fontweight='bold',fontsize=20)

p1 = sns.barplot(x='engagement_percent',y='company_name', data=df1,color = '#57799C', ax=ax[0])
ax[0].set_title('Top 10 Provider/Company Engagement',fontsize=12)
ax[0].set_xlabel('Engagement Rate (%)')
ax[0].set_ylabel('Company Name')
for x in p1.patches:
    width = x.get_width()    # get bar length
    p1.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

ax[0].add_patch(Rectangle((0.01, -0.49), 69.5, 1,fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

p2 = sns.barplot(x='engagement_percent',y='product_name', data=df2,color = '#57799C', ax=ax[1])
ax[1].set_title('Top 10 Product Engagement',fontsize=12)
ax[1].set_xlabel('Engagement Rate (%)')
ax[1].set_ylabel('Product Name')
for x in p2.patches:
    width = x.get_width()    # get bar length
    p2.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

ax[1].add_patch(Rectangle((0.01, -0.49), 35.5, 2, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))
ax[1].add_patch(Rectangle((0.01, 2.5), 35.5, 1, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))
ax[1].add_patch(Rectangle((0.01, 4.5), 35.5, 4, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

plt.tight_layout()
plt.show()

In [None]:
gc.collect()


df1 = full_table_state.groupby('company_name').product_name.nunique().reset_index().\
    sort_values(['product_name'],ascending=False).head(10)

companies_more_engag = full_table_state[['company_name','engagement_index']].\
    groupby(by=['company_name']).sum().reset_index().\
    sort_values(['engagement_index'],ascending=False).head(10)

df2 = full_table_state.groupby('company_name').product_name.nunique().reset_index()
# df2 = df2[df2['company_name'].isin(companies_more_engag['company_name'])]
df2 = pd.merge(df2,companies_more_engag, 'inner',on='company_name').reset_index().\
    sort_values(['engagement_index'],ascending=False)

fig, ax = plt.subplots(1,2,figsize=(20,8))
fig.suptitle('Analysis of the number of products from the top 10 companies',fontweight='bold',fontsize=20)

p1 = sns.barplot(x='product_name',y='company_name', data=df1,color = '#57799C', ax=ax[0])
ax[0].set_title('Top 10 companies with the most mapped products',fontsize=12)
ax[0].set_xlabel('Number of products')
ax[0].set_ylabel('')
for x in p1.patches:
    width = x.get_width()    # get bar length
    p1.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

ax[0].add_patch(Rectangle((0.01, -0.49), 69.5, 1,fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

p2 = sns.barplot(x='product_name',y='company_name', data=df2,color = '#57799C', ax=ax[1])
ax[1].set_title('Number of products Top 10 companies in total engagement',fontsize=12)
ax[1].set_xlabel('Number of products')
ax[1].set_ylabel('')

for x in p2.patches:
    width = x.get_width()    # get bar length
    p2.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

# ax[1].add_patch(Rectangle((0.01, -0.49), 35.5, 2, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))
ax[1].add_patch(Rectangle((0.01, 1.5), 35.5, 1, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))
ax[1].add_patch(Rectangle((0.01, 3.5), 35.5, 3, fc='#F5A294', ec='#F54434', linewidth=2.0, alpha=0.2,ls="--"))

plt.tight_layout()
plt.show()

In absolute terms, most engagement took place in schools in suburban (63%) and rural (21%) districts.

In [None]:
gc.collect()

df1 = full_table_state[['primary_function_main','engagement_index']].\
    groupby(by=['primary_function_main']).sum().reset_index()
df1['engagement_percent'] = df1.engagement_index.apply(lambda x: 100 * x / df1['engagement_index'].sum())
df1 =  df1.sort_values(['engagement_percent'],ascending=False).reset_index(drop=True).head(10)

df2 = full_table_state[['primary_function_sub','engagement_index']].\
    groupby(by=['primary_function_sub']).sum().reset_index()
df2['engagement_percent'] = df2.engagement_index.apply(lambda x: 100 * x / df2['engagement_index'].sum())
df2 = df2.sort_values(['engagement_percent'],ascending=False).reset_index(drop=True).head(10)

fig, ax = plt.subplots(1,2,figsize=(20,8))
fig.suptitle('Top category and sub-category in total engagement rate',fontweight='bold',fontsize=20)

p1 = sns.barplot(x='engagement_percent',y='primary_function_main', data=df1,color = '#57799C', order=df1['primary_function_main'], ax=ax[0])
ax[0].set_title('Top Main Function Engagement',fontsize=12)
ax[0].set_xlabel('Engagement Rate (%)')
ax[0].set_ylabel(None)
for x in p1.patches:
    width = x.get_width()    # get bar length
    p1.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

p2 = sns.barplot(x='engagement_percent',y='primary_function_sub', data=df2, color = '#57799C', order=df2['primary_function_sub'], ax=ax[1])
ax[1].set_title('Top 10 Sub-Category Function Engagement',fontsize=12)
ax[1].set_xlabel('Engagement Rate (%)')
ax[1].set_ylabel(None)
for x in p2.patches:
    width = x.get_width()    # get bar length
    p2.text(width + 0.05,       # set the text at 1 unit right of the bar
            x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
            '{:1.0f}'.format(width), # set variable to display, 2 decimals
            ha = 'left',   # horizontal alignment
            va = 'center'
            , color='black'
            , fontsize=14)  # vertical alignment

plt.tight_layout()
plt.show()

In [None]:
# gc.collect()

# df1 = full_table_state[['primary_function_main','date','in_emergstate_period','engagement_index']].\
#     groupby(by=['primary_function_main','in_emergstate_period','date']).sum().reset_index()

# df1 = df1[['primary_function_main','in_emergstate_period','engagement_index']].\
#     groupby(by=['primary_function_main','in_emergstate_period']).mean().reset_index().\
#     sort_values(['engagement_index'],ascending=False).head(10)

# df1 = df1.pivot(index='primary_function_main', columns = 'in_emergstate_period', values = 'engagement_index')
# df1['avg_daily_engag_growth']=(df1[1]-df1[0])/(df1[0])*100
# df1 = df1.sort_values(['avg_daily_engag_growth'],ascending=False)
# df1.reset_index(inplace=True)

# df2 = full_table_state[['primary_function_sub','date','in_emergstate_period','engagement_index']].\
#     groupby(by=['primary_function_sub','in_emergstate_period','date']).sum().reset_index()

# df2 = df2[['primary_function_sub','in_emergstate_period','engagement_index']].\
#     groupby(by=['primary_function_sub','in_emergstate_period']).mean().reset_index().\
#     sort_values(['engagement_index'],ascending=False).head(10)

# df2 = df2.pivot(index='primary_function_sub', columns = 'in_emergstate_period', values = 'engagement_index')
# df2['avg_daily_engag_growth']=(df2[1]-df2[0])/(df2[0])*100
# df2 = df2.sort_values(['avg_daily_engag_growth'],ascending=False)
# df2.reset_index(inplace=True)

# fig, ax = plt.subplots(1,2,figsize=(20,8))
# fig.suptitle('Top states and locale type in total engagement rate',fontweight='bold',fontsize=20)


# p1 = sns.barplot(x='avg_daily_engag_growth',y='primary_function_main', data=df1, order=df1['primary_function_main'], 
#                  ax=ax[0], color = '#57799C')
# ax[0].set_title('Top Main Function Engagement',fontsize=12)
# ax[0].set_xlabel('Engagement Rate (%)')
# ax[0].set_ylabel(None)
# for x in p1.patches:
#     width = x.get_width()    # get bar length
#     p1.text(width + 0.05,       # set the text at 1 unit right of the bar
#             x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
#             '{:1.0f}'.format(width), # set variable to display, 2 decimals
#             ha = 'left',   # horizontal alignment
#             va = 'center'
#             , color='black'
#             , fontsize=14)  # vertical alignment

# p2 = sns.barplot(x='avg_daily_engag_growth',y='primary_function_sub', data=df2, order=df2['primary_function_sub'], 
#                  ax=ax[1], color = '#57799C')
# ax[1].set_title('Top 10 Sub-Category Function Engagement',fontsize=12)
# ax[1].set_xlabel('Engagement Rate (%)')
# ax[1].set_ylabel(None)
# for x in p2.patches:
#     width = x.get_width()    # get bar length
#     p2.text(width + 0.05,       # set the text at 1 unit right of the bar
#             x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
#             '{:1.0f}'.format(width), # set variable to display, 2 decimals
#             ha = 'left',   # horizontal alignment
#             va = 'center'
#             , color='black'
#             , fontsize=14)  # vertical alignment

# plt.tight_layout()
# plt.show()

When we analyze the average daily engagement, within or outside the state-of-emergency period decreed by each state, we can see that the average grows dramatically, going from *3,082,485* to *4,331,711*, an increase of *40.5%*.

This is a reflection of the closing of schools and other leisure activities that students in normal times would be involved in.

<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="ovw-holydays">
    2.3 Access and engagement between pre- and post-pandemic holidays
    </h1>
</div><br>

In [None]:
# datetime 
full_table["time"] = pd.to_datetime(full_table.time)
full_table["week"] = full_table.time.dt.dayofweek 
full_table["holiday"] = full_table.week.apply(lambda x: 1 if x in [5, 6] else 0)
d = pd.date_range(start="2020-01-01", end="2020-01-19")
full_table["is_pandemic"] = full_table.time.apply(lambda x: 0 if x in d else 1)
full_table.drop("week", axis=1, inplace=True)

In [None]:
gc.collect()

plt.figure(figsize=(10, 8))

plot_df = full_table_state[['in_emergstate_period','date','engagement_index']].\
    groupby(by=['in_emergstate_period','date']).sum().reset_index()
plot_df = plot_df[['in_emergstate_period','engagement_index']].\
    groupby(by=['in_emergstate_period']).mean().reset_index().\
    sort_values(['engagement_index'],ascending=False)
# plot_df['engagement_percent'] = plot_df.engagement_index.apply(lambda x: 100 * x / plot_df['engagement_index'].sum())
plot_df =  plot_df.head(10)

sns.barplot(x='in_emergstate_period',y='engagement_index', data=plot_df,color = '#57799C')
plt.title("Avg Daily Engagement in emergency state",font="Serif", size=20)
plt.xlabel('Day in Emergency Period (1 if Yes, 0 if Not)')
plt.ylabel('Average Daily Engagement')
plt.show()

In [None]:
gc.collect()

plt.figure(figsize=(20, 8))

plot_df = full_table_state[['week_year','locale_type','date','in_emergstate_period','engagement_index']].\
    groupby(by=['week_year','locale_type','date']).sum().reset_index()

plot_df = plot_df[['week_year','locale_type','engagement_index']].\
    groupby(by=['week_year','locale_type']).mean().reset_index().\
    sort_values(['week_year'],ascending=True)


sns.lineplot(x='week_year',y='engagement_index', data=plot_df,hue='locale_type')
plt.title("Avg Daily Engagement per Week (2020)",size=20)
plt.xticks(rotation=45)
plt.show()

<div style="font-family:verdana; word-spacing:1.5px;">
    <h1 id="ovw-holydays">
    2.4 Demography
    </h1>
</div><br>

In [None]:
# gc.collect()

# df1 = full_table_state.groupby('pct_black_hispanic')['engagement_index'].sum()

# df1 = pd.DataFrame(df1).reset_index().sort_values(by='engagement_index',ascending=False)[2:]

# df1 =  df1.head(10)

# df2 = full_table_state.groupby('pct_black_hispanic')['pct_access'].sum()

# df2 = pd.DataFrame(df2).reset_index().sort_values(by='pct_access',ascending=False)[2:]

# fig, ax = plt.subplots(1,2,figsize=(20,8))
# fig.suptitle('Top category and sub-category in total engagement rate',fontweight='bold',fontsize=20)

# p1 = sns.barplot(x='engagement_index',y='pct_black_hispanic', data=df1,color = '#57799C', ax=ax[0])
# ax[0].set_title('Top Main Function Engagement',fontsize=12)
# ax[0].set_xlabel('Engagement Rate (%)')
# ax[0].set_ylabel(None)
# for x in p1.patches:
#     width = x.get_width()    # get bar length
#     p1.text(width + 0.05,       # set the text at 1 unit right of the bar
#             x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
#             '{:1.0f}'.format(width), # set variable to display, 2 decimals
#             ha = 'left',   # horizontal alignment
#             va = 'center'
#             , color='black'
#             , fontsize=14)  # vertical alignment

# p2 = sns.barplot(x='pct_access',y='pct_black_hispanic', data=df2, color = '#57799C', ax=ax[1])
# ax[1].set_title('Top 10 Sub-Category Function Engagement',fontsize=12)
# ax[1].set_xlabel('Engagement Rate (%)')
# ax[1].set_ylabel(None)
# for x in p2.patches:
#     width = x.get_width()    # get bar length
#     p2.text(width + 0.05,       # set the text at 1 unit right of the bar
#             x.get_y() + x.get_height() / 2, # get Y coordinate + X coordinate / 2
#             '{:1.0f}'.format(width), # set variable to display, 2 decimals
#             ha = 'left',   # horizontal alignment
#             va = 'center'
#             , color='black'
#             , fontsize=14)  # vertical alignment

# plt.tight_layout()
# plt.show()

In [None]:
# # define dataset
# X = full_table_state.drop(columns='engagement_index').copy()
# y = full_table_state['engagement_index']


In [None]:
# from sklearn.ensemble import RandomForestClassifier

# feature_names = [f'feature {i}' for i in range(X.shape[1])]
# forest = RandomForestClassifier(random_state=0)
# forest.fit(X, y)

In [None]:

# importances = forest.feature_importances_
# std = np.std([
#     tree.feature_importances_ for tree in forest.estimators_], axis=0)
# forest_importances = pd.Series(importances, index=feature_names)

In [None]:
# # define the model
# model = LinearRegression()
# # fit the model
# model.fit(X, y)
# # get importance
# importance = model.coef_
# # summarize feature importance
# for i,v in enumerate(importance):
# 	print('Feature: %0d, Score: %.5f' % (i,v))
# # plot feature importance

In [None]:
# pyplot.bar([x for x in range(len(importance))], importance)
# pyplot.show()

In [None]:
# plt.figure(figsize=(20, 20))

# abc = sns.heatmap(data=full_table_state.select_dtypes(include=['float','int']).corr(),annot=True)

# abc.add_patch(Rectangle((3, 0), 1, 40, fc='#49EB61', ec='#49EB61', linewidth=2.0, alpha=0.2,ls="--"))
# abc.add_patch(Rectangle((0, 3), 40, 1, fc='#49EB61', ec='#49EB61', linewidth=2.0, alpha=0.2,ls="--"))

Supplemental Nutrition Assistance Program (SNAP) (https://www.fns.usda.gov/disaster/pandemic/covid-19/snap-waivers-flexibilities)

Pandemic EBT (https://www.fns.usda.gov/snap/state-guidance-coronavirus-pandemic-ebt-pebt)

Supplemental Nutrition Assistance Program (SNAP) (https://www.fns.usda.gov/disaster/pandemic/covid-19/snap-waivers-flexibilities)

Pandemic EBT (https://www.fns.usda.gov/snap/state-guidance-coronavirus-pandemic-ebt-pebt)

In [None]:
# from sklearn.ensemble import RandomForestRegressor
# from sklearn import preprocessing

# X = full_table_state.drop(['engagement_index','state','state_short','company_name','product_name','locale_type','date','week_year','weekday','sector_description','primary_function_sub','primary_function_main'], axis=1)

# y = full_table_state.engagement_index

# min_max_scaler = preprocessing.MinMaxScaler()
# X = min_max_scaler.fit_transform(X)

# reg = RandomForestRegressor(random_state=0)
# reg.fit(X, y)
# importances =  pd.Series(reg.feature_importances_, index=X.columns, name='Importance')
# importances = importances.sort_values(ascending=False).to_frame()
# # importances.to_csv('importances_access.csv')
# importances

In [None]:

# pyplot.bar([x for x in range(len(importance))], importance)
# pyplot.show()