# Preamble & Key Insights
As a submission to the [Impact on Digital Learning Kaggle Competition](https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/overview), this notebook aims to explore the state of digital learning in 2020, particularly the disparity between the most- and least-engaged school districts, and how engagement of digital learning relates to factors such as district demographics and learning context. This notebook also aims to provide action items school districts can take to increase their student engagement. Using the engagement, product, and district data provided, I first quantified the disparity in engagement between the most and least engaged districts. I then quantified differences in demographic context, learning context, and socioeconomic status between the most and least engaged districts, to help understand why this disparity exists. **The main analytical findings of this notebook are as follows:**

1. *The disparity in student engagement is significant between the most and least engaged districts* (difference in mean daily engagement_index between the most and least engaged districts = 80,530, p = 0.00238).
2. *The most engaged districts used significantly more unique products, had significantly higher per-pupil expenditure, and had a smaller proportion of students eligible for a free or reduced-price lunch relative to the least-engaged districts.*
3. *There is a strong positive correlation between the number of unique products and per-pupil expenditure and moderately strong positive correlations between the number of unique products and student engagement as well as between per-pupil expenditure and student engagement.* This result suggests that the more student funding available, the more unique products a district can purchase and use with their students, which leads to increased student engagement.
4. *Although the most-engaged districts used a greater number of unique products, the distribution of product types was nearly the same for the most- and least-engaged districts.*
5. *From an additional data source, the areas where the least-engaged districts are located have a greater mean proportion of adults with bachelor's degrees and a slightly smaller unemployment rate.*

**From these analytical findings, recommendations to increase student engagement were identified throughout this notebook and summarized at the end of this notebook.** The thread of questioning motivating this analysis and methodology followed is outlined below, along with supporting visualizations, and assumptions made throughout the analysis. Within the text, assumptions have been numbered (see notation at the end of this sentence), and the implications of each assumption are discussed at the end of this notebook <span style="color:green;"> (1) </span>.

# Methodology
### 1. Load data, drop rows with missing values
First, the provided districts and product data files were loaded and, knowing that some districts were missing some information to maintain anonymity, the number of unique districts and states represented in the data were counted. For the district dara, any rows with missing information were dropped and the values in columns "pct_black/hispanic", "pct_free/reduced", and "ppt_total_raw" were changed to floats rather than ranges. For this conversion, the middle value in any given range was used as the float (e.g. "(0, 0.2(" became 0.1), to help with quantitative analysis <span style="color:green;"> (1) </span>.

Rather than investigate all districts included in the provided dataset, the most- and least-engaged districts were investigated and compared. The motivation for doing this was that the original districts dataset included data from only 23 of 50 states and only 14 of 50 states when all rows with missing values were omitted - to ensure each district was represented as accurately as possible, rather than replace any missing values, districts with missing info were omitted. By having school district data from only 14 states, the district dataset may have not been a good representation of the entirety of the United States. Instead, it was assumed that the best- and worst-case scenarios in terms of student engagement were captured somewhere within the remaining 88 districts in the 14 states <span style="color:green;"> (2) </span>. 

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind, pearsonr

# load districts and product data
distr_inf= pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
prod_inf= pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

# count number of distinct states represented in the districts data
print('Number of unique states and districts represented in district data:\n', distr_inf['state'].nunique(), 'states,', distr_inf['district_id'].nunique(), 'districts')

# from districts data, drop districts that have any missing info
# want to appropriately represent each district, especially since we're looking at top/bottom 10% engaged states
distr_inf.dropna(inplace= True)
print('Unique states and districts in district data after dropping rows with missing values:\n', distr_inf['state'].nunique(), 'states,', distr_inf['district_id'].nunique(), 'districts')

# for pct_black/hispanic, pct_free/reduced, and ppt_total_raw cols in districts data, convert the ranges to be the middle value of the range
# (converts all ranges to numerical values, to help with later analysis)
conv_1= {'[0, 0.2[': 0.1, '[0.2, 0.4[': 0.3, '[0.4, 0.6[': 0.5, '[0.6, 0.8[': 0.7, '[0.8, 1[': 0.9}
conv_2= {'[4000, 6000[': 5000, '[6000, 8000[': 7000, '[8000, 10000[': 9000, '[10000, 12000[': 11000, '[12000, 14000[': 13000, '[14000, 16000[': 15000, '[16000, 18000[': 17000, '[18000, 20000[': 19000, '[22000, 24000[': 23000, '[32000, 34000[': 33000}
for i in ['pct_black/hispanic', 'pct_free/reduced']:
    distr_inf[i]= distr_inf[i].replace(conv_1)
distr_inf['pp_total_raw']= distr_inf['pp_total_raw'].replace(conv_2)

# in products data, separate main and sub-categories within the Primary Essential Function column
prod_inf['Main function']= prod_inf['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
prod_inf['Sub-function']= prod_inf['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)
# two function_sub strings are very similar, so they likely serve the same function. make these strings the same so nothing is double-counted
prod_inf['Sub-function']= prod_inf['Sub-function'].replace({'Sites, Resources & References' : 'Sites, Resources & Reference'})
prod_inf.drop("Primary Essential Function", axis= 1, inplace= True)

# add district ID column to engagement data for each district. combine all engagement data into one dataframe
# drop rows that have any missing info
temp= []
for d in distr_inf.district_id.unique():
    temp_df= pd.read_csv(f"{'../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'}/{d}.csv", index_col= None, header= 0)
    temp_df['District ID']= d
    temp.append(temp_df)
engage= pd.concat(temp)
engage= engage.reset_index(drop= True)
engage.dropna(inplace= True)
engage['time']= pd.to_datetime(engage['time'])

### 2. Identify the most and least engaged districts, quantify the engagement disparity between these districts
To identify the most- and least-engaged districts, the daily average engagement index (i.e. engagement_index) across products used in each district was calculated. It was assumed that this was a more representative and intuitive measure of student engagement compared to the pct_access measure provided for each product <span style="color:green;"> (3) </span>, <span style="color:green;"> (4) </span>. The 8 (~10%) districts with the largest daily average engagement index were deemed the most-engaged districts and represented the best-case scenario in terms of student engagement in the USA in 2020. Conversely, the 8 districts with the lowest daily average engagement index were deemed the least-engaged districts and represented the worst case scenario in terms of student engagement in the USA in 2020 <span style="color:green;"> (5) </span>.

### 3. Quantify the difference in other metrics for the most and least engaged districts
After quantifying the difference in student engagement between the most- and least-engaged districts, the difference between these district groups for the following metrics were quantified: number of unique learning products used, pp_total_raw (per-pupil total expenditure), pct_black/hispanic (percentage of students that identify as Black or Hispanic), and pct_free/reduced (percentage of students eligible for a free or reduced-price lunch). The Pearson correlation between each district group's engagement index and these other metrics was also calculated and visualized using a correlation heatmap. These analyses were conducted to better understand if there were other differences between the most- and least-engaged districts that could possibly explain any disparity in student engagement.

### 4. Discussion & Additional Data Source Analysis
Once the provided data were analyzed, including the quantification of the difference in engagement and other metrics between the most- and least-engaged districts, a recommendation was made on how to increase student engagement for the least-engaged districts based on the provided data. Then, two additional data sources from the USDA (unemployment rate and education level) were used to create a holistic view of the areas where the most- and least-engaged districts were located, to identify additional recommendations to increase student engagement. More details on the additional data sources can be found [here](https://www.kaggle.com/valbauman/student-engagement-online-learning-supplement) and, using the 2013 Urban Influence Codes provided in this additional data source, each county was assigned to one of the following locale categories: City, Suburb, Town, Rural <span style="color:green;"> (6) </span>. Trends in the unemployment rate and education level in the areas where the most- and least-engaged districts were located were analyzed and, using these trends, an additional suggestion to increase student engagement in the least-engaged districts was made.

### 5. Summary of Recommendations, Assumptions & Implications
Finally, a summary of the recommendations that emerged throughout this notebook were re-stated as well as the assumptions made throughout this notebook and their implications.

# Results
### Identify the most and least engaged districts

In [None]:
# for each district, get average engagement across all days where engagement data were recorded to identify the most- and least-engaged districts (top/bottom 10%)
# the district with the greatest daily average engagement is the most engaged district
sum_engage= engage[['District ID', 'time', 'pct_access', 'engagement_index']].groupby(['District ID', 'time']).sum().groupby('District ID').mean()
sum_engage.reset_index(level= ['District ID'], inplace= True)
most_e= sum_engage.sort_values('engagement_index', ascending= False).iloc[:int(0.1*len(sum_engage)),:] 
least_e= sum_engage.sort_values('engagement_index', ascending= True).iloc[:int(0.1*len(sum_engage)),:]

# visualize the greatest disparity between the most-engaged district and least-engaged district
fig, ax= plt.subplots(figsize=(11,3))
ax.set_ylabel('Daily average engagement_index')
ax.set_xlabel('Top/Bottom 10% Engaged Districts')
ax.set_title('Daily average engagement index for most-engaged (green bars) and least-engaged districts (red bars)')
ax.set_xticks([])
ax.set_ylim([0,120000])
pp= ax.bar(np.arange(2*int(0.1*len(sum_engage))) - 0.75/2, (np.array([most_e['engagement_index'].values, least_e['engagement_index'].values]).flatten()).astype(int), 0.75, color= ['green', 'green', 'green', 'green', 'green', 'green', 'green', 'green', 'red','red','red', 'red','red','red', 'red','red'])

for p in pp: 
    height= p.get_height()
    ax.annotate('{}'.format(height),
               xy= (p.get_x() + p.get_width() / 2, height),
               xytext= (0,5),
               textcoords= 'offset points',
               ha= 'center',
               va= 'bottom')

In [None]:
print('Difference in mean daily average engagement_index for the most engaged and least engaged districts:', "\033[1m", str(int(np.mean(most_e['engagement_index'].values) - np.mean(least_e['engagement_index'].values[0]))), "\033[0m")
print('P-value for difference in engagement_index for the most and least engaged districts:', "\033[1m", str(np.round(ttest_ind(least_e.values.flatten(), most_e.values.flatten()).pvalue, 5)), "\033[0m")

Visually, there was a difference in daily average engagement index for products used by the most-engaged and least-engaged districts. Considering the assumption that each of these groups were representative of their respective population (e.g. the engagement indices for the most-engaged districts shown here were representative of the USA's actual most-engaged district), the greatest disparity between the most- and least-engaged districts across the USA was 80,530, meaning that, the most-engaged district had an average of 80,530 more page loads per thousand students each day compared to the least-engaged district. Also, using a 5% level of significance, there was a significant difference between the daily average engagement between the most- and least-engaged districts (p = 0.00238).

### Quantify the difference in other metrics for the most and least engaged districts

In [None]:
# explore factors that may be related to engagement: - test for significant differences between top/bottom engaged 
# quantify correlation between engagement and student metrics

# test for significant difference between the number of unique products used by the least and most engaged districts
LE_prods= engage.loc[engage['District ID'].isin(least_e['District ID'].unique())][['District ID', 'lp_id']].groupby('District ID').nunique().values
M_prods= engage.loc[engage['District ID'].isin(most_e['District ID'].unique())][['District ID', 'lp_id']].groupby('District ID').nunique().values
test= ttest_ind(LE_prods.flatten(), M_prods.flatten())
# print('Mean +/- standard deviation number of unique products for the least-engaged districts: ', np.mean(LE_prods), ' +/- ', np.std(LE_prods))
# print('Mean +/- standard deviation number of unique products for the most-engaged districts: ', np.mean(M_prods), ' +/- ', np.std(M_prods))
# print('P-value: ', test.pvalue, '\n')

# test for significant difference between the spending on each student for the least and most engaged districts
LE_spend= distr_inf.loc[distr_inf['district_id'].isin(least_e['District ID'])]['pp_total_raw'].values
M_spend= distr_inf.loc[distr_inf['district_id'].isin(most_e['District ID'])]['pp_total_raw'].values
test1= ttest_ind(LE_spend, M_spend)
# print('Mean +/- standard deviation per-pupil total expenditure for the least-engaged districts: ', np.mean(LE_spend), ' +/- ', np.std(LE_spend))
# print('Mean +/- standard deviation per-pupil total expenditure for the most-engaged districts: ', np.mean(M_spend), ' +/- ', np.std(M_spend))
# print('P-value: ', test1.pvalue, '\n')

# test for significant difference between the percentage of Black or Hispanic students for the least and most engaged districts
LE_students= distr_inf.loc[distr_inf['district_id'].isin(least_e['District ID'])]['pct_black/hispanic'].values
M_students= distr_inf.loc[distr_inf['district_id'].isin(most_e['District ID'])]['pct_black/hispanic'].values
test2= ttest_ind(LE_students, M_students)
# print('Mean +/- standard deviation percentage of Black or Hispanic students for the least-engaged districts: ', np.mean(LE_students), ' +/- ', np.std(LE_students))
# print('Mean +/- standard deviation percentage of Black or Hispanic students for the most-engaged districts: ', np.mean(M_students), ' +/- ', np.std(M_students))
# print('P-value: ', test2.pvalue, '\n')

# test for significant difference between the percentage of students of students eligible for free or reduced-price lunch for the least and most engaged districts
LE_lunch= distr_inf.loc[distr_inf['district_id'].isin(least_e['District ID'])]['pct_free/reduced'].values
M_lunch= distr_inf.loc[distr_inf['district_id'].isin(most_e['District ID'])]['pct_free/reduced'].values
test3= ttest_ind(LE_lunch, M_lunch)
# print('Mean +/- standard deviation percentage of students eligible for free or reduced-price lunch for the least-engaged districts: ', np.mean(LE_lunch), ' +/- ', np.std(LE_lunch))
# print('Mean +/- standard deviation percentage of students eligible for free or reduced-price lunch for the most-engaged districts: ', np.mean(M_lunch), ' +/- ', np.std(M_lunch))
# print('P-value: ', test3.pvalue, '\n')

# illustrate the mean +/- std dev for the most and least engaged districts for the following metrics:
# number of unique products, per-pupil expenditure, % Black or Hispanic students, % students with free or reduced-price lunch
# for each of these metrics, illustrate if there was a significant difference between the most and least engaged districts
plt.subplot(1,4,1)
plt.bar(np.arange(2), [np.mean(LE_prods), np.mean(M_prods)], yerr= [np.std(LE_prods), np.std(M_prods)], color= ['red','green'])
plt.xticks([])
plt.title('Unique \n products')
plt.xlabel('SIGNIFICANT DIFFERENCE \n p = ' + str(np.round(test.pvalue,5)))

plt.subplot(1,4,2)
plt.bar(np.arange(2), [np.mean(LE_spend), np.mean(M_spend)], yerr= [np.std(LE_spend), np.std(M_spend)], color= ['red','green'])
plt.xticks([])
plt.title('Per-pupil \n expenditure')
plt.xlabel('SIGNIFICANT DIFFERENCE \n p = ' + str(np.round(test1.pvalue,5)))

plt.subplot(1,4,3)
plt.bar(np.arange(2), [np.mean(LE_students), np.mean(M_students)], yerr= [np.std(LE_students), np.std(M_students)], color= ['red','green'])
plt.xticks([])
plt.title('% Black \n or Hispanic')

plt.subplot(1,4,4)
plt.bar(np.arange(1), [np.mean(LE_lunch)], yerr= [np.std(LE_lunch)], color= 'red', label= 'Least-engaged')
plt.bar(np.arange(1,2), [np.mean(M_lunch)], yerr= [np.std(M_lunch)], color= 'green', label= 'Most-engaged')
plt.xticks([])
plt.title('% Free or \n reduced-price lunch')
plt.xlabel('SIGNIFICANT DIFFERENCE \n p = ' + str(np.round(test3.pvalue,5)))
plt.legend(bbox_to_anchor= (1.55,1))
plt.subplots_adjust(left=1, right= 3.4)

In [None]:
# create correlation heatmap
plt.figure()
dispar_data= np.array([np.append(LE_prods, M_prods), np.append(LE_spend, M_spend), np.append(LE_students, M_students), np.append(LE_lunch, M_lunch), np.append(least_e['engagement_index'].values, most_e['engagement_index'].values)])
dispar_df= pd.DataFrame(data= dispar_data.T, columns= ['Unique products', 'pp_total_raw', 'pct_black/hispanic', 'pct_free/reduced', 'engagement_index'])
sns.heatmap(dispar_df.corr(), annot= True, cmap ='PiYG')
plt.title('Correlation Heatmap')
plt.show()

From the tests for significant differences illustrated in the four bar charts above, **the most engaged districts used significantly more unique products, had significantly higher per-pupil expenditure, and had a smaller proportion of students eligible for a free or reduced-price lunch relative to the least-engaged districts. There was no significant difference between these groups in terms of percentage of Black and Hispanic students.**

From the correlation heatmap, **there was a strong positive correlation between the number of unique products and per-pupil expenditure and moderately strong positive correlations between the number of unique products and student engagement as well as between per-pupil expenditure and student engagement.** Also from the heatmap, there was a moderately strong negative correlation between the percentage of students eligible for a free or reduced-price lunch and student engagement.

# Discussion & Additional Data Source Analysis
**From the positive correlations in the heatmap, it can be inferred that the more student funding available, the more unique products a district can purchase and use with their students, which can lead to increased student engagement.** These findings are consistent with the fact that the most-engaged districts used significantly more unique products and had significantly higher student funding on a per-pupil basis relative to the least-engaged districts. **Based on this,the least-engaged districts should use more unique products to increase their student engagement.** Also, as suggested by the correlations, per-pupil expenditure should increase in order for these districts to have access to and use more products, to lead to an increase in student engagement. 

To get a better idea as to what kind of products the least-engaged districts should spend money on to increase engagement, the distribution of all products used by the most-engaged districts and least-engaged districts by product main function can be visualized: 

In [None]:
# visualize distribution of products used by the most- and least-engaged districts by product main function
plt.subplot(1,2,1)
LE_prod_distrib= prod_inf.loc[prod_inf['LP ID'].isin(engage.loc[engage['District ID'] == (least_e['District ID'].iloc[0])]['lp_id'].values)].groupby('Main function').size().plot(kind= 'pie', autopct= '%1.0f%%', ylabel= '')
plt.title('Distribution of product types used by the least-engaged (left) and most-engaged (right) districts', loc= 'left')
plt.subplot(1,2,2)
M_prod_distrib= prod_inf.loc[prod_inf['LP ID'].isin(engage.loc[engage['District ID'] == (most_e['District ID'].iloc[0])]['lp_id'].values)].groupby('Main function').size().plot(kind= 'pie', autopct= '%1.0f%%', ylabel= '')
plt.legend(['CM - Classroom management', 'LC - Learning Curriculum', 'LC/CM/SDO', 'SDO - School & District Operations'], bbox_to_anchor= (1.1,1))

From these pie charts, the distribution of product types was nearly the same for the most- and least-engaged districts. Based on this, the least-engaged districts should increase their number of products without drastically changing the distribution of product types. For example, if the least-engaged districts decide to use 100 new products, ~80 of them should have a main function of "learning curriculum", ~8 should have a main function of "classroom management", ~6 should have a main function of "school & district operations" and ~6 should have a main function of "learning curriculum/classroom management/school & district operations". Overall, from the results of the hypothesis tests and correlation heatmap analysis on the provided dataset, there was a difference in student engagement and other factors between the most- and least-engaged districts. **To help close the gap in student engagement between the most- and least-engaged districts, the least-engaged districts should increase the number of products they use without changing the distribution of product main functions.**

### Analysis with Additional Data Sources

Since the state and locale category was known for each district, the average unemployment rate and education level for each state's locale was used to quantify the mean unemployment rate and education level for the areas where the most- and least-engaged districts were located. For example, as shown in the dataframe below, the state-locale combinations for the least-engaged districts were Utah-Suburb, Illinois-Town, Washington-Suburb, North Carolina-City, Indiana-Suburb, Utah-Town, Utah-City. Knowing this, and knowing the unemployment rate and education level for each of these state-locale combinations, the average unemployment rate and education level across all state-locale combinations where the least engaged districts were located was calculated. From here, trends in unemployment rate and education level over time were visualized and analyzed for the areas where the most- and least-engaged districts were located, as shown in the plots below.

In [None]:
print('Dataframe with least engaged districts:')
LE_locale= distr_inf.loc[distr_inf['district_id'].isin(least_e['District ID'])][['district_id', 'state', 'locale']]
M_locale= distr_inf.loc[distr_inf['district_id'].isin(most_e['District ID'])][['district_id', 'state', 'locale']]
LE_locale

In [None]:
# load unemployment rate and education data. Add column with full state name, since full state name is what's provided in the provided datasets
unemploy= pd.read_csv("../input/student-engagement-online-learning-supplement/unemployment.csv")
education= pd.read_csv("../input/student-engagement-online-learning-supplement/education.csv")

us_state_full = {
    'AL': 'Alabama',
    'AK': 'Alaska',
    'AS': 'American Samoa',
    'AZ': 'Arizona',
    'AR': 'Arkansas',
    'CA': 'California',
    'CO': 'Colorado',
    'CT': 'Connecticut',
    'DE': 'Delaware',
    'DC': 'District Of Columbia',
    'FL': 'Florida',
    'GA': 'Georgia',
    'GU': 'Guam',
    'HI': 'Hawaii',
    'ID': 'Idaho',
    'IL': 'Illinois',
    'IN': 'Indiana',
    'IA': 'Iowa',
    'KS': 'Kansas',
    'KY': 'Kentucky',
    'LA': 'Louisiana',
    'ME': 'Maine',
    'MD': 'Maryland',
    'MA': 'Massachusetts',
    'MI': 'Michigan',
    'MN': 'Minnesota',
    'MS': 'Mississippi',
    'MO': 'Missouri',
    'MT': 'Montana',
    'NE': 'Nebraska',
    'NV': 'Nevada',
    'NH': 'New Hampshire',
    'NJ': 'New Jersey',
    'NM': 'New Mexico',
    'NY': 'New York',
    'NC': 'North Carolina',
    'ND': 'North Dakota',
    'MP': 'Northern Mariana Islands',
    'OH': 'Ohio',
    'OK': 'Oklahoma',
    'OR': 'Oregon',
    'PA': 'Pennsylvania',
    'PR': 'Puerto Rico',
    'RI': 'Rhode Island',
    'SC': 'South Carolina',
    'SD': 'South Dakota',
    'TN': 'Tennessee',
    'TX': 'Texas',
    'UT': 'Utah',
    'VT': 'Vermont',
    'VI': 'Virgin Islands',
    'VA': 'Virginia',
    'WA': 'Washington',
    'WV': 'West Virginia',
    'WI': 'Wisconsin',
    'WY': 'Wyoming'
}

unemploy['state full'] = unemploy['State'].replace(us_state_full)
education['state full'] = education['State'].replace(us_state_full)

# for each of the state-locale combinations for the least engaged districts (8 total), 
# get the mean unemployment rate and % of adults with 4 years of college or bachelor's degrees
# then take the mean to get an average unemployment rate and education level for the locations where the least engaged districts reside
# repeat for the most engaged districts, to allow for comparisons between the two groups
LE_unemploy= unemploy.loc[(unemploy['City/Suburb/Town/Rural'] == LE_locale['locale'].values[0]) & (unemploy['state full'] == LE_locale['state'].values[0])].mean()
LE_unemploy= pd.DataFrame(data= [LE_unemploy], columns= LE_unemploy.index.values)
M_unemploy= unemploy.loc[(unemploy['City/Suburb/Town/Rural'] == M_locale['locale'].values[0]) & (unemploy['state full'] == M_locale['state'].values[0])].mean()
M_unemploy= pd.DataFrame(data= [M_unemploy], columns= M_unemploy.index.values)

LE_edu= education.loc[(education['City/Suburb/Town/Rural 2013'] == LE_locale['locale'].values[0]) & (education['state full'] == LE_locale['state'].values[0])].mean()
LE_edu= pd.DataFrame(data= [LE_edu], columns= LE_edu.index.values)
M_edu= education.loc[(education['City/Suburb/Town/Rural 2013'] == M_locale['locale'].values[0]) & (education['state full'] == M_locale['state'].values[0])].mean()
M_edu= pd.DataFrame(data= [M_edu], columns= M_edu.index.values)

for i in range(1, len(LE_locale)):
    LE_unemploy= LE_unemploy.append(unemploy.loc[(unemploy['City/Suburb/Town/Rural'] == LE_locale['locale'].values[i]) & (unemploy['state full'] == LE_locale['state'].values[i])].mean(), ignore_index= True)
    M_unemploy= M_unemploy.append(unemploy.loc[(unemploy['City/Suburb/Town/Rural'] == M_locale['locale'].values[i]) & (unemploy['state full'] == M_locale['state'].values[i])].mean(), ignore_index= True)
    LE_edu= LE_edu.append(education.loc[(education['City/Suburb/Town/Rural 2013'] == LE_locale['locale'].values[i]) & education['state full'] == LE_locale['state'].values[i]].mean(), ignore_index= True)
    M_edu= M_edu.append(education.loc[(education['City/Suburb/Town/Rural 2013'] == M_locale['locale'].values[i]) & education['state full'] == M_locale['state'].values[i]].mean(), ignore_index= True)
    
LE_unemploy= LE_unemploy[['Unemployment_rate_2000', 'Unemployment_rate_2001', 'Unemployment_rate_2002', 'Unemployment_rate_2003', 
                          'Unemployment_rate_2004', 'Unemployment_rate_2005', 'Unemployment_rate_2006', 'Unemployment_rate_2007', 
                          'Unemployment_rate_2008', 'Unemployment_rate_2009', 'Unemployment_rate_2010', 'Unemployment_rate_2011', 
                          'Unemployment_rate_2012', 'Unemployment_rate_2013', 'Unemployment_rate_2014', 'Unemployment_rate_2015', 
                          'Unemployment_rate_2016', 'Unemployment_rate_2017', 'Unemployment_rate_2018', 'Unemployment_rate_2019', 'Unemployment_rate_2020']].mean()
    
M_unemploy= M_unemploy[['Unemployment_rate_2000', 'Unemployment_rate_2001', 'Unemployment_rate_2002', 'Unemployment_rate_2003', 
                          'Unemployment_rate_2004', 'Unemployment_rate_2005', 'Unemployment_rate_2006', 'Unemployment_rate_2007', 
                          'Unemployment_rate_2008', 'Unemployment_rate_2009', 'Unemployment_rate_2010', 'Unemployment_rate_2011', 
                          'Unemployment_rate_2012', 'Unemployment_rate_2013', 'Unemployment_rate_2014', 'Unemployment_rate_2015', 
                          'Unemployment_rate_2016', 'Unemployment_rate_2017', 'Unemployment_rate_2018', 'Unemployment_rate_2019', 'Unemployment_rate_2020']].mean()

LE_edu= LE_edu[["Percent of adults completing four years of college or higher, 1970", "Percent of adults completing four years of college or higher, 1980", 
                "Percent of adults with a bachelor's degree or higher, 1990", "Percent of adults with a bachelor's degree or higher, 2000", 
                "Percent of adults with a bachelor's degree or higher, 2015-19"]].mean()

M_edu= M_edu[["Percent of adults completing four years of college or higher, 1970", "Percent of adults completing four years of college or higher, 1980", 
                "Percent of adults with a bachelor's degree or higher, 1990", "Percent of adults with a bachelor's degree or higher, 2000", 
                "Percent of adults with a bachelor's degree or higher, 2015-19"]].mean()


# create line chart of unemployment rate trend from 2000-2020 for mean unemployment rates from the areas where the most and least engaged districts are located
plt.figure(figsize= (12,5))
plt.plot(range(2000,2021),LE_unemploy, color= 'red', label= 'Areas of LEAST engaged districts')
plt.plot(range(2000,2021), M_unemploy, color= 'green', label= 'Areas of MOST engaged districts')
plt.ylabel('Unemployment rate (%)')
plt.xlabel('Year')
plt.xticks(range(2000,2021))
plt.legend(loc= 'upper left')
plt.title('Trends in mean adult unemployment rate for areas where the most engaged and least engaged districts are located')
plt.show()

# create line chart of mean % of adults with bachelor's degrees or higher for the areas where the most and least engaged districts are located
plt.figure(figsize= (12,5))
plot_time= np.array([1970, 1980, 1990, 2000, 2019])
plt.plot(plot_time,LE_edu, color= 'red', label= 'Areas of LEAST engaged districts')
plt.plot(plot_time, M_edu, color= 'green', label= 'Areas of MOST engaged districts')
plt.ylabel("Adults with a bachelor's degree or higher (%)")
plt.xlabel('Year')
plt.legend(loc= 'upper left')
plt.title("Trends in mean % of adults with a bachelor's degree or more for areas where the most engaged and least engaged districts are located")
plt.show()

From these line charts, the trends in education and unemployment rate over time were similar between the areas of the most- and least-engaged districts. For example, for both areas, the percentage of adults with bachelor's degree increased over time and the unemployment rates fluctuated similarly over time. **At the time closest the COVID-19 pandemic (year = 2019/2020), the areas where the least-engaged districts were located had a greater mean proportion of adults with bachelor's degrees and a slightly smaller unemployment rate.** In light of this, a possible explanation for the reduced student engagement in these areas is that, relative to the areas where the most engaged districts are located, more parents could have been educated and working full-time hours, so they did not have the time to help or encourage their children with their online learning activities. Contrastingly, recognizing the areas of the most engaged districts had slightly lower mean proportion of adults with bachelor's degrees or higher and a slightly larger mean unemployment rate, this could suggest not as many parents were working full-time, so they had to opportunity to help their children with their online learning activities. **Based on this, another suggestion to help increase student engagement for the least-engaged districts is for these districts to organize and run study groups for student peers to work side-by-side in an in-person or virtual environment. By doing this, students can feel encouraged and motivated to participate in their online learning activities by being surrounded by peers doing the same thing and they can ask each other for help instead of feeling unmotivated without much support when working alone at home with busy working parents.**

# Summary of Recommendations
In light of the analysis of the provided dataset and additional data sources, the following recommendations are intended for the least-engaged districts, to increase their student engagement (i.e. daily average engagement_index), but are applicable to all districts:
1. Increase per-pupil expenditure to increase the number of unique products used to increase student engagement
2. As a follow-up to recommendation 1, increase the number of unique products in such a way that the distribution of product main function remains unchanged
3. The least-engaged districts should organize study groups to give students an opportunity to work side-by-side (in-person or virtually) on their online learning tasks, to help ensure students remain motivated to learn and have the opportunity to receive support from their peers that is otherwise be unavailable at home

The analysis conducted that led to these recommendations relied on assumptions related to the data, indicated by the numbered sentences throughout this notebook. These assumptions and their implications are outlined in the final section of this notebook:

## Assumptions & Implications

*Assumption 1:* Changing range values in the district data to floats corresponding to the middle of the range assumed that the float corresponding to the middle of the range was representative of that district's particular metric.

*Implication:* The middle value of each range may not have been representative for each district, so any significant differences observed between the most- and least-engaged districts may not have been significant in reality or there may have been significant differences that were missed in this analysis as a result of using the middle range values.

*Assumption 2:* The most- and least-engaged districts were among the remaining 88 districts in 14 states after data cleaning.

*Implication:* Although the disparity in student engagement as measured by the daily average engagement_index for each district was very large (~80,000) between the most- and least-engaged districts identified, because data for all districts in all states wasn't available, it is not certain that the most- and least-engaged districts in the USA were represented in the provided dataset. This means that the calculated disparity between the most- and least-engaged districts could have been optimistic (i.e. the true disparity could be much larger than what was reported in this notebook).

*Assumption 3:* Daily average engagement_index was a more representative measure of engagement compared to daily average pct_access.

*Implication:* Although pct_access is a measure that considered individual students to generate a percentage of students with access to a particular product whereas engagement_index is a measure that considered a group of 1000 students, because pct_access was a percentage for each product on any given day, calculating the daily average pct_access for a district could have resulted in a value greater than 100%, which is not intuitive for a reader to understand. For this reason and because pct_access and engagement_index were likely highly correlated anyway, daily average engagement_index rather than pct_access was used to quantify student engagement.

*Assumption 4:* When calculating the daily average engagement_index for all districts that remained after data cleaning, the number of days with available data varied from district to district, with the fewest number of days being 59. For districts with fewer than 366 days of recorded engagement, it was assumed that data from the available days were representative of the entire year.

*Implication:* The data available for fewer days than the number of days in the year 2020 may not be representative of the entire year, so some districts may be misrepresented in this analysis.

*Assumption 5:* The top 10% most-engaged districts and bottom 10% engaged districts were a representative random sample that had normally distributed data.

*Implication:* If these assumptions were violated, the calculated correlations and tests for significant differences in engagement, count of unique prodicts, percentage of Black and Hispanic students, percentage of students eligible for a free or reduced-price lunch, and per-pupil expenditure between the most and least engaged districts are invalid.

*Assumption 6:* For the additional data sources, using the text description of each of the 2013 Urban Influence Codes and descriptions of the locales "City", "Suburb", "Town" and "Rural", each county was assigned one of these locales. This was done to ensure the additional data sources could have common fields (i.e. state and locale) with the provided dataset, to ensure the datasets could be joined. Since district county was not a provided field in the provided datasets, all unemployment rate and education level metrics were based on the average of the averages of all state-locale combinations for the least-engaged districts (same idea applied to the most-engaged districts).

*Implication:* Assignment of each county to an Urban Influence Code was subjective, since it was done by a single person that read the text description of each Urban Influence Code and assigned it to one of the locale categories. This means that some counties could have been misclassified as belonging to a certain locale (e.g. City) when, in reality, they were part of a different locale (e.g. Suburb). This inaccurate assignment of locales could result in the areas where the most and least engaged districts are located being misrepresented and incorrect conclusions being made based on unemployment rate and education level. It is also worth recognizing that the conclusions that emerged from this additional data source were somewhat contradictory to the analysis conducted using the provided dataset. From the tests for significant differences in the provided dataset, a greater percentage of students were eligible for a free or reduced-price lunch in the least-engaged districts relative to the most-engaged districts, suggesting that families with children that attended schools in the least-engaged districts were of lower economic status than families with children that attended schools in the most-engaged districts. However, from the analysis on the additional data source, it was found that a greater percentge of adults in the areas where the least-engaged districts were located were employed and had bachelor's degrees, suggesting that families in these areas would have sufficient income, and a smaller percentage of students would be eligible for a free or reduced-price lunch relative to the areas where the most-engaged districts are located. Despite this, the recommendation for the least-engaged districts to organize study groups where students can work on online learning activities side-by-side is still a valid suggestion to increase student engagement, potentially without drastically increasing costs.