In [None]:
# import libraries
import pandas as pd
import numpy as np
import datetime
import zipfile

# visualization libararies
import matplotlib.pyplot as plt
import seaborn as sns

# waffle charts
from pywaffle import Waffle

# map visualization
import folium
from folium import plugins

# import files
import os

In [None]:
# read input datasets

#read .csv files
districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv') # information about school districts
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')   # information about the top 372 products with most users in 2020

#create list of all files in 'engagement_data' folder
files = next(os.walk('../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'))[2]

engagement_list = []

for file in files:
    district_id = file.split(".")[0]
    file = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/'+ str(file)
    df = pd.read_csv(file)
    df["district_id"] = district_id
    engagement_list.append(df)
    
engagement = pd.concat(engagement_list)  # information about page load events
engagement = engagement.reset_index(drop=True)


# Table of contents

**Introduction**

**1 Datasets pre-processing**

     1.1 Engagement dataset
     1.2 Districts dataset
     1.3 Products dataset
     
**2 Top digital learning products in 2020**

     2.1 Users of the digital learning products
     2.2 Functions of the digital learning products
     
**3 What is the picture of engagement in digital learning in 2020?**

     3.1 Engagement in digital learning products for each Users' category
     3.2 Engagement in digital learning products with various functionality
     
**4 City, Town, Suburb, Rural**

**5 Additional information about the states**

**6 Engagement in digital learning products vs COVID-19 pandemic**

**7 Influence of social / political / financial aspects on online engagement**

     7.1 Social context
     7.2 Financial context

**Personal view**

**Data**



# Introduction

Lot of words have already been said about digital learning and the influence of the global pandemic on education. And hundreds of articles and research work will be written in the next few years to evaluate the effect of COVID-19 pandemic on education and other social institutions worldwide.

So, I wish to proceed with analysis immediately and not to bore you with the long introduction.
Let's see what was the state of digital learning in the USA in the outstanding and frightening 2020.


# 1 Datasets pre-processing

### 1.1 Engagement dataset


In [None]:
# missing values

def print_missing_values(df):
    # function calculates and displays amount of missing values in the dataset 
    # df - dataframe to analyse
    
    print('Percentage of missing values per column before pre-processing:\n')
    for column in df.columns:
        print('{} - {}%'.format(column,
                                round((df[column].isna().sum() / len(df)) * 100, 3)))
        
print_missing_values(engagement)   

**Missing values:**
- 'engagement_index' equals NaN when 'pct_access' equals 0 or NaN, therefore, in the first case NaN values can be replaced with 0;
- lines with missing information about 'lp_id' and 'engagement_index' are useless for the following analysis, therefore, they can be dropped.

In [None]:
# engagement_index = NaN when pct_access = 0, so, I replace NaN values with 0.
engagement.loc[engagement.pct_access == 0, 'engagement_index'] = engagement.loc[
                                                                engagement.pct_access == 0, 'engagement_index'].fillna(0)

# drop lines with missing values in 'pct_access', 'lp_id' columns
# After this step there are no more missing values in the dataframe.
engagement.dropna(subset=['pct_access', 'lp_id'], inplace = True)


#change column types to int and date format
engagement['lp_id'] = engagement['lp_id'].astype(int)
engagement['district_id'] = engagement['district_id'].astype(int)
engagement['time'] = pd.to_datetime(engagement['time'], format = '%Y-%m-%d')

### 1.2 Districts dataset

In [None]:
# missing values
print_missing_values(districts)   

Due to anonymization procedures there are a lot of missing values in the dataset that creates limitations for analysis. Nevertheless, some information gaps can be eliminated based on information about the state where the school district is located.

**Missing values:**
- 'state' - only districts with information about state can be valuable for analysis, rows with missing 'state' are empty in other columns too, therefore, rows with 'state' equals NaN should be dropped;
- 'county_connections_ratio' - non-informative column, because out of 70% non-missing values, 99% of values (161 out of 162) equal '0.18-1', therefore, the entire column can not be used to analyze impact of connections ratio on learning process.

**'pp_total_raw'** column deserves special attention. This information was taken from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The column contains the median value or more specifically the range around median value of the expenditure of a given school district.

Even though I cannot find missing expenditure values for the specific districts, I used the same source [Edunomics Lab](https://edunomicslab.org/nerds/) to download files with per-state information about total expenditure for every school in a state and calculate median values for each state with missing data. The only state with no available information is New Hampshire.

**Interpretation**:
'pct_black/hispanic', 'pct_free/reduced' and 'pp_total_raw' are columns with categorical information. For easier interpretation I performed some changes to make them cleaner.



In [None]:
# only districts with information about state can be valuable for analysis,
# rows with missing 'state' are empty in other columns too.
districts = districts.loc[districts.state.isna() == False]

# drop non informative column: 162 non-null values out of which 161 - '0.18, 1', and 1 - '1, 2'
districts.drop(columns = 'county_connections_ratio', inplace = True)

# replace comma ',' with hyphen '-' for easier interpretation 
# drop punctuation that makes information less interpretable

for column in districts.columns[-3:]:
    districts[column] = districts[column].str.replace(', ', ' - ')
    districts[column] = districts[column].str.replace('[', '')

In [None]:
# fill in per-pupil total expenditure NaN values using Edunomics Lab website: https://edunomicslab.org/nerds/

#identify states with missing information about per-pupil total expenditure
states_missing_exp = list(districts.loc[districts['pp_total_raw'].isna() == True]['state'].unique())

# delete New Hampshire from the list because there is no available information about this state.
states_missing_exp.remove('New Hampshire')

# states abbreviations
states_abb = zip(states_missing_exp, ['CT', 'OH','CA','AZ','ND','NY'])

# as I don't have access to information about district locale status (city, suburb etc),
# I will use median total expenditure per state and the range of 2000$ around this median value.

for state, abb in states_abb:
    state_name = state
    file = 'https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/'+ abb + '_1819_final_August_1st_21.csv'
    state = pd.read_csv(file, sep = ';', 
                        error_bad_lines=False,
                        encoding= 'unicode_escape')
    
    # files with financial information per state are standardized and column names are the same + state abbreviation
    column_name = 'pp_total_raw_' + abb                  # the same column name as in the given dataset
    state = state[[column_name]].dropna()
    state = state[[column_name]].replace(',','.', regex=True).astype(float)
    
    # there are negative values in the columns but in documentation it's noticed that these values are mistakes.
    state = state.loc[state[column_name] > 0]            # choose the rows with positive values
    median = round(state[[column_name]].median(), -3)    # round to the nearest thousand
        
    median_left = int(median[0] - 1000)    # start of the range
    median_right = int(median[0] + 1000)   # end of the range
    
    median_range = str(median_left) + ' - ' + str(median_right)  # range to fill in cells with missing information
    
    districts.loc[districts.state == state_name, 'pp_total_raw'] = districts.loc[
                                                                   districts.state == state_name,
                                                                   'pp_total_raw'].fillna(median_range)

### 1.3 Products dataset

In [None]:
# missing values
print_missing_values(products)   

**Missing values:**
'Sector(s)' and  'Primary Essential Function' - I decided to fill in missing values with the most common sector for a company-producer. Usually a product with missing information about its sector or function details is a version or part of products' group in the dataset. In the situations when company has only 1 product and there is no information about its  sector or function, I decided to choose following characteristics:
- Sector = 'Higher Ed; Corporate';
- Primary Essential Function = 'LC/CM/SDO - Other'.

**New features:**
- split 'Primary Essential Function' into 2 features: 'Function(abb)' and 'Function(details)'. 'Function(abb)' consists of 4 possible categories, while 'Function(details)' consists of 18 different categories. The split into 2 features will provide 2 levels for analysis of DL products' usage and facilitate analysing process;
- feature 'Users' is created based on feature 'Sector(s)' and does not add new information, but increases interpretability of 'Sector(s)' column, displaying groups of students who are target-users of each product.   



In [None]:
def fill_missing_sector_function(df, company):
    # function identifies the most frequent sector/function of products of a given company
    # and fills in missing values of 'Sector(s)', 'Primary Essential Function' columns filtered for this company
    # if there is only one product of a given company, then Sector = 'Higher Ed; Corporate',
    # Primary Essential Function = 'LC/CM/SDO - Other'.
    
    # df - dataframe with inormation about poducts
    # company - given company with missing values in 'Sector(s)' column
    
    df_company = df.loc[df['Provider/Company Name'] == company]

    if len(df_company) > 1:
        sector_value = df.loc[df['Provider/Company Name'] == company, 'Sector(s)'].mode()[0]
        function_value = df.loc[df['Provider/Company Name'] == company, 'Primary Essential Function'].mode()[0]
    
    else:
        sector_value = 'Higher Ed; Corporate'
        function_value = 'LC/CM/SDO - Other'
        
    df.loc[df['Provider/Company Name'] == company, 'Sector(s)'] = df.loc[
                                                                df['Provider/Company Name'] == company, 'Sector(s)'
                                                                ].fillna(sector_value)

    df.loc[df['Provider/Company Name'] == company, 'Primary Essential Function'] = df.loc[
                                                            df['Provider/Company Name'] == company, 'Primary Essential Function'
                                                            ].fillna(function_value)
    
    
# create a list of companies with missing info about their products
company_list = list(
                products.loc[products['Sector(s)'].isna() == True, 'Provider/Company Name']
                .dropna().unique())

# apply function to fill in missing values
for company in company_list:
    fill_missing_sector_function(products, company)

In [None]:
# instead of sectors I find that users' category is more interpretable and create a new feature based on Sector(s).
sectors = {'PreK-12':'Children under 18 y.o.',
           'PreK-12; Higher Ed': 'Students, all ages',
           'PreK-12; Higher Ed; Corporate': 'Universal products',
           'Higher Ed; Corporate':'Uni.students and corporate users',
            'Corporate': 'Corporate users'}

# create new feature
products['Users'] = products['Sector(s)'].copy()
products.replace({"Users": sectors}, inplace = True)

# split Primary Essential Function column on 2 columns: Function(abb) + Function(details)
products['Function(abb)'] = products['Primary Essential Function'].str.split(' - ').str[0]
products['Function(details)'] = products['Primary Essential Function'].str.split(' - ').str[1]

# drop Primary Essential Function, because it duplicates information
products.drop(columns = 'Primary Essential Function', inplace = True)

# in Function(details) column there are 2 same categories: 'Sites, Resources & Reference' and
# 'Sites, Resources & References'. They should be merged into 1 category.
products.replace({"Function(details)":
                 {'Sites, Resources & Reference':'Sites, Resources & References'}},
                 inplace = True)


# 2 Top digital learning products in 2020

This section covers an overview of the top Digital Learning products with the most users in 2020. First of all, the state of the market should be analyzed. I wondered **who are the main users of digital learning products and if there is any free niche for those who wish to enter this growing market in the near future.**

### 2.1 Users of the digital learning products

Based on my personal experience of using various learning platforms and educational sources I got an impression that main users of the rapidly growing digital education market are working specialists who wish to change careers and college/university students who wish to expand knowledge and skills they receive during official education. Surprisingly for me, the results of the analysis showed an absolutely different picture.


In [None]:
# parameters for visualizations
# fonts
font_title = 18
font_text = 14
font_labels = 12
font_ticks = 12
font_legend = 10
fontname = 'Times New Roman'

#colors
colors_group_1 =  ['darkcyan', 'limegreen', "gold"]
colors_group_2 =  ["#782B9D", "#EA4F88", "#F98477"]

cmap_1 = 'viridis'
cmap_2 = 'plasma'
waffle_cmap = 'Set1'

#labels
function_abb_labels = {'LC':'Learning & Curriculum',
                       'CM':'Classroom Management',
                       'SDO':'School & District Operations',
                       'LC/CM/SDO':'Universal products'}

In [None]:
# waffle chart to identify the distribution of products' sectors or users categories
df_users = products.Users.value_counts()

fig_0 = plt.figure(
    FigureClass = Waffle, 
    rows = 10,
    values = list(df_users.values),
    interval_ratio_x = 0.5,
    interval_ratio_y = 0.6,
    figsize = (12, 7),
    icons = 'file-alt',
    title = {
        'label': "Products' users",
        'loc': 'left',
        'fontdict': {
            'fontsize': font_title,
            'fontname': fontname
        }},
    labels = [f"{k} ({int(v / sum(df_users.values) * 100)}%)" for k, v in df_users.items()],
    legend = {
        
        'loc': 'lower left',
        'bbox_to_anchor': (0, -0.35),
        'ncol': 3,
        'framealpha': 0,
        'fontsize': font_labels
    },
    vertical = True,
    cmap_name = waffle_cmap
)

**Key insights:**
- almost **half of all products are designed for children under 18 years old** (kindergarten, primary and secondary education);
- specifically corporate users or working specialists not enrolled in universities are almost neglected: less than 1% of all products;
- almost **1/3 of digital learning products can be used universally by users of all ages**;
- university students experience little support from providers of digital learning products: 17% of products can be used by students of all ages including both kindergarten and universities.


There are 3 categories of users or products' sectors: PreK-12 (school students), Higher education (college/university students) and corporate users. I wondered **how these categories intersect and who are major clients of digital education**.

The chart below shows that school students can use almost all digital learning products from the provided list and apparently are seen as the major group of users of digital learning, while older users have 2 times less options.

In [None]:
# bar chart that displays how intersect products accrding to their sectors

# count how many products per each category, including mixed sectors
df_sectors = products['Sector(s)'].value_counts()

# create dictionary to calculate how time each sector is mentioned in the 'Sector(s)' column
sectors = {'PreK-12':0, 'Higher Ed':0, 'Corporate':0}

for key, value in sectors.items():
    for index, count in df_sectors.items():
        if key in index:
            sectors[key] += count

# get list of values from the dictionary            
stacked_sectors = list(sectors.values())

#labels for the chart
sectors_names = ['PreK-12 (school students)', 'Higher Education', 'Corporate users' ]

# create a chart
fig_2 = plt.figure(figsize = (15,1.8))
fig_2_ax = fig_2.add_subplot()

# white spaces are necessary to visualze intersection of sectors
white_spaces = [0, 176, 241]
fig_2_ax.barh(sectors_names, white_spaces, color = "white", height = 1)

# colored bars
for i in range(len(stacked_sectors)):
    fig_2_ax.barh(sectors_names[i], stacked_sectors[i],
                  left=white_spaces[i], color=colors_group_1[i],
                  height=1)

# add vertical lines and labels for better understanding how sectors intersect
# 'PreK-12 (school students)'
fig_2_ax.vlines(df_sectors['PreK-12'], -1, 3, colors = 'black', linestyles = '--')
fig_2_ax.text(75, 0.8, str(df_sectors['PreK-12']), fontsize=font_labels, color = 'black')

# 'PreK-12; Higher Ed'
fig_2_ax.vlines(241, -0.8, 3, colors='black', linestyles='--')
fig_2_ax.text(200, 1.6, str(df_sectors['PreK-12; Higher Ed']), fontsize=font_labels, color = 'black')

# 'PreK-12; Higher Ed; Corporate'
fig_2_ax.vlines(363, -1, 3, colors='black', linestyles='--')
fig_2_ax.text(292, 2.6, str(df_sectors['PreK-12; Higher Ed; Corporate']), fontsize=font_labels, color = 'black')

# 'Higher Ed; Corporate'
fig_2_ax.vlines(370, -1, 3, colors='black', linestyles='--')
fig_2_ax.text(364, 2.6, str(df_sectors['Higher Ed; Corporate']), fontsize=font_labels, color = 'black')

# 'Corporate'
fig_2_ax.text(372, 2.6, str(df_sectors['Corporate']), fontsize = font_labels, color = 'black')
    
        
fig_2_ax.set_xlim([0, 400])     # set x-scale
fig_2_ax.yaxis.tick_right()     # put y-ticks on the right side
fig_2_ax.set_yticklabels(sectors_names, fontsize = font_labels)

fig_2_ax.set(frame_on = False)  # get rid of the frame
fig_2_ax.set_xticks([])         # get rid of x-ticks

plt.show()

**Some interesting points to note:**
- there are literally no products only for university students, although, to my mind this category of digital learning users have a direct interest in gaining knowledge before they enter highly competitive job market and they have more skills to study independently without guidance from teachers (in comparison with school students) and have more free time (in comparison with corporate users);
- there is only 1 product ([Weebly](https://weebly.com/) - web-hosting service) designed specifically for corporate users, although, there are multiple spheres and industries where constant learning is of paramount importance for professionals and can be provided with digital products.

**Overall, the charts above demonstrate outdated attitude towards education in general:**

- people study mostly in schools (for more than 10 years);

- less than half of them continue to study in colleges/universities  ([37.5% of the U.S. population who were aged 25 and above had graduated from college or another higher education institution](https://www.statista.com/statistics/184260/educational-attainment-in-the-us/));

- after graduation people stop educating themselves or use the same sources as university students.

I consider this view of education and people's attitude towards it to be outdated because nowadays people tend to change careers and specializations during their lives, more and more secondary school graduates and even university students value education and try to get new skills online to be more versatile in our competitive world. This tendency has not started with COVID-19 pandemic, but this worldwide crisis might reinforce it.



### 2.2 Functions of the digital learning products

**How digital learning products are used by students/teachers/users in educational process?**

In my view, the chart below not only answers the question, but also demonstrate why schools and universities all over the world were not prepared to enforced switch to online-lessons.
The products that are designed to organize educational process online and teach effectively, safely constitute together only 17% (9% CM + 8% SDO).
Teachers have faced new reality with little technical support, not speaking about methodological background.

The vast majority of products are from 'Learning & Curriculum' segment which might indicate that educational content is valued higher than smooth and effective educational process.

**Some interesting point to note:**
- only 1% of 'Learning & Curriculum' products is dedicated to 'Career Planning & Job Search', although, nowadays it is a topic of high demand.

In [None]:
# waffle chart to identify the distribution of products' functions
df_functions = products['Function(abb)'].value_counts()
df_functions.rename(function_abb_labels, inplace = True)

fig_1 = plt.figure(
    FigureClass = Waffle, 
    rows = 6,
    columns = 21,
    rounding_rule = 'floor',
    values = list(df_functions.values),
    interval_ratio_x = 0.5,
    interval_ratio_y = 0.8,
    figsize = (13, 7),
    icons = 'file-alt',
    title = {
        'label': "Products' functions",
        'loc': 'left',
        'fontdict': {
            'fontsize': font_title,
            'fontname': fontname
        }},
    labels = [f"{k} ({int(v / sum(df_functions.values) * 100)}%)" for k, v in df_functions.items()],
    legend = {
        
        'loc': 'upper right',
        'bbox_to_anchor': (1.45, 1.05),
        'ncol': 1,
        'framealpha': 0,
        'fontsize': font_labels
    },
    vertical = False,
    cmap_name= waffle_cmap
)

# 3 What is the picture of engagement in digital learning in 2020?

The most exciting part of the presented analysis is to see how people in the USA studied during such an unusual and difficult year when online education became not a choice, but an obligation.

**Did they actually use 'the gift of time'?**

**What digital learning products were the most popular among users in different categories?**

**Will we see some patterns in users' behavior?**


### 3.1 Engagement in digital learning products for each Users category 

*Methodology:*
1. sum up total engagement index (number of page-load events per one 1000 students) for ALL PRODUCTS per Users' category per day;
2. calculate average engagement index per week in 2020, per Users' category;
3. display the behavior of weekly engagement index;
3. for each product in every category of Users' feature calculate average engagement index, sort the list and choose n-most popular products.

In [None]:
# merge engagement and products dataframes
product_engagement = pd.merge(engagement[['time', 'lp_id', 'engagement_index']],
                              products[['LP ID', 'Product Name', 'Provider/Company Name',
                                        'Users', 'Function(abb)', 'Function(details)']],
                              how = 'left', left_on = 'lp_id', right_on = 'LP ID')
# drop mutual column 'LP ID'
product_engagement.drop(columns = ['LP ID'], inplace = True)

# calculate total engagement index for ALL PRODUCTS per Users' category per day
# (w/o split per districts and products/ names)
users_df = product_engagement.groupby(
                               by = [product_engagement.time, product_engagement.Users]
                               )['engagement_index'].sum().reset_index()

# calculate average engagement index per week in 2020, per Users' category
users_df = users_df.groupby(by = [users_df.time.dt.week, users_df.Users]
                            )['engagement_index'].mean(
                            ).unstack(level = 1).reset_index()

In [None]:
def create_season_chart(title, y_lim, y_text):
    # function create a sample of chart with colored seasons and weeks of 2020 on x-axis
    
    # title - title of the chart
    # y_lim - maximum of y_axis, should be slightly more than maximum value of the Users' column
    # y_text - y-coordinate of seasons' names, should be less than y_lim
    
    # create chart
    plt.style.use('ggplot')
    fig = plt.figure(figsize = (17.5,5.5))
    ax = fig.add_subplot()
    
    #tisks
    x_ticks = [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 53]
    plt.xticks(x_ticks)
    ax.set_xlim(1, 53)
    ax.set_ylim(1, y_lim)
    
    #background color - seasons
    plt.axvspan(1, 9, facecolor='blue', alpha=0.2)
    plt.axvspan(9, 22, facecolor='limegreen', alpha=0.2)
    plt.axvspan(22, 35, facecolor='gold', alpha=0.2)
    plt.axvspan(35, 48, facecolor='coral', alpha=0.2)
    plt.axvspan(48, 53, facecolor='blue', alpha=0.2)
    
    #text seasons
    ax.text(3.5, y_text, 'Winter', fontsize=14, color = 'blue', alpha = 0.4, fontweight = 'semibold')
    ax.text(14, y_text, 'Spring', fontsize=14, color = 'limegreen', alpha = 0.4, fontweight = 'semibold')
    ax.text(27, y_text, 'Summer', fontsize=14, color = 'orange', alpha = 0.4, fontweight = 'semibold')
    ax.text(40, y_text, 'Autumn', fontsize=14, color = 'coral', alpha = 0.4, fontweight = 'semibold')
    ax.text(49, y_text, 'Winter', fontsize=14, color = 'blue', alpha = 0.4, fontweight = 'semibold')
    
    #grid
    ax.grid(axis='x')
    
    #labels
    ax.set_ylabel('Av.am. of page-load events per 1000 students,.000', fontsize = font_labels)
    ax.set_xlabel('Weeks, 2020 year', fontsize = font_labels)

    #title
    ax.set_title(title, fontsize = font_title, fontname = fontname)
       
    return fig, ax

In [None]:
def top_products(df, user_categoty, n_products, axis, y_lim, x_text_coordinate, color_line = 'black'):
    # function calculates average engagement index for each product in a given Users' group
    # for the whole year, sorts products by avg.engagement index and makes a list of n-top used products.
    # Afterwards, function displays the list on a given plot.
    
    # df - merged dataframe to calculate avg. engagement index
    # user_categoty - given Users' category
    # n_products - how many top used products to display
    # axis - axis of the plot, where to display
    # y_lim - feature used in create_season_chart function, use the same value
    # x_text_coordinate - x-coordinate on the list of products on the plot
    # color_line - the same color as Users' category plot or darker color, defualt = 'black'
    
    top_product = df.loc[df.Users == user_categoty]. \
                  groupby(by= 'Product Name')['engagement_index']. \
                  mean().reset_index()
            
    top_product_list = list(top_product.sort_values(
                       by = 'engagement_index', ascending = False).head(n_products)['Product Name'])
    
    title_coordinate = y_lim * (3/4)
    axis.text(x_text_coordinate, title_coordinate, 'Top used products in 2020:',
              fontsize = font_text,
              color = color_line,
              alpha = 0.8,
              fontname = fontname)
        
    for i in range(0, n_products):
        product_coordinate = title_coordinate - (i+1) * y_lim * (1/15)
        
        axis.text(x_text_coordinate, product_coordinate, str(str(i+1) + ' - ' + top_product_list[i]),
                  fontsize = font_text,
                  color = color_line,
                  alpha = 0.9,
                  fontname = fontname)

In [None]:
fig_3, fig_3_ax = create_season_chart('Universal products for all users, all states in 2020',
                                      y_lim = 12000, y_text = 11000)

fig_3_ax.plot(users_df.time, users_df['Universal products'] /1000,
              linewidth = 1.5, color = 'blue',
              label = 'Universal products')

top_products(product_engagement, 'Universal products',
             n_products = 6, axis = fig_3_ax,
             y_lim = 12000, x_text_coordinate = 24)

fig_3_ax.legend(fontsize = font_legend,
                loc = 'lower left')

plt.show()

In [None]:
fig_4, fig_4_ax = create_season_chart('Products that can be used by children under 18 y.o. and university students, all states in 2020',
                                      y_lim = 3000, y_text = 2750)

fig_4_ax.plot(users_df.time, users_df['Children under 18 y.o.'] /1000,
              linewidth = 1.5, color = 'red',
              label = 'Children under 18 y.o.')

fig_4_ax.plot(users_df.time, users_df['Students, all ages'] /1000,
              linewidth = 1.5, color = 'green',
              label = 'Students, all ages')

top_products(product_engagement, 'Children under 18 y.o.',
             n_products = 5,
             axis = fig_4_ax, y_lim = 3000,
             x_text_coordinate = 17,
             color_line = 'darkred')

top_products(product_engagement, 'Students, all ages',
             n_products = 5,
             axis = fig_4_ax, y_lim = 3000,
             x_text_coordinate = 30,
             color_line = 'darkgreen')

fig_4_ax.legend(fontsize = font_legend,
                loc = 'lower left')

plt.show()

In [None]:
fig_5, fig_5_ax = create_season_chart('Products that can be used by corporate users and university students, all states in 2020',
                                      y_lim = 25, y_text = 22.5)

fig_5_ax.plot(users_df.time, users_df['Uni.students and corporate users'] /1000,
              linewidth = 1.5, color = 'darkviolet',
              label = 'Uni.students and corporate users')

fig_5_ax.plot(users_df.time, users_df['Corporate users'] /1000,
              linewidth = 1.5, color = 'orange',
              label = 'Corporate users')

top_products(product_engagement, 'Uni.students and corporate users',
             n_products = 5,
             axis = fig_5_ax, y_lim = 25,
             x_text_coordinate = 17,
             color_line = 'rebeccapurple')

top_products(product_engagement, 'Corporate users',
             n_products = 1,
             axis = fig_5_ax, y_lim = 25,
             x_text_coordinate = 30,
             color_line = 'darkgoldenrod')

fig_5_ax.legend(fontsize = font_legend,
                loc = 'lower left')

plt.show()

**Key insights:**
- the common patterns for all categories are, first of all, **dramatic decrease in engagement during summer months** with subsequent increase starting from the end of August, and **sharp fall in engagement during holidays: Thanksgiving day and Christmas**;
- universal products that include Google services and YouTube were used much more intensively than digital learning products for students of all ages, not speaking about corporate users' products;
- products designed for all students experienced less fluctuation in engagement in comparison with other categories;
- we can notice a more or less **sharp decrease in the weekly engagement index on the 12th week, the next week after WHO declared COVID-19 a Pandemic and President Trump declared COVID-19 a National Emergency**;
- the behavior of engagement index of products for university students and corporate users differs from the rest categories. I assume it might be related to the pandemic and high level of uncertainty for university students and those who were going to apply to colleges in 2020;
- in the second half of the year engagement in products for children under 18 y.o. repeated the behavior of the universal products line quite closely.

To my mind, national holidays influence overall engagement in digital learning products more than crises like global pandemic or political events like presidential elections.

**I want to point out that we cannot evaluate the influence of COVID-19 pandemic based solely on the data of 2020. To identify whether there was any impact of pandemic and to measure it we should analyze data for the several previous years and at least one academic year after the pandemic.**


### 3.2 Engagement in digital learning products with various functionality

Speaking about functionality of digital learning products, there are 18 different categories of functions and analysing each of them seems inefficient to me. So, I decided to compare monthly engagement in products with functions that were mostly used in 2020 across all states and monthly engagement in products that fluctuated over average engagement index more than others.

*Methodology for identifying the most fluctuating functions:*
1. sum up total engagement index (number of page-load events per one 1000 students) for ALL PRODUCTS per Function category per day;
2. calculate average engagement index per month in 2020, per Function category;
3. scale monthly means with the total average of engagement index per year and calculate standard deviation of each month;
4. choose the top-3 mostly fluctuating functions and display monthly average engagement index for these functions.

*Methodology for identifying the most used functions:*
1. sum up total engagement index (number of page-load events per one 1000 students) for ALL PRODUCTS per Function category per day;
2. calculate average engagement index per day in 2020, per Function category;
3. choose the top-3 functions with the largest average engagement index and display monthly average engagement index for these functions.


In [None]:
def function_extreme_std(df):
    # function calculates average engagement index per Function(deatils) category per month,
    # identifies top-3 categories which usage mostly fluctuated during 2020.
    # function returns dictionary with name of function and list of average engagement indexes per month. 
    
    function_details = list(df['Function(details)'].unique())  #list of all Functions(details)
    
    functions_to_vis = {}
    y_df = df.groupby(by = ['time', 'Function(details)'])['engagement_index'].sum().reset_index()
    
    for category in function_details:
        y = y_df.loc[y_df['Function(details)'] == category].groupby(by = y_df.time.dt.month)['engagement_index'].mean()
        y_std = np.std(y/y.mean())  #scale means per each month by dividing by the total mean
        
        if y_std > 0.9:             # y_std = 0.9 is manually chosen value
            y = list(y)
            functions_to_vis[category] = y
        
    return functions_to_vis


def function_extreme_average(df):
    # function calculates average engagement index per Function(deatils) category per the whole 2020,
    # identifies top-3 categories with the highest average engagement index.
    # function returns dictionary with name of function and list of average engagement indexes per month.
    
    y = df.groupby(by = ['time', 'Function(details)'])['engagement_index'].sum().unstack(level = 1).reset_index()
    y_mean = y.mean()
    
    function_details = list(y_mean.sort_values(ascending = False)[:3].index)
    functions_to_vis = {}
    
    for category in function_details:
        y_df = df.groupby(by = ['time', 'Function(details)'])['engagement_index'].sum().reset_index()
        y_category = y_df.loc[y_df['Function(details)'] == category]. \
                                          groupby(by = y_df.time.dt.month)['engagement_index'].mean()
        y_category = list(y_category / 1000)
        functions_to_vis[category] = y_category

    return functions_to_vis


def function_extreme_chart(dictionary, figure, colors, *axis):
    # function creates set of charts from dictionary
    # on given a figure, axises using a list of colors.
    
    months = ['January','February','March', 'April','May','June','July', 'August', 'September', 'October', 'November','December']
    
    for ax, item, color in zip(axis, dictionary.items(), colors):
        
        sns.barplot(months, item[1], ax = ax, color = color)
        ax.set_xticklabels(months, rotation = 45, fontsize = 10)
        ax.set_xlabel('')
        ax.set_title(item[0], fontsize = font_title, fontname = fontname)
        ax.locator_params(axis='y', nbins=6) 
        
     

In [None]:
# create dictionaries with the most fluctuating and the most used categories in Function(details)
functions_std = function_extreme_std(product_engagement)
functions_average = function_extreme_average(product_engagement)

In [None]:
fig_6, fig_6_ax = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=False, figsize=(18, 15))
plt.style.use('ggplot')

f6_ax1, f6_ax2, f6_ax3 =  fig_6_ax[:,0]
f6_ax4, f6_ax5, f6_ax6 =  fig_6_ax[:,1]

# charts with top-3 fluctuating categories
function_extreme_chart(functions_std, fig_6, colors_group_1, f6_ax1, f6_ax2, f6_ax3)
fig_6.text(0.2, 0.93, 'The most fluctuating functions in 2020',
           fontsize = font_title, fontname = fontname)

# y-label for top-3 fluctuating categories
fig_6.text(0.08, 0.4, 'Average amount of page-load events per 1000 students',
           fontsize = font_labels, alpha = 0.8,
           rotation = 'vertical')

# charts with top-3 used categories
function_extreme_chart(functions_average, fig_6, colors_group_2, f6_ax4, f6_ax5, f6_ax6)
fig_6.text(0.63, 0.93, 'The most used functions in 2020',
           fontsize = font_title, fontname = fontname)

# y-label for top-3 used categories
fig_6.text(0.5, 0.4, 'Average amount of page-load events per 1000 students, .000',
           fontsize = font_labels, alpha = 0.8,
           rotation = 'vertical')



plt.show()

**Key insights:**
- presumably, all three **most fluctuating functions experienced strong influence of the pandemic**;
- almost zero engagement in Virtual Classroom products in January, February with subsequent increase in March (after COVID-19 pandemic declaration) and tripled engagement index after summer months shows that the **need in Virtual Classroom products was created by circumstances** and entrenched at the beginning of the new academic year in 2020;
- the absolutely opposite situation can be observed in Admission, Enrollment & Rostering category. With the start of the pandemic engagement in these products dropped to almost zero;
- **EHS Compliance products demonstrated increase in March, April** when schools and universities across all states were closed and transferred to online-education;
- the most used functions demonstrate the same pattern: increase at the beginning of the year till the spring with the subsequent decrease to the lowest engagement level in July. Engagement started to rise quickly at the beginning of the new academic year (2020-2021) and slightly decrease by Christmas holidays;
- I assume that **pandemic might influence the absolute numbers of engagement in the products of the most used functions, but had little effect on the pattern itself**. High usage during the academic year and low interest during holidays seem natural.


# 4 City, Town, Suburb, Rural

Coronavirus moving study conducted by [MYMOVE](https://www.mymove.com/moving/covid-19/coronavirus-moving-trends/) showed that pandemic affected tremendously the moving trends and people's preferences in terms of type of location where to live. It is exciting to compare data about engagement in digital learning products and information about relocation trends in 2020.

Even though we cannot observe if there were any changes in engagement from various locations in comparison with the last few years, the charts below can partially confirm some facts about migration from cities to suburbs and rural areas.

First of all, it is obvious that **suburbs are leading in interest in digital education, while cities and towns are several times less active in using digital learning products**. It corresponds to the moving trend of pandemic times: from cities to suburbs, from crowded spaces to less densely populated areas.

Although absolute numbers of engagement rate differ significantly depending on the type of location, relative numbers of engagement in products with various functionality are quite similar for all types.

In [None]:
# merge engagement and district dataframes
locale_engagement = pd.merge(engagement[['engagement_index','district_id','lp_id', 'time']],
                         districts[['district_id', 'locale']],
                         how = 'left', on = 'district_id')

# merge new dataframe and products dataframe
locale_engagement = locale_engagement.merge(products[['LP ID', 'Function(abb)', 'Users']],
                                   how = 'left', left_on = 'lp_id', right_on = 'LP ID')

# it is crucial to drop rows with missing information about products and districts for further manual calculations
locale_engagement.dropna(inplace = True)


In [None]:
# calculate total engagement index for ALL PRODUCTS per day and per locale category
# (w/o split per products/ names)
locale_engage_function = locale_engagement.groupby(by = ['time','locale']) \
                                                    ['engagement_index'].sum().reset_index()

# calculate average engagement index for the whole 2020 per 
locale_group = locale_engage_function.groupby(by = ['locale']) \
                                                    ['engagement_index'].mean().reset_index()

In [None]:
# list of all unique categories in Function(abb) feature
functions_abb = products['Function(abb)'].unique()[:-1]

# caluclate manually partion of each Function(abb) engagement index per locale category
for function in functions_abb:
    locale_group[function] = locale_group.apply(lambda row:
                                                locale_engagement.loc[(locale_engagement['Function(abb)'] == function)&
                                                (locale_engagement.locale == row.locale),
                                                'engagement_index'].sum() /
                                                locale_engagement.loc[locale_engagement.locale == row.locale,
                                                'time'].nunique(), axis = 1)

# create a column for custom sorting: City, Town, Suburb, Rural    
locale_group['locale_index'] = locale_group['locale'].map({'City': 0, 'Town':1, 'Suburb':2, 'Rural':3})
locale_group = locale_group.sort_values(by = 'locale_index', ascending = True)

# caluculate percentages of Function(abb)'s engagement indexes in total index per locale category 
percentages_locale = []                      # empty list for lists of percentages
for index, row in locale_group.iterrows():
    perc_locale = []
    columns = locale_group.columns[2:6]
    
    for col in columns:
        p = round((row[col] / row['engagement_index'])*100)
        perc_locale.append(p)
    
    percentages_locale.append(perc_locale)

In [None]:
def engagement_group(figsize, results, category_names, labels, cmap, percentages):
    # function creates stacked horizontal bar chart displaying percentages per each stack
    # and returns figure, axis
    
    # figsize - size of the figure
    # results - dataframe with numberical data to visualize
    # category_names - labels for stacked bars: locale names
    # labels - catgories that form each bar
    # cmap - color map
    # percentages - values to display on each stack (calculate separately before applying the function)

    data = np.array(results)
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap(cmap)(np.linspace(0.15, 0.85, data.shape[1]))
    

    fig, ax = plt.subplots(figsize=figsize)  
    
    
    for i, (colname, color, label) in enumerate(zip(category_names, category_colors, labels)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        xcenters = starts + widths / 2
        rects = ax.barh(category_names, widths, left=starts, height=0.7, label=label, color=color)

      
        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.1 else 'black'
        for y, (x, c) in enumerate(zip(xcenters, percentages)):
            if c[i] > 4:
                ax.text(x, y, str(c[i])+'%', ha='center', va='center', color=text_color, fontsize = 12)
    
    ax.legend(labels, fontsize = 12, loc='lower right')
    plt.xticks(fontsize = font_ticks)
    plt.yticks(fontsize = font_ticks)

    
    return fig, ax


In [None]:
# numerical data to visualize
results = locale_group.iloc[:, [2,3,4,5]] /1000

# names of locale categories
locale_labels = list(locale_group.locale.unique())

# labels of categories in Function(abb)
function_labels = ['Learning & Curriculum', 'Classroom Management',
                    'School & District Operations', 'Universal products']

fig_7, fig_7_ax = engagement_group(figsize = (18, 5.5), results = results,
                                   category_names = locale_labels,
                                   labels = function_labels,
                                   cmap = cmap_1,
                                   percentages = percentages_locale)

fig_7_ax.set_xlabel('Average amount of page-load events per 1000 students per 1 day in 2020,.000', fontsize = font_labels)

fig_7_ax.set_title("Engagement in digital learning products in various locations - Products' functions",
                   fontsize = font_title, fontname = fontname)
fig_7_ax.ticklabel_format(axis='x', style='plain')

plt.show()

Learning & Curriculum products constitute 65-68% in cities, suburbs and rural areas. While users who live in towns express more interest in this type of products - 74%.

The similar situation is observed with School & District Operations products: 20-21% in all locations except for towns where these products constitute 18% on the engagement rate.

**Interesting point to note**:
- even though universal products (from functions perspective) include Facebook and some of the Microsoft Office and Google products, total engagement in them in all location types does not exceed 5%.



In [None]:
# calculate total engagement index for ALL PRODUCTS per day and per locale category
# (w/o split per products/ names)
locale_engage_user = locale_engagement.groupby(by = ['time','locale']) \
                                                    ['engagement_index'].sum().reset_index()

# calculate average engagement index for the whole 2020 per 
locale_group_user = locale_engage_user.groupby(by = ['locale']) \
                                                    ['engagement_index'].mean().reset_index()

In [None]:
# list of all unique categories in Users feature
users = products['Users'].unique()[:-1]

# caluclate manually partion of each User engagement index per locale category
for user in users:
    locale_group_user[user] = locale_group.apply(lambda row:
                                                locale_engagement.loc[(locale_engagement['Users'] == user)&
                                                (locale_engagement.locale == row.locale),
                                                'engagement_index'].sum() /
                                                locale_engagement.loc[locale_engagement.locale == row.locale,
                                                'time'].nunique(), axis = 1)

# create a column for custom sorting: City, Town, Suburb, Rural    
locale_group_user['locale_index'] = locale_group_user['locale'].map({'City': 0, 'Town':1, 'Suburb':2, 'Rural':3})
locale_group_user = locale_group_user.sort_values(by = 'locale_index', ascending = True)

# caluculate percentages of Users' engagement indexes in total index per locale category 
percentages_locale_user = []                      # empty list for lists of percentages
for index, row in locale_group_user.iterrows():
    perc_locale = []
    columns = locale_group_user.columns[2:7]
    
    for col in columns:
        p = round((row[col] / row['engagement_index'])*100)
        perc_locale.append(p)
    
    percentages_locale_user.append(perc_locale)

In [None]:
# numerical data to visualize
results_2 = locale_group_user.iloc[:, [2,3,4,5, 6]]/1000

fig_8, fig_8_ax = engagement_group(figsize = (18, 5.5), results = results_2,
                                   category_names = locale_labels,
                                   labels = users,
                                   cmap = cmap_2,
                                   percentages = percentages_locale_user)

fig_8_ax.set_xlabel('Average amount of page-load events per 1000 students per 1 day in 2020,.000', fontsize = font_labels)

fig_8_ax.set_title("Engagement in digital learning products in various locations - Products' users",
                   fontsize = font_title, fontname = fontname)
fig_8_ax.ticklabel_format(axis='x', style='plain')

plt.show()

Analysing engagement in products for different Users' groups, we can see surprising results:
- even though **products for children under 18 y.o. constitute almost 50% of all digital learning products, their part in total engagement rate does not exceed 20% in all locations**;
- vice versa with universal  products (from users perspective): 74% - 84% of total engagement is constituted by 32% of products. **Nice visualization of [Pareto principle](https://en.wikipedia.org/wiki/Pareto_principle)**;
- rural areas and suburbs have similar distribution of engagement in products, as well as towns and cities;
- in all location types part of engagement in products for students of all ages (this category includes university/college students) does not exceed 5%.

Analysing the behavior of weekly average engagement rate in various location types, the prime thing to notice is that Suburb - line fluctuates much more than others. **Starting  and ending with the same level of engagement in digital learning products, Suburbs show dramatic rise/drops during the year and responds to events like end of academic year, holidays or declaration of pandemic much harder than other locations.**

Presumably, the growth after the 12th week and higher engagement in 2020-2021 academic year in suburbs and rural areas might be related to the mentioned migration from cities to less densely popuated areas. On the other hand, during the first 10 weeks of 2020 total engagement in Suburbs is obviously much higher than in other locations.

In [None]:
# calculate average engagement index per week in 2020 per locale category
locale_group_time = locale_engage_function.groupby(by = [locale_engage_function.time.dt.week, locale_engage_function.locale]) \
                                         ['engagement_index'].mean().reset_index()

# list of color
category_colors = plt.get_cmap(cmap_1)(np.linspace(0.15, 0.85, 4))



In [None]:
fig_9 = plt.figure(figsize = (16,5.5))
fig_9_ax = fig_9.add_subplot()
plt.style.use('ggplot')


for local, color in zip(locale_labels, category_colors):
    local_df = locale_group_time.loc[locale_group_time.locale == local]
    fig_9_ax.plot(local_df.time, local_df.engagement_index, label = local, color = color)
 
fig_9_ax.legend(fontsize = 12, loc='upper left')

#tisks


#labels & ticks
x_ticks = [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 53]
plt.xticks(x_ticks)
fig_9_ax.set_xlim(1, 53)
fig_9_ax.set_ylim(0, 7000000)
fig_9_ax.set_yticklabels([0, 1000, 2000, 3000, 4000, 5000, 6000, 7000])

fig_9_ax.set_xlabel('Weeks, 2020 year', fontsize = font_labels)
fig_9_ax.set_ylabel('Average amount of page-load events per 1000 students, .000', fontsize = 10)

#title
fig_9_ax.set_title('Engagement in digital learning products in various locations',
                   fontsize = font_title, fontname = fontname)

#grid
fig_9_ax.grid(axis='x')

plt.show()

The last, but not least in this section is analysis of engagement in digital learning in different states and distribution of 4 location types in total engagement rate per state.

I have splitted the chart into 2 parts because of the high variation in absolute numbers of average engagement per 1 day in 2020 across the states. So, the first chart demonstrates the top-8 states with the highest engagement in digital learning, and the second chart shows the rest of the states.

I should mention that obviously the dataset with information about school districts has a lot of gaps. Arizona shows 100% of city population, while Florida - 100% of suburb population, which is far from reality. Nevertheless, let's see the picture based on available information.

In [None]:
# merge engagement and district dataframes
state_engagement = pd.merge(engagement[['engagement_index','district_id', 'time']],
                         districts[['district_id', 'locale', 'state']],
                         how = 'left', on = 'district_id')

state_engagement.dropna(inplace = True)

# repeat the same states that were done to analyse engagement per locale category, but with analytics per state
# sum engagement index for all products per each date, per state
state_group = state_engagement.groupby(by = ['time','state'])['engagement_index'].sum().reset_index()

# calculate average engagement index per state
state_group = state_group.groupby(by = 'state')['engagement_index'].mean().reset_index()


In [None]:
# sort states from smallest to biggest average engagement index in 2020
state_group = state_group.sort_values(by = 'engagement_index', ascending = True)

for locale in locale_labels:
    state_group[locale] = state_group.apply(lambda row:
                                                state_engagement.loc[(state_engagement.locale == locale)&
                                                (state_engagement.state == row.state)]['engagement_index'].sum() /
                                                state_engagement.loc[state_engagement.state == row.state,
                                                'time'].nunique(), axis = 1)
    


In [None]:
# caluculate percentages of locale engagement indexes in total index per each state  
percentages_states_locale = []
for index, row in state_group.iterrows():
    perc_locale = []
    columns = state_group.columns[2:6]
    
    for col in columns:
        p = round((row[col] / row['engagement_index'])*100)
        perc_locale.append(p)
    
    percentages_states_locale.append(perc_locale)

In [None]:
# Because of the significant differences between largest and smallest engagement indexes,
# I split the list of state in 2 parts:

# part I - top-8 states with highest average engagement index
states_1 = list(state_group.state.unique()[14:])
results_states_1 = state_group[state_group.columns[2:6]][14:]
percentages_states_1 = percentages_states_locale[14:]

fig_10, f10_ax1 = engagement_group((15, 7), results_states_1,
                                   states_1, locale_labels,
                                    cmap_1, percentages_states_1)
#title
f10_ax1.set_title('Top-8 states with highest engagement in digital learning products',
                   fontsize = font_title, fontname = fontname)

#ticks
f10_ax1.set_xticklabels([0, 200, 400, 600, 800, 1000, 1200, 1400])
f10_ax1.set_xlabel('Average amount of page-load events per 1000 students per 1 day in 2020, .000', fontsize = font_labels)

# part II - states with lowest average engagement index
states_2 = list(state_group.state.unique()[:14])
results_states_2 = state_group[state_group.columns[2:6]][:14]
percentages_states_2 = percentages_states_locale[:14]

_, f10_ax2 = engagement_group((15, 9), results_states_2,
                                   states_2, locale_labels,
                                    cmap_1, percentages_states_2)
#title
f10_ax2.set_title('States with lowest engagement in digital learning products',
                   fontsize = font_title, fontname = fontname)

#ticks
f10_ax2.set_xticklabels([0, 20, 40, 60, 80, 100, 120, 140])
f10_ax2.set_xlabel('Average amount of page-load events per 1000 students per 1 day in 2020, .000', fontsize = font_labels)

plt.show()

I have an assumption that the level of average engagement per state might relate to the number of big and famous universities in the state. For instance, Connecticut - Yale, Illinois - University of Chicago, Massachusetts - MIT and Harvard, California - Stanford, New York - Columbia University and Cornell University etc. Although Princeton in New Jersey didn't bring this state one of the top places, I assume it is the result of missing information in the districts dataset.

# 5 Additional information about states

There are multiple factors that might influence students' engagement in digital learning process:
- organization of educational process;
- teacher's qualification, background and personal interest in doing its best for students;
- educational system in country/state and support of teachers from authorities (financial/methodological/technical);
- current trends among students concerning value of education, especially higher education;
- current trends on job market;
- cultural traditions, family background and so on.

In my view, analyzing all states together is not entirely correct because in many issues the states can be considered as different 50 countries. To get a better understanding of the states from the districts dataset, I decided to create visualization with the following information per state:

1. choropleth (colored map according to certain statistical metric) based on average engagement rate per day in each state;
2. political orientation of the state based on presidential elections 2020 - I truly believe that political views of the majority in the state and educational issues are closely related;
3. average annual salary of school teachers - not only per-pupil funding matters in education, but also per-teacher funding;
4. number of universities and ratio 'citizens per 1 university' in the state - to my mind, higher education institutions are like centers of gravity in cities and towns. They play a significant role in the cultural life of citizens and can set cultural trends especially among young people.

Using this map, you can see how different the states are and, what is more important for analysis on the country level, how many blank spaces are on this map. Almost half of the states are neglected which makes results unreliable and the whole analysis quite limited.

Nevertheless, there might be interesting correlations between level of engagement and the features I chose to display.

In [None]:
def correct_add_data(df):
    # correct 'District of Columbia' to upper case
    df['state'] = df['state'].replace({'District of Columbia':'District Of Columbia'})
    
    # drop white spaces before and after names of the states
    df['state'] =["".join(string.rstrip().lstrip()) for string in df['state']]
    
    return df    
    

In [None]:
# elections
elections = pd.read_csv(r'https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/Popular%20vote%20backend%20-%20Sheet1.csv',
                       usecols = [0,1], skiprows = [0,1,2,3], header = 0, names = ['state', 'called'])

# drop sub-states like 'Maine 2nd District'
elections = elections.drop(index = [4, 7, 15, 32, 39, 40])

elections = correct_add_data(elections)

# replace D with Democrats and R with Republicans
elections['called'] = ['Democrats' if x == 'D' else 'Republicans' for x in elections['called']]

# covid_cases
covid_cases = pd.read_csv(r'https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv',
                         usecols = [0,1, 5], names = ['submission_date', 'state_abb', 'new_case'], header = 0)

#replace NYC with NY
covid_cases = covid_cases.replace({"state_abb": {'NYC':'NY'}})

# abbreviation of states
state_abb = pd.read_csv(r'https://raw.githubusercontent.com/ElinaAizenberg/Vision-Zero-in-USA---project/main/data/name-abbr.csv',
                       header = None, names = ['state', 'state_abb'])

# merge covid_cases with states abbreviations
covid_cases = covid_cases.merge(state_abb, how = 'left', on = 'state_abb')

# drop rows: Republic of Marshall Islands, GU - Guam, VI - Virgin Islands, PR - Puerto Rico,
# MP - Northern Mariana Islands, AS - American Samoa, PW - Palau, FSM - Federated States of Micronesia
covid_cases = covid_cases.loc[~covid_cases.state_abb.isin(['RMI', 'GU', 'VI', 'PR', 'MP', 'AS', 'PW','FSM'])]

# number of universities per state
universities = pd.read_csv(r'https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/tabn317.20.csv',
                           header = 1, sep = ';', names = ['state', 'amount_uni'])
universities.dropna(inplace = True)

universities = correct_add_data(universities)

# coordinates of states

states_coordinates = pd.read_csv('https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/statelatlong.csv',
                                names = ['Abb', 'Latitude', 'Longitude', 'state'])

states_coordinates = correct_add_data(states_coordinates)

In [None]:
# teachers' salaries
teachers_salaries = pd.read_csv('https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/tabn211.60.csv',
                               header = 0, sep = ';')

teachers_salaries = teachers_salaries.iloc[:,[0, 7]]
teachers_salaries = teachers_salaries.rename(columns = {'Unnamed: 0':'state', '2019-20':'salary_2020'})

teachers_salaries = correct_add_data(teachers_salaries)
teachers_salaries.dropna(inplace = True)

# replace 'Columbia' with 'District of Columbia'
teachers_salaries['state'] = teachers_salaries['state'].replace({'Columbia':'District Of Columbia'})

# states' population

population = pd.read_csv('https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/nst-est2019-01.csv',
                               header = 0, sep = ';')

population.dropna(inplace = True)

population['state'] = population['state'].str.replace('.', '')
population['population_2019'] = population['population_2019'].str.replace(' ','')


population['population_2019'] = population['population_2019'].astype(int)
population = correct_add_data(population)

In [None]:
# the base of the new dataframe - state_group with information about engagement index
state_data = state_group[['state', 'engagement_index']]
state_data = pd.merge(state_data, state_abb, how = 'left', on = 'state')

# merge all dataframes: universities, elections '20 results, population, teachers' salaries, coordinates of the states
state_data = pd.merge(state_data, universities, how = 'left', on = 'state')
state_data = pd.merge(state_data, elections, how = 'left', on = 'state')
state_data = pd.merge(state_data, population, how = 'left', on = 'state')
state_data = pd.merge(state_data, teachers_salaries, how = 'left', on = 'state')
state_data = pd.merge(state_data, states_coordinates[['Latitude','Longitude','state']], how = 'left', on = 'state')

# calculate how many citizens per 1 university in each state
state_data['citizens_per_uni'] = round(state_data['population_2019'] / state_data['amount_uni'])

# chage data type of salary column to integer
state_data['salary_2020'] = state_data['salary_2020'].str.replace(' ','')
state_data['salary_2020'] = state_data['salary_2020'].astype(int)

In [None]:
# visualize geo-data with choropleth
url = (
    "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
m = folium.Map(location=[40, -102], zoom_start=4, zoom_control=False,
               scrollWheelZoom=False)

#popup_text = "State: {}, \nElections in 2020: {}, Amount of universities: {}, Citizens per 1 university: {}, Average annual salary of teachers in '19: {}%"

folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["state_abb", "engagement_index"],
    key_on="feature.id",
    fill_color="OrRd",
    threshold_scale=[0, 200000, 400000,600000, 800000, 1000000, 1200000, 1400000],
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Average number of page-load events per 1000 students in 2020",
).add_to(m)

folium.LayerControl().add_to(m)

for index, row in state_data.iterrows():
    text = 'State: '+ str(row.state) + '<br>' + 'Elections in 2020: ' + str(row.called) \
            + '<br>' + 'Amount of universities: ' + str(row.amount_uni) + '<br>' \
            + 'Citizens per 1 university: ' + str(row.citizens_per_uni) + '<br>' \
            + "Average annual salary of teachers in '19: " + str(row.salary_2020) + ' $'

    iframe = folium.IFrame(text, width=290, height=130)
    popup = folium.Popup(iframe, max_width=290)
    
    folium.Marker(
    location=[row.Latitude, row.Longitude],
    popup = popup,
    icon=folium.Icon(color="green")).add_to(m)

m

# 6  Engagement in digital learning products vs COVID-19 pandemic

One of the main questions that were offered to disclose in the description of this analytical challenge and undoubtedly was discussed in teachers' communities all over the word is **'How did COVID-19 influence online education?'**. More general and more unpredictable issue to discuss is **'What is the future of online education after the pandemic?'**.

Once again I would like to underline that we cannot make any conclusions based solely on the data of 2020. First of all, this year should be considered as an outlier because of the circumstances that made transfer from traditional learning to distance learning obligatory, rapid and unprepared. Secondly, this transfer revealed all disadvantages of distance learning that might result in further resistance no matter how many great distance learning products will be on the market. And finally, the pandemic hasn't finished in December 2020. It is not over even now, in September 2021. Therefore, it is too early to make any conclusions about the effect of the pandemic on distance learning.

Nevertheless, we might evaluate if there were any relationships between the engagement in digital learning products and number of COVID-19 cases or severity of pandemic in the USA.

*Methodology:*
1. sum all new Covid-19 cases from all states per day;
2. calculate 7-days rolling mean of the new cases;
3. sum up total engagement index (number of page-load events per one 1000 students) for ALL PRODUCTS and all districts per day;
4. calculate 7-days rolling mean of the engagement index.



In [None]:
# Prepare dataframes for analysis engagement_index VS Covid-19 cases

# change data type of date column to DateTime
covid_cases['submission_date'] = pd.to_datetime(covid_cases['submission_date'], format = '%m/%d/%Y')

# select only 2020 cases
covid_cases = covid_cases.loc[covid_cases.submission_date.dt.year == 2020]

# sum new cases per day w/o split on states
covid_cases_total = covid_cases.groupby(by = 'submission_date')['new_case'].sum().reset_index()

# sort by date
covid_cases_total = covid_cases_total.sort_values(by = 'submission_date')

# sum engagement index for all products and all districts per each day in 2020
engagement_per_day = engagement.groupby(by = 'time')['engagement_index'].sum().reset_index()

In [None]:
fig_11 = plt.figure(figsize = (16,7))
fig_11_ax = fig_11.add_subplot()

#covid-19 cases each day(bar chart) and rolling average number of cases per 7 days(line chart)
fig_11_ax.bar(covid_cases_total.submission_date, covid_cases_total.new_case,
            color = 'midnightblue', label='New COVID-19 cases', alpha = 0.5)
fig_11_ax.plot(covid_cases_total.submission_date, covid_cases_total.new_case.rolling(7).mean(),
            color = 'midnightblue', label='New COVID-19 cases', linewidth = 2.0)

# twin axis
fig_11_ax2 = fig_11_ax.twinx()

# rolling average engagement index per 7 days
fig_11_ax2.plot(engagement_per_day.time, engagement_per_day.engagement_index.rolling(7).mean(),
            color = 'red', label='Amount of page-load events per 1000 students', linewidth = 2.0)

# labels & ticks
fig_11_ax2.set_xlim([datetime.date(2020, 1, 22), datetime.date(2021, 1, 1)])

fig_11_ax.set_ylabel('Amount of new COVID-19 cases', fontsize = font_labels)
fig_11_ax2.set_ylabel('Av. amount of page-load events per 1000 students', fontsize = font_labels)
fig_11_ax2.set_yticklabels([])

# legend
fig_11_ax2.legend(loc = 'upper left', fontsize = 11)
fig_11_ax.legend(loc = 'upper left', fontsize = 11)

# grid
fig_11_ax2.grid(b = False)
fig_11_ax.grid(axis='x')

# title
fig_11_ax.set_title('COVID-19 cases VS engagement in digital learning products, 2020',
                   fontsize = font_title, fontname = fontname)

plt.show()

**Some interesting points to note:**
- there is **no obvious and uniform correlation during the year between the progress of the pandemic and engagement in digital learning products**;
- the significant growth of engagement rate at the beginning of the year is seen before pandemic declaration and schools' closure;
- there might be **some relationship between the increase of COVID-19 cases and drop of engagement in digital learning during the summer months**. Probably, large number of children and students who were not occupied with distance learning might be one of the reasons of the higher 'second wave of the pandemic';
- it seems that with the beginning of the new academic year and increase in usage of digital learning products the severity of pandemic slightly decreased;
- even though we can notice sharp decrease both in engagement and COVID-19 cases during Thanksgiving week, it resulted in subsequent increase, probably, due to celebrating national holiday with families;
- situation **during Christmas holidays remindes summer months: decrease in engagement in digital learning was accompanied by a surge in new COVID-19 cases**.

Speaking about the future of distance learning after the global pandemic, I would like to note that education is a social process and social institution and, as [political scientist Dr. Ekaterina Schulmann](https://en.wikipedia.org/wiki/Ekaterina_Schulmann) said, **crises do not bring fundamental changes to the world. They accelerate the processes that were growing and kills those that were dying**. Therefore, to predict the vague future of distance learning we should know what was the trend before the pandemic.

As we have no data from previous years, I decided to get a general understanding of people's interest in digital or online education using Google Trends. I chose **2 topics in Google Web search in the USA and downloaded the data for the last 5 years**:
1. E-learning;
2. Digital learning.

The analysis is based on the value of 'interest over time' which represents search interest relative to the highest point on the chart for the given region (USA) and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.



In [None]:
# CSV files downloaded from Google trends
e_learning = pd.read_csv(r'https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/E_learning.csv',
                        parse_dates = [0])
digital_learning = pd.read_csv(r'https://raw.githubusercontent.com/ElinaAizenberg/Kaggle-competition/main/Digital%20learning.csv',
                              parse_dates = [0])


fig_19, fig_19_ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 10))

fig_19_ax_1 = fig_19_ax[0]
fig_19_ax_2 = fig_19_ax[1]

#E-learning topic
fig_19_ax_1.plot(e_learning.Week, e_learning['E-Learning: (United States)'],
            color = 'blue', linewidth = 2.0, label = 'E-learning')
#Digital learning topic
fig_19_ax_2.plot(digital_learning.Week, digital_learning['Digital learning: (United States)'],
            color = 'darkcyan', linewidth = 2.0, label = 'Digital learning')

fig_19_ax_1.axvspan(pd.Timestamp('2020-01-01'), pd.Timestamp('2021-01-01'),
                    alpha=0.2, color='red')

fig_19_ax_2.axvspan(pd.Timestamp('2020-01-01'), pd.Timestamp('2021-01-01'),
                    alpha=0.2, color='red')

#lines to separate years
years = ['2017-01-01','2018-01-01','2019-01-01','2020-01-01','2021-01-01']

for year in years:
    fig_19_ax_1.vlines(pd.Timestamp(year), 0, 100,
                       colors='black', linestyles='-',
                       linewidth = 1.0, alpha = 0.5)
    fig_19_ax_2.vlines(pd.Timestamp(year), 0, 100,
                       colors='black', linestyles='-',
                       linewidth = 1.0, alpha = 0.5)

# legend    
fig_19_ax_1.legend(fontsize = 12,
                loc = 'upper right')
fig_19_ax_2.legend(fontsize = 12,
                loc = 'upper right')

# labels & ticks
fig_19_ax_1.set_ylabel('Interest rate', fontsize = font_labels)
fig_19_ax_2.set_ylabel('Interest rate', fontsize = font_labels)

# grid
fig_19_ax_1.grid(axis='x')
fig_19_ax_2.grid(axis='x')

# title
fig_19.suptitle('Interest in online education topics in the USA, 2017-2021',
                   fontsize = font_title, fontname = fontname)

plt.show()

**Key insights:**
- **3 year before the pandemic interest of Google users in E-learning topic followed the same pattern**: increase at the beginning of the year, reduction  by half during summer months with following rise at the beginning of a new academic year and decrease by Christmas;
- 2020 demonstrated outstanding results in absolute values, but the same pattern in general;
- **the most important thing to note: in 2021 interest rate returns on the level of 2017-2019**;
- Digital learning topic demonstrates a less clear pattern each year, except for a surge during the first weeks every year, but it is easy to notice a dramatic increase in the middle of 2020;
- once again, **in 2021 interest rate of Digital learning topic is on the pre-pandemic level**.


# 7 Influence of social / political / financial aspects on online engagement

In this section I would like to analyze how engagement in digital learning products relates to political orientation of the state where the school district is located, whether there is correlation between ethnicity distribution in school district or financial status of students in a district and online engagement.

Also, I believe that there might be a correlation between average salaries of teachers in a state, per-pupil expenditure and level of online engagement because the better is financial support of the school, the higher teachers' salaries are, the more opportunities to organize online education are at disposal.

I decided to analyze the impact of social and financial aspects along with the political split of the states because I believe that in many cases each state of the USA can be considered as a distinct country. And one aspect that can help in clustering school districts is the political orientation of each state. I use the results of the latest presidential election to split school districts on democratic and republican.

### 7.1 Social context

*Methodology*:
1. sum up total engagement index for ALL products for the whole year per each school district and per each category in social/financial feature separately in democratic and republican states;
2. calculate average engagement index per 1 district for the whole year per each category.

In [None]:
# merge engagement and districts dataframes
social_engagement = pd.merge(engagement[['engagement_index','district_id', 'time']],
                         districts[['district_id', 'state', 'pct_black/hispanic','pct_free/reduced', 'pp_total_raw']],
                         how = 'left', on = 'district_id')

# merge new dataframe and information about each state
social_engagement = pd.merge(social_engagement,
                             state_data[['state', 'called', 'salary_2020', 'citizens_per_uni']],
                            how = 'left', on = 'state')

# drop only the rows which has missing information about the state
social_engagement = social_engagement.dropna(subset = ['state'], axis=0, how='any')

In [None]:
def prepare_social_df(df, social_column):
    #this function calculates average engagegment index per each catogory of the feature chosen for analysis
    #df - social_engagement dataframe\
    #social_column - feature for analysis
    
    democrats = df.loc[df.called == 'Democrats'].groupby(
                by = ['district_id', social_column])['engagement_index'].sum().reset_index()
    
    republicans = df.loc[df.called == 'Republicans'].groupby(
                by = ['district_id',social_column])['engagement_index'].sum().reset_index()
    
    democrats.dropna(inplace = True)
    republicans.dropna(inplace = True)

    democrats = democrats.groupby(by = social_column)['engagement_index'].mean().reset_index()
    republicans = republicans.groupby(by = social_column)['engagement_index'].mean().reset_index()

    social_df = pd.merge(democrats, republicans, how = 'outer', on = social_column)
    social_df = social_df.rename(columns = {'engagement_index_x':'engagement_D',
                                                    'engagement_index_y':'engagement_R'})

    social_df['engagement_D'] = social_df['engagement_D'] / 1000
    social_df['engagement_R'] = social_df['engagement_R'] / 1000
    
    return social_df
    

In [None]:
# create categorical feature based on continious feature of teachers' salaries
social_engagement['salary_2020_range'] = pd.cut(social_engagement['salary_2020'],
                                                bins=[0, 50000, 60000, 70000, 80000, 90000],
                                                labels=['40-50K $', '50-60K $', '60-70K $', '70-80K $', '80-90K $'])


In [None]:
def butterfly_chart(df, social_column, yticklabels, ylabel):
    # this function creates a butterfly chart with split for republican and democratic states

    hfont = {'fontname':'Calibri'}
    color_red = '#fd625e'
    color_blue = '#01b8aa'
    index = df[social_column]
    column0 = df['engagement_R']
    column1 = df['engagement_D']
    title0 = 'Republicans, .000 per-load events'
    title1 = 'Democrats, .000 per-load events'

    fig, ax = plt.subplots(figsize=(10.5,5), ncols=2, sharey=True)
    fig.tight_layout()

    ax[0].barh(index, column0, align='center', color=color_red, zorder=10)
    ax[0].set_title(title0, fontsize=14, pad=14, color=color_red, **hfont)
    ax[1].barh(index, column1, align='center', color=color_blue, zorder=10)
    ax[1].set_title(title1, fontsize=14, pad=14, color=color_blue, **hfont)

    # If you have positive numbers and want to invert the x-axis of the left plot
    ax[0].invert_xaxis() 

    # To show data from highest to lowest
    plt.gca().invert_yaxis()

    ax[0].set(yticks=df[social_column],
                    yticklabels=yticklabels,
                    ylabel = ylabel)


    ax[0].set_xticks([0, 5000, 10000, 15000, 20000])
    ax[1].set_xticks([0, 5000, 10000, 15000, 20000])

    plt.subplots_adjust(wspace=0, top=0.85, bottom=0.1, left=0.18, right=0.95)
    
    fig.suptitle('Average engagement rate per school district in democratic and republican states',
                 fontsize=font_title, fontname = fontname)

    return fig, ax

In [None]:
ethnics_group = prepare_social_df(social_engagement, 'pct_black/hispanic')

yticklabels_ethnics = ['0-20%', '20-40%', '40-60%', '60-80%', '80-100%']
ylabel_ethnics = '% black/hispanic students'

fig_12, fig_12_ax = butterfly_chart(ethnics_group, 'pct_black/hispanic',
                                    yticklabels_ethnics, ylabel_ethnics)


**% of black / hispanic students in a school district:**

- as we can see, average engagement in democratic and republican states differs: in all categories of ethnic feature **engagement in digital learning is higher in democratic states than in republican**;
- the highest engagement in democratic states is achieved in districts with simultaneously highest and lowest percentage of black / hispanic students;
- I assume that quite similar situation can be observed in republican states, but due to the lack of information the highest engagement rate is achieved in 0-20% and 60-80% groups, while 80-100% category shows the lowest result.




In [None]:
lunch_group = prepare_social_df(social_engagement, 'pct_free/reduced')

yticklabels_lunch = ['0-20%', '20-40%', '40-60%', '60-80%', '80-100%']
ylabel_lunch = '% of students receiving free/reduced lunches'

fig_13, fig_13_ax = butterfly_chart(lunch_group, 'pct_free/reduced',
                                    yticklabels_lunch, ylabel_lunch)

**% of students receiving free / reduced lunches:** 

It should be noted that the percentage of students in a school who are eligible for free or reduced-price lunch is an indicator of the amount of low-income students in a school according to [National Center for Education Statistics (NCES)](https://nces.ed.gov/fastfacts/display.asp?id=898). The higher the percentage of students who are eligible for free or reduced-price lunch the higher is the level of school poverty.

I initially expected to see a strong correlation between level of engagement in digital learning and percentage of students receiving free lunches. As far as I know, it has been discussed by social scientists and teachers that transfer to online education might affect the poorest and most financially vulnerable families because they are initially in worse circumstances and have less opportunities to fully participate in the education process.

Nevertheless, in republican states the chart does not show a clear relationship between a social feature and level of engagement, the result is more or less the same in all categories. While **in democratic states in the first 4 categories from 0% to 80% of students with free/reduced lunches we can observe negative correlation**, but the last category of 80-100% students breaks the tendency.

**Interesting thing to notice:**

- in the category 60-80% average engagement rates in democratic and republican districts are very close.



### 7.2 Financial context

For analysis I will consider 2 financial features:
- average salary of school teachers per state;
- total per-pupil expenditure per school district.

I suppose there are no doubts that teachers with higher salaries will work better, have more opportunities to study new teaching methods with usage of digital products, can be more involved and interested in organizing engaging educational processes.

In the states included into the dataset we can see the following:

- **in all republican states average teachers' salaries are significantly lower than the national average** and it might be related to the lower average engagement in digital learning that we observe in republican states;
- in 29% of democratic states teachers' salaries are significantly lower than the national average (< 60 000 USD), but in the rest 50% of the democratic states salaries are significantly higher than the national average (> 70 000 USD) and on average in democratic states teachers earn much better.

In [None]:
fig_15 = plt.figure(figsize = (17,7))
fig_15_ax = fig_15.add_subplot()

fig_15_ax.bar(state_data.loc[state_data.called == 'Republicans']['state'],
        state_data.loc[state_data.called == 'Republicans']['salary_2020'],
        color = '#fd625e', label = 'Republicans')

fig_15_ax.bar(state_data[state_data.called == 'Democrats']['state'],
        state_data[state_data.called == 'Democrats']['salary_2020'],
        color = '#01b8aa', label = 'Democrats')

# ticks
plt.xticks(rotation = 80, fontsize = font_ticks)
fig_15_ax.set_ylim(20000, 90000)

#labels
fig_15_ax.set_ylabel('Average salary, $', fontsize = font_labels)

# display the line of average salary of school teacher in 2019 - 2020: 63 645 $
plt.axhline(y=63645, color="black", linestyle="--")
fig_15_ax.text(6.5, 65000,'Average salary across the states - $63 645 per year', fontsize = font_text)

# legend
fig_15_ax.legend(loc = 'upper left', fontsize = font_legend)

#grid
fig_15_ax.grid(axis='x')

# title
fig_15_ax.set_title('Average salary of school teachers in USA, 2019 - 2020 ', fontsize = font_title, fontname = fontname)

plt.show()

In [None]:
#manual application of the function prepare_social_df
democrats = social_engagement.loc[social_engagement.called == 'Democrats'].groupby(
                by = ['district_id', 'salary_2020_range'])['engagement_index'].sum().reset_index()
    
republicans = social_engagement.loc[social_engagement.called == 'Republicans'].groupby(
            by = ['district_id','salary_2020_range'])['engagement_index'].sum().reset_index()

democrats = democrats.loc[democrats['engagement_index'] > 0]
republicans = republicans.loc[republicans['engagement_index'] > 0]

democrats = democrats.groupby(by = 'salary_2020_range')['engagement_index'].mean().reset_index()
republicans = republicans.groupby(by = 'salary_2020_range')['engagement_index'].mean().reset_index()

salary_group = pd.merge(democrats, republicans, how = 'outer', on = 'salary_2020_range')
salary_group = salary_group.rename(columns = {'engagement_index_x':'engagement_D',
                                                'engagement_index_y':'engagement_R'})

salary_group['engagement_D'] = salary_group['engagement_D'] / 1000
salary_group['engagement_R'] = salary_group['engagement_R'] / 1000
    
    
yticklabels_salary = ['40-50K $', '50-60K $', '60-70K $', '70-80K $', '80-90K $']
ylabel_salary = "Average teacher's salary in 2020"

fig_14, fig_14_ax = butterfly_chart(salary_group, 'salary_2020_range',
                                    yticklabels_salary, ylabel_salary)

plt.show()

**Average teacher's salary:**
- unfornunately, there is no obviuos positive correlation between average salary in a state and engagement in digital learning. In both political categories we can notice the decrease in engagement and I assume that is a interesting issue to explore in cooperation in teaching methodologists and teachers;

- at the same time, in the 50-60K USD category which is the only mutual category for democratic and republican states, average engagement in democratic states' districts is higher than in republicans.

The chart below shows the distribution of school districts according to per-pupil expenditure range.

In [None]:
districts_finance = pd.merge(districts[['district_id','state', 'pp_total_raw']],
                            state_data[['state', 'called']],
                            how = 'left', on = 'state')

districts_finance = districts_finance.groupby(
                    by = ['pp_total_raw', 'called'])['district_id'].count(). \
                    unstack(level = 1).reset_index()

# fill missing values with '0'
districts_finance.fillna(0, inplace = True)

# change order of indexes to order costs per students in ascending order
districts_finance = districts_finance.reindex(index = [10,11,12,13,0,1,2,3,4,5,6,7,8,9])

In [None]:
fig_16 = plt.figure(figsize = (10,7))
fig_16_ax = fig_16.add_subplot()

fig_16_ax.barh(districts_finance.pp_total_raw, districts_finance.Republicans,
                label='Republicans', color = '#fd625e')
fig_16_ax.barh(districts_finance.pp_total_raw, districts_finance.Democrats,
                left = districts_finance.Republicans,
                label='Democrats', color = '#01b8aa')


fig_16_ax.set_yticklabels(labels = districts_finance.pp_total_raw, fontsize = 11)
fig_16_ax.set_xlim(0, 50)

# To show data from highest to lowest
plt.gca().invert_yaxis()

# legend
fig_16_ax.legend(loc = 'upper right', fontsize = font_legend)

#grid
fig_16_ax.grid(axis='y')

# labels
fig_16_ax.set_xlabel('Number of districts with the corresponding per-pupil expenditure range', fontsize = font_labels)
fig_16_ax.set_ylabel('Per-pupil expenditure range, $', fontsize = font_labels)

# title
fig_16_ax.set_title('Per-pupil total expenditure in districts', fontsize = font_title, fontname = fontname)


plt.show()

**Key insights:**
- in republican states school districts varies from 4000-6000 USD per pupil to 11000-13000 USD per pupil;
- the **most frequent expenditure range in republican districts is 8000-10000 USD per pupil**;
- in democratic states school districts the smallest expenditure range is 2 times higher than in republicans - 8000-10000 USD
- as well as the most frequent expenditure range **in democratic districts is 2 times higher than in republicans - 16000-18000 USD**;

In [None]:
expenditure_group = prepare_social_df(social_engagement, 'pp_total_raw')
expenditure_group = expenditure_group.reindex(index = [11,12,10,13,0,1,2,3,4,5,6,7,8,9])

ylabel_expenditure = 'Per-pupil expenditure range, $'

fig_17, fig_17_ax = butterfly_chart(expenditure_group, 'pp_total_raw',
                                    expenditure_group.pp_total_raw, ylabel_expenditure)

fig_17.set_size_inches(12, 7.5)

fig_17_ax[0].set_xticks([0, 5000, 10000, 15000, 20000, 25000, 30000])
fig_17_ax[1].set_xticks([0, 5000, 10000, 15000, 20000, 25000, 30000])

plt.show()

**Key insights:**
- on average we can notice increase in engagement rate along with increase in expenditure per pupil in schools dicticts;
- the highest engagement rate is achieved in the third from the last category in both democratic and republican states. **Probably, there is some point in financial support of the school when its efficiency stops rising**;
- another possible explanation is that schools with high level of expenditure per pupil try to engage students more in offline activities than in online;
- in all categories where democratic and republican districts have their schools, average engagement in democratic districts is higher than in republican ones. 

# Personal view

Organizers offered several questions to answer using the data they provided. The most vague question that cannot be answered with this analysis is **how might online and distance learning evolve in the future?** 

I want to disclose my personal opinion on this problem that I formed in discussions with secondary school teachers and university professors in Moscow.

Online learning will definitely develop in the near future and even though after the pandemic the level of skepsis towards digital learning products increases, it will not stop the process of transferring from offline to online education.
offline education has its own limits that cannot be overcome and that are successfully beaten by online education:
- cost of education (primarily for higher education);
- strict discipline rules that young students and their parents may not approve;
- state defined educational programs.

Online education offers a candy in the form of a more free timetable, ability to study from any corner of the Earth, lower prices for higher education degrees and more freedom in choosing the syllabus. What is more, nowadays there is a widespread opinion that children should not learn handwriting, history or basic math because everything can be found in Google, calculated on a laptop and they will not write by hand at all. And this tendency has no geographical borders. A lesson with a laptop seems better than a lesson with a textbook and whiteboard. To my mind, the lower the level of education in society, the more this society will support online education. 

This will lead to a bigger education gap: offline education for the rich, online education for the poor. I see that during  the pandemic the opposite view emerged: poorer and more vulnerable with limited access to digital products, bad Internet connection suffer more from distance education. But this situation is temporary: Internet connection will be more and more accessible for everybody, while real schools with educated teachers which are expensive to maintain will shorten in amount.
Universities and schools where teachers/professors will continue to communicate with students face to face and teach them to research, defend their views and develop critical thinking will become a privilege with a highly competitive entrance examination.

I hope my predictions and personal view do not seem too pessimistic because there is also a bright side. Even though I assume that the quality of online education is significantly lower in comparison with offline learning, its broad application might offer great opportunities in those countries and regions where there are no or very limited alternatives. 

# Data

1. United States COVID-19 Cases and Deaths by State over Time ([link](https://catalog.data.gov/dataset/united-states-covid-19-cases-and-deaths-by-state-over-time));
2. Edunomics Lab's National Education Resource Database on Schools (NERD) project ([link](https://edunomicslab.org/nerds/));
3. 	Estimated average annual salary of teachers in public elementary and secondary schools, Digest of Education Statistics (NCES) ([link](https://nces.ed.gov/programs/digest/d20/tables/dt20_211.60.asp));
4. How Teacher Pay Compares to Average Salary in Each State 2021 ([link](https://www.business.org/hr/workforce-management/best-us-states-for-teachers/));
5. 2020 National Popular Vote Tracker ([link](https://cookpolitical.com/2020-national-popular-vote-tracker));
6. Degree-granting postsecondary institutions, by control and classification of institution and state or jurisdiction, Digest of Education Statistics (NCES) ([link](https://nces.ed.gov/programs/digest/d20/tables/dt20_317.20.asp));
7. State Population Totals: 2010-2019, US Census Buro ([link](https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html#par_textimage_500989927));
8. USA lat,long for state abbreviations, Kaggle dataset ([link](https://www.kaggle.com/washimahmed/usa-latlong-for-state-abbreviations));
9. Google Trends, Web search topic 'E-learning' ([link](https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=%2Fg%2F121tbbp8));
10. Google Trends, Web search topic 'Digital learning' ([link](https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=%2Fm%2F0113h59z)).
