**Introduction**

Short-term assistance measures have been implemented in some nations, including the provision of digital learning devices, financial help for students and schools, and cash for safety and cleaning equipment. Countries used online platforms for (a) educational information, (b) live classes on virtual meeting platforms, and (c) self-paced formalized learning.

The loss of instructional time supplied in a classroom context, as well as the lack of human touch between teacher and student, are two drawbacks of emergency remote learning.

In this study, we look into the various forms of learning as well as the socioeconomic issues that may influence student engagement. The datasets and data dictionaries can be found on this kaggle page. Evidence from some of the region's wealthiest countries suggests that the pandemic is causing learning losses and an increase in illiteracy.

In [None]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
import glob
import seaborn as sns
import plotly as py
import statistics as stat
import plotly.express as px
import plotly.graph_objs as go
warnings.filterwarnings("ignore")
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)


pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings("ignore")


districts = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
products = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
engagement_data_path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'

engagement_files = glob.glob(engagement_data_path + "/*.csv")
files = []

for file in engagement_files:
    df = pd.read_csv(file, index_col = None, header = 0)
    district_id = file.split('/')[4].split('.')[0]
    df['district_id'] = district_id
    files.append(df)
    
engagement = pd.concat(files)
engagement = engagement.reset_index(drop = True)
engagement['time'] = pd.to_datetime(engagement['time'])

districts.info()
#df.head()

**Dataset Basic information** 
we have given engagement_data,districts_info.csv and products_info.csv

***Districts_info***
1. district_id
2. state
3. locale
4. pct_black/hispanic
5. pct_free/reduced
6. county_connections_ratio
7. pp_total_raw 



In [None]:
#get the % of missing informations 
percent_missing = districts.isnull().sum() * 100 / len(districts)
percent_missing.sum(axis=0)/len(districts.columns)

we have 27% missing values 

In [None]:
districts.dropna(inplace=True)

Information about pct_black/hispanic, pct_free/reduced, county_connections_ratio and pp_total_raw is presented in the form of intervals, where "[a, b[" means that a ≤ x < b. All values in pct_black/hispanic and pct_free/reduced values have an interval of 20%, for a more understandable view we can convert them to a single value with a deviation of +- 10%. The information about county_connections_ratio is represented by the most abstract value from 18% to 100% and, unfortunately, this information is of no use. All values of of pp_total_raw have an interval of 2000, following the example of the previous two columns, we convert the values to a single value with a deviation of +- 1000.

In [None]:
for i in ['pct_black/hispanic', 'pct_free/reduced']:
    districts[i] = districts[i].apply(lambda x: float(x.split(',')[0][1:]) + 0.1)

districts['pp_total_raw'] = districts['pp_total_raw'].apply(lambda x: int(x.split(',')[0][1:]) + 1000)

districts.drop('county_connections_ratio', axis = 1, inplace = True)

districts.head(3)

***PRODUCTS***

The product file includes information about the characteristics of the top 372 products with most users in 2020:

1. ***LP ID*** - the unique identifier of the product.
2. ***URL*** 
3. ***Product Name***
4. ***Provider/Company Name***
5. ***Sector(s)*** - sector of education where the product is used.
6. ***Primary Essential Function*** - the basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled.


In [None]:
products.info()

In [None]:
products['Basic_category'] = 'x'
for i in range(len(products)):
    if pd.isna(products['Primary Essential Function'][i]) == False:
        products['Basic_category'][i] = products['Primary Essential Function'][i].split('-')[0][:-1]
products.head()

In [None]:
products.head(5)

***ENGAGEMENT***

The engagement file includes information about engagement of students with learning products in various school districts for the entire year 2020:

1. ***time*** - date.
2. ***lp_id*** - the unique identifier of the product.
3. ***pct_access*** - percentage of students in the district have at least one page-load event of a given product and on a given day.
4. ***engagement_index*** - total page-load events per one thousand students of a given product and on a given day.
5. ***district_id***

In [None]:
#engagement.head()
#percent_missing_ = engagement.isnull().sum() * 100 / len(districts)
engagement.fillna(method='ffill')
engagement.head()

**Districts**

In [None]:
state_abb = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

districts['state_abb'] = districts['state'].map(state_abb)

fig = go.Figure()
layout = dict(
    title_text = "Count of districts in the available States",
    title_font = dict(
            family = "monospace",
            size = 25,
            color = "black"
            ),
    geo_scope = 'usa'
)

fig.add_trace(
    go.Choropleth(
        locations = districts['state_abb'].value_counts().to_frame().reset_index()['index'],
        zmax = 1,
        z = districts['state_abb'].value_counts().to_frame().reset_index()['state_abb'],
        locationmode = 'USA-states',
        marker_line_color = 'white',
        geo = 'geo',
        colorscale = "cividis", 
    )
)
            
fig.update_layout(layout)   
fig.show()

plt.figure(figsize = (15, 8))
sns.set_style("white")
a = sns.barplot(data = districts['state'].value_counts().reset_index(), x = 'state', y = 'index', color = '#90afc5')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14, color = '#283655')
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.5 + width, p.get_y() + 0.55 * p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 15, color = '#283655')

plt.show()

In [None]:
fig = px.pie(districts['locale'].value_counts().reset_index().rename(columns = {'locale': 'count'}), values = 'count', names = 'index', width = 700, height = 700)

fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.7, 
                  marker = dict(colors = ['#90afc5','#336b87','#2a3132','#763626'], line = dict(color = 'white', width = 2)))

fig.update_layout(annotations = [dict(text = ' The count of districts <br>in each type <br>of areas', 
                                      x = 0.5, y = 0.5, font_size = 26, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = '#283655')],
                  showlegend = False)
                  
fig.show()

In [None]:
districts.isna().sum()

let's get insight from pct_black/hispanic i.e get the value of percentage balack/hispanic based on 'local' category, From this we conclude that large numbers of Hispanic or black people live in city.

In [None]:
df_pct_black_hispanic=districts.groupby('locale').agg({'pct_black/hispanic':np.mean})
df_pct_black_hispanic=df_pct_black_hispanic.reset_index()
plt.figure(figsize=(16,8))
sns.barplot(x="locale", y="pct_black/hispanic", data=df_pct_black_hispanic,
                 palette="bwr_r")
plt.title("Percentage of blacks and hispanic in the locales")

from pct_free_reduced graph,We conclude that, most of students who live in city or town are get free or reduced price education.

In [None]:
df_free_reduced=districts.groupby('locale').agg({'pct_free/reduced':np.mean})
df_free_reduced=df_free_reduced.reset_index()
plt.figure(figsize=(16,8))
sns.barplot(x="locale", y="pct_free/reduced", data=df_free_reduced,
                 palette="rocket")
plt.title("Percentage of Free and Reduced in the locales")

The average number of students who identified themselves as Black or Hispanic it between 0.25 – 0.4 i.e 25% - 40%

In [None]:
fig = plt.figure(figsize = (12, 18))
plt.figtext(0.6,0.7,'''The average number of
students who identified themselves 
as Black or Hispanic is
between 0.25 – 0.4 i.e 25% - 40%''')
sns.distplot(districts['pct_free/reduced'])

In [None]:
fig = plt.figure(figsize = (12, 9))
plt.figtext(0.38, 0.65, '''The average number of students who identified themselves
as Black or Hispanic is 23.2%. The most common value is 10%.''')
sns.distplot(districts['pct_black/hispanic'])

**Products**

In [None]:
plt.figure(figsize = (12, 12))
sns.set_style("white")
plt.title('Count of products by subcategory', size = 35, x = 0.2, y = 1.06, fontname = 'monospace', color = '#000000')
a = sns.barplot(data = products['Primary Essential Function'].value_counts().reset_index(), x = 'Primary Essential Function', y = 'index', color = '#90afc5')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 10, color = '#000000')
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(1 + width, p.get_y() + 0.55 * p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 11, color = '#000000')

plt.show()
##########
fig = px.pie(products.query("Basic_category != 'x'")['Basic_category'].value_counts().reset_index().rename(columns = {'Basic_category': 'count'}), values = 'count', names = 'index', width = 700, height = 700)

fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.7, 
                  marker = dict(colors = ['#90afc5','#336b87','#2a3132','#763626'], line = dict(color = 'white', width = 2)))

fig.update_layout(annotations = [dict(text = 'Count of products <br>by category', 
                                      x = 0.5, y = 0.5, font_size = 26, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = '#000000')],
                  showlegend = False)
                  
fig.show()

In [None]:
plt.figure(figsize=(15,10))
plt.ticklabel_format(style='plain')
sns.countplot(y="Primary Essential Function",data=products,order=products['Primary Essential Function'].value_counts().index,palette="ocean_r")
plt.title("Distribution of primary essential Sub-category",font="Georgia", fontsize=25)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Number of products', fontsize=20)
plt.ylabel('Sub-category', fontsize=20)

plt.show()

In [None]:
plt.rcParams.update({'font.size': 33,})
fig, ax  = plt.subplots(1, 2, figsize=(70, 15))
explode = (0.04, 0.04, 0.04, 0.04, 0.04)
labels = list(products['Sector(s)'].value_counts().index)
sizes = products['Sector(s)'].value_counts().values

patches, texts, autotexts = ax[0].pie(sizes, explode=explode, startangle=60, labels=labels, autopct='%1.0f%%', pctdistance=0.7, colors=["#c5d0bd","#5b863e","#6ebb3a","#d4e5c8", "#1e4104"])

texts[4].set_fontsize(2)
ax[0].add_artist(plt.Circle((0,0),0.6,fc='white'))
font = {'fontname':'Georgia'}
ax[0].title.set_text('Sector-wise % distribution of products')



plt.ticklabel_format(style='plain')
sns.countplot(y="Sector(s)",data=products, ax=ax[1],order=products['Sector(s)'].value_counts().index,palette="husl",linewidth=3)
ax[1].title.set_text("Sector-wise distribution of products")
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.xlabel('Number of products', fontsize=35)
plt.ylabel('Sector', fontsize=35)

plt.show()


 the variation of "pct_access" data and "engagement_index" data over time.

In [None]:

products['primary_essential_category'] = products['Primary Essential Function'].str.split(" - ",expand=True)[0]
products['primary_essential_subcategory'] = products['Primary Essential Function'].str.split(" - ",expand=True)[1]
lp_id_virtual = products[products.primary_essential_subcategory == 'Virtual Classroom']['LP ID'].unique()

plt.rcParams.update({'font.size': 14,})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_virtual:
    dummy = engagement[engagement.lp_id == product_id].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.pct_access, label=products[products['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of pct_access over time for Virtual Classroom products', fontsize = 20)
plt.show()

We see that with the engagement dataset containing weekends as well as weekdays data, the graph shows variation like the above with ripples. Since, people usually don't like to work on weekends which can be substantiated by the above graph, we can remove weekend data from our dataset.

In [None]:
engagement['weekday'] = pd.DatetimeIndex(engagement['time']).weekday
engagement_updated = engagement[engagement.weekday < 5]

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_virtual:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.pct_access, label=products[products['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of pct_access over time for Virtual Classroom products', fontsize = 20)
plt.show()

Now the graph shows better variation of pct_access data without ripples which were earlier present because of weekend data in the dataset.

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
for product_id in lp_id_virtual:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.engagement_index, label=products[products['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of engagement_index over time for Virtual Classroom products', fontsize = 20)
plt.show()

In [None]:
engagement.lp_id.head()

lp_id_digital = products[products.primary_essential_subcategory == 'Digital Learning Platforms']['LP ID'].unique()

lp_id_digital = [36692, 92993, 71279, 25559, 64998, 61441]

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_digital:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.pct_access, label=products[products['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of pct_access over time for TOP 6 most accessed Digital Learning platforms', fontsize = 20)
plt.show()

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_digital:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.engagement_index, label=products[products['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of engagement_index over time for TOP 6 most accessed Digital Learning platforms', fontsize = 20)
plt.show()