# Loading Liabraries !!

In [None]:
import glob
import warnings
import numpy as np 
import pandas as pd
import plotly as py
import seaborn as sns
import statistics as stat
import plotly.express as px
import plotly.graph_objs as go
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import missingno as msno
%matplotlib inline
from datetime import datetime
import os
'''for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))'''

# Table of Contents

This notebook is divided into four major segments:

**1. Loading DataSets**

**2. Basic Exploration**

**3. Deriving insights**

**4. Conclusion**

# 1. Load DataSets

* Merging engagement data files for every district into one dataframe.

In [None]:
data_path = "../input/learnplatform-covid19-impact-on-digital-learning/"
engagement_df = pd.DataFrame()
files = os.listdir(data_path + 'engagement_data/')
len(files)
for file in files:
    temp_df = pd.read_csv(data_path + 'engagement_data/' + file, parse_dates = ['time'])
    temp_df['district_id'] = file.split('.')[0]
    engagement_df = pd.concat([engagement_df, temp_df])
engagement_df.reset_index(inplace = True, drop = True)

district_df = pd.read_csv(data_path + 'districts_info.csv')
product_df = pd.read_csv(data_path + 'products_info.csv')

**Drop Duplicates**

In [None]:
def my_drop_duplicate(df):
    before_len=len(df)
    df.drop_duplicates(inplace=True)
    after_len=len(df)
    diff_len=before_len-after_len
    diff_ratio=diff_len/before_len
    drop_info = {"before_len":before_len,"after_len":after_len,"diff_len":diff_len,"diff_ratio":diff_ratio}
    drop_info = pd.DataFrame([drop_info])
    return df, drop_info

* Dropping duplicate rows from the engagement, district and product datasets
* Output of the next cell shows the number of rows before andd after removing duplicated data from all dataframes.

In [None]:
engagement_df, engagements_drop = my_drop_duplicate(engagement_df)
product_df, products_drop = my_drop_duplicate(product_df)
district_df, districts_drop = my_drop_duplicate(district_df)

drop_info = pd.concat([engagements_drop,products_drop,districts_drop]).set_axis(['engagement_df','product_df','district_df'])
drop_info

# 2. Basic Exploration

## 2.1 engagement_df

In [None]:
engagement_df.info()

In [None]:
print('Number of unique products: ', engagement_df.lp_id.nunique())
print('Number of unique districts: ', engagement_df.district_id.nunique())

In [None]:
sns.distplot(engagement_df.pct_access)

In [None]:
sns.distplot(engagement_df.engagement_index)

## 2.2 District Data

In [None]:
district_df.info()

#### 2.2.1 How many School districts per locale

In [None]:
fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Locale Distribution', size = 20, font="Serif")
explode = (0.05, 0.05, 0.05, 0.05)
labels = list(district_df.locale.value_counts().index)
sizes = district_df.locale.value_counts().values
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.7, colors=["#FFFF33","#ff9100","#eaaa00","#6d6875"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

* Total 59% of the districts are in the suburbs while the towns have the least number of districts(approx. 6%)

#### 2.2.2 How many School Districts per State:

In [None]:
sns.countplot(data = district_df, y = 'state', 
              order = district_df.state.value_counts().index).set_title('Schools Dstricts per State')
plt.xticks(rotation = 90)
plt.title('School Districts per State')

* Connecticut, Utah, Messachusetts and Illinois are top 4 states with maximum number of districts in the dataset while Arizona, North Dakota, Minnesota and Floarida are the ones with lest number of districts in the dataset.

#### 2.2.3 How school districts with different blak/hispanic ratios are distributed across locales

In [None]:
sns.countplot(data = district_df,  x = 'pct_black/hispanic', 
              order = district_df['pct_black/hispanic'].value_counts().index, hue = 'locale')
plt.legend(loc = 'upper right')

**Remarks:**

* The districts with higher black/hispanic population are mostly from City or Suburban locales.
* Districts in City locales are equally distributed across all hispanic ratio intervals.

#### 2.2.4 How districts with different socio-economic status are distributed across locales

In [None]:
sns.countplot(data = district_df,  x = 'pct_free/reduced', hue = 'locale', order = ['[0, 0.2[', '[0.2, 0.4[', '[0.4, 0.6[', '[0.6, 0.8[', '[0.8, 1['])
plt.legend(loc = 'upper right')

**Remarks:**
* A large chunk of districts in Cities and Towns have a pct_free/reduced ratio greater than 0.4.
* On the contrary, a large chunck of districts in Suburban and Rural regions have a pct_free/reduced ratio less than 0.6.

### 2.2.5 For each black/hispanic ratio what is the district count

In [None]:
fig, ax  = plt.subplots(figsize=(16, 8))
#fig.suptitle('', size = 20, font="Serif")
explode = (0.05, 0.05, 0.05, 0.05, 0.05)
labels = list(district_df['pct_black/hispanic'].value_counts().index)
sizes = district_df['pct_black/hispanic'].value_counts().values
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.0f%%', pctdistance=0.7, colors=["#FFFF33","#ff9100","#eaaa00","#6d6875"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

**Remarks**

* Out of all the districts, 66% have very low black/hispanic ratios.
* Very small proportion of districts have higher black/hispanic population ratio.

### 2.2.6 For each free/reduced meal ratio what is the district count

In [None]:
sns.countplot(data = district_df,  x = 'pct_free/reduced', 
              order = district_df['pct_free/reduced'].value_counts().index)
plt.xticks(rotation = 0)
plt.show()

#### 2.2.7 Number of districts for every per person expenditure range

In [None]:
sns.countplot(data = district_df,  y = 'pp_total_raw', 
              order = district_df['pp_total_raw'].value_counts().index).set_title('Distribution of pp_total_raw across overall districts')
plt.xticks(rotation = 0)
plt.tight_layout()

**Remarks:**

* In the plot we can see that larger number of districts have low or moderate per_person_expenditure. This seems logical because large number of districts have low or moderate ratio of free meals (as we saw above) which should result in low expenditure in those districts.

## 2.3 Products Data

**Splitting Primary Essential Function into major and sub-categories**

In [None]:
product_df['Primary Essential Function'] = product_df['Primary Essential Function'].astype('str')
product_df['PEFCategory'] = product_df['Primary Essential Function'].apply(lambda x: x.split(' - ')[0])
product_df['PEFSub_category'] = product_df['Primary Essential Function'].apply(lambda x: 'nan' if x=='nan' else ('-'.join(x.split(' - ')[1:]) if len(x.split(' - ')) > 1 else (x.split('-')[1])))

#### 2.3.1 How many products in every main Primary Essential Function category

In [None]:
fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Primary Essential Function main', size = 20, font="Serif")
explode = (0.05, 0.05, 0.05, 0.05, 0.05)
labels = list(product_df['PEFCategory'].value_counts().index)
sizes = product_df['PEFCategory'].value_counts().values
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.2f%%', pctdistance=0.7, colors=["#18ff9f","#2cfbff","#ffb703"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

**Remarks**

* About 73% of the products are of 'Learning and Curriculam' types.
* Products in main PEF category 'Classroom Management'and 'SDO' are almost in the same proportion.


#### 2.3.2 How many products in every Sector

In [None]:
fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Sector Distribution', size = 20, font="Serif")
explode = (0.05, 0.05, 0.05, 0.1, 0.05)
labels = list(product_df['Sector(s)'].value_counts().index)
sizes = product_df['Sector(s)'].value_counts().values
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.2f%%', pctdistance=0.7, colors=["#ff228a","#20b1fd","#ffb703"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

#### 2.3.4 How many products in every Primary Essential Function sub category

In [None]:
fig, ax = plt.subplots(figsize = [6, 8])
sns.countplot(data = product_df, y = 'Primary Essential Function', order = product_df['Primary Essential Function'].value_counts().index)

#### 2.3.5 Missing Values:

In [None]:
print(engagement_df.shape)

In [None]:
def missing_counts(df):
    temp_df = pd.DataFrame(df.isnull().sum(), columns = ['count'])
    temp_df['ratio'] = temp_df.apply(lambda x: x/len(df))
    return temp_df

In [None]:
#engagement_df
missing_counts(engagement_df)

In [None]:
engagement_df[engagement_df.engagement_index.isnull()].pct_access.value_counts(dropna = False, normalize = True)

* In engagement data most of the missing values are found in engagment_index
* For all those missing values of engagement index, pct access values are mostly 0(99.75% of the times) and sometimes missing(.25% percent of the times).

In [None]:
missing_counts(district_df)

In [None]:
msno.heatmap(district_df,figsize=(10,5))

**Remmark:**
* To check if the missing values are correlated or not we can use the missing number heatmap.
* The heatmap of the missing values shows their correlation in different columns.
* It only shows the lower traingle of the correlation matrix and also skips a column's correlation with itself.
* +1 indicates complete positive correlation and -1 indicates complete negative correlation.

**Insights**
* We can see from the heatmap that missing values from each column are highly correlated to the missing values of other columns. For example correlation between state and locale is 1, which means whenever state value is missing, locale will also be missing.
* Here the number of missing values is very large so dropping such rows would significantly reduce the data.
* No filling strategy seems to be appropriate in this case so we let those values as they are.

In [None]:
missing_counts(product_df)

**Remove rows in engagement data where product id or district id is missing**

In [None]:
engagement_df = engagement_df[~engagement_df.lp_id.isnull()]
engagment_df = engagement_df[~engagement_df.district_id.isnull()]

# 3. Deriving Insights

* **Adding quarter and week as columns for analysis**

In [None]:
#engagement_df['month'] = engagement_df.time.apply(lambda x: x.month if x.year == 2020 else x.month + 12)
engagement_df['quarter'] = engagement_df.time.apply(lambda x: x.quarter)
engagement_df['week'] = engagement_df.time.apply(lambda x: x.weekofyear)

* **Changing DataTypes of product ids and district ids to string for easy merge operations**

In [None]:
engagement_df['lp_id'] = engagement_df.lp_id.apply(lambda x: str(int(x)))
engagement_df['district_id'] = engagement_df.district_id.astype('str')
product_df['LP ID'] = product_df['LP ID'].astype('str')
#engagement_df['primary_key'] = engagement_df.lp_id + '_' + engagement_df.district_id
district_df['district_id'] = district_df.district_id.astype('str')

* **Filtering district and product data to retain only those products and districts that have corresponding data available in the engagement dataframe**

In [None]:
district_df = district_df[district_df.district_id.isin(engagement_df.district_id.unique())]
product_df = product_df[product_df['LP ID'].isin(engagement_df.lp_id.unique())]

## 3.1 How total engagement varied throughout the year

**Description**

To analyse how the online engagement across the country varies with time. 

### 3.1.1 Daily Plot

**PLOT**

In [None]:
def lineplot(df, agr_col, target_col, title):
    fig, ax1 = plt.subplots(figsize = [15,5])
    ymin = 0
    ymax = df[target_col].max()
    plt.vlines(x=datetime.strptime('2020-02-10', '%Y-%m-%d'), ymin=ymin, ymax=ymax, color = 'red', lw = 4)
    ax1.fill_between([datetime.strptime('2020-05-15', '%Y-%m-%d'), datetime.strptime('2020-09-15', '%Y-%m-%d')], 0, ymax, alpha = 0.15, color = 'green')
    ax1 = sns.lineplot(data = df, x = agr_col, y = target_col, color = 'red')
    plt.title(title)
    plt.show()

In [None]:
temp_df = engagement_df.groupby('time').agg({'engagement_index':'sum', 'pct_access':'sum'}).reset_index(drop = False)
lineplot(temp_df, 'time', 'engagement_index', 'Total Engagement per day')

**Design Decisions:**

* Here we have aggregated(sum) the engagement index for each day, this should give an idea about total engagement across all districts and products.
* The summer vacation period is marked as the shaded region in the plot.
* The onset of covid impact on the country is represented by the red vertical line on the plot.

**Observations**


* The frequent periodic dips in engagement correspond to weekends.
* It is observed here that the total engagement per week has increased during the month of Feb, which can also be considered as the onset of Covid impact in the country.
* A large dip in engagement can be observed during the months of May to August which is generally the summer vacation period.
* It is evident from the aggregated time series plot that there is a noticeable increase in the digital engagement after the summer vacation.

**Primary Insights**

* Overall not taking the summer vacation time into account we can say that the total engagement in the country increases steadily as the year progresses. Due to lockdowns and restricted openings of the schools and academic institutions, the dependence of students on the online learning platforms is also expected to increase over time. This trend of steady increase can be seen clearly in the next plot, which shows the weekly aggregate of total engagement with the summer vacation period removed from the x-axis.

### 3.1.2 Weekly Plot

In [None]:
def weekly_barplot(df, agr_col, target_col, title):
    fig, ax1 = plt.subplots(figsize = [15,5])
    ymin = 0
    ymax = df[target_col].max()
    ax1 = sns.barplot(data = data, x = agr_col, y = target_col, color = 'dodgerblue')
    ax1.set_box_aspect(10/len(ax1.patches)) #change 10 to modify the y/x axis ratio
    ax1 = plt.vlines(x=5, ymin=ymin, ymax=ymax, color = 'red', lw = 4)
    ax1 = plt.vlines(x=31, ymin=ymin, ymax=ymax, color = 'black', lw = 4)
    plt.title(title);
    plt.show()

**PLOT**

In [None]:
temp_df = engagement_df.groupby('week').agg({'engagement_index':'sum', 'pct_access':'mean'}).reset_index(drop = False)
data = temp_df[~temp_df.week.isin([1, 53] + list(range(20, 39)))].sort_values('week')
weekly_barplot(data, 'week', 'engagement_index', 'Weekly Engagement')

**Design Decisions:**

* Here we have aggregated(sum) the engagement index for each week, this should give an idea about total engagement across all districts and products.
* The first and last week in the year 2020 were removed from the plot as these weeks did not have all seven days.
* The summer vacation period has been removed from the x-axis in the plot.
* The onset of covid impact on the country is represented by the red vertical line on the plot.
* The black vertical line marks the second last week of the year.

**Primary Insights**

* As seen in the plot above we can see that the engagement increases steadily as the year progresses.

**Secondary Insights**

* There are some occasional dips in this weekly plot which might be due to some holidays. For example, the second last week of the year is around Christmas time which seems to be a cause for the decrease in engagement in that week.

## 3.2 Engagement per week for each 'pct_black/hispanic' ratio category

****Description:****

Here we are aggregating(sum) the engagement per week for each 'pct_black/hispanic' ratio category and visualizing how the engagement is varying with time. Purpose of this plot would be to see the time series patterns for different demographics(pct_black/hispanic ratio).


In [None]:
district_counts = engagement_df.merge(district_df, how = 'left', left_on = 'district_id', right_on = 'district_id').groupby('pct_black/hispanic').district_id.nunique().reset_index()
district_counts.columns = ['pct_black/hispanic', 'district_count']
district_counts = district_counts.sort_values('district_count', ascending = False)
district_counts.head()

**PLOT**

In [None]:
temp_df = engagement_df.merge(district_df, how = 'left', left_on = 'district_id', 
                              right_on = 'district_id').groupby(['pct_black/hispanic','week']).agg({'engagement_index':'sum', 'pct_access':'mean'}).reset_index(drop = False).merge(district_counts, how = 'left', left_on = 'pct_black/hispanic', right_on = 'pct_black/hispanic')
temp_df['engagement_index'] = temp_df.engagement_index/temp_df.district_count

In [None]:
fig, ax = plt.subplots(figsize = [15,5])
sns.lineplot(data = temp_df, x = 'week', y = 'engagement_index', hue = 'pct_black/hispanic')
ymin = 0
ymax = temp_df.engagement_index.max()
plt.vlines(x=7, ymin=ymin, ymax=ymax, color = 'red', lw = 4)
ax.fill_between([20,38], 0, ymax, alpha = 0.15, color = 'green')
leg = plt.legend()
for i in range(len(temp_df['pct_black/hispanic'].unique())):
    leg.get_lines()[i].set_linewidth(6)
plt.show()

****Design Decisions****

* The aggregation of engagement is done on week level in the plot.
* As the number of districts vary in different 'pct_black/hispanic' categories, it seems appropriate to divide each aggregated value with the district counts in the respective category to normalize the aggregated values.
* Onset of covid impact on the country is marked with the red vertical line in the plot.
* Shaded region marks the summer vacation period.

**Primary Insights**

* The most clear trend we can see in the aggregated plot is that the regions with low black/hispanic population ratios are showing more digital engagement throughout the year. Also the time series pattern for each black/hispanic ratio category is similar to the time series pattern for overall engagement across all categories. 

## 3.3 Product Category Analysis:

In [None]:
def lineplot_weekly(df, agr_col, target_col, hue_col):
    fig, ax = plt.subplots(figsize = [15,5])
    sns.lineplot(data = df, x = agr_col, y = target_col, hue = hue_col)
    ymin = 0
    ymax = df[target_col].max()
    plt.vlines(x=7, ymin=ymin, ymax=ymax, color = 'red', lw = 4)
    #plt.vlines(x=11, ymin=ymin, ymax=ymax, color = 'black', lw = 4)
    ax.fill_between([20,38], 0, ymax, alpha = 0.15, color = 'green')
    plt.show()

### 3.3.1  How total engagement varied for PEF main categories throughout the year.

**Description:**

Here we are aggregating(sum) the engagement per 'PEFCategory' and visualizing how the engagement is varying with time. Purpose of this plot would be to see the time series patterns in the different major Primary Essential Functions.

In [None]:
product_counts = engagement_df.merge(product_df, how = 'left', left_on = 'lp_id', right_on = 'LP ID').groupby(['PEFCategory']).lp_id.nunique().reset_index(drop = False)
product_counts.columns = ['PEFCategory', 'product_count']
product_counts = product_counts.sort_values('product_count', ascending = False)
print(product_counts.head())

**PLOT**

In [None]:
temp_df = engagement_df.merge(product_df, how = 'left', left_on = 'lp_id', right_on = 'LP ID').groupby(['PEFCategory', 'week']).agg({'pct_access':'mean', 'engagement_index':'sum'}).reset_index(drop = False).merge(product_counts, how = 'left', left_on = 'PEFCategory', right_on = 'PEFCategory')
temp_df['engagement_index'] = temp_df.engagement_index/temp_df.product_count

lineplot_weekly(temp_df, 'week', 'engagement_index', 'PEFCategory')

**Design Decisions**

* We are aggregating the engagement per week for each major Primary Essential Function categories.

* As the number of products vary in different major Primary Essential Function categories, it seems appropriate to divide each aggregated value with the product counts in the respective category to normalize the data.

* Summer vacation is marked by shaded region
* Red vertical line separates pre and post covid time.

**Primary Insights**

* Throughout the year the major PEF category which got maximum engagement was 'School and District Operations'.
* It can also be seen clearly in the plot that there is a steady increase post covid in the engagement for main Primary Function CM if we ignore the summer vacation time.

### 3.3.2  Daily Trend of total engagement for each subcategory within PEF main category 'SDO'.

In [None]:
def lineplot_daily(df, agr_col, target_col, hue_col):
    fig, ax = plt.subplots(figsize = [15,5])
    sns.lineplot(data = df, x = agr_col, y = target_col, hue = hue_col)
    ymin = 0
    ymax = df[target_col].max()
    plt.vlines(x=datetime.strptime('2020-03-13', '%Y-%m-%d'), ymin=ymin, ymax=ymax, color = 'red', lw = 6)
    ax.fill_between([datetime.strptime('2020-05-15', '%Y-%m-%d'), datetime.strptime('2020-09-15', '%Y-%m-%d')], 0, ymax, alpha = 0.15, color = 'green')
    leg = plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    for i in range(len(df[hue_col].unique())):
        leg.get_lines()[i].set_linewidth(6)

**Description:**

In this section we will analyze the daily trend in the engagement for each sub-category of the PEF Main category 'SDO'. The purpose of this section is to see which sub-categories got maximum enngagement throughout the year in SDO.

**PLOT**

In [None]:
temp_df = engagement_df.merge(product_df, how = 'left', left_on = 'lp_id', right_on = 'LP ID')[lambda x: x.PEFCategory == 'SDO'].groupby(['PEFSub_category', 'time']).agg({'pct_access':'sum', 'engagement_index':'sum', 'lp_id':'nunique'}).reset_index(drop = False)
temp_df['engagement_index'] = temp_df.engagement_index/temp_df.lp_id
lineplot_daily(temp_df, 'time', 'engagement_index', 'PEFSub_category')

**Design Decisions:**

* We have divided each aggregated value with the product counts in the respective sub-category to normalize the data.
* Summer vacation is marked by shaded region
* Red vertical line separates pre and post covid time.

**Primary Insights**

* Within the 'SDO' PEF main category, sub-category of products that got maximum engagement is 'Learning Management Systems(LMS).

* A sharp rise in the digital engagement in LMS sub-category can be observed after onset of covid, in the plot. A learning management system (LMS) is a software application for the administration, documentation, tracking, reporting, automation and delivery of educational courses, training programs, or learning and development programs. The learning management system concept emerged directly from e-Learning. (Source-Wikipedia). This rise seems logical because schools being shut down all over the country after covid, might have resulted in an increased dependency on e-learning and LMS.

* There is a significant decline in the engagement for 'School Management Softwares-Mobile Device Management' sub-category post covid. The most likely reason for such phenomenon seems to be the shutting down of schools during lockdowns and schools and students being able to rely on Digital platforms for their learning thus minimizing the need for school management softwares. **The plot for the daily trend in sub-category 'School Management Softwares-Mobile Device Management'is given below separately, which clearly shows the decrease in the total engagement post covid.** 

In [None]:
data = temp_df[temp_df.PEFSub_category.isin(['School Management Software-Mobile Device Management'])]
lineplot_daily(data, 'time', 'engagement_index', 'PEFSub_category')

In [None]:
def time_series(df,col1,col2,col3):
    list1 = df[[col1,col2]].groupby([col1])[col2].sum().sort_values(ascending=False).index[:10].tolist()
    
    df = df[df[col1].isin(list1)].reset_index(drop=True)[[col3, col1, col2]]

    df = df.pivot_table(index=col3, columns=col1, values=col2)

    fig = px.line(df, facet_col=col1, facet_col_wrap=1, width=800, height=800)

    fig.show()

### 3.3.3 Daily Trend of total engagement for each subcategory within PEF main category 'CM'.

**Description:**

In this section we will analyze the daily engagement trend for each sub-category in the PEF Main category 'CM'. The purpose of this section is to see why the engagement for PEF main category 'CM' shows a steady increase post covid(ignoring the summer vacation).

**PLOT**

In [None]:
temp_df = engagement_df.merge(product_df, how = 'left', left_on = 'lp_id', right_on = 'LP ID')[lambda x: x.PEFCategory == 'CM'].groupby(['PEFSub_category', 'time']).agg({'pct_access':'sum', 'engagement_index':'sum', 'lp_id':'nunique'}).reset_index(drop = False)
temp_df['engagement_index'] = temp_df.engagement_index/temp_df.lp_id
lineplot_daily(temp_df, 'time', 'engagement_index', 'PEFSub_category')

**Design Decisions:**

* We have divided each aggregated value with the product counts in the respective sub-category to normalize the data.
* Summer vacation is marked by shaded region
* Red vertical line separates pre and post covid time.

**Primary Insights**

* The sub-category 'Virtual Classroom - Video Conferencing and Screen Sharing' shows a sharp increase post covid. This is probably because the schools were shut down post covid and the schools and students resorted to Virual Platforms to conduct their classes and studies, in an effort to adapt with the limited or no access to real classroom environments.

* It seems the sharp increase in engagement for sub-category 'Virtual Classroom - Video Conferencing and Screen Sharing' is responsible for the increase in engagement for PEF main category 'CM' post covid(which we have seen in the section 3.3.1)

* In the rest of the sub categories no significant increase in engagement is seen.

* **The Virtual Classroom products which got maximum engagement after covid were "Meet" and "ZOOM", which can be observed from the plot shown in the next cell.**

In [None]:
temp_df = engagement_df.merge(product_df, how = 'left', left_on = 'lp_id', right_on = 'LP ID')[lambda x: x.PEFSub_category == 'Virtual Classroom-Video Conferencing & Screen Sharing'].groupby(['Product Name', 'time']).agg({'pct_access':'sum', 'engagement_index':'sum', 'lp_id':'nunique'}).reset_index(drop = False)
temp_df['engagement_index'] = temp_df.engagement_index/temp_df.lp_id
lineplot_daily(temp_df,'time','engagement_index','Product Name')

## 3.4 How has student engagement with different products evolved over the course of the pandemic?

**Description**

To analyze the change in digital engagement between 1st and 4th quater of the year for various products belonging to different product categories and sub-categories.

This exercise will enable us to quantify the change in products usages during pre covid(1st quarter) and the current period(last quarter).

### 3.4.1 Scatter plots for absolute change vs percent change in the quarterly engagement for products belonging to different categories.

In [None]:
temp_df = engagement_df.merge(product_df, how = 'inner', left_on = 'lp_id', right_on = 'LP ID').groupby(['lp_id', 'quarter']).agg({'engagement_index':'sum', 'pct_access':'mean'}).reset_index(drop = False)
keys = temp_df.lp_id.value_counts()[lambda x: x == 4].index
temp_df = temp_df[temp_df.lp_id.isin(keys)]

df = temp_df[['lp_id', 'quarter', 'engagement_index']].set_index(['lp_id', 'quarter'])['engagement_index'].unstack('quarter').add_prefix('q').reset_index()

df['diff'] = df.q4 - df.q1
df['per_diff'] = (df.q4 - df.q1)*100/df.q1

df = df.merge(product_df, how = 'inner', left_on = 'lp_id', right_on = 'LP ID')

**PLOT**

In [None]:
fig, ax = plt.subplots(1, 2, figsize = [12,5])
sns.scatterplot(ax = ax[0], data = df, x = 'diff', y = 'per_diff', hue = 'PEFCategory')
sns.scatterplot(ax = ax[1], data = df, x = 'diff', y = 'per_diff', hue = 'PEFCategory')
ax[0].legend()
ax[1].legend()
plt.xlim([-1500000, 5000000])
plt.ylim([-200, 1000])
plt.tight_layout()

**Design Decisions**

* After we have the quarterly engagement per product we are calculating both absolute change in the quarterly usage(difference between first and last quarter) of each product. 

* We are also looking at the percentage change in the quarterly usage of each product. This is important because, a high absolute change may not be significant if a product already had a high engagement in the first quarter.

* So here, we are analysing on the basis of both absolute and percent change in the engagement. The purpose is to come up with the Product categories whose products gained maximum interest of students in terms of both absolute and percent increase in engagement.

* We are considering only those products in the analysis which have data for all the four quarters.

* In the plot on the left side above, there are some extreme points along both x and y axes, so there is another plot which focuses on the cluster of points close to origin.

**Primary Insights**

* The absolute change vs percentage change plot shows that most of the courses with both significant absolute change as well as percent change in engagement between the two quarters(first and fourth) are of the type 'Learning and Curriculam' and 'Classroom Management'.

### 3.4.2 Average percent change in engagement(between first and last quarter of the year) per product for major Primary Essential Function categories

**Description:**

The following plot shows the percent change in the engagement(between first and last quarter of the year) averaged across products for each Primary Essential Main Category.

**PLOT**

In [None]:
fig, ax = plt.subplots(figsize = [6, 5])
data = df.groupby('PEFCategory').per_diff.mean()
data.plot(kind = 'barh', logx = True)
plt.show()


**Primary Insights**

* Even if we look at the average percent increase in engagement alone, the categories 'Learning and Curriculam' and 'Classroom Management' are on the top. On the other hand for the other categories, the percent change is significantly less.

### 3.4.3 Further digging in to PEF Sub_categories:

**Description:**

In this section we will be looking at the percent change in engagement averaged across products for each Primary Essential Function sub category.

**PLOT**

In [None]:
fig, ax = plt.subplots(figsize = [5, 20])
data = df.groupby('Primary Essential Function').per_diff.mean()
data.plot(kind = 'barh')
plt.xlim([-200,1000])
plt.show()

**Design Decisions:**

* To be able to visualize the small bars properly, we have restricted the x axis limit from -200 to 1000.

**Primary Insights**

* As expected the sub categories with the maximum average percent change belong to 'Learning and Curriculam' and 'Classroom Management', which we also highlighted in the earlier plots.


* Learning and Curriculam:
    * The sub category 'Content Creation and Curation' got the maximum growth in terms of average percent change in the digital engagement between 1st and 4th quarter.
    * Next to that was "Digital Learning Platforms", which is also expected because people moved more and more to digital learning after covid.
    * And there are many other product sub categories that had increased engagement in the last quarter like, "Sites, References and Learning Materials", "Courseware and Textbooks", "Streaming services", "Learning Materials and Supplies", etc. This also an expected trend.  
    
    
* Classroom Management: 
    * The sub category within 'Classroom Management' which saw the maximum average percent of growth is "Virtual Classroom-Video Conferencing and Screen Sharing", reasonably so because post covid most of the educational institutions and schools have resorted to virtual classroom and video conferencing tools to continue their interactions with students.
    
    
* Shools andd District Operations:
    * Most of the sub primary function categories within this category have seen a negative average percent change in engagement from 1st quarter to 4th quater, which means there is a decrease in engagement for such sub categories. For example the engagement has decreased for sub categories like Human Resources, Safety Compliances, Student Information Systems etc. The products of these sub categories are generally useful when schools are in operating state, and hence are expected to see a negative impact on their usage post covid.

# 4. Conclusion

There are several expected trends which are seen in the usage of online products as the pandemic progressed throughout the year. Some of these trends are summarized below.

* Overall Digital Engagement increased right after the onset of Covid

* Overall Digital Engagement contiued to steadily increase throughout the year.

* The regions with low black/hispanic population ratios are showing more digital engagement throughout the year

* Usage of school management softwares has reduced post covid as shutting down of schools during lockdowns minimized the need for such softwares.

* 'Virtual Classroom' Softwares and platforms showed a sharp increase in engagement post covid, because schools and students resorted to Virual Platforms to conduct their classes and studies in order to adapt with the limited or no access to real classroom environments.

* With people moving to digital learning as the year passed, there was an increase in the usage of products related to Digital Learning Platforms. And as part of Digital Learning, the students also engaged more with product sub categories like, "Sites, References and Learning Materials", "Courseware and Textbooks", "Streaming services", "Learning Materials and Supplies", etc, which are all related to Learning and Curriculam.  