**Team Number: 01 From KLE Technological University**

**Team Leader: Rohini K Katti**

**Team Members: Shivani C Guranalli , Madhura Nagaraj Nayak , Soumya Jakkali**

In this study, we are analyzing the data to study how engagement with digital learning relates to factors like district demographics, broadband access and state/national level policies and events.


**Challenges :**


*  Analyze the picture of digital connectivity and engagement in 2020
*  What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?


*  How does student engagement with different types of education technology change over the course of the pandemic?
*  How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?




*   Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?


### **For quick hightlights of the notebook [click here](https://drive.google.com/file/d/1qQ76yP1FhZzsYA9PkOqZpFEnjFM9sv85/view?usp=sharing)**

Importing all the necessary libraries

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly import figure_factory as FF
from plotly.offline import iplot
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from matplotlib import patches
import plotly.express as px
%pylab inline

Imported all the necessary libraries.


Loading the data files.

# **District_info.csv**






---
**District information data**      


The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec-2018),and Edunomics Lab. 

---

In [None]:
#Loading district_info.csv
district_info=pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
district_info.head()

Let's see the number of rows and columns in district_info.csv

In [None]:
print("Rows ",district_info.shape[0])
print("Columns ",district_info.shape[1])

Let us know about the attributes of districts_info.csv

---



In [None]:
district_attributes = [['Sl No' , 'Attribute' , 'Description'],
[1, 'district_id', 'The unique identifier of the school district'],[2, 'state', ' The state where the district resides in'],[3, 'locale', "NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information."],[4, 'pct_black/hispanic ', 'Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data'],[5, 'pct_free/reduced', 'Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data'],
[6, 'countyconnectionsratio', 'ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.'],[7, 'pp_total_raw', "Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERDS) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district."]]
figure1 = FF.create_table(district_attributes,height_constant = 30)
iplot(figure1)

In [None]:
district_info.describe(include='all')

---
By seeing the above information we can infer that data is mostly comprised of connecticut state and suburban locale.

Important part of analyzing a data, is to study the missing values.

Let's see if the districts have any missing values.

---

In [None]:
print("Percentage of Nulls present in the Districts data are as follows")
districtna=((district_info.isnull().sum())/(district_info.shape[0]))*100
districtna

Percentage of Null values are more hence we have to find some ways to fill the missing values

From the given data , we have noticed that wherever the state is NaN in a row, all other data is missing as well. So we have removed all NaN states.

In [None]:
#Dropping all NaN state values as they do not contain any information about other attributes
district_info = district_info[district_info.state.notna()].reset_index(drop=True)

In [None]:
print("Percentage of Nulls present in the Districts data are as follows")
districtna=((district_info.isnull().sum())/(district_info.shape[0]))*100
districtna

Now we could observe that pct_free/reduced values are missing.

Let's explore about the ways of filling it as only small amount of data is missing.

In [None]:
#No of rows with null values
nulls=district_info[(district_info['pct_free/reduced'].isnull())]
nulls

We can see most of the missing values of free/reduced lunch can be seen in Massachusetts.

Does this mean Massachusetts has not adopted any policy to ensure free/reduced price lunch to the students or data is missing about its policy??



In [None]:
data_massachets=district_info[district_info['state']=='Massachusetts']
data_massachets

So, according to https://nces.ed.gov/programs/digest/d20/tables/dt20_204.10.asp, roughly 40% of students are eligible for free or reduced lunch, and USA spending ranges from 40% to 60%. As a result, it is preferable to populate those values with [0.4,0.6[


In [None]:
district_info['pct_free/reduced'].fillna('[0.4, 0.6[',inplace=True)

We'll now look at the missing expenditure per pupil values. .

In [None]:
exp_null=district_info[district_info['pp_total_raw'].isnull()]
exp_null

Connecticut and a few other states have a lot of missing expenditure data.

In [None]:
district_info[district_info['state']=='Connecticut']

Because there are null values for average spending per pupil for the state of Connecticut, it is preferable to populate such values manually.

We got the expenditure per pupil number for Connecticut as 20,000 dollars from https://educationdata.org/public-education-spending-statistics#connecticut




In [None]:
district_info['pp_total_raw']=np.where(district_info['state']=='Connecticut','[18000, 20000[' , district_info.	pp_total_raw)
district_info

For the state of California, we see all null values for pp total raw

In [None]:
district_info[district_info['state']=='California']

As we can observe null values for avg expenditure per pupil for california state , it is better to fill those values manually.

From
https://educationdata.org/public-education-spending-statistics#california we  got the expenditure per pupil value for connecticut as 12,000 dollar approximately




In [None]:
district_info['pp_total_raw']=np.where(district_info['state']=='California','[12000, 14000[' , district_info.	pp_total_raw)
district_info

In [None]:
district_info[district_info['state']=='Ohio']

We can observe that even we don't have any data of per pupil total expenditure related to ohio.

From https://educationdata.org/public-education-spending-statistics#ohio we have obtained the value of per pupil expenditure for ohio state is around 12000-14000 dollar

In [None]:
district_info['pp_total_raw']=np.where(district_info['state']=='Ohio','[12000, 14000[' , district_info.	pp_total_raw)
district_info

In [None]:
print("Percentage of Nulls present in the Districts data are as follows")
districtna=((district_info.isnull().sum())/(district_info.shape[0]))*100
districtna

As we have county_conection value the same overall it is better to drop that column 

In [None]:
district_info.drop('county_connections_ratio',axis = 1, inplace = True)
print("Dropped country_connection ratio column...")

In [None]:
print("Percentage of Nulls present in the Districts data are as follows")
districtna=((district_info.isnull().sum())/(district_info.shape[0]))*100
districtna

---
## **Summary of handling missing values :**



*   The missing value can be done in many ways like ignoring a tuple or filling it with particular value.
*   Removed all the tuples with NaN values in state attribute


*   Filling pct_free/reduced , pp_total_raw manually by referring to various sources
*   Removed county_connection_ratio value as it has the same values in almost all the tuples

---

# **Exploring district_info data**

In [None]:
#Count of districts per state

fig1 = district_info['state'].value_counts().plot(kind='bar', title='Count of districts per state', figsize=(12,6))

plt.xticks(rotation=90)

fig1.text(9.6, -9.5,'fig 1',style = 'italic',fontsize = 20)


From *fig 1* we can infer that,

More number of school districts are found in Connecticut followed by Utah from the given data.

The information is collected from 23 states out of 50 states of US

In [None]:
#Plotting locale
fig2 = district_info.groupby('locale').size().plot(kind='pie',figsize=(20,7),shadow=True,autopct='%1.1f%%')
plt.legend()
fig2.text(0.01, -1.3,'fig 2',style = 'italic',fontsize = 20)

About 60 percent of the given data is about Suburban locale (From *fig 2*)

Let's write an average function for pct_black/hispanic,county_connection_ratio and pct_free/reduced for our better understanding. 

In [None]:
#Average function
def avg_ranges(x):
    return np.array(str(x).strip("[").split(",")).astype(float).mean()

In [None]:
district_info['avg_black_hispanic'] = district_info['pct_black/hispanic'].apply(avg_ranges)
district_info['avg_reduced_lunch'] = district_info['pct_free/reduced'].apply(avg_ranges)

district_info['avg_spent_per_pupil'] = district_info['pp_total_raw'].apply(avg_ranges)

In [None]:
district_info['avg_spent_per_pupil'].fillna(district_info['avg_spent_per_pupil'].median(), inplace=True)
#Filling missing values of expenditure per pupil with its class median as data is skewed

Grouping the states in accordance with other useful parameters.

In [None]:
grouped_districts = district_info.groupby(by=["state"])[['avg_black_hispanic', 'avg_reduced_lunch', 'avg_spent_per_pupil']].mean()
grouped_districts

In [None]:
#grouping above average districts
grouped_districts["above_avg_hispanic"] = grouped_districts.avg_black_hispanic.apply(lambda x: 1 if x>grouped_districts.avg_black_hispanic.mean() else 0)
grouped_districts["above_avg_lunch"] = grouped_districts.avg_reduced_lunch.apply(lambda x: 1 if x>grouped_districts.avg_reduced_lunch.mean() else 0)
grouped_districts["above_avg_pupil"] = grouped_districts.avg_spent_per_pupil.apply(lambda x: 1 if x>grouped_districts.avg_spent_per_pupil.mean() else 0)
grouped_districts

Let's plot the graph about how each parameters affect each state and understand the variations

In [None]:
fig, ax = plt.subplots(3,1,figsize=(16,12), sharex=True)

sns.barplot(ax=ax[0], x=grouped_districts.index, y=grouped_districts["avg_black_hispanic"], hue=grouped_districts.above_avg_hispanic, dodge=False)
ax[0].axhline(grouped_districts["avg_black_hispanic"].mean(), c="black", linestyle="--")
ax[0].set_title("Percentage of Black/Hispanic per State", size=16)
ax[0].set_xlabel(None)
ax[0].set_ylabel("Percentage", size=12)
ax[0].legend("")
ax[0].annotate("USA_Average",size=16, xy=(19, 0.3), xytext=(20, 0.6),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

sns.barplot(ax=ax[1], x=grouped_districts.index, y=grouped_districts["avg_reduced_lunch"], hue=grouped_districts.above_avg_lunch, dodge=False)
ax[1].axhline(grouped_districts["avg_reduced_lunch"].mean(), c="black", linestyle="--")
ax[1].set_title("Percentage of people eligible for Free/reduced Lunch fees", size=16)
ax[1].set_xlabel(None)
ax[1].set_ylabel("Percentage", size=12)
ax[1].legend("")
ax[1].annotate("USA_Average",size=16, xy=(15, 0.38), xytext=(16, 0.6),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )


sns.barplot(ax=ax[2], x=grouped_districts.index, y=grouped_districts["avg_spent_per_pupil"], hue=grouped_districts.above_avg_pupil, dodge=False)
ax[2].axhline(grouped_districts["avg_spent_per_pupil"].mean(), c="black", linestyle="--")
ax[2].set_title("Expenditure per Pupil - State wise", size=16)
ax[2].set_xlabel(None)
ax[2].set_ylabel("Expenditure($)", size=12)
ax[2].legend("")
ax[2].annotate("USA_Average",size=16, xy=(15, 12500), xytext=(16, 17500),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.xticks(size=12, rotation=90)

plt.subplots_adjust(hspace=0.4)
fig.show()
fig.text(0.4, -0.1,'fig 3',style = 'italic',fontsize = 20)

From *fig 3*, 
we can infer that Arizona has higher percentage of black/hispanic people and  of higher percentage of students eligible for free/reduced lunch

Connecticut, New York, and the District of Columbia all spend more per pupil than the rest of the country.
Florida spends least amount per pupil in the given data.


In [None]:
district_info.columns

In [None]:
plt.figure(figsize=(9,7))

fig4 = sns.countplot(x='pct_black/hispanic',data=district_info, palette='rainbow',hue='locale')

plt.title("Count of black/hispanic people, Separated by locale")

plt.legend(loc='upper right')

fig4.text(1.6, -9.5,'fig 4',style = 'italic',fontsize = 20)


From *fig 4* we observe that,
* City has highest percentage of black/hispanic people.
* Town has least percentage of black/hispanic people.

In [None]:
plt.figure(figsize=(9,7))

fig5 = sns.countplot(x='pct_free/reduced',data=district_info, palette='rainbow',hue='locale')

plt.title("Count of  people eligible for free/reduced lunch, Separated by locale")

plt.legend(loc='upper right')

fig5.text(1.6, -8.5,'fig 5',style = 'italic',fontsize = 20)

From *fig 5* we came to know that city has highest percentage of people eligible for free/reduced lunch which clearly indicates that the socio economic states of cities might be low.


In [None]:
plt.figure(figsize=(20,7))
plt.xticks(rotation = 90)
fig6 = sns.countplot(x = 'state', data =district_info, hue = 'locale')
fig6.text(10.6, -9.5,'fig 6',style = 'italic',fontsize = 20)

From *fig 6*,

1)Connecticut mostly comprises of suburban locale

2)District of columbia and Arizona has only city locale

In [None]:
grouped_df = district_info.groupby(by=["state"])[['avg_black_hispanic', 'avg_reduced_lunch', 'avg_spent_per_pupil']].mean()

#is there a relationship between the two ratios?
sns.heatmap(grouped_df.corr(),annot=True)

avg_black_hispanic and avg_reduced_lunch have strong positive correlation. From this we can say that most of the students who are eligible for free or reduced lunch are black or hispanic.

---
## **Summary:**

General Insights:

 

*   Dataset contains most number of school districts for Connecticut state and california state
*   Suburbs is the most common (59%) locale type for the districts in the dataset.

*  Arizona has the highest percentage of black/hispanic students.
*   The percentage of students receiving free/reduceed lunch is highest in Minnesota, and lowest in New Jersey,Arizona, North Dakota


*  Connecticut, New York, and the District of Columbia all spend more per pupil than the rest of the country.


*   City has highest percentage of black/hispanic people and town has least percentage of black/hispanic people.

*   City has highest percentage of people eligible for free/reduced lunch which clearly indicates that the socio economic states of cities might be low.

---

# **products_info.csv**

---
**Product information data**
The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. Data were labeled by our team. Some products may not have labels due to being duplicate, lack of accurate url or other reasons.

---

In [None]:
products_info=pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

products_info.head()


In [None]:
products_info.shape

Let us know about the attributes of products_info.csv

In [None]:
product_attributes = [['Sl No' , 'Attribute' , 'Description'],
[1, 'lp_id', 'The unique identifier of the product'],[2, 'URL', 'Web Link to the specific product'],[3, 'Product Name ', 'Name of the specific product'],[4, 'Provider/CompanyName', 'Name of the product provider'],[5, 'Sector(s)', 'Sector of education where the product is used'],[6, 'PrimaryEssentialFunction ', 'The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled']]
figure2 = FF.create_table(product_attributes,height_constant = 30)
iplot(figure2)

**Exploring products.csv**

In [None]:
print("Percentage of Nulls present in the Products data are as follows")
productna=((products_info.isnull().sum())/(products_info.shape[0]))*100
productna

The percentage of null values is very less, and there is a relationship between sectors and major essential functions because both have the same percentage of null values.

In [None]:
#No of rows with null values
products_info.iloc[np.unique(np.where(products_info.isnull())[0]),:]

In [None]:
#Feature Extraction

#splitting up primary essential function
products_info['primary_function_main'] = products_info['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
products_info['primary_function_sub'] = products_info['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)

# Synchronize similar values
products_info['primary_function_sub'] = products_info['primary_function_sub'].replace({'Sites, Resources & References' : 'Sites, Resources & Reference'})
products_info.drop("Primary Essential Function", axis=1, inplace=True)
products_info

In [None]:
#Dropping all NaN product values as they do not contain any information about other attributes
products_info = products_info[products_info.primary_function_main.notna()].reset_index(drop=True)
print('Dropping nan values of primary essential function')

---
## **Summary of handling missing values in products data :**


*   'Sector' and 'Primary Essential Function' column have almost all  missing  values

*   Dropping the tuples having NaN values in sector attribute

---

In [None]:
#plotting to find top 15 learning providers/companies
fig7 = plt.figure(figsize = (15, 8))
sns.set_style("white")
plt.title('TOP-15 of learning providers/companies',fontname = 'monospace', color = '#283655')
a = sns.barplot(data = products_info['Provider/Company Name'].value_counts().reset_index().head(15), x = 'Provider/Company Name', y = 'index', color = '#90afc5')
plt.xticks([])
plt.yticks(fontname = 'monospace', fontsize = 14, color = '#283655')
plt.ylabel('')
plt.xlabel('')
a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.5 + width, p.get_y() + 0.55 * p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center', fontname = 'monospace', fontsize = 15, color = '#283655')


fig7.text(0.4, -0.05,'fig 7',style = 'italic',fontsize = 20)

From *fig 7* 

Google LLC provides the most of the products that encourage the digital learning.

In [None]:
fig8 = products_info.groupby('Sector(s)').size().plot(kind='pie',figsize=(10,8))

plt.legend( loc='upper left')

fig8.text(0.01, -1.3,'fig 8',style = 'italic',fontsize = 20)

From *fig 8* we came to know that given data contains more values of Prek-12 sector

In [None]:
# Visualizing the Primary Essential Function Learn Platform catagories
LP1=LP2=LP3=0

for s in products_info["primary_function_main"]:
    if(not pd.isnull(s)):
        LP1 += s.count("CM")
        LP2 += s.count("LC")
        LP3 += s.count("SDO")

fig, ax  = plt.subplots(figsize=(16, 8))
plt.title('Primary essential Function')
explode = (0.02, 0.02, 0.02)
labels = ['CM','LC','SDO']
sizes = [LP1, LP2, LP3]
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.2f%%', pctdistance=0.7, colors=["lightpink",'lavender','thistle'])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
fig.text(0.4, -0.05,'fig 9',style = 'italic',fontsize = 20)

From *fig 9* we observe that the LC (Learning Curriculum) category has the highest percentage of products.

In [None]:
# Visualizing the Primary Essential Function Learn Platform sub-catagories
plt.figure(figsize=(12, 8))
fig10 = sns.countplot(y='primary_function_sub', data=products_info, order=products_info["primary_function_sub"].value_counts().index, color= 'pink')
plt.title("Primary Essential Function(Sub)")
fig10.text(-0.01, 20,'fig 10',style = 'italic',fontsize = 20)

From *fig 10* we observe that in Learnplatform subcategories , Sites, Resources and References are widely used followed by Digital learning and curriculum.

# **Engagement Data**

---
**Engagement data**
The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. The lp_id can be used to link to product information in product_info.csv.

---

Loading the data file of enagagement data

In [None]:
PATH = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 


temp = []

for district in district_info.district_id.unique():
    df = pd.read_csv(f'{PATH}/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
    temp.append(df)
    
    
engagement = pd.concat(temp)
engagement = engagement.reset_index(drop=True)

In [None]:
engagement.shape

In [None]:
engagement_attributes = [['Sl No' , 'Attribute' , 'Description'],
[1, 'time', 'date in "YYYY-MM-DD"'],[2, 'lp_id' , 'The unique identifier of the product'],[3, 'pct_access', 'Percentage of students in the district have at least one page-load event of a given product and on a given day'],[4, 'engagement_index', '     Total page-load events per one thousand students of a given product and on a given day']]
figure3 = FF.create_table(engagement_attributes,height_constant = 30)
iplot(figure3)

In [None]:
print("Percentage of Nulls present in the Enagagement data are as follows")
engagementna=((engagement.isnull().sum())/(engagement.shape[0]))*100
engagementna

In [None]:
engagement['engagement_index'].fillna(engagement['engagement_index'].mean(), inplace=True)

In [None]:
engagement['pct_access'].fillna(engagement['pct_access'].mean(), inplace=True)

In [None]:
#Dropping all NaN values of lp_id
engagement = engagement[engagement.lp_id.notna()].reset_index(drop=True)

In [None]:
print("Percentage of Nulls present in the Enagagement data are as follows")
engagementna=((engagement.isnull().sum())/(engagement.shape[0]))*100
engagementna

**Merging Data Values from all the files**

In [None]:
#Merging values of engagement data , district_ifo and products_info

df = pd.merge(engagement,district_info, on="district_id", how="inner")
df1 = pd.merge(df,products_info, left_on="lp_id", right_on="LP ID", how="inner")
df1.head()

In [None]:
print("Percentage of Nulls present in the DF1 data are as follows")
df1na=((df1.isnull().sum())/(df1.shape[0]))*100
df1na

In [None]:
df1.shape

In [None]:
freq = products_info.groupby(['Sector(s)','Provider/Company Name']).count()
freq.sort_values(by=['Product Name'], ascending=False )

In [None]:
#Dropping redundant values
## Removing pp_total_raw
df1.drop('pct_black/hispanic',axis = 1, inplace = True)
print("Dropped pct_black/hispanic columns...")
df1.drop('pct_free/reduced',axis = 1, inplace = True)
print("Dropped pct_free/reduced column...")
df1.drop('pp_total_raw',axis = 1, inplace = True)
print("Dropped pp_total_raw column...")

In [None]:
print("Percentage of Nulls present in the merged data are as follows")
df1na=((df1.isnull().sum())/(df1.shape[0]))*100
df1na

Now we find no missing values

**Adding Extra Features**

In [None]:
df1['month'] = pd.DatetimeIndex(df1['time']).month
print("Month added.")

In [None]:


df1['weekday'] = pd.DatetimeIndex(df1['time']).weekday
print('Weekday added')
df1.head()

# Challenges

### Q. What is the picture of digital conncectivity and engagement in 2020?

**Plotting month vs engagement index**

In [None]:
#subplot

plt.figure(figsize = (20,5))
plt.subplot(1, 2, 1)
fig11_1 = sns.barplot(x ='month', y ='engagement_index', data = df1)
plt.title('Month Vs Engagement index')
plt.legend()
plt.xticks(rotation = 90)
fig11_1.text(1.5, -51,'fig 11.1',style = 'italic',fontsize = 20)

plt.subplot(1, 2, 2)
fig11_2 = sns.lineplot(x ='month', y ='engagement_index', data = df1)
plt.title('Month Vs Engagement index')
plt.legend()
plt.xticks(rotation = 90)
fig11_2.text(0.4, -1.1,'fig 11.2',style = 'italic',fontsize = 20)

From *fig 11.1* and *fig 11.2*,

*    We can observe that engagement reaches its peak on April and September
*    Engagement is decreasing and reaches minimum value on july month and again engagement starts increasing.

This infers that On september schools might reopen and must be having examinations on april and a summer break.



**Plotting weekday vs engagement index**

In [None]:
#subplot
plt.figure(figsize = (20,5))
plt.subplot(1, 2, 1)
fig12_1 = sns.barplot(x ='weekday', y ='engagement_index', data = df1)
plt.title('Weekday Vs Engagement index')
plt.legend()
plt.xticks(rotation = 90)
fig12_1.text(1.5, -51,'fig 12.1',style = 'italic',fontsize = 20)
plt.subplot(1, 2, 2)
fig12_2 = sns.lineplot(x ='weekday', y ='engagement_index', data = df1)
plt.title('Weekday Vs Engagement index')
plt.legend()
plt.xticks(rotation = 90)
fig12_2.text(0.4, -1.1,'fig 12.2',style = 'italic',fontsize = 20)

From *fig 12.1* and *fig 12.2*

*   0 is Monday and 6 is sunday
*   we see interaction is quite low on the weekends

*   We see a large dip in the month of July where almost all engagements went close to zero


---
## **Summary:**
* Engagement starts decreasing in the summer break which indicates that the schools have been shut down due to pandemic situation
* Engagement is decreasing in the weekend days

---

### Q. How does the student engagement with different types of education technology change over the course of the pandemic?

**Month vs Engagement index with most poplular products**

In [None]:
l = df1.groupby('lp_id').mean().sort_values('engagement_index', ascending  = False).head().reset_index()['lp_id']
for i in range(len(l)):
    p1 = df1[df1['lp_id'] == l[i]].groupby('month', sort = False).mean().reset_index()
    plt.title("Engagement with Most Popular Products", {'fontsize' : 20} )
    fig13 = sns.lineplot(data=p1, x="month", y="engagement_index")
    plt.legend()
    fig13.text(-6, 10,'fig 13',style = 'italic',fontsize = 20)

In [None]:
fig14 = df1.groupby('Sector(s)')[['engagement_index']].median().plot(kind='bar',figsize=(15,7),color=['green'])
fig14.text(1.8, -9.1,'fig 14',style = 'italic',fontsize = 20)

From *fig 14* we can say that Prek-12,Corporate and Prek-12;Higher Ed;Corporate have highest engagement index.

In [None]:
sns.catplot(x = 'month',y='engagement_index', 
              data = df1.groupby(by=['month','primary_function_main']).mean().reset_index(), 
              edgecolor="white",kind='bar',
              palette="viridis",col='primary_function_main',col_wrap=2,height=4)

plt.show()

In [None]:
lp, name = list(products_info['LP ID']), list(products_info['Product Name'])
lp_to_name = {}
for i in range(len(lp)):
    lp_to_name[int(lp[i])] = name[i]
most_popular_products = df1[['lp_id','time']].groupby('lp_id').count().sort_values('time').tail(20).reset_index()

decoded_ids = []
number_of_rows = []
for i in range(len(most_popular_products['lp_id'])):
    if int(list(most_popular_products['lp_id'])[i]) in lp_to_name:
        decoded_ids.append(lp_to_name[int(list(most_popular_products['lp_id'])[i])])
        number_of_rows.append(list(most_popular_products['time'])[i])

In [None]:
plt.figure(figsize=(12,7))
plt.title("Distribution of Data Points per Product", {'fontsize':20})
fig15 = sns.barplot(y = decoded_ids, x = number_of_rows)
fig15.text(-19, 22,'fig 15',style = 'italic',fontsize = 20)

From *fig 15* we can say that google products like google docs, google drive are used widely

Now lets look at the engagement with most popular products and least popular products

In [None]:
product1=pd.DataFrame(df1.groupby('Product Name')['engagement_index'].mean().sort_values(ascending=False))
product1=product1.reset_index()
product_most=product1[["engagement_index","Product Name"]].head(10)
product_least=product1[["engagement_index","Product Name"]].tail(10)
plt.figure(figsize = (20,5))
plt.subplot(1, 2, 1)
fig16_1 = sns.barplot(x ='Product Name', y ='engagement_index', data = product_most)
plt.title('Most engaged products')
fig16_1.text(3.5, -5500,'fig 16.1',style = 'italic',fontsize = 20)

plt.xticks(rotation = 90)
plt.subplot(1, 2, 2)
fig16_2=sns.barplot(x ='Product Name', y ='engagement_index', data = product_least)
plt.title('Least engaged products')

plt.xticks(rotation = 90)
fig16_2.text(3.5, -8.5,'fig 16.2',style = 'italic',fontsize = 20)

In [None]:
producta=pd.DataFrame(df1.groupby('Product Name')['pct_access'].mean().sort_values(ascending=False))
producta=producta.reset_index()
producta_most=producta[["pct_access","Product Name"]].head(10)
producta_least=producta[["pct_access","Product Name"]].tail(10)
plt.figure(figsize = (20,5))
plt.subplot(1, 2, 1)
fig17_1=sns.barplot(x ='Product Name', y ='pct_access', data = producta_most)
plt.title('Most accessed products')
fig17_1.text(3.7, -9,'fig 17.1',style = 'italic',fontsize = 20)

plt.xticks(rotation = 90)
plt.subplot(1, 2, 2)
fig17_2 = sns.barplot(x ='Product Name', y ='pct_access', data = producta_least)
plt.title('Least accessed products')

plt.xticks(rotation = 90)
fig17_2.text(3.7, -0.006,'fig 17.2',style = 'italic',fontsize = 20)

---
## **Summary:**

* Monthly pattern is similar for all primary function types: Fall 2020 engagement was higher than spring 2020
* Drop in engagement during summer break as expected
* For category LC , which is where students engagement is most showed a higher engagement during Fall 2020 compared to spring 2020.
* Engagement with google products is high

---

### Q. How does the student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/etnicity, ESL, learning disability)? Learning context? Socioeconomic status?

**Plotting State vs pct_access**

In [None]:
# Sub plot
plt.figure(figsize = (20,5))
fig18 = sns.barplot(x ='state', y ='pct_access', data = df1)
plt.legend()
plt.title('State V/s pct_access')
plt.xticks(rotation = 90)
fig18.text(8.4, -2.5,'fig 18',style = 'italic',fontsize = 20)

From *fig 18*,

*   North Dakota has highest pct_access value 

*   Arizona has second highest pct_access value

Here we can note another thing that eventhough arizona has highest black/hispanic population , its engagement index and pct_access is high 

And we can clearly understand that there is no bias for white or black people for accessing education 

**Plotting month v/s pct_access**

In [None]:
plt.figure(figsize = (20,5))
fig19 = sns.barplot(x ='month', y ='pct_access', data = df1)
plt.legend()
plt.title('Month V/s pct_access')
plt.xticks(rotation = 90)
fig19.text(10.4, -0.2,'fig 19',style = 'italic',fontsize = 20)

**Plotting Locale v/s pct_access  and locale v/s engagement index**

In [None]:
#subplot
plt.figure(figsize = (20,5))
plt.subplot(1, 2, 1)
fig20_1 = sns.barplot(x ='locale', y ='engagement_index', data = df1)
plt.title('Locale Vs Engagement index')
plt.legend()
plt.xticks(rotation = 90)
fig20_1.text(0.4, -80,'fig 20.1',style = 'italic',fontsize = 20)

plt.subplot(1, 2, 2)
fig20_2 = sns.barplot(x ='locale', y ='pct_access', data = df1)
plt.title('Locale Vs pct_access')
plt.legend()
plt.xticks(rotation = 90)
fig20_2.text(0.4, -0.3,'fig 20.2',style = 'italic',fontsize = 20)

In [None]:
plt.figure(figsize=(12,7))
plt.title("Engagement By Demography: Percentage of Black/Hispanic", {'fontsize':20})
fig21 = sns.barplot(data = df1.groupby('avg_black_hispanic').mean().reset_index(), y = 'avg_black_hispanic', x = 'engagement_index')
fig21.text(0.4, -0.1,'fig 21',style = 'italic',fontsize = 20)

From *fig 21* we can say that,
Though here the graph shows that engagement of black/hispanic people is high .
When we look into the data we got to know that the data about the regions with most black/hispanic people is less.
So we cannot infer that black/hispanic students are highly engaging with e-learning products just by observing very small amount of given data 

In [None]:
plt.figure(figsize=(12,7))
plt.title("Engagement By Demography: Percentage of Students eligible for free/reduced lunch", {'fontsize':20})
fig22 = sns.barplot(data = df1.groupby('avg_reduced_lunch').mean().reset_index(), y = 'avg_reduced_lunch', x = 'engagement_index')
fig22.text(0.4, -0.1,'fig 22',style = 'italic',fontsize = 20)

In [None]:
def plot_time_series(df1,col1,col2,col3):
    max_list = df1[[col1,col2]]\
        .groupby([col1])[col2].mean()\
        .sort_values(ascending=False).index[:6].tolist()

    df1 = df1[df1[col1].isin(max_list)]\
                    .reset_index(drop=True)[[col3, col1, col2]]
    df1 = df1.pivot_table(index=col3, columns=col1, values=col2)

    fig = px.line(df1, facet_col=col1, facet_col_wrap=1, width=800, height=800)
    fig.update_layout(
                      title=(col1 + " , " + col2 + " , " + col3).title(),
                      title_x=0.39,
                      template="plotly_white",
                      paper_bgcolor='#f5f7f8',
                      font = {'family': 'Serif', 'size': 20}
                     )
    fig.show()

In [None]:
plot_time_series(df1,"state","engagement_index","time")

*Fig 23.1*

From *Fig 23.1* (State, Engagement_index, Time),
* Arizona has only one district data (district_id = 9007) and its engagement has increased after August.
* Connecticut has not showed any difference in the engagement before covid and after August.
* Engagement of District of Columbia  has increased after August. Before covid and during vacation it was low.
* North Dakota has data related to engagement only till february 4

In [None]:
plot_time_series(df1,"locale","engagement_index","time")

*Fig 23.2*

From *Fig 23.2* (Locale, Engagement_index , Time),
* Engagement of Rural, Suburban and City has increased after August. But Engagement of Town has decreased after vacation.
* Rural has highest Engagement compared to other locales.

In [None]:
plot_time_series(df1,"Product Name","engagement_index","time")

*Fig 23.3*

From *Fig 23.3* (Product Name, Engagment_index , time),
* Engagement with google classroom seems decreasing after vacation, whereas the engagement with google meet increased after vacation.
* Engagement with youtube started during vacation and then increased.
* Engagement with google docs and canvas has slightly increased after vacation.


### Q. What is the effect of covid-19 pandemic on online and distance learning, and how might this also evolve in the future?
 From the statistical data we get to know that the US started imposing the lockdown in the last week of march. 

The difference we can observe between pre covid and post lockdown that the engagement with e-learning products has significantly increased post lockdown (from september)
(Inferred from above graph from challenge 1)

### Q. Do certain state interventions, practices of policies (e.g stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagment?

**Plotting state vs engagement index**

In [None]:
# Sub plot
plt.figure(figsize = (20,5))
fig24_1 = sns.barplot(x ='state', y ='engagement_index', data = df1)
plt.title('State V/s Engagement Index')
plt.legend()
plt.xticks(rotation = 90)
fig24_1.text(1.6, -300,'fig 24.1',style = 'italic',fontsize = 20)

plt.figure(figsize = (20,5))
fig24_2 = sns.barplot(x ='state', y ='pct_access', data = df1)
plt.title('State V/s pct_access')
plt.legend()
plt.xticks(rotation = 90)
fig24_2.text(1.6, -2,'fig 24.2',style = 'italic',fontsize = 20)

From *fig 24.1*,
* Arizona, North Dakota, New York and New Hampshire were the top 4 states in terms of mean engagement index.

From *fig 24.2*,
* North Dakota,Arizona, New York and District of columbia were the top 4 states in terms of mean pct access.

**Plotting state v/s avg_expenditure**

In [None]:
# Sub plot
plt.figure(figsize = (20,5))
fig25 = sns.barplot(x ='state', y ='avg_spent_per_pupil', data = df1)
plt.title('State V/s Avg Expenditure per pupil')
plt.legend()
plt.xticks(rotation = 90)
fig25.text(1.5, -6500,'fig 25',style = 'italic',fontsize = 20)

From *fig 25* we came to know that New York, Connecticut and district of Columbia has higher expenditure per pupil

**Plotting locale v/s engagement index**

In [None]:
# Sub plot
plt.figure(figsize = (20,5))
fig26 = sns.barplot(x ='locale', y ='avg_spent_per_pupil', data = df1)
plt.title('Locale V/s Avg Expenditure per pupil')
plt.legend()
plt.xticks(rotation = 90)
fig26.text(0.4, -2000,'fig 26',style = 'italic',fontsize = 20)

Avg expenditure per pupil of rural locale is highest (From *fig 26*)

In [None]:
state_access=pd.DataFrame(df1.groupby('state')[['pct_access','avg_spent_per_pupil','avg_black_hispanic','avg_reduced_lunch']].mean().sort_values(['pct_access','avg_spent_per_pupil','avg_black_hispanic','avg_reduced_lunch'],ascending=False))
state_access=state_access.reset_index()

In [None]:
# lets check the correlation between different features of data or variables , 
# pct access, expenditure, black/hispanic, free/reduced lunch eligibility %
g = sns.pairplot(state_access)

# let's show pearson correlation coefficient (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) 
#above each pair plot 
from scipy.stats import pearsonr
def corrfunc(x, y, ax=None, **kws):
    """Plot the correlation coefficient in the top left hand corner of a plot."""
    r, _ = pearsonr(x, y)
    ax = ax or plt.gca()
    ax.annotate(f'ρ = {r:.2f}', xy=(.6, .9), xycoords=ax.transAxes)

g.map_offdiag(corrfunc)
plt.show()

*Fig 27.1*

In [None]:
state_engage=pd.DataFrame(df1.groupby('state')[['engagement_index','avg_spent_per_pupil','avg_black_hispanic','avg_reduced_lunch']].mean().sort_values(['engagement_index','avg_spent_per_pupil','avg_black_hispanic','avg_reduced_lunch'],ascending=False))
state_engage=state_engage.reset_index()

In [None]:
# lets check the correlation between different features of data or variables , 
# pct access, expenditure, black/hispanic, free/reduced lunch eligibility %
g = sns.pairplot(state_engage)

# let's show pearson correlation coefficient (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) 
#above each pair plot 
from scipy.stats import pearsonr
def corrfunc(x, y, ax=None, **kws):
    """Plot the correlation coefficient in the top left hand corner of a plot."""
    r, _ = pearsonr(x, y)
    ax = ax or plt.gca()
    ax.annotate(f'ρ = {r:.2f}', xy=(.6, .9), xycoords=ax.transAxes)

g.map_offdiag(corrfunc)
plt.show()

*Fig 27.2*

From *fig 27.1* and *fig 27.2*
* Engagement Index has a positive correlation with avg_pp_total_raw (total expenditure per pupil). Make sense if the expenditure is aimed at increasing digital learning we expect the engagement to increase with expenditure.

* We might also observe from the above data that pct_free/reduced and avg_expenditure_per_pupil  is negatively correlated. If the percentage of students eligible for free/reduced lunch increases ,then engagement index drops. Free/reduced lunch qualification depends on poor economic status. So this might be due to poor people can't afford the infrastructure required for engagement in digital learning.

* Avg_black/hispanic and avg_free_reduced_lunch have strong positive correlation.This is probably due to black and hispanic communities being financial poor compared to other communities in USA.

* Avg_black_hispanic have very less engagement index


By looking at the following graph, we can infer that government can increase their expenditure in the regions with most black/hispanic people. 
As from the graph we saw that the economic condition of the black/hispanic people is low.


Hence Government should implement certain policies in the regions with most black/hispanic people to increase their engagement with e-learning products.
The pandemic should not be a barrier for accessing education.

In [None]:
sns.relplot(x="avg_spent_per_pupil", y="engagement_index", kind="line", data=state_engage,ci=None)

In [None]:
print("Percentage of Nulls present in the DF1 data are as follows")
df1na=((df1.isnull().sum())/(df1.shape[0]))*100
df1na

Number of school districts per state

In [None]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

district_info['state_abbrev'] = district_info['state'].replace(us_state_abbrev)
district_info_by_state = district_info['state_abbrev'].value_counts().to_frame().reset_index(drop=False)
district_info_by_state.columns = ['state_abbrev', 'num_districts']

fig = go.Figure()
layout = dict(
    title_text = "Number of Available School Districts per State",
    geo_scope='usa',
)

fig.add_trace(
    go.Choropleth(
        locations=district_info_by_state.state_abbrev,
        zmax=1,
        z = district_info_by_state.num_districts,
        locationmode = 'USA-states', # set of locations match entries in `locations`
        marker_line_color='white',
        geo='geo',
        colorscale=px.colors.sequential.Teal, 
    )
)
            
fig.update_layout(layout)   
fig.show()

From the above map, we can say that Connecticut (CT) has the highest number of school districts i.e., 30 school districts followed by Utah (UT) that has 29 school districts. And Arizona (AZ) has only one school district.

---
## **Summary**



### Engagement level:

* From January to March 2020, there is a steady increase in levels of engagement. This is because, even before COVID-19, education technology has experienced rapid expansion and use.
* Longer closures, uncertainty about reopening dates, a possible tightening of the academic calendar, and the resulting learning discontinuity among students forced states and educational institutions to seek alternate solutions to mitigate the various impacts. This is where digital/online learning becomes extremely important.
* Over the course of 2020, the engagement index improved, and the fall engagement index was higher than the spring engagement index. This is commonly acknowledged since schools, institutions, and even businesses have begun to use online platforms to learn and work from home.
* As expected, the data clearly reveals a sharp drop in engagement index over the summer vacations.
* There is more engagement in the beginning of the week, then it gradually decreases as the week progresses to the weekend. The highest average engagement is on Tuesday.


### Most accessed products:

Google classroom, Google docs , Google drive and Youtube (From *fig 17.1*)

### Most engaged products:

Google docs, Google classroom, Youtube,Canvas (From *fig 16.1*)

### Primary functions:

* For all primary function types, the monthly pattern is the same: Engagement in the fall of 2020 was higher than in the spring of 2020.
* SDO has highest engagement index
* Amongst the different learning tools, Learning & Curriculum (LC) had achieved the highest adoption.

* "Sites, Resources and Reference" and "Digital Learning Platform" take the top spots in the list of primary essential sub-category.


### States with highest engagement index

Arizona, New York ,North Dakota (From *fig 24.1*)

### States with highest pct_access

North Dakota ,Arizona ,New York (From *fig 24.2*)


### States with highest expenditure

New York ,District of Columbia, Connecticut (From *fig 25*)

---

---
### Following are the question that we formulated by seeing data

1. Why city locales have the highest black/hispanic population ?


2. What is the effect of covid-19 on the engagement of economically weak students?


3. Why city locales have the highest percentage of students eligible for free or reduced lunch ?Can we infer anything about the economic condition of black/hispanic?

4. How expenditure per pupil affects the digital learning 
(i)pct_access 
(ii)engagement index?

5. Is literacy rate and engagement index have any relation?

6. Do hurricanes in the US affect digital learning?

---

### 1. Why city locales have the highest black/hispanic population ?

From *fig 3* 

Most of the city locales are located in  California , Utah , District of Columbia ,Arizona .
From the given data , the above mentioned states have more black/hispanic students 

### 2. What is the effectt of covid-19 on the engagement of economically weak students?

From *fig 25.1* , we might observe that pct_free/reduced and avg_expenditure_per_pupil is negatively correlated. If the percentage of students eligible for free/reduced lunch increases ,then engagement index drops. Free/reduced lunch qualification depends on poor economic status. So this might be due to poor people can't afford the infrastructure required for engagement in digital learning.

### 3. Why city locales have the highest percentage of students eligible for free or reduced lunch ?Can we infer anything about the economic condition of black/hispanic?

City has the highest percentage of people eligible for free/reduced lunch which clearly indicates that the socio economic states of cities might be low.

From *Fig 3* we can observe that avg black/hispanic and avg free/reduced have very strong positive correlation. 
This indicates that the economic status of most of the black/hispanic people is low. So the black/hispanic people are not engaging with e-learning products for effective education as it is really difficult for them to afford e-learning devices .

Most of the black/hispanic students are eligible for free/reduced lunch . This further indicates that their economic status is low.


### 4. How expenditure per pupil affects the digital learning  (i)pct_access  (ii)engagement index?

From *fig 27.1* and *fig 27.2* we can observe that there is positive correlation between avg_spent_per_pupil and avg_free_reduced_lunch and avg_spent_per_pupil and engagement_index. Hence we can say that if government increase expenditure, the engagement with digital learning platform may increase.This might reduce the digital divide among the students.

### 5. Is literacy rate and engagement index have any relation?

From [https://worldpopulationreview.com/state-rankings/us-literacy-rates-by-state](http://) we got that 

The states with high literacy rates are

New Hampshire 94.20% , North dakota 93.70% , Indiana 92% , Connecticut 91.40% , Arizona 86.90%

From *fig 24.1* we observe that,
* Arizona, North Dakota, New York and New Hampshire were the top 4 states in terms of mean engagement index.
* So literacy rate and engagement are positively correlated. 
* Increasing engagement index increases literacy rate in states of US this can be the one way to improve literacy rate.

The states with lowest literacy rates

California 76.90% , Florida 80.30% , Texas 81%

From *fig 24.1*
* We can easily find that the engagement index of Texas, Florida, and California Have lesser engagement index.
* So from the literacy rate and engagement index of these states we can infer they are positively correlated. 

### 6. Do hurricanes in the US affect digital learning?

From the [external source](https://www.finder.com/states-with-the-most-hurricanes) it is clearly visible that Florida and Texas are mostly affected by hurricanes in the US. We can see in *fig 24.1* these two states are with least engagement , in *fig 25* we can see that the expenditure per pupil is also less and even pct_access also very less for these two states (from *fig 24.2*).
So from this we can relate that there are states where digital learning is affected by hurricanes. Hence, the government should take some measures to improve the digital learning in these states.