In [None]:
!pip install pdpipe

In [None]:
import pandas as pd
import numpy as np  
%matplotlib inline  
from plotly import __version__ 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()
import seaborn as sns
import matplotlib.pyplot as plt 
import glob
import missingno as msno
import datetime as dt
import pdpipe as pdp
from typing import Tuple, List, Dict

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline

# INTRODUCTION AND METHODOLOGY :
Covid-19 has surely impacted many aspects of our daily habits. The crisis has reshaped many economical, social and, in our case of study, educational aspects. In this analysis, we are about to uncover how this crisis has affected e-learning baised on EdTech engagement data. 
We will conduct it as below : 
        
    Data preparing or missing values analysis and treatment,
    Data exploration and visualisation,
    
From these steps, we will get mainly to know : 
       
    Engagement's time evolution in the different states,
    characteristics of the products the most loaded,
    Any correlation that are to be found between the states' characteristics and how engaged they are, 
    
In between, we will get to know, despite of how engaged, or loaded, they are, the districts and products data to get to know this population more as below : 
    
    
    Top 10 providers, 
    Top primary essential function of the products,
    Top locale and states,
    How present are black and hispanic people in these states,
    How large is the number of people having access to free or reduced lunch at school,
    How much ressources are spent by student.

Once all the libraries have been downloaded, we download the data, one by one and do the primary exploration and the missing values handling :
    
    Product data first, 
    And then, district's data, 
    and finally the engagement data.

# Data exploration : 
## Products' Data :

The Product's data is made up of the following columns : 

| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |


In [None]:
Product_data = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv', sep=',')
Product_data.head(5)

In [None]:
Product_data.info()  

In [None]:
    # Total number of entries (rows X columns) in the dataset
total= Product_data.size
    #Number of missing values per column
missingCount =  Product_data.isnull().sum()
    #Total number of missing values
missing_tot = missingCount.sum()
    # Calculate percentage of missing values
print('Total number of missing values for each column of dataframe: \n \b \b \b',missingCount)
print("The dataset contains", round(((missing_tot/total) * 100), 2), "%", "missing values")
print('Total number of rows with at least one missing value column are ', Product_data[ Product_data.isnull().any(axis=1)].shape[0])
print('Percentage of rows with missing data ',round((( Product_data[ Product_data.isnull().any(axis=1)].shape[0]/ Product_data.shape[0])*100),2),'%\n\n')

Not many missing datas on this one. Let's check if there are any correlations between the variables having missing values before going further.

In [None]:
ax = msno.heatmap(Product_data,figsize=(5,5))
plt.show()

Since the sector's missing data and the primary essential function are 100% correlated, then we know we are about to drop only 21 lines out of 372 rows. Wich is trivial.

In [None]:
Product_data.dropna(inplace=True)
Product_data.isna().sum()
Product_data.rename(columns={'LP ID' : 'lp_id'}, inplace=True) # we will need it later on for the merge

In [None]:
Product_data.info()

Our data is handled for now, and ready for some exploration. 

In [None]:
print ("in this data set,we have a number of unique sectors equal to ", Product_data['Sector(s)'].nunique(), "wich are", Product_data['Sector(s)'].unique())
print ("we have a number of unique product's names equal to" ,Product_data['Product Name'].nunique())
print ("we have a number of unique provider's equal to ", Product_data['Provider/Company Name'].nunique(), "we will get to know their names later on",  )
print ("we have a number of unique Primary Essential Function equal to ", Product_data['Primary Essential Function'].nunique(),"we will get to know their names later on")

Let's discover some memorable names : the top ten !

#### Provider/Company Name

In [None]:
df9=pd.DataFrame(Product_data['Provider/Company Name'].value_counts()).reset_index()
df9.rename(columns={'Provider/Company Name':'Number of products', 'index': 'Provider/Company Name'}, inplace= True)
df9 ['Percentage']=(df9['Number of products']/df9['Number of products'].sum())*100
df9.head (10)

In [None]:
df9.head(20).iplot(kind='bar', x = 'Provider/Company Name', y = 'Number of products', xTitle ='Provider/Company Name', yTitle = 'Number of products', orientation = 'v', sortbars=False)

**Google LLC** is by far the most represented provider. For Google has built a developed offer in term of products throughout the years !
#### Sectors :

In [None]:
df8=pd.DataFrame(Product_data['Sector(s)'].value_counts()).reset_index()
df8.rename(columns = {'Sector(s)': 'Number of products', 'index':'Sector(s)'}, inplace = True)
df8 ['percentage']=(df8['Number of products']/df8['Number of products'].sum())*100
df8.head(10)

In [None]:
df8.head(20).iplot(kind='bar', x = 'Sector(s)', y = 'Number of products', xTitle ='Sector(s)', yTitle = 'Number of products', orientation = 'v', sortbars=False)


**Prek-12, PreK-12; Higher Ed; Corporate**, PreK-12; Higher Ed are mainly the represented sectors. Schools were closed or partially closed actually.

#### Primary Essential Function:

In [None]:
df7=pd.DataFrame(Product_data['Primary Essential Function'].value_counts()).reset_index()
df7.rename(columns = {'Primary Essential Function': 'Number of products', 'index':'Primary Essential Function'}, inplace = True)
df7 ['Percentage']=(df7['Number of products']/df7['Number of products'].sum())*100
df7.head(10)

In [None]:
df7.head(20).iplot(kind='bar', x = 'Primary Essential Function', y = 'Number of products', xTitle ='Primary Essential Function', yTitle = 'Number of products', orientation = 'v', sortbars=False)


**LC - Digital Learning Platforms**, LC - Sites, Resources & Reference, LC - Content Creation & Curation are the most represented essential functions by a percentage 44%.

#### Partial summary 1:

    The missing values represent only 5% of the data set, so we dropped them. 
    We have 5 unique sectors and Prek-12, PreK-12; Higher Ed; Corporate, PreK-12; Higher Ed are mainly  represented.
    We have 284 providers and 7% od the products are provided by Google LLC.
    These products fulfill 35 unique primary essential functions and LC - Digital Learning Platforms, LC - Sites, Resources & Reference, LC - Content Creation & Curation are the functions of 44% of the products.

## Districts' Data : 
    
    Let's handle the data district data the same way : 
    discover them, handle missing values and then explore it a little bit
    
    First dicovery : 
    
    The district's data is made up of the following columns :
    
| Name | Description |
| :--- | :----------- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |

In [None]:
district_data = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv',sep=',')

district_data.head()

In [None]:
district_data.info()

In [None]:
    # Total number of entries (rows X columns) in the dataset
total= district_data.size
    #Number of missing values per column
missingCount =  district_data.isnull().sum()
    #Total number of missing values
missing_tot = missingCount.sum()
    # Calculate percentage of missing values
print('Total number of missing values for each column of dataframe: \n \b \b \b',missingCount)
print("The dataset contains", round(((missing_tot/total) * 100), 2), "%", "missing values")
print('Total number of rows with at least one missing value column are ',district_data[ district_data.isnull().any(axis=1)].shape[0])
print('Percentage of rows with missing data ',round((( district_data[ district_data.isnull().any(axis=1)].shape[0]/ district_data.shape[0])*100),2),'%\n\n')

We cannot just drop off the equivalent of 62% of the data ! Let's check if tere are any correlations before we get to decide what to do !

In [None]:
ax = msno.heatmap(district_data,figsize=(5,5))
plt.show()

If a state is missing then the locale and the percentage of black and hispanic would missing too. Since there no way to induce them from the known variable (taht's the reason why they were hidden), we will drop them.
in fine, we drop the other rows with missing datas not to have our analysis biased by the filling them. 

In [None]:
district_data.dropna(axis=0, inplace=True)
district_data.isnull().sum()

In [None]:
print ("in this data set,we have a number of unique districts equal to ", district_data['district_id'].nunique())
print ("These districts belong to a number of unique states equal to" ,district_data['state'].nunique())
print ("we have a number of unique locale equal to ", district_data['locale'].nunique(), "wich are", district_data['locale'].unique())

#### Per-pupil total expenditure :

In [None]:
df6=pd.DataFrame(district_data ['pp_total_raw'].value_counts()).reset_index()
df6.rename(columns={'pp_total_raw':'Number of districts','index':'interval_pp_total_raw'}, inplace= True)
df6 ['Percentage']=(df6['Number of districts']/df6['Number of districts'].sum())*100
df6.head(11)

In [None]:
df6.iplot(kind='bar', x='interval_pp_total_raw' , y='Number of districts',  xTitle = 'Interval expenditure per pupil', yTitle = 'Numbre of occurence')

Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource 
Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. 

* 30% of the districts spend between 18.000 and 20.000 NERD$ and 32% spend between 8000 and 12000.

#### county_connections_ratio:

In [None]:
df5=pd.DataFrame(district_data ['county_connections_ratio'].value_counts())
df5 ['percentage']=(df5['county_connections_ratio']/df5['county_connections_ratio'].sum())*100

df5.head(10)

| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households)
 based on the county level data from FCC From 477 (December 2018 version).
 
 * We can see that this variable isn't what make a difference between the different districts !

#### Precentage of students having access to free/reduced lunch :

In [None]:
df4=pd.DataFrame(district_data ['pct_free/reduced'].value_counts()).reset_index()
df4.rename(columns={'pct_free/reduced':'Number of districts','index':'interval_access_pct_free/reduced'}, inplace= True)
df4 ['Percentage']=(df4['Number of districts']/df4['Number of districts'].sum())*100
df4.head(10)

In [None]:
df4.iplot(kind='bar', x= 'interval_access_pct_free/reduced' , y='Number of districts', theme='white', mode={'Number of districts' : 'bar','Percentage': 'markers'} , xTitle = 'Access_free/reduced lunch', yTitle = 'Numbre of occurence')

| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price 
lunch based on 2018-19 NCES data |

* 80% of the districts have the percentage of students eligible for free or reduced-price between 0% and 80 %.

#### Percentage of black/hispanic people :

In [None]:
df3=pd.DataFrame(district_data ['pct_black/hispanic'].value_counts()).reset_index()
df3.rename(columns={'pct_black/hispanic' : 'Number of districts','index':'interval_pct_black/hispanic'}, inplace= True)
df3 ['percentage']=(df3['Number of districts']/df3['Number of districts'].sum())*100
df3.head(10)

In [None]:
df3.iplot(kind='bar', x= 'interval_pct_black/hispanic' , y='Number of districts',  xTitle = 'interval_pct_black/hispanic', yTitle = 'Numbre of occurence')

| pct_black/hispanic | Percentage of students in the districts
 identified as Black or Hispanic based on 2018-19 NCES data |

* 66% of the districts have between 0% and 20% of their population identifying themselves as black or hispanic.

#### States :

In [None]:
df2 = pd.DataFrame (district_data ['state'].value_counts()).reset_index()
df2.rename(columns={'state':'Number of districts','index':'States'}, inplace= True)
df2 ['Percentage']=(df2['Number of districts']/df2['Number of districts'].sum())*100

k=0
j=0
while k <80 : 
    k=k+df2.at[j,"Percentage"]
    j=j+1
    
df2.head(j)


In [None]:
df2.iplot(kind='bar', x= 'States', y='Number of districts', xTitle = 'States', yTitle = 'Number of districts', orientation ="v")

In [None]:
print(j,'Districts constitute 80% of the districts represented in this sample and they are', df2["States"][:j])

#### Areas :

In [None]:
df1=pd.DataFrame (district_data['locale'].value_counts()).reset_index()
df1.rename(columns={'locale' : 'Number of districts', 'index':'Locale'}, inplace= True)
df1 ['percentage']=(df1['Number of districts']/df1['Number of districts'].sum())*100
df1

In [None]:
df1.iplot(kind="bar", x= 'Locale', y= 'Number of districts', xTitle = 'Locale', yTitle = 'Number of districts by locale')

| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural.
* 60% of the districts are in suburban areas 
* only 6% are in town

#### Partial summary 2 : 
    
The missing values represent more than 60% of the data set. We dropped all the rows that has a NaN value not to fill them and have a biased data. 
We have a number of unique districts equal to  176.These districts belong to 23 states, dispesed onto 4 areas: Suburban, Rural, City and Town.

Some numbers to keep in mind about this data set : 
    
* 30% of the districts spend between 18.000 and 20.000 NERD$ and 32% spend between 8000 and 12000.
* We can see that county_connections_ratio isn't what make a difference between the different districts !
* 80% of the districts have the percentage of students eligible for free or reduced-price between 0% and 80 %.
* 66% of the districts have between 0% and 20% of their population identifying themselves as black or hispanic.
* 9 Districts constitute 80% of the districts represented in this sample and they are : Connecticut, Utah, Massachusetts,     Illinois, California, Ohio, New York, Indiana, Washington.
* 60% of the districts are in suburban areas. 
* only 6% are in town.
   

## Engagement's Data

To make the kernel smoother, we handle missing values as soon as we download the engagement data. For the same reason mentionned above, we drop the NaN values. 

The enagement Data has the following columns : 

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |

  

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
all_engagement_data_paths = glob.glob(path + "/*.csv")

temporary = []
all_engagement_data=pd.DataFrame()

for i in all_engagement_data_paths :
    df=pd.read_csv(i)
    district_id = i.split("/")[4].split(".")[0]
    df["district_id"]=district_id
    df.dropna(axis=0, inplace=True)
    df[['Year', 'Month', 'Day']] = df.time.str.split('-', expand=True)
    
    temporary.append(df)
    
    
all_engagement_data = pd.concat(temporary)


all_engagement_data.isnull().sum()

In [None]:
all_engagement_data.head()


In [None]:
all_engagement_data["lp_id"]=all_engagement_data["lp_id"].astype('str')
print ("in this data set,we have a number of unique products equal to ", all_engagement_data['lp_id'].nunique())
print ("These data covers from 01-01-2020 till 31-12-2020")
print ("and it concerns", all_engagement_data['district_id'].nunique(), "districts")

The number of unique products surpass the ones we have left after cleansing the product and district data. Studiying the engagement data alone won't be very helpfull. To get more insights let's merge the 3 data sets we've dowloded so far. 

# Merging time : 

In [None]:
Product_data["lp_id"] = Product_data["lp_id"].astype(float)
district_data["district_id"] = district_data["district_id"].astype(float)
all_engagement_data['district_id']=all_engagement_data['district_id'].astype(float)
all_engagement_data["lp_id"]=all_engagement_data["lp_id"].astype(float)


merge_df = pd.merge(all_engagement_data, district_data, on="district_id")
merge_df = pd.merge(merge_df, Product_data, on="lp_id")

In [None]:
merge_df.head ()

Data is ready !

# Timely based analysis : 

We will focus on the engagement_index only since the pct_access is correlated to it.
Let's see how engagement index has evolved by state throughout the whole studied period (2020). 
To get rid off the dips generated by the weekends in terms of engagement_index, we will check how it evoluated from a month to another, instead of a day by day presentation. 
#### Map representation :

In [None]:
#agg_digi_learn_df = result_df[result_df["Primary Essential Function"] == 'LC - Digital Learning Platforms']
agg_engagement_data = merge_df.groupby(["state", "time"],as_index=False)["engagement_index"].sum().reset_index()


def set_size(value):
    '''
    Takes the numeric value of a parameter to visualize on a map (Plotly Geo-Scatter plot)
    Returns a number to indicate the size of a bubble for a country which numeric attribute value 
    was supplied as an input
    '''
    result = np.log(1+value/100)
    if result < 0:
        result = 0.001
    return result


us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('engagement_index', set_size, 'size', drop=False),
    pdp.MapColVals('state', us_state_abbrev)])




agg_engagement_data_map = pipeline.apply(agg_engagement_data)

agg_engagement_data_map.fillna(0, inplace=True)



agg_engagement_data_map = agg_engagement_data_map.sort_values(by='time', ascending=True)
agg_engagement_data_map.drop(['index'], axis = 1, inplace = True)
agg_engagement_data_map.tail()



In [None]:
fig = px.scatter_geo(
    agg_engagement_data_map, locations="state", locationmode='USA-states',
    scope="usa",
    color="engagement_index", 
    size='size', hover_name="state", animation_frame= pd.to_datetime(agg_engagement_data_map["time"]).dt.month, 
    range_color= [0, 100000], 
    projection="albers usa", 
    title='Engagement Index')
fig.show ()

#### Graphic representation :

In [None]:
pipeline = pdp.PdPipeline([
    pdp.ApplyByCols('engagement_index', set_size, 'size', drop=False)])

agg_engagement_data = pipeline.apply(agg_engagement_data)
agg_engagement_data_monthly = agg_engagement_data.drop(['size','index'], axis=1)
agg_engagement_data_monthly['month'] = pd.to_datetime(agg_engagement_data["time"]).dt.month
agg_engagement_data_monthly.drop(['time'], inplace = True, axis =1)
agg_engagement_data_monthly.head()
agg_engagement_data_monthly_plot = agg_engagement_data_monthly.pivot_table(index= agg_engagement_data_monthly["state"], columns= agg_engagement_data_monthly["month"], values='engagement_index', aggfunc=sum)
agg_engagement_data_monthly_plot.rename ({1 : 'January', 2:'February', 3:'March', 4:'April' , 5: 'May', 6:'June', 7:'July', 8:'August', 9:'September', 10:'October', 11:'November', 12:'December'}, axis=1, inplace = True)
agg_engagement_data_monthly_plot.transpose().iplot()

### Partial summary 3 : 
From both the map representation and the graphic one, we can deduce : 

* Engagement has progressed in the early 2020 untill the month of April when it reached its local maximum for that periode.
* Engagement has decreased during June, July and for most of the states in August too. 
* Engagement data trend has started increasing in September to surpass April's maximum.
* The engagement data has a steady evolution until the end of the year.
* It slightly decreased in December

As we have seen below, the most used products belong to "LC : Learning & Curriculum" , followed by "CM: Classroom Management". they are therefore related to distance learning wich is closely impacted by these envents: 

* Emergency state on the federal level declared on the 13/03/2020,
* Seasonal holidays and summer holidays (June, July and August),
* Back to school season : Many districts didn't reopen in september but in October the trend has changed in October. Parents were also given a choice between in-person and distance instruction, knowing that an achievement gap was identified between these two modes. That is what explains the decrease between September and October-November and because of the gap between the student's results, we can assume that many parents have choosen the in-person instruction.
* in December, the winter break could be the cause behind the slight decrease.

These events go by the trend et explain it mainly. 

Something important to notice is that the engagement level dropping because of the winter break didn't reach the level of before the distance learning policies were implemented. 
So let's uncover how and how much !

## Engagement's residual value :    

First, let's check for the whole year : 
* What are the characteristics of the products the most used, 
* What are the states the most engaged. 

And then, let's check for the month of December :

* what are the products that kept being used 
* The states that kept being engaged 

#### Engagement by provider/company name :

In [None]:
merge_data_provider_ranking = merge_df.groupby(["Provider/Company Name"],as_index=False)["engagement_index"].sum().reset_index()
merge_data_provider_ranking.drop(['index'], inplace = True, axis =1)
merge_data_provider_ranking['Percentage']= (merge_data_provider_ranking['engagement_index']/merge_data_provider_ranking['engagement_index'].sum())*100
merge_data_provider_ranking.sort_values ('engagement_index', ascending = False).head(5)

In [None]:
Products_provided_by_insctructure_inc = Product_data[Product_data['Provider/Company Name']=='Instructure, Inc. ']
list(Products_provided_by_insctructure_inc['Product Name'])

In [None]:
merge_data_provider_ranking.sort_values ('engagement_index', ascending = False).head(5).iplot(kind='bar', x = 'Provider/Company Name', y = 'engagement_index', xTitle ='Provider/Company Name', yTitle = 'engagement_index', orientation = 'v', sortbars=False)

In terms of engagement, **Google LLC** keeps being, by far, the first, followed, this time, by ***Instructure, Inc*** thought it doesn't provide many products (less than 0.8%). It provides, Canvas and Instructure. (Very known actually ! ) 

#### Engagement by sectors :

In [None]:
merge_data_Sector_ranking = merge_df.groupby(["Sector(s)"],as_index=False)["engagement_index"].sum().reset_index()
merge_data_Sector_ranking.drop(['index'], inplace = True, axis =1)
merge_data_Sector_ranking['Percentage']= (merge_data_Sector_ranking ['engagement_index']/merge_data_Sector_ranking ['engagement_index'].sum())*100
merge_data_Sector_ranking.sort_values ('engagement_index', ascending = False).head(5)

In [None]:
merge_data_Sector_ranking.sort_values ('engagement_index', ascending = False).head(5).iplot(kind='bar', x = 'Sector(s)', y = 'engagement_index', xTitle ='Sector(s)', yTitle = 'engagement_index', orientation = 'v', sortbars=False)

**PreK-12; Higher Ed; Corporate** takes the lead and switch places with PreK-12 ! 

In [None]:
merge_data_Primary_Essential_Function_ranking = merge_df.groupby(["Primary Essential Function"],as_index=False)["engagement_index"].sum().reset_index()
merge_data_Primary_Essential_Function_ranking.drop(['index'], inplace = True, axis =1)
merge_data_Primary_Essential_Function_ranking['Percentage']= (merge_data_Primary_Essential_Function_ranking ['engagement_index']/merge_data_Primary_Essential_Function_ranking['engagement_index'].sum())*100
merge_data_Primary_Essential_Function_ranking.sort_values ('engagement_index', ascending = False).head(5)

In [None]:
merge_data_Primary_Essential_Function_ranking.sort_values ('engagement_index', ascending = False).head(5).iplot(kind='bar', x = 'Primary Essential Function', y = 'engagement_index', xTitle ='Primary Essential Function', yTitle = 'engagement_index', orientation = 'v', sortbars=False)

**LC - Content Creation & Curation** takes the lead this time preceding by far ***LC - Digital Learning Platforms***, though only 10% (36 products) of the products represented in this data set has it as primary function. 

#### Engagement by state :

In [None]:
merge_data_State_ranking = merge_df.groupby(["state"],as_index=False)["engagement_index"].sum().reset_index()
merge_data_State_ranking.drop(['index'], inplace = True, axis =1)
merge_data_State_ranking['Percentage']= (merge_data_State_ranking['engagement_index']/merge_data_State_ranking['engagement_index'].sum())*100
merge_data_State_ranking.sort_values ('engagement_index', ascending = False).head(5)

k=0
j=0
while k <80 : 
    k=k+merge_data_State_ranking.sort_values ('engagement_index', ascending = False).at[j,"Percentage"]
    j=j+1
    
merge_data_State_ranking.sort_values ('engagement_index', ascending = False).head(j)

In [None]:
merge_data_State_ranking.sort_values ('engagement_index', ascending = False).head(5).iplot(kind='bar', x = 'state', y = 'engagement_index', xTitle ='state', yTitle = 'engagement_index', orientation = 'v', sortbars=False)

**Illinois** now takes the lead and has switched places with ***Utah***. Indiana kept its third rank. 

### Let's check if something changed in December : 

In [None]:
agg_eng_data_partially = merge_df [['state', 'engagement_index', 'lp_id', 'Product Name', 'Provider/Company Name', 'Primary Essential Function', 'Month', 'Sector(s)']]
agg_eng_data_december= agg_eng_data_partially[agg_eng_data_partially['Month'] == '12']
agg_eng_data_december.head()

#### Engagement by provider :

In [None]:
agg_eng_data_december_products = agg_eng_data_december.groupby(["Provider/Company Name"],as_index=False)["engagement_index"].sum().reset_index()
agg_eng_data_december_products.drop(['index'], inplace = True, axis =1)
df20 = agg_eng_data_december_products.sort_values ('engagement_index', ascending = False)
df20 ['Percentage']= (df20['engagement_index']/df20['engagement_index'].sum())*100
df20.head (10)


Same trend ! 

#### Engagement by sectors :

In [None]:
agg_eng_data_december_products = agg_eng_data_december.groupby(["Sector(s)"],as_index=False)["engagement_index"].sum().reset_index()
agg_eng_data_december_products.drop(['index'], inplace = True, axis =1)
df21 = agg_eng_data_december_products.sort_values ('engagement_index', ascending = False)
df21 ['Percentage']= (df21['engagement_index']/df21['engagement_index'].sum())*100
df21.head (10)

Same trend again !

#### Engagement by Primary Essential Function :

In [None]:
agg_eng_data_december_products = agg_eng_data_december.groupby(["Primary Essential Function"],as_index=False)["engagement_index"].sum().reset_index()
agg_eng_data_december_products.drop(['index'], inplace = True, axis =1)
df22 = agg_eng_data_december_products.sort_values ('engagement_index', ascending = False)
df22 ['Percentage']= (df22['engagement_index']/df22['engagement_index'].sum())*100
df22.head (10)

* **LC - Content Creation & Curation** keeps being the first, 
* LC - Sites, Resources & Reference - Streaming is now second after being the fifth. Its share in terms of percentage has doubled.
* SDO - Learning Management Systems (LMS) is now third instead of second 
* all the engagement-indexes have been divided by 5 to 10.

#### Engagement by states :

In [None]:
agg_eng_data_december_products = agg_eng_data_december.groupby(["state"],as_index=False)["engagement_index"].sum().reset_index()
agg_eng_data_december_products.drop(['index'], inplace = True, axis =1)
df23 = agg_eng_data_december_products.sort_values ('engagement_index', ascending = False)
df23 ['Percentage']= (df23['engagement_index']/df23['engagement_index'].sum())*100
df23.head (10)

Same trend ! 
All the engagement-indexes of the whole year were didided by 5 to 10 in December !

Let's check how the engagement index changed from January to December : 

In [None]:
agg_engagement_data_monthly_plot['ratio'] = (agg_engagement_data_monthly_plot['December']/agg_engagement_data_monthly_plot['January'])
print('The engagement_index has been multiplied by' , agg_engagement_data_monthly_plot['ratio'].min(), 'to', agg_engagement_data_monthly_plot['ratio'].max(), 'between January and December')

### Partial summary 4: 

In terms of engagement:
* **Google LLC** keeps being, by far, the first, followed, this time, by ***Instructure, Inc*** thought it doesn't provide many products (less than 0.8%). It provides, Canvas and Instructure. (Very known actually ! ) 
* **PreK-12; Higher Ed; Corporate** takes the lead and switch places with PreK-12 ! 
* **LC - Content Creation & Curation** takes the lead this time preceding by far ***LC - Digital Learning Platforms***, thought only 10% (36 products) of the products represented in this data set has it as primary function. 
* **Illinois** now takes the lead and has switched places with ***Utah***. Indiana kept its third rank. 

There is no particular change between the trend of the whole year and December. The residual engagement value is 5 to 10 times less than fall 2020 but is  multiplied by 1.14 to 4.75 between January and December.


## Are there any valuable correlations ?
In this part we check :

* The possible correlations between the states' characteristics and the engagement index,

#### Correlation table :

In [None]:
#we have spiltted columns like 'pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio' and 'pp_total_raw' into 2 different columns based on their maximum and minimum value
merge_df_corr = merge_df
#'pct_black/hispanic'
merge_df_corr[['min_pct_black/hispanic','max_pct_black/hispanic']] = merge_df_corr['pct_black/hispanic'].str.split(",",expand=True)
merge_df_corr['min_pct_black/hispanic'] = merge_df_corr['min_pct_black/hispanic'].str.strip('[')
merge_df_corr['max_pct_black/hispanic'] = merge_df_corr['max_pct_black/hispanic'].str.strip('[')

#'pct_free/reduced'
merge_df_corr[['min_pct_free/reduced','max_pct_free/reduced']] = merge_df_corr['pct_free/reduced'].str.split(",",expand=True)
merge_df_corr['min_pct_free/reduced'] = merge_df_corr['min_pct_free/reduced'].str.strip('[')
merge_df_corr['max_pct_free/reduced'] = merge_df_corr['max_pct_free/reduced'].str.strip('[')

#'pp_total_raw'
merge_df_corr[['min_pp_total_raw','max_pp_total_raw']] = merge_df_corr['pp_total_raw'].str.split(",",expand=True)
merge_df_corr['min_pp_total_raw'] = merge_df_corr['min_pp_total_raw'].str.strip('[')
merge_df_corr['max_pp_total_raw'] = merge_df_corr['max_pp_total_raw'].str.strip('[')

#Drop the original columns
merge_df_corr.drop(['pct_black/hispanic', 'pct_free/reduced', 'county_connections_ratio', 'pp_total_raw'], axis = 1, inplace = True)

#Calculating averages 

merge_df_corr['min_pp_total_raw'] = merge_df_corr['min_pp_total_raw'].astype(float)
merge_df_corr['max_pp_total_raw'] = merge_df_corr['max_pp_total_raw'].astype(float)


merge_df_corr['min_pct_free/reduced'] = merge_df_corr['min_pct_free/reduced'].astype(float)
merge_df_corr['max_pct_free/reduced'] = merge_df_corr['max_pct_free/reduced'].astype(float)

merge_df_corr['min_pct_black/hispanic'] = merge_df_corr['min_pct_black/hispanic'].astype(float)
merge_df_corr['max_pct_black/hispanic'] = merge_df_corr['max_pct_black/hispanic'].astype(float)

merge_df_corr["avg_pct_black/hispanic"] = merge_df_corr[["min_pct_black/hispanic", "max_pct_black/hispanic"]].mean(axis=1)
merge_df_corr["avg_pct_free/reduced"] = merge_df_corr[["min_pct_free/reduced", "max_pct_free/reduced"]].mean(axis=1)
merge_df_corr["avg_pp_total_raw"] = merge_df_corr[["min_pp_total_raw", "max_pp_total_raw"]].mean(axis=1)



In [None]:
merge_df_corr.drop(['max_pct_black/hispanic','min_pct_black/hispanic', 'max_pct_free/reduced', 'min_pct_free/reduced', 'max_pp_total_raw', 'min_pp_total_raw'], axis=1).corr()

#### Partial summary 5 : 
Let's focus on the engagement_index : 
* it's unsurprisingly poisitively correlated to the pct_acces. 
* it's positively correlated to the expenditure per pupil. The more it's spent on a student's education, the more access to digital tools they have and hence their engagement-index goes up.
* it's negatively correlated to the percentage of black or hispanic people. The higher is a district/state's avg_pct_black/hispanic, the lower is its engagement_index. This percentage is positively correlated with the expenditure per pupil and with the percentage of student having access to free/reduced lunch. Wich leads us to think that maybe the engagement isn't a question of expenditure only but of culture as well. .
* it's negatively correlated the percentage of student having access to free/reduced lunch.

# Summary

For this data set, we dropped all the NaN values not to have a biased annalysis. 

Some descriptive statistics : 


* We have 5 unique sectors and Prek-12, PreK-12; Higher Ed; Corporate, PreK-12; Higher Ed are mainly  represented.
* We have 284 providers and 7% of the products are provided by Google LLC.
* These products fulfill 35 unique primary essential functions and LC - Digital Learning Platforms, LC - Sites, Resources & Reference, LC   - Content Creation & Curation are the functions of 44% of the products.
* We have a number of unique districts equal to 176.These districts belong to 23 states, dispersed onto 4 areas: Suburban, Rural, City and Town.
* 30% of the districts spend between 18.000 and 20.000 NERD$ and 32% spend between 8000 and 12000.
* The county_connections_ratio isn't what make a difference between the different districts,
* 80% of the districts have the percentage of students eligible for free or reduced-price between 0% and 80 %.
* 66% of the districts have between 0% and 20% of their population identifying themselves as black or hispanic.
* 9 Districts constitute 80% of the districts represented in this sample and they are : Connecticut, Utah, Massachusetts, Illinois, California, Ohio, New York, Indiana, Washington.
* 60% of the districts are in suburban areas.
* only 6% of the districts are in town.


Engagement progression through 2020 : 

* Engagement has progressed in the early 2020 until the month of April when it reached its local maximum for that periode.
* Engagement has decreased during June, July and for most of the states in August too.
* Engagemnt data trend has started increasing in September to surpass April's maximum.
* The engagement data has a steady evolution until the end of the year.
* It slightly decreased in December.

The most used products belong to "LC : Learning & Curriculum". they are therefore related to distance learning wich is closely impacted by these events:

* Emergency state on the federal level declared on the 13/03/2020,
* Seasonal holidays and summer holidays (June, July and August),
* Back to school season : Many districts didn't reopen in september but in October the trend has changed in October. Parents were also given a choice between in-person and distance instruction, knowing that an achievement gap was identified between these two modes. That is what explains the decrease between September and October-November and because of the gap between the student's results, we can assume that many parents have choosen the in-person instruction.
* in December, the winter break could be the cause behind the slight decrease.

These events explain the trend mainly. 

In terms of engagement throughout the year : 

 * Google LLC (27 products) keeps being, by far, the first, followed, this time, by Instructure, Inc thought it doesn't provide many products (less than 0.8%). It provides, Canvas and Instructure. (Very known actually ! )
* PreK-12; Higher Ed; Corporate takes the lead and swich places with PreK-12 !
* LC - Content Creation & Curation takes the lead this time preceding by far LC - Digital Learning Platforms, thought only 10% (36 products) of the products represented in this data set has it as primary function.
* Illinois now takes the lead and has switched places with Utah. Indiana kept its third rank.

Something important to notice is that the engagement level dropping because of the winter break didn't reach the level of before the distance learning policies were implemented. The analysis showed that nothing changed in the consumer behaviour in terms of products downloaded during December, if not very slightly. The residual engagement value, in December, is 5 to 10 times less than fall 2020. However, it is multiplied by 1.14 to 4.75 compared to January's engagement's data.

* The engagement_index is unsurprisingly poisitively correlated to the pct_acces,
* it's positively correlated to the expenditure per pupil. The more it's spent on a student education, the more access to digital tools they have and hence their engagement-index goes up.
* it's negatively correlated to the percentage of black or hispanic people. The higher is a district/state's avg_pct_black/hispanic, the lower is its engagement_index. This percentage is positively correlated with the expenditure per pupil and with the percentage of student having access to free/reduced lunch. Wich leads us to think that maybe the engagement isn't a question of expenditure only but of culture as well. 
* it's negatively correlated to the percentage of student having access to free/reduced lunch.