# Challenge

We encourage you to guide the analysis. Below are some examples of questions that relate to our problem statement:

 - What is the picture of digital connectivity and engagement in 2020?
 - What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?
 - How does student engagement with different types of education technology change over the course of the pandemic?
 - How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning     disability)? Learning context? Socioeconomic status?
 - Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import glob # for assembling multipe csvs
import missingno as msno

#for visualization
import seaborn as sns 
import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS

#for Ignoring the warnings and errors
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **Importing Dataset**

In [None]:
# importing dataset
districts_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
products_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")


# importing all the csv from engagement folder
folder = glob.glob("../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv")

merged = []

for CSV in folder:
    df = pd.read_csv(CSV, index_col = None, header = 0)
    district_id = CSV.split("/")[4].split(".")[0]
    df["district_id"] = district_id
    merged.append(df)
    
engagement_df = pd.concat(merged)
engagement_df = engagement_df.reset_index(drop=True)


In [None]:
df_name = ['districts_df','products_df','engagement_df']
df_list = [districts_df,products_df,engagement_df]
for i in range(3):
    print('*****'*12)
    print(f'Dataframe {df_name[i]} has {df_list[i].shape[0]} Rows and {df_list[i].shape[1]} Columns')
    print('*****'*12)
    display(df_list[i].head(5).style.set_properties(**{'background-color': 'white?','color': 'black','border': '1.5px  solid black'}))

**By looking at the data:**

- Given data is from 1st Jan 2020 to 31st Dec 2020 with
- 233 School Districts
- 372 Tech Products
- Around 22M engagement records

# Exploratory Data Analysis

## missingno Library

 - Python has a library named missingno which provides a few graphs that let us visualize missing data from a different perspective. This can help us a lot in the handling of missing data. The missingno library is based on matplotlib hence all graphs generated by it'll be static. 
 
 - missingno provides 4 plot as of now for the understanding distribution of missing data in our dataset that is Bar Chart, Matrix, Heatmap, Dendrogram. In this dataset I m using Bar Chart which shows count of values present per columns ignoring missing values

In [None]:
msno.bar(districts_df,figsize=(12,6), color = 'turquoise');

In [None]:
msno.bar(products_df,figsize=(12,6), color = 'turquoise');

In [None]:
msno.bar(engagement_df,figsize=(12,6), color = 'turquoise');

In [None]:
# checking missing values
print('-'*45)
print("percentage of missing values in DISTRICT DATA")
print('-'*45)
print(districts_df.isnull().sum()/len(districts_df)*100)
print('-'*45)
print("percentage of missing values in PRODUCT DATA")
print('-'*45)
print(products_df.isnull().sum()/len(products_df)*100)
print('-'*45)
print("percentage of missing values in ENGAGEMENT DATA")
print('-'*45)
print(engagement_df.isnull().sum()/len(engagement_df)*100)

# dropping rows with missing values
districts_df.dropna(subset= ['state'],axis=0,inplace= True)

# also dropping "pp_total_raw" column as it has ~50% null values
districts_df.drop(columns=["pp_total_raw"], axis=1, inplace= True)

In [None]:
plt.figure(figsize=(16, 10))
sns.countplot(y="state",data=districts_df,order=districts_df.state.value_counts().index,palette="YlOrBr",linewidth=3)
plt.title("State Freq Chart in District Information data",font="Serif", size=20,pad=20)
plt.show()

In [None]:
plt.figure(figsize=(16, 10))
sns.countplot(y="locale",data=districts_df,order=districts_df.locale.value_counts().index,palette="YlOrBr",linewidth=3)
plt.title("Locale Freq Chart in District Information data",font="Serif", size=20,pad=20)
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
engagement_df.rename(columns={"lp_id": "LP ID"}, inplace=True)
merged=pd.merge(engagement_df, products_df, on= "LP ID")
m=merged.groupby("Product Name")["pct_access"].mean().sort_values(ascending=False).head(10)

engagement_df.rename(columns={"lp_id": "LP ID"}, inplace=True)
merged=pd.merge(engagement_df, products_df, on= "LP ID")
n=merged.groupby("Product Name")["engagement_index"].sum().sort_values(ascending=False).head(10)

# plot
plt.figure(figsize=(15,4))

plt.subplot(121)
plt.bar(m.index, m.values, color=["#6930c3","#5e60ce","#0096c7","#48cae4","#ade8f4","#ff7f51","#ff9b54","#ffbf69"])
plt.xlabel('Product Name')
plt.xticks(rotation=90)
plt.ylabel('Mean percentage of students')
plt.title("With atleast one-page load event")

plt.subplot(122)
plt.bar(n.index, n.values, color=["#4f000b","#720026","#ce4257","#ff7f51","#ff9b54"])
plt.xlabel('Product Name')
plt.xticks(rotation=90)
plt.ylabel('Page-load per 1000 students')
plt.title("With number of page-load per 1000 students")

In [None]:
class_doc=merged[(merged["Product Name"]=="Google Classroom")|(merged["Product Name"]=="Google Docs")]
pct=class_doc.groupby(["time", "Product Name"])["pct_access"].mean().to_frame().reset_index()
eng=class_doc.groupby(["time", "Product Name"])["engagement_index"].sum().to_frame().reset_index()
# plot
fig = px.line(pct, x="time", y="pct_access", color='Product Name',title='Percentage of students with atleast one-page load event on a given day',
              template="ggplot2", width=800, height=400)
fig.show()

fig = px.line(eng, x="time", y="engagement_index",title='Sum of number of page-load per 1000 students on a given day', color='Product Name',
              template="seaborn", width=800, height=400)
fig.show()

In [None]:
%%html
<marquee style='width: 90% ;height:70%; color: #45B39D ;'>
    <b>Do UPVOTE if you like my work, I will be adding some more plots :) 
</b></marquee>