# <div style="text-align: Left"><span style="color:#08838b; font-family:Georgia;">Problem Statement</span></div>
<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.</span></div>

# <div style="text-align: Left"><span style="color:#08838b; font-family:Georgia;">Questions that tackle the above problem statement </span></div>

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">    
<ul>
<li>What is the picture of digital connectivity and engagement in 2020?</li>
<li>What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?</li>
<li>How does student engagement with different types of education technology change over the course of the pandemic?</li>
<li>How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?</li>
<li>Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?</li>
</ul></span></div>

# Data we are using
We are using data provided in this kaggle competition providing a set of daily edtech engagement data from over 200 school districts in 2020, and we will also leverage other publicly available data sources in your analysis. Initial provided data includes three basic sets of files:



1.   The engagement_ data folder is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.
2.   The products_info.csv file includes information about the characteristics of the top 372 products with most users in 2020.
3. The districts_info.csv file includes information about the characteristics of school districts, including data from NCES and FCC.
The definitions of each column in the three data sets are detailed in the README file.


In addition to the files provided, we intend to use other public data sources such as COVID-19 US State Policy database, KIDS Count, and KFF.

# Basics about the data

*   There are a total of 233 School Districts available within the data, all around USA. A school district is a special-purpose district that operates local public primary and secondary schools in various nations.

*   There are a total of 372 distinct Educational Technology Products, such as tools like Canva, educational apps like Duolingo, reading sites like Goodreads, or social pages like Facebook.
*    The data was collected between 01.01.2020 (a few months before the pandemic hit) until 31.12.2020. This will give a full year overview of a before the pandemic and after usage.
* Summing up all data collected for all the district we end up with ~ 22.3M datapoints, which in the Data Science World would be called Big Data.
From the 22M datapoints, around 24% have missing values in the engagement_index feature. Nevertheless, the pct_access feature is fully available.
* Around half of the pupils have at least one page load on a product and on a given day.
* There are almost 168 page loads per 1k students on a given product and on a given day. This means activity close to none for some of the pupils.

## Importing libraries that are required for this Analysis </span></div>

In [None]:
import numpy as np 
import pandas as pd
import os
import glob
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
import statistics as stat
from wordcloud import WordCloud, STOPWORDS
from plotly.subplots import make_subplots
import folium
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)
%matplotlib inline


## Product information data
> The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy.

* LP ID - the unique identifier of the product. URL
* Product Name
* Provider/Company Name
* Sector(s) - sector of education where the product is used.
* Primary Essential Function - the basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled.

## Loading the dataset and reading its contents
Let's load the district data.. 

In [None]:
districts_info = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")

In [None]:
#we define a couple of scripts that we shall implement
def null_values(df):
    
    #get the number of null values in the system
    try:
        print(" #### Calculating Missing Values ##### \n")
        for i in tq(range(1),desc="Fetching data"):
        
            
            
            sumofNull=df.isna().sum()
            percentage=sumofNull/len(df)*100
        print(" #### Done #### \n ")
            
            
        
        print("#### Creating data frame #### \n ")
        
        for i in tq(range(1),desc="Creating DataFrame"):
            valuesdf=pd.DataFrame(data=[sumofNull,percentage])
            valuesdf=valuesdf.T
            valuesdf.columns=["Total Missing","Percentage Missing"]
        print(" #### Done #### ")
        

        return valuesdf
            
    except Exception as e:
        
        print(" !!! Error File Not Found  !!!!! \n")
        print(" !!! Program Failed !!!!! \n")
        
        print("Safely exiting the program")
        sys.exit(1)

def dropDuplicates(data):
        try:
            
            for i in tq(range(1),desc="Detecting duplicates"):
                data=data.drop_duplicates()
            for i in tq(range(1),desc="Report on Duplicates "):
                dupCount=len(data)-len(data.drop_duplicates())
                
            print ("There are {} duplicates in the dataset\n".format(dupCount))
            logging.info("Number of duplicates in the datset are {} ".format(dupCount))

            

            return data 
        except Exception as e:

            print("The following error occured {} ".format(e.__class__))

In [None]:
districts_info.info()

From the quick summary above we can tell we have quite a number of null values. From the description the null values are meant to create a form of anonymity,hence most of the data is structurally missing.

Lets calculate the missing values

In [None]:
import sys
from tqdm.notebook import tqdm_notebook as tq
null_values(districts_info)

We can discover that if the state data is missing then locale and pc_black/hispanic records are probably also missing. We see that ppl_total_raw has roughly 49% null values which isn't that good for our anlysis

In [None]:
#we get the state abbrevations to use in potting graphs
state_abb = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

districts_info['state_abb'] = districts_info['state'].map(state_abb)

Unique values

In [None]:
def count_plot(data,colr,title):
    plt.figure(figsize=(10,8))
    ax=sns.countplot(x=data,palette=colr,order=data.value_counts().index)
    plt.xticks(rotation=90)
    plt.title(title)
    for p in ax.patches:
        ax.text (p.get_x() + p.get_width()  / 2,p.get_height()+ 0.75,p.get_height(), fontsize = 11)
#         ax.text('%{:.1f}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+50))

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
layout = dict(
    title_text = "Count of districts in the available States",
    title_font = dict(
            family = "monospace",
            size = 25,
            color = "black"
            ),
    geo_scope = 'usa'
)

fig.add_trace(
    go.Choropleth(
        locations = districts_info['state_abb'].value_counts().to_frame().reset_index()['index'],
        zmax = 1,
        z = districts_info['state_abb'].value_counts().to_frame().reset_index()['state_abb'],
        locationmode = 'USA-states',
        marker_line_color = 'white',
        geo = 'geo',
        colorscale = "cividis", 
    )
)
            
fig.update_layout(layout)   
fig.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
count_plot(districts_info['state'],'RdYlGn',"State representation")

Connecticut has the most number of district representation with 30 district counts in the dataset closely followed by Utah

In [None]:
plt.figure(figsize = (15, 8))
sns.set_style("white")
a = sns.barplot(data = districts_info['state'].value_counts().reset_index(), x = 'state', y = 'index', color = '#90afc5')
plt.xticks([])
plt.yticks(fontsize = 14, color = '#283655')
plt.ylabel('')
plt.xlabel('')

a.spines['left'].set_linewidth(1.5)
for w in ['right', 'top', 'bottom']:
    a.spines[w].set_visible(False)
    
for p in a.patches:
    width = p.get_width()
    plt.text(0.5 + width, p.get_y() + 0.55 * p.get_height(), f'{int(width)}',
             ha = 'center', va = 'center',  fontsize = 15, color = '#283655')

plt.show()

In [None]:
count_plot(districts_info['locale'],'Blues','locale representation') 

A great number of the education insituions are located in the Suburbs but does this result in bettter education How do they compare to the other locales?

We aggeregate by state and find the percentages features. But first a quick look at the column we do realize that the column is made of intervals.

A quick intro to interval notation:

]a,b[ := {: a<x <b } : open Real interval

[a,b[ := {a<= x <b} : Half-open on the right

]a,b] := {a<xb<=b} : Half-open on the left

we address this by spliting by the , and taking the values

We however have to impute the values first before carrying on with the analysis

In [None]:
distcrict_copy=districts_info.copy()
districts_info.dropna(inplace=True)
districts_info['pct_black/hispanic']=districts_info['pct_black/hispanic'].apply(lambda x :float(x.split(',')[0][1:])+0.1)

districts_info['pct_free/reduced']=districts_info['pct_free/reduced'].apply(lambda x :float(x.split(',')[0][1:])+0.1)
districts_info=districts_info.reset_index()
districts_info.drop(labels='index',inplace=True,axis=1)
districts_info

In [None]:
districts_info['pp_total_raw']=districts_info['pp_total_raw'].apply(lambda x :float(x.split(',')[0][1:])+1000)
districts_info['county_connections_ratio']=districts_info['county_connections_ratio'].apply(lambda x: float(x.split(',')[0][1:])+0.1)

districts_info

In [None]:
import numpy as np
state_pct=districts_info.groupby('state').agg({'pct_black/hispanic':np.mean,'pct_free/reduced':np.mean,'pp_total_raw':np.mean})
state_pct=state_pct.reset_index()

In [None]:
#is there a relationship between the two ratios?
sns.heatmap(state_pct.corr(),annot=True)

In [None]:
# sample funnel graph showing the percentage of black/hispanic students in each state
import plotly.express as px
fig = px.funnel(state_pct, x='pct_black/hispanic', y='state')
fig.show()

Texas has the highest number of black/hispanic ratio accounting for 60%, Florida comes second with 50%

In [None]:
fig = px.bar_polar(state_pct, r="pct_black/hispanic", theta="state", color="pct_free/reduced", template="plotly_dark",
            color_discrete_sequence= px.colors.sequential.Plasma_r)
fig.show()

Quick review of the polar chart we spot that it doesn't have any clean relationship between pct_black/hispanic and pct_free/reduced

In [None]:
fig = px.bar_polar(state_pct, r="pp_total_raw", theta="state", template="plotly_dark",
            color_discrete_sequence= px.colors.sequential.Plasma_r)
fig.show()

New york seems to have the highest number of expenditure per pupil, we also do notice that the states that had higher pct_ofblacks/hispanic has a lower expenditure per pupil

Can this be affected by the locale of a district , lets viusialize and see if it affects that

In [None]:
locale_df=districts_info.groupby('locale').agg({'pct_black/hispanic':np.mean,'pct_free/reduced':np.mean})
locale_df=locale_df.reset_index()
plt.figure(figsize=(15,9))
sns.barplot(x="locale", y="pct_black/hispanic", data=locale_df,
                 palette="Blues_d")
plt.title("Percentage of blacks and hispanic in the locales")
plt.xticks(rotation=90)

In [None]:
locale_df.head()

Let's load the product data.

In [None]:
product_info = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
product_info.head()

In [None]:
cloud = WordCloud(width=1440, height=1080,stopwords={'nan'}).generate(" ".join(product_info['Product Name'].astype(str)))
plt.figure(figsize=(15, 10))
plt.imshow(cloud)
plt.axis('off')

In [None]:
# fill missing with ffill method for columns (pct_free/reduced , pp_total_raw )

def fix_missing_ffill(df, col):
    df[col] = df[col].fillna(method='ffill')
    return df[col]
# fill missing with ffill method for columns (pct_free/reduced , pp_total_raw )

def fix_missing_ffill(df, col):
    df[col] = df[col].fillna(method='ffill')
    return df[col]

In [None]:
product_info.info()

In [None]:
product_info.isna().sum()

In [None]:
msno.bar(product_info,color='#924893', sort="ascending", figsize=(10,5), fontsize=12)
plt.show()

In [None]:
# we can use fowward fil of back file to fill the 20 missing values in sector , primary function main and sub function 
product_info['Sector(s)'] = fix_missing_ffill(product_info, 'Sector(s)')
product_info['Primary Essential Function'] = fix_missing_ffill(product_info,'Primary Essential Function')

In [None]:
product_info['primary_function_main'] = product_info['Primary Essential Function'].apply(lambda x: x.split(' - ')[0] if x == x else x)
product_info['primary_function_sub'] = product_info['Primary Essential Function'].apply(lambda x: x.split(' - ')[1] if x == x else x)

# Synchronize similar values
product_info['primary_function_sub'] = product_info['primary_function_sub'].replace({'Sites, Resources & References' : 'Sites, Resources & Reference'})
product_info.drop("Primary Essential Function", axis=1, inplace=True)

In [None]:
product_info

In [None]:
product_info.isna().sum()

In [None]:
product_info = product_info.dropna()

In [None]:
product_info

In [None]:
def plot_count(df:pd.DataFrame, column:str,title:str) -> None:
    plt.figure(figsize=(12, 7))
    sns.countplot(data=df, x=column) 
    plt.xticks(rotation=75, fontsize=14)
    plt.title(title,font="Serif", size=20)
    plt.show()
def group_donut(grouped_data,title: str):
    grouped_data.plot.pie(subplots=True,figsize=(18, 9),autopct="%.1f%%",pctdistance=0.85)
    # add a circle at the center to transform it in a donut chart
    my_circle=plt.Circle( (0,0), 0.7, color='white')
    p=plt.gcf()
    p.gca().add_artist(my_circle)
    plt.title(title,font="Serif", size=20)
    plt.show()
def bar_p(df:pd.DataFrame, column:str,title:str):
    plt.figure(figsize=(18, 9))
    sns.countplot(y=column, data=product_info, order=df[column].value_counts().head(10).index,color = "#a265b8")
    plt.title(title,font="Serif", size=20)
    plt.show()

In [None]:
product_info['Provider/Company Name'].value_counts()

In [None]:
bar_p(product_info,'Provider/Company Name','Showing 10 providers')

In [None]:
Data = ['Google LLC','Houghton Mifflin Harcourt','Microsoft']
top3 = product_info.loc[product_info['Provider/Company Name'].isin(Data)]
top3.head()

In [None]:
top = top3.groupby(['Provider/Company Name'])['Product Name'].value_counts().groupby(level=0).head(10)
top

In [None]:
c1=c2=c3=0
for s in product_info["Sector(s)"]:
    if(not pd.isnull(s)):
        s = s.split(";")
        for i in range(len(s)):
            sub = s[i].strip()
            if(sub == 'PreK-12'): c1+=1
            if(sub == 'Higher Ed'): c2+=1
            if(sub == 'Corporate'): c3+=1

fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Sector Distribution', size = 30, font="Serif")
explode = (0.05, 0.05, 0.05)
labels = ['PreK-12','Higher Ed','Corporate']
sizes = [c1,c2, c3]
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.2f%%', pctdistance=0.7, colors=["#ff228a","#20b1fd","#ffb703"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

In [None]:
c1=c2=c3=0

for s in product_info["primary_function_main"]:
    if(not pd.isnull(s)):
        c1 += s.count("CM")
        c2 += s.count("LC")
        c3 += s.count("SDO")

fig, ax  = plt.subplots(figsize=(16, 8))
fig.suptitle('Primary Essential Function', size = 20, font="Serif")
explode = (0.05, 0.05, 0.05)
labels = ['CM','LC','SDO']
sizes = [c1, c2, c3]
ax.pie(sizes, explode=explode,startangle=60, labels=labels,autopct='%1.2f%%', pctdistance=0.7, colors=["#18ff9f","#2cfbff","#ffb703"])
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
plt.show()

In [None]:
product_info['primary_function_sub'].value_counts()

In [None]:
bar_p(product_info,'primary_function_sub','plot showing top 10 used sub functions')

## Engagement Data

### The engagement data are aggregated at school district level, and each file in the folder engagement_data represents data from one school district. The 4-digit file name represents district_id which can be used to link to district information in district_info.csv. The lp_id can be used to link to product information in product_info.csv.</span></div>

   

# Loading the dataset and reading its contents 

### Let's load the engagement data first. Since, the engagement_data folder contains 233 csv files and each file in the folder engagement_data represents data from one school district, we have to join these csv files into one file so that we can work on it.</span></div>

### The first step is to create a list of all the csv files stored in the folder engagement_data.

In [None]:
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
files = glob.glob(os.path.join(path, "*.csv"))

### To keep track of the file names as it is the only reference to the district (The 4-digit file name represents district_id which can be used to link to district information in district_info.csv), we have to create a list of dataframes with the filenames in a "district_id" column and then we can concatenate the items of the list into one dataframe.

### Here, we have used list comprehension to make a list of dataframes and then used concat function to combine the list items into one dataframe.

In [None]:
PATH = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 
districts_info = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")

temp = []

for district in districts_info.district_id.unique():
    df = pd.read_csv(f'{PATH}/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
    temp.append(df)
    
    
engagement = pd.concat(temp)
engagement = engagement.reset_index(drop=True)


In [None]:
engagement.shape

In [None]:
engagement.isna().sum()

 We can see the size of dataframe named engagement after combining data from all the 233 csv files. </span></div>

In [None]:
engagement.head(10)

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Let's find out the percentage of missing/null values in out dataset for each dataframe. we will use missingno library to visualize missing values. </span></div>

In [None]:
percent_missing_val_engagement = (engagement.isnull().sum().sort_values(ascending = False)/len(engagement))*100
percent_missing_val_engagement

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">The below bar graph shows the number of non null entries in each column for the "engagement" dataset. </span></div>

In [None]:
msno.bar(engagement, color='#25c28b', sort="ascending", figsize=(10,5), fontsize=12)
plt.title("Number of non-null entries in engagement dataset", size=20)
plt.show()

# <div style="text-align: Left"><span style="color:#08838b; font-family:Georgia;">Exploratory Data Analysis</span></div>

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Let's perform exploratory data analysis on the datasets that we have here. But before we do that, we have to change the data type of "time" column in "engagement" dataset from "object" to "int" so that we can perform some analysis on the dataset. Moreover, we have to add more columns into "engagement" dataset and map the "district_id" with "state" and "locale". </span></div>

In [None]:
engagement.info()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Converting the data type of "time" column from "object" to "int64" </span></div>

In [None]:
convert_dict = {'district_id': 'int64'
               }
engagement = engagement.astype(convert_dict)
engagement['time'] = pd.to_datetime(engagement['time'])

In [None]:
engagement.info()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Mapping the district_id to get associated state and locale and then adding it to the engagement dataset as individual columns. </span></div>

In [None]:
district = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
mapping_state = dict(district[['district_id', 'state']].values)
mapping_locale = dict(district[['district_id', 'locale']].values)
engagement['state'] = engagement['district_id'].map(mapping_state)
engagement['locale'] = engagement['district_id'].map(mapping_locale)
engagement.head(10)

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Getting a picture of digital connectivity and engagement across states and locale in 2020.</span></div>

In [None]:
engagement_state = engagement[['pct_access', 'engagement_index', 'state']]
engagement_locale = engagement[['pct_access', 'engagement_index', 'locale']]

In [None]:
plt.figure(figsize=(15, 10))
plt.ticklabel_format(style='plain')
sns.countplot(y="state",data=engagement_state,order=engagement_state.state.value_counts().index,palette="rocket_r",linewidth=4)
plt.title("Statewise Digital connectivity distribution in 2020",font="Georgia", size=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Count', fontsize=20)
plt.ylabel('State', fontsize=20)
plt.show()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">From the above plot, we can see that 'Connecticut', 'Utah', 'Illinois', 'Massachusetts' and 'California' are the top five states with maximum number of counts/records in the engagement dataset.</span></div>

In [None]:
plt.figure(figsize=(10, 6))
plt.ticklabel_format(style='plain')
sns.countplot(y="locale",data=engagement_locale,order=engagement_locale.locale.value_counts().index,palette="mako",linewidth=3)
plt.title("Locale-wise Digital connectivity distribution in 2020",font="Georgia", size=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Count', fontsize=20)
plt.ylabel('Locale', fontsize=20)
plt.show()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">The above plot tells that, Suburb records the maximum count of digital activity.</span></div>

In [None]:
fig, ax  = plt.subplots(figsize=(15, 10))
explode = (0.04, 0.04, 0.04, 0.04)
labels = list(engagement_locale.locale.value_counts().index)
sizes = engagement_locale.locale.value_counts().values
patches, texts, autotexts = ax.pie(sizes, explode=explode, startangle=60, labels=labels, autopct='%1.0f%%', pctdistance=0.7, colors=["#367382","#3998af","#16b5dc","#a8c2c9"])
texts[0].set_fontsize(20)
texts[1].set_fontsize(20)
texts[2].set_fontsize(20)
texts[3].set_fontsize(20)
autotexts[0].set_fontsize(20)
autotexts[1].set_fontsize(20)
autotexts[2].set_fontsize(20)
autotexts[3].set_fontsize(20)
ax.add_artist(plt.Circle((0,0),0.6,fc='white'))
font = {'fontname':'Georgia'}
plt.title('Locale-wise Digital connectivity distribution in 2020', fontsize = 20, **font)
plt.show()




In [None]:
engagement.time.nunique()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">We have 366 unique dates when the data was recorded. </span></div>

In [None]:
plt.figure(figsize=(15,5))

sns.histplot(engagement.groupby('district_id').time.nunique(), bins=30, color = '#b3a42f')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Number of days', fontsize=20)
plt.ylabel('Number of district', fontsize=20)
plt.title('Districts with unique days of engagement ', fontsize = 20)


<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">From the above histogram, we can see that majority of districts have engagement data for 366 days while some have data for less than 10 days or so.</span></div>

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Let's understand the variation of "pct_access" data and "engagement_index" data over time.</span></div>

In [None]:
lp_id_virtual = product_info[product_info.primary_function_sub == 'Virtual Classroom']['LP ID'].unique()
plt.rcParams.update({'font.size': 14,})
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_virtual:
    dummy = engagement[engagement.lp_id == product_id].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.pct_access, label=product_info[product_info['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of pct_access over time for Virtual Classroom products', fontsize = 20)
plt.show()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">We see that with the engagement dataset containing weekends as well as weekdays data, the graph shows variation like the above with ripples. Since, people usually don't like to work on weekends which can be substantiated by the above graph, we can remove weekend data from our dataset.</span></div>

In [None]:
engagement['weekday'] = pd.DatetimeIndex(engagement['time']).weekday
engagement_updated = engagement[engagement.weekday < 5]

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_virtual:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.pct_access, label=product_info[product_info['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of pct_access over time for Virtual Classroom products', fontsize = 20)
plt.show()

<div style="text-align: justify"><span style="color:#000000; font-family:Georgia; font-size:1.2em;">Now the graph shows better variation of pct_access data without ripples which were earlier present because of weekend data in the dataset.</span></div>

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
for product_id in lp_id_virtual:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.engagement_index, label=product_info[product_info['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of engagement_index over time for Virtual Classroom products', fontsize = 20)
plt.show()

In [None]:
engagement.lp_id.head()

In [None]:
lp_id_digital = product_info[product_info.primary_function_sub == 'Digital Learning Platforms']['LP ID'].unique()

In [None]:
len(lp_id_digital)

In [None]:
lp_id_digital = [36692, 92993, 71279, 25559, 64998, 61441]

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_digital:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').pct_access.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.pct_access, label=product_info[product_info['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of pct_access over time for TOP 6 most accessed Digital Learning platforms', fontsize = 20)
plt.show()

In [None]:
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))

for product_id in lp_id_digital:
    dummy = engagement_updated[engagement_updated.lp_id == product_id].groupby('time').engagement_index.mean().to_frame().reset_index(drop=False)
    sns.lineplot(x=dummy.time, y=dummy.engagement_index, label=product_info[product_info['LP ID'] == product_id]['Product Name'].values[0])
plt.legend()
plt.title('Variation of engagement_index over time for TOP 6 most accessed Digital Learning platforms', fontsize = 20)
plt.show()

## Causal graph

Since the engagement and district dataset have 80,000,000 points we will select around 1.2% of the data points for graphing the causal graph. To investigate the elements that directly affect the engagement index, we build an initial causal tree using a portion of the data (100,000 records). As the number of data points grows, the graph becomes more stable, but it also demands more computing power. Over the three graphs that were plotted, the factors that directly effect engagement stayed the same. When an intervention is placed on the variable pct free/reduced, we fit a conditional model and examine the change in the engagement index.sm = construct_structural_model(model_df, tabu_parent_nodes=["engagement_index"]) In [36]:

In [None]:
!pip install causalnex

In [None]:
! apt install python3-dev graphviz libgraphviz-dev pkg-config -y

In [None]:
!pip install pygraphviz

In [None]:
# to save up som memory we will remove the dataframes
del districts_info
del engagement

In [None]:
districts_info = pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
temp = []

for district in districts_info.district_id.unique():
    df = pd.read_csv(f'{PATH}/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
#     if df.time.nunique() == 366:
    temp.append(df)

engagement = pd.concat(temp)
engagement = engagement.reset_index(drop=True)

In [None]:
districts_info.dropna(inplace=True)

In [None]:
districts_info_edited = districts_info.copy()

districts_info_edited['pct_black/hispanic'] = districts_info['pct_black/hispanic'].apply(lambda x: (x.replace('[', '')).split(','))
districts_info_edited['pct_free/reduced'] = districts_info['pct_free/reduced'].apply(lambda x: (x.replace('[', '')).split(','))
districts_info_edited['pp_total_raw'] = districts_info['pp_total_raw'].apply(lambda x: (x.replace('[', '')).split(','))
districts_info_edited.drop(columns=['county_connections_ratio'],inplace=True)
for i in ['pct_black/hispanic','pct_free/reduced','pp_total_raw']:
    districts_info_edited[i] = districts_info_edited[i].apply(lambda x: (float(x[0])+float(x[1]))/2)
districts_info_edited

In [None]:
df_engagement_district = pd.merge(engagement, districts_info_edited, on='district_id')
df_merged = pd.merge(df_engagement_district, product_info, left_on='lp_id',right_on="LP ID")
df_merged.drop(columns=['LP ID'],inplace=True)
df_merged.head()

In [None]:
del engagement

In [None]:
df_merged.head()

In [None]:
df_merged.shape

In [None]:
df_merged = df_merged.dropna().reset_index(drop=True)

In [None]:
df_merged.columns

In [None]:
causal_data = df_merged[['pct_access', 'engagement_index','locale', 
                         'pct_black/hispanic', 'pct_free/reduced',
                        'pp_total_raw', 'Provider/Company Name',
                        'Sector(s)', 'primary_function_main',
                        'primary_function_sub']].copy()
causal_data.head()

In [None]:
p = df_merged[['time','engagement_index']].copy()
p

In [None]:
del df_merged

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
causal_data[['locale', 'Provider/Company Name', 'Sector(s)', 'primary_function_main',
       'primary_function_sub']] = causal_data[['locale', 'Provider/Company Name', 'Sector(s)', 'primary_function_main',
       'primary_function_sub']].apply(le.fit_transform)
causal_data.head()

In [None]:
col_dict = {'pct_black/hispanic': 'pct_black_hispanic',
        'pct_free/reduced': 'pct_free_reduced',
        'Provider/Company Name': 'Provider_Company_Name',
           'Sector(s)': 'Sector'}
causal_data.rename(columns=col_dict,
          inplace=True)
causal_data.head()

In [None]:
causal_data.shape

In [None]:
from IPython.display import Image
from causalnex.plots import plot_structure, NODE_STYLE, EDGE_STYLE
from causalnex.structure.notears import from_pandas, from_pandas_lasso

In [None]:
sm_1000000 = from_pandas(causal_data.iloc[:100000,:], w_threshold=0.8)
sm_1000000.remove_edge('primary_function_sub','Sector') 
viz = plot_structure(
    sm_1000000,
    graph_attributes={"scale": "2.0", 'size':2.5},
    all_node_attributes=NODE_STYLE.WEAK,
    all_edge_attributes=EDGE_STYLE.WEAK)
Image(viz.draw(format='png'))

### Discretize the data for input in Bayesian model

In [None]:
locale_map = {'City': 0, 'Rural': 1, 'Suburb': 2, 'Town': 3}
Provider_map = {' A&E Television Networks, LLC': 0,
 ' Autodesk, Inc': 1,
 ' Tes Global Ltd ': 2,
 'ABC digital': 3,
 'ABCya.com, LLC': 4,
 'Achieve3000': 5,
 'Actively Learn': 6,
 'Adobe Inc': 7,
 'Adobe Inc.': 8,
 'Age of Learning, Inc ': 9,
 'Amazon.com, Inc. ': 10,
 'Amplify Education, Inc.': 11,
 'Answers': 12,
 'Association for Supervision and Curriculum Development': 13,
 'Bartleby': 14,
 'BetterLesson': 15,
 'Big Ideas Learning': 16,
 'Blackboard Inc': 17,
 'Blindside Networks': 18,
 'Blooket LLC': 19,
 'Boom Learning (a dba of Omega Labs Inc.)': 20,
 'Box': 21,
 'BrainPOP LLC': 22,
 'Brainly': 23,
 'Breakout, Inc': 24,
 'CK-12 Foundation': 25,
 'Cable News Network': 26,
 'Calculator.com': 27,
 'Calendly': 28,
 'Canva': 29,
 'Canvas Talent, Inc. ': 30,
 'Capstone': 31,
 'Cengage Learning': 32,
 'Chegg': 33,
 'Cisco': 34,
 'ClassDojo, Inc.': 35,
 'ClassLink': 36,
 'Clever': 37,
 'Code.org': 38,
 'CommonLit': 39,
 'Constructive Media': 40,
 'ContentKeeper Technologies': 41,
 'CoolMath.com LLC': 42,
 'Course Hero': 43,
 'Cult of Pedagogy': 44,
 'Curiosity Media Inc': 45,
 'Curriculum Associates': 46,
 'DeltaMath': 47,
 'Desmos': 48,
 'Dictionary.com': 49,
 'Didax Education': 50,
 'Discovery Communications': 51,
 'Discovery Education': 52,
 'Disney': 53,
 'DocuSign Inc': 54,
 'Doodle Ltd': 55,
 'DotDash': 56,
 'Dreambox Learning': 57,
 'Dropbox': 58,
 'Duolingo': 59,
 'EBSCO Industries, Inc': 60,
 'EDpuzzle Inc.': 61,
 'Edgenuity Inc.': 62,
 'Edmentum': 63,
 'Education.com': 64,
 'Educational Testing Service': 65,
 'Edulastic': 66,
 'Ellevation': 67,
 'Enchanted Learning': 68,
 'Encyclopaedia Britannica, Inc.': 69,
 'Enotes.com': 70,
 'Epic Creations, Inc.': 71,
 'EqsQuest Ltd': 72,
 'Eventbrite': 73,
 'Evite': 74,
 'ExploreLearning, LLC': 75,
 'Facebook': 76,
 'Flipgrid': 77,
 'Flippity': 78,
 'Flocabulary': 79,
 'Formative': 80,
 'Frontline Education': 81,
 'Funbrain Holdings, LLC': 82,
 'Future US Inc': 83,
 'Gale Cengage': 84,
 'Generation Genius, Inc.': 85,
 'GeoGebra': 86,
 'Gimkit': 87,
 'GitHub': 88,
 'GloWorld': 89,
 'Global Compliance Network Inc': 90,
 'GoGuardian': 91,
 'Goodreads': 92,
 'Google LLC': 93,
 'Grammarly': 94,
 'Hapara': 95,
 'HealthTeacher': 96,
 'Heinemann, a division of Greenwood Publishing Group LLC': 97,
 'Hewlett-Packard': 98,
 'Hobsons': 99,
 'Hooda Math': 100,
 'Houghton Mifflin Harcourt': 101,
 'HowStuffWorks': 102,
 'Hulu, LLC': 103,
 'IAC': 104,
 'ITHAKA': 105,
 'IXL Learning': 106,
 'Imagine Easy Solutions': 107,
 'Imagine Learning': 108,
 'Infinite Campus': 109,
 'InnerSloth': 110,
 'Instagram': 111,
 'Instructure, Inc. ': 112,
 'Issuu': 113,
 'Istation': 114,
 'Jakub Koziol': 115,
 'John Wiley and Sons Inco': 116,
 'KQED': 117,
 'Kahoot! AS': 118,
 'Kaleido AI GmbH': 119,
 'Kami Limited': 120,
 'Khan Academy': 121,
 'Lakeshore Learning Materials': 122,
 'Lazel Inc.': 123,
 'LearnZillion': 124,
 'Learning A-Z': 125,
 'Legends of Learning': 126,
 'Lexia Learning': 127,
 'Library of Congress': 128,
 'LinkedIn': 129,
 'LitCharts LLC': 130,
 'LogMeIn': 131,
 'Loom, Inc': 132,
 'Lumen Learning': 133,
 'MIND Research Institute': 134,
 'MIT Media Lab': 135,
 'MakeMusic, Inc.': 136,
 'MarketWatch': 137,
 'Massachusetts Institute of Technology': 138,
 'MasteryConnect': 139,
 'Math Playground': 140,
 'Math Worksheets 4 Kids': 141,
 'Mathsisfun.com': 142,
 'Mathway': 143,
 'McGraw-Hill PreK-12': 144,
 'Measurement Incorporated ': 145,
 'Merriam-Webster': 146,
 'Michael Dayah': 147,
 'Microsoft': 148,
 'Microsoft Education': 149,
 'MobyMax': 150,
 'Multiplication.com': 151,
 'Mystery Science, Inc': 152,
 'NASA': 153,
 'National Center for Families Learning': 154,
 'Nature America, Inc': 155,
 'Nearpod Inc.': 156,
 'Netflix': 157,
 'Neuron Fuel': 158,
 'New York State Education Department': 159,
 'Newsela': 160,
 'Nitrolabs Limited': 161,
 'NoRedInk': 162,
 'NoodleTools': 163,
 'Online Writing Lab': 164,
 'OverDrive': 165,
 'PBS': 166,
 'Padlet': 167,
 'Pandora Media, LLC': 168,
 'Panorama Education': 169,
 'Pear Deck': 170,
 'Performance Matters': 171,
 'Pew Research Center ': 172,
 'Physics Classroom, LLC': 173,
 'Pixlr': 174,
 'PowerSchool Group LLC': 175,
 'Prezi Inc.': 176,
 'ProQuest': 177,
 'Purch': 178,
 'Qualtrics': 179,
 'Quaver Music': 180,
 'Quill.org': 181,
 'Quizizz': 182,
 'Quizlet': 183,
 'Quora': 184,
 'Read Theory': 185,
 'ReadWorks': 186,
 'ReadWriteThink.org': 187,
 'Remind101': 188,
 'Renaissance Learning': 189,
 'Renaissance Learning, Inc.': 190,
 'RoomRecess.com': 191,
 'SAG-AFTRA Foundation': 192,
 'SMARTeacher Inc.': 193,
 'Sandbox Networks': 194,
 'Savvas Learning Company | Formerly Pearson K12 Learning': 195,
 'Sched LLC': 196,
 'Scholastic Inc': 197,
 'School Loop': 198,
 'School Specialty Inc': 199,
 'SchoolTube': 200,
 'Schoology': 201,
 'Screencast-O-Matic': 202,
 'Screencastify, LLC': 203,
 'Securly Inc ': 204,
 'Seesaw Learning Inc': 205,
 'SharpSchool': 206,
 'Shmoop University, Inc': 207,
 'Showbie Inc ': 208,
 'SignUpGenius': 209,
 'SlidesCarnival': 210,
 'Snap Inc.': 211,
 'SoundCloud': 212,
 'Southern Poverty Law Center': 213,
 'SparkNotes': 214,
 'Spotify Ltd': 215,
 'Spotify USA Inc': 216,
 'Starfall Education': 217,
 'Studies Weekly': 218,
 'Study.com': 219,
 'StudyPad Inc.': 220,
 'SurveyMonkey': 221,
 'TEACHERSPAYTEACHERS': 222,
 'TED Conferences': 223,
 'Teach TCI': 224,
 'TeachMe': 225,
 'Teaching.com': 226,
 'Technological Solutions, Inc. (TSI)': 227,
 'Texthelp, Inc.': 228,
 'The College Board': 229,
 'The Common Application, Inc.': 230,
 'The Internet Archive': 231,
 'The Math Learning Center': 232,
 'The New York Times': 233,
 'The Pennsylvania State Universtity': 234,
 'The University of Utah ': 235,
 'The Wikimedia Foundation': 236,
 'ThingLink': 237,
 'Time USA, LLC ': 238,
 'Tools for Schools, Inc. (Book Creator)': 239,
 'Toy Theater': 240,
 'TumbleBooks': 241,
 'Tumblr ': 242,
 'Turnitin': 243,
 'TurtleDiary LLC': 244,
 'TypingClub': 245,
 'US Geological Survey': 246,
 'US Holocaust Museum': 247,
 'United States National Archives': 248,
 'University of Colorado': 249,
 'Utah Education Network': 250,
 'Vector Solutions': 251,
 'Vespr': 252,
 'ViewPure': 253,
 'Vimeo': 254,
 'Vitzo Ltd': 255,
 'VocabularySpellingCity': 256,
 'Vooks, Inc.': 257,
 'Wakelet': 258,
 'Washington Post': 259,
 'WeAreTeachers': 260,
 'WeVideo, Inc.': 261,
 'Weebly': 262,
 'West Corporation': 263,
 'Whiteboard.fi': 264,
 'Wistia': 265,
 'Wix.com, Inc': 266,
 'WordPress': 267,
 'WordReference.com': 268,
 'World Book, Inc': 269,
 'World Wildlife Fund': 270,
 'WyzAnt': 271,
 'XtraMath': 272,
 'Yegros Educational LLC DBA Conjuguemos': 273,
 'ZOOM VIDEO COMMUNICATIONS, INC.': 274,
 'Zearn': 275,
 'Zendesk': 276,
 'iCivics Inc': 277,
 'iHeartRadio': 278,
 'iStockphoto LP': 279,
 'mrdonn.org': 280,
 'musictheory.net': 281,
 'online-stopwatch.com': 282}
sector_map = {'Corporate': 0,
 'Higher Ed; Corporate': 1,
 'PreK-12': 2,
 'PreK-12; Higher Ed': 3,
 'PreK-12; Higher Ed; Corporate': 4}
primary_function_main_map = {'CM': 0, 'LC': 1, 'LC/CM/SDO': 2, 'SDO': 3}
primary_function_sub_map = {'Admissions, Enrollment & Rostering': 0,
                         'Career Planning & Job Search': 1,
                         'Classroom Engagement & Instruction': 2,
                         'Content Creation & Curation': 3,
                         'Courseware & Textbooks': 4,
                         'Data, Analytics & Reporting': 5,
                         'Digital Learning Platforms': 6,
                         'Environmental, Health & Safety (EHS) Compliance': 7,
                         'Human Resources': 8,
                         'Large-Scale & Standardized Testing': 9,
                         'Learning Management Systems (LMS)': 10,
                         'Online Course Providers & Technical Skills Development': 11,
                         'Other': 12,
                         'School Management Software': 13,
                         'Sites, Resources & Reference': 14,
                         'Study Tools': 15,
                         'Teacher Resources': 16,
                         'Virtual Classroom': 17}

In [None]:
discretised_data = causal_data.iloc[:100000,:].copy()

In [None]:
discretised_data["locale"] = discretised_data["locale"].map({y:x for x,y in locale_map.items()})
discretised_data["Provider_Company_Name"] = discretised_data["Provider_Company_Name"].map({y:x for x,y in Provider_map.items()})
discretised_data["Sector"] = discretised_data["Sector"].map({y:x for x,y in sector_map.items()})
discretised_data["primary_function_main"] = discretised_data["primary_function_main"].map({y:x for x,y in primary_function_main_map.items()})
discretised_data["primary_function_sub"] = discretised_data["primary_function_sub"].map({y:x for x,y in primary_function_sub_map.items()})

In [None]:
data_vals = {col: causal_data[col].unique() for col in causal_data.columns}
for i in discretised_data.columns[[0,1,3,4,5]]:
  map  = {v: 'low' if v <= (discretised_data[str(i)].max()-discretised_data[str(i)].min())/2
            else 'high' for v in data_vals[str(i)]}
  discretised_data[str(i)] = discretised_data[str(i)].map(map)

In [None]:
del causal_data

In [None]:
discretised_data.head()

### Train Bayesian Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc, confusion_matrix
from causalnex.inference import InferenceEngine
from causalnex.evaluation import roc_auc
from causalnex.evaluation import classification_report
from causalnex.network import BayesianNetwork

In [None]:
bn = BayesianNetwork(sm_1000000)

In [None]:
bn = bn.fit_node_states(discretised_data)

In [None]:
bn = bn.fit_cpds(discretised_data, method="BayesianEstimator", bayes_prior="K2")

In [None]:
ie1 = InferenceEngine(bn)

In [None]:
print("distribution before do", ie1.query()["pct_free_reduced"])
print("marginal engagement", ie1.query()["engagement_index"])
ie1.do_intervention("pct_free_reduced",
                   {'low': 1.0,
                    'high': 0.0})

print("distribution after do", ie1.query()["pct_free_reduced"])
print("updated marginal engagement", ie1.query()["engagement_index"])

In the original distribution, 55% of the data had low rollout of free food programs. We include an intervention to make this 100% low rollout of free food programs.

If the rate at which free meals were provided was 100% low as opposed to 55% from the original distribution, the level of engagement would remain the same. This can be explained by the fact that in 2020 most students were learning remotely and had no access to the free meals provided in schools.

In [None]:
ie2 = InferenceEngine(bn)

In [None]:
print("pct_black_hispanic distribution before intervention", ie2.query()["pct_black_hispanic"])
print("marginal engagement", ie2.query()["engagement_index"])
ie2.do_intervention("pct_black_hispanic",
                   {'low': 0.0,
                    'high': 1.0})
print("distribution after do", ie2.query()["pct_black_hispanic"])
print("updated marginal engagement", ie2.query()["engagement_index"])

If the number of students were black or hispanic is 100% low as opposed to 18% from the original distribution, the level of engagement would increase by five percent.

### Future prophets Analysis
The Prophet library is an open-source library designed for making forecasts. we are going to use this library to try and forecast online engagement six weeks ahead of time.

In [None]:
p.rename(columns={'time':'ds','engagement_index':'y'},inplace=True)
p.sample()


In [None]:
p = p.dropna()

In [None]:
! pip install Prophet
from fbprophet import Prophet

In [None]:
# instantiate the model and fit the timeseries
prophet = Prophet()
prophet.fit(p)

# create a future data frame 
future = prophet.make_future_dataframe(periods=42)
forecast = prophet.predict(future)

# display the most critical output columns from the forecast
print(forecast[['ds','yhat','yhat_lower','yhat_upper']].head())

# plot
fig = prophet.plot(forecast)