### Problem Statement

Current research shows educational outcomes are far from equitable. The imbalance was exacerbated by the COVID-19 pandemic. There's an urgent need to better understand and measure the scope and impact of the pandemic on these inequities.

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms.

In this notebook we will try to get insight on
- the state of digital learning in 2020 a
- how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.



#### Overview of the data
What datas are we going to use for analysis ?
- engagement_data : each file represent data from school district, the 4-digit number file name represent district_id
- products_info.csv : file includes information about the characteristics of the top 372 products with most users in 2020.
- districts_info.csv : file includes information about the characteristics of school districts, including data from NCES and FCC.

### Importing libs

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

sns.set()
import os


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory



# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### The following are functions to read and write csv files

In [None]:
def read_csv(csv_path, missing_values=[]):
    try:
        df = pd.read_csv(csv_path, na_values=missing_values)
        print("file read as csv")
        return df
    except FileNotFoundError:
        print("file not found")

def save_csv(df, csv_path):
    try:
        df.to_csv(csv_path, index=False)
        print('File Successfully Saved.!!!')
        return df

    except Exception:
        print("Save failed...")




### Functions for calculating missing values in a column and dataframe

In [None]:
def percent_missing(df: pd.DataFrame) -> float:

        totalCells = np.product(df.shape)
        missingCount = df.isnull().sum()
        totalMissing = missingCount.sum()
        return round((totalMissing / totalCells) * 100, 2)
def percent_missing_for_col(df: pd.DataFrame, col_name: str) -> float:
        total_count = len(df[col_name])
        if total_count <= 0:
            return 0.0
        missing_count = df[col_name].isnull().sum()

        return round((missing_count / total_count) * 100, 2)

### DataOverview class
DataOverview class is a class i'm going to use to overview a given dataframe. And get some information about the dataframe.
1. It gets number of rows and columns
2. It calculates the percent of missing values
3. It gets unique values of a given column

In [None]:
class DataOverview:
    
    def __init__(self, df):
        
        self.df = df
    
    def read_head(self, top=5):
        return self.df.head(top)
    
    # returning the number of rows columns and column information
    def get_info(self):
        row_count, col_count = self.df.shape
    
        print(f"Number of rows: {row_count}")
        print(f"Number of columns: {col_count}")
        print("================================")

        return (row_count, col_count), self.df.info()
    
    def get_count(self, column_name):
        return pd.DataFrame(self.df[column_name].value_counts())
    
    # getting the null count for every column
    def get_null_count(self, column_name):
        print("Null values count")
        print(self.df.isnull().sum())
        return self.df.isnull().sum()
    
    # getting the percentage of missing values
    def get_percent_missing(self):
        percent_missing_info = percent_missing(self.df)
        null_percent_df = pd.DataFrame(columns = ['column', 'null_percent'])
        columns = self.df.columns.values.tolist()
        null_percent_df['column'] = columns
        null_percent_df['null_percent'] = null_percent_df['column'].map(lambda x: percent_missing_for_col(self.df, x))
        
        
        return null_percent_df.sort_values(by=['null_percent'], ascending = False), percent_missing_info

### Reading product_info and districts_info dataframe

In [None]:
base_path = "/kaggle/input/learnplatform-covid19-impact-on-digital-learning/"

In [None]:
product = read_csv(f"{base_path}/products_info.csv")
district = read_csv(f"{base_path}/districts_info.csv")


### Data Preprocessing
In the data preporcessing step, we are mainly going to deal with the missing values and extract some new coulmns from an existsing columns
* Total missing value in district dataframe is : 27.1%
* Total missing value in product dataframe is : 1.84%


In [None]:
print(f"Total missing value in district dataframe is : {percent_missing(district)}%")
print(f"Total missing value in product dataframe is : {percent_missing(product)}%")

### A. Data cleaning for district

#### Overview of district data
The districts file includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab:

* distrist_id
* state - state of the school district
* locale - the type of the district (i.e City, Urban, Sub-uraban, Rural)
* pct_black/hispanic - percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data.
* pct_free/reduced - percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data.
* county_connections_ratio - ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version).
* pp_total_raw - per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project.

Total number of rows in district is 233
pp_total_raw, county_connections_ratio, pct_free/reduced and pct_black/hispani have their values in the form of [a, b[. This is an uncertainity range values that represents a <= x < b

In [None]:
district_overview = DataOverview(district)
display(district_overview.get_info())
print()
print("Let's get the top 5 rows in the district dataframe")
print()
display(district_overview.read_head())

- State, locale and pct_black/hispanic has 24.46% missing values each.  Rows where state column has a missing value have missing values on all of their other columns so they are not relevant for our analyis and we can drop them. That means we are going to loss 24.46 % of the district data


In [None]:
district_missing_df, _ = district_overview.get_percent_missing()
display(district_missing_df)

In [None]:

cleaned_district = district[district['state'].notna()].reset_index(drop=True)
district_missing_df, _ = DataOverview(cleaned_district).get_percent_missing()
display(district_missing_df)


For the other columns (i.e pp_total_raw, pct_free/reduced, county_connections_ratio) with missing values, instead of dropping them, we impute them with the mode of the corsponding columns. But just replacing them with mode values can create bias. so to replace a missing value in a given row and column, we took  the follwing steps.
* Based on the assumption that districst with same locale have simliar charactertic, we got the mode for the missing column for the locale of the missing row
* if a mode exists, we then replace the missing value with the mode we got.
* else we replace it with the value "[0.0, 0.0[". 

In [None]:
def get_mode_for_state_locale(df, state, locale, column):
    filtered_df = df[ df["state"] == state]

    filtered_df = filtered_df[ filtered_df["locale"] == locale]
    _mode = filtered_df[column].mode().to_list()
    if len(_mode) > 0:
        return _mode[0]
    
    return "[0.0, 0.0["

def handle_missing_for_district(row):
    
    if str(row["pp_total_raw"]) == "nan":
        row["pp_total_raw"] = get_mode_for_state_locale(cleaned_district, row["state"], row["locale"], "pp_total_raw")
    
    if str(row["pct_free/reduced"]) == "nan":
        row["pct_free/reduced"] = get_mode_for_state_locale(cleaned_district, row["state"], row["locale"], "pct_free/reduced")
    
    if str(row["county_connections_ratio"]) == "nan":
        row["county_connections_ratio"] = get_mode_for_state_locale(cleaned_district, row["state"], row["locale"], "county_connections_ratio")
    
    return row
    



In [None]:
cleaned_district = cleaned_district.apply(lambda row: handle_missing_for_district(row), axis=1)
district_missing_df, _ = DataOverview(cleaned_district).get_percent_missing()
display(district_missing_df)

In [None]:
DataOverview(cleaned_district).get_info()

In [None]:
DataOverview(cleaned_district).get_count("pp_total_raw")

In [None]:
display(DataOverview(cleaned_district).get_count("county_connections_ratio"))
display(DataOverview(cleaned_district).get_count("pct_free/reduced"))

#### Data Cleaning for product

The product file includes information about the characteristics of the top 372 products with most users in 2020:

* LP ID - the unique identifier of the product. URL
* Product Name
* Provider/Company Name
* Sector(s) - sector of education where the product is used.
* Primary Essential Function - the basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled.

- There are 372 rows, and we have a 5.38% missing values ofor Sector(s) and Primary Essential Function columns each.
- And we have a 0.27%  missing values for Provider/Company column. That is basically only one row

In [None]:
product_overview = DataOverview(product)
display(product_overview.get_info())
display(product_overview.read_head())
display(product_overview.get_percent_missing()[0])



1. For Sector(s) and Primary Essential Function columns, we replaced the missing values by following the following steps.
    * Based on the assumption that Products with same 'Provider/Company Name' have simliar sectors and function, we get the mode for the missing column for the 'Provider/Company Name' of the missing row
    * if a mode exists, we then replace the missing value with the mode.
    * else we replace it with the value None.

2. We then drop all missing values that are not handled in step 1



In [None]:
def get_mode_for_companey(df, comp_name, column):
    filtered_df = df[ df["Provider/Company Name"] == comp_name]

    _mode = filtered_df[column].mode().to_list()
    if len(_mode) > 0:
        return _mode[0]
    
    return None

def handle_missing_for_product(row):
    
    if str(row["Sector(s)"]) == "nan":
        row["Sector(s)"] = get_mode_for_companey(product, row["Provider/Company Name"], "Sector(s)")
    
    if str(row["Primary Essential Function"]) == "nan":
        row["Primary Essential Function"] = get_mode_for_companey(product, row["Provider/Company Name"], "Primary Essential Function")
    
    
    return row

cleaned_product = product.apply(lambda row: handle_missing_for_product(row), axis=1 )
cleaned_product = cleaned_product.dropna()

product_overview = DataOverview(cleaned_product)
display(product_overview.get_percent_missing()[0])


Now both product and district dataframe do not have any missing values

#### Extracting main_function and sub_function from "Primary Essential Function" column

In [None]:
cleaned_product["main_fun"] = cleaned_product["Primary Essential Function"].map(lambda x: x.split(" - ")[0])
cleaned_product["sub_fun"] = cleaned_product["Primary Essential Function"].map(lambda x: x.split(" - ")[1])

### Initial exploration

#### District

##### Number of counts for  locale values and pct_black/hispanic columns

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16,6))

sns.countplot(data=cleaned_district, x='locale', palette='GnBu', ax=ax[0])
sns.countplot(data=cleaned_district, x='pct_black/hispanic', palette='GnBu', ax=ax[1])
plt.show()


##### Number of counts for  pct_free/reduced values and county_connections_ratio columns
Unfortunatily the county_connections_ratio columns is predominantly filled with '[0.18, 1[.
[1, 2[ appears only in a single row only. so this column is no use to us

In [None]:

fig, ax = plt.subplots(1, 2,figsize=(16,6))
sns.countplot(data=cleaned_district, x='pct_free/reduced', palette='GnBu', ax=ax[0])
sns.countplot(data=cleaned_district, x='county_connections_ratio', palette='GnBu', ax=ax[1])



In [None]:
fig, ax = plt.subplots(figsize=(16,6))

sns.countplot(data=cleaned_district, y='pp_total_raw', palette='GnBu')


In [None]:
state_abb = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}

### Number of school district in each state.
Top states with most school districts are Connecticut,  Utah and Massachusetts


In [None]:
location_count_df = DataOverview(cleaned_district).get_count("state")
display(location_count_df)

locations = location_count_df.index.map(lambda x: state_abb[x]).to_list()
district_counts = location_count_df['state'].to_list()

fig = px.choropleth(locations=locations, locationmode="USA-states", color=district_counts, scope="usa" )
fig.show()

For further analysis we extracted new columns from columns having thier values definded in terms of range. We did this by calculating the (min_range + max_range) / 2

In [None]:
def calualte_mean_from_range(value: str):
    min_range = eval(value.split(", ")[0][1:])
    max_range = eval(value.split(", ")[1][:-1])
    return (min_range + max_range) / 2.0

cleaned_district['pct_black/hispanic_mean'] = cleaned_district['pct_black/hispanic'].map(lambda x: calualte_mean_from_range(x))
cleaned_district['pct_free/reduced_mean'] = cleaned_district['pct_free/reduced'].map(lambda x: calualte_mean_from_range(x))
cleaned_district['pp_total_raw_mean'] = cleaned_district['pp_total_raw'].map(lambda x: calualte_mean_from_range(x))
cleaned_district.head()

In [None]:
state_pct_black_hispanic_agg = cleaned_district.groupby("state").agg({"pct_black/hispanic_mean": "mean"})

locations = state_pct_black_hispanic_agg.index.map(lambda x: state_abb[x]).to_list()
colors = state_pct_black_hispanic_agg['pct_black/hispanic_mean'].to_list()
state_pct_black_hispanic_agg.sort_values(by=["pct_black/hispanic_mean"], ascending=False)


fig = px.choropleth(locations=locations, locationmode="USA-states", color=colors, scope="usa" )
fig.show()
state_pct_black_hispanic_agg

#### Average of total expediture in each state

In [None]:
state_expediutre_agg = cleaned_district.groupby("state").agg({"pp_total_raw_mean": "mean"})

locations = state_expediutre_agg.index.map(lambda x: state_abb[x]).to_list()
colors = state_expediutre_agg['pp_total_raw_mean'].to_list()
state_expediutre_agg.sort_values(by=['pp_total_raw_mean'], ascending=False)
display(state_expediutre_agg)

fig = px.choropleth(locations=locations, locationmode="USA-states", color=colors, scope="usa" )
fig.show()

#### Average of percentage of students in the districts eligible for free or reduced-price lunch  in each state

In [None]:
pct_free_reduced_mean_agg = cleaned_district.groupby("state").agg({"pct_free/reduced_mean": "mean"})

locations = pct_free_reduced_mean_agg.index.map(lambda x: state_abb[x]).to_list()
colors = pct_free_reduced_mean_agg['pct_free/reduced_mean'].to_list()
pct_free_reduced_mean_agg.sort_values(by=['pct_free/reduced_mean'], ascending=False)

display(pct_free_reduced_mean_agg)

fig = px.choropleth(locations=locations, locationmode="USA-states", color=colors, scope="usa" )
fig.show()

### Product

In [None]:
display(DataOverview(cleaned_product).get_count("main_fun"))
display(DataOverview(cleaned_product).get_count("sub_fun"))



In [None]:
fig, ax = plt.subplots(figsize=(8,6))
sns.countplot(data=cleaned_product, x='main_fun', palette='GnBu')
plt.show()



In [None]:
fig = px.pie(cleaned_product['main_fun'].value_counts().reset_index().rename(columns = {'main_fun': 'count'}), values = 'count', names = 'index', width = 700, height = 700)
fig.update_traces(textposition = 'inside', 
                  textinfo = 'percent + label', 
                  hole = 0.7, 
                  marker = dict(colors = ['#90afc5','#336b87','#2a3132','#763626'], line = dict(color = 'white', width = 2)))

fig.update_layout(annotations = [dict(text = ' The count of main primary functions <br> in products', 
                                      x = 0.5, y = 0.5, font_size = 26, showarrow = False, 
                                      font_family = 'monospace',
                                      font_color = '#283655')],
                  showlegend = False)
                  
fig.show()

### Enagement

In [None]:
enagment_path = base_path + "engagement_data"
district_lst = cleaned_district['district_id'].to_list()
  
temp = []    
for district_id in district_lst:
    temp_df = read_csv(f"{enagment_path}/{district_id}.csv")
    temp_df['district_id'] = district_id
    temp.append(temp_df)

engagement_df = pd.concat(temp).reset_index(drop=True)
print(enagment_path)

In [None]:
enagement_overview = DataOverview(engagement_df)
display(enagement_overview.get_info())
display(enagement_overview.read_head())

In [None]:
num_products= engagement_df["lp_id"].unique().shape[0]
print(f"Number of unique products is {num_products}")

In [None]:

unique_products = product["LP ID"].to_list()
cleaned_engagement_df = engagement_df[engagement_df["lp_id"].isin(unique_products)]
DataOverview(cleaned_engagement_df).get_percent_missing()[0]


In [None]:

def calcualte_product_percentage(prod_id, cleaned_engagement_df):
    row = cleaned_engagement_df[cleaned_engagement_df["lp_id"] == prod_id].shape[0]
    total = cleaned_engagement_df.shape[0]
    return (row/total) * 100
    
null_sectors_df = product[product["Sector(s)"].isna()]
null_sectors_ids = null_sectors_df["LP ID"].to_list()

temp_res = []
for _id in null_sectors_ids:
    temp_res.append(calcualte_product_percentage(_id, cleaned_engagement_df))

missing_prod_df = pd.DataFrame()
missing_prod_df["Product"] = null_sectors_ids
missing_prod_df["Occurence_pct"] = temp_res
missing_prod_df.sort_values(by=["Occurence_pct"], ascending=False)  

In [None]:

cleaned_engagement_df = cleaned_engagement_df.dropna(subset=['pct_access'])
cleaned_engagement_df["engagement_index"] = cleaned_engagement_df["engagement_index"].fillna(0)
DataOverview(cleaned_engagement_df).get_percent_missing()[0]


In [None]:
cleaned_product = cleaned_product.rename(columns={"LP ID": "lp_id"})
merged_df = pd.merge(cleaned_engagement_df, cleaned_product, on='lp_id')
merged_df = pd.merge(merged_df, cleaned_district, on='district_id')

main_fun_agg = merged_df.groupby("main_fun").agg({"pct_access": "mean"})


In [None]:
local_df = merged_df[["locale", "pct_access","engagement_index"]]
local_df_agg = local_df.groupby("locale").agg({"pct_access": "mean","engagement_index": "mean" })

fig, ax = plt.subplots(1, 2, figsize=(16,4))

sns.barplot(x=local_df_agg.index, y="pct_access", data=local_df_agg, ax=ax[0], palette='GnBu')
sns.barplot(x=local_df_agg.index, y="engagement_index", data=local_df_agg, ax=ax[1], palette='GnBu')

plt.show()


In [None]:
demo_df =  merged_df[["pct_black/hispanic_mean", "pct_access","engagement_index"]]

demo_agg = demo_df.groupby("pct_black/hispanic_mean").agg({"pct_access": "mean","engagement_index": "mean" })

fig, ax = plt.subplots(1, 2, figsize=(16,4))

sns.barplot(x=demo_agg.index, y="pct_access", data=demo_agg, ax=ax[0], palette='GnBu')
sns.barplot(x=demo_agg.index, y="engagement_index", data=demo_agg, ax=ax[1], palette='GnBu')

plt.show()

In [None]:

state_pct_df =  merged_df[["state", "pct_access"]]
state_eng_df =  merged_df[["state", "engagement_index"]]

state_pct_agg = state_pct_df.groupby("state").agg({"pct_access": "mean" })
state_eng_agg = state_eng_df.groupby("state").agg({"engagement_index": "mean" })

state_pct_agg = state_pct_agg.sort_values(by=["pct_access"], ascending=False)
state_eng_agg = state_eng_agg.sort_values(by=["engagement_index"], ascending=False)

fig, ax = plt.subplots(figsize=(16,8))
sns.barplot(x="pct_access", y=state_pct_agg.index, data=state_pct_agg, palette='GnBu')
plt.show()


In [None]:
fig, ax = plt.subplots(figsize=(16,8))
sns.barplot(x="engagement_index", y=state_eng_agg.index, data=state_eng_agg, palette='GnBu')
plt.show()

In [None]:
px.histogram(cleaned_district, x='state', color="locale").update_xaxes(categoryorder='total ascending')

In [None]:
px.histogram(cleaned_district, x='state', color="pct_black/hispanic").update_xaxes(categoryorder='total ascending')

In [None]:
product_df = merged_df[['Provider/Company Name', 'Product Name', 'pct_access', 'engagement_index']]
product_df_agg = product_df.groupby('Product Name').agg({"pct_access": "mean", "engagement_index": "mean"})
product_pct_access_agg = product_df_agg[['pct_access']].sort_values(by="pct_access", ascending=False)
product_eng_agg =  product_df_agg[['engagement_index']].sort_values(by="engagement_index", ascending=False)

In [None]:
display(product_pct_access_agg.head())
display(product_eng_agg.head())

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(24,4))

sns.barplot(x=product_pct_access_agg.head().index, y="pct_access", data=product_pct_access_agg.head(), ax=ax[0], palette='GnBu')
sns.barplot(x=product_eng_agg.head().index, y="engagement_index", data=product_eng_agg.head(), ax=ax[1], palette='GnBu')

plt.show()

In [None]:
main_funcs = list(merged_df['main_fun'].unique())

row = len(main_funcs) // 2 
if len(main_funcs) % 2 != 0:
    row += 1
    
fig, ax = plt.subplots(row, 2, figsize=(24,16))


temp_df = merged_df[['main_fun', 'engagement_index','Provider/Company Name', 'Product Name']]
for i, func in enumerate(main_funcs):
    temp_agg = temp_df[temp_df['main_fun'] == func].groupby('Product Name').agg({'engagement_index': "mean"})
    temp_agg = temp_agg.sort_values(by='engagement_index', ascending=False)
    
    fig.tight_layout()
    ax[i // 2, i%2].set_title(f'Top 5 in \n{func}', fontsize=16)
    sns.barplot(x=temp_agg.head().index, y="engagement_index", data=temp_agg.head(), ax=ax[i // 2, i%2], palette='GnBu')


In [None]:
sectors = list(merged_df['Sector(s)'].unique())
row = len(sectors) // 2 
if len(sectors) % 2 != 0:
    row += 1
    
fig, ax = plt.subplots(row, 2, figsize=(24,16))


temp_df = merged_df[['Sector(s)', 'engagement_index', 'Product Name']]
for i, sec in enumerate(sectors):
    temp_agg = temp_df[temp_df['Sector(s)'] == sec].groupby('Product Name').agg({'engagement_index': "mean"})
    temp_agg = temp_agg.sort_values(by='engagement_index', ascending=False)
    
    fig.tight_layout()
    ax[i // 2, i%2].set_title(f'Top 5 engaged products in  \n{sec} sector', fontsize=16)
    sns.barplot(x=temp_agg.head().index, y="engagement_index", data=temp_agg.head(), ax=ax[i // 2, i%2], palette='GnBu')


In [None]:
merged_df['time'] =  pd.to_datetime(merged_df['time'])
import plotly.graph_objects as go

fig = go.Figure()

time_df = merged_df.copy()
time_df['month'] = pd.DatetimeIndex(time_df['time']).month
for fun in list(merged_df['main_fun'].unique()):
    agg_df = time_df[time_df['main_fun'] == fun].groupby('month').agg({"pct_access": "mean"})
    
    fig.add_trace(go.Scatter(x=agg_df.index, y=agg_df["pct_access"],
                        mode='lines',
                        name=fun))
fig.show()

In [None]:
merged_df['time'] =  pd.to_datetime(merged_df['time'])
import plotly.graph_objects as go

fig = go.Figure()

time_df = merged_df.copy()
time_df['month'] = pd.DatetimeIndex(time_df['time']).month
for fun in list(merged_df['main_fun'].unique()):
    agg_df = time_df[time_df['main_fun'] == fun].groupby('month').agg({"engagement_index": "mean"})
    
    fig.add_trace(go.Scatter(x=agg_df.index, y=agg_df["engagement_index"],
                        mode='lines',
                        name=fun))
fig.show()

In [None]:
merged_df['time'] =  pd.to_datetime(merged_df['time'])
import plotly.graph_objects as go

fig = go.Figure()

time_df = merged_df.copy()
time_df['month'] = pd.DatetimeIndex(time_df['time']).month
for sec in list(time_df['Sector(s)'].unique()):
    print(sec)
    agg_df = time_df[time_df['Sector(s)'] == sec].groupby('month').agg({"pct_access": "mean"})
    
    fig.add_trace(go.Scatter(x=agg_df.index, y=agg_df["pct_access"],
                        mode='lines',
                        name=sec))
fig.show()

In [None]:
df = px.data.tips()
sub_df = merged_df[['main_fun', 'sub_fun']]
sub_df = sub_df.groupby(['main_fun', 'sub_fun']).size().reset_index(name='count')
fig = px.sunburst(sub_df, path=['main_fun', 'sub_fun'], values='count')
fig.show()

In [None]:
merged_df['time'] =  pd.to_datetime(merged_df['time'])

import plotly.graph_objects as go

fig = go.Figure()


vc_df = time_df[time_df['sub_fun'] == "Virtual Classroom"]
vc_products = list(vc_df['Product Name'].unique())

for prod in vc_products:
    
    agg_df = time_df[time_df['Product Name'] == prod].groupby('time').agg({"engagement_index": "mean"})
    
    fig.add_trace(go.Scatter(x=agg_df.index, y=agg_df["engagement_index"], mode='lines',name=prod))
fig.show()