In [None]:
import pandas as pd
import glob
import os

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
import seaborn as sns

import matplotlib.pyplot as plt
import plotly.express as px

init_notebook_mode(connected=True) 

# Loading Datasets

<p>We need to load the different files that we will later need for our analysis</p>

## Engagement data
Each engagement file contains the aggregated engagement data for districts. The file is named with the district id. The data contained inside each file is in the format below

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |


We will be loading all the engagement data into a dataframe and add the district id for identification

In [None]:
all_files = glob.glob("../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv")
engagments=[]
for i in all_files:
    try:
        engagement_frame=pd.read_csv(i)
    except FileNotFoundError:
        print("File Not Found")
    except:
        print("Error Reading CSV")       
    engagement_frame["district_id"]=os.path.splitext(os.path.basename(i))[0]
    engagments.append(engagement_frame)
engagements_frame=pd.concat(engagments)

In [None]:
del engagments

In [None]:
engagements_frame.head()

## District data
The district file districts_info.csv includes information about the characteristics of school districts. Some of the identifaiable information in this file has been removed to remove risks of identification. The data in the file is in the format below

| Name | Description |
| :--- | :----------- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |

In [None]:
districts_info_frame=pd.read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')

In [None]:
districts_info_frame.head(3)

## Product Info 
The product file products_info.csv includes information about the characteristics of the top 372 products with most users in 2020. The categories listed in this file are part of LearnPlatform's product taxonomy. The data is in the format below.


| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |

In [None]:
products_info_frame=pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

## Preprocessing the data
The data will be processed as fits the specific purposes it is utilized for throughout the notebook. This is to preserve as much information as possible and to only make changes as is fitting for the specific scenario

# Analysis

In [None]:
def piePlot(data,title,legend):
    plt=go.Pie(labels=data.index,values=data.values)
    fig = go.Figure(data=plt)
    fig.update_layout(
    title=title,
    legend_title=legend,
    font=dict(
        family="Roboto, monospace",
        size=18,
        color="Black"
    )
)
    
    py.offline.iplot(fig)
    fig.data=[]

## Locale Breakdown
We can see that the majority of our districts are in suburbs followed by rural,city then town. 

In [None]:
data=districts_info_frame.groupby('locale')["district_id"].count()

piePlot(data,"Breakdown of locales of the districts","Locale")

## Racial identity of students Breakdown
We can see that the majority of our districts have a percentage of 0%-0.2% hispanic/black students with diminishing number of districts as we go up in the percentage of hispanic/black students. 

In [None]:
data=districts_info_frame.groupby('pct_black/hispanic')["district_id"].count()

piePlot(data,"Breakdown of demographics of the districts","pct_black/hispanic")

## Percentage of students eligible for free of reduced price lunch Breakdown
This can be an indicator of the economic level of the students at these districts and will be used as such in the following sections

In [None]:
data=districts_info_frame.groupby('pct_free/reduced')["district_id"].count()

piePlot(data,"Breakdown of districts by the percentage of students eligible for free or reduced lunch","pct_free/reduced")

In [None]:
districts_info_frame['district_id'] = districts_info_frame['district_id'].astype('str')
engagements_frame['district_id'] = engagements_frame['district_id'].astype('str')

In [None]:
merged_data = pd.merge(engagements_frame, districts_info_frame, left_on = 'district_id', right_on = 'district_id')

In [None]:
merged_data=merged_data.dropna()

In [None]:
merged_data.isna().sum()
merged_data["time"]=pd.to_datetime(merged_data["time"])

In [None]:
districts_info_frame.isna().sum()

In [None]:
clean_districts=districts_info_frame.dropna()

## Breakdown of demographics overall

The plot show the number of districts by demographic, economic level as well as locale. For all locales except cities majority of the districts have a 0-0.2 percentage of students that are hispanic or black.Often we can see that a very high percentage of black or hispanic students(0.8-1) coincides with higher percentage of students eligible for lunch assistance than their lesser counterparts. This demonstrates racial bias to the economic status of the students.

In [None]:
data=clean_districts.groupby(['locale', 'pct_black/hispanic', 'pct_free/reduced'],as_index=False).count()
data["District Count"]=data["district_id"]
fig = px.sunburst(data, color_continuous_scale='Blues', path=['locale', 'pct_black/hispanic', 'pct_free/reduced'],width=750, height=750, values='District Count')
# fig.update_layout(uniformtext=dict(minsize=10))

fig.update_traces(textinfo="label+percent parent")
fig.update_layout(title="Breakdown locale, pct_black/hispanic and pct_free/reduced in respective rings",
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    )
)
fig.show()

In [None]:
data=clean_districts.groupby(['pct_black/hispanic', 'pct_free/reduced'],as_index=False).count()
data["District Count"]=data["district_id"]
fig = px.sunburst(data, color_continuous_scale='Blues', path=[ 'pct_black/hispanic', 'pct_free/reduced'],color='pct_free/reduced',width=750, height=750, values='District Count')
# fig.update_layout(uniformtext=dict(minsize=10))

fig.update_traces(textinfo="label+percent parent")
fig.update_layout(title="Breakdown pct_black/hispanic, pct_free/reduced in respective rings",
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    )
)

fig.show()

In [None]:
def lineTimePlot(column:str,value:str)->None:
    line= merged_data.groupby([column,"time"],as_index=False).agg('mean')
    fig = px.line(line, x ='time', y =value, color=column, width=1000,facet_col_wrap=1)
#     fig.add_vrect(x0="2020-03-04", x1="2020-03-14",annotation_text="States Announce State of Emergency",fillcolor="yellow", opacity=0.25, annotation_position="top right", line_width=0)
    fig.add_vrect(x0="2020-04-02", x1="2020-03-16",annotation_text="States Close K-12 Public Schools",fillcolor="red", opacity=0.25, line_width=0, annotation_position="top left")
    fig.add_vrect(x0="2020-06-01", x1="2020-08-23",annotation_text="Summer Holidays",fillcolor="green", opacity=0.25, line_width=0, annotation_position="top right")
    
    fig.add_hline(y=line[value].mean(),annotation_position="top right",annotation_text="Average "+value )
    fig.update_xaxes(rangeslider_visible=True)
    fig.update_layout(title= "Time Analysis of "+value+ " by "+column,
    font=dict(
        family="Roboto, monospace",
        size=14,
        color="Black"
    )
)
    fig.add_annotation(xref='x domain',
    x=0.5,
    yref='y domain',
    y=-0.4, font=dict(size=10),
                       showarrow=False,
    text="State Closure Data taken from <a href='https://www.openicpsr.org/openicpsr/project/119446/version/V75/view;jsessionid=851ECB80E6CB42252D396C29564184DC'>COVID-19 US State Policy Database by openICPSR</a>")
    fig.add_annotation(xref='x domain',
    x=0.5,
    yref='y domain',
    y=-0.45, font=dict(size=10),
                       showarrow=False,
    text="Holiday Data taken from  <a href='https://www.fcps.edu/sites/default/files/media/forms/19-20-standard-school-year-calendar.pdf'>FCPS 2019-2020 Standard School Year Calendar</a>. It is used as a source for an estimate")
    
    fig.show()


## Percent Access By Locale
The figure shows that rural locales have a higher access percentage followed by towns with cities performing the worst. We can also see that cities were hit the hardest by state decisions for closure with access plummeting the hardest while the rest only showed a small decrease. The city shows almost half the access percentage to our highest access percentage(rural locale)

In [None]:
lineTimePlot("locale",'pct_access')

## Percentage access by demogrphics
We can see that districts with a 0.8-1 percentage of black/hispanic students have a higher access percentage than their other counterparts. The districts with a 0.4-0.6 percentage of black/hispanic students performed the worst. Districts with a 0.8-1 percentage of black/hispanic students was hit the hardest with state closures but showed a great rebound with school opening in the fall with rates higher than before. Another range that was affected heavily is the districts with  0.4-0.6 percentage of black/hispanic students though it did rebound after the summer holidays.  

In [None]:
lineTimePlot("pct_black/hispanic",'pct_access')

## Percentage access by pct_free/reduced
We can see that districts with a 0.8-1 percentage of pct_free/reduced eligible students have a higher access percentage than their other counterparts on average. The districts with a 0.6-0.8 percentage of pct_free/reduced eligible students had the worst access percentages. Districts with a 0.8-1 percentage oof pct_free/reduced eligible students was hit the hardest with state closures but showed a rebound with school openings in the fall with rates that were only slightly lower than before. 

In [None]:
lineTimePlot("pct_free/reduced",'pct_access')


## Engagement Index by locale
This is a similar pattern with the percentage access with cities being the lowest and rural being the highest. And the effects of the state closures being hardest on the cities

In [None]:
lineTimePlot("locale",'engagement_index')


## Engagement Index by demogrpahics
Similar to its access counterpart the figure shows that  districts with a 0.8-1 percentage of black/hispanic students have a higher access percentage than their other counterparts. The districts with a 0.4-0.6 percentage of black/hispanic students performed the worst. Districts with a 0.8-1 percentage of black/hispanic students was hit the hardest with state closures but showed a great rebound with school opening in the fall with rates similar to before.

In [None]:
lineTimePlot("pct_black/hispanic",'engagement_index')


## Engagement Index by pct_free/reduced
We can see that districts with a 0.8-1 percentage of pct_free/reduced eligible students have a engagement index than their other counterparts on average. The districts with a 0.6-0.8 percentage of pct_free/reduced eligible students had the worst access percentages. Districts with a 0.8-1 percentage oof pct_free/reduced eligible students was hit the hardest with state closures but showed a rebound with school openings in the fall but never recovered to the level before the closure. 

In [None]:
lineTimePlot("pct_free/reduced",'engagement_index')


# Will Be continued with spending breakdown and more specific engagement and access analysis