<h1 style='padding:30px; background-color:#5956D6; color:#fff; text-align:center;border-radius:15px'>COVID19 Impact on Digital Learning</h1>

<center><img src="https://assets.euromoneydigital.com/dims4/default/e321b24/2147483647/strip/true/crop/780x355+0+0/resize/800x364!/quality/90/?url=http%3A%2F%2Feuromoney-brightspot.s3.amazonaws.com%2F50%2Fc4%2Fa213aa9d88223ccedf894858cdd4%2Fcorona-digital-780.jpg"></center>



<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>Welcome</h2>
I higly appriciate you are taking the time to read this notebook. 
In this Notebook we are going to analysize the impact of covid19 on the digital educational system.

* What is covid ( just incase ) ? 

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by the novel coronavirus,
SARS-CoV-2, that appeared in late 2019. It is predominantly a respiratory illness that can affect
other organs. People with COVID-19 have reported a wide range of symptoms, ranging from
mild symptoms to severe illness. Symptoms may appear 2 to 14 days after exposure to the virus.
Symptoms may include: fever or chills; cough; shortness of breath; fatigue; muscle and body
aches; headache; new loss of taste or smell; sore throat; congestion or runny nose; nausea or
vomiting; diarrhea. [source](https://www.fda.gov/media/144637/download#:~:text=Coronavirus%20disease%202019%20(COVID,or%20vomiting%3B%20diarrhea.)

* Little Background About the company

Education technology company **LearnPlatform** was founded in 2014 with a mission to expand equitable access to education technology for all students and teachers. LearnPlatform’s comprehensive edtech effectiveness system is used by districts and states to continuously improve the safety, equity, and effectiveness of their educational technology. LearnPlatform does so by generating an evidence basis for what’s working and enacting it to benefit students, teachers, and budgets.

* Ok, then what ? 
Well, The core questions we expect to answer are 
1. what does digital learning look like in 2020? 
2. how is digital learning affected by
    - District Demography
    - BroadBand Access
    - State/National Plocies

<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>General Overview</h2>

- **What datas are we going to use for analysis ?** 
    - **engagement_data** : each file represent data from school district, the 4-digit number file name represent district_id
    - **products_info.csv** : file includes information about the characteristics of the top 372 products with most users in 2020.
    - **districts_info.csv** : file includes information about the characteristics of school districts, including data from NCES and FCC.


- **How are some of  the datas collected ?** 
    - The Engagement datasets were collected from ChromeExtention on students activity over LearnPlatform. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.

     

<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>NoteBook Structure and Flow</h2>

1. **EDA (Exploratory Data Analysis)** : this is not directly related to the project but it will help us to explore the dataset more indepath. Also we can clean out the empty values which might affect our anslysis. 


<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>EDA ( Exploratory Data Analysis )</h2>

#### Important Modules

In [None]:
import os

import pandas as pd

# Ploting
import seaborn as sns
import matplotlib.pyplot as plt

#### Helper functions

In [None]:
def read_csv(file_path):
    '''return data frame of csv file from the  file path'''
    try:
        df = pd.read_csv(file_path, na_values=["n/a","null", 'NaN', ' ', "na", "undefined"])
        return df
    except:
        print("Error: File doesn't Exist")
        return None


def null_percentage(df, df_name=''):
    '''
    Display Total Null percentage of the Data Frame
    '''

    number_of_rows, number_of_columns = df.shape
    df_size = number_of_rows * number_of_columns

    null_size = (df.isnull().sum()).sum()
    percentage = round((null_size / df_size) * 100, 2)
    print(f"{df_name} Data Frame contain null values of { percentage }%")


def get_column_with_many_null(df):
    '''
    Return List of Columns which contain more than 30% of null values
    '''
    df_size = df.shape[0]
    
    columns_list = df.columns
    bad_columns = []
    
    for column in columns_list:
        null_per_column = df[column].isnull().sum()
        percentage = round( (null_per_column / df_size) * 100 , 2)
        
        if(percentage > 30):
            bad_columns.append(column)
    
    return bad_columns


def clean_district(df):
    '''Fill Null Values with Mode of the column in district'''
    df = df.copy()
    # columns with more than 30% null value
    bad_columns = get_column_with_many_null(df)
    
    df = df[df['state'].notna()] # data frame with state value.
    
    for col in bad_columns:
        the_mode = df[col].mode()[0]
        df[col] = df[col].fillna(the_mode)
    return df


def clean_product(df):
    '''Return Data frame with Not null Sector(s)'''
    return df[df['Sector(s)'].notna()]

In [None]:
## Ploting
def bar_plot(df, x_col, y_col, title=''):
    plt.figure(figsize=(20, 7))
    sns.barplot(data = df, x=x_col, y=y_col)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(x_col, fontsize=16)
    plt.ylabel(y_col, fontsize=16)
    plt.show()
    
def count_plot(df, col, title, hue=None):
    plt.figure(figsize=(20, 7))
    sns.countplot(data = df, y=col, hue=hue, order=df[col].value_counts().index)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(col, fontsize=16)
    plt.ylabel("Count", fontsize=16)
    plt.show()
    
def pie_plot(df, col, title=''):
    count = df[col].value_counts()
    plt.figure(figsize=(20, 7))
    plt.pie(list(count), labels=count.index, autopct='%1.2f%%')
    plt.title(title, size=20)
    plt.show()
    
    
def time_plot(df, x_col, y_col, title=''):
    plt.figure(figsize=(20, 7))
    sns.lineplot(data=df, x=x_col, y=y_col)
    plt.title(title, size=20)
    plt.xticks(rotation=75, fontsize=14)
    plt.yticks( fontsize=14)
    plt.xlabel(x_col, fontsize=16)
    plt.ylabel(y_col, fontsize=16)
    plt.show()

Reading Dataset Files

In [None]:
district_df = read_csv('../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
product_df = read_csv('../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')

NUL percentage of district and product datas

In [None]:
null_percentage(district_df, 'District')
null_percentage(product_df, 'Product')

Do we have columns with more than 30% Null Values?
This kind of columsn might not have good help for our analysis.

In [None]:
district_null_columns = get_column_with_many_null(district_df)
product_null_columns = get_column_with_many_null(product_df)

print("District Columns With Null Value more than 30%")
print(district_null_columns)
print()
print("Product Columns With Null Value more than 30%")
print(product_null_columns)

Clean Up The Data

In [None]:
district_df = clean_district(district_df)
product_df = clean_product(product_df)

<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>Insight About District DataFrame</h2>

In [None]:
district_df.head()

In [None]:
# what schools do we have in Illinois
Illinois = district_df[district_df['state'] == 'Illinois']
Illinois

In [None]:
count_plot(district_df, 'state', 'State Distribution')

The top 3 school district are Connecticut, Utah and Massachusets

In [None]:
count_plot(district_df, 'locale', 'Local Distribution')
# pie_plot(district_df, 'locale', 'Local Distribution')

In [None]:
count_plot(district_df, 'pct_black/hispanic', 'Black/Hispanic Distribution', hue='locale')

In [None]:
count_plot(district_df, 'pct_free/reduced', "Free or Reduced Distribution")

Most Places Provide from 20% - 40% of Aid for their students, and some school provide aid more than that. It seems Most students get Aid In some way from the school. it could be a free lunch or payment.

In [None]:
count_plot(district_df, 'county_connections_ratio', 'Connection Ratio')

Amost all the values in this column are similar except 1 value, we might consider droping this column

In [None]:
count_plot(district_df, 'pp_total_raw', 'Total Expenditure')

Most of our Data are from suborb

<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>Insight About Products DataFrame</h2>

In [None]:
product_df.head()

What Products of Google does the Students use?

In [None]:
result = product_df[product_df['Provider/Company Name'] == 'Google LLC']
result.head()

In [None]:
# count_plot(product_df, 'Product Name', 'Most Visited Urls')
result = product_df['Provider/Company Name'].value_counts()
result = result.head(10)
top_providers = pd.DataFrame({'Company': result.index, 'Count': result})
bar_plot(top_providers, "Count", "Company", title='Top 10 Providers/Companies')

In [None]:
product_df['Sector(s)'].value_counts()
count_plot(product_df, 'Sector(s)', 'Sector distribution')

In [None]:
def separate(text):
    main, sub = text.split('-')[0],text.split('-')[1:] 
    main = main.strip()
    if type(sub) is list:
        sub = ' '.join(sub)
    sub = sub.strip()
    return main

product_df['Primary Essential Function'].value_counts()
check_product_pdf = product_df.copy()
check_product_pdf['Function'] = check_product_pdf['Primary Essential Function'].apply(separate)
check_product_pdf.head()

In [None]:
pie_plot(check_product_pdf, 'Function', 'Most Used Functionalities')

The above graph shows the most used functionalities of the websites,
- LC :-> Learning and Curiculum
- CM :-> Class Room Management
- SDO :-> School and District Operation


<h2 style='padding:15px;font-size:20px; background-color:#5956D6; color:#fff'>Wroking on Engagement Dataset</h2>

Merge All the dataset into one dataset

What is the State of digital learning in 2020 ? 

This question mainly want to answer the students online activity and engagement over the internet and the products on year of 2020.

In [None]:
def get_school_engagement(school_id):
    '''
    Return district data frame if shcool id is present.
    '''
    if(f"{school_id}.csv" in directories):
        file_path = f"{engagement_path}/{school_id}.csv"
        school = read_csv(file_path)
        return school
    else:
        print(f"{school_id} : Data Not found")
        return None

    
def get_avilable_district_engaement_files():
    '''
    Return list of engagement files, which are present in the district list
    '''
    total_districts = [str(x) for x in list(district_df.district_id)]

    files = []
    for file in directories:
        file_name = file.split('.')[0]
        if(file_name in total_districts):

            files.append(file)
            
    return files


def merge_engagement_files():
    files = get_avilable_district_engaement_files()
    
    engagement_df_files = []
    for file in files:
        school_id = int(file.split('.')[0])
        
        df = get_school_engagement(school_id)
        df['district_id'] = school_id
        
        engagement_df_files.append(df)
        
    return pd.concat(engagement_df_files)
    

Lets us examine the school engagement in one school to get the feeling of the data

In [None]:
engagement_path = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data'
directories = os.listdir(engagement_path)

Illinois_school_id = '8815' # school found in Illinois
illinois_school = get_school_engagement(Illinois_school_id)
illinois_school.head()

From the Above we say the engagement of district 8815 from state of Illinois. But to get the sense of the total engagement we need to merge add districts together and get the general overview.

But before we do that, we need to make sure the states we have are actually the district we are going to merge. so we need to filter only the districts we have on our clened dataset.

In [None]:
files = get_avilable_district_engaement_files()
print(f'Total Files in the engagement: {len(directories)}')
print(f'Total Files from districts : {len(files)}')
print(files)
    


In [None]:
concated_engagement = merge_engagement_files()
concated_engagement.head()

In [None]:
print(f"Total Shape of our Data set is {concated_engagement.shape}")

In [None]:
# Null Percentage
null_percentage(concated_engagement)
concated_engagement.isnull().sum()

Most null values are from the engagement_index. but since our data set is much larger than our null values, we can drop the null values for our current analysis.

In [None]:
null_removed_enagement = concated_engagement.dropna().copy()
null_percentage(null_removed_enagement)

In [None]:
null_removed_enagement['time'] = pd.to_datetime(null_removed_enagement['time'])

In [None]:
result = null_removed_enagement.groupby('time').agg({'engagement_index': 'mean', 'pct_access': 'mean', 'lp_id': 'count'}).\
            reset_index()

time_plot(result, "time", "engagement_index", title='Engagement Over Time')

The Engagement is reduced from the 6th month to the 8th month, This is due to the school ends on this period and students are their break, and return to school on september

In [None]:
time_plot(result, "time", "pct_access", title='Percentage of Access Over Time')

To get more insights and analyise more data we could merge the product data with our cleaned engagement data. Thi will provide us larget data set to explore and gain insights.

In [None]:
general_df = pd.merge(null_removed_enagement, product_df, left_on='lp_id', right_on='LP ID' )

In [None]:
general_df = pd.merge(general_df, district_df, left_on='district_id', right_on='district_id' )

In [None]:
general_df.head()

In [None]:
#Do we have null values ? 
null_percentage(general_df)

<h2 style='padding:15px; color:#fff; background-color:#5956D6'>Engagement</h2>

How the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

which product was used mostly on diffrent locales?

In [None]:
result = general_df.groupby(['locale', 'Product Name']).agg({'time': 'count'})
result = result.reset_index()

def per_local(locale):
    local  =  result[result['locale'] == locale]
    
    new_df = pd.DataFrame({"Product Name": local['Product Name'], "time": local['time']})
    top_10 = new_df.sort_values(by='time', ascending=False).head(10)
    bar_plot(top_10, "Product Name", "time", title=f'Top Used application In {locale}')
    

for local in result.locale.unique():
    per_local(local)

Most States have a very similar application usage, Google is Leading the role.
But on town there is an activity on Netflix.

Now what does the engagement value tell us about each state ?

In [None]:
def calculate_engagement(col):
    result = general_df.groupby([col]).agg({'engagement_index': 'mean'})
    result = result.reset_index()
    return result

In [None]:
result = calculate_engagement('locale')
result

In [None]:
# import plotly.express as px
# # pie_plot(result, '')
# fig = px.bar_polar(result, r="engagement_index", theta="locale", 
#             color_discrete_sequence= px.colors.sequential.thermal)
# fig.show()
bar_plot(result, "locale", "engagement_index", title=f'Engagement Index per each locale')


Rural Areas are highly engaged than the others. does it mean being on a rural area increase the online engagement. it could be the case since most schools are outside rural areas and this could be the better means to education.

Which States are predominantly black and spanic and how does it affect the engagement ? 

In [None]:
general_df['pct_black/hispanic'].value_counts()

So for this case i will take values more than 50% and we see how the engaement is compare to the rest. where black and hispanish are the minorities

In [None]:
majority = general_df[general_df['pct_black/hispanic'].apply(lambda x: x in ['[0.6, 0.8[', '[0.8, 1['])]
majority.head()

In [None]:
majority_engagement = calculate_engagement('state')
bar_plot(majority_engagement, "state", "engagement_index", title=f'Engagement Index per each State On Majority black and hispanish')

In [None]:
majort_count = general_df.groupby('state').agg({'time':'count'})
majort_count = majort_count.reset_index()
majort_count = majort_count.sort_values(by='time', ascending=False)


bar_plot(majort_count, "state", "time", title=f'Engagement Index per each State On Majority black and hispanish')

The Above graphs show one thig, Arizona got the smalles count of record but the largets engagement, on the contrary most of the data is cron connecticut but the engagement is low. so may be cunneticut is not suitable for black online education.