#  **Introduction**

Current research shows educational outcomes are far from equitable. The imbalance was exacerbated by the COVID-19 pandemic. There's an urgent need to better understand and measure the scope and impact of the pandemic on these inequities.

Education technology company LearnPlatform was founded in 2014 with a mission to expand equitable access to education technology for all students and teachers. LearnPlatform’s comprehensive edtech effectiveness system is used by districts and states to continuously improve the safety, equity, and effectiveness of their educational technology. LearnPlatform does so by generating an evidence basis for what’s working and enacting it to benefit students, teachers, and budgets.

The data and feature description for this challenge can be found Here in kaggle computation.

**Business Need

- What is the state of digital learning in 2020? 

- And how does the engagement of digital learning relate to factors such as district demographics, broadband access, and state/national level policies and events?



## Table of Content
# Preprocss 
    `- District dataset
    
     - Product dataset
     
     - Engagment dataset
 # Visualize
 
 # Train model
 # Conclusion 

# EDA

In [None]:
#import important liberaries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")


# District dataset 

District information data

| Name | Description |
| :--- | :----------- |
| district_id | The unique identifier of the school district |
| state | The state where the district resides in |
| locale | NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See [Locale Boundaries User's Manual](https://eric.ed.gov/?id=ED577162) for more information. |
| pct_black/hispanic | Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data |
| pct_free/reduced | Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data |
| county_connections_ratio | `ratio` (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See [FCC data](https://www.fcc.gov/form-477-county-data-internet-access-services) for more information. |
| pp_total_raw | Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district. |



In [None]:
#load data
df_dist=pd.read_csv(r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
df_dist.info()

In [None]:
# set up to view all the info of the columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
df_dist.sample(10)

In [None]:
def assess_NA(data):
    """
    Returns a pandas dataframe denoting the total number of NA values and the percentage of NA values in each column.
    The column names are noted on the index.
    
    Parameters
    ----------
    data: dataframe
    """
    # pandas series denoting features and the sum of their null values
    null_sum = data.isnull().sum()# instantiate columns for missing data
    total = null_sum.sort_values(ascending=False)
    percent = ( ((null_sum / len(data.index))*100).round(2) ).sort_values(ascending=False)
    
    # concatenate along the columns to create the complete dataframe
    df_NA = pd.concat([total, percent], axis=1, keys=['Number of NA', 'Percent NA'])
    
    # drop rows that don't have any missing data; omit if you want to keep all rows
    #df_NA = df_NA[ (df_NA.T != 0).any() ]
    
    return df_NA

In [None]:
assess_NA(df_dist)

From the table above we can see column pp_total_raw,pct_free/reduced, county_connections_ratio has missing value greater than 30%
. We can drop this collumns but since data sets contain small features so I want to fill using _fillna() functon.


In [None]:
#unique values in districts columns
print(df_dist['pct_free/reduced'].nunique(dropna = True))
print(df_dist['pp_total_raw'].nunique(dropna = True))
print(df_dist['pct_black/hispanic'].nunique(dropna = True))
print(df_dist['county_connections_ratio'].nunique(dropna = True))
    

Observe that  the interval "[a, b[" means that a ≤ x < b for a,b are real numbers.
For the columns pp_total_raw,pct_free/reduced, county_connections_ratio having interval inputs I understand data as: for instance [0.2,0.4[ in pct_free/reduced means 20-40 % students in the districts are eligible for free or reduced-price lunch.So for this kind of datasets I am planing to use mean/avarage of two points a and b and change to single value.

In [None]:
#average points from county_connections_ratio
#this column has two unique values [.18,1[ and [1,2[ so I am using numbers closure to 1 and 2 for open interval side
#using this points I will substitute intervals to single values
m=(0.18+0.999999)/2
m1=(1+1.999999)/2
m,m1


In [None]:
#change intervals to single values
for i in range(0,233):
    if df_dist['county_connections_ratio'][i]=='[0.18, 1[':
        df_dist['county_connections_ratio'][i]=m
for i in range(0,232):
    if df_dist['county_connections_ratio'][i]=='[1, 2[':
        df_dist['county_connections_ratio'][i]=m1
    
        
    
print(df_dist['county_connections_ratio']) 
df_dist['county_connections_ratio'].nunique()

In [None]:
for i in range(0,233):
    if df_dist['pct_black/hispanic'][i]=='[0, 0.2[':
        df_dist['pct_black/hispanic'][i]=(0+0.1999)/2
    if df_dist['pct_black/hispanic'][i]=='[0.2, 0.4[':
        df_dist['pct_black/hispanic'][i]=(0.2+0.3999)/2
    if df_dist['pct_black/hispanic'][i]=='[0.4, 0.6[':
        df_dist['pct_black/hispanic'][i]=(0.4+0.5999)/2
    if df_dist['pct_black/hispanic'][i]=='[0.8, 1[':
        df_dist['pct_black/hispanic'][i]=(0.8+0.9999)/2
for i in range(0,233):
    if df_dist['pct_free/reduced'][i]=='[0, 0.2[':
        df_dist['pct_free/reduced'][i]=(0+0.1999)/2
    if df_dist['pct_free/reduced'][i]=='[0.2, 0.4[':
        df_dist['pct_free/reduced'][i]=(0.2+0.3999)/2
    if df_dist['pct_free/reduced'][i]=='[0.4, 0.6[':
        df_dist['pct_free/reduced'][i]=(0.4+0.5999)/2
    if df_dist['pct_free/reduced'][i]=='[0.8, 1[':
        df_dist['pct_free/reduced'][i]=(0.8+0.9999)/2
    if df_dist['pct_free/reduced'][i]=='[0.6, 0.8[':
        df_dist['pct_free/reduced'][i]=(0.6+0.7999)/2

In [None]:
for i in range(0,233):
    if df_dist['pp_total_raw'][i]=='[14000, 16000[':
        df_dist['pp_total_raw'][i]=(14000+15999.999)/2
    if df_dist['pp_total_raw'][i]=='[6000, 8000[':
        df_dist['pp_total_raw'][i]=(6000+7999.999)/2
    if df_dist['pp_total_raw'][i]=='[10000, 12000[':
        df_dist['pp_total_raw'][i]=(10000+11999.999)/2
    if df_dist['pp_total_raw'][i]=='[8000, 10000[':
        df_dist['pp_total_raw'][i]=(8000+9999.999)/2
    if df_dist['pp_total_raw'][i]=='[12000, 14000[':
        df_dist['pp_total_raw'][i]=(12000+13999.999)/2
    if df_dist['pp_total_raw'][i]=='[16000, 18000[':
        df_dist['pp_total_raw'][i]=(16000+17999.999)/2
    if df_dist['pp_total_raw'][i]=='[20000, 22000[':
        df_dist['pp_total_raw'][i]=(20000+21999.999)/2
    if df_dist['pp_total_raw'][i]=='[18000, 20000[':
        df_dist['pp_total_raw'][i]=(18000+19999.999)/2
    if df_dist['pp_total_raw'][i]=='[22000, 24000[':
        df_dist['pp_total_raw'][i]=(22000+23999.999)/2
    if df_dist['pp_total_raw'][i]=='[4000, 6000[':
        df_dist['pp_total_raw'][i]=(4000+5999.999)/2
    if df_dist['pp_total_raw'][i]=='[32000, 34000[':
        df_dist['pp_total_raw'][i]=(32000+33999.999)/2

In [None]:
print(df_dist.isna().sum())

In [None]:
#fill missing using ffill
df_dist=df_dist.fillna(method="ffill")
print(df_dist.isna().sum())
df_dist.head(20)

In [None]:
df1=df_dist.drop("state",axis=1)


In [None]:
fig, axes = plt.subplots(2, 3, figsize=(10, 6))

for i, (idx, row) in enumerate(df1.set_index('locale').iterrows()):
    ax = axes[i // 3, i % 3]
    row = row[row.gt(row.sum() * .01)]
    ax.pie(row, labels=row.index, startangle=30)
    ax.set_title(idx)

fig.subplots_adjust(wspace=.2)

In [None]:
for col in df_dist.columns:
    sb.displot(df_dist[col])
    plt.show()

In [None]:
from pandas.plotting import scatter_matrix
import pandas.plotting as plt


plt.scatter_matrix(df_dist)

# Product dataset 

| Name | Description |
| :--- | :----------- |
| LP ID| The unique identifier of the product |
| URL | Web Link to the specific product |
| Product Name | Name of the specific product |
| Provider/Company Name | Name of the product provider |
| Sector(s) | Sector of education where the product is used |
| Primary Essential Function | The basic function of the product. There are two layers of labels here. Products are first labeled as one of these three categories: LC = Learning & Curriculum, CM = Classroom Management, and SDO = School & District Operations. Each of these categories have multiple sub-categories with which the products were labeled |


In [None]:
# Product dataset 
df_prod=pd.read_csv(r'../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv')
df_prod.info()

In [None]:
df_prod.sample(10)

# Engagment dataset information

The engagement data are aggregated at school district level, and each file in the folder `engagement_data` represents data from one school district. The 4-digit file name represents `district_id` which can be used to link to district information in `district_info.csv`. The `lp_id` can be used to link to product information in `product_info.csv`.

| Name | Description |
| :--- | :----------- |
| time | date in "YYYY-MM-DD" |
| lp_id | The unique identifier of the product |
| pct_access | Percentage of students in the district have at least one page-load event of a given product and on a given day |
| engagement_index | Total page-load events per one thousand students of a given product and on a given day |


In [None]:
import pandas as pd
import glob

path = r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

df_eng = pd.concat(li, axis=0, ignore_index=True)

In [None]:
df_eng.sample(10)

In [None]:
df_eng.info()

In [None]:
#list all unique values of column
for col in df_dist1:
    print(df_dist1[col].unique())

In [None]:
df_dist1['pp_total_raw'].nunique()

In [None]:
from pandas.plotting import scatter_matrix
import pandas.plotting as plt

plt.scatter_matrix(df_dist)

In [None]:
df_dist1.columns

# visualizing district data

In [None]:
df_dist1.plot(x="locale", y=["pct_black/hispanic", "pct_free/reduced", "county_connections_ratio"], kind="bar",figsize=(9,8))
plt.show()

In [None]:
df_prod.head()

Observe that  the three datasets have relation. That is in district data set column district_id entries are 
the name of engagment datasets and column lp id is common column for both product and engagment datasets. So I deciede to merge/concatinate three of them
in one and will process it.

In [None]:
data=pd.concat([df_dist, df_prod, df_eng])
#fill missing using ffill
data=data.fillna(method="ffill")
#drop duplicated column that is 'lp id'
#data.dropna


In [None]:
data.isna()

In [None]:
# histograms of the variables
for col  in df_dist1.columns:
    fig = df_dist[col].hist(xlabelsize=6, ylabelsize=6)
#[x.title.set_size(6) for x in fig.ravel()]
# show the plot
    plt.show()

In [None]:
import seaborn as sb
for i in range(0,6):
    sb.histplot(df_dist1[i])
#sb.histplot(df_dist1['pct_free/reduced'])


In [None]:
df_dist1['state'].nunique()

In [None]:
df_eng.columns

In [None]:
#

In [None]:
df_dist[df_dist.district_id.duplicated()].shape[0]

In [None]:
df_dist.dropna(how="all")

In [None]:
#dealing with missing values


In [None]:
def save_clean(self):
    try:
      self.df.to_csv('../data/clean_data.csv', index=False)
    except:
      print('Log: Error while Saving File')

In [None]:
df_eng=pd.read_csv(r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data/6345.csv')
df_eng.head(4)
df_eng.info()

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df=pd.read_csv(r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data/5882.csv')
df.head()

In [None]:
import pandas as pd
import glob

path = r'/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

In [None]:
frame.describe()

In [None]:
frame.info()

# Districts

In [None]:
#load and see sample district data info
df_dist=pd.read_csv(r'../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv')
df_dist.head()

In [None]:
df_dist.info()

In [None]:
def with_null_column(df):
    '''
    Return List of Columns which contain more than 30% of null/missing values
    '''
    df_size = df.shape[0]
    
    columns_list = df.columns
    columns_null = []
    
    for column in columns_list:
        null_per_column = df[column].isnull().sum()
        percentage = round( (null_per_column / df_size) * 100 , 2)
        
        if(percentage > 30):
            columns_null.append(column)
    
    return columns_null

In [None]:
with_null_column(df_dist)