In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# INTRODUCTION

The COVID-19 has resulted in schools shut all across the world. Globally, over 1.2 billion children are out of the classroom. As a result, education has changed dramatically, with the distinctive rise of e-learning, whereby teaching is undertaken remotely and on digital platforms. Research suggests that online learning has been shown to increase retention of information, and take less time, meaning the changes coronavirus have caused might be here to stay. While countries are at different points in their COVID-19 infection rates, worldwide there are currently more than 1.2 billion children in 186 countries affected by school closures due to the pandemic. In Denmark, children up to the age of 11 are returning to nurseries and schools after initially closing on 12 March, but in South Korea students are responding to roll calls from their teachers online. With this sudden shift away from the classroom in many parts of the globe, some are wondering whether the adoption of online learning will continue to persist post-pandemic, and how such a shift would impact the worldwide education market.
Note: Copied from the kaggle source

Note: Copied from the kaggle source


# PROBLEM STATEMENT

The COVID-19 Pandemic has disrupted learning for more than 56 million students in the United States. In the Spring of 2020, most states and local governments across the U.S. closed educational institutions to stop the spread of the virus. In response, schools and teachers have attempted to reach students remotely through distance learning tools and digital platforms. Until today, concerns of the exacaberting digital divide and long-term learning loss among America’s most vulnerable learners continue to grow.
Vulnerable learners are the someone who does not have access.
Note: Copied from the kaggle source

# DATA DESCRIPTION

We have provided a set of daily edtech engagement data from over 200 school districts in 2020, and we encourage you to leverage other publicly available data sources in your analysis. We include three basic sets of files to help you get started:

The engagement_ data folder is based on LearnPlatform’s Student Chrome Extension. The extension collects page load events of over 10K education technology products in our product library, including websites, apps, web apps, software programs, extensions, ebooks, hardwares, and services used in educational institutions. The engagement data have been aggregated at school district level, and each file represents data from one school district.

The products_info.csv file includes information about the characteristics of the top 372 products with most users in 2020. 

The districts_info.csv file includes information about the characteristics of school districts, including data from NCES and FCC. 

The definitions of each column in the three data sets are detailed in the README file.

In addition to the files provided, we encourage you to use other public data sources such as COVID-19 US State Policy database, KIDS Count, and KFF.

Note: Copied from kaggle


# EXPLORATORY DATA ANALYSIS

The engagement data are aggregated at school district level, whereby each file in the folder engagement data represents data from one school district. The district_id in the engegement data can be used to link to district information in district data while the lp_id can be used to link to products data. Below is some description on the column in the engagement data

>     Time: Date in "YYYY-MM-DD"
>     Lp_id: The unique identifier of the product that can be used to link to product information.
>     Pct_access: Percentage of students in the district have at least one page-load event of a given product and on a given day.
>     Engagement_index: Total page-load events per one thousand students of a given product and on a given day.

The district file districts_info.csv includes information about the characteristics of school districts, including data from NCES (2018-19), FCC (Dec 2018), and Edunomics Lab. In this data set, we removed the identifiable information about the school districts. We also used an open source tool ARX (Prasser et al. 2020) to transform several data fields and reduce the risks of re-identification. For data generalization purposes some data points are released with a range where the actual value falls under. Additionally, there are many missing data marked as 'NaN' indicating that the data was suppressed to maximize anonymization of the dataset.

> Name	Description
district_id	The unique identifier of the school district 
state	The state where the district resides in 
locale	NCES locale classification that categorizes U.S. territory into four types of areas: City, Suburban, Town, and Rural. See Locale Boundaries User's Manual for more information.
pct_black/hispanic	Percentage of students in the districts identified as Black or Hispanic based on 2018-19 NCES data
pct_free/reduced	Percentage of students in the districts eligible for free or reduced-price lunch based on 2018-19 NCES data
countyconnectionsratio	ratio (residential fixed high-speed connections over 200 kbps in at least one direction/households) based on the county level data from FCC From 477 (December 2018 version). See FCC data for more information.
pptotalraw	Per-pupil total expenditure (sum of local and federal expenditure) from Edunomics Lab's National Education Resource Database on Schools (NERD$) project. The expenditure data are school-by-school, and we use the median value to represent the expenditure of a given school district.

In [None]:
import pandas as pd
import os, glob

path = "../input/learnplatform-covid19-impact-on-digital-learning/engagement_data"
all_files = glob.glob(os.path.join(path, "*.csv"))
all_df = []
for f in all_files:
    df = pd.read_csv(f, sep=',')
    df['district_id'] = f.split("/")[4].split(".")[0]
    all_df.append(df)
engagement_df = pd.concat(all_df, ignore_index=True,sort=True)
engagement_df.head()

We have have imported all the enagement csv files into one particular dataframe. Lets import other 2 data into dataframe and check

In [None]:
products_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
products_df.head()

In [None]:
districts_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
districts_df.head()

# CHECK DATATYPES

In [None]:
engagement_df.dtypes
products_df.dtypes
districts_df.dtypes

> district_id is integer rest all are objects

# % OF MISSING VALUES

In [None]:
import matplotlib.pyplot as plot

engagement_null = (engagement_df.isnull().sum()/len(engagement_df))*100
engagement_null
# Draw a vertical bar chart
engagement_null.plot.bar(x="enagement", y="null values in %", rot=70, title="% of null values in a data");
plot.show(block=True);

In [None]:
products_null = (products_df.isnull().sum()/len(products_df))*100
products_null
# Draw a vertical bar chart
products_null.plot.bar(x="enagement", y="null values in %", rot=70, title="% of null values in a data");
plot.show(block=True);

In [None]:
districts_null = (districts_df.isnull().sum()/len(districts_df))*100
districts_null
# Draw a vertical bar chart
districts_null.plot.bar(x="enagement", y="null values in %", rot=70, title="% of null values in a data");
plot.show(block=True);

> Engagement data
There are approx 24% of missing engagement index which is Total page-load events per one thousand students of a given product and on a given day.
Product data
There are 2 to 5% of missing data in product
District data
There are also huge number of missing data which range around 24 to 50%

# EDA IN PRODUCT DATA

In [None]:
products_df.head()
products_df = products_df.rename(columns={"LP ID": "lp_id", "Product Name": "product_name", "Provider/Company Name": "provider_company", 
                        "Sector(s)": "sector", "Primary Essential Function": "primary_essential_function" })
products_df.head()

## Duplicated

In [None]:
products_df.shape

In [None]:
products_df.info()

In [None]:
products_df.nunique()

> above are the unqiue values for each columns. LP ID, Product Name and Company Name are unique in nature

## Top 10 products with most users in 2020

In [None]:
#Group by Provider/Company Name

df = products_df[['provider_company','lp_id']].groupby('provider_company').count().sort_values(by='lp_id',ascending=False)
df = df.iloc[:10]
_= sns.barplot(x=df.lp_id,y=df.index)
plt.xlabel('count of provider')

In [None]:
freq = products_df.groupby(['sector','provider_company']).count().sort_values(by='lp_id',ascending=False)[:10]
freq

> from both the above details we could see the "Google LLC is the provider company" is a top provider company and Google LLC covers the major sector like PreK-12; Higher Ed; Corporate

In [None]:
# products_df['primary_essential_function'].unique()
freq = products_df.groupby(['primary_essential_function']).count()
freq.sort_values(by=['product_name'], ascending=False )[:10]

> above are the top compant resources used like digital learning platforms, sites and resources and reference

In [None]:
products_df.groupby(['sector']).sum().plot(kind='pie', y='lp_id', subplots=True, shadow = True,startangle=90,
figsize=(10,6), autopct='%1.1f%%')

> from above we have major contribution into PreKG-12 which is 47.9% and Prek-12, Higher Ed, Corporate