# LearnPlatform COVID-19 Impact on Digital Learning
Use digital learning data to analyze the impact of COVID-19 on student learning

* Challenge
We challenge the Kaggle community to explore (1) the state of digital learning in 2020 and (2) how the engagement of digital learning relates to factors such as district demographics, broadband access, and state/national level policies and events.

* We encourage you to guide the analysis with questions that are related to the themes that are described above (in bold font). Below are some examples of questions that relate to our problem statement:
    * What is the picture of digital connectivity and engagement in 2020?
    * What is the effect of the COVID-19 pandemic on online and distance learning, and how might this also evolve in the future?
    * How does student engagement with different types of education technology change over the course of the pandemic?
    * How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?
    * Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with the increase or decrease online engagement?

## Note: This notebook is about learning what data we have a little of cleaning and re-arranging the items. Do stay tuned to see more of it. Also I will first be looking a individual datasheets given and then try to merge them together to start answering the questions asked above.
## Also please comment or give suggestions on how to improve it further. Thank you.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Ploting and visualisations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px 
from plotly.offline import download_plotlyjs,init_notebook_mode, iplot
import plotly.tools as tls 
import plotly.figure_factory as ff 
py.init_notebook_mode(connected=True)
# ----------------------- #
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Products
products = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")

# Districts
districts = pd.read_csv("/kaggle/input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")

In [None]:
# checking missing values if any
display(products.info(),products.head(),products.isnull().sum())

### Here some data points are missing but still alot is there to work on. 

In [None]:
display(districts.info(),districts.head(),districts.isnull().sum())

### Here there are data points that are missing but still a little more work has to be done. 
### Removal of NaN values has to be done with.But will deal with it later.

In [None]:
products['Product Name'].unique()

## Check missing points in Percentages

In [None]:
# Define missing plot to detect all missing values in dataset
def missing_plot(dataset, key) :
    null_feat = pd.DataFrame(len(dataset[key]) - dataset.isnull().sum(), columns = ['Count'])
    percentage_null = pd.DataFrame((dataset.isnull().sum())/len(dataset[key])*100, columns = ['Count'])
    percentage_null = percentage_null.round(2)

    trace = go.Bar(x = null_feat.index, y = null_feat['Count'] ,opacity = 0.8, text = percentage_null['Count'],  textposition = 'auto',marker=dict(color = '#7EC0EE',
            line=dict(color='#000000',width=1.5)))

    layout = dict(title =  "Missing Values (count & %)")

    fig = dict(data = [trace], layout=layout)
    py.iplot(fig)

In [None]:
missing_plot(products,'Product Name')

In [None]:
missing_plot(districts,'district_id')

## Definatly we have to look into the districts dataframe. But we will look at it later. First lets go through products table

In [None]:

def target_count(data,column):
    trace = go.Bar( x = data[column].value_counts().values.tolist(),
    y = data[column].unique(),
    orientation = 'h',
    text = data[column].value_counts().values.tolist(),
    textfont=dict(size=20),
    textposition = 'auto',
    opacity = 0.5,marker=dict(colorsrc='tealrose',
            line=dict(color='#000000',width=1.5))
    )
    layout = (dict(title= "EDA of {} column".format(column),
                  autosize=True,height=800,))
    fig = dict(data = [trace], layout=layout)
    
    py.iplot(fig)

# --------------- donut chart to show there percentage -------------------- # 

def target_pie(data,column):
    trace = go.Pie(labels=data[column].unique(),values=data[column].value_counts(),
                  textfont=dict(size=15),
                   opacity = 0.5,marker=dict(
                   colorssrc='tealrose',line=dict(color='#000000', width=1.5)),
                   hole=0.6)
                  
    layout = dict(title="Dounat chart to see %age of individual elements")
    fig = dict(data=[trace],layout=layout)
    py.iplot(fig)


In [None]:
#target_count(products,'Product Name')
#target_pie(products,'Product Name')

### From Product Name we will only get individual courses used but learning about Provider/Company Name is benificial as a company can have several different products, So we might look at it. 

In [None]:
products['Product Name'].value_counts()

In [None]:
# checking the frequence of the 'Product name' 
freq = products.groupby(['Product Name']).count() 
freq

### Note: Google, Houghton,Microsoft, Learning A-z etc are the most used courses as seen below.

In [None]:
#products['Provider/Company Name'].unique()

freq = products['Provider/Company Name'].value_counts()[:20]
freq

### Lets explore more by regrouping the date as 'Provider'

In [None]:
freq = products.groupby(['Provider/Company Name']).count()
freq.sort_values(by=['Product Name'], ascending=False )[:10]

In [None]:
target_pie(freq,'Product Name')

## The above table and dougnut chart gives the percentage of people who have taken the course. 
* ### 89% people have taken different courses. 
* ### 0.69% people have taken by google courses. 
* ### 6.21% people have been a part of companies like Houghton,Microsoft, Learning A-z etc with 3+ people following it. 

## Let's also see the Sector's people have selected from.

In [None]:
#Sector(s)
products['Sector(s)'].unique()

In [None]:
freq = products.groupby(['Sector(s)','Provider/Company Name']).count()
freq.sort_values(by=['Product Name'], ascending=False )

* ## Google LLC covers the major sector like PreK-12; Higher Ed; Corporate 

## Below I have divided the data on the bases of Sector and then grouping then by 'Provider/Company Name' to learn more.
## Here I can find which companies are focusing which sectors. 

In [None]:
pk = products[products['Sector(s)'] == 'PreK-12']

pk = pk.groupby(['Provider/Company Name']).count()
pk.sort_values(by=['Product Name'], ascending=False )[:5]

### Note: PreK-12 is covered by Learning A-Z,Curriculum Associates,The College Board,IXL Learning,Houghton Mifflin Harcourt

In [None]:
pkH = products[products['Sector(s)'] == 'PreK-12; Higher Ed']

pkH = pkH.groupby(['Provider/Company Name']).count()
pkH.sort_values(by=['Product Name'], ascending=False )[:5]

### Note: McGraw-Hill PreK-12,Houghton Mifflin Harcourt,Google LLC cover 'PreK-12; Higher Ed' sector

In [None]:
pkH = products[products['Sector(s)'] == 'PreK-12; Higher Ed; Corporate']

pkH = pkH.groupby(['Provider/Company Name']).count()
pkH.sort_values(by=['Product Name'], ascending=False )[:5]

### Google LLC,Microsoft,Autodesk, Inc,Adobe Inc.,ZOOM VIDEO COMMUNICATIONS, INC coveres 'PreK-12; Higher Ed; Corporate' Sector

In [None]:
pkH = products[products['Sector(s)'] == 'Corporate']

pkH = pkH.groupby(['Provider/Company Name']).count()
pkH.sort_values(by=['Product Name'], ascending=False )

In [None]:
pkH = products[products['Sector(s)'] == 'Higher Ed; Corporate']

pkH = pkH.groupby(['Provider/Company Name']).count()
pkH.sort_values(by=['Product Name'], ascending=False )

### Note: Higher Ed; Corporate , Corporate are taken by Weebly,Qualtrics respectively

## Lets also look at which company is focusing on which 'Primary Essential Function'

In [None]:
products['Primary Essential Function'].unique()
freq = products.groupby(['Provider/Company Name','Primary Essential Function']).count()

freq.sort_values(by=['Product Name'], ascending=False )[:10]

## So Below companies Primary Essential Function are the following. Thus we can also say that the following companies these products are better for learning. 
* #### Google LLC	LC/CM/SDO - Other
* #### LC - Content Creation & Curation
* #### Learning A-Z	LC - Courseware & Textbooks
* #### IXL Learning	LC - Digital Learning Platforms
* #### Houghton Mifflin Harcourt	LC - Courseware & Textbooks
* #### Curriculum Associates	LC - Digital Learning Platforms
* #### Google LLC	LC - Sites, Resources & Reference
* #### CM - Classroom Engagement & Instruction - Communication & Messaging
* #### Dictionary.com	LC - Sites, Resources & Reference - Thesaurus & Dictionary
* #### The College Board	LC - Study Tools - Test Prep & Study Skills

## Lets also see other way around to 'Primary Essential Function'

In [None]:
products['Primary Essential Function'].unique()
freq = products.groupby(['Primary Essential Function']).count()
freq.sort_values(by=['Product Name'], ascending=False )

## We can see that many people used these companies resoursers for :-
* ### LC - Digital Learning Platforms
* ### LC - Sites, Resources & Reference
* ### LC - Content Creation & Curation
* ### LC - Study Tools
## Due to being locked in homes people also learned more about 'Content Creation & Curation' thats an intresting thing to see.

## Now we can also see which course is utilised my people and which company is providing it..

In [None]:
products['Primary Essential Function'].unique()
freq = products.groupby(['Primary Essential Function','Provider/Company Name']).count()
freq.sort_values(by=['Product Name'], ascending=False )[:10]

## Lets create a Sunburst plot for more details just click on the element you want to know about
## The main purpose to use this plot was to be able to group major info together 

In [None]:
import plotly.express as px
df = px.data.tips()
sunb_data = products[['Primary Essential Function','Provider/Company Name','Sector(s)']]
sunb_data = sunb_data.dropna()
sunb_data = sunb_data.groupby(['Primary Essential Function','Provider/Company Name']).size().reset_index(name='count')
#sunb_data.sort_values(by=['count'], ascending=False )[:10]
fig = px.sunburst(sunb_data, path=['Primary Essential Function','Provider/Company Name'], values='count')
fig.show()

## Here we can see the 'Primary Essential Function' and then the 'Provider/Company Name' and their count..
## From the above graph we can learn a lot about which product product is being used and which companies product is being utilized.
* ### Digital Learning Platform was the most used 'Primary Essential Function' and companies like Circumulative Associative, IXL learning,Teaching.com were mostly used.
* ### Similarly for Sites,and references Google LLC was mostly refered. 
* ### Again For Content creation Google,Adobe and Autodesk was used more.

In [None]:
df = products[['Product Name','Primary Essential Function','Provider/Company Name','Sector(s)']]
df = df.dropna()
df = df.groupby(['Provider/Company Name','Product Name','Primary Essential Function','Sector(s)']).size().reset_index(name='count')
df.sort_values(by=['count'], ascending=False )

## This is first part I still need to go through the other data sheets. Hope till now it was an interesting notebook.If you like the work do show your support and stay tuned for more detailed exploration. Thank you once again.

## Now lets explore districts datasheet. 
### We already know there are many NaN values so we have to deal with them but before that we have to see if we can do any imputaions or we will have to simply drop them. 

In [None]:
## Lets see the representation of states this shows which states were more involved in learning... 
px.histogram(districts, x='state', barmode='group').update_xaxes(categoryorder='total ascending')

### Connecticut was the most active one followed by Utah, Massachusetts and Illinios.   

In [None]:
#fig = px.pie(districts, values='state', names='state', title='Which state was most active')
#iplot(fig)
def PIE(state):
    trace = go.Pie(labels=districts[state].unique(),values=districts[state],
                      textfont=dict(size=15),
                       opacity = 0.5,marker=dict(
                       colorssrc='tealrose',line=dict(color='#000000', width=1.5)),
                       hole=0.6)

    layout = dict(title="Which state was most active")
    fig = dict(data=[trace],layout=layout)
    py.iplot(fig)
PIE('state')

## Lets remove NaN values

In [None]:
# get names of indexes for which districts['state'].isnull() as state,locale both have 57 rows that are totally empty as seen 
# during previous data analysis
index_names = districts[ districts['state'].isnull()].index
  
# drop these row indexes
# from dataFrame
districts.drop(index_names, inplace = True)

districts

## Similarly lets look at other columns also.

In [None]:
## Lets also see locality was more involved in learning... 
target_count(districts,'locale')
target_pie(districts,'locale')

## People from Suburbs are more active on these platforms to learn followed by Rural,City and Town people.

## Similarly going through other columns we can see the following trend

In [None]:
px.histogram(districts, x='county_connections_ratio', barmode='group')

In [None]:
px.histogram(districts, x='pct_black/hispanic', barmode='group')

In [None]:
px.histogram(districts, x='pct_free/reduced', barmode='group')

In [None]:
px.histogram(districts, x='pp_total_raw', barmode='group')

## Next we are going to combine all the dataframes to collectively analysis the data and try to answer the important questions asked. Thanks for staying connected. Will be updating the notebook with futher analysis soon. 
## Do comment and share your views with me. Thank you once again. 

In [None]:
import glob  
path = "/kaggle/input/learnplatform-covid19-impact-on-digital-learning/engagement_data/*.csv"

dfs = []
for fname in glob.glob(path):
    dfs.append(pd.read_csv(fname))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
big_frame

## lp_id and product LP_ID can be merged and we can regroup according to dates and then analysis which company was being favoured as the days passe using engagement_index col. 