# Preparing Data for RESEARCH ON DIGITAL LEARNING IMPACT BY COVID-19
https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/overview

* This Notebook's code, aims to enreach the collective effort for constructing a meaningful information from the data of [Learning-during-Covid-2020 Competiotion](http://)https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning/overview.

* This Notebook's insights from the data aren't going to far, but it aims to make a tiny part from needed research
 
* This Notebook's goal is to create a more valuable set of data, that can be used by others for deeper analysis and predictions. Here what it does:
    * read the files
        * **products_info.csv** (372 entries, 5 columns)
        * **districts_info.csv** (133 entries, 6 columns)
        * **engagement_data/{district}.csv** - 233 files (14,913,939 entries, 5 columns)
    * improves the data and recreate as files
        * **better_products_df.csv** (369 entries, 12 columns)
        * **better_engagement_df.csv**  (7,784,803 entries, 11 columns)       

# Load data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import matplotlib.colors

#plt.style.use('Solarize_Light2')
plt.style.use('seaborn-deep') #https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html

import warnings
warnings.filterwarnings("ignore")

In [None]:
products_df = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv")
products_df_orig = pd.DataFrame(products_df)
products_df.info()

In [None]:
products_df.sample(12)

**Products table columns**

* **LP ID** - The unique identifier of the product
* **URL** - Web Link to the specific product
* **Product Name** - Name of the specific product
* **Provider/Company Name** - Name of the product provider
* **Sector(s)** - Sector of education where the product is used
* **Primary Essential Function** - Has three main categories
    *   LC = Learning & Curriculum, 
    *   CM = Classroom Management, and 
    *   SDO = School & District Operations

# First (0) aesthetic names


In [None]:
products_df = products_df.rename(columns={'LP ID': 'ID',
                            'Product Name': 'Name',
                            'Provider/Company Name': 'Provider_Company',
                            'Primary Essential Function': 'Primary_Essential_Function'})

In [None]:
products_df.info()

In [None]:
products_df_orig.info()

# Dig dipper(1) Sector(s) column


In [None]:
products_df['Sector(s)'].value_counts()

Rows that appear 1 time in category spoils the graphs, so i dig dipper to check the data

In [None]:
print(products_df[products_df['Sector(s)'] == 'Higher Ed; Corporate'])

Using info from https://www.qualtrics.com/
i can see that given product has no real connection to Education - might be better to drop it, but i will keep it for purpose of this analysis for a safe of dataset
didn't products_df.drop(185, axis = 'index', inplace = True)


In [None]:
products_df.at[185,'Sector(s)'] = 'Corporate'

In [None]:
products_df['Sector(s)'].value_counts()

In [None]:
only_Pre_K = products_df[products_df['Sector(s)'] == 'PreK-12'].value_counts().size
only_Pre_K

In [None]:
both_Pre_K_and_Higher_Ed = len(products_df[products_df['Sector(s)'].astype('string').str.contains('Higher')])
both_Pre_K_and_Higher_Ed

In [None]:
# do not include only 'Corporate' category as it's not related to Education, with deeper analysis it might be better to drop those rows or assign one other category 
sectors_size = [only_Pre_K, both_Pre_K_and_Higher_Ed]
sectors_labels = ['PreK-12','PreK-12 & Higher Ed']
sectors_total = sectors_size[0]+sectors_size[1]


In [None]:
plt.pie(sectors_size, labels=sectors_labels)
plt.title('Sectors of ' + str(sectors_total) + ' the Educational products' + '\n')
plt.xlabel(str(round(sectors_size[0] / sectors_total * 100)) + '% are only PreK-12')
plt.show()

del only_Pre_K
del both_Pre_K_and_Higher_Ed

* Approximaly 50% of the Educational Products aims both PreK-12 & Higher Ed
* Other 50% aims only PreK-12
* All Higher Ed products aimed also for PreK-12

--------------------------------------------------------------------------------------

# Dig dipper (2) Sector(s) Null values


In [None]:
print(products_df[products_df['Sector(s)'].isnull()])

checking the products links to update missing data

In [None]:
#https://www.ixl.com/
products_df.at[61,'Sector(s)'] = 'PreK-12'

#https://www.yelp.com/ - has no connection to Learning but didnt products_df.drop(146, axis = 'index', inplace = True)
products_df.at[146,'Sector(s)'] = 'Corporate'

#https://learnplatform.com/ - seems to suit 'PreK-12; Corporate' but there is no such category
products_df.at[158,'Sector(s)'] = 'PreK-12; Corporate'

#https://genius.com/Genius-about-genius-annotated
products_df.at[174,'Sector(s)'] = 'PreK-12'

#https://www.microsoft.com/en-us/education
products_df.at[183,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

#https://www.hmhco.com/shop - seems to suit 'PreK-12; Corporate' but there is no such category
products_df.at[210,'Sector(s)'] = 'PreK-12; Corporate'

#https://www.classdojo.com/en-gb/?redirect=true
products_df.at[237,'Sector(s)'] = 'PreK-12'

#https://music.youtube.com/googleplaymusic#/sulp - not availiable and not connected to Learning but didnot products_df.drop(248, axis = 'index', inplace = True)
products_df.at[248,'Sector(s)'] = 'Corporate'

#https://www.google.com/search?q=sciencejournal.withgoogle.com/   
products_df.at[262,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

#https://www.google.com/search?q=
products_df.at[265,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

#https://www.adobe.com/express/
products_df.at[305,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

#https://www.usnews.com/best-colleges/myfit
products_df.at[311,'Sector(s)'] = 'Higher Ed; Corporate'

#https://chrome.google.com/webstore/detail/grammarly-for-chrome/kbfnbcaeplbcioakkpcpgfkobkghlhen
products_df.at[314,'Sector(s)'] = 'PreK-12; Higher Ed'

#https://www.google.com/search?q=maxpreps.com
products_df.at[331,'Sector(s)'] = 'PreK-12'

#https://info.flipgrid.com/
products_df.at[293,'Sector(s)'] = 'PreK-12; Higher Ed; Corporate'

#https://www.ducksters.com/history/
products_df.at[352,'Sector(s)'] = 'PreK-12'

#https://www.google.com/search?q=safeyoutube.net
products_df.at[354,'Sector(s)'] = 'PreK-12'

#https://studio.code.org/courses
products_df.at[356,'Sector(s)'] = 'PreK-12'

#https://edpuzzle.com/
products_df.at[370,'Sector(s)'] = 'PreK-12'

#https://www.google.com/search?q=truenorthlogic.com
products_df.at[371,'Sector(s)'] = 'PreK-12'

In [None]:
products_df.info()

--------------------------------------------------------------------------------------

# Dig dipper (3) Provider column


In [None]:
products_df.isnull().sum()

In [None]:
print(products_df[products_df['Provider_Company'].isnull()])

search on https://www.google.com/search?q=www.truenorthlogic.com
shows that the company name for that product is probably "PowerSchool Group LLC"

In [None]:
products_df.at[371,'Provider_Company'] = 'PowerSchool'

In [None]:
products_df.info()

In [None]:
products_of_company = products_df['Provider_Company'].value_counts()

In [None]:
print(products_of_company)

In [None]:
more_than_2_products_per_company = products_of_company[products_of_company > 2]
exactly_2_products_per_company = products_of_company[products_of_company == 2]
exactly_1_products_per_company = products_of_company[products_of_company < 2]
uniq_companies = more_than_2_products_per_company.count() + exactly_2_products_per_company.count() + exactly_1_products_per_company.count()

print('------------------------------------')
print(str(more_than_2_products_per_company.count()) + ' companies has 3 or more products')
print(str(exactly_2_products_per_company.count()) + ' companies has 2 products')
print(str(exactly_1_products_per_company.count()) + ' companies has 1 product')
print('Total: ' + str(uniq_companies) + ' different companies / providers for ' + str(products_df['Name'].count()) + ' Educational products')

In [None]:
more_than_2_products_per_company

In [None]:
products_amount_for_company = [more_than_2_products_per_company.count(), exactly_2_products_per_company.count(), exactly_1_products_per_company.count()]
products_amount_labels = ['3+ products in same provider', '2 products in same provider', '1 product provider']

In [None]:
plt.pie(products_amount_for_company, labels=products_amount_labels)
plt.title(' ' + str(uniq_companies) + ' different companies for ' + str(products_df['Name'].count()) + ' Educational products')
plt.show()

In [None]:
plt.barh(more_than_2_products_per_company.index, more_than_2_products_per_company.values)
plt.title(str(len(more_than_2_products_per_company)) + ' Most popular Company/Provider (3+ products)')
plt.xlabel('Ammount of products for a company')
plt.xticks([3,4,6,30])
plt.show()

In [None]:
more_than_2_products_per_company

In [None]:
del more_than_2_products_per_company
del exactly_2_products_per_company
del exactly_1_products_per_company

--------------------------------------------------------------------------------------

# Dig dipper (4) Primary Essential Function

In [None]:
products_df.info()

In [None]:
products_df['Primary_Essential_Function'].fillna('Other', inplace=True)

In [None]:
sub_categories = products_df['Primary_Essential_Function'].value_counts()
sorted(sub_categories.index)

* **Primary Essential Function** - Has three main categories
    *   LC = Learning & Curriculum, 
    *   CM = Classroom Management 
    *   SDO = School & District Operations

In [None]:
products_df.replace('LC/CM/SDO - Other','Other', inplace=True)

In [None]:
# Create new column for Primary Function 
products_df['Primary_Function'] = ''
products_df['Primary_Function']= products_df['Primary_Essential_Function'].str[:5]
products_df['Primary_Function'].value_counts()


In [None]:
products_df['Primary_Function'] = products_df['Primary_Function'].replace('LC - ','Learning_and_Curriculum')
products_df['Primary_Function'] = products_df['Primary_Function'].replace('CM - ','Classroom_Management')
products_df['Primary_Function'] = products_df['Primary_Function'].replace('SDO -','School_and_District_Operations')

In [None]:
products_df['Primary_Essential_Function'] = products_df['Primary_Essential_Function'].astype('string')
products_df['Primary_Function'] = products_df['Primary_Function'].astype('string')

In [None]:
# Plot Primary Function distibution
pf = products_df['Primary_Function'].value_counts()

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(11, 7))

plt.pie(pf, labels=pf.values)
plt.legend(['Learning & Curriculum', 'Classroom Management','School & District Operations','Other'], loc='lower left')
plt.title('Primary Function of ' + str(products_df['Name'].count()) + ' Educational products')
plt.show()

In [None]:
products_df.info()

* Created new column for Primary_Function

<hr>

# Dig dipper (5) Split Sectors column to 3 sectors

In [None]:
products_df['Sector(s)'] = products_df['Sector(s)'].astype('string')

In [None]:
products_df.info()

In [None]:
products_df['sector_K_12'] = products_df['Sector(s)'].str.contains('12')
products_df['sector_High_Ed'] = products_df['Sector(s)'].str.contains('Ed')
products_df['sector_Corporate'] = products_df['Sector(s)'].str.contains('Corporate')

In [None]:
products_df.drop(columns = 'Sector(s)', inplace=True)

In [None]:
products_df.head(12)

* Created new 3 columns 'sector_K_12', 'sector_High_Ed' & 'sector_Corporate' with True/False values instead 'Sector(s)' column

--------------------------------------------------------------------------------------

# Dig dipper (6) check closely the URL column

In [None]:
url_research = list(products_df['URL'])

In [None]:
#removes http:// in the begining
url_clean = [element.split('://')[1] for element in url_research if 'http' in element]

#removes / in the end
url_clean = [element.split('/')[0] for element in url_clean]

In [None]:
url_dot_com  = [element for element in url_clean if '.com' in element]
url_other = [element for element in url_clean if not '.com' in element]
url_dot_org = [element for element in url_other if '.org' in element]
url_other = [element for element in url_other if not '.org' in element]
url_dot_edu = [element for element in url_other if '.edu' in element]
url_other = [element for element in url_other if not '.edu' in element]
url_dot_gov = [element for element in url_other if '.gov' in element]
url_other = [element for element in url_other if not '.gov' in element]
url_dot_net = [element for element in url_other if '.net' in element]
url_other = [element for element in url_other if not '.net' in element]

In [None]:
url_types = {
    'com' : len(url_dot_com),
    'org' : len(url_dot_org),
    'edu' : len(url_dot_edu),
    'gov' : len(url_dot_gov),
    'net' : len(url_dot_net),    
    'other' : len(url_other)
}
url_types
url_types.values()

In [None]:
url_other

In [None]:
plt.bar(list(url_types.keys()), list(url_types.values()))
plt.title('URL types')
plt.yticks(list(url_types.values())[:3])
plt.show()

# Dig much dipper (7) Engagement with products

* What is a rate of Engagement for each product we can observe in our data?
* Are there some products with no Engagement recoreded?
* Are there some products with a low Engagement recoreded?
* What kind of usage patterns can be observed with 
    * different products
    * different products categories: LC = Learning & Curriculum / CM = Classroom Management / SDO = School & District Operations
    * different states - New York / Utah / Arizona / e.t.c
    * different districts - City / Town / Suburb / Rural


In [None]:
districts_info = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv")
districts_info = districts_info[districts_info.state.notna()].reset_index(drop=True)

In [None]:
districts_info.info()

In [None]:
districts_info.head(22)

To make the data easier to compare, we will only consider distrcits with engagement data for everyday in 2020 - read [more here](https://www.kaggle.com/iamleonie/how-to-approach-analytics-challenges?scriptVersionId=73481888&cellId=13) 

used code from: https://www.kaggle.com/iamleonie/how-to-approach-analytics-challenges?scriptVersionId=73481888&cellId=16



In [None]:
PATH = '../input/learnplatform-covid19-impact-on-digital-learning/engagement_data' 

temp = []

for district in districts_info.district_id.unique():
    df = pd.read_csv(f'{PATH}/{district}.csv', index_col=None, header=0)
    df["district_id"] = district
    if df.time.nunique() == 366:
        temp.append(df)

engagement = pd.concat(temp)
engagement = engagement.reset_index(drop=True)

# Only consider districts with full 2020 engagement data (drops 43 districts from list of 176)
districts_info = districts_info[districts_info.district_id.isin(engagement.district_id.unique())].reset_index(drop=True)
# Only consider products that have engagement data (drops 3 products with no engagement data: DocHub, Google Slides, True North Logic)
products_df = products_df[products_df['ID'].isin(engagement.lp_id.unique())].reset_index(drop=True)
# Only consider engagement data for products there is info about.
#       Remove engagement data for unknown products. This reduces the engagement data roughly by half.
engagement = engagement[engagement.lp_id.isin(products_df['ID'].unique())]

In [None]:
# Merge engagement info with districts_info
engagement_with_districts_info = pd.merge(engagement, districts_info, how="inner",on='district_id', sort=True)

In [None]:
# Merge engagement info with products_info
engagement_with_districts_info_and_products = pd.merge(engagement_with_districts_info, products_df, how="inner",left_on='lp_id', right_on='ID', sort=True)

In [None]:
engagement_full_df = pd.DataFrame(engagement_with_districts_info_and_products)
better_engagement_data = pd.DataFrame(engagement_with_districts_info)

del engagement_with_districts_info
del engagement_with_districts_info_and_products
del engagement

In [None]:
better_engagement_data.info()

In [None]:
# For each product - Calculate engagement days, engagement_score & engagement_score_percent

different_products = products_df[['ID', 'Name']]
different_products.set_index(['ID'], inplace = True)

for index, row in different_products.iterrows():          
        temp = engagement_full_df[engagement_full_df['lp_id'] == index]   
        different_products.loc[index, 'engagement_days'] = int(round(temp.shape[0]))       
        different_products.loc[index, 'engagement_score'] = int(round(temp['engagement_index'].sum()))
        
total_engagement_score = different_products['engagement_score'].sum()

different_products['engagement_score_percent'] = different_products['engagement_score'] / total_engagement_score     

different_products.drop(columns=['Name'], inplace = True)
products_full_df = pd.merge(products_df, different_products, how="inner",left_on='ID', right_index=True)

del different_products
del products_df

In [None]:
products_full_df['engagement_days'] = products_full_df['engagement_days'].astype(int)
products_full_df['engagement_score'] = products_full_df['engagement_score'].astype(int)

In [None]:
products_full_df.sort_values('engagement_days', ascending = False, inplace = True)

In [None]:
df = products_full_df
df.sort_values('engagement_days', ascending = True, inplace = True)

# Figure
f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 66))
       
plt.barh(df['Name'], df['engagement_days'])
plt.title('Days of reported engagement in data')

plt.show()

In [None]:
products_full_df.sort_values('engagement_score', ascending = False, inplace = True)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 56))
plt.barh(products_full_df['Name'], products_full_df['engagement_score'])
plt.title('Engagement score in data')
plt.show()

Obwiously Google products, take a huge volume of the engagemet data. Might be interesting to look on the data without Google products in **products_full_df**

In [None]:
products_without_google = products_full_df[~(products_full_df['Provider_Company'] == 'Google LLC')]

# Plot most popular 33 products, not counting Google company product.
df = pd.DataFrame(products_without_google)
df = df.head(33)

colormap = []
for i, row in df.iterrows():   
    if row['Primary_Function'] == 'Learning_and_Curriculum':
        colormap.append('cornflowerblue')
    elif row['Primary_Function'] == 'Classroom_Management':
        colormap.append('khaki')
    else:
        colormap.append('lightcoral')

df.sort_values('engagement_score_percent', ascending = True, inplace = True)


f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 22))

plt.barh(df['Name'], df['engagement_score_percent'], color=colormap)
plt.title('Engagement score percent for 33 top non google products')
plt.legend(['Learning and Curriculum','Classroom Management','School and District Operations'])

plt.show()

del df
del products_without_google

* There are 3 products that had no Engagement data - droped it
* Used only full engagement data
* Google Docs & Google Classroom are 2 products that have significantly higher enagagement then many others 
* Most popular products related to 'Learning_and_Curriculum' category

# Dig (8) Engagement with product's by category

In [None]:
df = products_full_df

LC_only = df[df['Primary_Function'] == 'Learning_and_Curriculum']
CM_only = df[df['Primary_Function'] == 'Classroom_Management']
SD_only = df[df['Primary_Function'] == 'School_and_District_Operations']
Other_only = df[df['Primary_Function'] == 'Other']

In [None]:
# Plot engagement days for each product's category - LC / CM / SD 
LC_engagement_days = LC_only['engagement_score_percent'].sum()
CM_engagement_days = CM_only['engagement_score_percent'].sum()
SD_engagement_days = SD_only['engagement_score_percent'].sum()
Other_engagement_days = Other_only['engagement_score_percent'].sum()


f, ax = plt.subplots(nrows=1, ncols=1, figsize=(11, 7))
score_for_cat = [LC_engagement_days, CM_engagement_days, SD_engagement_days, Other_engagement_days]
sum_days = sum(score_for_cat)
l = [str(round(e / sum_days * 100))+'%' for e in score_for_cat]

plt.pie(score_for_cat, labels = l)
plt.title('Engagement score for different categories')
plt.legend(['Learning and Curriculum', 'Classroom Management', 'School and District Operations', 'Other'])
plt.show()

Engagement score is pretty much proportional to ammount of products related to the category, a slight exeption is 'Scool and Districts Operation' that have 20% volume of engagement while there are only 9% products related to this category, :
https://www.kaggle.com/anako2020/zoom-into-products-data?scriptVersionId=75776927&cellId=81


# Dig (9) Engagement patterns with products

In [None]:
engagement_full_df.time = engagement_full_df.time.astype('datetime64[ns]')


In [None]:
engagement_full_df.set_index(['time'], inplace = True)

In [None]:
def engagement_with_product_by_col(df, product, col):
    
    df = pd.DataFrame(df[df['Name'] == product])
    
    cats = list(df[col].value_counts(col).index)   # [for 'locale' its Suburb', 'City', 'Rural', 'Town']
    df_list = list()
    
    for cat in cats:
        dff = df[df[col] == cat]
        dff = dff.resample('W').agg({'engagement_index': 'max'})
        df_list.append(pd.DataFrame(dff))   
        
    f, ax = plt.subplots(nrows=1, ncols=1, figsize=(24, 6))
    plt.title('Weekly engagement with '+ str(product))
    for df in df_list:
        plt.plot(df.index, df)     
        
    plt.ylabel('engagement')
    plt.legend(cats)
    plt.show()
    
    for df in df_list:
        del df 

In [None]:
engagement_with_product_by_col(engagement_full_df,'Google Classroom','state')

* Engagement pattern in different states, seems to be pretty much consistent for 'Google Classroom' product

In [None]:
engagement_with_product_by_col(engagement_full_df,'Google Classroom','locale')

In [None]:
engagement_with_product_by_col(engagement_full_df,'Netflix','locale')

In [None]:
engagement_with_product_by_col(engagement_full_df,'Zoom','locale')

In [None]:
engagement_full_df.head()

In [None]:
products_full_df.head()

<hr>

# Output better data csv

In [None]:
products_df_orig.info()

In [None]:
products_df = products_full_df.convert_dtypes()

In [None]:
products_df.info()

In [None]:
products_df.sample(12)

In [None]:
better_engagement_data.info()

In [None]:
better_engagement_data = better_engagement_data.convert_dtypes()

In [None]:
better_engagement_data.info()

In [None]:
better_engagement_data.sample(12)

# Summary - Data overview, cleaning and improving

**Products_df**
1. Updated Null values of 'Sectors' and others columns with reasonable data
2. Dast cleanung - dtypes, lonely categories, names, ... 
3. Overview on df and it's columns
        - Sector(s)
        - URL column
        - Provider
        - Primary Essential Function
3. New column: 'Primary Function' - extracted from 'Primary_Essential_Function' 
4. New 3 columns: 'sector_K_12', 'sector_High_Ed' & 'sector_Corporate' with boolean values extracted from  'Sector(s)' 
6. New column: engagement_days - ammount of engagement days for the product
7. New column: engagement_score - engagement_index sum for the product
8. New column: engagement_score_percent - percent of engagement_index from total for all products
9. Droped 3 products that had no Engagement data - (DocHub, Google Slides, True North Logic)
10. Overview on engagement data for each category - LC / SD / CM
11. Overview on engagement patterns for product in different areas - City / Town / Suburb / Rural
12. Output improved data set

**Engagement_df**
1. Drops 43 districts (from 176) - Only consider districts with full engagement data 366 nonunique days
2. Combined all engagement data from 233 files to 1 df
3. Only considered engagement data that was full - for 366 days
4. Remove engagement data for unknown products (not listed in products_df), this reduces the engagement data roughly by half.
4. Overview on engagement patterns for all product
5. Output improved dataset

In [None]:
products_df.to_csv('better_products_df.csv')

In [None]:
better_engagement_data.to_csv('better_engagement_data.csv')