## Our problem handling framework

<center><img src="https://imgur.com/DhVOmLe.png" width="400" height="200" /></center>

Our approach to this analytics challenge consists of the following steps:
1. Data Transformation
2. Features Engineering
3. Analysis, correlation, and regression
4. Predictions
5. Results evaluation and a summary

But before we jump into technical implementation, let's have a brief overview of the **problem statement**, **datasets**, and **requirements**!

# Overview
## Problem statement

The COVID-19 Pandemic has changed the way we live, work, study. About *56 million US students* have faced a disruption in the convenient onsite education. Globally, over *1.2 billion children* are out of the classroom. Online education is increasing, and a vital role plays the EdTech. 
However, constant e-learning has brought such problems for students and educators:
* digital divide
* learning loss
* health and mental issues
* extra load on the infrastructure
* inequitable access to education

## Datasets

The datasets structure is briefly described and summarized with the content below:
1. `engagement_data` - folder that contains digital engegement data for 233 districts with a corresponding number of CSVs. Each file has the following naming convention: `<district_id>.csv`.
2. `products_info.csv` - data about top 372 online education products in 2020.
3. `districts_info.csv` - characteristics of school districts. district_id is provided as key column.


<center><img src="https://imgur.com/ya0tdhS.png" /></center>


### Requirements
The competition calls to take part in the digital learning exploration in current and future states. The impact of multiple factors like demographic conditions, state/social/national policies and level, network access needs to be analyzed. We have questioned the following items, and will try to answer them thruoght the analysis:
1. How has changed the digital learning trend and what causes the changes?
2. Does product selection affects the engagement index and other characteristics?
3. What should we expect in the near future in the field of digital learning?

## Data Transformation
So that we can get most out of the data, it is decided to modify and consolidate three files into only one dataset. The schema below briefly displays the transformation details.


![](https://imgur.com/R11WG3R.png)

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
os.chdir("/kaggle/input/learnplatform-covid19-impact-on-digital-learning")

districts_info_df = pd.read_csv("districts_info.csv")
products_info_df = pd.read_csv("products_info.csv")

In [None]:
districts_features_to_transform = {"pct_black/hispanic": ["pct_black", "hispanic"], 
                        "pct_free/reduced": ["pct_free", "pct_reduced"],
                        "county_connections_ratio": ["county_connections_ratio_0", "county_connections_ratio_1"], 
                        "pp_total_raw": ["ppt_total_local", "ppt_total_federal"]}

def features_transformation(dataset=districts_info_df, features_to_transform=districts_features_to_transform):
    i = 3
    old_features_list = []
    
    for old_feature, new_features in features_to_transform.items():
        dataset.iloc[:, i] = dataset.iloc[:, i].str.replace(r'[', '')
        i+=1

        dataset[new_features] = dataset[old_feature].str.split(pat=",", expand=True)
        old_features_list.append(old_feature)

    dataset = dataset.drop(columns=old_features_list)
    
    return dataset

In [None]:
def concatenate_file(directory):
    files = os.listdir(directory)
    concatenated_dataframe = pd.DataFrame(columns=["time", "lp_id", "pct_access", "engagement_index", "district_id"])
    for file in files:
        add_file = pd.read_csv("{directory}/{file}".format(directory=directory, file=file), header=0)
        add_file["district_id"] = int(file.split(".")[0])
    #        add_file = add_file.drop(add_file.index[0])
        concatenated_dataframe = concatenated_dataframe.append(add_file, ignore_index=True)  
    print("Finished concatenation")
    
    return concatenated_dataframe
    


In [None]:
# column name changes for products info
products_info_df = products_info_df.rename(columns={"LP ID": "lp_id", "URL": "url", "Product Name": "product_name",
                                                   "Provider/Company Name": "company_name", "Sector(s)": "sectors",
                                                   "Primary Essential Function": "primary_essential_function"}, errors="raise")





def join_datasets(dataset1, dataset2, left_on, right_on):
    return dataset1.merge(dataset2, how='left',left_on=left_on, right_on=right_on)


def transform_data(directory, dataset1, key1, dataset2, key2):
    dataset1 = features_transformation()
    
    transformed_dataset = concatenate_file(directory)
    
    transformed_dataset = join_datasets(transformed_dataset, dataset1, key1, key1)
    transformed_dataset = join_datasets(transformed_dataset, dataset2, key2, key2)
    
    print("Writing data to CSV")
    transformed_dataset.to_csv('/kaggle/working/data.csv', index=False)
    print("Writing the dataset to CSV has been completed")
    
    return transformed_dataset
    

engagement_dataset = transform_data('engagement_data', districts_info_df, 'district_id', products_info_df, 'lp_id')  

In [None]:
fig, ax = plt.subplots(figsize=(11,4))
ax.set(ylabel="Count of districts", xlabel="Mean engagement per district")

ax.hist(engagement_dataset.groupby("district_id").mean()["engagement_index"].reset_index().engagement_index, bins=10)

In [None]:
products_per_district = engagement_dataset.groupby(['district_id']).agg({'product_name': 'nunique'}).reset_index()
plt.subplots(figsize=(11,4))
plt.hist(products_per_district.product_name, bins=10)
plt.gca().set(title='Frequency Histogram', ylabel='Frequency')

In [None]:
engagement_dataset.shape
engagement_dataset.dtypes

## Visualizing Time Series Date

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
# Use seaborn style defaults and set the default figure size
sns.set(rc={'figure.figsize':(11, 4)})

In [None]:
# Events trend over time - how many entities were collected on a specific date
engagement_dataset.groupby('time').size().plot(linewidth=0.5)

In [None]:
# Plot all values of the pct_access and engagement_index columns
numeric_features_plot = ['pct_access', 'engagement_index']

axes = engagement_dataset[numeric_features_plot].plot(marker='.', alpha=0.5, linestyle='None', figsize=(11, 9), subplots=True)
for ax in axes:
    ax.set_ylabel('%')

In [None]:
# A function to calculate the percentile
def q90(x):
    return x.quantile(0.9)

In [None]:
# Generate the average and 90th percentile for each numeric column + visualizing the counts for a date
engagement_timeseries_dataset = engagement_dataset.groupby('time')
timeseries_stats = engagement_timeseries_dataset.size().to_frame(name='count')
timeseries_stats['mean_pct_access'] = engagement_timeseries_dataset.agg({'pct_access': 'mean'})
timeseries_stats['p90_pct_access'] = engagement_timeseries_dataset.agg({'pct_access': q90})
timeseries_stats['mean_engagement_index'] = engagement_timeseries_dataset.agg({'engagement_index': 'mean'})
timeseries_stats['p90_engagement_index'] = engagement_timeseries_dataset.agg({'engagement_index': q90})
timeseries_stats.head()

In [None]:
timeseries_stats['7d_rolling_count_sum'] = timeseries_stats.rolling(7)['count'].sum()
timeseries_stats['7d_rolling_count_avg'] = timeseries_stats.rolling(7)['count'].mean()

timeseries_stats['7d_rolling_engagement_index_mean'] = timeseries_stats.rolling(7)['mean_engagement_index'].mean()
timeseries_stats['7d_rolling_pct_access_mean'] = timeseries_stats.rolling(7)['mean_pct_access'].mean()

timeseries_stats['7d_rolling_engagement_index_sum'] = timeseries_stats.rolling(7)['mean_engagement_index'].sum()
timeseries_stats['7d_rolling_pct_access_sum'] = timeseries_stats.rolling(7)['mean_pct_access'].sum()

We can derive such information from the graphs below:
* the lower the events count, the lower the engagement index and access percentage
* There is a clear indicator that on weekends there are less observation than during weekdays

In [None]:
# timeseries_features_plot = ['count', 'mean_pct_access', 'p90_pct_access', 'mean_engagement_index', 'p90_engagement_index']

fig, ax = plt.subplots()
# ax.plot(timeseries_stats_extra['count'], marker='.', linestyle='-')
ax.plot(timeseries_stats['7d_rolling_count_sum'], marker='.', linestyle='-')
ax.plot(timeseries_stats['7d_rolling_count_avg'], marker='.', linestyle='-')
ax.set_ylabel('Count')
ax.set_title('7d Rolling Events Count')
ax.legend(['Total', 'Average'])
plt.xticks(rotation=90)
# Set x-axis major ticks to weekly interval, on Mondays
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

In [None]:
fig, ax = plt.subplots()
ax.plot(timeseries_stats['mean_pct_access'], marker='.', linestyle='-')
ax.plot(timeseries_stats['p90_pct_access'], marker='.', linestyle='-')
ax.set_ylabel('Access %')
ax.legend(['Mean', '90th Percentile', '7 Days Mean'])
plt.xticks(rotation=90)

# ax.set_title('')
# Set x-axis major ticks to weekly interval, on Mondays
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

In [None]:
fig, ax = plt.subplots()
ax.plot(timeseries_stats['7d_rolling_pct_access_mean'], marker='.', linestyle='-')
ax.plot(timeseries_stats['7d_rolling_pct_access_sum'], marker='.', linestyle='-')
ax.set_ylabel('Access %')
plt.xticks(rotation=90)

# ax.set_title('')
# Set x-axis major ticks to weekly interval, on Mondays
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

In [None]:
fig, ax = plt.subplots()
ax.plot(timeseries_stats['mean_engagement_index'], marker='.', linestyle='-')
ax.plot(timeseries_stats['p90_engagement_index'], marker='.', linestyle='-')
ax.set_ylabel('Engagement Index')
ax.legend(['Mean', '90th Percentile'])
ax.set_title('Engagement Index')
plt.xticks(rotation=90)
# Set x-axis major ticks to weekly interval, on Mondays
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

In [None]:
fig, ax = plt.subplots()
ax.plot(timeseries_stats['7d_rolling_engagement_index_mean'], marker='.', linestyle='-')
ax.plot(timeseries_stats['7d_rolling_engagement_index_sum'], marker='.', linestyle='-')
ax.set_ylabel('Access %')
# ax.set_title('')
# Set x-axis major ticks to weekly interval, on Mondays
plt.xticks(rotation=90)

ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
# Format x-tick labels as 3-letter month name and day number
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));