In [None]:
import pandas as pd
import numpy as np
from IPython.display import display
import gc
import os
import joblib as jb
from tqdm.auto import tqdm
import seaborn as sns
from sklearn import model_selection, pipeline, preprocessing, linear_model, metrics

# Introduction

The work presented in this notebook has the following structure:
- **preprocessing**: available data are preprocessed removing missing values and parsing the remaining, then final results are combined in a single dataset;
- **data analysis**: dataset resulting from the previous step is analyzed trying to get information about the relationship between engagement data, products and scholastic districts characteristics and States policies adopted to contain Covid-19 epidemy;
- **conclusions**: observations got in data analysis are summarized at the end of the work.

Specific information about datasets are not provided and the discussion focuses more on the steps that are performed.
Detailed information can be retrieved in competition data and at links reported at the end of this notebook in the **references** section.

# Preprocessing

Available data concerns **digital products** available for students, **scholastic districts**, **states policies** to fight the Covid-19 epidemy and **engagement data** for digital products usage by districts members.

Each dataset is preprocessed removing null values, encoding labels to numerical values and reducing memory usage changing values types.

In [None]:
def make_dir(path):
    if not os.path.isdir(path):
        os.mkdir(path)

In [None]:
# encoders will be saved in a separate directory
make_dir("encoders")

## Products

Products dataset identifies digital contents available for students and reports sector of application and primary function of the service.

In [None]:
prod = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/products_info.csv", index_col=0).sort_index() # district id as index
prod.index.name = "id"
prod = prod.rename(columns={
    "URL": "url",
    "Product Name": "name",
    "Provider/Company Name": "provider",
    "Sector(s)": "sector",
    "Primary Essential Function": "function"
})
prod = prod.filter(["sector", "function"])
prod.head()

### Null values

In [None]:
prod.isnull().sum()

In [None]:
prod = prod.fillna(value="null")

In [None]:
prod.isnull().any(axis=None)

### Labels encoding

In [None]:
def encode_labels(df, cols=[]):
    if len(cols)==0:
        cols = df.columns.tolist()
    encoders = []
    for i,col in enumerate(cols):
        encoders += [preprocessing.LabelEncoder()]
        y = df.loc[:,col].values
        encoders[i] = encoders[i].fit(y)
        df.loc[:,col] = encoders[i].transform(y)
    return df,encoders

def print_encoders(encoders, cols):
    display(pd.concat([pd.DataFrame({col: encoders[i].classes_}) for i,col in enumerate(cols)], axis=1))
    
def save_encoders(name, encoders, cols, folder="encoders"):
    for i,col in enumerate(cols):
        jb.dump(encoders[i], "{}/products_{}_encoder.pkl".format(folder,col))

In [None]:
cols = prod.columns.tolist()
prod, encoders = encode_labels(prod, cols)
print_encoders(encoders, cols)
save_encoders("products", encoders, cols)
del encoders

In previous table `NaN` values has not to be considered while `null` is one of the valid lables (used to fill missing values in the dataset).

### Memory usage

In [None]:
prod.describe(percentiles=[]).loc["max"]

In [None]:
prod = prod.astype("uint8")

In [None]:
_ = gc.collect()

## Districts

This dataset identifies scholastic districts and provides information about location, ethnic diversity, economic conditions, quality of internet access and public funds provided.

In [None]:
dist = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv", index_col=0).sort_index() # product id as index
dist.index.name = "id"
dist.head()

Intervals can not be used as they are, they need to be parsed as numerically so they are represented with their middle value and length.

In [None]:
def parse_intervals(col):
    arr = []
    for x in col:
        if type(x)!=str:
            arr += [[np.nan] * 2]
            continue
        x = x.strip("[]").split(",")
        x = [float(n.strip(" ")) for n in x]
        center = (x[0] + x[1]) / 2
        length = x[1] - x[0]
        arr += [[center, length]]
    arr = np.array(arr)
    return arr

In [None]:
for col in ["pct_black/hispanic", "pct_free/reduced", "county_connections_ratio", "pp_total_raw"]:
    df = pd.DataFrame(parse_intervals(dist.loc[:,col]), index=dist.index, columns=[col + "_center", col + "_length"])
    dist = dist.join(df).drop(columns=col)
dist.head()

### Null values

In [None]:
dist.isnull().sum()

Rows with all null values are dropped while ones with some null values are filled using median values of referred to same `state` and `locale`.
If `state` and `locale` do not provide an acceptable value the condition is relaxed using only `locale`.

In [None]:
dist = dist.dropna(how="all")

In [None]:
dist.isnull().sum()

In [None]:
fillna_values = dist.groupby(["state", "locale"]).median().fillna(value=dist.groupby("locale").median())
fillna_values

In [None]:
dist = dist.apply(lambda row: row.fillna(value=fillna_values.loc[(row.state,row.locale),:]), axis=1)
del fillna_values

In [None]:
dist.isnull().any(axis=None)

### Labels encoding

States are also present in **Covid-19 US States Policy** dataset, because of this only `locale` is encoded for now and `state` will be encoded with policy dataset's states.

In [None]:
cols = ["locale"]
dist, encoders = encode_labels(dist, cols)
print_encoders(encoders, cols)
save_encoders("districts", encoders, cols)
del encoders

### Memory usage

In [None]:
dist.describe(percentiles=[])

In [None]:
# pp_total_raw_center can be represented as int?
(dist.pp_total_raw_center*10%10).sum() == 0

In [None]:
dist = dist.astype({
    "locale": "uint8",
    "pp_total_raw_center": "uint16",
    "pp_total_raw_length": "uint16"
})

In [None]:
_ = gc.collect()

## Covid-19 US States Policy

States policy dataset contains different informations about states legislative response to Covid-19 epidemy.

In [None]:
dtypes = pd.read_csv("../input/covid19-us-state-policy-database/dtypes.csv", index_col=0, squeeze=True)
date_cols = dtypes.loc[dtypes == "datetime64[ns]"].index.tolist()
dtypes = dtypes.loc[dtypes != "datetime64[ns]"].to_dict()

pol = pd.read_csv("../input/covid19-us-state-policy-database/data.csv", index_col=0, dtype=dtypes, parse_dates=date_cols)
pol = pol.drop(columns=["POSTCODE", "FIPS"])

del dtypes; del date_cols
pol.head()

### Feature selection

To better manage available data and have a more interpretable analysis only features directly regarding scholastic activity will be considered.

Interesting variables are determined searching _school_ word in features' descriptions.

In [None]:
definitions = pd.read_csv("../input/covid19-us-state-policy-database/definitions.csv", index_col=0)
features = definitions.Description.apply(lambda x: "school" in x or "School" in x if type(x)==str else x).loc[lambda x: x].index.tolist()
pol = pol.filter(features)

pd.set_option("display.max_colwidth", 400)
display(definitions.loc[pol.columns, ["Description"]])
pd.set_option("display.max_colwidth", 0)
display(pol.head())
del definitions; del features

### Null values

`NaT` values are not problematic because afterwards datetimes features are joined in engagement data using the condition `engagement data value >= policy date value` that gives `False` if policy's value is missing.

In [None]:
null_df = pd.concat([pol.isnull().sum(), pol.dtypes], axis=1).rename(columns={0: "null", 1: "type"})
null_df = null_df.loc[null_df.null!=0]
null_df = null_df.loc[null_df.type!="datetime64[ns]"]
display(null_df)
del null_df

There are not columns with `NaN` values.

### Labels encoding

In [None]:
print("Districts states: {}".format(dist.state.sort_values().unique().size))
print("Policies states: {}".format(pol.index.sort_values().unique().size))

Policy dataset contains more states than districts so it can be used to encode both dataframes.

In [None]:
dist.loc[dist.state.isin(pol.index.unique().tolist()).loc[lambda x: x == False].index,"state"].unique().tolist()

Districts dataset contains _District of Columbia_ as a state while policy database does not.
To make the two datasets compatible _District of Columbia_ is replaced with the state of _Virginia_.

In [None]:
dist = dist.replace("District Of Columbia", "Virginia")
dist.state.isin(pol.index.unique().tolist()).all()

In [None]:
pol = pol.reset_index()

cols = pol.dtypes.loc[lambda x: x == "object"].index.tolist()
pol, encoders = encode_labels(pol, cols)
print_encoders(encoders, cols)
save_encoders("policy", encoders, cols)

dist.state = encoders[cols.index("STATE")].transform(dist.state.values)
jb.dump(encoders[cols.index("STATE")], "encoders/districts_state_encoder.pkl")

pol = pol.set_index("STATE")
del encoders

In the previous table do not look at `NaN` values.

### Memory usage

In [None]:
def reduce_memory(col):
    if col.dtype == "int64":
        x_min = col.min()
        x_max = col.max()
        dtype = "int"
        if x_min >= 0:
            dtype = "u" + dtype
        for b in [8,16,32,64]:
            if (dtype == "uint" and x_max < 2**b) or (dtype == "int" and x_max < 2**(b-1)):
                dtype += "{}".format(b)
                break
        if dtype == "int" or dtype == "uint":
            raise OverflowError
        return col.astype(dtype)
    else:
        return col

In [None]:
pol = pol.apply(reduce_memory)
pol.info()

In [None]:
_ = gc.collect()

## Engagement data

Engagement data provides percentage access and engagement index representing how districts' students use offered digital products.

In [None]:
dist_id = pd.read_csv("../input/learnplatform-covid19-impact-on-digital-learning/districts_info.csv", usecols=["district_id"], squeeze="True")

eng = jb.Parallel(n_jobs=-1)(jb.delayed(pd.read_csv)("../input/learnplatform-covid19-impact-on-digital-learning/engagement_data/{:.0f}.csv".format(i), parse_dates=["time"]) for i in dist_id.values)
for i,df in enumerate(eng):
    eng[i].insert(loc=1, column="district_id", value=[dist_id.iloc[i]] * df.shape[0])

del dist_id

eng = pd.concat(eng, axis=0, ignore_index=True).rename(columns={"time": "date", "lp_id": "product_id"}).sort_values("date", ignore_index=True)
eng.head()

### Null values

In [None]:
eng.isnull().sum()

In [None]:
eng = eng.dropna().reset_index(drop=True)
eng.product_id = eng.product_id.astype(int)

In [None]:
eng.isnull().any(axis=None)

## Final dataset

At this point processed data can be joined in a single dataset.

In [None]:
dist = dist.rename(columns={col: "district_" + col for col in dist.columns})
df = eng.join(dist, on="district_id")
del dist; del eng

In [None]:
df.isnull().sum()

Values with missing districts are dropped.

In [None]:
df = df.dropna(); gc.collect()
df.isnull().any(axis=None)

In [None]:
prod = prod.rename(columns={col: "product_" + col for col in prod.columns})
df = df.join(prod, on="product_id")
del prod

In [None]:
df.isnull().sum()

As for districts, data with missing products are dropped.

In [None]:
df = df.dropna(); gc.collect()
df.isnull().any(axis=None)

As anticipated datetimes of policy dataset are converted to boolean values using the condition `engagement date >= policy date`

In [None]:
pol = pol.rename(columns={col: "policy_" + col for col in pol.columns})

# join datasets
df = df.join(pol, on="district_state")

# parse datetimes
cols = df.dtypes.loc[lambda x: x == "datetime64[ns]"].drop("date").index.tolist()
df.loc[:,cols] = df.loc[:,cols].apply(lambda col: col >= df.date)

# date is not necessary for later data analysis
df = df.drop(columns="date")

del pol

In [None]:
df.isnull().sum()

In [None]:
print("Final data shape: {}".format(df.shape))

In [None]:
_ = gc.collect()

# Data analysis

In features of the final dataset `pct_access` and `engagement_index` can be considered as the target variables to be predicted by a model.
Building a model capable of well predicting targets values could be not so easy but as the interest of the work is mostly on how other features affect the targets, a linear model can be applied and the resulting coefficients considered as the corresponding contribute of each variable.

In general not all included predictors have a significative contribute in the targets; because of this **lasso regression** is applied on the two target variables and features selection is provided by model's regularization ($\textit{l}_1$ penalty).

Model is fitted on a subset of the original dataset.
The latter infact is splitted into training and validation dataframes in the measure of $70\%$ for training and $30\%$ for validation.

Lasso is applied after scaling features as

<center>$ x' = \frac{x-\mu}{\sigma} $</center>

with $\mu = \bar{x}$ and $\sigma$ variance of $x$.
In this way the resulting coefficients can be easily confronted to features variance value (i.e. $\sigma' = 1$)

In [None]:
train, test = model_selection.train_test_split(df, train_size=0.7, random_state=1, shuffle=True)
del df
print("Train shape: {}".format(train.shape))
print("Test shape: {}".format(test.shape))

In [None]:
def get_x(df):
    x = df.drop(columns=["pct_access", "engagement_index"])
    return x

def get_y(df, target=None):
    y = df.loc[:,["pct_access", "engagement_index"]]
    if target != None:
        y = y.iloc[:,target]
    return y

def lasso(df, target=0):
    model = pipeline.Pipeline([
        ("scaler", preprocessing.StandardScaler()),
        ("estimator", linear_model.LassoCV(random_state=1, n_jobs=-1))
    ])
    model = model.fit(get_x(train), get_y(train,target))
    #print("CV Lasso alpha: {:.2f}".format(model.named_steps["estimator"].alpha_))
    coef = pd.Series(model.named_steps["estimator"].coef_, index=get_x(train).columns.tolist())
    return model, coef

In [None]:
models = []
coefs = []
for i in [0,1]:
    model, coef = lasso(train, target=i)
    models += [model]
    coefs += [coef]
coefs = pd.concat(coefs, axis=1)
coefs.columns = ["pct_access", "engagement_index"]
coefs

Features have coefficients with a value mostly greater than $10^{-2}$ and because of this a threshold $\eta = 10^{-3}$ is used to exclude the predictors that contributes less to the target value.

In [None]:
def plot_coefs_thresh(coefs, thresh, x):
    coefs_thresh = coefs.loc[(coefs.abs() >= thresh).all(axis=1)]
    plt = sns.barplot(data=coefs_thresh, x=x, y=coefs_thresh.index)
    return coefs_thresh, plt

In [None]:
coefs_thresh, plt = plot_coefs_thresh(coefs, thresh=1e-3, x="pct_access")
plt

In [None]:
plot_coefs_thresh(coefs, thresh=1e-3, x="engagement_index")[1]

Trained models are not suitable to predict engagement data due to their high mean squared errors.

In [None]:
err = pd.DataFrame(
    {
        ("MSE", target): [metrics.mean_squared_error(y_true=get_y(df,target=i), y_pred=models[i].predict(get_x(df))) for df in [train,test]]
        for i,target in enumerate(["pct_access", "engagement_index"])
    },
    index=["train","test"]
)
err

Following subsections analyze obtained coefficients considering one predictor dataset (districts, products, policies) at a time.

## District

In [None]:
def plot_coefs_subset(coefs, sub, x):
    coefs_sub = coefs.loc[coefs.index.to_series().apply(lambda x: sub in x).loc[lambda x: x].index.tolist()]
    plt = sns.barplot(data=coefs_sub, x=x, y=coefs_sub.index)
    return coefs_sub, plt

def plot_corr(df, coefs):
    plt = sns.heatmap(data=df.loc[:,coefs.index.tolist()].corr(), vmin=-1, vmax=1, cmap="bwr")
    return plt

In [None]:
coefs_sub, plt = plot_coefs_subset(coefs_thresh, sub="district", x="pct_access")
plt

In [None]:
plot_coefs_subset(coefs_thresh, sub="district", x="engagement_index")[1]

In [None]:
plot_corr(train, coefs_sub)

Looking at the previous plots some considerations can be done:
- public funds provided to districts is the most contributing feature to targets values;
- quality of internet connection provides a better engagement;
- percentage of black and hispanic students or students with reduced or free meal (i.e. students with a worse economic situation) affects engagement negatively.

From correlation plot can be also seen that percentage of black/hispanic students is correlated to free/reduced price meal students.

## Product

In [None]:
coefs_sub, plt = plot_coefs_subset(coefs_thresh, sub="product", x="pct_access")
plt

In [None]:
plot_coefs_subset(coefs_thresh, sub="product", x="engagement_index")[1]

In [None]:
plot_corr(train, coefs_sub)

Previous plots shows `product_sector` as the most important feature within products ones.

## Covid-19 policy

In [None]:
coefs_sub, plt = plot_coefs_subset(coefs_thresh, sub="policy", x="pct_access")
plt

In [None]:
plot_coefs_subset(coefs_thresh, sub="policy", x="engagement_index")[1]

In [None]:
plot_corr(train, coefs_sub)

Definitions of policy features are reported again in the following table.

In [None]:
definitions = pd.read_csv("../input/covid19-us-state-policy-database/definitions.csv", index_col=0)
pd.set_option("display.max_colwidth", 400)
display(definitions.loc[coefs_sub.index.to_series().apply(lambda x: x[len("policy_"):]).tolist(), ["Description"]])
pd.set_option("display.max_colwidth", 0)
del definitions

Lasso coefficients shows that:
- schools closure increases percentage access to digital resources but decreases the engagement index;
- school employees vaccination reduces both percentage access and engagement index;
- `policy_UICLDCR` has a positive effect on engagement, probably explained by the fact it affects economic condition of families.

In [None]:
# dump results
train.to_csv("train.csv")
test.to_csv("test.csv")
jb.dump(models[0], "model_pct_access.pkl")
jb.dump(models[1], "model_engagement_index.pkl")
coefs.to_csv("model_coefs.csv")

In [None]:
# free memory
del train; del test; del models; del coefs; del coefs_thresh; del coefs_sub
_ = gc.collect()

# Conclusions

Previous analysis highlights that more public funds, better internet connection access and more social helps to improve families economic conditions could make the difference to improve students engagement and achieve a better education.

Covid-19 epidemy had and continues to have a strong impact on school activity end efficiency.
Schools closure increases digital access but decreases students efforts to learn.
Eventually, schools staff vaccination is the solution to restart school activity and to bring the situation back to the normal routine.

# References

- [Competition reference](https://www.kaggle.com/c/learnplatform-covid19-impact-on-digital-learning)
- [Covid-19 US States Policy repository](https://github.com/USCOVIDpolicy/COVID-19-US-State-Policy-Database)
- [Lasso regression](https://en.wikipedia.org/wiki/Lasso_(statistics))