# Tabular Plaground Series June 2022

This notebook contains:

- A very brief EDA
- Using K-Fold Validation to assess performance of the regression model to impute.
- A comparison of imputation techniques for each feature.
- Using LightGBM to impute missing values of F_4_X features.
- Using the mean values to impute F_1_X and F_3_X features.

View my EDA here: https://www.kaggle.com/code/cabaxiom/tps-jun-22-eda

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

sns.set_style('darkgrid')

In [None]:
df = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")

In [None]:
# Group features for later use

feature_cols = [i for i in df.columns if "F" in i]
float_cols = [i for i in df.columns if df[i].dtype == float]
int_cols = [i for i in df.columns if df[i].dtype == int and "F" in i]

F_1_cols = [i for i in df.columns if "F_1" in i]
F_2_cols = [i for i in df.columns if "F_2" in i]
F_3_cols = [i for i in df.columns if "F_3" in i]
F_4_cols = [i for i in df.columns if "F_4" in i]

F_123_cols = F_1_cols + F_2_cols + F_3_cols

In [None]:
print("There are: ", len(float_cols), "float features")
print("There are: ", len(int_cols), "integer features")

print("There are: ", len(F_1_cols), "F1 features")
print("There are: ", len(F_2_cols), "F2 features")
print("There are: ", len(F_3_cols), "F3 features")
print("There are: ", len(F_4_cols), "F4 features")

In [None]:
plt.subplots(figsize=(25,20))
sns.heatmap(df[feature_cols].corr(), annot= True, cmap="RdYlGn", fmt = '0.1f', vmin=-0.6, vmax=0.6, cbar=False);

In [None]:
plt.subplots(figsize=(25,20))
sns.heatmap(df[float_cols].corr(), annot= True, cmap="RdYlGn", fmt = '0.2f', vmin=-1, vmax=1, cbar=False);

**Observations:**

- The F_4_X features are only correlated with other F_4_X features.
- The integer (discrete) F_2_X features are only correlated with other F_2_X features. (Note: theres no missing F_2_X features).
- F_1_X and F_3_X features are not correlated with any other feature.

**Insight:**
- Consider using a regression model to predict F_1_1 values. There are no other columns correlated with this feature. This gives some indication that perhaps these other features are redundent for predictions. However its still possible that:
    1. There no overall correlation with the target feature but the feature is still useful for predictions (high mutual information score)
    2. Other features are useful for predictions when we consider feature interactions.

# K-Fold Cross-Validation

We take a single feature (here F_4_1) and train and evaluate how good the models predictions of this feature are. Once we have a good model we can use it to impute the missing values.

In [None]:
y_col = "F_4_1"
y = df[y_col].dropna()
X = df.loc[y.index].drop(columns=[y_col,"row_id"]).reset_index(drop=True)
y = y.reset_index(drop=True)

In [None]:
model = LGBMRegressor(n_estimators = 50, learning_rate = 0.1, random_state=0, n_jobs=-1)

In [None]:
def k_fold_cv(model,X,y):
    kfold = KFold(n_splits = 5, shuffle=True, random_state = 0)

    feature_imp, y_pred_list, y_true_list, mse_list  = [],[],[],[]
    for fold, (train_index, val_index) in enumerate(kfold.split(X, y)):
        print("==fold==", fold)
        X_train = X.loc[train_index]
        X_val = X.loc[val_index]

        y_train = y.loc[train_index]
        y_val = y.loc[val_index]

        model.fit(X_train,y_train)
        y_pred = model.predict(X_val)
            
        y_pred_list = np.append(y_pred_list, y_pred)
        y_true_list = np.append(y_true_list, y_val)

        mse_list.append(mean_squared_error(y_val,y_pred))
        print("MSE", mean_squared_error(y_val,y_pred))

        try:
            feature_imp.append(model.feature_importances_)
        except AttributeError: # if model does not have .feature_importances_ attribute
            pass # returns empty list
    return feature_imp, y_pred_list, y_true_list,mse_list, X_val, y_val

In [None]:
%%time
feature_imp, y_pred_list, y_true_list, mse_list, X_val, y_val = k_fold_cv(model=model,X=X,y=y)

In [None]:
print("Mean MSE:", np.mean(mse_list))

In [None]:
f,ax = plt.subplots(figsize=(8,8))
sns.histplot(y_pred_list, color="blue")
sns.histplot(y_true_list, color="red", alpha=0.2);

In [None]:
def fold_feature_importances(model_importances, column_names, model_name, n_folds = 5, ax=None, boxplot=False):
    importances_df = pd.DataFrame({"feature_cols": column_names, "importances_fold_0": model_importances[0]})
    for i in range(1,n_folds):
        importances_df["importances_fold_"+str(i)] = model_importances[i]
    importances_df["importances_fold_median"] = importances_df.drop(columns=["feature_cols"]).median(axis=1)
    importances_df = importances_df.sort_values(by="importances_fold_median", ascending=False)
    if ax == None:
        f, ax = plt.subplots(figsize=(15, 25))
    if boxplot == False:
        ax = sns.barplot(data = importances_df, x = "importances_fold_median", y="feature_cols", color="blue")
        ax.set_xlabel("Median Feature importance across all folds");
    elif boxplot == True:
        importances_df = importances_df.drop(columns="importances_fold_median")
        importances_df = importances_df.set_index("feature_cols").stack().reset_index().rename(columns={0:"feature_importance"})
        ax = sns.boxplot(data = importances_df, y = "feature_cols", x="feature_importance", color="blue", orient="h")
        ax.set_xlabel("Feature importance across all folds");
    plt.title(model_name)
    ax.set_ylabel("Feature Columns")
    return ax

In [None]:
f, ax = plt.subplots(figsize=(15, 20))
fold_feature_importances(model_importances = feature_imp, column_names = X_val.columns, model_name = "LGBM", n_folds = 5, ax=ax, boxplot=False);

**Observations:**
- Only the F_4 features are useful for predicting other f_4 columns.

# Comparing imputation methods

We now compare the prediction performance for each feature investigating whether its better to:

1. Use the mean value to impute.
2. Use the median value to impute.
3. Use all 0 values to impute.
3. Use a regression model to impute.

In [None]:
def compare_methods_fold(model,X,y):
    kfold = KFold(n_splits = 5, shuffle=True, random_state = 0)

    for fold, (train_index, val_index) in enumerate(kfold.split(X, y)):
        if fold < 1: # only evaluate 1/5 folds to save time
            X_train = X.loc[train_index]
            X_val = X.loc[val_index]

            y_train = y.loc[train_index]
            y_val = y.loc[val_index]

            model.fit(X_train,y_train)

            y_pred_model = model.predict(X_val)
            y_pred_mean = np.full(len(y_val), y_train.mean())
            y_pred_median = np.full(len(y_val), y_train.median())
            y_pred_zeros = np.zeros(len(y_val))
            
            mse_model =  mean_squared_error(y_val,y_pred_model)
            mse_mean =  mean_squared_error(y_val,y_pred_mean)
            mse_median =  mean_squared_error(y_val,y_pred_median)
            mse_zeros =  mean_squared_error(y_val,y_pred_zeros)

    return [mse_model, mse_mean, mse_median, mse_zeros]

In [None]:
def compare_methods():
    mse_lists = []
    for y_col in float_cols:
    
        y = df[y_col].dropna()
        X = df.loc[y.index].drop(columns=[y_col,"row_id"]).reset_index(drop=True)
        y = y.reset_index(drop=True)
        
        mse_lists.append(compare_methods_fold(model,X,y))
        
    new_df = pd.DataFrame(mse_lists, columns=["model","mean","median","zeros"], index=float_cols)
    return new_df

In [None]:
comparison_df = compare_methods()
comparison_df["best_method"] = comparison_df.idxmin(axis=1)
comparison_df

# Inference

Now we perform are actual imputations on the missing values. Based on our experiments we decide to:

- For imputing F_1 and F_3 columns we use the mean value.
- For imputing F_4 columns we use a regression model which is only fit to other F_4 columns.

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-jun-2022/sample_submission.csv")

In [None]:
def impute():
    """
    This function takes each feature one at a time e.g. F_1_0, F_1,1, ... , F_4_14 to be used as y (the column to predict)
    
    The method of prediction depends on whether the column we are predictiing is a F_1, F_3 or F_4.
    For predicting F_1 and F_3 columns the mean y value is used as the prediction.
    For F_4 columns we train a model on relevant columns (other F_4 colums).
    
    Once predictions have been made the relevant rows in the submission DataFrame are set to the predictions
    """
    for y_col in float_cols:
        
        y = df[y_col].dropna() #non-missing y values 
        X = df.loc[y.index].drop(columns=[y_col,"row_id"]).reset_index(drop=True) #corresponding non-missing X values
        y = y.reset_index(drop=True)
        
        #Columns with missing target values to be used to make predictions from
        X_test = df.loc[df[y_col].isna() == True].drop(columns=[y_col,"row_id"])
        
        if (y_col[2] == "1") or (y_col[2] == "3"):
            # predictions are the mean value of that column (mean impute)
            preds = np.full(len(X_test), y.mean()) 
            
        if y_col[2] == '4':
            #Predictions are based on the regression model
            model = LGBMRegressor(n_estimators = 80000, learning_rate = 0.1, random_state=0, max_bins=511, n_jobs=-1)
            X = X.drop(columns = F_123_cols) #Remove redundent columns
            model.fit(X,y)
            preds = model.predict(X_test.drop(columns = F_123_cols ))
        
        submission.loc[submission["row-col"].str.endswith(y_col), "value"] = preds
        

In [None]:
%%time
impute()

In [None]:
submission.to_csv('submission.csv', index = False)