# Basic exploration and submission using a classical multi-output linear model

Within this notebook the basics features of the data are explored, and subsequently a baseline model is produced using linear regression.

Why linear regression? In short, because it is extremely simple and quick to train for this multi-output task and gets a quick solution in place to understand the dataset. Although this is a classification task, we actually want output probabilities for the final submissions, and therefore we can actually make use of regression models rather than hard classification models (that produce integer labels).  

This is likely a weak model to choose for obtaining a good score, and we could undoubtedly get much better results using more suitable deep neural network variants and/or ensemble methods.

## 1. Import dependencies and data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, cross_validate, cross_val_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier
from tqdm import tqdm

In [None]:
input_dir = '/kaggle/input/lish-moa'
train_features = pd.read_csv(os.path.join(input_dir, 'train_features.csv'))
train_targets_scored = pd.read_csv(os.path.join(input_dir, 'train_targets_scored.csv'))
train_targets_nonscored = pd.read_csv(os.path.join(input_dir, 'train_targets_nonscored.csv'))
test_features = pd.read_csv(os.path.join(input_dir, 'test_features.csv'))

In [None]:
train_features.shape, train_targets_scored.shape, train_targets_nonscored.shape, test_features.shape

## 2. Basic Exploratory Data Analysis

In [None]:
train_features.head(3)

We have only three categorical columns at the beginning, with the remainder of the columns being numerical.

In [None]:
cat_cols = ['cp_type', 'cp_time', 'cp_dose']

plt.figure(figsize=(16,4))

for idx, col in enumerate(cat_cols):
    plt.subplot(int(f'13{idx + 1}'))
    labels = train_features[col].value_counts().index.values
    vals = train_features[col].value_counts().values
    sns.barplot(x=labels, y=vals)
    plt.xlabel(f'{col}')
    plt.ylabel('Count')
plt.tight_layout()
plt.show()

For 'cp_type', the 'ctl_vehicle' refers to samples treated with a control perturbation. For control perturbations, our targets are all zero, since they have no Mechanism of Action (MoA).

To deal with this, a good strategy could be to identify samples that are ctl_vehicle (through training a classification model or simply using the feature as its in the test data!), and set all of these to zero. We can then process the test set accordingly, by first setting all test instance targets to zero if its a ctl_vehicle, followed by processing all of the others normally using our trained model.

In [None]:
# select all indices when 'cp_type' is 'ctl_vehicle'
ctl_vehicle_idx = (train_features['cp_type'] == 'ctl_vehicle')

# evaluate number of 1s we have in the total train scores when cp_type = ctl_vehicle
train_targets_scored.loc[ctl_vehicle_idx].iloc[:, 1:].sum().sum()

The total sum is zero, which confirms the statement above on all targets being zero for cases where cp_type is ctl_vehicle. The best thing to do with this is simply fill our targets for zero when this is the case.

We shall also remove all of these from the training set, since there is no need to unnecessarily complicate our model.

In [None]:
# take a copy of all our training sig_ids for reference
train_sig_ids = train_features['sig_id'].copy()

In [None]:
# drop cp_type column since we no longer need it
X = train_features.drop(['sig_id', 'cp_type'], axis=1).copy()
X = X.loc[~ctl_vehicle_idx].copy()

y = train_targets_scored.drop('sig_id', axis=1).copy()
y = y.loc[~ctl_vehicle_idx].copy()

In [None]:
X.shape, y.shape

In [None]:
X.head(3)

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(X.iloc[:, 2:].mean())
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(y.mean())
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(train_targets_nonscored.mean())
plt.show()

In [None]:
train_targets_scored.head(3)

In [None]:
train_targets_nonscored.head(3)

In [None]:
y.sum().sort_values()[:30].plot.bar(figsize=(18,6))
plt.show()

Some output classes only have 1 instance in the entire training set. This is problematic and is no where near enough data if we expect our models to effectively make predictions across the whole range of targets. Imbalanced dataset techniques such as minority class over-sampling may have to be introduced, which may help our models generalise better to new data.

## 3. Preprocessing

Lets create our preprocessing functions, which include some of the exploration we conducted above and will start again from the raw dataset(s) as they are imported in.

This will include the following steps:

- Take reference of the sig_id's and cp_types for our data
- Remove the above two features 
- Remove all instances where cp_type == ctl_vehicle, since these will result in all zero targets
- Encode our two categorical features cp_time and cp_dose
- Standardise our numerical features prior to training

In [None]:
class DataProcessor(BaseEstimator, TransformerMixin):
    """ Data preprocessor and loader """
    
    def __init__(self, rm_ctl_vehicle=True, std_features=True, encode_cat=True,
                 cat_cols=['cp_time', 'cp_dose']):
        self.rm_ctl_vehicle = rm_ctl_vehicle
        self.std_features = std_features
        self.cat_cols = cat_cols
        self.encode_cat = encode_cat
        
        
    def fit(self, X, y=None):

        return self
                
    
    def transform(self, X):
        """ Process features from our given data """
        
        new_df = self._remove_features(X).copy()
        
        if self.std_features:
            # standardise numerical cols
            pass
        
        if self.encode_cat:
            
            # one-hot encode cat vars
            new_df = pd.concat([pd.get_dummies(new_df.cp_dose, prefix='cp_dose'), 
                                new_df.drop('cp_dose', axis=1)], axis=1)
            
            new_df = pd.concat([pd.get_dummies(new_df.cp_time, prefix='cp_time'),
                                new_df.drop('cp_time', axis=1)], axis=1)
            
        return new_df
    
    
    def _remove_features(self, dataframe):
        """ Remove unwanted features from our dataframes """
        
        if self.rm_ctl_vehicle:
            new_df = dataframe.drop(['sig_id', 'cp_type'], axis=1)
        
        else:
            new_df = dataframe.drop('sig_id', axis=1)
        
        return new_df
    
    
    def _standardise_features(self, dataframe):
        pass

In [None]:
# take a copy of all our training sig_ids for reference
train_sig_ids = train_features['sig_id'].copy()

# select all indices when 'cp_type' is 'ctl_vehicle'
train_ctl_vehicle_idx = (train_features['cp_type'] == 'ctl_vehicle')

# initialise class and use to transform our dataset
data_processor = DataProcessor()

# remove unwanted feats and encode categorical
X = data_processor.fit_transform(train_features)
y = train_targets_scored.drop('sig_id', axis=1).copy()

# remove cases where cp_type is ctl_vehicle from train features and targets
X = X.loc[~train_ctl_vehicle_idx].copy()
y = train_targets_scored.drop('sig_id', axis=1).copy()
y = y.loc[~train_ctl_vehicle_idx].copy()

# standardise our numerical columns only
std_scaler = StandardScaler()
num_cols = [x for x in X.columns.values if not x.startswith(('cp_time', 'cp_dose'))]
X_std = X.copy()
X_std[num_cols] = std_scaler.fit_transform(X.loc[:, num_cols])

In [None]:
X_std.head(3)

Good, our data is now ready for producing a basic set of predictions on the test set using a baseline model.

## 4. Basic Model Exploration

The problem with using basic machine learning models for this multi-output classification task is that generally we need to train 200+ individual models, which can take an extremely long time if we can't parallelise these training operations.

Conversely, a neural network can handle this type of complexity with just one unified model, which makes it much more preferable for this competition.

For this simple example, we'll produce a simple linear regression model to provide output probabilities. This could be much improved upon using more advanced models, but this simple approach will suffice for this notebook.

In [None]:
lin_reg = LinearRegression()

In [None]:
# evaluate using cross-validation
lin_reg = LinearRegression()
lr_val_preds = cross_val_predict(lin_reg, X_std, y, cv=5)

# in order to effective work out log loss, we need to flatten both arrays before computing log loss
lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds))
print(f"Log loss for our Linear Regression Model: {lr_log_loss:.5f}\n")

Not too bad for such a simple method of model production. Let's explore the impact / benefits of applying dimensionality reduction to our model. Usually, this results in a performance decrease, however for datasets with a large number of redundant features (which could be the case here), we may actuall get a performance increase. Let's see whether this is the case or not: 

In [None]:
n_range = [1, 2, 5, 10, 25, 50, 100, 150, 200, 250]
log_losses = []

lin_reg = LinearRegression()

for n in n_range:
    pca = PCA(n_components=n)
    lr_model = Pipeline(steps=[('pca', pca), ('linear regression', lin_reg)])
    
    # evaluate using cross-validation
    lr_val_preds = cross_val_predict(lr_model, X_std, y, cv=5)

    # in order to effective work out log loss, we need to flatten both arrays before computing log loss
    lr_log_loss = log_loss(np.ravel(y), np.ravel(lr_val_preds))
    print(f"Log loss for Linear Regression with PCA (n={n}): {lr_log_loss:.5f}\n")
    
    log_losses.append(lr_log_loss)

In [None]:
plt.figure(figsize=(14,5))
sns.lineplot(x=n_range, y=log_losses)
plt.ylabel("Average Log Loss", weight='bold')
plt.xlabel("PCA n components", weight='bold')
plt.grid()
plt.show()

It's interesting in this case that our dimensionality reduction actually appears to improve the generalisation performance of our model. Perhaps we have many redundant features within our high dimensional dataset. Let's now train this on the entire training set and make a prediction on the test set for the competition.

## 5. Baseline model predictions on the test set

We first need to preprocess our test set inputs so that it is consistent with regards to our training data.

In [None]:
# take a copy of all our training sig_ids for reference
test_sig_ids = test_features['sig_id'].copy()

# select all indices when 'cp_type' is 'ctl_vehicle'
test_ctl_vehicle_idx = (test_features['cp_type'] == 'ctl_vehicle')

X_test = data_processor.transform(test_features)

In [None]:
# standardise our numerical columns only using the training standard scaler
test_num_cols = [x for x in X_test.columns.values if not x.startswith(('cp_time', 'cp_dose'))]
X_test_std = X_test.copy()
X_test_std[test_num_cols] = std_scaler.transform(X_test.loc[:, test_num_cols])

In [None]:
lin_reg = LinearRegression()
pca = PCA(n_components=5)
lr_model = Pipeline(steps=[('pca', pca), ('linear regression', lin_reg)])
%time lr_model.fit(X_std, y)

In [None]:
%time test_preds = lr_model.predict(X_test_std)

We now need to update all of the predictions for cp_type == ctl_vehicle so that they are zero.

In [None]:
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values].sum()

In [None]:
# change all cp_type == ctl_vehicle predictions to zero
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values] = 0

# confirm all values now sum to zero for these instances
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values].sum()

In [None]:
# we have some values above 1 and below 0 - this needs amending since probs should only be 0-1
test_preds.max(), test_preds.min()

In [None]:
# in addition, let's set all negative values to 0.0 and all values above 1 to 1.0
test_preds[test_preds > 1.0] = 1.0
test_preds[test_preds < 0.0] = 0.0

# confirm these values are all corrected
test_preds.max(), test_preds.min()

In [None]:
lr_test_submission = pd.DataFrame({'sig_id' : test_sig_ids})
lr_test_submission[train_targets_scored.columns[1:]] = test_preds
lr_test_submission.head(3)

With this in the correct format, we can now save it and make a basic submission for the competition:

In [None]:
# save our submission as csv
lr_test_submission.to_csv('submission.csv', index=False)