# A Visual Guide to Mechanism of Action#


![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fbiologydictionary.net%2Fwp-content%2Fuploads%2F2017%2F05%2FAgonist-and-Antagonist.jpg&f=1&nofb=1)
Source: biologydictionary.net

**Basic Idea**

This competition attempts to predict how drug molecules will affect different proteins on a cell.  As shown in the diagram above, drugs usually work by binding to a receptor and upregulating (agonist) or downregulating (antagonist) the production of some downstream cellular activity.

If we know a disease affects some particular receptor or downstream set of cell activity, we can develop drugs faster if we can predict how cells and genes affect various receptor sites.  

**Get started with these discussions and kernels**
* **"Competition Insights" by Matthew Masters** - This post has pulled together some of the best kernels, literature review, github pages, and insights available on the competition so far.   https://www.kaggle.com/c/lish-moa/discussion/184005
* **MoA EDA by HeadsorTails** - The master continues to share his inspiring insights and EDA for the rest of us.  Check out his beautiful EDA here - https://www.kaggle.com/headsortails/explorations-of-action-moa-eda
* **Amin's beautiful explanation of the datasets and exploration** - https://www.kaggle.com/amiiiney/drugs-classification-mechanisms-of-action

In [None]:
# Kaggle Comments:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

SHOW_DIRS = False

if SHOW_DIRS:
    import os
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(os.path.join(dirname, filename))

****Pipeline Project Parameters****

In [None]:
# Number these with headings... 1.1, 2.4, 3.6

# Preliminary / Exploratory Data Analysis

LOAD_DATA = True               # 1.0
VERIFY_DATA = True             # 1.1
DO_PANDAS_PROFILING = False    # 1.2
DO_EDA = True                  # 1.4
SHOW_CORR = True               # 1.42


**Imports**

In [None]:
import random
from typing import Callable

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import KFold
from sklearn.metrics import log_loss

#
from scipy.stats import spearmanr

# https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, RandomForestClassifier, ExtraTreesClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC

# ML Visualization
from yellowbrick.classifier import ConfusionMatrix, ROCAUC, PrecisionRecallCurve, ClassificationReport, ClassPredictionError, DiscriminationThreshold

# Encoders
# Category encoders: https://contrib.scikit-learn.org/category_encoders/
# https://contrib.scikit-learn.org/category_encoders/count.html
from category_encoders import CountEncoder, TargetEncoder, BinaryEncoder

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# import networkx as nx
import matplotlib.pyplot as plt
# import statsmodels
import seaborn as sns
import plotly.express as px

from tqdm import tqdm

import tensorflow as tf


**Project Constants**

In [None]:
# Display to two decimal places
# Change maxrows and maxcols to 50?

# OS Constants
PATH = "/kaggle/input/lish-moa/"

TEST_PATH = PATH + "test_features.csv"
TRAIN_PATH = PATH + "train_features.csv"
TRAIN_Y_PATH = PATH + "train_targets_scored.csv"
TRAIN_UNSCORED_PATH = PATH + "train_targets_nonscored.csv"

SUBMISSION_PATH = PATH + "sample_submission.csv"

# Dataframe Constants
INDEX_COL = "sig_id"
CATEGORICAL_COLS = [""]
NUMERICAL_COLS = []

# Pandas Constants

# Graphical Constants

# sns.set()
sns.set_style('whitegrid')
sns.set_context('poster')

%matplotlib inline
plt.rcParams["figure.figsize"] = (16,12)
plt.rcParams['axes.titlesize'] = 16


# Warnings
IGNORE_WARNINGS = False

if IGNORE_WARNINGS:
    import warnings
    warnings.filterwarnings('ignore')

# Reproducability
SEED = 42
np.random.seed(42)
# tf.random.set_seed(42)

**Import Data**

In [None]:
train_X = pd.read_csv(TRAIN_PATH, index_col = INDEX_COL)
train_y = pd.read_csv(TRAIN_Y_PATH, index_col = INDEX_COL)

train_unscored = pd.read_csv(TRAIN_UNSCORED_PATH, index_col = INDEX_COL)

test_X = pd.read_csv(TEST_PATH, index_col = INDEX_COL)

submission_df = pd.read_csv(SUBMISSION_PATH)

### Tour of Data Files ###

### Y Training Set - The Targets We Have To Predict ###

The targets we have to predict are basically probabilities of activation for each of these proteins.  

* **Agonists: Agonists increase the production of downstream molecules.**
* **Antagonists and Inhibitors**:  Inhibit the production of downstream molecules by competing with other molecules for the binding site.  Antagonists are specific types of inhibitors and the difference between these terms is often domain specific.


![y_train_moa](https://deeptestprep.com/wp-content/uploads/2020/09/y_train_moa.png)

In [None]:
print("y_train shape: ", train_y.shape)

### X_Train - Cell and Gene Predictor Variables ###



![X_train_moa](https://deeptestprep.com/wp-content/uploads/2020/09/X_train_moa.png)

In [None]:
print("X_train shape: ", train_X.shape)

### A quick look at gene data ###

In [None]:
if DO_EDA:
    # sample_cols = ['g-0']
    # Create a sampled dataframe and use hue to denote different histograms?

    sns.set_context('poster')

    ax = sns.distplot(train_X['g-0'])
    ax2 = sns.distplot(train_X['g-100'])
    ax3 = sns.distplot(train_X['g-200'])
    ax4 = sns.distplot(train_X['g-300'])
    ax5 = sns.distplot(train_X['g-400'])
    ax6 = sns.distplot(train_X['g-500'])
    ax7 = sns.distplot(train_X['g-600'])
    ax8 = sns.distplot(train_X['g-700'])
    ax9 = sns.distplot(train_X['g-750'])
    ax10 = sns.distplot(train_X['g-150'])


    ax.set(title = "Regulation of 10 Random Genes",
          xlabel = "Upregulation or Downregulation",
          ylabel = "Percent of Sample")

    plt.annotate("Gene Deeply Downregulated", xy = (-9.9, .01), xytext = (-7.8, 0.21),
                 size = 16,
                 arrowprops = {'facecolor':'grey', 'width':3})

    plt.annotate("Somewhat Downregulated", xy = (-5, 0.05), xytext = (-7.8, 0.11), size = 16,
                arrowprops = {'facecolor':'grey', 'width':3},
                backgroundcolor = 'white')

    plt.annotate("Genes Upregulated.  Slight Right Skew", xy = (2.5, 0.06), xytext = (2.5, 0.11), size = 16,
                arrowprops = {'facecolor':'grey', 'width':3},
                backgroundcolor = 'white')

    plt.legend()
    plt.show()

In [None]:
if DO_EDA:
    sns.set_context('poster')

    ax = sns.distplot(train_X['c-0'])
    ax2 = sns.distplot(train_X['c-10'])
    ax3 = sns.distplot(train_X['c-20'])
    ax4 = sns.distplot(train_X['c-30'])
    ax5 = sns.distplot(train_X['c-40'])
    ax6 = sns.distplot(train_X['c-50'])
    ax7 = sns.distplot(train_X['c-60'])
    ax8 = sns.distplot(train_X['c-70'])
    ax9 = sns.distplot(train_X['c-80'])
    ax10 = sns.distplot(train_X['c-90'])

    ax.set(title = "Viability of 10 Random Cell Samples",
          xlabel = "Increased or decreased viability",
          ylabel = "Percent of Sample")

    plt.annotate("Drug effective at killing cells / Error?", xy = (-9.9, .08), xytext = (-7.8, 0.21),
                 size = 16,
                 arrowprops = {'facecolor':'grey', 'width':3})

    plt.annotate("More cells are killed in general", xy = (-4, 0.02), xytext = (-7.8, 0.11), size = 16,
                arrowprops = {'facecolor':'grey', 'width':3},
                backgroundcolor = 'white')

    plt.annotate("Cell viability enhanced less often", xy = (1.5, 0.06), xytext = (2.5, 0.11), size = 16,
                arrowprops = {'facecolor':'grey', 'width':3},
                backgroundcolor = 'white')

    plt.legend()
    plt.show()

**Pandas Profiling**

In [None]:
if DO_PANDAS_PROFILING:
    import pandas_profiling as pp
    
    train_features = pd.read_csv("/kaggle/input/lish-moa/train_features.csv")
    train_report = pp.ProfileReport(train_features, title = "train_dataset_profile")

if DO_PANDAS_PROFILING:
    train_report.to_file("train_report.html")
    train_report.to_notebook_iframe()

**Mechanism of Action EDA**

Questions I'd like to answer:
* Do any of the categorical variables (such as time or dosage level) affect gene expression and cell viability outcomes in a predictable / linear / correlated way?
* The unscored training set are the ligands we don't know the value of.  The scored values are the ligands we do know the value of.

What is the relationship between these sets and the test set?

For the following, it also may be useful to generate simply hypotheses using domain knowledge:

* Are certain gene responses correlated with each other?

* Are certain cell viability responses correlated with each other?

* Are certain gene / cell responses correlated?

* What is the relationship between antagonists / agonists with the same treatment protocol?

* What data transformations may make sense here?

* What data denoising do we have to worry about?

**EDA Helper Functions**

In [None]:
DO_EDA = True

In [None]:
def explore_df(df: pd.DataFrame,
               df_name : str) -> None:
    
    # print name and shape
    # print describe
    # print info
    pass

In [None]:
print("Training Set X: ", train_X.shape)
print("Training Set y: ", train_y.shape)
print("Unlabeled Set X: ", train_unscored.shape)
print("Testing Set X: ", test_X.shape)

There are 876 features which include gene activations, cell viability, and specifics about the treatment protocol.  

In [None]:
if DO_EDA:
    train_X.head()

## Linear Correlations In the Dataset ##

This section will show linear correlations that are above a certain threshold. You can use the second interactive chart to mouse-over to see the two features that have strong correlations.

In [None]:
def color_above_threshold_green(val):
    """
    Colors any cell above a threshold green.
    """
    
    if np.abs(val) > 0.65:
        color = 'green'
    else:
        color = 'black'
        
    return 'color: %s' % color

In [None]:
gene_cols = [col for col in train_X if col.startswith('g-')]
gene_train_df = train_X[gene_cols]

cell_cols = [col for col in train_X if col.startswith('c-')]
cell_train_df = train_X[cell_cols]

In [None]:

if SHOW_CORR and DO_EDA:
    corr_threshold = 0.75


In [None]:
if SHOW_CORR and DO_EDA:
    gene_corr = gene_train_df.corr()    
    gene_corr.shape

In [None]:
if SHOW_CORR and DO_EDA:
    gene_has_corr = gene_corr[(gene_corr > corr_threshold) | (gene_corr < -corr_threshold)]

    gene_has_corr.dropna(axis = 0, thresh = 2, inplace = True)
    gene_has_corr.dropna(axis = 1, thresh = 2, inplace = True)
    
    gene_has_corr.shape

In [None]:
if SHOW_CORR and DO_EDA:
    mask = np.zeros_like(gene_has_corr)
    mask[np.triu_indices_from(mask)] = True

In [None]:
if SHOW_CORR and DO_EDA:
    ax = sns.heatmap(gene_has_corr, cmap = 'vlag', linewidths=4, mask = mask)
    ax.set(title = 'Genes with High Linear Correlation > 0.75')
    plt.show()

In [None]:
if SHOW_CORR and DO_EDA:
    fig = px.imshow(gene_has_corr, template = 'ggplot2', 
                    title = 'Interactive heatmap of linearly correlated genes',
                    width=900, height=600)
    fig.show()
    

------------------------------------------------------------------------------------------------

# Basic Walkthrough - BELOW THIS POINT - IN PROGRESS - #

Thanks for reading my walkthrough - if you like it, please upvote as it is a good habit that inspires our community to continue publishing its work publically.

------------------------------------------------------------------------------------------------


In [None]:
if DO_EDA:
    # Use 3 examples to illustrate what this graph means
    # With annotations.

    ax = sns.distplot(train_y.describe().T['mean'], kde = False)
    ax.set(title = "Density of Positive Classifications Across Ligand Bindings",
          xlabel = "Mean value of each ligand in y_train",
          ylabel = "Number of ligands with that mean")
    plt.show()

In [None]:
if DO_EDA:
    ax = sns.distplot(train_y.describe().T['std'], kde = False)
    ax.set(title = "Density of Standard Deviations Across Ligand Bindings",
          xlabel = "Std.Dev of each ligand in y_train",
          ylabel = "Number of ligands with that Std Dev")
    plt.show()

In [None]:
if DO_EDA:
    # As mean increases (the number of positives), the standard deviation also increases.
    # This makes sense, but can the shape tell us anything.

    ax = sns.scatterplot(train_y.describe().T['mean'], train_y.describe().T['std'])

    ax.set(title = "Mean vs. Std Dev of y_train",
          xlabel = "Mean Value",
          ylabel = "Std Dev")
    plt.show()

In [None]:
train_full = train_X.merge(train_y, left_index = True, right_index = True)
train_full.head()

In [None]:
if SHOW_CORR and DO_EDA:
    full_corr = train_full.corr()
    full_corr.head()

In [None]:
# Sub Function 

def get_strong_correlations(corr_matrix: pd.DataFrame,
                            corr_threshold : float = 0.70):
    
    # Utility function for other functions.
    
    strong_corr = corr_matrix[(corr_matrix > corr_threshold) | (corr_matrix < -corr_threshold)]
    
    strong_corr.dropna(axis = 0, thresh = 2, inplace = True)
    strong_corr.dropna(axis = 1, thresh = 2, inplace = True)

    return strong_corr

def show_strong_correlations(corr_matrix: pd.DataFrame,
                             corr_threshold : float = 0.70,
                             show_corr : bool = True,
                             interactive : bool = True) -> pd.DataFrame:
    
    strong_corr = get_strong_correlations(corr_matrix, corr_threshold)
    print("Found" , strong_corr.shape[0], "Features with strong Correlations")
    
    mask = np.zeros_like(strong_corr)
    mask[np.triu_indices_from(mask)] = True

    if show_corr:
        
        if interactive:
            fig = px.imshow(strong_corr, template = 'ggplot2', 
                    title = 'Heatmap with High Correlations',
                    width=900, height=600)
            fig.show()
            
        else:
            ax = sns.heatmap(strong_corr, mask = mask)

            ax.set(title = 'Heatmap with Correlations')

            # plt.show()
        
    return strong_corr

def get_dict_of_correlated_features(corr_matrix: pd.DataFrame,
                                    corr_threshold : float = 0.70) -> dict:
    
    """
    Returns a dictionary of each feature and its associated correlated features
    ranked in order of most correlated to least correlated along with their
    correlation values.
    """
    
    strong_corr = get_strong_correlations(corr_matrix, corr_threshold)

    print("Found" , strong_corr.shape[0], "Features with strong Correlations")
    
    

In [None]:
if SHOW_CORR and DO_EDA:
    show_strong_correlations(full_corr, corr_threshold = 0.75)

In [None]:
# spearman_corr, pval = spearmanr(train_full)
# show_strong_correlations(pd.DataFrame(spearman_corr))

In [None]:
if SHOW_CORR and DO_EDA:
    sns.distplot(full_corr.vitamin_b)
    sns.distplot(full_corr.kit_inhibitor)
    plt.show()

**Network Analysis**

In [None]:
DO_NETWORK_ANALYSIS = False

if DO_NETWORK_ANALYSIS:
    G = nx.from_pandas_adjacency(gene_has_corr)

    #positions=nx.circular_layout(G)

    # nx.draw_networkx_nodes(G,positions,node_color='#DA70D6',
    #                           node_size=500,alpha=0.8)

In [None]:
# Use Pandas Styling
# s = gene_corr.style.applymap(color_above_threshold_green)


**Plotter Helper Functions**

In [None]:
if DO_EDA:
    def disp_boxplot(data, title, xlabel, ylabel):
        sns.set_style('whitegrid')
        sns.set_context('poster')
        palette = sns.color_palette("mako_r", 6)

        ax = sns.boxplot(data=data, palette = palette)

        ax.set(title = title,
              xlabel = xlabel,
              ylabel = ylabel)

        try:
            ax.axhline(y = data.mean().mean(), color = 'b', label = 'Mean of all datapoints', linestyle = '--', linewidth = 1.5)
            ax.ahline(y = data.median().median(), color = 'g', label = 'Median of all datapoints', linestyle = '--', linewidth = 1.5)
        except:
            pass

        ax.set_xticklabels(ax.get_xticklabels(), rotation = 45)

        plt.legend()
        plt.show()

    print('Plotting Helper Functions:')
    print("disp_boxplot() - function will display a nicely formatted box plot")

**Create Statistical Dataframe Summaries**

In [None]:

# Create a helper function
# Graph a few notable genes and cell lines.
# Anything stand out?

statistical_df = pd.DataFrame()
statistical_df['median'] = train_X.median(axis = 0)
statistical_df['mean'] = train_X.mean(axis = 0)
statistical_df['std_dev'] = train_X.std(axis = 0)
statistical_df['min'] = train_X.min(axis = 0)
statistical_df['max'] = train_X.max(axis = 0)

gene_cols = [col for col in statistical_df.T if col.startswith('g-')]
gene_train_stats_df = statistical_df.T[gene_cols].T

cell_cols = [col for col in statistical_df.T if col.startswith('c-')]
cell_train_stats_df = statistical_df.T[cell_cols].T

In [None]:
# statistical_df.head(5)

In [None]:
# cell_train_stats_df.head()

In [None]:
if DO_EDA:
    # Kurtosis / Skew
    # Mean vs. Median
    # Make ECDF Comparison next To it too.

    sns.set_context('poster')

    ax = sns.distplot(cell_train_stats_df['mean'], label = 'Mean', kde = False)
    ax2 = sns.distplot(cell_train_stats_df['median'], label = 'Median', kde = False)

    ax.set(title = "Cell Lines: Aggregate Mean vs. Median Distribution",
          xlabel = "Cell Viability",
          ylabel = "Num Samples")

    plt.legend()
    plt.show()

In [None]:
if DO_EDA:
    # Kurtosis / Skew
    # Mean vs. Median

    sns.set_context('poster')

    ax = sns.distplot(gene_train_stats_df['mean'], label = 'Mean', kde = False)
    ax2 = sns.distplot(gene_train_stats_df['median'], label = 'Median', kde = False)

    ax.set(title = "Gene Expression: Aggregate Mean vs. Median Distribution",
          xlabel = "Upregulation & Downregulation",
          ylabel = "Num Samples")

    plt.legend()
    plt.show()

In [None]:
if DO_EDA:
    # Kurtosis / Skew
    # Mean vs. Median

    sns.set_context('poster')

    ax = sns.distplot(cell_train_stats_df['std_dev'], label = 'Cell Std Dev', kde = False)
    ax2 = sns.distplot(gene_train_stats_df['std_dev'], label = 'Gene Std Dev', kde = False)

    ax.set(title = "Gene Expression: Aggregate Standard Deviations",
          xlabel = "Standard Deviation",
          ylabel = "Num Samples with this Std Dev")

    plt.legend()
    plt.show()

In [None]:
if DO_EDA:
    # Kurtosis / Skew
    # Mean vs. Median

    sns.set_context('poster')

    ax = sns.distplot(cell_train_stats_df['min'], label = 'Cell Min', kde = False)
    ax2 = sns.distplot(gene_train_stats_df['min'], label = 'Gene Min', kde = False)

    ax.set(title = "Cell and Gene Minimums",
          xlabel = "Minimum Value",
          ylabel = "How many Samples")

    plt.legend()
    plt.show()

In [None]:
if DO_EDA:
    # Kurtosis / Skew
    # Mean vs. Median

    sns.set_context('poster')

    ax = sns.distplot(cell_train_stats_df['max'], label = 'Cell Max', kde = False)
    ax2 = sns.distplot(gene_train_stats_df['max'], label = 'Gene Max', kde = False)

    ax.set(title = "Cell and Gene Maximums",
          xlabel = "Maximum Value",
          ylabel = "How many Samples")

    plt.legend()
    plt.show()

In [None]:
if DO_EDA:
    cat_sum = train_X.groupby(['cp_type']).sum().T.reset_index(drop = True)

In [None]:
if DO_EDA:
    train_unscored.head()

## Preprocessing ##

In [None]:
def clean_for_models(df : pd.DataFrame) -> pd.DataFrame:
    # Categorical Encoding
    # Target Encoding
    
    pass

def remove_non_numerical(df: pd.DataFrame) -> pd.DataFrame:
    pass

**Encode Categorical Variables**


In [None]:
encoder = BinaryEncoder(cols=['cp_type', 'cp_dose', 'cp_time'], return_df = True)
train_X_encoded = encoder.fit_transform(train_X)
test_X_encoded = encoder.fit_transform(test_X)

In [None]:
train_X_encoded.head()

In [None]:
test_X_encoded.head()

## Dimensionality Reduction ##

In [None]:
DO_DIM_REDUCE = True

if DO_DIM_REDUCE:
    from sklearn.decomposition import PCA, SparsePCA, KernelPCA
    
    # Put in preprocessing
    min_max_X_train = MinMaxScaler().fit_transform(train_X_encoded)
    min_max_X_test = MinMaxScaler().fit_transform(test_X_encoded)
    
    print("Linear PCA explained variance:")
    lin_pca_X_train = PCA(n_components = 25).fit(min_max_X_train)
    print(lin_pca_X_train.explained_variance_ratio_)
    print("Total Variance:", sum(lin_pca_X_train.explained_variance_ratio_))
    
    # FIX this inefficiency
    lin_pca_X_train = PCA(n_components = 25).fit_transform(min_max_X_train)
    print("\n")
    

## Use Subsample ##
Since there are so many outputs to predict in this model, it may make sense to prototype different models on a subset of the data.  This would allow for faster prototyping while still seeing if some models can work very well on limited portions of the dataset.

In [None]:
USE_SUBSAMPLE_ONLY = True
n_subsample = 50  # Just predict n ligands for faster analysis and prototyping.

if USE_SUBSAMPLE_ONLY:
    
    # Option to use same subset...
    
    targets_to_use = random.sample(range(1, len(train_y.columns)), n_subsample)
    train_X_subsample = train_X_encoded
    train_y_subsample = train_y.iloc[:, targets_to_use]
    
    print("Using subsample of targets for faster exploration")
    print("Using the following target columns \n")
    print(list(train_y_subsample.columns))


## Validation Splits ##

In [None]:
# ONE_VALIDATION SPLIT

from sklearn.model_selection import train_test_split

# SET What you want to split here.
X_to_split = train_X_encoded
y_to_split = train_y_subsample

try:
    X_train_v, X_test_v, y_train_v, y_test_v = train_test_split(X_to_split, 
                                                                y_to_split, 
                                                                test_size=0.33, 
                                                                random_state=42, 
                                                                stratify = train_y_subsample)
except:   # Stratify doesn't work on all classes.
    X_train_v, X_test_v, y_train_v, y_test_v = train_test_split(X_to_split, 
                                                                y_to_split, 
                                                                test_size=0.33, 
                                                                random_state=42)

print("Validation Data Train Set - X_train_v: ", X_train_v.shape)
print("Validation Data Test Set - X_test_v: ", X_test_v.shape)

print("Validation Target Train Set - y_train_v: ", y_train_v.shape)
print("Validation Target Test Set - y_test_v: ", y_test_v.shape)


In [None]:
# Show Counts of targets in validation sets.

## Anomaly Detection ##

Because the classes are so sparse, it might make sense to try to understand if the data signatures of the 0's and 1's are different. Can anomaly detection detect when an incoming sample will have a MoA?


## Modeling ##

In [None]:
DO_BASELINE = False
LOAD_CLASSIFIER = False
DO_CLASSIFIER = True
DO_NN = True

### Model Helper Functions ###

**Create Submission File**

In [None]:
# Create empty submission file
outputs_df = submission_df.copy()
outputs_df.head()

**Multi Output Target Helper Function**

In [None]:
from typing import NewType # Or Generic?
Classifier = NewType('Classifier', str)  # Make this work for any classifier.

def use_voting_classifier(X_train : pd.DataFrame, 
                         y_train : pd.DataFrame,
                         X_test: pd.DataFrame,
                         clfs ) -> pd.DataFrame:
    
    
    voting_clf = VotingClassifier(clfs, voting = 'soft')
    multi_voting_clf = MultiOutputClassifier(voting_clf).fit(X_train, y_train)
    
    # Predict Probas...
    # return probas
    

def classify_with_multiclassifier(X_train : pd.DataFrame, 
                         y_train : pd.DataFrame,
                         X_test: pd.DataFrame,
                         clf : Classifier,
                         save_classifier : bool = True) -> np.array:
    
    """Uses SKLearn MultiOutput Classifier instead of a loop"""
    
    multi_clf = MultiOutputClassifier(clf).fit(X_train, y_train)
    
    preds = multi_clf.predict_proba(X_test)
    
    if save_classifier:
        dump(multi_clf, 'model.joblib')
    
    return preds
    

### Baseline Models ###

In [None]:
LOAD_CLASSIFIER = False

if LOAD_CLASSIFIER:
    extra_preds_path = "extra_preds.csv"
    extra_preds = pd.read_csv(extra_preds_path)


**Run Classifier Here**

In [None]:
DO_CLASSIFIER = True

if DO_CLASSIFIER:
    
    from joblib import dump, load   # For Saving Model
    from sklearn.svm import SVC
    
    
    # Create Classifiers Here
    
    xgb_clf = XGBClassifier(n_jobs = -1, max_depth = 5)
    lgb_clf = LGBMClassifier()
    rf_clf = RandomForestClassifier(n_jobs = -1)
    extra_clf = ExtraTreesClassifier(n_jobs = -1, max_depth = 7, min_samples_split = 3, n_estimators = 500,
                                    class_weight = 'balanced')
    sk_gb_reg = GradientBoostingRegressor()
    
    svc_clf = SVC(class_weight='balanced')
    svc_linear_clf = SVC(class_weight='balanced', kernel = 'linear')
    
    svclinear_proba_clf = SVC(class_weight='balanced', kernel = 'linear', probability=True)
    svc_proba_clf = SVC(class_weight='balanced', probability=True)
    

In [None]:
# Set what train and test sets you want to use.
classify_X_train = X_train_v
classify_X_test = X_test_v
classify_y_train = y_train_v
classify_y_test = y_test_v

if DO_CLASSIFIER:
    # CHOOSE CLASSIFIER AND METHOD HERE.
    use_this_classifier = extra_clf
    used_classifier_name = 'extra_clf'
    
    print("Classifier set to: ", used_classifier_name)
    print(use_this_classifier)
    

In [None]:
# BASELINE
# all_preds_gb = classify_prob_all_targets(train_X_encoded, train_y, test_X_encoded, outputs_df, use_this_classifier)
# preds = classify_with_multiclassifier(classify_X_train, classify_y_train, classify_X_test, use_this_classifier)

In [None]:
def convert_preds_to_dataframe(preds: list,
                               clf_name : str,
                              save_array : bool = True,
                              save_df: bool = True,
                              verbose : int = 1) -> pd.DataFrame:
    
    """
    Sometimes a multidimensional list is created with MultiOutputClassifier
    This function will convert that output to just the positive predictions
    and return a dataframe
    
    save_array: saves the array as an np_y format
    save_df: saves the dataframe as a csv."""
    
    preds_arr = np.array(preds)
    pos_clf_preds = preds_arr[:,:,1]   # Preserve only positive binary prediction.
    pos_clf_preds = pos_clf_preds.T    # The output is the transpose of the shape we want
    
    if save_array:
        np.save(clf_name, pos_clf_preds)
        
        if verbose:
            print("Numpy Array saved: " + clf_name + ".npy")
    
    preds_df = pd.DataFrame(pos_clf_preds)
    
    if save_df:
        csv_name = clf_name + ".csv"
        preds_df.to_csv(csv_name, index = False)
        
        if verbose:
            print("CSV Saved as " + csv_name)
    
    if verbose:
        print("head of new dataframe: ")
        print(preds_df.head())
        print(preds_df.shape)
    
    return preds_df

In [None]:
# BASELINE
# if DO_CLASSIFIER:
#    preds_df = convert_preds_to_dataframe(preds, clf_name = used_classifier_name, verbose = 1)
    

## Validation Set Analysis ##

In [None]:
def fit_for_validation():
    pass

def analyze_validation(y_true, y_pred):
    pass

In [None]:
def score_clfs(true_df, preds_df):
    from sklearn.metrics import accuracy_score

    wrongs = true_df.compare(preds_df)

    n_wrongs = len(wrongs.notna()) # Do I divide this by 2 because it has 'self' and 'other'?  FIX
    n_samples = true_df.shape[0]
    n_correct = n_samples - n_wrongs

    print("Total Targets: ", n_samples)
    print("Number Correct: ", n_correct)
    print("Number Wrong: ", n_wrongs)
    print("Accuracy: ", round(100 * (n_correct / n_samples), 2), "%")
    print("Sklearn Acc: ", round(100 * (accuracy_score(true_df, preds_df)), 2), "%")

    print("Log Loss: ", round(log_loss(true_df, preds_df), 2))

def analyze_clf_results():
    pass

def multiclass_stratified_cv(X : pd.DataFrame, 
                             y : pd.DataFrame, 
                             clf : Callable[[pd.DataFrame, pd.DataFrame], None]) -> None:
    
    skf = IterativeStratification(n_splits=5)
    skf.get_n_splits(X, y)

    for train_index, test_index in skf.split(X, y):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
        y_train, y_test = y.iloc[train_index,:], y.iloc[test_index,:]

        # Fit Function
        clf.fit(X_train, y_train)
        
        # Create Predictions
        y_preds_arr = clf.predict(X_test)
        
        # Convert preds to a dataframe.
        y_preds = pd.DataFrame(y_preds_arr, columns = y_test.columns, index = y_test.index)
        
        # Score Function
        score_clfs(y_test, y_preds)
        
        # Analyze Function
        # analyze_clf_results(y_test, y_preds)
        
        print("\n")




In [None]:
DO_CV = False

if DO_CV:
    # CV_VALIDATION_SPLIT
    from sklearn.model_selection import StratifiedKFold
    from skmultilearn.model_selection import IterativeStratification 
    
    # Use whole training set for cross validation
    X = train_X_encoded
    y = train_y
    
    multiclass_stratified_cv(X, y, use_this_classifier)

In [None]:
# For validation, we actually predict the classes
print("Fitting", use_this_classifier, "to classes for validation analysis.")

preds_arr = use_this_classifier.fit(X_train_v, y_train_v).predict(X_test_v)

preds = pd.DataFrame(preds_arr, columns = y_test_v.columns, index = y_test_v.index)
preds.head()

In [None]:
#ax = sns.distplot(y_test_v, kde = False)
#ax.set(title = "Class Balance of Validation Set")
#plt.show()

### Multi Label Confusion Matrix ##

In this section, we are doing our post prediction analysis on the validation set to see what samples and features the classifier had the hardest time with.   

In the future, we will do this for each classifier and ensemble a set of local models models so that each classifier will predict those samples it is best at.  If it performs reasonably well, it may be ensembled with other models later on.  

In [None]:
show_confusion_matrix = False

if show_confusion_matrix:
    from sklearn.metrics import multilabel_confusion_matrix

    conf_mat = multilabel_confusion_matrix(y_test_v, preds)
    conf_mat.shape

    for i in range(y_test_v.shape[1]):
        name = y_test_v.columns[i]

        display(name)
        conf_mat_small = pd.DataFrame(conf_mat[i], index = ["True 0", "True 1"], columns = ["Predicted 0", "Predicted 1"])

        display(conf_mat_small)

In [None]:
# Do post_model_analysis
# Def ___

from sklearn.metrics import accuracy_score

wrongs = y_test_v.compare(preds)

n_wrongs = len(wrongs.notna()) # Do I divide this by 2 because it has 'self' and 'other'?  FIX
n_samples = y_test_v.shape[0]
n_correct = n_samples - n_wrongs

print("Total Targets: ", n_samples)
print("Number Correct: ", n_correct)
print("Number Wrong: ", n_wrongs)
print("Accuracy: ", round(100 * (n_correct / n_samples), 2), "%")
print("Sklearn Acc: ", round(100 * (accuracy_score(y_test_v, preds)), 2), "%")

print("Log Loss: ", round(log_loss(y_test_v, preds), 2))


In [None]:
# Do post_model_analysis

sns.heatmap(wrongs, cbar = False)
plt.show()

In [None]:
wrongs.head()

In [None]:
wrong_sums = pd.DataFrame(wrongs.sum(axis = 0))
wrong_sums

In [None]:
# wrong_sums.plot.bar()

In [None]:
def custom_confusion_matrix():
    pass


# Full Set Training #

**Neural Network Model**

In [None]:
train_y

In [None]:
# Compute Class Weights
# def...

sum_classes = np.array(list((train_y.sum())))
n_samples = 23814
weights = n_samples / sum_classes

class_weights_dict = dict(enumerate(weights))

In [None]:
DO_NN = True

# Imports
if DO_NN:
    from keras import Input, layers
    from keras.models import Model
    from keras.layers import Dense, BatchNormalization, Dropout, Embedding
    from tensorflow_addons.layers import WeightNormalization
    from keras.regularizers import l2
    
# Preprocess Data
# Set the X and y
if DO_NN:
    
    
    std_scaler = StandardScaler()
    mm_scaler = MinMaxScaler()
    
    # Use this scaler
    scaler = std_scaler
    
    X_nn = scaler.fit_transform(train_X_encoded)
    y_nn = train_y
    X_test_nn = scaler.fit_transform(test_X_encoded)
    

# Set NN Parameters
if DO_NN:

    EPOCHS = 100
    BATCH_SIZE = 128
    INPUT_SHAPE = X_nn.shape[1]
    OUTPUT_SHAPE = y_nn.shape[1]
    
    NUM_LAYERS = 5  # NUM_HIDDEN - 1  # Rename NUM_HIDDEN
    SIZE_LAYER = 1024 
    STEP_DOWN = 128
    DROPOUT_AMOUNT = 0.5

# Create Model

if DO_NN:
    
    input_tensor = Input(shape = (INPUT_SHAPE, ))
    
    layer = Dense(SIZE_LAYER, activation = 'selu', kernel_initializer = 'he_normal')(input_tensor)
    layer = BatchNormalization(input_shape = (INPUT_SHAPE, ))(layer)
    layer = Dropout(0.2)(layer)
    
    for i in range(NUM_LAYERS - 1): 
        layer = WeightNormalization(Dense(SIZE_LAYER, activation = 'selu', 
                                          kernel_initializer = 'he_normal',
                                          use_bias = False))(layer)
        layer = BatchNormalization()(layer)
        #layer = Dense(SIZE_LAYER, activation = 'relu',
        #              kernel_regularizer = l2(0.05))(layer)
        layer = Dropout(DROPOUT_AMOUNT)(layer)

        SIZE_LAYER -= STEP_DOWN

    
    output_tensor = Dense(OUTPUT_SHAPE, activation = 'sigmoid')(layer)
    
    model = Model(input_tensor, output_tensor)
    model.summary()
    

In [None]:
if DO_NN:
    from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
    from keras.optimizers import Adam, Nadam, SGD
    
    # Optimizer is stochastic gradient descent with nesterov acceleration and momentum.
    sgd_opt = SGD(lr = 0.01, momentum = 0.9, nesterov = True, decay = 1e-4)
    
    # Callbacks
    reduce_lr = ReduceLROnPlateau(patience=2, mode='min', monitor='val_loss', factor = 0.5)
    early_stop = EarlyStopping(patience = 10, monitor = 'val_loss')
    
    model.compile(optimizer = sgd_opt,
                 loss = 'binary_crossentropy',
                 metrics = ['acc'])
    
    # This model pre-loads the class-weights.  Then the model continues fitting
    # without the class weights
    
    model.fit(X_nn, 
              y_nn,
              epochs = 50,
              batch_size = BATCH_SIZE,
              class_weight = class_weights_dict,
              validation_split = 0.3,
              callbacks = [reduce_lr, early_stop])
    

In [None]:
# Running again without class_weight after partial training.
# Real effect or leakage into validation set?

if DO_NN:
    # Optimizer is stochastic gradient descent with nesterov acceleration and momentum.
    sgd_opt = SGD(lr = 0.01, momentum = 0.9, nesterov = True, decay = 1e-4)
    
    # Callbacks
    reduce_lr = ReduceLROnPlateau(patience=2, mode='min', monitor='val_loss', factor = 0.5)
    early_stop = EarlyStopping(patience = 10, monitor = 'val_loss')
    
    model.compile(optimizer = sgd_opt,
                 loss = 'binary_crossentropy',
                 metrics = ['acc'])
    
    model.fit(X_nn, 
              y_nn,
              epochs = 50,
              batch_size = BATCH_SIZE,
              # class_weight = class_weights_dict,
              validation_split = 0.3,
              callbacks = [reduce_lr, early_stop])

In [None]:
if DO_NN:
    model.save('keras_model.h5')

In [None]:
if DO_NN:
    preds_nn = model.predict(X_test_nn)

In [None]:
if DO_NN:
    # THIS IS NOW AT END.  
    
    preds_nn_df = pd.DataFrame(preds_nn)
    print(preds_nn_df.shape)

    new_submission = submission_df.copy()  # This is a duplicate
    print(new_submission.shape)
    print(preds_nn_df.head())

In [None]:
if DO_NN:
    ax = sns.distplot(preds_nn_df, label = 'NN Preds', kde = False)
    ax.set(title = "Predictions", 
          xlabel = "Prediction Probability",
          ylabel = "Number of samples")

    plt.show()

## Submission FIle ##

In [None]:
SUBMIT_FILE = True
SUBMIT_FROM_KERNEL = True

if SUBMIT_FILE and SUBMIT_FROM_KERNEL:
    # Set module parameters here.
    
    preds_df = preds_nn_df
    submission_df = submission_df

In [None]:
def df_to_submission(pred_csv: pd.DataFrame, 
                     submission_df: pd.DataFrame,
                     verbose : bool = True,
                     export_df : bool = True) -> pd.DataFrame:
    """
    
    This function takes a df of positive prediction
    outputs and merges it with the submission dataframe
    so it is in the correct format for Kaggle's submission"""

    new_submission = submission_df.copy()
    
    column_labels = new_submission.drop(columns = 'sig_id')
    column_labels = column_labels.columns
    
    if verbose:
        print("Submission Shape: ")
        print(new_submission.shape)
        print("Preds Shape: ")
        print(pred_csv.shape)
    
    pred_csv.columns = column_labels
    
    merged_df = new_submission.merge(pred_csv, how = 'right')
    merged_df['sig_id'] = new_submission['sig_id']
    
    if verbose:
        print("Merged Shape: ")
        print(merged_df.shape)
        print("Preview of Merged DF: ")
        print(merged_df.head(1))
    
    if export_df:
        merged_df.to_csv('submission.csv', index=False)
        
        if verbose:
            print("Submission file created: submission.csv")
    return merged_df



In [None]:
if SUBMIT_FILE:
    final_submission_csv = df_to_submission(preds_df, submission_df)