# Mechanisms of Action: EDA for starters

Welcome to the Mechanisms of Action competition! Here we have to predict drug activation mechanisms, and we have a tabular competition which allows us to make full use of the hallowed **LightGBM.** So let's get started with a simple EDA to get you "active" in this competition. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.read_csv('../input/lish-moa/train_features.csv')
plt.style.use('seaborn-darkgrid')

In [None]:
df

In [None]:
for i in df.isnull().any():
    if i == True:
        print("MISSING VALUE")

Quick check yields no NaNs in the data, very good for us. Now we'll need to explore the columns a bit further before moving on.

In [None]:
targs = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
targs.head()

Targets are all principally just binary, nothing new or unique here. 

 Let's jsut quickly take a look at a few of the features we have over here, adjusted with their rolling means.

In [None]:
fig, axes = plt.subplots(8,2,figsize=(14, 30), dpi=100)
for i in range(0, 16):
    df[f"g-{i}"].plot(ax=axes[i%8][i//8], alpha=0.8, label='Feature', color='tab:blue')
    df[f"g-{i}"].rolling(window=4).mean().plot(ax=axes[i%8][i//8], alpha=0.8, label='Rolling mean', color='tab:orange')
    axes[i%8][i//8].legend();
    axes[i%8][i//8].set_title('Feature {}'.format(i), fontsize=13);
    plt.subplots_adjust(hspace=0.45)


It seems like almost each and every one of these features is basically the same distribution with a few variations, and all largely look rather even to me when I look at these features. Let's try some beginner-level dimensionality reduction technique on the data, which will be a Principal Component Analysis or PCA.

In [None]:
from sklearn.decomposition import PCA
df = df.drop(["sig_id","cp_type","cp_time","cp_dose"],axis=1)
pca = PCA(n_components=6)
pca.fit(df)
pca_samples = pca.transform(df)
ps = pd.DataFrame(pca_samples)
ps.head()

We have attempted to reduce a dataset with over 800 features and extremely high dimensionality into a six-feature dataframe, which hopefully will pay off in the future. For now, let's have a look-see at this new PCA-ified dataset and its bountiful features:

In [None]:
fig, axes = plt.subplots(3,2,figsize=(7, 15), dpi=100)
for i in range(0, 6):
    ps[i].plot(ax=axes[i%3][i//3], alpha=0.8, label='Feature', color='tab:blue')
    ps[i].rolling(window=4).mean().plot(ax=axes[i%3][i//3], alpha=0.8, label='Rolling mean', color='tab:orange')
    axes[i%3][i//3].legend();
    axes[i%3][i//3].set_title('Feature {}'.format(i), fontsize=13);
    plt.subplots_adjust(hspace=0.45)


This data basically smooshes together approximately 140 columns into one column for each of its six columns, and it has caused some resounding extremities to be seen in this data that we have. Taking a look at correlations, we get-

In [None]:
fig = plt.figure(figsize=(8, 8))
sns.heatmap(ps.corr(), annot=True, cmap=plt.cm.magma);

Those are some huge negative correlations we got there in our PCA-ified data.

In [None]:
import plotly.graph_objs as go
import plotly.tools as tls
import warnings
from collections import Counter
import plotly.offline as py
py.init_notebook_mode(connected=True)
data = [
    go.Heatmap(
        z= ps.corr().values,
        x=ps.columns.values,
        y=ps.columns.values,
        colorscale='Viridis',
        reversescale = False,
        opacity = 1.0 )
]

layout = go.Layout(
    title='Pearson Correlation of Integer-type features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

Adding interactivity lets the user play around and have fun with your dataviz, I personally like it for that reason. This is yet another way to show those painful negative correlations we have here.

Check std as opposed to rolling means-

In [None]:
fig, axes = plt.subplots(8,2,figsize=(14, 30), dpi=100)
for i in range(0, 16):
    df[f"g-{i}"].plot(ax=axes[i%8][i//8], alpha=0.8, label='Feature', color='tab:blue')
    df[f"g-{i}"].rolling(window=4).std().plot(ax=axes[i%8][i//8], alpha=0.8, label='Rolling mean', color='tab:orange')
    axes[i%8][i//8].legend();
    axes[i%8][i//8].set_title('Feature {}'.format(i), fontsize=13);
    plt.subplots_adjust(hspace=0.45)


This seems nice for now. Let's quickly check the types of each dosage we have over here:

In [None]:
df = pd.read_csv('../input/lish-moa/train_features.csv')

sns.countplot(df['cp_type'])

In [None]:
print("WORK IN PROGRESS")