# Introduction

In this work I'm going to do some exploratory data analysis for recent [Mechanisms of Action (MoA) Prediction](https://www.kaggle.com/c/lish-moa/) competition. It's still in progress and I'll try to update it whenever it's possible for me, I hope you find it useful!

In this task we will be predicting multiple targets of the Mechanism of Action (MoA) response(s) of different samples, given various inputs such as gene expression data and cell viability data. In short Mechanism of Action describes the process by which a molecule, such as a drug, functions to produce a pharmacological effect. A drugâ€™s mechanism of action may refer to its effects on a biological readout such as cell growth, or its interaction and modulation of its direct biomolecular target, for example a protein or nucleic acid.

We have given info such as:
-  Gene Expression Data (g-)
-  Cell Viability Data (c-)
-  Cp Type (indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle))
-  Duration and Dose of the Treatment
-  MoA labels for prediction

Just to be sure this is multi-label classification and one sample might have more than one labels. I tried to explain [here](https://www.kaggle.com/c/lish-moa/discussion/180500) basically.

Let's get started...

# Loading Libraries

In [None]:
# Some basic stuff for EDA:

import pandas as pd
import numpy as np

# for visualizing

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import plotly.express as px

# for adding extra statistical stuff

from scipy.stats import skew, norm

In [None]:
# Styling graphs with customized color palette.

cust_palt = [
    '#111d5e','#c70039','#37b448','#B43757', '#ffbd69', '#ffc93c','#FFFF33','#FFFACD',
]

plt.style.use('ggplot')

## Loading Data

In [None]:
# Train, test, targets and submission file:

train_feat = pd.read_csv('../input/lish-moa/train_features.csv')
train_target = pd.read_csv('../input/lish-moa/train_targets_scored.csv')

test_feat = pd.read_csv('../input/lish-moa/test_features.csv')
sample_sub = pd.read_csv('../input/lish-moa/sample_submission.csv')

# First Look and Overview of the Data

### We have several datasets like: Train features and train targets for training our model, test features for predicting and submission sample for sending our predictions.

In [None]:
print('Train Feature Samples:')
display(train_feat.sample(3))
print('Test Feature Samples:')
display(test_feat.sample(3))
print('Train Target Samples:')
display(train_target.sample(3))

## Quality of the Data, Unique Observations and Categorical Features

### Here we take a look general quality of the data we given. We going to find answers for questions like if we have any missing data, how many features and observations we have, is there any categorical variables, if so which ones etc...

In [None]:
# Checking train and test columns/rows.

print(
    f'Train data has {train_feat.shape[1]} features, {train_feat.shape[0]} observations. {train_feat.sig_id.nunique() } of these are unique.\nTest data {test_feat.shape[1]} features, {test_feat.shape[0]} observations. {test_feat.sig_id.nunique() } of these are unique.'
)



In [None]:
# Checking missing values.

train_miss=train_feat.isnull().sum().sum()
test_miss=train_feat.isnull().sum().sum()

if train_miss&test_miss == 0:
    print('There are no missing values in both datasets!')
else:
    print('There are missing values you should check them individually!')

# Categorical Data

### These are like treatment dose, time and control groups for our samples, we might encode them for modelling later.

### We can observe:
- Most of the observed treatments are compound for both datasets meanwhile control pertubation are 7-8% for train test set respectively. We can say it's balanced between train test sets.

- Treatment durations are commonly distributed with 48 hour ones slightly (~2%) more than the rest. Again it's pretty balanced for both datasets.

- Doses are evenly distributed, first dose is slightly more than D2 in both datasets(~2%). Both datasets are balanced.

In [None]:
# Checking categorical features
print(f'Categorical features on dataset are:\n {train_feat.loc[:,train_feat.nunique()<=10].columns.tolist()}')

In [None]:
# Displaying categorical distribution:

fig = plt.figure(constrained_layout=True, figsize=(20, 12))


grid = gridspec.GridSpec(ncols=6, nrows=3, figure=fig)

ax1 = fig.add_subplot(grid[0, :3])

ax1.set_title(f'Train cp_type Distribution',weight='bold')

sns.countplot(x='cp_type',
                    data=train_feat,
                    palette=cust_palt,
                    ax=ax1,
                    order=train_feat['cp_type'].value_counts().index)

total = float(len(train_feat['cp_type']))


for p in ax1.patches:
    height = p.get_height()
    ax1.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')


ax2 = fig.add_subplot(grid[0, 3:])



sns.countplot(x='cp_type',
                    data=test_feat,
                    palette=cust_palt,
                    ax=ax2,
                    order=test_feat['cp_type'].value_counts().index)

total = float(len(test_feat['cp_type']))

ax2.set_title(f'Test cp_type Distribution', weight='bold')


for p in ax2.patches:
    height = p.get_height()
    ax2.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')
ax3 = fig.add_subplot(grid[1, :3])

ax3.set_title(f'Train cp_time Distribution', weight='bold')

sns.countplot(x='cp_time',
                    data=train_feat,
                    palette=cust_palt,
                    ax=ax3,
                    order=train_feat['cp_time'].value_counts().index)

total = float(len(train_feat['cp_time']))


for p in ax3.patches:
    height = p.get_height()
    ax3.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

ax4 = fig.add_subplot(grid[1, 3:])

ax4.set_title(f'Test cp_time Distribution', weight='bold')

sns.countplot(x='cp_time',
                    data=test_feat,
                    palette=cust_palt,
                    ax=ax4,
                    order=train_feat['cp_time'].value_counts().index)

total = float(len(test_feat['cp_time']))


for p in ax4.patches:
    height = p.get_height()
    ax4.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')
    
ax5 = fig.add_subplot(grid[2, :3])

ax5.set_title(f'Train cp_dose Distribution', weight='bold')

sns.countplot(x='cp_dose',
                    data=train_feat,
                    palette=cust_palt,
                    ax=ax5,
                    order=train_feat['cp_dose'].value_counts().index)

total = float(len(train_feat['cp_dose']))


for p in ax5.patches:
    height = p.get_height()
    ax5.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

ax6 = fig.add_subplot(grid[2, 3:])

ax6.set_title(f'Test cp_dose Distribution', weight='bold')

sns.countplot(x='cp_dose',
                    data=test_feat,
                    palette=cust_palt,
                    ax=ax6,
                    order=train_feat['cp_dose'].value_counts().index)

total = float(len(test_feat['cp_dose']))


for p in ax6.patches:
    height = p.get_height()
    ax6.text(p.get_x() + p.get_width() / 2.,
            height + 2,
            '{:1.2f}%'.format((height / total) * 100),
            ha='center')

In [None]:
# Label encoding categorical data

train_feat['cp_type'] = train_feat['cp_type'].map({'trt_cp':0,'ctl_vehicle':1})
train_feat['cp_time'] = train_feat['cp_time'].map({24:0,48:1,72:2})
train_feat['cp_dose'] = train_feat['cp_dose'].map({'D1':0,'D2':1})

test_feat['cp_type'] = test_feat['cp_type'].map({'trt_cp':0,'ctl_vehicle':1})
test_feat['cp_time'] = test_feat['cp_time'].map({24:0,48:1,72:2})
test_feat['cp_dose'] = test_feat['cp_dose'].map({'D1':0,'D2':1})

# Targets

### It looks like most common MoA labels are nfkb_inhibitor, proteasome_inhibitors followed by cyclooxygenase_inhibitor, all three have more than 400 instances. If we check total label counts per sample we see most of our observations have one MoA meanwhile ~9k of them have none. Multilabel ones are much more rare but still can effect our final model performance...

### There might be some correlation between 1+ MoA labels worth to investigate in future. Using models including these relations might give better results...

In [None]:
# Counting target values.

targ_cts=train_target.iloc[:,1:].sum(axis=0)
fig = plt.figure(figsize=(20,15))
sns.barplot(y=targ_cts.sort_values(ascending=False)[:30].index, x=targ_cts.sort_values(ascending=False)[:30].values, palette='inferno')
plt.show()

In [None]:
# Labels per sample.

plt.figure(figsize=(16,6))
features = train_target.columns.values[1:]
plt.title('Total Target Score Counts', weight='bold')
sns.countplot(train_target[features].sum(axis=1), palette=cust_palt)
plt.xlabel('Total Number of Targets per Sample')
plt.legend()
plt.show()

# Meta Feature Distribution

### Here I wanted to take a look at our some statistical values for our train test data, these are meta values but can give us insights for next steps we going to take. We can detect some meta differences between train and test data.

- There are small differences between distribution of mean values but generally they look balanced. We couldn't say same for the median.
- The min and max values echoes nicely between them but we see some differences between train and test data.
- The std and var for both datasets looks nice and balanced.
- By looking at their skewness we could say they both datasets are similar. But generally their distribution looks worth to take a deeper look.
- Kurtosis indicates test samples having little longer tails, that's interesting and worth to take a look again...

In [None]:
# Displaying meta distribution:

fig = plt.figure(constrained_layout=True, figsize=(20, 12))

features = train_feat.columns.values[1:]

grid = gridspec.GridSpec(ncols=4, nrows=4, figure=fig)

ax1 = fig.add_subplot(grid[0, :2])

ax1.set_title('Distribution of Mean Values per Column', weight='bold')

sns.kdeplot(train_feat[features].mean(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].mean(axis=0),color=cust_palt[1], shade=True, label='Test')


ax2 = fig.add_subplot(grid[0, 2:])

ax2.set_title('Distribution of Median Values per Column', weight='bold')

sns.kdeplot(train_feat[features].median(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].median(axis=0),color=cust_palt[1], shade=True, label='Test')

ax3 = fig.add_subplot(grid[1, :2])

ax3.set_title('Distribution of Minimum Values per Column', weight='bold')

sns.kdeplot(train_feat[features].min(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].min(axis=0),color=cust_palt[1], shade=True, label='Test')


ax4 = fig.add_subplot(grid[1, 2:])

ax4.set_title('Distribution of Maximum Values per Column', weight='bold')

sns.kdeplot(train_feat[features].max(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].max(axis=0),color=cust_palt[1], shade=True, label='Test')


ax5 = fig.add_subplot(grid[2, :2])

ax5.set_title('Distribution of Std\'s per Column', weight='bold')

sns.kdeplot(train_feat[features].std(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].std(axis=0),color=cust_palt[1], shade=True, label='Test')

ax6 = fig.add_subplot(grid[2, 2:])

ax6.set_title('Distribution of Variances per Column', weight='bold')

sns.kdeplot(train_feat[features].var(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].var(axis=0),color=cust_palt[1], shade=True, label='Test')

ax7 = fig.add_subplot(grid[3, :2])

ax7.set_title('Distribution of Skew Values per Column', weight='bold')

sns.kdeplot(train_feat[features].skew(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].skew(axis=0),color=cust_palt[1], shade=True, label='Test')

ax8 = fig.add_subplot(grid[3, 2:])

ax8.set_title('Distribution of Kurtosis Values per Column', weight='bold')

sns.kdeplot(train_feat[features].kurtosis(axis=0),color=cust_palt[0], shade=True, label='Train')
sns.kdeplot(test_feat[features].kurtosis(axis=0),color=cust_palt[1], shade=True, label='Test')

plt.suptitle('Meta Distributions of Train/Test Set', fontsize=25, weight='bold')

plt.show()

# Distribution of Some Features

### There are high number of features and we don't need to visualize all of them for now. Instead of choosing features randomly I wanted to go with some *irregular* variables such as highly skewed or high std features. I also fitted normal distribution line so we can see how are our samples are differs from gaussian distribution. We can see that our irregular features are quite similar around central parts with small differences but we can see some differences visible on tails. Next we will check them.

In [None]:
# Listing skew and high standard deviation features:

features_std = train_feat.iloc[:,1:].apply(lambda x: x.std()).sort_values(
    ascending=False)
f_std = train_feat[features_std.iloc[:20].index.tolist()]

features_skew = np.abs(train_feat.iloc[:,1:].apply(lambda x: skew(x)).sort_values(
    ascending=False))
skewed = train_feat[features_skew.iloc[:20].index.tolist()]

In [None]:
def feat_dist(df, df2, cols, rows=3, columns=3, title=None):
    
    '''A function for displaying skew feat distribution'''
    
    fig, axes = plt.subplots(rows, columns, figsize=(30, 25), constrained_layout=True)
    axes = axes.flatten()

    for i, j in zip(cols, axes):
        sns.distplot(
                    df[i],
                    ax=j,
                    fit=norm,
                    hist=False,
                    color='#111d5e',
                    label=f'Train {i}',
                    kde_kws={'alpha':0.9})        
        
        sns.distplot(
                    df2[i],
                    ax=j,
                    hist=False,
                    color = '#c70039',
                    label=f'Test {i}',
                    kde_kws={'alpha':0.7})
        
        (mu, sigma) = norm.fit(df[i])
        j.set_title('Train Test Dist of {0} Norm Fit: $\mu=${1:.2g}, $\sigma=${2:.2f}'.format(i.capitalize(), mu, sigma), weight='bold')
        fig.suptitle(f'{title}', fontsize=24, weight='bold')

# Distribution of High Std Features

In [None]:
# Creating distplot of features which has high std

feat_dist(train_feat, test_feat, f_std.columns.tolist(), rows=5, columns=4, title='Distribution of High Std Train/Test Features')

# Distribution of High Skew Features

In [None]:
# Creating distplot of features which highly skewed

feat_dist(train_feat, test_feat, skewed.columns.tolist(), rows=5, columns=4, title='Distribution of Highly Skewed Train/Test Features')

In [None]:
def tail_dist(df, df2, cols, rows=3, columns=3, title=None):
    
    '''A function for displaying skew feat distribution'''
    
    fig, axes = plt.subplots(rows, columns, figsize=(30, 25), constrained_layout=True)
    axes = axes.flatten()

    for i, j in zip(cols, axes):
        sns.distplot(
                    df[i],
                    ax=j,                    
                    hist=False,
                    color='#111d5e',
                    label=f'Train {i}',
                    kde_kws={'alpha':0.9})        
        
        sns.distplot(
                    df2[i],
                    ax=j,
                    hist=False,
                    color = '#c70039',
                    label=f'Test {i}',
                    kde_kws={'alpha':0.7})        

        j.set_title(f'Train Test Dist of {i.capitalize()}', weight='bold')
        if cols == f_std.columns.tolist():
            j.axis([-11,-0.5,0,0.1])
        else:
            j.axis([1,11,0,0.1])
        fig.suptitle(f'{title}', fontsize=24, weight='bold')

# Tails of the Interesting Features

Here we can see where the train test outlier differences lies. Even though they are small partition of the data they can have decent impact on our model performances. Especiall on G-229 we can see huge gap between train and test set around -10, these are wort to dig deeper in future...

In [None]:
# Creating distplot of features which has high std / tail part

tail_dist(train_feat, test_feat, f_std.columns.tolist(), rows=5, columns=4, title='Distribution of High Std Train/Test Feature Tails')

In [None]:
# Creating distplot of features which highly skewed / tail part

tail_dist(train_feat, test_feat, skewed.columns.tolist(), rows=5, columns=4, title='Distribution of Highly Skewed Train/Test Feature Tails')

# Correlations

### For this part we gonna inspect correlations between features, to see if there is any linear relations between their values. For this we're going to take abs value for each correlation so it only shows how strong the relations is whether it's positive or negative. We see there are lots of highly correlated features, with this info we might thinking about dropping some features or reducing dimensions via other way later...

In [None]:
correlations = train_feat.iloc[:,1:].corr().abs().unstack().sort_values(kind="quicksort",ascending=False).reset_index()
correlations = correlations[correlations['level_0'] != correlations['level_1']] #preventing 1.0 corr
corr_max=correlations.level_0.head(150).tolist()
corr_max=list(set(corr_max)) #removing duplicates

corr_min=correlations.level_0.tail(34).tolist()
corr_min=list(set(corr_min)) #removing duplicates

### Features with highest correlations

c-52 and c-42 seems to have highest correlation between them in our data with the values of 92.46% followed by c-13 and c-73.

In [None]:
# top corrs
display(correlations.head(5))

### Features with lowest correlations

g-179 and g-44 seems to have lowest correlation between them followed by g-363 and c-91.

In [None]:
display(correlations.tail(5))

In [None]:
correlation_train = train_feat.loc[:,corr_max].corr()
mask = np.triu(correlation_train.corr())

plt.figure(figsize=(30, 12))
sns.heatmap(correlation_train,
            mask=mask,
            annot=True,
            fmt='.3f',
            cmap='Wistia',
            linewidths=0.01,
            cbar=True)


plt.title('Features with Highest Correlations',  weight='bold')
plt.show()

In [None]:
correlation_train = train_feat.loc[:,corr_min].corr()
mask = np.triu(correlation_train.corr())
plt.figure(figsize=(30, 12))

sns.heatmap(correlation_train,
            mask=mask,
            annot=True,
            fmt='.3f',
            cmap='Wistia',
            linewidths=0.01,
            cbar=True)


plt.title('Features with Lowest Correlations',  weight='bold')
plt.show()

# Dimension Reduction

### We have quite high number of features and since getting good model is iterative process you might want to reduce your dimensions and get faster results for producing your first working baseline models. If you inspect the visuals below you'll see first component explains 40% of the variance and first 30 components explains around 70% of the variance. With these reduced dimensions you might get your first steps of modelling faster, which I found quite useful

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(train_feat.iloc[:,1:])
pca_train = pca.transform(train_feat.iloc[:,1:])
pca_test = pca.transform(test_feat.iloc[:,1:])

In [None]:
# Explaining variance ratio:

fig, ax = plt.subplots(2,1,figsize=(28, 10))
ax[0].plot(range(train_feat.iloc[:,1:].shape[1]), pca.explained_variance_ratio_.cumsum(), linestyle='--', drawstyle='steps-mid', color=cust_palt[1],
         label='Cumulative Explained Variance')
sns.barplot(np.arange(1,train_feat.iloc[:,1:].shape[1]+1), pca.explained_variance_ratio_, alpha=0.85, color=cust_palt[0],
            label='Individual Explained Variance', ax=ax[0])


ax[0].set_title('Explained Variance', fontsize = 20, weight='bold')
ax[0].set_ylabel('Explained Variance Ratio', fontsize = 14)
ax[0].set_xlabel('Number of Principal Components', fontsize = 14)
plt.legend(loc='center right', fontsize = 13);

ax[0].set_xticks([])

ax[1].plot(range(train_feat.iloc[:,1:].shape[1]), pca.explained_variance_ratio_.cumsum(), linestyle='--', drawstyle='steps-mid', color=cust_palt[1],
         label='Cumulative Explained Variance')
sns.barplot(np.arange(1,train_feat.iloc[:,1:].shape[1]+1), pca.explained_variance_ratio_, alpha=0.85, color=cust_palt[0],
            label='Individual Explained Variance', ax=ax[1])

ax[1].axis([0,29,0,1])
ax[1].set_title('First 30 Explained Variances', fontsize = 20, weight='bold')
ax[1].set_ylabel('Explained Variance Ratio', fontsize = 14)
ax[1].set_xlabel('Number of Principal Components', fontsize = 14)
plt.tight_layout()


In [None]:
train_temp = pd.read_csv('../input/lish-moa/train_features.csv')
pca = PCA(4)
pca.fit(train_feat.iloc[:,1:])
pca_samples = pca.transform(train_feat.iloc[:,1:])

### We can observe that some of the groups diversible easily in components wise, especially control groups are can be detected easily.

In [None]:
# Displaying 50% of the variance:

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}


fig = px.scatter_matrix(
    pca_samples,
    color=train_temp.iloc[:,1:].cp_type,
    dimensions=range(4),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}% vs cp_type',
    opacity=0.5,
    color_discrete_sequence=cust_palt[:4],
)
fig.update_traces(diagonal_visible=False)
fig.show()

In [None]:
# Displaying 50% of the variance:

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}

fig = px.scatter_matrix(
    pca_samples,
    color=train_temp.iloc[:,1:].cp_dose,
    dimensions=range(4),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}% vs cp_dose',
    opacity=0.5,
    color_discrete_sequence=cust_palt[:4]
)
fig.update_traces(diagonal_visible=False)
fig.show()

In [None]:
train_temp['number_of_moas'] = train_target[list(train_target.columns[1:])].sum(axis=1)
train_temp['number_of_moas'] = train_temp['number_of_moas'].map({0:'No MoA',1:'One MoA',2:'Multiple MoAs', 3:'Multiple MoAs', 4:'Multiple MoAs', 5:'Multiple MoAs', 6:'Multiple MoAs'
                                                                , 7:'Multiple MoAs'})
train_temp['cp_time'] = train_temp['cp_time'].astype('str')

In [None]:
# Displaying 50% of the variance:

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}


fig = px.scatter_matrix(
    pca_samples,
    color=train_temp.iloc[:,1:].cp_time,
    dimensions=range(4),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}% vs cp_time',
    opacity=0.3,
    color_discrete_sequence=cust_palt[:3],
)

fig.update_traces(diagonal_visible=False)
fig.show()

In [None]:
# Displaying 50% of the variance:

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {
    str(i): f"PC {i+1} ({var:.1f}%)"
    for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}


fig = px.scatter_matrix(
    pca_samples,
    color=train_temp['number_of_moas'],
    dimensions=range(4),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}% vs number of MoA\'s',
    opacity=0.3,
    color_discrete_sequence=cust_palt[:3],
)

fig.update_traces(diagonal_visible=False)
fig.show()

# Adversarial Validation

Alright, since we testing for train test sampling differences in previous parts I also wanted wanted to implement what is called 'Adversarial Validation'. This method helped me in previous competition while building my model. Basically we going to replace our targets for both datasets (0 for train and 1 for test), then we going build a classifier which tries to predict which observation belongs to train and which one belongs to test set. If datasets randomly selected from similar roots it should be really hard for the classifier to separate them. But if there is systematic selection differences between train and test sets then classifier should be able to capture this trend. So we want our models score lower for the next section (0.50 AUC) because higher detection rate means higher difference between train and test datasets, so let's get started...


In [None]:
# duplicating train test features

train_adv = train_feat.iloc[:,1:].copy()
test_adv = test_feat.iloc[:,1:].copy()

#labelling train test data

train_adv.insert(loc=875, value=0, column='dataset_label')
test_adv.insert(loc=875, value=1, column='dataset_label')

# merging train test data
adv_master = pd.concat([train_adv, test_adv], axis=0)
adv_master.reset_index(inplace=True)
adv_master.drop('index', axis=1, inplace=True)

# creating x and y's for adversarial validation
adv_X = adv_master.drop('dataset_label', axis=1)
adv_y = adv_master['dataset_label']

In [None]:
# loading some basic packages for testing

from sklearn.linear_model import LogisticRegression
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
import math
from sklearn.metrics import plot_roc_curve, auc, roc_auc_score


# setting 3 fold cv

cv = StratifiedKFold(3, shuffle=True, random_state=42)


# models:

xg_adv = LogisticRegression(
    random_state=42,
    n_jobs=-1,
)

lg_adv = lgb.LGBMClassifier(
    random_state=42,
    n_jobs=-1,
)

estimators = [xg_adv, lg_adv]

In [None]:
def adv_roc(estimators, cv, X, y):
    
    ''' A function for plotting roc '''

    fig, axes = plt.subplots(math.ceil(len(estimators) / 2),
                             2,
                             figsize=(16, 6))
    axes = axes.flatten()

    for ax, estimator in zip(axes, estimators):
        tprs = []
        aucs = []
        mean_fpr = np.linspace(0, 1, 100)

        for i, (train, test) in enumerate(cv.split(X, y)):
            estimator.fit(X.loc[train], y.loc[train])
            viz = plot_roc_curve(estimator,
                                 X.loc[test],
                                 y.loc[test],
                                 name='ROC fold {}'.format(i),
                                 alpha=0.3,
                                 lw=1,
                                 ax=ax)
            interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
            interp_tpr[0] = 0.0
            tprs.append(interp_tpr)
            aucs.append(viz.roc_auc)

        ax.plot([0, 1], [0, 1],
                linestyle='--',
                lw=2,
                color='r',
                label='Chance',
                alpha=.8)

        mean_tpr = np.mean(tprs, axis=0)
        mean_tpr[-1] = 1.0
        mean_auc = auc(mean_fpr, mean_tpr)
        std_auc = np.std(aucs)
        ax.plot(mean_fpr,
                mean_tpr,
                color='b',
                label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' %
                (mean_auc, std_auc),
                lw=2,
                alpha=.8)

        std_tpr = np.std(tprs, axis=0)
        tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
        tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
        ax.fill_between(mean_fpr,
                        tprs_lower,
                        tprs_upper,
                        color='grey',
                        alpha=.2,
                        label=r'$\pm$ 1 std. dev.')

        ax.set(xlim=[-0.02, 1.02],
               ylim=[-0.02, 1.02],
               title=f'{estimator.__class__.__name__} ROC for Adversarial Val.')
        ax.legend(loc='lower right', prop={'size': 10})
    plt.show()

## Results

We just implemented two basic classifiers to predict our train test sets: LightGBM and LogisticRegression. When we check our results we can see:

- Both models having hard time while distinguish if data comes from train set or test set.
- With the score of almost 0.50 we can say train and test datasets randomly selected and they are balanced. Which is good!

If we got higher scores we would have to inspect what features causing it and then try to eliminate that effect for more regularized results...

In [None]:
adv_roc(estimators, cv, adv_X, adv_y)

# Work in Progress!

### There are loads of missing comments and some graphs I want to visualize later and maybe a simple model, but it's mainly EDA work for now, I'll be updating them whenever it's possible. Thanks and I hope you enjoyed while reading it. Happy coding!