<h1><center>Mechanisms of Action (MoA) Prediction. Data analysis and visualization</center></h1>

<center><img src="https://pharmacyinnovations.net/wp-content/uploads/pillsdrugs.png"></center>


<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:blue; border:0; color:white' role="tab" aria-controls="home"><center>Quick navigation</center></h2>

* [1. Basic Data Overview](#1)
* [2. Categories visualization](#2)
* [3. Gene and cell features distribution](#3)
* [4. Training features correlation](#4)
* [5. Targets analysis](#5)
* [6. Train & Targets correlations](#6)
* [7. Targets & Train features dependecies](#7)
    
#### I also started second <a href="https://www.kaggle.com/isaienkov/moa-prediction-interesting-findings/">kernel</a> where I will incrementally add interesting finding extracted from this dataset.

In [None]:
import numpy as np
import pandas as pd
import random

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import matplotlib.pyplot as plt

In [None]:
pd.options.display.max_columns = None

<a id="1"></a>
<h2 style='background:blue; border:0; color:white'><center>1. Basic Data Overview</center><h2>

In [None]:
train = pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
test = pd.read_csv('/kaggle/input/lish-moa/test_features.csv')

train['dataset'] = 'train'
test['dataset'] = 'test'

df = pd.concat([train, test])

Take a look into training and test sets.

**train_features.csv** - Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).

In [None]:
train.head()

**test_features.csv** - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.

In [None]:
test.head()

In [None]:
print('Number of rows in training set:', train.shape[0])
print('Number of columns in training set:', train.shape[1] - 1)

print('Number of rows in test set:', test.shape[0])
print('Number of columns in test set:', test.shape[1] - 1)

In [None]:
df.info()

We can see that we have 872 float features 1 integer (cp_time) and 3 categorical (sig_id, cp_type and cp_dose).

In [None]:
sample_submission = pd.read_csv('/kaggle/input/lish-moa/sample_submission.csv')
sample_submission

<a id="2"></a>
<h2 style='background:blue; border:0; color:white'><center>2. Categories Visualization</center><h2>

Here we are going to check categorical features: cp_type, cp_time, cp_dose.

In [None]:
cp_width = 500
cp_height = 400
scatter_size = 600
WIDTH=800

In [None]:
ds = df.groupby(['cp_type', 'dataset'])['sig_id'].count().reset_index()

ds.columns = [
    'cp_type', 
    'dataset', 
    'count'
]

fig = px.bar(
    ds, 
    x='cp_type', 
    y="count", 
    color='dataset',
    barmode='group',
    orientation='v', 
    title='cp_type train/test counts', 
    width=cp_width,
    height=cp_height
)

fig.show()

In [None]:
ds = df.groupby(['cp_time', 'dataset'])['sig_id'].count().reset_index()

ds.columns = [
    'cp_time', 
    'dataset', 
    'count'
]

fig = px.bar(
    ds, 
    x='cp_time', 
    y="count", 
    color='dataset',
    barmode='group',
    orientation='v', 
    title='cp_time train/test counts', 
    width=cp_width,
    height=cp_height
)

fig.show()

In [None]:
ds = df.groupby(['cp_dose', 'dataset'])['sig_id'].count().reset_index()

ds.columns = [
    'cp_dose', 
    'dataset', 
    'count'
]

fig = px.bar(
    ds, 
    x='cp_dose', 
    y="count", 
    color='dataset',
    barmode='group',
    orientation='v', 
    title='cp_dose train/test counts', 
    width=cp_width,
    height=cp_height
)

fig.show()

In [None]:
ds = df[df['dataset']=='train']
ds = ds.groupby(['cp_type', 'cp_time', 'cp_dose'])['sig_id'].count().reset_index()

ds.columns = [
    'cp_type', 
    'cp_time', 
    'cp_dose', 
    'count'
]

fig = px.sunburst(
    ds, 
    path=[
        'cp_type',
        'cp_time',
        'cp_dose' 
    ], 
    values='count', 
    title='Sunburst chart for all cp_type/cp_time/cp_dose',
    width=500,
    height=500
)

fig.show()

<a id="3"></a>
<h2 style='background:blue; border:0; color:white'><center>3. Gene and cell features distribution</center><h2>

Some distribution of randomly selected columns.

In [None]:
train_columns = train.columns.to_list()

g_list = [i for i in train_columns if i.startswith('g-')]

c_list = [i for i in train_columns if i.startswith('c-')]

In [None]:
def plot_set_histograms(plot_list, title):
    fig = make_subplots(
        rows=4, 
        cols=3
    )
    
    traces = [
        go.Histogram(
            x=train[col], 
            nbinsx=100, 
            name=col
        ) for col in plot_list
    ]

    for i in range(len(traces)):
        fig.append_trace(
            traces[i], 
            (i // 3) + 1, 
            (i % 3) + 1
        )

    fig.update_layout(
        title_text=title,
        height=1000,
        width=WIDTH
    )
    fig.show()

In [None]:
plot_list = [
    g_list[
        np.random.randint(0, len(g_list)-1)
    ] for i in range(50)
]

plot_list = list(set(plot_list))[:12]
plot_set_histograms(plot_list, 'Randomly selected gene expression features distributions')

In [None]:
plot_list = [
    c_list[
        np.random.randint(0, len(c_list)-1)
    ] for i in range(50)
]

plot_list = list(set(plot_list))[:12]
plot_set_histograms(plot_list, 'Randomly selected cell expression features distributions')

<a id="4"></a>
<h2 style='background:blue; border:0; color:white'><center>4. Training features correlation</center><h2>

Let's see some correlation between randomly selected variables.

In [None]:
columns = g_list + c_list
for_correlation = random.sample(columns, 50)
data = df[for_correlation]

f = plt.figure(
    figsize=(18, 18)
)

plt.matshow(
    data.corr(), 
    fignum=f.number
)

plt.xticks(
    range(data.shape[1]), 
    data.columns, 
    fontsize=14, 
    rotation=50
)

plt.yticks(
    range(data.shape[1]), 
    data.columns, 
    fontsize=14
)

cb = plt.colorbar()
cb.ax.tick_params(
    labelsize=13
)

Time to find pairs of features with high correlation.

In [None]:
%%time

cols = ['cp_time'] + columns
all_columns = list()
for i in range(0, len(cols)):
    for j in range(i+1, len(cols)):
        if abs(train[cols[i]].corr(train[cols[j]])) > 0.9:
            all_columns = all_columns + [cols[i], cols[j]]

In [None]:
all_columns = list(set(all_columns))
print('Number of columns:', len(all_columns))

In total we have 35 columns that have correlation with at least another 1 higher than 0.9. Let's visualize them.

In [None]:
data = df[all_columns]

f = plt.figure(
    figsize=(18, 18)
)

plt.matshow(
    data.corr(), 
    fignum=f.number
)

plt.xticks(
    range(data.shape[1]), 
    data.columns, 
    fontsize=14, 
    rotation=50
)

plt.yticks(
    range(data.shape[1]), 
    data.columns, 
    fontsize=14
)

cb = plt.colorbar()

cb.ax.tick_params(
    labelsize=14
)

Let's visualize them.

In [None]:
fig = make_subplots(
    rows=12, 
    cols=3
)

traces = [
    go.Histogram(
        x=train[col], 
        nbinsx=100, 
        name=col
    ) for col in all_columns
]

for i in range(len(traces)):
    fig.append_trace(
        traces[i], 
        (i // 3) + 1, 
        (i % 3) + 1
    )

fig.update_layout(
    title_text='Highly correlated features',
    height=1200
)

fig.show()

<a id="5"></a>
<h2 style='background:blue; border:0; color:white'><center>5. Targets analysis</center><h2>

Let's check targets.

In [None]:
train_target = pd.read_csv("../input/lish-moa/train_targets_scored.csv")

print('Number of rows: ', train_target.shape[0])
print('Number of cols: ', train_target.shape[1])

train_target.head()

In [None]:
x = train_target.drop(['sig_id'], axis=1).sum(axis=0).sort_values().reset_index()

x.columns = [
    'column', 
    'nonzero_records'
]

x = x.tail(50)

fig = px.bar(
    x, 
    x='nonzero_records', 
    y='column', 
    orientation='h', 
    title='Columns with the higher number of positive samples (top 50)', 
    width=WIDTH,
    height=1000
)

fig.show()

In [None]:
x = train_target.drop(['sig_id'], axis=1).sum(axis=0).sort_values(ascending=False).reset_index()

x.columns = [
    'column', 
    'nonzero_records'
]

x = x.tail(50)

fig = px.bar(
    x, 
    x='nonzero_records', 
    y='column', 
    orientation='h', 
    title='Columns with the lowest number of positive samples (top 50)', 
    width=WIDTH,
    height=1000 
)

fig.show()

We can see that at least 50 target columns have number pf positive samples less than 20 (about 0.1%) !!!

In [None]:
x = train_target.drop(['sig_id'], axis=1).sum(axis=0).sort_values(ascending=False).reset_index()

x.columns = [
    'column', 
    'count'
]

x['count'] = x['count'] * 100 / len(train_target)

fig = px.bar(
    x, 
    x='column', 
    y='count', 
    orientation='v', 
    title='Percent of positive records for every column in target', 
    width=1200,
    height=800 
)

fig.show()

The biggest number of positive samples for 1 target column is 3.5%. So we deal here with highly imbalanced data.

In [None]:
data = train_target.drop(['sig_id'], axis=1).astype(bool).sum(axis=1).reset_index()

data.columns = [
    'row', 
    'count'
]

data = data.groupby(['count'])['row'].count().reset_index()

fig = px.bar(
    data, 
    y=data['row'], 
    x="count", 
    title='Number of activations in targets for every sample', 
    width=WIDTH, 
    height=500
)

fig.show()

In [None]:
data = train_target.drop(['sig_id'], axis=1).astype(bool).sum(axis=1).reset_index()

data.columns = [
    'row', 
    'count'
]

data = data.groupby(['count'])['row'].count().reset_index()

fig = px.pie(
    data, 
    values=100 * data['row'] / len(train_target), 
    names="count", 
    title='Number of activations in targets for every sample (Percent)', 
    width=WIDTH, 
    height=500
)

fig.show()

We can see here that about 40% of sample have zeros in all columns and more than 50% have only one active target column.

In [None]:
train_target.describe()

<a id="6"></a>
<h2 style='background:blue; border:0; color:white'><center>6. Train & Targets correlations</center><h2>

Time to find the most correlated features for every target column.

In [None]:
%%time

correlation_matrix = pd.DataFrame()

for t_col in train_target.columns:
    corr_list = list()
    if t_col == 'sig_id':
        continue
    for col in columns:
        res = train[col].corr(train_target[t_col])
        corr_list.append(res)
    correlation_matrix[t_col] = corr_list

And we have large correlation matrix.

In [None]:
correlation_matrix['train_features'] = columns
correlation_matrix = correlation_matrix.set_index('train_features')

correlation_matrix

Let's see what is the higher value (absolute) of correlation for target columns with every column from train set. Every column on chart is max correlation of current target column with all of columns from training set.

In [None]:
maxCol=lambda x: max(x.min(), x.max(), key=abs)

high_scores = correlation_matrix.apply(maxCol, axis=0).reset_index()

high_scores.columns = [
    'column', 
    'best_correlation'
]

fig = px.bar(
    high_scores, 
    x='column', 
    y="best_correlation", 
    orientation='v', 
    title='Best correlation with train columns for every target column', 
    width=1200,
    height=800
)

fig.show()

Now let's see what columns from training set have the higher number of "high" correlations with target columns. Every row from chart means that column `A` `N` times has the best value of correlation with different target columns. 

In [None]:
col_df = pd.DataFrame()
tr_cols = list()
tar_cols = list()

for col in correlation_matrix.columns:
    tar_cols.append(col)
    tr_cols.append(
        correlation_matrix[col].abs().sort_values(ascending=False).reset_index()['train_features'].head(1).values[0]
    )

col_df['column'] = tar_cols
col_df['train_best_column'] = tr_cols

total_scores = pd.merge(high_scores, col_df)

total_scores

In [None]:
count_features = total_scores['train_best_column'].value_counts().reset_index().sort_values('train_best_column')

count_features.columns = [
    'column', 
    'count'
]

count_features = count_features.tail(33)

fig = px.bar(
    count_features, 
    x='count', 
    y="column", 
    orientation='h', 
    title='Columns from training set with number of high correlations with target columns', 
    width=WIDTH,
    height=700
)

fig.show()

Target columns and pairs of highly correlated features.

Let's select some random columns and see how they deal with pairs of the highly correlated features.

In [None]:
target_columns = train_target.columns.tolist()
target_columns.remove('sig_id')
for_analysis = [
    target_columns[
        np.random.randint(0, len(target_columns)-1)
    ] for i in range(5)
]

current_corr = correlation_matrix[for_analysis]

In [None]:
col_df = pd.DataFrame()
tr_first_cols = list()
tr_second_cols = list()
tar_cols = list()

for col in current_corr.columns:
    tar_cols.append(col)
    tr_first_cols.append(
        current_corr[col].abs().sort_values(ascending=False).reset_index()['train_features'].head(2).values[0]
    )
    tr_second_cols.append(
        current_corr[col].abs().sort_values(ascending=False).reset_index()['train_features'].head(2).values[1]
    )

col_df['column'] = tar_cols
col_df['train_1_column'] = tr_first_cols
col_df['train_2_column'] = tr_second_cols

col_df

In [None]:
def plot_scatter(col_df, index):
    analysis = pd.DataFrame()
    analysis['color'] = train_target[col_df.iloc[index]['column']]
    analysis['x'] = train[col_df.iloc[index]['train_1_column']]
    analysis['y'] = train[col_df.iloc[index]['train_2_column']]
    analysis.columns = [
        'color', 
        col_df.iloc[index]['train_1_column'], 
        col_df.iloc[index]['train_2_column']
    ]
    analysis['size'] = 1
    analysis.loc[analysis['color'] == 1, 'size'] = 12

    fig = px.scatter(
        analysis, 
        x=col_df.iloc[index]['train_1_column'], 
        y=col_df.iloc[index]['train_2_column'], 
        color="color", 
        size='size', 
        width=scatter_size,
        height=scatter_size,
        title='Scatter plot for ' + col_df.iloc[index]['column']
    )
    fig.show()

In [None]:
plot_scatter(col_df, 0)

In [None]:
plot_scatter(col_df, 1)

In [None]:
plot_scatter(col_df, 2)

Let's do the same but for 3d plots.

In [None]:
for_analysis = [
    target_columns[np.random.randint(0, len(target_columns)-1)] for i in range(5)
]

current_corr = correlation_matrix[for_analysis]

col_df = pd.DataFrame()
tr_first_cols = list()
tr_second_cols = list()
tr_third_cols = list()
tar_cols = list()

for col in current_corr.columns:
    tar_cols.append(col)
    tr_first_cols.append(
        current_corr[col].abs().sort_values(ascending=False).reset_index()['train_features'].head(3).values[0]
    )
    tr_second_cols.append(
        current_corr[col].abs().sort_values(ascending=False).reset_index()['train_features'].head(3).values[1]
    )
    tr_third_cols.append(
        current_corr[col].abs().sort_values(ascending=False).reset_index()['train_features'].head(3).values[2]
    )

col_df['column'] = tar_cols
col_df['train_1_column'] = tr_first_cols
col_df['train_2_column'] = tr_second_cols
col_df['train_3_column'] = tr_third_cols

col_df

In [None]:
def plot_3dscatter(col_df, index):
    analysis = pd.DataFrame()
    analysis['color'] = train_target[col_df.iloc[index]['column']]
    analysis['x'] = train[col_df.iloc[index]['train_1_column']]
    analysis['y'] = train[col_df.iloc[index]['train_2_column']]
    analysis['z'] = train[col_df.iloc[index]['train_3_column']]
    analysis.columns = [
        'color', 
        col_df.iloc[index]['train_1_column'], 
        col_df.iloc[index]['train_2_column'], 
        col_df.iloc[index]['train_3_column']
    ]
    analysis['size'] = 1
    analysis.loc[analysis['color'] == 1, 'size'] = 20

    fig = px.scatter_3d(
        analysis, 
        x=col_df.iloc[index]['train_1_column'], 
        y=col_df.iloc[index]['train_2_column'],
        z=col_df.iloc[index]['train_3_column'], 
        color="color", 
        size='size', 
        height=scatter_size,
        width=scatter_size,
        title='Scatter plot for ' + col_df.iloc[index]['column']
    )
    fig.show()

In [None]:
plot_3dscatter(col_df, 0)

In [None]:
plot_3dscatter(col_df, 1)

In [None]:
plot_3dscatter(col_df, 2)

We can extract several group names from target column names. Looks like that last term in column name is definition of a group. Let's extact them and visualize groups with number of columns > 1.

In [None]:
last_term = dict()

for item in target_columns:
    try:
        last_term[item.split('_')[-1]] += 1
    except:
        last_term[item.split('_')[-1]] = 1

last_term = pd.DataFrame(last_term.items(), columns=['group', 'count'])
last_term = last_term.sort_values('count')
last_term = last_term[last_term['count']>1]
last_term['count'] = last_term['count'] * 100 / 206

fig = px.bar(
    last_term, 
    x='count', 
    y="group", 
    orientation='h', 
    title='Groups in target columns (Percent from all target columns)', 
    width=WIDTH,
    height=500
)

fig.show()

Is it possible to have more than 1 activation for 1 sample in every group?

In [None]:
answer = list()

for group in last_term.group.tolist():
    agent_list = list()
    for item in target_columns:
        if item.split('_')[-1] == group:
            agent_list.append(item)
    agent_df = train_target[agent_list]
    data = agent_df.astype(bool).sum(axis=1).reset_index()
    answer.append(data[0].max())

In [None]:
ds = pd.DataFrame()
ds['group'] = last_term.group.tolist()
ds['max_value'] = answer

fig = px.bar(
    ds, 
    x='max_value', 
    y="group", 
    orientation='h', 
    title='Maximum number of active columns in 1 sample for every group', 
    width=WIDTH,
    height=500
)

fig.show()

We can see that for groups activator, agent, blocker maximum number of active columns in sample is 1.

<a id="7"></a>
<h2 style='background:blue; border:0; color:white'><center>7. Targets & Train features dependecies</center><h2>

Let's check target columns with categorical columns from training set.

In [None]:
categories = train[['cp_type', 'cp_time', 'cp_dose']]
tar = train_target.copy()
tar = tar.drop(['sig_id'], axis=1)
analysis = pd.concat([categories, tar], axis=1)

In [None]:
for category in analysis['cp_dose'].unique().tolist():
    
    number = 0
    cols = list()
    
    for col in analysis.columns:
        if col in ['cp_type', 'cp_time', 'cp_dose']:
            continue
        if len(analysis[analysis['cp_dose'] == category][col].value_counts()) == 1:
            number += 1
            cols.append(col)

    print(category, '. Number of columns with 1 unique value: ', number, '. Columns: ', cols)

Let's check problematic columns for dp_dose = 2.

In [None]:
analysis[analysis['cp_dose'] == 'D2']['atp-sensitive_potassium_channel_antagonist'].value_counts()

In [None]:
analysis[analysis['cp_dose']=='D2']['erbb2_inhibitor'].value_counts()

In [None]:
for category in analysis['cp_time'].unique().tolist():
    
    number = 0
    cols = list()
    
    for col in analysis.columns:
        if col in ['cp_type', 'cp_time', 'cp_dose']:
            continue
        if len(analysis[analysis['cp_time']==category][col].value_counts()) == 1:
            number += 1
            cols.append(col)

    print(category, '. Number of columns with 1 unique value: ', number, '. Columns: ', cols)

Let's check problematic columns for cp_time = 24 and 72.

In [None]:
analysis[analysis['cp_time'] == 24]['erbb2_inhibitor'].value_counts()

In [None]:
analysis[analysis['cp_time'] == 72]['erbb2_inhibitor'].value_counts()

In [None]:
analysis[analysis['cp_time'] == 24]['atp-sensitive_potassium_channel_antagonist'].value_counts()

In [None]:
analysis[analysis['cp_time'] == 72]['atp-sensitive_potassium_channel_antagonist'].value_counts()

In [None]:
for category in analysis['cp_type'].unique().tolist():
    
    number = 0
    cols = list()
    
    for col in analysis.columns:
        if col in ['cp_type', 'cp_time', 'cp_dose']:
            continue
        if len(analysis[analysis['cp_type'] == category][col].value_counts()) == 1:
            number += 1
            cols.append(col)

    print(category, '. Number of columns with 1 unique value: ', number, '. Columns: ', cols)

In [None]:
analysis[analysis['cp_type']=='ctl_vehicle']['igf-1_inhibitor'].value_counts()

We can see that for column ```cp_type``` all records where value is ```ctl_vehicle``` for all targets are 0. The same picture for ```cp_time``` == 72 ana == 24, but only for 2 target columns and for ```cp_dose``` == D2 also for 2 target columns.
