In this kernel, we are going to explore the association between 

* The response and the explanatory variables. 
* The correlation among the explanatory variables. 

We are going to use seaborn which is a useful plot library in Python.

In [None]:
# Loading libraries 
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt 
import warnings
warnings.filterwarnings('ignore')

sns.set(style="ticks", color_codes=True)

In [None]:
df_train = pd.read_csv("../input/train.csv")

In [None]:
df_train.pop("id");
df_train["target"] = df_train["target"].apply(int)
df_train["target"] = df_train["target"].apply(str)

# Association between response variable and explanatory variables 

In order to compare the distribution of explanatory variables conditionated by the target variable we are going to use the violin plot which is a method of plotting numerical data with the addition of a rotated kernel density plot on each side. A kernel density plot is a useful method to estimate the distribution of a numeric variable , in the graph below you can see the kernel density plot (the curve) and a histogram for the variable x (this is a hyphotetical variable). 

![](https://blogs.sas.com/content/iml/files/2016/07/kdecomponents2.png)

We are going to explore the violin plot for the variables : 0, 1, 2, 3, and 4 conditionated by the target variable, the following graphs are bivariate plot.

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "0")

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "1")

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "2")

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "3")

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "4")

The distribution of the 0's and 1's are similar among the previous explanatory variables (it means that these variables are not good for predicting the target variable) but this is not the case for other explanatory variables such as: 7 and 220 as shown below.

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "7")

In [None]:
g = sns.FacetGrid(df_train, col="target")
g.map(sns.violinplot, "220")

We are going to create a function in order to explore any arbitrary subset of explanatory variables. 

In [None]:
def plot_violin(l):
    for x in l:
        g = sns.FacetGrid(df_train, col="target")
        g.map(sns.violinplot, x)
    

In [None]:
plot_violin(["33", "65", "91"])

From graphs above we have that the distribution of the explanatory variables are different among the 0's and 1's and because of that these variables have  discriminatory power.

# Scatter plot

We are going to plot the pair plot which is used to understand the best set of features to explain an association between two variables. In order to illustrate the concept of scatter plot and variables with discriminatory power, the following graph is shown: 

![](https://www.bonaccorso.eu/wp-content/uploads/2017/09/ikm_1-768x514.png)



From the graph above we can separate the two classes (red = 1 and green = 0) by a line. If you are able to separate these classes by a line (or any complex curve) with a certain grade of precision then these variables have discriminatory power. 

In [None]:
def plot_pair_plot(l):
    df = df_train.loc[:, l]
    g = sns.pairplot(df, hue="target")

In [None]:
lista_variables = ['target', '0', '1', '2', '3']
plot_pair_plot(lista_variables)

It seems that the variables 0, 1, 2 and 3 have lower discriminatory power.  We are going to explore other variables.

In [None]:
lista_variables = ['target', '33', '65', '91']
plot_pair_plot(lista_variables)

It seems that the variables 33, 65, and 91  have a certain discriminatory power. 