In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

**Problem Statement**

We found a USB stick in a bottle on the shore of our nearby beach. The USB has some interesting data. There are no details about the columns or what the data represents, but by looking at the data we could understand that each row in the dataset can be of one of the two types, i.e. it is a binary classification problem. Along with the dataset we got the following poem:

> *Silly column names abound, 
> but the test set is a mystery. 
> Careful how you pick and slice, 
> or be left behind by history.*

Just a first hand interpretation of the poem, we can make certain caveats to guide our analysis:
1. Do not go by the names of the columns (they are silly)
2. Test set is mysterious
3. Be cautious of data cleaning and variable reduction
4. As a mistake there would lead to big errors

****READ DATA****

Let's start by reading all datasets including the sample submission file.

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
sample_submission = pd.read_csv("../input/sample_submission.csv")

A sneak peak into how the data looks is never a bad idea

In [None]:
train.head()

In [None]:
print("There are {} rows and {} columns".format(train.shape[0],train.shape[1]))

**Exploratory Data Analysis**

Just getting a feel of the patterns, distributions and other relationships in our data. 
We will look at the target variable first

In [None]:
train['target'].value_counts() / train['target'].value_counts().sum()

The split is pretty even between the two classes and there is no case of imbalanced class

**Missing Value Treatment**

Getting missing values for each column. Picked this up from [Will's](https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction) Kernel

In [None]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
# Missing values statistics
missing_values = missing_values_table(train)

In [None]:
# Number of unique classes in the single object column present
train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

id is the unique identifier in the dataset

Finding the correlation between variables

In [None]:
# Find correlations with the target and sort
correlations = train.corr()

In [None]:
correlations

In [None]:
train.describe()