In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
# Next, load the train and test datasets available in the "../input/" directory
train = pd.read_csv("../input/train.csv") # the train dataset is now a Pandas DataFrame
test = pd.read_csv("../input/test.csv") # the train dataset is now a Pandas DataFrame

# Let's have a peek of the train data
train.head()

**Data size**

Let's start with the most basic exploration of the dataset: get the number of attributes (features) and instances (data points):



In [None]:
instance_count, attr_count = train.shape
print('Number of instances: ', instance_count)
print('Number of features:', attr_count)

The features can be split into input ones and target ones. In this case there's just one target (SalePrice) and several inputs.

## Distributions of each attribute ##

We'll start by exploring our dataset attributes:

In [None]:
# View the columns
train.columns

Next we'd like to know how values of each attributes are distributed. We can readily use the basic statistics (`count, mean, min, max, quartiles`) via the pandas `df.describe()`

In [None]:
# some statistical overview

train.describe()

## Missing values ##

As we can see from the above cell we have some missing values in some of the columns. In pandas missing values are represented by `np.NaN`. The `pd.isnull(df).any()` command tells us whether each column contains any missing values, `pd.isnull(df).sum()` then counts the missing values.

In [None]:
# Check for missing values
pd.isnull(train).any()

In [None]:
# Count missing values in training data set
pd.isnull(train).sum()

**Filling in Missing Data**

Let's call the `fillna` method to fill in the missing data with the column averages. 

First let's view the mean

In [None]:
train.mean()

Let's fill in the "holes" with the means on numerical attributes

In [None]:
train.fillna(train.mean())

## Corelations between attributes ##

By definition, a correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

In our dataset, it is crucial to have a better understanding of the underlying structure and characteristics of the data and leads to better intuition in knowing whether some pairs of attributes are correlated and how much. For many ML algorithms correlated features might make some trouble, ideally we should try to get a set of independent features.

We can use Pandas `DataFrame.corr()` function to get the three various correlation coefficients: standard Pearson correlation coefficient, Spearman rank correlation, Kendall Tau correlation coefficient. 

**Pearson correlation coefficient**

One of the simplest method for understanding a feature’s relation to the response variable is Pearson correlation coefficient, which measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

In [None]:
pearson = train.corr(method='pearson')
pearson

We'd like to know how each input attribute is able to predict the target i.e. `SalePrice` and this is called predictivity i.e. the correlation between input attributes and the target one:

In [None]:
# Since the target attr is the last, remove corr with itself
corr_with_target = pearson.ix[-1][:-1]

corr_with_target_dict = corr_with_target.to_dict()

# List the attributes sorted from the most predictive by their correlation with Sale Price
print("FEATURE \tCORRELATION")
for attr in sorted(corr_with_target_dict.items(), key = lambda x: -abs(x[1])):
    print("{0}: \t{1}".format(*attr))

We might also be interested in strong negative correlations it would be better to sort the correlations by the absolute value:

In [None]:
corr_with_target[abs(corr_with_target).argsort()[::1]]

It would also be interesting to understand strong correlations between attribute pairs.

In [None]:
attrs = pearson.iloc[:-1,:-1] # all except target
# only important correlations and not auto-correlations
threshold = 0.5
# {(YearBuilt, YearRemodAdd): 0.592855, (1stFlrSF, GrLivArea): 0.566024, ...
important_corrs = (attrs[abs(attrs) > threshold][attrs != 1.0]) \
    .unstack().dropna().to_dict()
#     attribute pair                   correlation
# 0     (OverallQual, TotalBsmtSF)     0.537808
# 1     (GarageArea, GarageCars)	   0.882475
# ...
unique_important_corrs = pd.DataFrame(
    list(set([(tuple(sorted(key)), important_corrs[key]) \
    for key in important_corrs])), columns=['Attribute Pair', 'Correlation'])
# sorted by absolute value
unique_important_corrs = unique_important_corrs.ix[
    abs(unique_important_corrs['Correlation']).argsort()[::-1]]

unique_important_corrs

## Visualisation ##

 Let's promote the above correlations with some visualisations. This also enables us to grasp difficult concepts or identify new patterns easily through some graphical representations of the different statistical inferences.

Some good Python packages that we can use for plotting the above exist such as the standard matplotlib package and additionally seaborn for some extra statistical plots and for more elegant and comprehensible plot styles.

Diagonal Correlation Matrix

Let's start by visualising the value of correlation of pairs of attributes, ie. a 2D matrix:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


# Generate a mask for the upper triangle
mask = np.zeros_like(pearson, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(pearson, mask=mask, cmap=cmap, vmax=.3,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)

Now, let's visualize an estimation of the probability density function to get a better understanding of how values of each attribtue look like.

We can use a simple means as an intial pdf estimation and the standard `hist()` method from matplotlib will suffice:

In [None]:
target = train['SalePrice']
plt.hist(target, bins=50)

For a better distribution plot, we can use the Seaborn package's distplot() method, which offers  a smoothed histogram with a kernel density estimation (KDE) plot as a single plot:

In [None]:
sns.distplot(target)

Going further with our plots, we would like to explore the attribute pair correlations by plotting a 2D plot with each axis representing the particular attribute range and the points on the plot representing the probability that both attributes have the particular values at once:

In [None]:
# Scatter Plot
x, y = train['YearBuilt'], train['SalePrice']
plt.scatter(x, y, alpha=0.5)

# or via jointplot (with histograms aside):
sns.jointplot(x, y, kind='scatter', joint_kws={'alpha':0.5})

In [None]:
# Hexagonal 2-D plot
sns.jointplot(x, y, kind='hex')

We can also estimate the PDF smoothly by convolving each datapoint with a kernel function via the Seaborn `kdeplot()` method:

In [None]:
sns.kdeplot(x, y, shade=True)
# or 
sns.jointplot(x, y, kind='kde')

Next, let's create a merged plot of the top 6 strong correlated features with the target (SalePrice). Recall from the start we saw that the following attributes have a strong positive correlation with the SalePrice: *OverallQual, GrLivArea(GarageCars), GargeArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd, GargeYrBlt, MasVnrArea* and *Fireplaces*.

Let's see how the pairwise matrix looks like for the top 6:

In [None]:
plt.figure(1)
f, axarr = plt.subplots(3, 2, figsize=(10, 9))
y = target.values
axarr[0, 0].scatter(train['OverallQual'].values, y)
axarr[0, 0].set_title('OverallQual')
axarr[0, 1].scatter(train['TotRmsAbvGrd'].values, y)
axarr[0, 1].set_title('TotRmsAbvGrd')
axarr[1, 0].scatter(train['GarageCars'].values, y)
axarr[1, 0].set_title('GarageCars')
axarr[1, 1].scatter(train['GarageArea'].values, y)
axarr[1, 1].set_title('GarageArea')
axarr[2, 0].scatter(train['TotalBsmtSF'].values, y)
axarr[2, 0].set_title('TotalBsmtSF')
axarr[2, 1].scatter(train['1stFlrSF'].values, y)
axarr[2, 1].set_title('1stFlrSF')
f.text(-0.01, 0.5, 'Sale Price', va='center', rotation='vertical', fontsize = 12)
plt.tight_layout()
plt.show()

Source:

 1. [Dataset exploration: Boston house pricing][1] by Bohumír Zámečník
 2. [Plotting a diagonal correlation matrix][2] by Michael Waskom

  [1]: http://www.neural.cz/dataset-exploration-boston-house-pricing.html
  [2]: https://stanford.edu/~mwaskom/software/seaborn/examples/many_pairwise_correlations.html