# Comprehensive data exploration with Python

[Pedro Marcelino's Kernel](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)

<br>

# COMPREHENSIVE DATA EXPLORATION WITH PYTHON

[Pedro Marcelino](http://pmarcelino.com/) - February 2017

Other Kernels: [Data analysis and feature extraction with Python](https://www.kaggle.com/pmarcelino/data-analysis-and-feature-extraction-with-python)

---

<br>

**'The most difficult thing in life is to know yourself'**

This quite belongs to Thales of Miletus. Thales was a Greek/Phonecian philosopher, mathematician and astronomer, which is recognised as the first individual in Western civilisation known to have entertained and engaged in scientific thought (source: [https://en.wikipedia.org/wiki/Thales](https://en.wikipedia.org/wiki/Thales_of_Miletus))

I wouldn't say that knowing your data is the most difficult thing in data science, but it is time-consuming. Therefore, it's easy to overlook this initial step and jump too soon into the water.

So I tried to learn how to swim before jumping into the water. Based on [Hair et al. (2013)](https://www.amazon.com/gp/product/9332536503/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=9332536503&linkCode=as2&tag=pmarcelino0b-20&linkId=ab279fb29582571ebfa89e6e8b95c50a), chapter 'Examining your data', I did my best to follow a comprehensive(종합적인), but not exhaustive, analysis of the data. I'm far from reporting a rigorous(철저한) study in this kernel, but I hope that it can be useful for the community, so I'm sharing how I applied some of those data analysis principles to this problem.

Despite the strange names I gave to the chapters, what we are doing in this kernel is something like:

1. **Understand the problem.**  
We'll look at each variable and do a philosophical(철학적) analysis about their meaning and importance for this problem.

2. **Univariable study.**  
We'll just focus on the dependent variable ('SalePrice') and try to know a little bit more about it.

3. **Multivariate study**  
We'll try to understand how the dependent variable and independent variables relate.

4. **Basic cleaning**  
We'll clean the dataset and handle the missing data, outliers and categorical variables.

5. **Test assumptions**  
We'll check if our data meets the assumptions required by most multivariate techniques.

Now, it's time to have fun!

In [1]:
# invite people for the Kaggle party
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
# bring in the six packs
df_train = pd.read_csv('../../input/train.csv')

In [3]:
# check the decoration
df_train.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

<br>

# 1. So... What can we expect?

In order to understand our data, we can look at each variable and try to understand their meaning and relevance(관련성) to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.

In order to have some discipline(규율) in our analysis, we can create an Excel spreadsheet with the following columns:

- **Variable**  
Variable name.

- **Type**  
Identification of the variables' type. There are two possible values for this field: **'numerical'** or **'categorical'**. By **'numerical'** we mean variables for which the values are numbers, and by **'categorical'** we mean variables for which the values are categories.

- **Segment(부분)**  
Identification of the variables' segment. We can define three possible segments: **building**, **space** or **location**. When we say **'building'**, we mean a variable that relates to the physical characteristics of the building (e.g. 'OverallQual'). When we say **'space'**, we mean a variable that reports space properties of the house (e.g. 'TotalBsmtSF'). Finally, when we say a **'location'**, we mean a variable that gives information about the place where the house is located (e.g. 'Neighborhood').