# Exploratoy data analysis with Pyhton 
[Hyun woo kim] - 2018-10-03

![](https://kaggle2.blob.core.windows.net/competitions/kaggle/5407/media/housesbanner.png)

update1 : The Pearson correlation was changed to the Spearman correlation in order to consider the ordered variables.   

Based Kernels : [Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)

### Outline
- What can we do?
- First things first: analysing 'SalePrice'
- Finding Missing values
- Preprocessing
    - Outlier
    - Missing Values

## Competition description
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa (central Iowa in America) , this competition challenges you to predict the final price of each home.

In [None]:
import pandas as pd #Analysis 
import matplotlib.pyplot as plt #Visulization
import seaborn as sns #Visulization
import numpy as np #Analysis 
from scipy.stats import norm #Analysis 
from sklearn.preprocessing import StandardScaler #Analysis 
from scipy import stats #Analysis 
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline
import gc

In [None]:
#bring in the six packs
df_train = pd.read_csv('../input/train.csv')
df_test  = pd.read_csv('../input/test.csv')

In [None]:
print("train.csv. Shape: ",df_train.shape)
print("test.csv. Shape: ",df_test.shape)

## 1. What can we do?
In order to understand our data, we can look at each variable and try to understand their meaning and relevance to this problem. I know this is time-consuming, but it will give us the flavour of our dataset.

We can create an Excel spreadsheet. like below picture


![](https://choco9966.github.io/Team-EDA/image/Excelspread.PNG)

There is also a feature selection problem of whether to use 80 variables or only important variables. This is detailed in the third week :D

## 2. First things first: analysing 'SalePrice'

In [None]:
#descriptive statistics summary
df_train['SalePrice'].describe()

  - The std is big.
  - min is greater than 0
  - There is a big difference between the minimum value and the 25th percentile.
  - It's bigger than the 75th percentile and max.
  - The difference between the 75th percentile and the max is greater than the 25th percentile and the max.

In [None]:
#histogram
f, ax = plt.subplots(figsize=(8, 6))
sns.distplot(df_train['SalePrice'])

- Long tail formation to the right (not normal distribution) => Should I normalize?

In [None]:
#skewness and kurtosis
print("Skewness: %f" % df_train['SalePrice'].skew())
print("Kurtosis: %f" % df_train['SalePrice'].kurt())

- Skewness: The longer the right tail, the more positive the tail
- Kurtosis (kurtosis / kurtosis): If the kurtosis value (K) is close to 3, the scatter is close to the normal distribution. (K <3), the distributions can be judged to be flattened more smoothly than the normal distribution, and if the kurtosis is a positive number larger than 3 (K> 3), the distribution can be considered to be a more pointed distribution than the normal distribution

## Relationship with variables

### `SalePrice` correlation matrix (zoomed heatmap style)

### Pearson product ratio correlation
Pearson correlation evaluates the linear relationship between two metric variables. There is a linear relationship when the variation of one variable is proportional to the change of another variable.

For example, Pearson correlation can be used to assess whether the increase in temperature in a production facility is related to changes in the thickness of the chocolate coating.

### Spearman Rank Correlation
Spearman correlation evaluates the simple relationship between two metric or sequential variables. In a simple relationship, the two variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked value for each variable, not the raw data.

Spearman correlation is often used to evaluate relationships containing sequential variables. For example, you can use Spearman correlation to assess whether the order in which employees complete the test exercises is related to the number of months employed.

Correlation for different types: https://m.blog.naver.com/PostView.nhn?blogId=lucifer246&logNo=180754322&proxyReferer=https%3A%2F%2Fwww.google.co.kr%2F

![](https://choco9966.github.io/Team-EDA//image/correlation1.png)

               Pearson = +1, Spearman = +1                       Pearson = +0.851, Spearman = +1

In [None]:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
corrmat = df_train.corr(method='spearman') # correlation 전체 변수에 대해서 계산
cols = corrmat.nlargest(k, 'SalePrice').index # nlargest : Return this many descending sorted values
cm = np.corrcoef(df_train[cols].values.T) # correlation 특정 컬럼에 대해서
sns.set(font_scale=1.25)
f, ax = plt.subplots(figsize=(8, 6))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 8}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

9 most relevant variables with SalePrice
- OverallQual : Overall material and finish quality
- GrLivArea : Above grade (ground : the portion of a home that is above the ground) living area square feet
- GarageCars : Size of garage in car capacity
- GarageArea : Size of garage in square feet
- TotalBsmtSF : Total square feet of basement area (지하실 the lowermost portion of a structure partly or wholly below ground level; often used for storage)
- 1stFlrSF : First Floor square feet
- FullBath : Full bathrooms above grade
- TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)
- YearBuilt : Original construction date

In [None]:
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', 'TotalBsmtSF', '1stFlrSF', 'FullBath','TotRmsAbvGrd','YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- The higher the quality, the better the selling price.
- However, the 4th level of outliers and 7, 8, and 10 outliers do not have anything suspicious. These values should be checked again later.

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['GrLivArea']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='GrLivArea', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

The relationship between GrLivArea and SalePrice has a positive correlation. That is, as the area becomes wider, the price also increases. However, irrespective of that, the price of 4000 or more is low.

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['GarageCars']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='GarageCars', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- 4 is very strange ... why?

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['GarageArea']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='GarageArea', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- GarageArea is divided into 0 and non-zero parts.
- Generally, there is a positive correlation
- But SalePrice is about 7-80 million points and Four points of GarageArea 1200 ~ 1400 Hmm looks outlier ... !!

In [None]:
data = pd.concat([(df_train[df_train['GarageArea'] > 0])['SalePrice'], (df_train[df_train['GarageArea'] > 0])['GarageArea']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='GarageArea', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['TotalBsmtSF']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='TotalBsmtSF', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- Outlier removal
- What is the relationship between TotalBsmtSF of 0 and GarageArea 0?

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['1stFlrSF']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='1stFlrSF', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- It looks very similar to TotalBsmtSF. The correlation between the two is very high at 0.82

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['FullBath']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='FullBath', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- It is no wonder that the FullBath increases. But what about zero? why higher than 1? 

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['TotRmsAbvGrd']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='TotRmsAbvGrd', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

I think that outlier
- Separate analysis only for 12, 14
- Very low values in 6 
- very high value at 10

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['YearBuilt']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='YearBuilt', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [None]:
#histogram
f, ax = plt.subplots(figsize=(8, 6))
sns.distplot(df_train['YearRemodAdd'])

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['YearBuilt'], df_train['YearRemodAdd']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
data['YearRemodBuilt'] = data['YearRemodAdd'] - data['YearBuilt']
fig = sns.regplot(x='YearRemodBuilt', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [None]:
data = data[data['YearRemodBuilt'] > 1]
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.regplot(x='YearRemodBuilt', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

## 3. Finding Missing values

In [None]:
#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

In [None]:
#histogram
#missing_data = missing_data.head(20)
percent_data = percent.head(20)
percent_data.plot(kind="bar", figsize = (8,6), fontsize = 10)
plt.xlabel("Columns", fontsize = 20)
plt.ylabel("Count", fontsize = 20)
plt.title("Total Missing Value (%)", fontsize = 20)

- Why was there a missing vales?
- How should I handle missing values?

## 4. Preprocessing
### 4.1 Outliers
#### 4.1.1 OverallQual

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

- OverallQual : 4
- OverallQual : 8
- OverallQual : 10

In [None]:
df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]

In [None]:
print("Variable","       value  ", "     mean","      ","   0.75Q")
print("YearBuilt:    ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['YearBuilt'].values,
      "",df_train[df_train['OverallQual'] == 4]['YearBuilt'].mean(),
     "",df_train[df_train['OverallQual'] == 4]['YearBuilt'].quantile(0.75))
print("GarageCars:   ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['GarageCars'].values,
      "   ",df_train[df_train['OverallQual'] == 4]['GarageCars'].mean(),
     "",df_train[df_train['OverallQual'] == 4]['GarageCars'].quantile(0.75))
print("GrLivArea:    ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['GrLivArea'].values,
      "",df_train[df_train['OverallQual'] == 4]['GrLivArea'].mean(),
     "",df_train[df_train['OverallQual'] == 4]['GrLivArea'].quantile(0.75))
print("FullBath:     ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['FullBath'].values,
      "   ",df_train[df_train['OverallQual'] == 4]['FullBath'].mean(),
     "",df_train[df_train['OverallQual'] == 4]['FullBath'].quantile(0.75))
print("YearRemodAdd: ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['YearRemodAdd'].values,
      "",df_train[df_train['OverallQual'] == 4]['YearRemodAdd'].mean(),
     "            ",df_train[df_train['OverallQual'] == 4]['YearRemodAdd'].quantile(0.75))
print("GarageArea:   ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['GarageArea'].values,
      " ",df_train[df_train['OverallQual'] == 4]['GarageArea'].mean(),
     " ",df_train[df_train['OverallQual'] == 4]['GarageArea'].quantile(0.75))
print("TotalBsmtSF:  ",df_train[df_train['OverallQual'] == 4][df_train['SalePrice'] > 200000]['TotalBsmtSF'].values,
      "",df_train[df_train['OverallQual'] == 4]['TotalBsmtSF'].mean(),
     " ",df_train[df_train['OverallQual'] == 4]['TotalBsmtSF'].quantile(0.75))

Overall, the value is not great, but strangely the salesprice is strange. So I judged it to be outlier

In [None]:
df_train = df_train[df_train['Id'] != 458]
df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]

In [None]:
print("Variable","       value  ", "     mean","      ","   0.75Q")
print("YearBuilt:    ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['YearBuilt'].values,
      "",df_train[df_train['OverallQual'] == 8]['YearBuilt'].mean(),
     "",df_train[df_train['OverallQual'] == 8]['YearBuilt'].quantile(0.75))
print("GarageCars:   ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['GarageCars'].values,
      "   ",df_train[df_train['OverallQual'] == 8]['GarageCars'].mean(),
     "",df_train[df_train['OverallQual'] == 8]['GarageCars'].quantile(0.75))
print("GrLivArea:    ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['GrLivArea'].values,
      "",df_train[df_train['OverallQual'] == 8]['GrLivArea'].mean(),
     "",df_train[df_train['OverallQual'] == 8]['GrLivArea'].quantile(0.75))
print("FullBath:     ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['FullBath'].values,
      "   ",df_train[df_train['OverallQual'] == 8]['FullBath'].mean(),
     "",df_train[df_train['OverallQual'] == 8]['FullBath'].quantile(0.75))
print("YearRemodAdd: ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['YearRemodAdd'].values,
      "",df_train[df_train['OverallQual'] == 8]['YearRemodAdd'].mean(),
     "",df_train[df_train['OverallQual'] == 8]['YearRemodAdd'].quantile(0.75))
print("GarageArea:   ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['GarageArea'].values,
      " ",df_train[df_train['OverallQual'] == 8]['GarageArea'].mean(),
     " ",df_train[df_train['OverallQual'] == 8]['GarageArea'].quantile(0.75))
print("TotalBsmtSF:  ",df_train[df_train['OverallQual'] == 8][df_train['SalePrice'] > 500000]['TotalBsmtSF'].values,
      "",df_train[df_train['OverallQual'] == 8]['TotalBsmtSF'].mean(),
     "",df_train[df_train['OverallQual'] == 8]['TotalBsmtSF'].quantile(0.75))

In [None]:
df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]

In [None]:
print("Variable","       value  ", "     mean","      ","   0.25Q")
print("YearBuilt:    ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['YearBuilt'].values,
      "",df_train[df_train['OverallQual'] == 10]['YearBuilt'].mean(),
     " ",df_train[df_train['OverallQual'] == 10]['YearBuilt'].quantile(0.25))
print("GarageCars:   ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['GarageCars'].values,
      "   ",df_train[df_train['OverallQual'] == 10]['GarageCars'].mean(),
     " ",df_train[df_train['OverallQual'] == 10]['GarageCars'].quantile(0.25))
print("GrLivArea:    ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['GrLivArea'].values,
      "",df_train[df_train['OverallQual'] == 10]['GrLivArea'].mean(),
     "",df_train[df_train['OverallQual'] == 10]['GrLivArea'].quantile(0.25))
print("FullBath:     ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['FullBath'].values,
      "   ",df_train[df_train['OverallQual'] == 10]['FullBath'].mean(),
     "",df_train[df_train['OverallQual'] == 10]['FullBath'].quantile(0.25))
print("YearRemodAdd: ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['YearRemodAdd'].values,
      "",df_train[df_train['OverallQual'] == 10]['YearRemodAdd'].mean(),
     "",df_train[df_train['OverallQual'] == 10]['YearRemodAdd'].quantile(0.25))
print("GarageArea:   ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['GarageArea'].values,
      "",df_train[df_train['OverallQual'] == 10]['GarageArea'].mean(),
     " ",df_train[df_train['OverallQual'] == 10]['GarageArea'].quantile(0.25))
print("TotalBsmtSF:  ",df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 180000]['TotalBsmtSF'].values,
      "",df_train[df_train['OverallQual'] == 10]['TotalBsmtSF'].mean(),
     " ",df_train[df_train['OverallQual'] == 10]['TotalBsmtSF'].quantile(0.25))

If we do not find a clear trail in the above process, we search for variables that we did not explore. However, by corr, I think that the numerical type is investigated to some degree, so i explore categorical variables

In [None]:
var = 'SaleCondition'
data = pd.concat([df_train[df_train['OverallQual'] == 10]['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)

In [None]:
var = 'MSZoning'
data = pd.concat([df_train[df_train['OverallQual'] == 10]['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)

In [None]:
df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] < 200000]

We judge it as outliers and remove it.

In [None]:
df_train = df_train[df_train['Id'] != 524][df_train['Id'] != 1299]

In [None]:
var = 'Neighborhood'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)

The fluctuation of saleprice seems to be large by neighbor.

In [None]:
df_train[df_train['Neighborhood'] == 'Edwards']['SalePrice'].describe()

In [None]:
df_train[df_train['OverallQual'] == 10][df_train['SalePrice'] > 700000]

In [None]:
df_train = df_train[df_train['Id'] != 692][df_train['Id'] != 1183]

Likewise, you can see that Neighborhood is the same as NoRidge.

While looking at this continued category variable, I noticed that the Neighborhood has a significant impact on the price. If so, two issues arise here.
- How do you encode your neighbors to make them recognize your computer?
- How to deal with anomalies in the neighborhood?

we take encoding in the next kernel in 2week

In [None]:
df_train[df_train['Neighborhood'] == 'NoRidge']['SalePrice'].describe()

In [None]:
#FireplaceQu
var = 'BsmtQual'
data = pd.concat([df_train[df_train['OverallQual'] == 10]['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)

#### 4.1.2 GarageCars

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['GarageCars']], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x='GarageCars', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [None]:
df_train[df_train['GarageCars'] == 4]

Can not I delete it if it is not in Test? Because the deletion is less risky! but ...

In [None]:
df_test[df_test['GarageCars'] == 4]

We have seen that the neighbor variables are important, so we will explore them.

In [None]:
df_train[df_train['GarageCars'] == 4]['Neighborhood'].unique()

In [None]:
df_train[df_train['Neighborhood'] == 'Mitchel']['SalePrice'].describe()

In [None]:
df_train[df_train['Neighborhood'] == 'OldTown']['SalePrice'].describe()

In [None]:
df_train[(df_train['Neighborhood'] == 'OldTown') & (df_train['SalePrice'] > 400000)]

In [None]:
df_train = df_train[df_train['Id']!=186]

#### 4.1.3 GrLivArea

In [None]:
data = pd.concat([df_train['SalePrice'], df_train['GrLivArea']], axis=1)
#f, ax = plt.subplots(figsize=(8, 6))
def r2(x, y):
    return stats.pearsonr(x, y)[0] ** 2
sns.jointplot('GrLivArea','SalePrice', kind="reg",stat_func=r2, data=data,height =18)
#fig.axis(ymin=0, ymax=800000);

In [None]:
df_train[df_train['GrLivArea'] < 3000][df_train["SalePrice"] > 600000]

In [None]:
df_train[df_train['Neighborhood'] == 'NridgHt']['SalePrice'].describe()

In [None]:
df_train = df_train[df_train['Id'] != 899]

### 4.2 Missing Values

In [None]:
#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

#histogram
#missing_data = missing_data.head(20)
percent_data = percent.head(20)
percent_data.plot(kind="bar", figsize = (18,16), fontsize = 15)
plt.xlabel("Columns", fontsize = 20)
plt.ylabel("Percent of Missing Value (%)", fontsize = 20)
#plt.title("Total Missing Value (%)", fontsize = 20)

Pedro Marcelino commented on https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

Let's analyse this to understand how to handle the missing data.

We'll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. This means that we will not try any trick to fill the missing data in these cases. According to this, there is a set of variables (e.g. 'PoolQC', 'MiscFeature', 'Alley', etc.) that we should delete. The point is: will we miss this data? I don't think so. None of these variables seem to be very important, since most of them are not aspects in which we think about when buying a house (maybe that's the reason why data is missing?). Moreover, looking closer at the variables, we could say that variables like 'PoolQC', 'MiscFeature' and 'FireplaceQu' are strong candidates for outliers, so we'll be happy to delete them.

In what concerns the remaining cases, we can see that 'GarageX' variables have the same number of missing data. I bet missing data refers to the same set of observations (although I will not check it; it's just 5% and we should not spend 20 in5  problems). Since the most important information regarding garages is expressed by 'GarageCars' and considering that we are just talking about 5% of missing data, I'll delete the mentioned 'GarageX' variables. The same logic applies to 'BsmtX' variables.

Regarding 'MasVnrArea' and 'MasVnrType', we can consider that these variables are not essential. Furthermore, they have a strong correlation with 'YearBuilt' and 'OverallQual' which are already considered. Thus, we will not lose information if we delete 'MasVnrArea' and 'MasVnrType'.

Finally, we have one missing observation in 'Electrical'. Since it is just one observation, we'll delete this observation and keep the variable.

In summary, to handle missing data, we'll delete all the variables with missing data, except the variable 'Electrical'. In 'Electrical' we'll just delete the observation with missing data.

**But I'm basically filling it all up.**


![](https://cdn-images-1.medium.com/max/1600/1*Nph7tFVhdnFjJWHMKof-0A.png)

In [None]:
import missingno as msno
len_train = df_train.shape[0]
df_all = pd.concat([df_train,df_test])
missingdata_df = df_all.columns[df_all.isnull().any()].tolist()
msno.heatmap(df_all[missingdata_df], figsize=(20,20))

We can look at some variables that have correlations between missing values.
- Bsmt~
- Garage~

### PoolQC

```
   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   NA   No Pool
   ```
   So, it is obvious that I need to just assign ‘No Pool’ to the NAs. Also, the high number of NAs makes sense as normally only a small proportion of houses have a pool.

In [None]:
df_all["PoolQC"] = df_all["PoolQC"].fillna("None")

In [None]:
df_all["PoolQC"].describe()

In [None]:
df_all[(df_all["PoolQC"] == 'None') & (df_all["PoolArea"] > 0)][["Id","PoolQC","PoolArea","OverallQual"]]

In [None]:
df_all.loc[df_all['Id'] == 2421, ['PoolQC']] = 'TA'
df_all.loc[df_all['Id'] == 2504, ['PoolQC']] = 'Gd'
df_all.loc[df_all['Id'] == 2600, ['PoolQC']] = 'Fa'

In [None]:
df_all[(df_all["PoolQC"] == 'None') & (df_all["PoolArea"] > 0)][["Id","PoolQC","PoolArea","OverallQual"]]

In [None]:
df_all["PoolQC"].describe()

### MiscFeature 
```
   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None
```

In [None]:
df_all["MiscFeature"] = df_all["MiscFeature"].fillna("None")

### Alley 
```
   Grvl Gravel
   Pave Paved
   NA   No alley access
```

In [None]:
df_all["Alley"] = df_all["Alley"].fillna("None")

### Fence 
```
   GdPrv    Good Privacy
   MnPrv    Minimum Privacy
   GdWo Good Wood
   MnWw Minimum Wood/Wire
   NA   No Fence
```

In [None]:
df_all["Fence"] = df_all["Fence"].fillna("None")

### FireplaceQu 
```
   Ex   Excellent - Exceptional Masonry Fireplace
   Gd   Good - Masonry Fireplace in main level
   TA   Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa   Fair - Prefabricated Fireplace in basement
   Po   Poor - Ben Franklin Stove
   NA   No Fireplace
```

In [None]:
df_all["FireplaceQu"] = df_all["FireplaceQu"].fillna("None")

### LotFrontage
Because each street area connected to the residential area is likely to have an area similar to other houses in the neighborhood, you can fill in the missing value with the median value of your neighbor's LotFrontage.

In [None]:
df_all["LotFrontage"] = df_all.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
len(df_all[df_all["LotFrontage"].isnull()])

### Garage 
The correlations of the above missing values are highly correlated with the four variables 'GarageType', 'GarageCond', 'GarageFinish', and 'GarageQual'.
So first look at the missing values around this

First, GarageYrBlt: Year garage is a value that is made up of YearBuilt values (this value is similar to YearRemodAdd, and by default YearBuilt has no reform or addition).

In [None]:
df_all['GarageYrBlt'] = df_all.fillna(df_all['YearBuilt'])

In [None]:
df_all[df_all['GarageType'].isnull()][['GarageCond','GarageFinish','GarageQual']].head(10)

In [None]:
df_all[((df_all['GarageType'].isnull()) == False) & ((df_all['GarageFinish'].isnull()) == True)][['Id','GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish']]

All 157 NAs in GarageType are NA in GarageCondition, GarageQuality, and GarageFinish. The difference can be found in Houses 2127 and 2577. As you can see, house 2127 actually has a garage and does not appear to be in house 2577. Therefore, there must be 158 houses without garage. Replaces the most common values (mode) for GarageCond, GarageQual, and GarageFinish to modify house 2127.

In [None]:
print("GarageCond: ", df_all[(df_all['GarageType']=='Detchd') & (df_all['GarageCond'] != "nan")]['GarageCond'].mode().values)
print("GarageQual: ", df_all[(df_all['GarageType']=='Detchd') & (df_all['GarageQual'] != "nan")]['GarageQual'].mode().values)
print("GarageFinish: ", df_all[(df_all['GarageType']=='Detchd') & (df_all['GarageFinish'] != "nan")]['GarageFinish'].mode().values)

In [None]:
df_all.loc[df_all['Id'] == 2127, ['GarageCond']] = 'TA'
df_all.loc[df_all['Id'] == 2127, ['GarageQual']] = 'TA'
df_all.loc[df_all['Id'] == 2127, ['GarageFinish']] = 'Unf'

In [None]:
df_all[df_all["Id"]==2127][['GarageCond','GarageQual','GarageFinish']]

But Id 2577 has GarageType, but Cars and Area are zero. In this case, it is difficult to understand the situation with only the above information. Garage was present, but it may or may not have been marked. If I had this information in Train, I would have cleared it, but because I am in the test, I will reduce the bias by putting all of the NA.

In [None]:
df_all.loc[df_all['Id'] == 2577, ['GarageCars']] = 0
df_all.loc[df_all['Id'] == 2577, ['GarageArea']] = 0
df_all.loc[df_all['Id'] == 2577, ['GarageType']] = 'None'

In [None]:
df_all[df_all["Id"]==2577][['GarageCars','GarageArea','GarageType']]

#### GarageType: Garage location
```
   2Types   More than one type of garage
   Attchd   Attached to home
   Basment  Basement Garage
   BuiltIn  Built-In (Garage part of house - typically has room above garage)
   CarPort  Car Port
   Detchd   Detached from home
   NA       No Garage
   ```

In [None]:
df_all['GarageType'] = df_all['GarageType'].fillna('None')

#### GarageFinish: Interior finish of the garage
```
   Fin  Finished
   RFn  Rough Finished  
   Unf  Unfinished
   NA   No Garage 
```

In [None]:
df_all['GarageFinish'] = df_all['GarageFinish'].fillna('None')

#### GarageQual: Garage quality
Another variable than can be made ordinal with the Qualities vector.
```
   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage
   ```


In [None]:
df_all['GarageQual'] = df_all['GarageQual'].fillna('None')

#### GarageCond: Garage condition

Another variable than can be made ordinal with the Qualities vector.
```
   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage
   ```

In [None]:
df_all['GarageCond'] = df_all['GarageCond'].fillna('None')

### Basement 
Altogether, there are 11 variables that relate to the Basement of a house. Five of those have 79-82 NAs, six have one or two NAs.

In [None]:
df_all[((df_all["BsmtFinType1"].isnull())==False) & ((df_all["BsmtCond"].isnull()) | (df_all["BsmtQual"].isnull()) | (df_all["BsmtExposure"].isnull()) | (df_all["BsmtFinType2"].isnull()))][['Id','BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']]

In [None]:
print("BsmtFinType2 mode",df_all['BsmtFinType2'].mode().values,"\nBsmtExposure mode",df_all['BsmtExposure'].mode().values,"\nBsmtCond mode",df_all['BsmtCond'].mode().values,"\nBsmtQual mode",df_all['BsmtQual'].mode().values)

In [None]:
df_all.loc[df_all['Id'] == 333, ['BsmtFinType2']] = 'Unf'
df_all.loc[(df_all['Id'] == 949),['BsmtExposure']] = 'No';df_all.loc[(df_all['Id'] == 1488),['BsmtExposure']] = 'No';df_all.loc[(df_all['Id'] == 2349),['BsmtExposure']] = 'No'
df_all.loc[(df_all['Id'] == 2041), ['BsmtCond']] = 'Unf';df_all.loc[(df_all['Id'] == 2186), ['BsmtCond']] = 'Unf';df_all.loc[(df_all['Id'] == 2525), ['BsmtCond']] = 'Unf'
df_all.loc[(df_all['Id'] == 2218), ['BsmtQual']] = 'Unf';df_all.loc[(df_all['Id'] == 2219), ['BsmtQual']] = 'Unf'

#### BsmtQual: Evaluates the height of the basement
```
   Ex   Excellent (100+ inches) 
   Gd   Good (90-99 inches)
   TA   Typical (80-89 inches)
   Fa   Fair (70-79 inches)
   Po   Poor (<70 inches
   NA   No Basement
   ```

In [None]:
df_all['BsmtQual'] = df_all['BsmtQual'].fillna('None')

#### BsmtCond: Evaluates the general condition of the basement
A variable than can be made ordinal with the Qualities vector.

```
   Ex   Excellent
   Gd   Good
   TA   Typical - slight dampness allowed
   Fa   Fair - dampness or some cracking or settling
   Po   Poor - Severe cracking, settling, or wetness
   NA   No Basement```



In [None]:
df_all['BsmtCond'] = df_all['BsmtCond'].fillna('None')

#### BsmtExposure: Refers to walkout or garden level walls

A variable than can be made ordinal.
```
   Gd   Good Exposure
   Av   Average Exposure (split levels or foyers typically score average or above)  
   Mn   Mimimum Exposure
   No   No Exposure
   NA   No Basement
   ```

In [None]:
df_all['BsmtExposure'] = df_all['BsmtExposure'].fillna('None')

#### BsmtFinType1: Rating of basement finished area

A variable than can be made ordinal.
```
   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
   ```

In [None]:
df_all['BsmtFinType1'] = df_all['BsmtFinType1'].fillna('None')

#### BsmtFinType2: Rating of basement finished area (if multiple types)

A variable than can be made ordinal with the FinType vector.
```
   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
   ```

In [None]:
df_all['BsmtFinType2'] = df_all['BsmtFinType2'].fillna('None')

#### Remaining Basement variabes with just a few NAs

I now still have to deal with those 6 variables that have 1 or 2 NAs.

In [None]:
df_all[(df_all["BsmtFullBath"].isnull()) & ((df_all["BsmtHalfBath"].isnull()) | (df_all["BsmtFinSF1"].isnull()) | (df_all["BsmtFinSF2"].isnull()) | (df_all["BsmtUnfSF"].isnull())| (df_all["TotalBsmtSF"].isnull()) )][['Id','BsmtQual', 'BsmtQual', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF']]

It should be obvious that those remaining NAs all refer to ‘not present’. Below, I am fixing those remaining variables.

#### BsmtFullBath: Basement full bathrooms

An integer variable.

In [None]:
df_all['BsmtFullBath'] = df_all['BsmtFullBath'].fillna(0)

#### BsmtHalfBath: Basement half bathrooms

An integer variable.

In [None]:
df_all['BsmtHalfBath'] = df_all['BsmtHalfBath'].fillna(0)

#### BsmtFinSF1: Type 1 finished square feet

An integer variable.

In [None]:
df_all['BsmtFinSF1'] = df_all['BsmtFinSF1'].fillna(0)

#### BsmtFinSF2: Type 2 finished square feet

An integer variable.

In [None]:
df_all['BsmtFinSF2'] = df_all['BsmtFinSF2'].fillna(0)

#### BsmtUnfSF: Unfinished square feet of basement area

An integer variable.

In [None]:
df_all['BsmtUnfSF'] = df_all['BsmtUnfSF'].fillna(0)

#### TotalBsmtSF: Total square feet of basement area

An integer variable.

In [None]:
df_all['TotalBsmtSF'] = df_all['TotalBsmtSF'].fillna(0)

#### Masonry veneer type, and masonry veneer area

Masonry veneer type has 24 NAs. Masonry veneer area has 23 NAs. If a house has a veneer area, it should also have a masonry veneer type. Let’s fix this one first.

In [None]:
df_all[(df_all['MasVnrType'].isnull()) & (df_all['MasVnrArea'].isnull() == False ) ][['Id','MasVnrType','MasVnrArea']]

In [None]:
#FireplaceQu
var = 'MasVnrArea'
data = pd.concat([df_all['MasVnrType'], df_all[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x='MasVnrType', y="MasVnrArea", data=data)
xt = plt.xticks(rotation=45)

In [None]:
df_all.loc[df_all['Id'] == 2611, ['MasVnrType']] = 'Stone'

In [None]:
df_all[df_all['Id']==2611]['MasVnrType']

#### Masonry veneer type

Will check the ordinality below.
```
   BrkCmn   Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   None None
   Stone    Stone
   ```

In [None]:
df_all[df_all['MasVnrType'].isnull() == True]['MasVnrType'].head()

In [None]:
df_all['MasVnrType'] = df_all['MasVnrType'].fillna('None')

#### MasVnrArea: Masonry veneer area in square feet

An integer variable.

In [None]:
df_all['MasVnrArea'] = df_all['MasVnrArea'].fillna(0)
len(df_all[df_all['MasVnrArea'].isnull()])

#### MSZoning: Identifies the general zoning classification of the sale

4 NAs. Values are categorical.
```
   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density
   ```

In [None]:
df_all['MSZoning'].describe()

In [None]:
df_all['MSZoning'] = df_all['MSZoning'].fillna('RL')

### Kitchen Variables
#### Kitchen quality and numer of Kitchens above grade

Kitchen quality has 1 NA. Number of Kitchens is complete.

#### Kitchen quality

1NA. Can be made ordinal with the qualities vector.
```
   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   ```

In [None]:
df_all['KitchenQual'].describe()

In [None]:
df_all['KitchenQual'] = df_all['KitchenQual'].fillna('TA')

### Utilities
#### Utilities: Type of utilities available

2 NAs. Ordinal as additional utilities is better.
```
   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO  Electricity only
```
However, the table below shows that only one house does not have all public utilities. This house is in the train set. Therefore, imputing ‘AllPub’ for the NAs means that all houses in the test set will have ‘AllPub’. This makes the variable useless for prediction. Consequently, I will get rid of it.

In [None]:
df_all['Utilities'].describe()

In [None]:
del df_all['Utilities'];
gc.collect()

#### Functional: Home functionality

1NA. Can be made ordinal (salvage only is worst, typical is best).
```
   Typ  Typical Functionality
   Min1 Minor Deductions 1
   Min2 Minor Deductions 2
   Mod  Moderate Deductions
   Maj1 Major Deductions 1
   Maj2 Major Deductions 2
   Sev  Severely Damaged
   Sal  Salvage only
   ```

In [None]:
df_all['Functional'].describe()

In [None]:
df_all['Functional'] = df_all['Functional'].fillna('Typ')

### Exterior Variables
#### There are 4 exterior variables

2 variables have 1 NA, 2 variables have no NAs.

#### Exterior1st: Exterior covering on house

1 NA. Values are categorical.
```
   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast 
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles
   ```


In [None]:
df_all['Exterior1st'].describe()

In [None]:
df_all['Exterior1st'] = df_all['Exterior1st'].fillna('VinylSd')

#### Exterior2nd: Exterior covering on house (if more than one material)

1 NA. Values are categorical. 
```
   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles
   ```

In [None]:
df_all['Exterior2nd'].describe()

In [None]:
df_all['Exterior2nd'] = df_all['Exterior2nd'].fillna('VinylSd')

### Electrical
#### Electrical: Electrical system

1 NA. Values are categorical.
```
   SBrkr    Standard Circuit Breakers & Romex
   FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
   FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
   FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
   Mix  Mixed
   ```

In [None]:
df_all['Electrical'].describe()

In [None]:
df_all['Electrical'] = df_all['Electrical'].fillna('SBrkr')

### SaleType
#### SaleType: Type of sale

1 NA. Values are categorical.
```
   WD   Warranty Deed - Conventional
   CWD  Warranty Deed - Cash
   VWD  Warranty Deed - VA Loan
   New  Home just constructed and sold
   COD  Court Officer Deed/Estate
   Con  Contract 15% Down payment regular terms
   ConLw    Contract Low Down payment and low interest
   ConLI    Contract Low Interest
   ConLD    Contract Low Down
   Oth  Other
   ```

In [None]:
df_all['SaleType'].describe()

In [None]:
df_all['SaleType'] = df_all['SaleType'].fillna('WD')

In [None]:
#missing data
total = df_all.isnull().sum().sort_values(ascending=False)
percent = (df_all.isnull().sum()/df_all.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

#histogram
#missing_data = missing_data.head(20)
percent_data = percent.head(20)
percent_data.plot(kind="bar", figsize = (18,16), fontsize = 15)
plt.xlabel("Columns", fontsize = 20)
plt.ylabel("Percent of Missing Value (%)", fontsize = 20)
#plt.title("Total Missing Value (%)", fontsize = 20)

In [None]:
df_train = df_all[:len_train]
df_test = df_all[len_train:]

In [None]:
df_train.to_csv('train.csv', index=False)
df_test.to_csv('test.csv', index=False)

Next week I will talk about encoding methods and feature engineering.