Hi, thank you for clicking my Kernel :)

Here, you will see the most detailed data wrangling for this House Sales in King Countrydataset[](https://www.kaggle.com/harlfoxem/housesalesprediction) and may give you a future-inspiration on data wrangling.

Also, I introduce three different methods to help people to find outliers. After using this kernel, the data will be clean and tidy and I believe it would help you to do predictions or build machine learning models better. 

Hopefully this kernel would be helpful for you. Enjoy the reading! :)


## Step 1. Loading A CSV Into pandas

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import os
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np  
# Configure visualisations
%matplotlib inline
mpl.style.use('ggplot') #with this, your figures would be more beautiful

In [None]:
#loading the dataset 
df_house=pd.read_csv("../input/housesalesprediction/kc_house_data.csv")
df_house.head()

In [None]:
#remove some columns that we do not need for the following steps
df_house.drop(['id', 'sqft_living15','sqft_lot15'], axis = 1, inplace = True)
df_house.head()

### Data types and missing values

In [None]:
df_house.info()

There are no missing values, but 'zipcode' should be a categorical variable instead of quantitative variable.

In [None]:
df_house['zipcode'] = df_house['zipcode'].astype('category')

In [None]:
df_house.describe()

What the table above tells us:

1. We have a total of 21,613 samples but no missing values are found.
2. For 'price', standard deviation(367,127) is a large number. It seems not way too smaller than the mean(540,088), which indicates some individual prices may vary a lot from the mean. However, outliers may affect the mean as well.
3. For sqft_living, standard deviation(918) is way smaller than the mean(2079). It means most of individual living spaces do not vary a lot around the mean.
4. The mean of condition is more than 3.4. Most of properties have average-above conditions even though they do not have any Views.
5. Mean of the Waterfront(0.086517) is much less than the mean(0.5), which means most of properties are not living next to water.
6. The proerty built between the year of 1900 and the year of 2015.

Let's see the distribution of the object variables

In [None]:
df_house.describe(include=['O'])

What the table above tells us:

1. 372 properties have been sold at the day of 23/06/2014, the frequency is 142.
2. There are total of 21613 different days

Let's see the distribution of the categorical variables

In [None]:
df_house.describe(include=['category'])

What the table above tells us:

1. there are 70 areas in King County
2. zipcode 98103 is the most hot area that selling properties, sold 602 times


# Step 2. Auditing and cleansing the loaded data

## Checking irregularities

#### Checking irregularity of zipcodes

In [None]:
df_house.zipcode.unique()

#### Checking irregularity of bathrooms

In [None]:
df_house.bathrooms.unique()

It is very common to see a beathroom is 0.25 or 0.5 or 0.75, just because what we consider a bathing facility does not exist doesn’t mean you can’t “bathe”. 

#### Checking irregularity of bedrooms

In [None]:
df_house.bedrooms.unique()

Holy crap, there are some houses with 33 bedrooms! I guess they are either hotels or palaces.

#### Checking irregularity of views

In [None]:
df_house.view.unique()

#### Checking irregularity of conditions

In [None]:
df_house.condition.unique()

#### Checking irregularity of year renovated

In [None]:
df_house.yr_renovated.unique()

#### Checking irregularity of waterfront

In [None]:
df_house.waterfront.unique()

#### Checking irregularity of grade

In [None]:
df_house.grade.unique()

Hey buddy, where is the grade 2?

#### Checking irregularity of floors

In [None]:
df_house.floors.unique()

The minimum floor is 1, and the reason it allows decimals might be because some of the house have a mezzanine or an attic, and some might be built on a slope.

## Checking any lexical errors in the data
Typos are the most common errors, particularly whenever the data collection process involves human. Let's look at the date. Firstly using the 'value_counts()' function to check if there are some unique errors.

#### Checking attribute of date

In [None]:
df_house.date.value_counts()

I suspect a date that appears only once or twice might have some issues. So I check them one by one, but this dataset is too damn perfect. Everything is correct here.

In [None]:
#remove 'T000000'
df_house['date'] = df_house.date.str.replace('T000000' , '')

Let's use a regular expression to check whether they have other errors in date or not

In [None]:
regex = r'''(?x)
    # Year
    (?:(?:(?:\d{2})?\d{2})
    #30-day months
    (?:(?:(?:0[469]|11)(?:30|[12][0-9]|0[1-9]))|
    #31-day months
    (?:(?:0[13578]|1[02])(?:3[01]|[12][0-9]|0[1-9]))|
    #February (29 days every year)
    (?:(?:0?2)(?:[12][0-9]|0?[1-9]))))
'''

df_house[~df_house["date"].str.match(regex)]

OMG! The datset is too perfect!


In [None]:
#change the datetime format
df_house['date']= pd.to_datetime(df_house['date'])
df_house.head()

#### Checking lexical errors of zipcode attribute

In [None]:
df_house.zipcode.value_counts()

Nothing wrong here. But I won't give up!

##  Checking Inconsistency , Integrity and Semantic errors
### Checking the values of "sqft_living" consistent with the values of "sqft_above" and "sqft_basement"

In [None]:
df_house.loc[(df_house.sqft_living!=df_house.sqft_above+df_house.sqft_basement)]

### Checking whether there are any year of last renovation earlier than the year of initially built

In [None]:
df_house.loc[(df_house.yr_renovated<=df_house.yr_built) & (df_house.yr_renovated !=0) ]

What should I say? This dataset is too clean!

### Checking whether there are no bathroom and bedroom exist simultaneously

In [None]:
df_house[(df_house.bathrooms==0) & (df_house.bedrooms==0)]

Finally, I find something. According to the features above, I think the datas are reasonable even though the properties do not have any bedroms and bathrooms. We can see that yr_renovated are 0 here. We assume this property has not been renovated yet. And most of properties are very old. Keeping them in the database could be a good indicator for someone who would like to sell the house without bedrooms and bathrooms. We can assume those type of properties are unfurnished and do not need any paintings. Therefore, I decide to keep them and do not make any changes.

### Checking if there is a property which has only 1 floor but does not have basement space, while living space is larger than its land space

Before we move on, let's check the houses which just have 1 floor.

In [None]:
df_house[(df_house.floors==1)]

From the table above, we can see that the basement space does not be counted as a part of floors.

Now let's check is there any living space larger than land space when the property only has 1 floor and no basement.

In [None]:
df_house[(df_house.floors==1) & (df_house.sqft_basement==0) & (df_house.sqft_living > df_house.sqft_lot)]

Technically, it does not make any sense that the living space is larger than the land space when a property has only 1 floor and does not have any basements. Therefore, I count it as an error and decide to remove it.

In [None]:
df_house.drop(df_house.index[13278],inplace=True)

In [None]:
df_house[(df_house.floors==1) & (df_house.sqft_basement==0) & (df_house.sqft_living > df_house.sqft_lot)]

## Checking duplicated records

If we assume that location, date, "yr_built", "sqft_living", "sqft_lot" can uniquely identify a property bacause there will not be a property sold twice in one day. We can then use the five values to check whether the dataset has duplicated records or not.

In [None]:
df_house[df_house.duplicated(["lat", "long", "date", "yr_built",
                               "sqft_living", "sqft_lot"], keep=False)]

Oh, we have duplicates here. Let's just remove the first record and keep the second one for this property.

In [None]:
df_house.drop_duplicates(["lat", "long", "date", "yr_built",
                          "sqft_living", "sqft_lot"], keep='last', inplace=True)

## Outliers

### Univariate Outlier Detection Methods
* The 3σ Edit Rule
* The Boxplots graphical Detections

##### 1.The 3σ Edit Rule

The most common way to detect outliers in this case is the 3σ edit rule, which declares any point lying farther than three standard deviations from the mean is an outlier

In [None]:
df_house['price'].describe()

In [None]:
#calculate the distance of 3*standard deviation
three_sigma=3*(df_house.price.std())
#finding the rows that has 3sd far away from the mean
price_Editrule=[]
for i in range(len(df_house)):
    #calculate the mean of price
    price_mean=df_house.price.mean()
    #absolute value of price
    absolute_value=abs(df_house.price.iloc[i])
    #distance of 3sd away from mean
    a=price_mean+three_sigma
    #justify the price of property larger than 3sd from the mean
    if absolute_value>a:
        price_Editrule.append(i)

df_house.iloc[price_Editrule]

There are 406 outliers detected from the calculations(seems impossible). A common difficulty for this edit rule method is the sensitivity to mean and SD value that are themselves affected by the outliers. According to the definition, a nice property is: if a data point is farther than an outlier, itself is an outlier. So I would like to use more methods to check with outliers accurately.

##### 2.Boxplots Detection

Graphical methods are very important for visualising and identifying outliers especially with data represented in few dimensions. Boxplot is a common graphical method that has an advantage of robustness against outliers because it the usage of quartiles. Here, I build a boxplot in next step.

In [None]:
df_house.boxplot(column='price',sym='k.')
plt.show()

From the boxplot and price decribiton showned above, we can see the the y-axis is shown in million. The median value as shown in the box plot is at the bottom and it is hard to distinguish deviations below the median. Therefore, we use a natural log for the house prices and take a look for other deviations clearly.

In [None]:
#log of each price
df_house['price']=np.log(df_house.price)
#Now plot a boxplot again
df_house.boxplot(column='price',sym='k.')
plt.show()

From the boxplot showed above, we can clearly find the outlier which biased from the median value

In [None]:
df_house[df_house.price>15.5]

From the four obervations above, the properties are resonable because they all have larger living and land space, and mulitiple bedrooms and bathrooms. More importantly, 2 properties have been renovated in 2001.

### ● Multivariate Outlier Detection Methods

* Mahalanobis Distance

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df_house.corr(), annot=True, fmt=".2f")
plt.show()

In [None]:
#plot a price histgram
df_house['price'].hist(bins=100)
plt.show()

In [None]:
#plot living space histogram
sns.distplot(df_house['sqft_living'])
plt.show()

In [None]:
#2-dimention mahalanobis distance detect outliers
#The greater the value of mahalanobis distance, the higher probability of outlier it is.

from pandas import Series
from scipy.spatial import distance 
#build a new dataframe which contains juat column for price and sqft_living 
hw=df_house[['price','sqft_living']]
#define number of outliers
n_outliers =6
#use mahalanobis distance to detect each point
#series used to generate distance for each property
#hw.iloc stands for the outside index of each row; hw.mean stands for value of mean for 2 columns; np.mat create correlation matrix and reverse the matrix
m_dist_order = Series([float(distance.mahalanobis(hw.iloc[i], hw.mean(), np.mat(hw.cov().as_matrix()).I) ** 2) for i in range(len(hw))]).sort_values(ascending=False).index.tolist()  
#If the property is outlier return True, otherwise return False
is_outlier = [False, ] * (len(hw)) 
for i in range(n_outliers):  
    is_outlier[m_dist_order[i]] = True 
#outliers are displayed in red, others are displayed in blue
color = ['b', 'r']  
#turn True to 1, False to 0
pch = [1 if is_outlier[i] == True else 0 for i in range(len(is_outlier))]  
#turn 1 to 'r', turn 0 to 'b'
cValue = [color[is_outlier[i]] for i in range(len(is_outlier))]  

#plotting
fig = plt.figure()  
#set title
plt.title('Scatter Plot')  
#set x label
plt.xlabel('sqft_living')  
#set y label
plt.ylabel('price')  
#draw scatter
plt.scatter(hw['sqft_living'],  hw['price'], s=40, c=cValue) 
plt.show()  

The outliers found from 2-dimention mahalanobis distance showed as following:

In [None]:
index_list1=[]
for i in range(len(pch)):
    #if value in pch is 1, it is an outlier
    if pch[i]==1:
        index_list1.append(i)
#show outliers from 2-dimention mahalanobis distance
df_house.iloc[index_list1]

6 outliers has been detected based on mahalanobis distance, which include the outliers detected by using univariate methods. In order to get more accurate outliers, I build a 3-dimention mahalanobis distance model to find outliers by using the most 2 correlated variables (sqft_living and grades).

In [None]:
#3-dimention mahalanobis distance detect outliers
#The greater the value of mahalanobis distance, the higher probability of outlier it is.  
from mpl_toolkits.mplot3d import Axes3D

#build a new dataframe including three columns
hw=df_house[['price','sqft_living','grade']]    
  
n_outliers = 6 #select 6 outliers 
#iloc[]take 3 columns and 1 row   hw.mean()here is an array of three variables    np.mat(hw.cov().as_matrix()).I is the inverse matrix of covariance   **为乘方  
#Series's output is: the index is on the left, the value is on the right
#m_dist_order is a one-dimensional array that holds the index in descending order of Series
m_dist_order =  Series([float(distance.mahalanobis(hw.iloc[i], hw.mean(), np.mat(hw.cov().as_matrix()).I) ** 2)  
       for i in range(len(hw))]).sort_values(ascending=False).index.tolist()  
is_outlier = [False, ] * len(hw)
for i in range(n_outliers):#mahalanobis distance value is marked as True
    is_outlier[m_dist_order[i]] = True  

#outliers are displayed in red, others are displayed in blue
color = ['b', 'r']  
#turn True to 1, False to 0
pch = [1 if is_outlier[i] == True else 0 for i in range(len(is_outlier))]  
#turn 1 to 'r', turn 0 to 'b'
cValue = [color[is_outlier[i]] for i in range(len(is_outlier))]  

#plotting
fig = plt.figure()  
#using 3 dimention 
ax1 = fig.gca(projection='3d')  
#set title and labels
ax1.set_title('Scatter Plot')  
ax1.set_xlabel('price')  
ax1.set_ylabel('sqft_living')  
ax1.set_zlabel('grade')  
#plot scatter plot
ax1.scatter(hw['sqft_living'], hw['price'], hw['grade'], s=30, c=cValue)  
plt.show()  

In [None]:
index_list2=[]
#pch return True to 1. Here we can find the index of the outliers 
for i in range(len(pch)):
    if pch[i]==1:
        index_list2.append(i)
#show outliers from 3-dimention mahalanobis distance
df_house.iloc[index_list2]

6 outliers has been detected. The result is completely the same as the outliers has been found from 2 dimention. We can keep the outliers or remove them. Here, I decide to remove them since I think they would influence the prediction models if we want to use it in the future.

In [None]:
#Set values for particular cell in index_list
df_house.iloc[index_list1,1]=0
#replace 0 with NaN
df_house['price'].replace(0,np.NaN,inplace=True)

Right now, we have a very clean and tidy dataset after data wrangling. And we can write data to a csv file.

In [None]:
df_house
#Writing a CSV file with the pandas library if you want
#df_house.to_csv('house.csv', encoding='utf-8',index=False)

Thank you for reading!