# What is Data Preparation?

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data.

Data preparation is often a lengthy undertaking for data professionals or business users, but it is essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias resulting from poor data quality.

For example, the data preparation process usually includes standardizing data formats, enriching source data, and/or removing outliers. ([Read full article about Data Preparation](https://www.talend.com/resources/what-is-data-preparation/))

Our dataset is part of GettingStarted Prediction Competition: [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Import Dataset

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.head(1)

In [None]:
test.head(1)

#### Combine Train and Test Datasets

In [None]:
# Combine Train and Test Datasets for Data  Preparation:
df = pd.concat([test.assign(ind="test"), train.assign(ind="train")])

# Then later we can split them again:
# test, train = df[df["ind"].eq("test")], df[df["ind"].eq("train")]

### Data Overview

In [None]:
df.tail()

In [None]:
df.info()

#### Correlation overview:

In [None]:
df.corr()['SalePrice'].sort_values()

#### Check Outliers from plats

* OverallQual

In [None]:
sns.boxplot(x='OverallQual', y='SalePrice', data=df)

In [None]:
sns.scatterplot(data = df, x='OverallQual', y='SalePrice')
plt.axhline(y=200000,color='r')

> we can see outlier in OverallQual=10 when SalePrice is under 200000 :

In [None]:
df[(df['OverallQual']>8) &(df['SalePrice']<200000)][['SalePrice', 'OverallQual']]

* GrLivArea

In [None]:
sns.scatterplot(data = df , x='GrLivArea', y='SalePrice')
plt.axhline(y=200000, color='r')
plt.axvline(x=4000, color='r')

> we can see 2 outlier in scatterplot. lets find those in dataframe:

In [None]:
df[(df['GrLivArea']>4000) & (df['SalePrice']<400000)][['SalePrice', 'GrLivArea']]

*If you notice that the number of outliers in the two graphs is same. So we can drop them at once:*

In [None]:
index_drop=df[(df['GrLivArea']>4000) & (df['SalePrice']<400000)].index
df=df.drop(index_drop, axis=0)

**Now we can look again to our scatterplot. it should be without outliers:**

In [None]:
sns.scatterplot(x='GrLivArea', y='SalePrice', data=df)
plt.axhline(y=200000, color='r')
plt.axvline(x=4000, color='r')

In [None]:
sns.scatterplot(x='OverallQual', y='SalePrice', data=df)
plt.axhline(y=200000,color='r')

In [None]:
sns.boxplot(x='OverallQual', y='SalePrice', data=df)

#### Dealing with Missing Data

lets take a look Null in datafraime with info()

In [None]:
df.info()

In [None]:
df.head()

**Some of the columns give us features that have no value for our learning model. Id, for example. These columns must be drop:**

In [None]:
df= df.drop('Id', axis=1)

**Ok. now we want to know how percent of our column are null:**

In [None]:
msd = 100*(df.isnull().sum()/len(df)).nlargest(13)

#### Visualize missing values 

In [None]:
plt.figure(figsize = (18,5))
sns.lineplot(data = msd).set_title('13 Max Missing Data Column')

In [None]:
import missingno as msno

In [None]:
# Visualize missing values as a matrix
msno.matrix(df)

#### All of Column That they Have Missing Data
**Make a Function to calculate the percent of missing data in each columns (feature) and then sort it:**

In [None]:
def missing_percent(df):
    nan_percent= 100*(df.isnull().sum()/len(df))
    nan_percent= nan_percent[nan_percent>0].sort_values()
    return nan_percent

In [None]:
nan_percent= missing_percent(df)

In [None]:
nan_percent

#### Show Columns that have missimg data by plot:

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

We have some column with under one percent missing data. for watching those on the plot we can consider threshold for plot:

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#Set 1% threshold:
plt.ylim(0,1)

In [None]:
nan_percent[nan_percent<1]

In [None]:
df[df['Electrical'].isnull()]

In [None]:
df= df.dropna(axis=0, subset=['Electrical'])

We know by the [description](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) of this dataset that "MasVnrType" is categorical and "MasVnrArea" is numeric. Therefore, we can fill the missing data with respect to the documentation.

In [None]:
#Numerical Columns fill with 0:
df['MasVnrArea']=df['MasVnrArea'].fillna(0)

#String Columns fill with None:
df['MasVnrType']= df['MasVnrType'].fillna('None')

We should use again missing_percent function:

In [None]:
nan_percent= missing_percent(df)

Ok now lets plot again:

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#Set 1% threshold:
plt.ylim(0,1)

##### Now i'm going to check and fill Bsmt*** columns:

In [None]:
#Numerical Columns fill with 0:
bsmt_num_cols= ['BsmtQual' , 'BsmtFinSF1' , 'BsmtFinSF2' , 'BsmtUnfSF' , 'BsmtHalfBath' , 'TotalBsmtSF' , 'BsmtFullBath']
df[bsmt_num_cols]= df[bsmt_num_cols].fillna(0)

#String Columns fill with NA:
bsmt_str_cols = ['BsmtCond', 'BsmtExposure', 'BsmtFinType2', 'BsmtFinType1']
df[bsmt_str_cols]= df[bsmt_str_cols].fillna('NA')

In [None]:
nan_percent= missing_percent(df)

# plot 
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)
#Set 1% threshold:
plt.ylim(0,1)

##### Now i'm going to check and fill Garage columns:

In [None]:
#Numerical Columns fill with 0:
Garage_num_cols= ['GarageCars' , 'GarageArea']
df[Garage_num_cols]= df[Garage_num_cols].fillna(0)

#String Columns fill with NA:
Garage_str_cols= ['GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'GarageYrBlt']
df[Garage_str_cols]= df[Garage_str_cols].fillna('NA')

In [None]:
nan_percent= missing_percent(df)

# plot 
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

##### We have Column with high missing data (more than %80). we going to drop these columns:

In [None]:
df= df.drop(['Fence', 'Alley', 'MiscFeature','PoolQC'], axis=1)

In [None]:
nan_percent= missing_percent(df)

# plot 
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#Set 1% threshold:
plt.ylim(0,1)

In [None]:
df["FireplaceQu"]= df["FireplaceQu"].fillna('NA')

In [None]:
nan_percent= missing_percent(df)

# plot 
plt.figure(figsize=(12,6))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#Set 1% threshold:
plt.ylim(0,1)

**Now about Exterior1st and Exterior2nd:**
In data_description.txt about Exterior1st and Exterior2nd we have this: 

> Exterior1st: Exterior covering on house
> 
       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast	
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
> Exterior2nd: Exterior covering on house (if more than one material)
> 
       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles

So we can fill nulls with 'Other'

In [None]:
str_cols= ['Exterior1st', 'Exterior2nd']
df[str_cols]= df[str_cols].fillna('Other')

**ok lets check KitchenQual column:**

KitchenQual: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

Oops we don't have no choice for fill nulls. lets check how many null we have in this column. if there are not many of them we will drop them:

In [None]:
df[df['KitchenQual'].isnull()]

Ok lets drop this row:

In [None]:
df = df.dropna(axis=0, subset=['KitchenQual'])

**Check** SaleType: Type of sale
		
       WD 	Warranty Deed - Conventional
       CWD	Warranty Deed - Cash
       VWD	Warranty Deed - VA Loan
       New	Home just constructed and sold
       COD	Court Officer Deed/Estate
       Con	Contract 15% Down payment regular terms
       ConLw	Contract Low Down payment and low interest
       ConLI	Contract Low Interest
       ConLD	Contract Low Down
       Oth	Other
       
Ok we can fill nulls with 'Oth'

In [None]:
df['SaleType']= df['SaleType'].fillna('Oth')

Check Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only

Oops we don't have no choice for fill nulls. lets check how many null we have in this column. if there are not many of them we will drop them:

In [None]:
df[df['Utilities'].isnull()]

In [None]:
df = df.dropna(axis=0, subset=['Utilities'])

**Check** Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only

So we can fill nulls with 'Typ' of drop rows:

In [None]:
df['Functional']= df['Functional'].fillna('Typ')

**Check** MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

In [None]:
df[df['MSZoning'].isnull()]

In [None]:
df = df.dropna(axis=0, subset=['MSZoning'])

##### lets look at the plot again:

In [None]:
nan_percent= missing_percent(df)

# plot 
plt.figure(figsize=(3,3))
sns.barplot(x=nan_percent.index, y=nan_percent)
plt.xticks(rotation=90)

#### D-Imputation of Missing Data
**We assume that the Lot Frontage is related to what a Neighborhood a house is in**

In [None]:
df['Neighborhood'].unique()

In [None]:
plt.figure(figsize=(8,12))
sns.boxplot(data=df, x='LotFrontage', y='Neighborhood')

In [None]:
df['LotFrontage']=df.groupby('Neighborhood')['LotFrontage'].transform(lambda val: val.fillna(val.mean()))

In [None]:
df['LotFrontage']= df['LotFrontage'].fillna(0)

In [None]:
nan_percent= missing_percent(df)

In [None]:
nan_percent

**Ok now we can say: we don't have any missing data in dataframe! Missing data in SalePrice is for train data beacuse we don't have SalePrice in train dataset.**

**Check** GarageYrBlt column:

In [None]:
df['GarageYrBlt'].value_counts(dropna=False)

So about GarageYrBlt we can see 159 with NA content, I think its too much to drop rows! these house are without Garage, we can change NA to 0. Another Solution is convert this column to object. and then convert it to one hot code.

I'm going to convert NA to 0:

In [None]:
df['GarageYrBlt'] = df['GarageYrBlt'].replace(['NA'],'0')

In [None]:
df['GarageYrBlt'].value_counts(dropna=False)

#### Dealing with Categorical Data

##### Numerical Columns to Categorical
If the type of columns whose values are numeric but this numeric is kind of categorical numbers, we need to convert that column into object.

for doing this we should check the discription of dataset. 

**I checked data_description of dataset, and found that these columns should be object(categorical numeric):**

* OverallCond
* OverallQual
* MSSubClass

I think we don't need to convert OverallCond and OverallQual to object type because OverallQual show us Rates the overall material and finish of the house and OverallCond show us Rates the overall condition of the house with numbers from 1 to 10 (1 is Very Poor and 10 is Very Excellent). so its better for training to be numeric type.

In [None]:
df['MSSubClass'] = df['MSSubClass'].apply(str)

In [None]:
print(df['MSSubClass'].dtypes)

##### Convert All Object type to One hot encoding

In [None]:
df_num = df.select_dtypes(exclude='object')
df_obj = df.select_dtypes(include='object')

In [None]:
df_obj.dtypes

In [None]:
non_dummy_cols = ['ind']
# Takes all other columns
dummy_cols = list(set(df_obj.columns) - set(non_dummy_cols))
df_obj = pd.get_dummies(df_obj, columns=dummy_cols)

In [None]:
df_obj.shape

In [None]:
final_df = pd.concat([df_num, df_obj], axis = 1)

**WELL DONE**
*, NOW WE CAN USE THIS CLEANING DATAFRAME*

#### We can split test and train:

In [None]:
test, train = final_df[final_df["ind"].eq("test")], final_df[final_df["ind"].eq("train")]

# We should Drop indicator Column from test and train dataframes:
test= test.drop(['ind'], axis=1)
train= train.drop(['ind'], axis=1)

One more thing we should do: drop label column from test dataframe:

first lets check existing it in column:

In [None]:
if 'SalePrice' in test.columns:
    print("True") 
else:
      print("False")

In [None]:
test = test.drop(['SalePrice'], axis=1)

#### Done!