# Introduction

![](https://res.akamaized.net/domain/image/upload/t_web/c_fill,w_600/v1554864563/6_New_Jersey_Road_Five_Dock_NSW_Low_res_tq758d.jpg)

## Context

This is a Sydney House Prices dataset.

This data set contains information on the houses sold in Sydney between 2000 and 2019. In this study, deficient value operations, outliers and eventually regression analysis will be applied.

## Content
1. [Load and Check Data](#0)
1. [Dataset Description](#1)
1. [Data Visualization](#2)
1. [Missing Value Analysis](#3)
    * [Defining and Visualizing Missing Values](#4)
    * [Testing the Randomness of Missing Values](#5)
    * [Operations on Missing Values](#6)
1. [Variable Transformation](#7)
1. [Outlier Value Analysis](#8)
    * [Outlier Value Detection Using Boxplot](#9)
    * [Outlier Value Analysis With IQR](#10) 
1. [Machine Learning With Regression Algorithms](#11)

In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt
import missingno as msno

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.preprocessing import scale
from sklearn import model_selection
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression, PLSSVD

import warnings
warnings.filterwarnings("ignore");

<a id="0"></a>
# Load and Check Data

* Load file

In [None]:
shouse = pd.read_csv("../input/sydney-house-prices/SydneyHousePrices.csv")
df = shouse.copy()

* First 5 records in the Dataset

In [None]:
df.head()

<a id="1"></a>
# Dataset Description

* With the info() function, we can see the total number of variables in the data set, the types of these variables and the number of observations in the variables.
* Our Dataset consists of 199504 rows and 9 columns.
* The most missing value is seen in the sellPrice column.
* Variables and types:
    - float64(1):bed,car
    - int64(2):  Id, postalCode, sellPrice, bath
    - object(8):Date,suburb, propType

In [None]:
df.info()

 * Dataset variable names.

In [None]:
df.columns

### Variable Description

 1. **Id:**:  A variable with no name and a variable specifying indexes will be deleted below because it is unnecessary.
 1. **Date**: Sales dates of houses.
 1. **suburb**: Suburban names in Australia.  
 1. **propType**: The type of house.
 1. **sellPrice**:  Prices of house.
 1. **car**: No idea.
 1. **postalCode**: Postal code.
 1. **bed**: Number of bed.
 1. **bath**: Number of bathrooms.

* Unnecessary variable deletion.

In [None]:
df.drop(["Id"],axis=1,inplace=True)

* Check, delete successful. Our new number of variables is 8.

* Statistical information about the dataset.
    * You can access information such as means, medians, standard deviations, minimum and maximum values of numerical variables with the describe() function.

In [None]:
round(df.describe(),2).T

* The correlation of numerical variables is examined in the dataset. It is determined that there is a moderate relationship between the variables.

<a id="2"></a>
# Data Visualization

 Using two types of plots:

* Univariate plots to better understand each attribute.
* Multivariate plots to better understand the relationships between attributes.

## Univariate Plots

* Univariate plots – plots of each individual variable.
* Given that the input variables are numeric, we can create box and whisker plots of each.

In [None]:
# box and whisker plots 
plt.subplot(2,1,1)
df["propType"].value_counts().plot(kind='pie', title='PropType', autopct='%.1f%%', figsize=[20,20]);

* Creating histogram of each input variable to get an idea of the distribution.

In [None]:
plt.subplot(4,1,1)
df.bed.plot(kind='hist',color='pink',bins=50,figsize=(15,15))
plt.title("bed Variable Histogram Chart");


plt.subplot(4,1,2)
df.sellPrice.plot(kind='hist',color='pink',bins=50,figsize=(15,15))
plt.title("sellPrice Variable Histogram Chart");


plt.subplot(4,1,3)
df.bath.plot(kind='hist',color='pink',bins=50,figsize=(15,15))
plt.title("bath Variable Histogram Chart");


plt.subplot(4,1,4)
df.car.plot(kind='hist',color='pink',bins=50,figsize=(15,15))
plt.title("car Variable Histogram Chart");

## Multivariate Plots

In [None]:
plt.figure(figsize=(12,5))
sns.heatmap(df.corr(),annot=True,linewidth=2.5,fmt='.3F',linecolor='black');

* The relationship between numerical variables is examined. It cannot be said that there is a linear relationship between price and point variables when the data are examined.

In [None]:
sns.pairplot(df, hue = "propType");

<a id="3"></a>
# Missing Values

*Used when the values in the dataset are missing.
Missing values are generally NA.*

* Is there any missing value in the data set

In [None]:
df.isnull().values.any()

<a id="4"></a>
## Defining and Visualizing Missing Values

* Total missing values in the variables.

In [None]:
df.isnull().sum()

* Looking at the graph, the data at the top shows the missing data in the variables. On the left, it shows the percentages in the dataset. On the right, it shows the number of observations in the dataset. At the bottom of the graph, there are variable names.

In [None]:
msno.bar(df,color = sns.color_palette('deep'));

<a id="5"></a>
## Testing the Randomness of Missing Values

* When the graphic is examined, there are observation information in the data set on the left part, variable names in the upper part and missing observations on the right part.

In [None]:
msno.matrix(df, color = (0.1, 0.2, 0.3));

Heat maps are used to learn the relationships between variables. The values in this graph range from -1 to 1. If the value is 1 there is a correct relationship between the two variables, if the value is -1 there is a inverse relationship between the two variables. If the value is 0, there is no relationship between the two variables.

In [None]:
msno.heatmap(df);

* Missing value numbers and percentages.

In [None]:
def missing_value_table(df):
    missing_value = df.isna().sum().sort_values(ascending=False)
    missing_value_percent = 100 * df.isna().sum()//len(df)
    missing_value_table = pd.concat([missing_value, missing_value_percent], axis=1)
    missing_value_table_return = missing_value_table.rename(columns = {0 : 'Missing Values', 1 : '% Value'})
    cm = sns.light_palette("darkred", as_cmap=True)
    missing_value_table_return = missing_value_table_return.style.background_gradient(cmap=cm)
    return missing_value_table_return
  
missing_value_table(df)

<a id="6"></a>
## Operations on Missing Values

### Filling in the Missing Values

* Missing values in the dataset are filled with the average of the variables.

In [None]:
df['car'] = df['car'].fillna(df['car'].mean())

In [None]:
df['bed'] = df['bed'].fillna(df['bed'].mean())

In [None]:
df.isnull().sum()

<a id="7"></a>
# Variable Transformation

* One Hot Encoding means that categorical variables are represented as binary.

In [None]:
df["propType"].value_counts()

In [None]:
df['propType'] = pd.Categorical(df['propType'])
dfDummies = pd.get_dummies(df['propType'], prefix = 'propType')
dfDummies  



* One of the propType values converted to One Hot Encoding is added to the data. 

In [None]:
df = pd.concat([df, dfDummies["propType_house"]], axis=1)

In [None]:
df.Date.value_counts()

*  The date variable is divided by year and month.

In [None]:
Date_ = pd.to_datetime(df['Date'])
df['Year'] = Date_.dt.year
df['Months'] = Date_.dt.month
df

* Categorical variables are deleted and the data set consists only of numerical values.

In [None]:
df.drop(["Date","suburb","propType"],axis = 1, inplace=True)

In [None]:
df.head()

<a id="8"></a>
# Outlier Value Analysis

<a id="9"></a>
## Outlier Value Detection Using Boxplot

In statistics, an outlier is a data point that differs significantly from other observations.

* Outlier is smaller than Q1-1.5(Q3-Q1) and higher than Q3+1.5(Q3-Q1) .

    * (Q3-Q1) = IQR (INTER QUARTILE RANGE)
    * Q3 = Third Quartile(%75)
    * Q1 = First Quartile(%25)

In [None]:
dff = df.drop(["postalCode","propType_house","Year","Months"],axis = 1)

* Outlier observation analysis would be unnecessary for variables deleted above.

In [None]:
for i, col in enumerate(dff.columns):
    plt.figure(i)
    sns.boxplot(x=col, data=df)

<a id="10"></a>
## Outlier Value Analysis With IQR

* Train-test separation process for outliers observation analysis.

In [None]:
y = df[["sellPrice"]]
X = df.drop("sellPrice", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.25)

In [None]:
columns = X_train.copy()

In [None]:
del columns["postalCode"]
del columns["propType_house"]
del columns["Year"]
del columns["Months"]

In [None]:
lower_and_upper = {} # storage
X_train_copy = X_train.copy() # train copy 

for col in columns.columns: # outlier detect
    q1 = X_train[col].describe()[4] # Q1 = Quartile 1 median 25 
    q3 = X_train[col].describe()[6] # Q3 = Quartile 3 median 75 
    iqr = q3-q1  #IQR Q3 -Q1
    
    lower_bound = q1-(1.5*iqr)
    upper_bound = q3+(1.5*iqr)
    
    lower_and_upper[col] = (lower_bound, upper_bound)
    X_train_copy.loc[(X_train_copy.loc[:,col]<lower_bound),col]=lower_bound*0.75
    X_train_copy.loc[(X_train_copy.loc[:,col]>upper_bound),col]=upper_bound*1.25
    
lower_and_upper

In [None]:
X_test_copy = X_test.copy() # test copy   

for col in columns.columns:
    X_test_copy.loc[(X_test_copy.loc[:,col]<lower_and_upper[col][0]),col]=lower_and_upper[col][0]*0.75
    X_test_copy.loc[(X_test_copy.loc[:,col]>lower_and_upper[col][1]),col]=lower_and_upper[col][1]*1.25

* Contrary observations of the train set were cleared.

In [None]:
for i, col in enumerate(X_train_copy.columns):
    plt.figure(i)
    sns.boxplot(x=col, data=X_train_copy)

* Contrary observations of the test set were cleared.

In [None]:
for i, col in enumerate(X_test_copy.columns):
    plt.figure(i)
    sns.boxplot(x=col, data=X_test_copy)

* For Target

In [None]:
sns.boxplot(y_train);

In [None]:
sns.boxplot(y_test);

In [None]:
lower_and_upper = {} # storage
y_train_copy = y_train.copy() # train copy 

for col in y_train.columns: # outlier detect
    q1 = y_train[col].describe()[4] # Q1 = Quartile 1 median 25 
    q3 = y_train[col].describe()[6] # Q3 = Quartile 3 median 75 
    iqr = q3-q1  #IQR Q3 -Q1
    
    lower_bound = q1-(1.5*iqr)
    upper_bound = q3+(1.5*iqr)
    
    lower_and_upper[col] = (lower_bound, upper_bound)
    y_train_copy.loc[(y_train_copy.loc[:,col]<lower_bound),col]=lower_bound*0.75
    y_train_copy.loc[(y_train_copy.loc[:,col]>upper_bound),col]=upper_bound*1.25
    
lower_and_upper

In [None]:
y_test_copy = y_test.copy() # test copy   

for col in y_test.columns:
    y_test_copy.loc[(y_test_copy.loc[:,col]<lower_and_upper[col][0]),col]=lower_and_upper[col][0]*0.75
    y_test_copy.loc[(y_test_copy.loc[:,col]>lower_and_upper[col][1]),col]=lower_and_upper[col][1]*1.25

* Contrary observations of the train set were cleared.

In [None]:
sns.boxplot(y_train_copy);

* Contrary observations of the test set were cleared.

In [None]:
sns.boxplot(y_test_copy);

<a id="11"></a>
# Machine Learning With Regression Algorithms

* Multiple Linear Regression
* Principal Component Regression (PCR)
* Partial Least Squares Regression (PLS)

In [None]:
print("X_train:",X_train_copy.shape)
print("y_train:",y_train_copy.shape)
print("X_test:",X_test_copy.shape)
print("y_test:",y_test_copy.shape)

### Model with Sklearn

In [None]:
def final_model(X_reduced_train, y_train, X_reduced_test, y_test, X_train, X_test):
    
    #Setting up final models with the best values
    pcr_final_model = LinearRegression().fit(X_reduced_train[:,0:6],y_train)
    pls_final_model = PLSRegression(n_components = 7).fit(X_train, y_train)
    multi_linear_final_model = LinearRegression().fit(X_train, y_train)
    
    #Forecasting operations with final models.
    y_pred_pcr = pcr_final_model.predict(X_reduced_test[:,0:6])
    y_pred_pls = pls_final_model.predict(X_test)
        
    print("corrected bug of pcr model:",np.sqrt(mean_squared_error(y_test, y_pred_pcr)))
    print("corrected bug of multi linear regression model:",np.sqrt(-cross_val_score(multi_linear_final_model, X_test, y_test, cv = 10, scoring = "neg_mean_squared_error")).mean())
    print("corrected bug of pls model:",np.sqrt(mean_squared_error(y_test, y_pred_pls)))    

In [None]:
def model_tuning(X_reduced_train, x_train, y_train):
    cv_10 = model_selection.KFold(n_splits =10, shuffle = True, random_state = 1)
    
    lm = LinearRegression()
    RMSE_pcr = []
    RMSE_pls = []

    #The best parameters are found with cross validation.
    for i in np.arange(1, X_reduced_train.shape[1] + 1):    
        score1 = np.sqrt(-1*cross_val_score(lm, X_reduced_train[:,:i], y_train.values.ravel(), cv = cv_10, scoring = "neg_mean_squared_error").mean())
        RMSE_pcr.append(score1)
    
    #The best parameters are found with cross validation.
    for i in np.arange(1, X_train.shape[1] + 1):
        pls = PLSRegression(n_components=i)
        score2 = np.sqrt(-1*cross_val_score(pls,  x_train, y_train, cv = cv_10, scoring = "neg_mean_squared_error").mean())
        RMSE_pls.append(score2)
    
    
    fig, axs = plt.subplots(2,figsize=(10,10))
    fig.suptitle('PCR / PLS Model Tuning For Price Prediction Model')
    axs[0].plot(RMSE_pcr, '-v')
    axs[1].plot(np.arange(1, X_train.shape[1]+1), np.array(RMSE_pls), '-v', c = "r")
    axs[0].set_xlabel('Number of components')
    axs[0].set_ylabel('RMSE')
    axs[1].set_xlabel('Number of components')
    axs[1].set_ylabel('RMSE')

In [None]:
def model_predict(x_train ,y_train, x_test, y_test):
    pca = PCA()
    X_reduced_train = pca.fit_transform(scale(x_train)) #Conversion processes for pcr model x_train.
    X_reduced_test = pca.fit_transform(scale(x_test)) #Conversion processes for pcr model x_test.
    
    
    #Building a models
    pcr_model = LinearRegression().fit(X_reduced_train, y_train)
    reg_model = LinearRegression().fit(x_train, y_train)
    pls_model = PLSRegression(n_components = 2).fit(x_train, y_train)
    
    #Predicted operations from created models.
    y1_pred = pcr_model.predict(X_reduced_test)
    y2_pred = reg_model.predict(x_test)
    y3_pred = pls_model.predict(x_test)
    
    print("primitive error of pcr model:", np.sqrt(mean_squared_error(y_test, y1_pred)))
    print("primitive error of multiple linear regression model:", np.sqrt(mean_squared_error(y_test, y2_pred)))
    print("primitive error of pls model:", np.sqrt(mean_squared_error(y_test, y3_pred)))
    print("----------------------------------------------------------------------------------")
    
    model_tuning(X_reduced_train, x_train, y_train)
    final_model(X_reduced_train, y_train, X_reduced_test, y_test, X_train, X_test)

In [None]:
model_predict(X_train_copy, y_train_copy, X_test_copy, y_test_copy)

* Ridge Regression
* Lasso Regression
* ElasticNet Regression

## Ridge Model

The aim is to find the coefficients that minimize the mean square error by applying a penalty to these coefficients.

In the Ridge model, there is an extra term known as the term punishment. The λ given here is actually specified by the alpha parameter in the ridge function. That's why we basically control the penalty term by changing alpha values. The higher the alpha values, the greater the penalty, and therefore the size of the coefficients decreases.

-Important points:
* Reduces the parameters, so it is mostly used to prevent multiple connections.
* It is resistant to over learning.
* Reduces model complexity by coefficient shrinkage.
* L2 uses regularization technique.

**Model/Estimation**

In [None]:
# A high alpha value means a higher constraint on the coefficients. Here we will experiment with three alpha values.
#alfa = [0.00005,0.5,10] 
alfa = 10**np.linspace(10,-2,100)*0.5 
Coef_=[]

In [None]:
print('---------------')
ridge_model=Ridge()
for güncelalfa in alfa:
    ridge_model.set_params(alpha=güncelalfa)
    ridge_model.fit(X_train_copy,y_train_copy)
    y_pred = ridge_model.predict(X_test_copy)
    Coef_.append(ridge_model.coef_)
    mse = np.mean((y_pred - y_test_copy)**2)
    print('Alfası ' + str(güncelalfa) + ' olan Ridge regresyon modelin Train skoru: ',ridge_model.score(X_train_copy,y_train_copy))
    print('Alfası ' + str(güncelalfa) + ' olan Ridge regresyon modelin Test skoru: ',ridge_model.score(X_test_copy,y_test_copy))
    print('Kullanılan öznitelik sayısı: ',np.sum(ridge_model.coef_!=0))
    print('Test Hatası MSE: ', mse)
    print('\n')    

In [None]:
# Coef_
print(Coef_)

**Model Tuning**

In [None]:
# Writing the Ridge_cv model to find the Optimal Lamp.
alfa[0:5]
ridge_cv=RidgeCV(alphas=alfa,scoring="neg_mean_squared_error",normalize=True)

In [None]:
ridge_cv.fit(X_train_copy,y_train_copy)

In [None]:
# Finding the Optimal lambda.
ridge_cv.alpha_

In [None]:
# Setting up Ridge regression model with optimal lambda value
ridge_tuned=Ridge(alpha=ridge_cv.alpha_,normalize=True).fit(X_train_copy,y_train_copy)

In [None]:
# Mean squared error values
mse= np.sqrt(mean_squared_error(y_test_copy,ridge_tuned.predict(X_test_copy)))
mse

# Lasso Model

The aim is to find the coefficients that minimize the sum of error squares by applying a penalty to these coefficients.

* It has been proposed against the disadvantage of Ridge regression leaving all relevant / unrelated variables in the model.
* Lasso approximates the coefficients to zero.
* L1 form resets some coefficients when lambda is big enough. Therefore, it eliminates variables.
* Ridge and lasso methods are not superior to each other.

**Model/Estimation**

In [None]:
print('---------------')
lasso_model=Lasso()
for güncelalfa in alfa:
    lasso_model.set_params(alpha=güncelalfa)
    lasso_model.fit(X_train_copy,y_train_copy)
    y_pred = lasso_model.predict(X_test_copy)
    Coef_.append(lasso_model.coef_)
    mse= np.sqrt(mean_squared_error(y_test_copy,y_pred))
    print('Alfası ' + str(güncelalfa) + ' olan Lasso regresyon modelin Train skoru: ',lasso_model.score(X_train_copy,y_train_copy))
    print('Alfası ' + str(güncelalfa) + ' olan Lasso regresyon modelin Test skoru: ',lasso_model.score(X_test_copy,y_test_copy))
    print('Kullanılan öznitelik sayısı: ',np.sum(lasso_model.coef_!=0))
    print('Test Hatası MSE: ', mse)
    print('\n')

In [None]:
# Coef_
print(Coef_)

**Model Tuning**

In [None]:
# Writing the Lasso_cv model to find the Optimal Lamp.
lasso_cv_model=LassoCV(alphas=None,cv=10,max_iter=10000,normalize=True).fit(X_train_copy,y_train_copy)

In [None]:
# Finding the Optimal lambda.
lasso_cv_model.alpha_

In [None]:
# Setting up Lasso regression model with optimal lambda value
lasso_tuned=Lasso(alpha=lasso_cv_model.alpha_).fit(X_train_copy,y_train_copy)

In [None]:
# Mean squared error values
mse= np.sqrt(mean_squared_error(y_test_copy,lasso_tuned.predict(X_test_copy)))
mse

# ElasticNet Model

The aim is to find the coefficients that minimize the sum of error squares by applying a penalty score to these coefficients. ElasticNet combines L1 and L2 approaches.

**Model/Estimation**

In [None]:
    elastikNet_model=ElasticNet().fit(X_train_copy,y_train_copy)
    y_pred = elastikNet_model.predict(X_test_copy)
    mse= np.sqrt(mean_squared_error(y_test_copy,y_pred))
    r2=r2_score(y_test_copy,y_pred)
    print('Alfası ' + str(güncelalfa) + ' olan ElasticNet regresyon modelin Train skoru: ',elastikNet_model.score(X_train_copy,y_train_copy))
    print('Alfası ' + str(güncelalfa) + ' olan ElasticNet regresyon modelin Test skoru: ',elastikNet_model.score(X_test_copy,y_test_copy))
    print('Kullanılan öznitelik sayısı: ',np.sum(elastikNet_model.coef_!=0))
    print('Test Hatası MSE: ', mse)
    print('Test Hatası R2 score: ', r2)
    print('\n')

In [None]:
# Coef_
elastikNet_model.coef_

**Model Tuning**

In [None]:
# Writing the ElasticNet_cv model to find the Optimal Lamp.
elastikNet_cv_model=ElasticNetCV(cv=10,random_state=0).fit(X_train_copy,y_train_copy)

In [None]:
# Finding the Optimal lambda.
elastikNet_cv_model.alpha_

In [None]:
# Setting up Elastic Net regression model with optimal lambda value
elastikNet_tuned=ElasticNet(alpha=elastikNet_cv_model.alpha_).fit(X_train_copy,y_train_copy)

In [None]:
# Mean squared error values
mse= np.sqrt(mean_squared_error(y_test_copy,elastikNet_tuned.predict(X_test_copy)))
mse