The data set describes the sale of individual residential property in Ames, Iowa
from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory
variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home
values.

In this note book we will explore the Ames housing data set. We will focus on:
1. Removing outliers 
2. Dealing with missing data
3. Building and assessing the model

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns

## Setting max displayed rows to 500, in order to display the full output of any command 
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# read the data 
df = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")

In [None]:
df.head()

In [None]:
df.describe()

### 1. Checking for outliers
The following example shows why outliers are very dangerous. They significantly affect the mean and the standard deviation and thus affecting the estimators of the model.

|| | Data without outlier |  | Data with outlier | 
|--||--||--|
|**Data**| |1,2,3,3,4,5,4 |  |1,2,3,3,4,5,**400** | 
|**Mean**| |3.142 | |**59.714** |  
|**Median**| |3|  |3|
|**Standard Deviation**| |1.345185| |**150.057**|

In order to visually see outliers, we need a box plot or a scatter plot. 
Therefore, lets see the most correlated features with sale price to plot them a gainst each others.

In [None]:
df.corr()["SalePrice"].sort_values(ascending = False)

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "OverallQual", y = "SalePrice");

As we can see there are some points with very high quality (10/10) but very low price. Lets explore other highly correlated features with Sale Price

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "GrLivArea", y = "SalePrice");

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "TotalBsmtSF", y = "SalePrice");

The points that indicate very high price and also very high living area (at the top right corner) are not outliers. They make sense as they are follwing a trend, therefore they will not hurt our model.

On the other hand The 3 points at the right-lower corner indicate very high living area but very low price. They are very likely to be outliers because they are not following the general trend.



#### Lets now check those points closely

In [None]:
df[(df["SalePrice"] < 200000) & (df["OverallQual"] > 8)]

In [None]:
df[(df["SalePrice"] < 200000) & (df["OverallQual"] > 8) & (df["GrLivArea"] > 4000)]

In [None]:
drop_index = df[(df["SalePrice"] < 200000) & (df["OverallQual"] > 8) & (df["GrLivArea"] > 4000)].index

In [None]:
df = df.drop(drop_index, axis = 0)

#### Lets now repeat one of the scatter plots that we had before

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.scatterplot(data = df, x = "GrLivArea", y = "SalePrice");

### 2. Dealing with missing data

In [None]:
df.head()

Id is just an identifier, it has no numeric value for the model. Set it as index, or drop it. Dropping it will not make any problems, because we have the default identifier (0, 1, 2, 3, ... ) 

In [None]:
df = df.drop("Id", axis = 1)

In [None]:
df.info()

In [None]:
## lets create a functions that can be used for any future data
def percent_missing_data(df):
    missing_count = df.isna().sum().sort_values(ascending = False)
    missing_percent = 100 * df.isna().sum().sort_values(ascending = False) / len(df)
    missing_count = pd.DataFrame(missing_count[missing_count > 0])
    missing_percent = pd.DataFrame(missing_percent[missing_percent > 0])
    missing_table = pd.concat([missing_count,missing_percent], axis = 1)
    missing_table.columns = ["missing_count", "missing_percent"]
    
    return missing_table

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.show()

In principle we should go through each feature and decide whether we will keep it, fill it or drop it. When we speak about dropping we can drop columns or rows.

For example Pool QC values are missing for 99.6 percent of houses. This might be due to:
1. These houses have no pools, and instead of nan it should have been zero.
2. These houses have pools, but the data is actually missing.

We should go back to the description file and try to understand it better. But now, lets deal with columns with very few missing values.

In [None]:
## lets see the features that has less than on percent missing
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.ylim(0,1)
plt.show()

lets now look at these rows, there might be houses with missing values across all features

In [None]:
percent_nan[percent_nan["missing_percent"] < 1]

In [None]:
index = percent_nan[percent_nan["missing_percent"] < 1].index
for name in index:
    print(df[df["Electrical"].isnull()][name])

In [None]:
df[df["GarageType"].isnull()]["GarageFinish"]

In [None]:
df = df.dropna(axis = 0, subset = ["GarageType"])

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
df[df["BsmtFinType1"].isnull()]

It seems that all features related Basement have very high number of missing values. If we go back to data description you will find that Nan actually means that the house do not has a basement. It is not missing, it just has one. Therefore, it does make sense to replace nan values with a string saying that the house has no Basement. This will work for Basement string columns, as for Basement numeric columns we will replace them with zero.

In [None]:
for col in df.columns:
    if "Bsmt" in col:
        print(col)

In [None]:
## basement numeric features ==> fillna 0
bsmt_num_cols = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']
df[bsmt_num_cols] = df[bsmt_num_cols].fillna(0)

## basement string features ==> fillna none
bsmt_str_cols =  ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
df[bsmt_str_cols] = df[bsmt_str_cols].fillna('None')

In [None]:
# now if you check again, you will find no nulls
df[df["BsmtFinSF1"].isnull()]

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

Electrical still has 1 missing value, lets look at it closely and decide

In [None]:
df[df["Electrical"].isnull()]

In [None]:
# You have the choice of filling it with the mode or dropping it, I will drop it
df = df.dropna(axis = 0, subset = ["Electrical"])


In [None]:
percent_nan = percent_missing_data(df)
percent_nan

Both "Mas Vnr Area" and "Mas Vnr Type" have less than 1 percent of null values. How to deal with them? 

Going back to data description, we found that there is a category for none: It does not have "Mas Vnr". We can assume that those missing values are also none but they are mistakenly filled with Nan.

In [None]:
df[["MasVnrArea"]] = df[["MasVnrArea"]].fillna(0)
df[["MasVnrType"]] = df[["MasVnrType"]].fillna("None")

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

#### What to do with the rest?
The rest of the features have more than 1% missing data. We need to carefully look at each one and decide how to deal with them. For sure, dropping rows is not a possible strategy any more. so we need to figure out something else. We have two options:
1. Fill in missing values
2. Drop thr feature column

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.show()

Some of the above features have more than 99 percent missing data, dropping these features can be the best strategy to opt for.

In [None]:
df = df.drop(["PoolQC", "MiscFeature", "Alley", "Fence"], axis = 1)

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = percent_nan.index, y = percent_nan.values[:,1])
plt.xticks(rotation = 90)
plt.show()

Now we are left with just two columns. You have to be carefull and do a lot of thinking because you can not just drop the rows nor the feature columns. Not enough to drop the feature but not too little to drop the rows.

In [None]:
df["FireplaceQu"].value_counts()

Since it is a categorical variable we can fill missing data with "None"

In [None]:
df["FireplaceQu"] = df["FireplaceQu"].fillna("None")

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

In [None]:
df["LotFrontage"].value_counts()

It is tricky, it is numeric. I can not longer go back to the description and fill it with a convenient text. 
We will use the Neighborhood feature calculate the missing feature.

Neighborhood: Physical locations within Ames city limits

LotFrontage: Linear feet of street connected to property

We will operate under the assumption that the Lot Frontage is related to what neighborhood a house is in.

In [None]:
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(x = "Neighborhood", y = "LotFrontage", data = df)
plt.xticks(rotation = 90)
plt.show()

As we can see each category is unique enough to make the assumption that we can impute the LotFrontage based on Neighborhood categories. 

In [None]:
df.groupby("Neighborhood")["LotFrontage"].mean()

To achieve the intended result, we will use pandas transform method. I calls group by and fill in missing vsalues based on it. 

In [None]:
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].transform(lambda value: value.fillna(value.mean()))

In [None]:
percent_nan = percent_missing_data(df)
percent_nan

**Yeah! Congratulations! we did it. Nothing is missing any more!**
 
Lets now move to encoding options. Essentially we will use one hot encoding with variables of the type "Object". However, There is one varaible that seems Numeric where in fact it is categorical. It is "MS SubClass".

If we go back to data description we will find that:

MSSubClass: Identifies the type of dwelling involved in the sale.

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

These numbers has no ordinal meaning. So we should covert the variable from integer to text.

In [None]:
df["MSSubClass"].dtypes

In [None]:
df["MSSubClass"] = df["MSSubClass"].apply(str)

In [None]:
df["MSSubClass"].dtypes

In [None]:
#Select all Object Features 
df.select_dtypes(include = "object")

In [None]:
df_object = df.select_dtypes(include = "object")
df_numeric = df.select_dtypes(exclude = "object")

In [None]:
df_object_dummies = pd.get_dummies(df_object, drop_first = True)
df_object_dummies

Note that we will not calculate coeff_ for each single feature. Using regularization, we will be able to drop non-important features lets now cocatinate the two data frames

In [None]:
df_final = pd.concat([df_numeric, df_object_dummies], axis = 1)
df_final.head()

In [None]:
print(df_final.shape)

In [None]:
corr = abs(df_final.corr()["SalePrice"]).sort_values(ascending = False)
large_corr = corr[corr > 0.3]

plt.figure(figsize = (10, 4), dpi = 100)
sns.barplot(x = large_corr.index, y = large_corr.values)
plt.xticks(rotation = 90)
plt.show()

Overall Quality is the most important feature for our model. But there is a problem. It is most likely to be generated by human judgment, therefore model deployment will be dependent on the existence of that human who judge the quality of the house and feed it to the model. 

Lets now proceed to model building and evaluation.

### 3. Model Building and evaluation 

#### Train | Test Split Procedure 

1. Split Data in Train/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Test Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
7. Adjust Parameters as Necessary and repeat steps 5 and 6

In [None]:
# Split the data for X and y
X = df_final.drop("SalePrice", axis = 1)
y = df_final["SalePrice"]

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
# scaling the X data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # only fit to training data to aviod data leakage
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Create the Ridge model
from sklearn.linear_model import Ridge
ridge1 = Ridge(alpha = 100)
ridge1.fit(X_train, y_train)

# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = ridge1.predict(X_test)
mean_absolute_error(y_test, y_predict)

Disadvantages of classic train test split:
1. Getting the right parameter is quite tedious
2. It is not the most fair evaluation, because we adjusted the parameters to have better performance on that specific test data.

Therfore its useful to hold some data aside. The model has never been adjusted to this data before, therfore it reflects the true evaluation matrix.

#### Train | Validation | Test Split Procedure 

This is often also called a "hold-out" set, since you should not adjust parameters based on the final test set, but instead use it *only* for reporting final expected performance.

0. Clean and adjust data as necessary for X and y
1. Split Data in Train/Validation/Test for both X and y
2. Fit/Train Scaler on Training X Data
3. Scale X Eval Data
4. Create Model
5. Fit/Train Model on X Train Data
6. Evaluate Model on X Evaluation Data (by creating predictions and comparing to Y_eval)
7. Adjust Parameters as Necessary and repeat steps 5 and 6
8. Get final metrics on Test set (not allowed to go back and adjust after this!)

In [None]:
# first split
from sklearn.model_selection import train_test_split
X_train, X_other, y_train, y_other = train_test_split(X, y, test_size=0.3, random_state=101)

# second split: 50% of 30% = 15% of all data 
X_eval, X_test, y_eval, y_test = train_test_split(X_other, y_other, test_size=0.5, random_state=101)

In [None]:
# scaling the X data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # only fit to training data to aviod data leakage

X_train = scaler.transform(X_train)
X_eval = scaler.transform(X_eval)
X_test = scaler.transform(X_test)

In [None]:
# Create the Ridge model
from sklearn.linear_model import Ridge
ridge1 = Ridge(alpha = 100)
ridge1.fit(X_train, y_train)

# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = ridge1.predict(X_eval)
mean_absolute_error(y_eval, y_predict)

In [None]:
# Create the Ridge model
alpha_list = []
mse_list = []
for alpha_val in np.arange(0.01, 200):
    from sklearn.linear_model import Ridge
    ridge1 = Ridge(alpha = alpha_val)
    ridge1.fit(X_train, y_train)
    alpha_list.append(alpha_val)
    
    # testing the model
    from sklearn.metrics import mean_absolute_error
    y_predict = ridge1.predict(X_eval)
    mse = mean_absolute_error(y_eval, y_predict)
    mse_list.append(mse)

In [None]:
alpha_list = pd.DataFrame(alpha_list)
mse_list = pd.DataFrame(mse_list)
alpha_mse = pd.concat([alpha_list, mse_list], axis = 1)
alpha_mse.columns = ["alpha_list", "mse_list"]

In [None]:
alpha_mse[alpha_mse["mse_list"] == alpha_mse["mse_list"].min()]

In [None]:
# Create the Ridge model
from sklearn.linear_model import Ridge
ridge3 = Ridge(alpha = 81.01)
ridge3.fit(X_train, y_train)

# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = ridge3.predict(X_eval)
print(mean_absolute_error(y_eval, y_predict))

y_final_test_predict = ridge3.predict(X_test)
print(mean_absolute_error(y_test, y_final_test_predict))

### Lasso Regression 

In [None]:
# Create the Ridge model
from sklearn.linear_model import Lasso
ls = Lasso(alpha = 100)
ls.fit(X_train, y_train)

# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = ls.predict(X_eval)
print(mean_absolute_error(y_eval, y_predict))

y_final_test_predict = ls.predict(X_test)
print(mean_absolute_error(y_test, y_final_test_predict))

In [None]:
# Create the Ridge model
alpha_list = []
mse_list = []
for alpha_val in np.arange(0.01, 200):
    from sklearn.linear_model import Lasso
    ls = Lasso(alpha = alpha_val)
    ls.fit(X_train, y_train)
    alpha_list.append(alpha_val)
    
    # testing the model
    from sklearn.metrics import mean_absolute_error
    y_predict = ls.predict(X_eval)
    mse = mean_absolute_error(y_eval, y_predict)
    mse_list.append(mse)

In [None]:
alpha_list = pd.DataFrame(alpha_list)
mse_list = pd.DataFrame(mse_list)
alpha_mse = pd.concat([alpha_list, mse_list], axis = 1)
alpha_mse.columns = ["alpha_list", "mse_list"]

In [None]:
alpha_mse[alpha_mse["mse_list"] == alpha_mse["mse_list"].min()]

In [None]:
# Create the optimal Ridge model
from sklearn.linear_model import Lasso
ls = Lasso(alpha = 199.01)
ls.fit(X_train, y_train)

# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = ls.predict(X_eval)
print(mean_absolute_error(y_eval, y_predict))

y_final_test_predict = ls.predict(X_test)
print(mean_absolute_error(y_test, y_final_test_predict))

### Elastic Net Model

In [None]:
from sklearn.linear_model import ElasticNetCV
elastic_model = ElasticNetCV(l1_ratio= np.linspace(0.01, 1, 100),tol=0.01)
elastic_model.fit(X_train,y_train)

In [None]:
elastic_model.l1_ratio_

In [None]:
# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = elastic_model.predict(X_eval)
print(mean_absolute_error(y_eval, y_predict))

y_final_test_predict = elastic_model.predict(X_test)
print(mean_absolute_error(y_test, y_final_test_predict))

### Polynomial Regression 

In [None]:
#Import the poly conerter 
from sklearn.preprocessing import PolynomialFeatures
polynomial_converter = PolynomialFeatures(degree=2,include_bias=False)

#convert X data 
poly_features_train = polynomial_converter.fit_transform(X_train)
poly_features_eval = polynomial_converter.fit_transform(X_eval)
poly_features_test = polynomial_converter.fit_transform(X_test)

In [None]:
poly_features_train.shape

In [None]:
#import elastic net 
from sklearn.linear_model import ElasticNetCV
elastic_model = ElasticNetCV(l1_ratio= 1,tol=0.01)
elastic_model.fit(poly_features_train,y_train)

In [None]:
# testing the model
from sklearn.metrics import mean_absolute_error
y_predict = elastic_model.predict(poly_features_eval)
print(mean_absolute_error(y_eval, y_predict))

y_final_test_predict = elastic_model.predict(poly_features_test)
print(mean_absolute_error(y_test, y_final_test_predict))

The original elastic net model is the best one. 