The Ames Housing dataset is a famous dataset to exersice the regression models using machine learning. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, the challenges is to predict the final price of each home.

# Step 1: Importing the dataset and neccessary libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df=pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df

# Step 2: Getting to know the data

In [None]:
df.columns

As we can see in the information from the dataset we have:

In [None]:
with open('../input/house-prices-advanced-regression-techniques/data_description.txt', 'r') as f:
    print(f.read())

# Step 3: Dealing with the missing data

In [None]:
df.shape

In [None]:
df.corr()['SalePrice'].sort_values()

We can see that the overall quality and then GrLivArea has the most corrolation with the sale price! Let's get a visual view of these two:

In [None]:
sns.scatterplot(data=df, x='OverallQual', y='SalePrice')
plt.axhline(y=200000,color='r')

As you can see, in overall quality of 10 we have 2 data that are for houses less than 200,000. This does not seem to be reasonable as they are too cheap for a house with the quality of 10. But let's analyze some more to be sure those are outliers.

In [None]:
df[(df['OverallQual']>8) &(df['SalePrice']<200000)][['SalePrice', 'OverallQual']]

In [None]:
sns.scatterplot(x='GrLivArea', y='SalePrice', data=df)
plt.axhline(y=200000, color='r')
plt.axvline(x=4000, color='r')

GrLivArea is above grade (ground) living area square feet. If the house is bigger than 4000 and the sales price is less than 200,000, this is definately something unusual and it can be an outlier.

In [None]:
df[(df['GrLivArea']>4000) & (df['SalePrice']<200000)][['SalePrice', 'GrLivArea']]

We see that it is the same as before. So let's remove these 2 data from the dataset:

In [None]:
#Remove the outliers:
index_drop=df[(df['GrLivArea']>4000) & (df['SalePrice']<200000)].index
df=df.drop(index_drop, axis=0)

### Nan-values

Now, let's look at the Nan values.

In [None]:
df.info()

Some data do not give us useful information. Like Id. So let's drop that:

In [None]:
df= df.drop('Id', axis=1)

How many missing data do we have?

In [None]:
df.isnull().sum()

Let's see how many percent of each missing data is null:

In [None]:
def missing_percent(df):
    nan_percent= 100*(df.isnull().sum()/len(df))
    nan_percent= nan_percent[nan_percent>0].sort_values()
    return nan_percent

percentage = missing_percent(df)
percentage

Let's visualize the percentage of missing data on a graph:

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(x=percentage.index, y=percentage)
plt.xticks(rotation=90)

And now let's look at the ones with less than 1%.

### Less than 1%: Electrical, MasVnrType, MasVnrArea

In [None]:
percentage[percentage<1]

In [None]:
df[df['Electrical'].isnull()][['Electrical']]

If we look at the data information:
Electrical: Electrical system

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	Mixed

It is just one row so we drop it.

In [None]:
df[df['MasVnrType'].isnull()][['MasVnrType']]

If we look at the data information we can see:

MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone

So the ones with null value have no Masonry veneer and this is not a missing data. We should replace it with None.

In [None]:
df[df['MasVnrArea'].isnull()][['MasVnrArea']]

If we look at the data information we can see:

MasVnrArea: Masonry veneer area in square feet

We have obviosly a low rate of missing values here, so we also drop them.

In [None]:
df= df.dropna(axis=0, subset=['Electrical', 'MasVnrType', 'MasVnrArea'])

In [None]:
missing_percent(df)

### Basement data: BsmtQual, BsmtCond, BsmtFinType1, BsmtExposure, BsmtFinType2

Let's look at the information about the basement:

BsmtQual: Evaluates the height of the basement

       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement
		
BsmtCond: Evaluates the general condition of the basement

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement
	
BsmtFinType1: Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement
		
BsmtExposure: Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement
		
BsmtFinType2: Rating of basement finished area (if multiple types)

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement

So the Na values are not missing data, the house simply has no basement. We have to change them to None.

In [None]:
df['BsmtQual']= df['BsmtQual'].fillna('None')
df['BsmtCond']= df['BsmtCond'].fillna('None')
df['BsmtFinType1']= df['BsmtFinType1'].fillna('None')
df['BsmtExposure']= df['BsmtExposure'].fillna('None')
df['BsmtFinType2']= df['BsmtFinType2'].fillna('None')

In [None]:
missing_percent(df)

### Garage data: GarageCond, GarageQual, GarageFinish, GarageType, GarageYrBlt

Let's look at the information about the garage:

GarageType: Garage location
		
       2Types	More than one type of garage
       Attchd	Attached to home
       Basment	Basement Garage
       BuiltIn	Built-In (Garage part of house - typically has room above garage)
       CarPort	Car Port
       Detchd	Detached from home
       NA	No Garage
		
GarageYrBlt: Year garage was built
		
GarageFinish: Interior finish of the garage

       Fin	Finished
       RFn	Rough Finished	
       Unf	Unfinished
       NA	No Garage
		
GarageQual: Garage quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
		
GarageCond: Garage condition

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage

So the Na values are not missing data, the house simply has no garage. We have to change them to None.

In [None]:
df['GarageType']= df['GarageType'].fillna('None')
df['GarageYrBlt']= df['GarageYrBlt'].fillna(0)
df['GarageFinish']= df['GarageFinish'].fillna('None')
df['GarageQual']= df['GarageQual'].fillna('None')
df['GarageCond']= df['GarageCond'].fillna('None')

In [None]:
missing_percent(df)

### More than 80%: Fence, Alley, Miscfeature, PoolQC

For these qualities, the amount of valid data is very low. So we drop these columns:

In [None]:
df= df.drop(['Fence', 'Alley', 'MiscFeature','PoolQC'], axis=1)

In [None]:
missing_percent(df)

### Remaining missing data: FireplaceQu

As we can see from the information:
FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace
So the Na values are not missing data, the house simply has no fireplace. We have to change them to None.

In [None]:
df['FireplaceQu']= df['FireplaceQu'].fillna('None')

In [None]:
missing_percent(df)

### Remaining missing data: LotFrontage 

We have 17.7% of missing data and we cannot just drop it. Lot frontage means the side of a lot abutting on a legally accessible street right-of-way other than an alley or an improved county road. We need to replace the Nan value with a suitable amount.
Let's look if the lot frintage is corrolated with neighbourhood:

In [None]:
plt.figure(figsize=(8,12))
sns.boxplot(data=df, x='LotFrontage', y='Neighborhood')

In [None]:
df.groupby('Neighborhood')['LotFrontage'].mean()

We can substitue the missing value with the mean of lot frontage in each neighbourhood:

In [None]:
df['LotFrontage']=df.groupby('Neighborhood')['LotFrontage'].transform(lambda val: val.fillna(val.mean()))

In [None]:
missing_percent(df)

Finally we don't have any missing data!

In [None]:
df.shape

In [None]:
df = df.reset_index(drop = True)
df.tail(10)

# Step 4: Categorical data

### Numerical Columns to Categorical
We need to be careful when it comes to encoding categorical as numbers. We want to make sure that the numerical relationship makes sense for model. For example, the encoding MSSubClass is essentially just a code per class:

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

So we need to change it to string:

In [None]:
df['MSSubClass']= df['MSSubClass'].apply(str)

### Encoding:

We divide the categorical and numerical features to create dummy variables from the categorical ones:

In [None]:
df_num= df.select_dtypes(exclude='object')
df_obj= df.select_dtypes(include='object')

In [None]:
df_obj.info()

Let's do the encoding:

In [None]:
df_obj= pd.get_dummies(df_obj)

In [None]:
df_obj

# Step 5: Numerical data (feature scaling)

We need to feature scale the numerical data:

In [None]:
df_num.info()

In [None]:
df_num

In [None]:
y_train = df_num['SalePrice']
df_num = df_num.drop( ['SalePrice'] , axis=1)
y_train

In [None]:
names_num = df_num.columns

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
scaler.fit(df_num)
df_num = scaler.transform(df_num)
df_num = pd.DataFrame(df_num, columns= names_num)
df_num

# Step 6: Merging the categorical and numerical values

In [None]:
x_train= pd.concat([df_num, df_obj], axis=1)
x_train

# Step 7: Importing the test dataset and do the changes

## Dealing with the missing data:

We do every change we did on the train dataset, inclusing deleting some columns and changing the MSSubClass to str.

In [None]:
df_test=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df_test.shape

In [None]:
df_test.drop('Id', axis=1, inplace=True)
df_test.drop(['Fence', 'Alley', 'MiscFeature','PoolQC'], axis=1, inplace=True)
df_test['MSSubClass']= df_test['MSSubClass'].apply(str)

df_test

In [None]:
df_test.isnull().sum()

In [None]:
missing_percent(df_test)

In [None]:
df_test['BsmtQual']= df_test['BsmtQual'].fillna('None')
df_test['BsmtCond']= df_test['BsmtCond'].fillna('None')
df_test['BsmtFinType1']= df_test['BsmtFinType1'].fillna('None')
df_test['BsmtExposure']= df_test['BsmtExposure'].fillna('None')
df_test['BsmtFinType2']= df_test['BsmtFinType2'].fillna('None')
df_test['BsmtFinSF1']= df_test['BsmtFinSF1'].fillna(0)
df_test['BsmtFinSF2']= df_test['BsmtFinSF2'].fillna(0)
df_test['BsmtUnfSF']= df_test['BsmtUnfSF'].fillna(0)
df_test['BsmtFullBath']= df_test['BsmtFullBath'].fillna(0)
df_test['BsmtHalfBath']= df_test['BsmtHalfBath'].fillna(0)
df_test['TotalBsmtSF']= df_test['TotalBsmtSF'].fillna(0)

In [None]:
df_test['GarageType']= df_test['GarageType'].fillna('None')
df_test['GarageYrBlt']= df_test['GarageYrBlt'].fillna(0)
df_test['GarageFinish']= df_test['GarageFinish'].fillna('None')
df_test['GarageQual']= df_test['GarageQual'].fillna('None')
df_test['GarageCond']= df_test['GarageCond'].fillna(0)
df_test['GarageArea']= df_test['GarageArea'].fillna(0)
df_test['GarageCars']= df_test['GarageCars'].fillna(0)

In [None]:
df_test['FireplaceQu']= df_test['FireplaceQu'].fillna('None')
df_test['LotFrontage']=df_test.groupby('Neighborhood')['LotFrontage'].transform(lambda val: val.fillna(val.mean()))

In [None]:
df_test['Electrical'] = df_test['Electrical'].fillna(df['Electrical'].mode()[0])
df_test['MasVnrType'] = df_test['MasVnrType'].fillna(df['MasVnrType'].mode()[0])
df_test['MasVnrArea'] = df_test['MasVnrArea'].fillna(df['MasVnrArea'].mode()[0])
df_test['Exterior1st'] = df_test['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
df_test['Exterior2nd'] = df_test['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])
df_test['KitchenQual'] = df_test['KitchenQual'].fillna(df['KitchenQual'].mode()[0])
df_test['SaleType'] = df_test['SaleType'].fillna(df['SaleType'].mode()[0])
df_test['Utilities'] = df_test['Utilities'].fillna(df['Utilities'].mode()[0])
df_test['MSZoning'] = df_test['MSZoning'].fillna(df['MSZoning'].mode()[0])
df_test['Functional'] = df_test['Functional'].fillna(df['Functional'].mode()[0])

In [None]:
missing_percent(df_test)

### Categorical data (creating dummy variables)

In [None]:
df_num_test= df_test.select_dtypes(exclude='object')
df_obj_test= df_test.select_dtypes(include='object')

In [None]:
df_obj_test.info()

In [None]:
df_obj_test= pd.get_dummies(df_obj_test)

As you can see we have 248 columns. We need to make the columns like the columns of train dataset. We remove the columns which have extra amounts that do not exist in the train dataset:

In [None]:
for i in df_obj_test.columns:
    if i not in df_obj.columns:
         df_obj_test = df_obj_test.drop(columns = i, axis=1)

df_obj_test = df_obj_test.reindex(columns = df_obj.columns, fill_value=0)

In [None]:
df_obj_test

### Numerical data (feature scaling)

In [None]:
df_num_test.info()

In [None]:
df_num_test = scaler.transform(df_num_test)
df_num_test = pd.DataFrame(df_num_test, columns= names_num)
df_num_test

### Merging the categorical and numerical data

In [None]:
x_test= pd.concat([df_num_test, df_obj_test], axis=1)
x_test

# Step 8: Building a model (Linear Regression)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train, y_train)

In [None]:
y_pred=lr.predict(x_test)
y_pred

In [None]:
y_test=pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
y_test.shape

In [None]:
y_test = y_test['SalePrice']
y_test

In [None]:
from sklearn import metrics
MAE=metrics.mean_absolute_error(y_test,y_pred)
MSE=metrics.mean_squared_error(y_test,y_pred)
RMSE=np.sqrt(MSE)

In [None]:
pd.DataFrame(data=[MAE,MSE,RMSE],index=["MAE","MSE","RMSE"],columns=["LinearRegression"])