# House Prices - Competition

The object of the competition is to predict the house prices as precisely as possible using database that contains 79 explanatory variables of the residental homes in Ames, Iowa.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

## Importing needed libraries and data

In [1]:
# Importing needed tools
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Getting the data to train the models
home_filepath = '../input/house-prices-advanced-regression-techniques/train.csv'
home_data = pd.read_csv(home_filepath)

# Getting the data for which we give predictions
test_data_path = '../input/house-prices-advanced-regression-techniques/test.csv'
test_data = pd.read_csv(test_data_path)

## Initial data analysis

In [2]:
home_data.info()
print('************')
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

Quite a lot of stuff. On my first try I will only consider the numeric features.

In [3]:
numeric_home_data = home_data.select_dtypes(exclude='object')
numeric_home_data.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1459.0,1459.0,1232.0,1459.0,1459.0,1459.0,1459.0,1459.0,1444.0,1458.0,...,1458.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0
mean,2190.0,57.378341,68.580357,9819.161069,6.078821,5.553804,1971.357779,1983.662783,100.709141,439.203704,...,472.768861,93.174777,48.313914,24.243317,1.79438,17.064428,1.744345,58.167923,6.104181,2007.769705
std,421.321334,42.74688,22.376841,4955.517327,1.436812,1.11374,30.390071,21.130467,177.6259,455.268042,...,217.048611,127.744882,68.883364,67.227765,20.207842,56.609763,30.491646,630.806978,2.722432,1.30174
min,1461.0,20.0,21.0,1470.0,1.0,1.0,1879.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,1825.5,20.0,58.0,7391.0,5.0,5.0,1953.0,1963.0,0.0,0.0,...,318.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,2190.0,50.0,67.0,9399.0,6.0,5.0,1973.0,1992.0,0.0,350.5,...,480.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,2554.5,70.0,80.0,11517.5,7.0,6.0,2001.0,2004.0,164.0,753.5,...,576.0,168.0,72.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,2919.0,190.0,200.0,56600.0,10.0,9.0,2010.0,2010.0,1290.0,4010.0,...,1488.0,1424.0,742.0,1012.0,360.0,576.0,800.0,17000.0,12.0,2010.0


## Choosing features to use

In [4]:
numeric_corr = numeric_home_data.corr()
sale_price_corr_table = numeric_corr['SalePrice'].sort_values(ascending=False).to_frame()
sale_price_corr_table

KeyError: 'SalePrice'

Let's choose the ones that correlation with the price is more than 0.25.

In [None]:
features = sale_price_corr_table[(sale_price_corr_table['SalePrice']>0.25) & (sale_price_corr_table['SalePrice']<1)].index.to_list()
home_data[features].describe()

On this first try I will remove the features that have a lot of missing values.

In [None]:
features.remove('LotFrontage')

In [None]:
home_data['MasVnrArea'].fillna(0, inplace=True)

In [None]:
home_data['GarageYrBlt'].fillna(home_data['YearBuilt'], inplace = True)

In [None]:
home_data[features].describe()

In [None]:
test_data[features].describe()

In [None]:
home_data[features].head()

In [None]:
test_data[features].head()