# **Top 12% Solution**
 
 # House Prices - Advanced Regression Techniques

The goal of this project: Is to predict the sales price of each house from a neighborhood in Ames, Iowa.

Why would a company benefit from this project and will there be a next step to this project:
1.  A company such as Zillow or any real estate company would benefit from this by being able to see how prices for each house are predicted. 
2. If needed the companies can ask to add more onto this dataset to try and predict other type of homes in a different area or even try to calculate monthly rental prices.

How to find the best tree model: I used XGBOOSTRegressor because it has highly efficient implementation.

Why did I use this parameters: I used learning_rate to prevent any over fitting of the data.

How would I implement this algorithm: After going through and prepping the data such as removing and modify the nulls, adding a few features, and predicting the final model with XGBoost.

My results: After predicting the model the final mean squared error value that was predicted is 199500 per house.

Why is this Regression: This is a Regression project because we are predicting each of the house prices which are considered continuous. 


In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings("ignore")
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Importing the csv files. One for training data and one for testing data.
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')


# 1. Reading the data:

In [None]:
# Viewing the train data. I can already see there are a few missing values.
train.head(10)

In [None]:
# Moving the train without the Saleprice column.
X = pd.DataFrame(train.loc[:, train.columns != 'SalePrice'])
# Here I am using concat to add the testing data to the train data.
X = pd.concat([X, test], axis=0)
# Creating the Y for train test split.
y = pd.DataFrame(train['SalePrice'])


In [None]:
X.info()


In [None]:
# Creating a bar graph to view how the SalePrice data is distributed.  
fig, ax = plt.subplots(1,2)
# The graph has a long right tail which means it is right skewed.
ax[0].hist(train['SalePrice'], bins=12, edgecolor='green', facecolor='yellow');
ax[0].set_title("SalePrice distribution ")
# Log of the SalePrice.
ax[1].hist(np.log(y['SalePrice']),bins=12,edgecolor='white')
ax[1].set_title('SalePrice Log')
plt.subplots_adjust(right=0.9, wspace=0.4, hspace=0.4)


In [None]:
# Heatmap to visualize the missing data using seaborn.
plt.figure(figsize=(12, 9))
sns.heatmap(X.isnull(),cmap='YlGnBu')
# The blue represents any missing values.

In [None]:
# As you can see there is a lot of missisng data in this dataset. 
X.isnull().sum().sum()

# 2. Data Cleaning:

In [None]:
# Filling in the null values.
X['Functional'].fillna('Typ', inplace=True)
X['Electrical'].fillna('SBrkr', inplace=True)

# Filling in these features with "No", which means they do not have this feature.
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'MiscFeature', 'Fence', 'FireplaceQu', 'Alley', 'PoolQC'):
  X[col] = X[col].fillna('No')

# Filling these features with zeros
for col in ('GarageArea', 'GarageCars'):
  X[col] = X[col].fillna(0)

# Filling these features with the mode.
for col in ('MSZoning', 'Utilities', 'MasVnrType', 'Exterior1st', 'Exterior2nd', 'SaleType'):
  X[col] = X[col].fillna(X[col].mode()[0])

# The rest of the data I am filling in with the median
X.fillna(X.median(), inplace=True)

# 3. Feature Engineering:

In [None]:
# Adding new features to the dataset by using the original features
X['AllSF'] = X['TotalBsmtSF'] + X['1stFlrSF'] + X['2ndFlrSF']
X['BackyardSF'] = X['LotArea'] - X['1stFlrSF']
X['PorchSF'] = X['WoodDeckSF'] + X['OpenPorchSF'] + X['EnclosedPorch'] + X['3SsnPorch'] + X['ScreenPorch'] 
X['Total_Bathrooms'] = X['FullBath'] + X['BsmtFullBath'] + (.5 * X['HalfBath']) + (.5 * X['BsmtHalfBath'])
X['MedNhbdArea'] = X.groupby('Neighborhood')['GrLivArea'].transform('median')
X['IsAbvGr'] = X[['MedNhbdArea', 'GrLivArea']].apply(lambda x: 'yes' if x['GrLivArea'] > x['MedNhbdArea'] else 'no', axis=1)


In [None]:
# Correlation matrix
corrmat = X.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)


# 4. Scaling, Encoding, Transforming:

In [None]:
scale = ['MedNhbdArea', 'BackyardSF', 'PorchSF', 'WoodDeckSF', 'OpenPorchSF','AllSF', '1stFlrSF','2ndFlrSF', 'BsmtFinSF1', 'BsmtFinSF2','BsmtUnfSF','GarageArea','GrLivArea','LotArea','LotFrontage','LowQualFinSF','MasVnrArea','TotalBsmtSF','PoolArea']

encode = list(set(X.columns) - set(scale) - set(['Id']))

# Working with the skew
skew_feats = X[scale].skew().sort_values(ascending=False)
skewness = pd.DataFrame({'Skew': skew_feats.astype('float')})
skewness = skewness[(skewness.Skew > .75)]
indeces = list(skewness.index)

# Why an I using log: It helps improve the skew of the data.
for x in indeces:
  X[x] = np.log1p(X[x])


In [None]:
# Starting to scale: Transforming by calculating the distances between the data.
Xscale = X[scale]
scaler = MinMaxScaler().fit(Xscale)
Xscale = pd.DataFrame(scaler.transform(Xscale), columns=Xscale.columns)

# Encoding the data
Xencode = X[encode]

# Merge the encoding with the scale data and reset the index.
X = Xscale.merge(Xencode.reset_index(), left_index=True, right_index=True)
# Turning the data into dummies.
X = pd.DataFrame(pd.get_dummies(data=X))

# Working with OrdinalEncoder and then reshaping
oc = OrdinalEncoder()
for x in X:
  if X[x].dtype == 'object':
    X[x] = oc.fit_transform(X[x].values.reshape(-1, 1))
X.head(5)

# 5. Train Test split: 

In [None]:
# Preparing the data for train test split
j = X
length = test.shape[0]
X = j[:train.shape[0]]
test = j[train.shape[0]:test.shape[0]+(length+1)]

# Split
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.3, random_state=0)

# Working with ytrain
ytrain = pd.DataFrame(np.log1p(ytrain.SalePrice))
ytrain.reset_index(inplace=True)
ytrain.drop(columns='index', inplace=True)
ytest = pd.DataFrame(np.log1p(ytest.SalePrice))
# Working with ytest
ytest.reset_index(inplace=True)
ytest.drop(columns='index', inplace=True)

# 6. XGBoost:
A few of the more important parameters that are worth nothing:
1. learning_rate: To prevent any over fitting of the data.
2. n_estimators: Value from the parameter tuning.
3. gamma: To not have any regularization. The higher the number the higher the regularization.
4. objective: Using Linear Regression with this XGBoost
5. nthread: The default is -1.
6. reg_alpha: Reduces overfitting and test error.

In [None]:
model = XGBRegressor(learning_rate=0.01, n_estimators=3460, gamma=0, objective='reg:linear', nthread=-1, reg_alpha=0.00006)


In [None]:
#Fitting thr model
preds = model.fit(xtrain, ytrain)
# Predicting the model
preds = model.predict(xtest)
# Finding the mean squared error and removing the log so we can see the number without the log.
np.sqrt(mean_squared_error(np.expm1(ytest), preds))


In [None]:
# Adding log to y
y = pd.DataFrame(np.log1p(y.SalePrice))
# Resetting the index
y.reset_index(inplace=True)
# Dropping the index column
y.drop(columns='index', inplace=True)

In [None]:
# Predicting the final model with test data
final_preds = model.predict(test)
# Removing the log
final_preds = np.expm1(final_preds)
# Adding in the test data from beginning
new_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

# 7. Submission:


In [None]:
# Working on the final submission.
submission = pd.DataFrame(new_test['Id'], columns=['Id'])
# Adding column for SalePrice
submission['SalePrice'] = final_preds
# Final view of the finished submission data
submission

In [None]:
#Saving the submission in order to submit it to Kaggle.
submission.to_csv('submission.csv', index=False, header=True)