This Kernel is meant to solve the Housing Prices Challenge in a simple way and produce an acceptable leaderboard score. We will apply Linear Regression and machine learning models known for handling structured data.

> Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

Resources used to guide my personal learning and application throughout this House Price project:
* [Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)
* [Getting Started with Kaggle: House Prices Competition](https://www.dataquest.io/blog/kaggle-getting-started/)
* [Regularized Linear Models](https://www.kaggle.com/apapiu/regularized-linear-models)

**Notebook Content:**
1. Imports
2. Exploratory Data Analysis
3. Transforming and Engineering Feautures

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# visualiation tools
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# sci-kit learn tools
from scipy.stats import skew

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.

In [None]:
# given data imports
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# copies of DS for manipulation (we will use this for remainder of project)
train_df = train.copy()
test_df = test.copy()

# copies for EDA purposes
EDA_train = train.copy()
EDA_test = test.copy()

# combine data sets to avoid dimension misalignment
all_data = pd.concat((train_df.loc[:,'MSSubClass':'SaleCondition'],
                      test_df.loc[:,'MSSubClass':'SaleCondition']))

print(train_df.shape, test_df.shape, all_data.shape)

In [None]:
# drop target (dependent variable) from training dataframe
actual_y = train_df['SalePrice']
#train_df = train_df.drop('SalePrice', axis=1)

train_df.shape

**Exploratory Data Analysis**

In [None]:
# from Abhinand "Predicting HousingPrices: Simple Approach" Kernel
def show_all(df):
    #This fuction lets us view the full dataframe
    with pd.option_context('display.max_rows', 100, 'display.max_columns', 100):
        display(df)

In [None]:
show_all(train_df.head())

> We can see that there exists many qualitative and missing values^
> 
> Let's take a look at the skewness of SalePrice to see if a log transformation will be necessary for linear regression.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(actual_y)
print("Skew is: ", actual_y.skew())

> You can see that the data is skewed. We will attempt to log-transform the data to bring the skew number closer to 0.

In [None]:
log_actual_y = np.log(actual_y)
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(log_actual_y)

print("Skew is: ", log_actual_y.skew())

> A skew value closer to 0 means that we have improved the skewness of the data. You can see from the plot that the logged data resembles a normal distribution!

We will handle quantitative and qualitative features seperately. We will begin with Quantitative features. We will examine correlation between actual SalePrice and quantitative features.

In [None]:
quant = train.select_dtypes(include=[np.number])
quant.dtypes

> Examine correlations between SalePrice and target

In [None]:
corr = quant.corr()
print(corr['SalePrice'].sort_values(ascending=False)[:5])
print(corr['SalePrice'].sort_values(ascending=False)[-5:])

We can now see the top 5 most postiviely correlated features with SalePrice. 
> If your dataset has perfectly positive or negative attributes then there is a high chance that the performance of the model will be impacted by a problem called — “Multicollinearity”

Lets take a look at a correlation heatmap.

In [None]:
corr_map = train_df.corr()
fig, ax = plt.subplots(figsize=(20,16))
sns.heatmap(corr_map, vmax=.8, square=True, annot=True, fmt='.1f')
plt.show();

> We can see that 'GrLivArea' and 'TotRmsAbvGrd', 'TotalBsmtSF' and '1stFlrSF', 'YearBuilt' and 'GarageYrBlt' have high correlations. **These correlations are so strong that it can indicate a situation of multicollinearity**.

Lets take a moment to visualize these highly correlated numeric features (and later trim outliers).


In [None]:
# Top 10 high correlation to SalePrice matrix
n = 10
cols = corr_map.nlargest(n, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train_df[cols].values.T)
sns.set(font_scale=1.25)
fig, ax = plt.subplots(figsize=(10,8))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

* 'GarageCars' and 'GarageArea' are like twins. So we just need one. We will choose 'GarageCars' because it has a stronger correlation to 'SalePrice'!
* 'TotalBsmtSF' and '1stFlrSF' are also twins. We will choose 'TotalBsmtSF' because of higher correlation to SP
* 'TotRmsAbvGrd' and 'GrLivArea' are also twins. We will choose 'GrLivArea'

> * (Drop 'GarageArea')
> * (Drop '1stFlrSF')
> * (Drop 'TotRmsAbvGrd')

In [None]:
train_df.OverallQual.unique()

In [None]:
# pivot table to further investigate relationship between 'OverallQual' and 'SalePrice'
quality_pivot = train.pivot_table(index='OverallQual', values='SalePrice', aggfunc=np.median)
quality_pivot

In [None]:
f, ax = plt.subplots(figsize=(8, 4))
sns.lineplot(x='OverallQual', y = train_df.SalePrice, color='green',data=train_df)

> We can see that as the overall quality of the house increases so does the price of the house. This is an excellent variable for our model.

Now we will take a look at 'GrLivArea'

In [None]:
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x = train_df['GrLivArea'], y = log_actual_y)
plt.ylabel('LogSalePrice')
plt.xlabel('GrLivArea')
plt.show()

> Outliers can affect a regression model by pulling our estimated regression line further away from the true population regression line. So, we’ll remove those observations from our data.

In [None]:
# remove outliers and update EDA_train
EDA_train = EDA_train[EDA_train['GrLivArea'] < 4000]

plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x = EDA_train['GrLivArea'], y = np.log(EDA_train.SalePrice))
plt.xlim(-200,6000) # keeps same scale as first scatter plot
plt.ylabel('LogSalePrice')
plt.xlabel('GrLivArea')
plt.show()

In [None]:
# lets do the same for all data
#all_data = all_data[all_data['GrLivArea'] < 4000]

Now we will take a look at garage area.

In [None]:
plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x = train_df['GarageArea'], y = np.log(train_df.SalePrice))
plt.ylabel('LogSalePrice')
plt.xlabel('GarageArea')
plt.show()

In [None]:
# remove outliers and update train_df
EDA_train = EDA_train[EDA_train['GarageArea'] < 1200]

plt.figure(figsize=(8, 6), dpi=80)
plt.scatter(x = EDA_train['GarageArea'], y = np.log(EDA_train.SalePrice))
plt.xlim(-50,1475)
plt.ylabel('LogSalePrice')
plt.xlabel('GarageArea')
plt.show()

In [None]:
# instead of removing outlier rows, lets try to impute them with a value
# dropping rows with outliers is misaligning my data and preventing submission
#all_data = all_data[all_data['GarageArea'] < 1200]

We can do more investigation on other variable outliers at a later date. Now we will take a look at missing values and begin imputation process. (Do not forget it will soon be time to combine two data sets to avoid dimension misalignment)

In [None]:
# Number of missing values in each column of training data
missing_vals = (train_df.isnull().sum())
print(missing_vals[missing_vals > 0])

We will drop all variables with a high amount of missing values. Why? None of these variables seem to be important or considered when deciding to buy a house (and that's probably why they have so many missing values). 
> Drop: 'MiscFeature', 'Fence', 'PoolQC', 'FireplaceQu', 'Alley'

In regards to the Garage-related variables with missing values, we already have a garage variable with a high correlation to SalePrice. That variable alone will do the trick so we will delete all Garage variables with missing data.
> Drop: 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond'

The same logic applies to Bsmt variables. Also MasVnr variables correlate heavily with 'OverallQual' so we will delete those.

We will delete everything except for Electrical.

> General intution here was gathered from [Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python)


*We will now begin an analysis on the normality of some of our very important features. Let's note our most important variables thus far:*
> * OverallQual
> * GarageCars (recall all other garage variables have been dropped)
> * TotalBsmtSF (recall all other Bsmt variables have been dropped)
> * GrLivArea

In [None]:
# GrLivArea
f, ax = plt.subplots(figsize=(8, 4))
sns.distplot(train_df['GrLivArea'])
print("Skew is: ", train_df['GrLivArea'].skew())

Holy skew! Let's transform.

In [None]:
train_df['GrLivArea'] = np.log(train_df['GrLivArea'])

f, ax = plt.subplots(figsize=(8, 4))
sns.distplot(train_df['GrLivArea'])
print("Skew is: ", train_df['GrLivArea'].skew())

In [None]:
# TotalBsmtSF
f, ax = plt.subplots(figsize=(8, 4))
sns.distplot(train_df['TotalBsmtSF'])
print("Skew is: ", train_df['TotalBsmtSF'].skew())

> There does exists methods to log-transform complicated variables that contain: values equal to zero. We will visit this at a later date.

**Transforming and Engineering Features**
> Start with applying all drops and transformations noted in EDA section. 
> For this section I reference: [Regularized Linear Models](https://www.kaggle.com/apapiu/regularized-linear-models)

In [None]:
all_data.shape

In [None]:
# we will begin by applying log transformation to skewed numeric features
num_data = all_data.dtypes[all_data.dtypes != "object"].index

skew_data = all_data[num_data].apply(lambda x: skew(x.dropna()))
skew_data = skew_data[skew_data > 0.75]
skew_data = skew_data.index

all_data[skew_data] = np.log1p(all_data[skew_data])

For the sake of working quick, we will encode all qualitative variables with dummy representations. At a later point we will re-visit qualitative variables with a more granular approach.

In [None]:
all_data.shape

In [None]:
# drop all features with missing values, noted above : keep electrical
all_data = all_data.drop((missing_vals[missing_vals > 1]).index,1)
#all_data = all_data.drop(all_data.loc[all_data['Electrical'].isnull()].index)

# fix few number of missing vals in test set
all_data = all_data.fillna(all_data.mean())

In [None]:
all_data.shape

In [None]:
all_data = pd.get_dummies(all_data)

In [None]:
# drop variables noted in EDA section
drop_me = ['GarageArea', '1stFlrSF', 'TotRmsAbvGrd']
all_data = all_data.drop(drop_me, axis=1)

In [None]:
# quick look under the hood
show_all(all_data.head())
print(all_data.shape)

In [None]:
# split concatonated data into train and test dataframes

y = np.log1p(train_df["SalePrice"])
train_df = train_df.drop('SalePrice', axis=1)
X_train = all_data[:train_df.shape[0]]
X_test = all_data[train_df.shape[0]:]


X_train.shape

**Build the Model**
>  We will attempt to apply the following models:
> * Linear Regression
> * Lasso Regression
> * Random Forests

In [None]:
# Root-Mean-Squared-Error (RMSE) evaluation metric
from sklearn.model_selection import cross_val_score

# from "Regularized Linear Models" w/ cross validation
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

In [None]:
# Linear Regression !
from sklearn import linear_model

linear_model = linear_model.LinearRegression()
lr_model = linear_model.fit(X_train, y)

rmse_cv(lr_model).mean()

In [None]:
# LassoCV !
from sklearn.linear_model import LassoCV
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

lasso_model = LassoCV(alphas = [1, 0.1, 0.001, 0.0005, 0.005, 0.0001, 0.5, 0.2]).fit(X_train, y)
rmse_cv(lasso_model).mean()

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(random_state=42, max_depth = 6, n_jobs = 5)
rf_model.fit(X_train, y)

rmse_cv(rf_model).mean()

In [None]:
linear_pred = np.expm1(lr_model.predict(X_test))
lasso_pred = np.expm1(lasso_model.predict(X_test))

Submit!

In [None]:
lasso_pred.shape

In [None]:
#submit!
output = pd.DataFrame({"id":test.Id, "SalePrice":lasso_pred})
output.to_csv("lasso_solution.csv", index = False)