# 10. Kaggle online competition: Supervised Learning

This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview
 
![image](https://user-images.githubusercontent.com/43855029/156053760-007e3d08-3472-47e5-ba96-c07d8d3fa325.png)

_**Project description:**_

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. 


For simpilicity: I downloaded the data for you and put it here:
https://github.com/vuminhtue/SMU_Machine_Learning_Python/tree/master/data/house-prices

## 10.1 Understand the data

There are 4 files in this folder: 
- train.csv: the trained data with 1460 rows and 81 columns. The last column "**SalePrice**" is for output with continuous value
- test.csv: the test data with 1459 rows and 80 columns. Note: There is no  "**SalePrice**" in the last column
- data_description.txt: contains informations on all columns
- sample_submission.csv: is where you save the output from model prediction and upload it to Kaggle for competition

**Objective:**
- We will use the **train.csv**__ data to create the actual train/test set and apply several algorithm to find the optimal ML algorithm to work with this data
- Once model built and trained, apply to the **test.csv**__ and create the output as in format of sample_submission.csv
- Write all analyses in ipynb format


## Step 1: Load data from Kaggle housing dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_train = pd.read_csv("https://raw.githubusercontent.com/vuminhtue/SMU_Machine_Learning_Python/master/data/house-prices/train.csv")

In [None]:
df_train.columns


In [None]:
y=df_train['SalePrice']

In [None]:
df_test = pd.read_csv("https://raw.githubusercontent.com/vuminhtue/SMU_Machine_Learning_Python/master/data/house-prices/test.csv")
df_test.columns

## Step 2: Select variables

- First split input data to numerical and categorical
- Visualize the input data



In [None]:
# Remove columns with missing values
df_test = df_test.dropna(axis=1)
df_train = df_train[df_test.columns]

In [None]:
df_train_numerical=df_train.select_dtypes(exclude=['object'])
df_train_categorical=df_train.select_dtypes(include=['object'])

df_test_numerical=df_test.select_dtypes(exclude=['object'])
df_test_categorical=df_test.select_dtypes(include=['object'])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(20, 10))
sns.heatmap(df_test_numerical.corr(), cmap='RdYlGn_r', annot=True,mask = (np.abs(df_test_numerical.corr()) < 0.6))


## Step 6: How about categorical data?
Sometime categorical data like condition also plays good contribution

In [None]:
# Merge categorical data with predictand
df_train_categorical = pd.concat([df_train_categorical,y],axis=1)

In [None]:
df_train_categorical.head()

#### Using One Hot Encoding:

In [None]:
df_train_categorical_ohe=pd.get_dummies(df_train_categorical,drop_first=True)
df_train_categorical_ohe.head()

In [None]:

plt.figure(figsize=(20, 10))
sns.heatmap(df_train_categorical_ohe.corr(), cmap='RdYlGn_r', annot=True,mask = (np.abs(df_train_categorical_ohe.corr()) <= 0.5))



In [None]:
cate_selected = df_categorical_ohe[["KitchenQual_Gd","ExterQual_TA"]]

In [None]:
big_train = pd.concat([df_train_numerical,df_train_categorical_ohe],axis=1)

In [None]:
big_train.head()

In [None]:
X = big_train.iloc[:,-1]

In [None]:
X

In [None]:
X = df_train2.iloc[:,0:8]
y = df_train2.iloc[:,-1]  

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.6,random_state=123)

In [None]:
from sklearn.ensemble import RandomForestRegressor
model_RF = RandomForestRegressor(n_estimators=100).fit(X_train,y_train)
y_pred_RF = model_RF.predict(X_test)

In [None]:
from sklearn import metrics
print("R2 using Random Forest is: %1.2f " % metrics.r2_score(y_test,y_pred_RF)) 
print("RMSE using Random Forest is: %1.2f" % metrics.mean_squared_error(y_test,y_pred_RF,squared=False))

In [None]:
# Calculate output

dftest_numerical=df_test.select_dtypes(exclude=['object'])
dftest_categorical=df_test.select_dtypes(include=['object'])

df_test1 = df_test[["OverallQual","TotalBsmtSF","1stFlrSF","GrLivArea","GarageCars","GarageArea"]]
dftest_categorical = dftest_categorical.dropna(axis=1)

dftest_categorical_ohe=pd.get_dummies(dftest_categorical,drop_first=True)


In [None]:
dftest_categorical_ohe.columns

In [None]:

dftest_selected = dftest_categorical_ohe[["KitchenQual_Gd","ExterQual_TA"]]
df_test2 = pd.concat([dftest_selected,df_test1],axis=1)

#### We can see that with the addition of categorical data as input, using the same Ranfom Forest algorithm, we are able to obtain better output