<a href="https://colab.research.google.com/github/robimalco/colab/blob/main/House_prices_advanced_regression_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download files



- **SalePrice** - the property's sale price in dollars. This is the target variable that you're trying to predict.
- **1stFlrSF**: First Floor square feet
- **2ndFlrSF**: Second floor square feet
- **3SsnPorch**: Three season porch area in square feet
- **Alley**: Type of alley access
- **Bedroom**: Number of bedrooms above basement level
- **BldgType**: Type of dwelling
- **BsmtCond**: General condition of the basement
- **BsmtExposure**: Walkout or garden level basement walls
- **BsmtFinSF1**: Type 1 finished square feet
- **BsmtFinSF2**: Type 2 finished square feet
- **BsmtFinType1**: Quality of basement finished area
- **BsmtFinType2**: Quality of second finished area (if present)
- **BsmtFullBath**: Basement full bathrooms
- **BsmtHalfBath**: Basement half bathrooms
- **BsmtQual**: Height of the basement
- **BsmtUnfSF**: Unfinished square feet of basement area
- **CentralAir**: Central air conditioning
- **Condition1**: Proximity to main road or railroad
- **Condition2**: Proximity to main road or railroad (if a second is present)
- **Electrical**: Electrical system
- **EnclosedPorch**: Enclosed porch area in square feet
- **ExterCond**: Present condition of the material on the exterior
- **Exterior1st**: Exterior covering on house
- **Exterior2nd**: Exterior covering on house (if more than one material)
- **ExterQual**: Exterior material quality
- **Fence**: Fence quality
- **FireplaceQu**: Fireplace quality
- **Fireplaces**: Number of fireplaces
- **Foundation**: Type of foundation
- **FullBath**: Full bathrooms above grade
- **Functional**: Home functionality rating
- **GarageArea**: Size of garage in square feet
- **GarageCars**: Size of garage in car capacity
- **GarageCond**: Garage condition
- **GarageFinish**: Interior finish of the garage
- **GarageQual**: Garage quality
- **GarageType**: Garage location
- **GarageYrBlt**: Year garage was built
- **GrLivArea**: Above grade (ground) living area square feet
- **HalfBath**: Half baths above grade
- **Heating**: Type of heating
- **HeatingQC**: Heating quality and condition
- **HouseStyle**: Style of dwelling
- **Kitchen**: Number of kitchens
- **KitchenQual**: Kitchen quality
- **LandContour**: Flatness of the property
- **LandSlope**: Slope of property
- **LotArea**: Lot size in square feet
- **LotConfig**: Lot configuration
- **LotFrontage**: Linear feet of street connected to property
- **LotShape**: General shape of property
- **LowQualFinSF**: Low quality finished square feet (all floors)
- **MasVnrArea**: Masonry veneer area in square feet
- **MasVnrType**: Masonry veneer type
- **MiscFeature**: Miscellaneous feature not covered in other categories
- **MiscVal**: $Value of miscellaneous feature
- **MoSold**: Month Sold
- **MSSubClass**: The building class
- **MSZoning**: The general zoning classification
- **Neighborhood**: Physical locations within Ames city limits
- **OpenPorchSF**: Open porch area in square feet
- **OverallCond**: Overall condition rating
- **OverallQual**: Overall material and finish quality
- **PavedDrive**: Paved driveway
- **PoolArea**: Pool area in square feet
- **PoolQC**: Pool quality
- **RoofMatl**: Roof material
- **RoofStyle**: Type of roof
- **SaleCondition**: Condition of sale
- **SaleType**: Type of sale
- **ScreenPorch**: Screen porch area in square feet
- **Street**: Type of road access
- **TotalBsmtSF**: Total square feet of basement area
- **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
- **Utilities**: Type of utilities available
- **WoodDeckSF**: Wood deck area in square feet
- **YearBuilt**: Original construction date
- **YearRemodAdd**: Remodel date
- **YrSold**: Year Sold

# Configure and import

In [None]:
from google.colab import files
files.upload()

In [None]:
!pip install -q kaggle
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c house-prices-advanced-regression-techniques

In [None]:
!pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
!pip3 install torchvision

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

In [None]:
# !nvcc --version
# !python --version
torch.cuda.get_device_name(0)

# Data Import

In [None]:
original_train_df = pd.read_csv('train.csv')
original_train_df['Source'] = 'train.csv'

original_test_df = pd.read_csv('test.csv')
original_test_df['Source'] = 'test.csv'

total_df = pd.concat([original_train_df, original_test_df], axis=0)

In [None]:
numerical_columns = []
categorical_columns = []

for column in total_df.columns:
  if total_df.dtypes[column] == np.int64 or total_df.dtypes[column] == np.float64:
    numerical_columns.append(column)
  else:
    categorical_columns.append(column)

categorical_columns.remove('Source')
numerical_columns.remove('SalePrice')

# Data Exploration

In [None]:
list_of_numerics = total_df.select_dtypes(include=['float', 'int']).columns
corrSalePrice = round(total_df[numerical_columns].corrwith(original_train_df['SalePrice']), 3) * 100
types = total_df.dtypes
missing = round((total_df.isnull().sum()/total_df.shape[0]),3)*100
overview = total_df.apply(
    lambda x: [
      round(x.min()), 
      round(x.max()), 
      round(x.mean()), 
      round(x.quantile(0.5))
    ] if x.name in list_of_numerics else x.unique())
outliers = total_df.apply(
    lambda x: sum(
        (x<(x.quantile(0.25)-1.5*(x.quantile(0.75)-x.quantile(0.25)))) | 
        (x>(x.quantile(0.75)+1.5*(x.quantile(0.75)-x.quantile(0.25)))) 
      if x.name in list_of_numerics else ''))
explore_df = pd.DataFrame({
  'Types': types,
  'CorrSalePrice%': corrSalePrice,
  'Missing%': missing,
  'Overview': overview,
  'Outliers': outliers
})
explore_df['Types'] = explore_df['Types'].astype(str)
explore_df.sort_values(by=['Missing%'], ascending=False).transpose()

In [None]:
# Plot Correlation Matrix

temp_df = total_df[numerical_columns]
f = plt.figure(figsize=(19, 15))
plt.matshow(temp_df.corr(), fignum=f.number)
plt.xticks(range(temp_df.shape[1]), temp_df.columns, fontsize=14, rotation=45)
plt.yticks(range(temp_df.shape[1]), temp_df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

# Preprocessing

In [None]:
# Manage missing values
# Based on the number of missing values, 
# decide if makes sense or not to create a dedicated category called "None",
# or if it is simply better to assign a mean() value

for column in ['Alley', 'MasVnrType','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1', 'BsmtFinType2','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PoolQC','Fence','MiscFeature']:
  total_df[column] = total_df[column].fillna('None')
for column in ['Electrical','MSZoning','Exterior1st','Exterior2nd','KitchenQual','SaleType','Functional', 'Utilities']:
  total_df[column] = total_df[column].fillna(total_df[column].mode()[0])
for column in ['MasVnrArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','BsmtFullBath','BsmtHalfBath', 'GarageYrBlt','GarageCars','GarageArea']:
  total_df[column] = total_df[column].fillna(0)
for column in ['LotFrontage']:
  total_df[column] = total_df[column].mean()

In [None]:
# Remove outsiders

total_df = total_df[total_df['GrLivArea'] < 4000]

## Data Exploration

In [None]:
# GarageYrBlt
# MiscVal
# Id
# YrSold 
# BsmtHalfBath
# numerical_columns.remove('GarageYrBlt')
# numerical_columns.remove('MiscVal')
# numerical_columns.remove('Id')
# numerical_columns.remove('YrSold')
# numerical_columns.remove('BsmtHalfBath')

In [None]:
cp_total_df = total_df
# cp_total_df = cp_total_df.drop(['Id'], axis=1)
# cp_total_df = cp_total_df.drop(['GarageYrBlt', 'MiscVal', 'Id', 'YrSold', 'BsmtHalfBath'], axis=1)

In [None]:
pd.options.mode.chained_assignment = None

for column in categorical_columns:
  cp_total_df[column] = LabelEncoder().fit_transform(cp_total_df[column])

for column in categorical_columns:
  cp_total_df[column] = cp_total_df[column].astype('category')

In [None]:
train_df = cp_total_df[cp_total_df['Source'] == 'train.csv']
train_output_df = pd.DataFrame(train_df['SalePrice'], columns=['SalePrice'])
train_df.drop('SalePrice', axis=1, inplace=True)
test_df = cp_total_df[cp_total_df['Source'] == 'test.csv']
test_df.drop('SalePrice', axis=1, inplace=True)

In [None]:
def create_tensor(input_df):
  stack = []
  for column in categorical_columns:
    temp_stack = input_df[column].cat.codes.values
    stack.append(temp_stack)
  for column in numerical_columns:
    temp_stack = input_df[column].astype(np.float64)
    stack.append(temp_stack)
  return torch.tensor(np.stack(stack, 1), dtype=torch.float)


tensor_train = create_tensor(train_df).float()
tensor_output = torch.tensor(train_output_df.values).flatten().float()

tensor_test = create_tensor(test_df).float()

In [None]:
total_records_train = len(tensor_train)
test_records_train = int(total_records_train * 0.2)

tensor_train_data = tensor_train[:total_records_train-test_records_train]
tensor_train_output = tensor_output[:total_records_train-test_records_train]

tensor_validation_data = tensor_train[total_records_train-test_records_train:total_records_train]
tensor_validation_output = tensor_output[total_records_train-test_records_train:total_records_train]

In [247]:
class Model(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear1 = nn.Linear(len(tensor_train_data[1].to(device)), 1000)
    self.linear2 = nn.Linear(1000, 500)
    self.linear3 = nn.Linear(500, 200)
    self.linear4 = nn.Linear(200, 1)
  def forward(self, x):
    y = self.linear1(x)
    y = torch.nn.functional.dropout(y, p=0.2)
    y = self.linear2(y)
    y = torch.nn.functional.dropout(y, p=0.2)
    y = self.linear3(y)
    y = torch.nn.functional.dropout(y, p=0.2)
    y = self.linear4(y)
    return y

In [None]:
model = Model()
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

In [249]:
def train_model(fold, epochs, x, y, aggregated_losses):
  for i in range(epochs):
    y_pred = model(x)
    loss = loss_function(y_pred.squeeze(), y)
    optimizer.zero_grad() # sets the gradients of all optimized to zero.
    loss.backward() # compute gradient of loss with respect to all the parameters
    optimizer.step() # iterate and update all parameters based on the current gradient
    if i == epochs - 1:
      print("fold:", fold, "epoch: " + str(i) + "\tloss: " + str(loss.item()))
    aggregated_losses.append(loss)

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)
epochs = 300
aggregated_losses = []

for fold, (train_index, test_index) in enumerate(kf.split(tensor_train_data, tensor_train_output)):
  x_train_fold = tensor_train_data[train_index].to(device)
  y_train_fold = tensor_train_output[train_index].to(device)
  train_model(fold, epochs, x_train_fold, y_train_fold, aggregated_losses)

In [None]:
plt.figure(figsize=(10,5))
plt.plot(range(0, len(aggregated_losses)), aggregated_losses)
plt.ylabel('Loss')
plt.xlabel('epoch');

In [None]:
# Overfitting if: training loss << validation loss
# Underfitting if: training loss >> validation loss
# Just right if training loss ~ validation loss

with torch.no_grad():
    x_validation = tensor_validation_data.to(device)
    y_validation = tensor_validation_output.to(device)
    y_val = model(x_validation)
    loss_validation = loss_function(y_val.squeeze(), y_validation)
print("Validation loss: ", str(loss_validation.item()))
print("Train Loss VS Validation loss: ", round(1 - aggregated_losses[len(aggregated_losses) - 1].item() / loss_validation.item(), 2) * 100)

In [None]:
aggregated_losses[len(aggregated_losses) - 1].item()

In [None]:
# To remove overfitting
#   Cross-validation: use your initial training data to generate multiple mini train-test splits.
#   Remove features: removing irrelevant input features or aggregate them 
#   Early stopping: stopping the training process before the learner degradates.
#   Regularization: adds a penalty as model complexity increases
#   Ensembling: machine learning methods for combining predictions from multiple separate models. 

In [None]:
# Make predictions
with torch.no_grad():
    x_test = tensor_test.to(device)
    y_pred = model(x_test)

In [None]:
submission_df = pd.DataFrame(y_pred, columns=['SalePrice']).astype("float")

submission_df = pd.concat([original_test_df, submission_df], axis=1)

submission_df = submission_df[['Id', 'SalePrice']]

# submission_df[submission_df['Id'] == 2891]
submission_df

In [None]:
submission_df.iloc[len(submission_df)-1, submission_df.columns.get_loc('SalePrice')] = 244171.2813

In [None]:
from google.colab import files
submission_df.to_csv('submission.csv', index=False)
files.download('submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>