**[Introduction to Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**

---


# Housing Prices from [Kaggle's Intro to ML](https://www.kaggle.com/learn/intro-to-machine-learning)


## Setup

In [None]:
# Code you have previously used to load data
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *

# read train and test data files using pandas
train_data_file_path = '../input/train.csv'
original_home_data = pd.read_csv(train_data_file_path)

test_data_file_path = '../input/test.csv'
test_data = pd.read_csv(test_data_file_path)


train_data = original_home_data.reset_index(drop=True)

In [None]:
# correlations = home_data.corr()
# correlations.SalePrice.sort_values()

## Inspect the data and come back anytime to review

train_data.head()
# train_data.head()
# train_data.describe()
# train_data.describe()

In [None]:
def home_data_handle_na(home_data):
    # Columns where 'NaN' means None
    none_cols = [
        'Alley', 'PoolQC', 'MiscFeature', 'Fence', 'FireplaceQu', 'GarageType',
        'GarageFinish', 'GarageQual', 'GarageCond', 'BsmtQual', 'BsmtCond',
        'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'MasVnrType'
    ]
    for col in none_cols:
        home_data[col].replace(np.nan, 'None', inplace=True)

    # Columns where 'NaN' means 0
    zero_cols = [
        'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath',
        'BsmtHalfBath', 'GarageYrBlt', 'GarageArea', 'GarageCars', 'MasVnrArea'
    ]
    for col in zero_cols:
        home_data[col].replace(np.nan, 0, inplace=True)

    # Columns where 'NaN' can be replaced with mode
    freq_cols = [
        'Electrical', 'Exterior1st', 'Exterior2nd', 'Functional', 'KitchenQual',
        'SaleType', 'Utilities'
    ]
    for col in freq_cols:
        home_data[col].replace(np.nan, home_data[col].mode()[0], inplace=True)

    # Filling 'MSZoning' according to MSSubClass.
    home_data['MSZoning'] = home_data.groupby('MSSubClass')['MSZoning'].apply(
        lambda x: x.fillna(x.mode()[0]))

    # Filling 'MSZoning' according to Neighborhood.
    home_data['LotFrontage'] = home_data.groupby(
        ['Neighborhood'])['LotFrontage'].apply(lambda x: x.fillna(x.median()))

    # home_data which numerical on data but should be treated as category:
    home_data['MSSubClass'] = home_data['MSSubClass'].astype(str)
    home_data['YrSold'] = home_data['YrSold'].astype(str)
    home_data['MoSold'] = home_data['MoSold'].astype(str)

    # Transforming rare values(less than 10) into one group.
    other_cols = [
        'Condition1', 'Condition2', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
        'Heating', 'Electrical', 'Functional', 'SaleType'
    ]
    for col in other_cols:
        mask = home_data[col].isin(
            home_data[col].value_counts()[home_data[col].value_counts() < 10].index)
        home_data[col][mask] = 'Other'

    # Converting some of the categorical values to numeric ones.
    neigh_map = {
        'MeadowV': 1, 'IDOTRR': 1, 'BrDale': 1,
        'BrkSide': 2, 'OldTown': 2, 'Edwards': 2,
        'Sawyer': 3, 'Blueste': 3, 'SWISU': 3, 'NPkVill': 3, 'NAmes': 3,
        'Mitchel': 4,
        'SawyerW': 5, 'NWAmes': 5, 'Gilbert': 5, 'Blmngtn': 5,
        'CollgCr': 6, 'ClearCr': 6, 'Crawfor': 6,
        'Veenker': 7, 'Somerst': 7,
        'Timber': 8, 'StoneBr': 9, 'NridgHt': 10, 'NoRidge': 10
    }
    home_data['Neighborhood'] = home_data['Neighborhood'].map(neigh_map).astype(
        'int')
    ext_map = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
    home_data['ExterQual'] = home_data['ExterQual'].map(ext_map).astype('int')
    home_data['ExterCond'] = home_data['ExterCond'].map(ext_map).astype('int')
    bsm_map = {'None': 0, 'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
    home_data['BsmtQual'] = home_data['BsmtQual'].map(bsm_map).astype('int')
    home_data['BsmtCond'] = home_data['BsmtCond'].map(bsm_map).astype('int')
    bsmf_map = {
        'None': 0, 'Unf': 1, 'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6
    }
    home_data['BsmtFinType1'] = home_data['BsmtFinType1'].map(bsmf_map).astype('int')
    home_data['BsmtFinType2'] = home_data['BsmtFinType2'].map(bsmf_map).astype('int')
    heat_map = {'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
    home_data['HeatingQC'] = home_data['HeatingQC'].map(heat_map).astype('int')
    home_data['KitchenQual'] = home_data['KitchenQual'].map(heat_map).astype('int')
    home_data['FireplaceQu'] = home_data['FireplaceQu'].map(bsm_map).astype('int')
    home_data['GarageCond'] = home_data['GarageCond'].map(bsm_map).astype('int')
    home_data['GarageQual'] = home_data['GarageQual'].map(bsm_map).astype('int')

    return home_data


train_data = home_data_handle_na(train_data)
test_data = home_data_handle_na(test_data)

In [None]:
# Dropping outliers after detecting them by eye.

def home_data_handle_outliers(home_data):
    
    home_data = home_data.drop(home_data[(home_data['OverallQual'] < 5) & (home_data['SalePrice'] > 200000)].index)
    home_data = home_data.drop(home_data[(home_data['OverallQual'] > 7) & (home_data['SalePrice'] > 420000)].index)
    home_data = home_data.drop(home_data[(home_data['OverallQual'] > 8) & (home_data['SalePrice'] < 250000)].index)

    home_data = home_data.drop(home_data[(home_data['GrLivArea'] > 4000) & (home_data['SalePrice'] < 320000)].index)

    home_data = home_data.drop(home_data[(home_data['GarageArea'] > 1200) & (home_data['SalePrice'] < 200000)].index)

    home_data = home_data.drop(home_data[(home_data['TotalBsmtSF'] > 3000) & (home_data['SalePrice'] < 320000)].index)

    home_data = home_data.drop(home_data[(home_data['1stFlrSF'] < 3000) & (home_data['SalePrice'] > 640000)].index)
    home_data = home_data.drop(home_data[(home_data['1stFlrSF'] > 3000) & (home_data['SalePrice'] < 240000)].index)
    return home_data


train_data = home_data_handle_outliers(train_data)


In [None]:
def home_data_add_insights(home_data):
    # Creating new home_data  based on previous observations. There might be some highly correlated home_data now. Drop them if you want to...

    home_data['TotalSF'] = (
        home_data['BsmtFinSF1'] + home_data['BsmtFinSF2'] + 
        home_data['1stFlrSF'] + home_data['2ndFlrSF']
    )
    home_data['TotalBathrooms'] = (
        home_data['FullBath'] + home_data['BsmtFullBath'] + 
        (0.8 * (home_data['HalfBath'] + home_data['BsmtHalfBath']))
    )

    home_data['TotalPorchSF'] = (
        home_data['OpenPorchSF'] + home_data['3SsnPorch'] + home_data['EnclosedPorch'] + 
        home_data['ScreenPorch'] + home_data['WoodDeckSF']
    )

    # home_data['YearBlRm'] = (home_data['YearBuilt'] + home_data['YearRemodAdd'])

    # Merging quality and conditions.

    home_data['TotalExtQual'] = (home_data['ExterQual'] + home_data['ExterCond'])
    home_data['TotalBsmQual'] = (
        home_data['BsmtQual'] + home_data['BsmtCond'] + home_data['BsmtFinType1'] + home_data['BsmtFinType2']
    )
    home_data['TotalGrgQual'] = (home_data['GarageQual'] + home_data['GarageCond'])
    home_data['TotalQual'] = (
        home_data['OverallQual'] + home_data['TotalExtQual'] + 
        home_data['TotalBsmQual'] + home_data['TotalGrgQual'] + 
        home_data['KitchenQual'] + home_data['HeatingQC']
    )

    # Creating new home_data by using new quality indicators.

    home_data['QualGr'] = home_data['TotalQual'] * home_data['GrLivArea']
    home_data['QualBsm'] = home_data['TotalBsmQual'] * (home_data['BsmtFinSF1'] + home_data['BsmtFinSF2'])
    home_data['QualPorch'] = home_data['TotalExtQual'] * home_data['TotalPorchSF']
    home_data['QualExt'] = home_data['TotalExtQual'] * home_data['MasVnrArea']
    home_data['QualGrg'] = home_data['TotalGrgQual'] * home_data['GarageArea']
    home_data['QlLivArea'] = (home_data['GrLivArea'] - home_data['LowQualFinSF']) * (home_data['TotalQual'])
    home_data['QualSFNg'] = home_data['QualGr'] * home_data['Neighborhood']

    return home_data


train_data = home_data_add_insights(train_data)
test_data = home_data_add_insights(test_data)

In [None]:
def home_data_add_bool_insights(home_data):
    home_data['HasPool'] = home_data['PoolArea'].apply(lambda x: 1 if x > 0 else 0)
    home_data['Has2ndFloor'] = home_data['2ndFlrSF'].apply(lambda x: 1 if x > 0 else 0)
    home_data['HasGarage'] = home_data['QualGrg'].apply(lambda x: 1 if x > 0 else 0)
    home_data['HasBsmt'] = home_data['QualBsm'].apply(lambda x: 1 if x > 0 else 0)
    home_data['HasFireplace'] = home_data['Fireplaces'].apply(lambda x: 1 if x > 0 else 0)
    home_data['HasPorch'] = home_data['QualPorch'].apply(lambda x: 1 if x > 0 else 0)
    return home_data

train_data = home_data_add_bool_insights(train_data)
test_data = home_data_add_bool_insights(test_data)

In [None]:
def home_data_drop_features(home_data):
    # Dropping features.
    home_data.drop(columns=[
        'Utilities',
        'PoolQC',
        'YrSold',
        'MoSold',
        'ExterQual',
        'BsmtQual',
        'GarageQual',
        'KitchenQual',
        'HeatingQC',
    ], inplace=True)

    return home_data


train_data = home_data_drop_features(train_data)
test_data = home_data_drop_features(test_data)

In [None]:
print(f'Number of missing values: {train_data.isna().sum().sum()}')
print(f'Number of missing values: {test_data.isna().sum().sum()}')

In [None]:
# Create target object and call it y
y = train_data['SalePrice']
y.dropna(inplace=True)

# Create X
features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 'WoodDeckSF', 'ScreenPorch']
features += ['TotalSF', 'TotalBathrooms', 'TotalPorchSF', 'TotalExtQual', 'TotalBsmQual', 'TotalGrgQual', 'TotalQual'] 
features += ['QualGr', 'QualBsm', 'QualPorch', 'QualExt', 'QualGrg', 'QlLivArea', 'QualSFNg'] 
features += ['HasPool', 'Has2ndFloor', 'HasGarage', 'HasBsmt', 'HasFireplace', 'HasPorch']
X = train_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))

# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))

# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

In [None]:
train_data.describe()

# Creating a Model For the Competition

Build a Random Forest model and train it on all of **X** and **y**.

In [None]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1)

# fit rf_model_on_full_data on all data from the training data
rf_model.fit(X, y)

# Make Predictions
Read the file of "test" data. And apply your model to make predictions

In [None]:

# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]

# make predictions which we will submit. 
test_preds = rf_model.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring
# Just uncomment them.

output = pd.DataFrame({'Id': test_data.Id,
                      'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

Before submitting, run a check to make sure your `test_preds` have the right format.

# Test Your Work

To test your results, you'll need to join the competition (if you haven't already).  So open a new window by clicking on [this link](https://www.kaggle.com/c/home-data-for-ml-course).  Then click on the **Join Competition** button.

![join competition image](https://i.imgur.com/wLmFtH3.png)

Next, follow the instructions below:
1. Begin by clicking on the blue **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the blue **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the blue **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Continuing Your Progress
There are many ways to improve your model, and **experimenting is a great way to learn at this point.**

The best way to improve your model is to add features.  Look at the list of columns and think about what might affect home prices.  Some features will cause errors because of issues like missing values or non-numeric data types. 

The **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** micro-course will teach you how to handle these types of features. You will also learn to use **xgboost**, a technique giving even better accuracy than Random Forest.


# Other Micro-Courses
The **[Pandas](https://kaggle.com/Learn/Pandas)** micro-course will give you the data manipulation skills to quickly go from conceptual idea to implementation in your data science projects. 

You are also ready for the **[Deep Learning](https://kaggle.com/Learn/Deep-Learning)** micro-course, where you will build models with better-than-human level performance at computer vision tasks.

---
**[Introduction to Machine Learning Home Page](https://www.kaggle.com/learn/intro-to-machine-learning)**





*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161285) to chat with other Learners.*