# Introduction: House Prices - Advanced Regression Techniques

This notebook is intended for those who are new to machine learning competitions or want a gentle introduction to the problem. I purposely avoid jumping into complicated models or joining together lots of data in order to show the basics of how to get started in machine learning! Any comments or suggestions are much appreciated.

In this notebook, we will take an initial look at the House Prices machine learning competition currently hosted on Kaggle. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Goal: to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 


I hope you liked this code, I also prepared more interesting laptops for this competition and I will be glad to share them with you:

1. [COMPREHENSIVE DATA EXPLORATION WITH PYTHON](https://www.kaggle.com/andrej0marinchenko/comprehensive-data-exploration-with-python-upd)
2. [Data ScienceTutorial for Beginners ](https://www.kaggle.com/andrej0marinchenko/data-sciencetutorial-for-beginners-house-prices)
3. [House Price Calculation methods for beginnners](https://www.kaggle.com/andrej0marinchenko/house-price-calculation-methods-for-beginnners)
4. [Start: Introduction for beginners ](https://www.kaggle.com/andrej0marinchenko/start-introduction-for-beginners-house-prices)
5. [EDA + Data Analytics For beginners](https://www.kaggle.com/andrej0marinchenko/eda-data-analytics-for-beginners-house-prices)
6. [1 step for beginners linear model](https://www.kaggle.com/andrej0marinchenko/1-step-for-beginners-linear-model-house-prices)
7. [Universal notebook 4 data analysis](https://www.kaggle.com/andrej0marinchenko/universal-notebook-4-data-analysis)


# Data

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

File descriptions
- train.csv - the training set
- test.csv - the test set

Moreover, we are provided with the definitions of all the columns (in `data_description.txt`) and an example of the expected submission file `sample_submission.csv`. 

In this notebook, we will stick to using only the main application training and testing data. Although if we want to have any hope of seriously competing, we need to use all the data, for now we will stick to one file which should be more manageable. This will let us establish a baseline that we can then improve upon. With these projects, it's best to build up an understanding of the problem a little at a time rather than diving all the way in and getting completely lost! 

## Metric: ROC AUC

Once we have a grasp of the data, we need to understand the metric by which our submission is judged. In this case, it is a common classification metric known as the [Receiver Operating Characteristic Area Under the Curve (ROC AUC, also sometimes called AUROC)](https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it).

The ROC AUC may sound intimidating, but it is relatively straightforward once you can get your head around the two individual concepts. The [Reciever Operating Characteristic (ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) graphs the true positive rate versus the false positive rate:

![image](http://www.statisticshowto.com/wp-content/uploads/2016/08/ROC-curve.png)

A single line on the graph indicates the curve for a single model, and movement along a line indicates changing the threshold used for classifying a positive instance. The threshold starts at 0 in the upper right to and goes to 1 in the lower left. A curve that is to the left and above another curve indicates a better model. For example, the blue model is better than the red model, which is better than the black diagonal line which indicates a naive random guessing model. 

The [Area Under the Curve (AUC)](http://gim.unmc.edu/dxtests/roc3.htm) explains itself by its name! It is simply the area under the ROC curve. (This is the integral of the curve.) This metric is between 0 and 1 with a better model scoring higher. A model that simply guesses at random will have an ROC AUC of 0.5.

When we measure a classifier according to the ROC AUC, we do not generation 0 or 1 predictions, but rather a probability between 0 and 1. This may be confusing because we usually like to think in terms of accuracy, but when we get into problems with inbalanced classes (we will see this is the case), accuracy is not the best metric. For example, if I wanted to build a model that could detect terrorists with 99.9999% accuracy, I would simply make a model that predicted every single person was not a terrorist. Clearly, this would not be effective (the recall would be zero) and we use more advanced metrics such as ROC AUC or the [F1 score](https://en.wikipedia.org/wiki/F1_score) to more accurately reflect the performance of a classifier. A model with a high ROC AUC will also have a high accuracy, but the [ROC AUC is a better representation of model performance.](https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy)

Not that we know the background of the data we are using and the metric to maximize, let's get into exploring the data. In this notebook, as mentioned previously, we will stick to the main data sources and simple models which we can build upon in future work. 

## Imports

We are using a typical data science stack: `numpy`, `pandas`, `sklearn`, `matplotlib`. 

In [None]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd 

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns

## Read in Data 

First, we can list all the available data files. There are a total of 9 files: 1 main file for training (with target) 1 main file for testing (without the target), 1 example submission file, and 6 other files containing additional information about each loan. 

In [None]:
# List files available
print(os.listdir("../input/house-prices-advanced-regression-techniques/"))

In [None]:
# Training data
app_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
print('Training data shape: ', app_train.shape)
app_train.head()

The training data has 1460 observations (each one a separate hause) and 79 features (variables) including the `SalePrice` (the label we want to predict). why 79 if we see that there are 81 columns in the composition? Because we took away two columns:
number of the sold house in the data table and the desired value (sales value).

In [None]:
# Testing data features
app_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
print('Testing data shape: ', app_test.shape)
app_test.head()

The test set is considerably smaller and lacks a `TARGET` column. 

# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

## Examine the Distribution of the Target Column

The target is what we are asked to predict: `SalePrice`.

In [None]:
app_train['SalePrice'].value_counts()

In [None]:
app_train['SalePrice'].astype(int).plot.hist();

From this information, we see this is an [_imbalanced class problem_](http://www.chioka.in/class-imbalance-problem/). The number of houses sold under $250,000 is much higher than the rest. Once we get into more sophisticated machine learning models, we can [weight the classes](http://xgboost.readthedocs.io/en/latest/parameter.html) by their representation in the data to reflect this imbalance. 

## Examine Missing Values

Next we can look at the number and percentage of missing values in each column. 

In [None]:
# Function to calculate missing values by column# Funct 
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)

When it comes time to build our machine learning models, we will have to fill in these missing values (known as imputation). In later work, we will use models such as XGBoost that can [handle missing values with no need for imputation](https://stats.stackexchange.com/questions/235489/xgboost-can-handle-missing-data-in-the-forecasting-phase). Another option would be to drop columns with a high percentage of missing values, although it is impossible to know ahead of time if these columns will be helpful to our model. Therefore, we will keep all of the columns for now.

## Column Types

Let's look at the number of columns of each data type. `int64` and `float64` are numeric variables ([which can be either discrete or continuous](https://stats.stackexchange.com/questions/206/what-is-the-difference-between-discrete-data-and-continuous-data)). `object` columns contain strings and are  [categorical features.](http://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/what-are-categorical-discrete-and-continuous-variables/) . 

In [None]:
# Number of each type of column
app_train.dtypes.value_counts()

Let's now look at the number of unique entries in each of the `object` (categorical) columns.

In [None]:
# Number of unique classes in each object column
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

Most of the categorical variables have a relatively small number of unique entries. We will need to find a way to deal with these categorical variables! 

## Encoding Categorical Variables

Before we go any further, we need to deal with pesky categorical variables.  A machine learning model unfortunately cannot deal with categorical variables (except for some models such as [LightGBM](http://lightgbm.readthedocs.io/en/latest/Features.html)). Therefore, we have to find a way to encode (represent) these variables as numbers before handing them off to the model. There are two main ways to carry out this process:

* Label encoding: assign each unique category in a categorical variable with an integer. No new columns are created. An example is shown below

![image](https://raw.githubusercontent.com/WillKoehrsen/Machine-Learning-Projects/master/label_encoding.png)

* One-hot encoding: create a new column for each unique category in a categorical variable. Each observation recieves a 1 in the column for its corresponding category and a 0 in all other new columns. 

![image](https://raw.githubusercontent.com/WillKoehrsen/Machine-Learning-Projects/master/one_hot_encoding.png)

The problem with label encoding is that it gives the categories an arbitrary ordering. The value assigned to each of the categories is random and does not reflect any inherent aspect of the category. In the example above, programmer recieves a 4 and data scientist a 1, but if we did the same process again, the labels could be reversed or completely different. The actual assignment of the integers is arbitrary. Therefore, when we perform label encoding, the model might use the relative value of the feature (for example programmer = 4 and data scientist = 1) to assign weights which is not what we want. If we only have two unique values for a categorical variable (such as Male/Female), then label encoding is fine, but for more than 2 unique categories, one-hot encoding is the safe option.

There is some debate about the relative merits of these approaches, and some models can deal with label encoded categorical variables with no issues. [Here is a good Stack Overflow discussion](https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor). I think (and this is just a personal opinion) for categorical variables with many classes, one-hot encoding is the safest approach because it does not impose arbitrary values to categories. The only downside to one-hot encoding is that the number of features (dimensions of the data) can explode with categorical variables with many categories. To deal with this, we can perform one-hot encoding followed by [PCA](http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf) or other [dimensionality reduction methods](https://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/) to reduce the number of dimensions (while still trying to preserve information). 

In this notebook, we will use Label Encoding for any categorical variables with only 2 categories and One-Hot Encoding for any categorical variables with more than 2 categories. This process may need to change as we get further into the project, but for now, we will see where this gets us. (We will also not use any dimensionality reduction in this notebook but will explore in future iterations).

### Label Encoding and One-Hot Encoding

Let's implement the policy described above: for any categorical variable (`dtype == object`) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding. 

For label encoding, we use the Scikit-Learn `LabelEncoder` and for one-hot encoding, the pandas `get_dummies(df)` function.

In [None]:
# Create a label encoder object
le = dict()
le_count = 0

def encode_transform(app, col, le):
    # Transform both training and testing data
    app[col] = le[col].transform(app[col])
    return app, col, le

    


# Iterate through the columns
for col in app_train:
#     print(col)
    if app_train[col].dtype == 'object':        
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            le[col] = LabelEncoder()
            # Train on the training data
            le[col].fit(app_train[col])
            encode_transform(app_train, col, le)
#             encode_transform(app_test, col, le)
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# Iterate through the columns
for col in app_train:
#     print(col)
    if app_train[col].dtype == 'object':        
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            le[col] = LabelEncoder()
            # Train on the training data
            le[col].fit(app_train[col])
#             encode_transform(app_train, col, le)
            encode_transform(app_test, col, le)
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# one-hot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

### Aligning Training and Testing Data

There need to be the same features (columns) in both the training and testing data. One-hot encoding has created more columns in the training data because there were some categorical variables with categories not represented in the testing data. To remove the columns in the training data that are not in the testing data, we need to `align` the dataframes. First we extract the target column from the training data (because this is not in the testing data but we need to keep this information). When we do the align, we must make sure to set `axis = 1` to align the dataframes based on the columns and not on the rows!

In [None]:
train_labels = app_train['SalePrice']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)

# Add the target back in
app_train['SalePrice'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

The training and testing datasets now have the same features which is required for machine learning. The number of features has grown significantly due to one-hot encoding. At some point we probably will want to try [dimensionality reduction (removing features that are not relevant)](https://en.wikipedia.org/wiki/Dimensionality_reduction) to reduce the size of the datasets.

## Back to Exploratory Data Analysis

### Anomalies

One problem we always want to be on the lookout for when doing EDA is anomalies within the data. These may be due to mis-typed numbers, errors in measuring equipment, or they could be valid but extreme measurements. One way to support anomalies quantitatively is by looking at the statistics of a column using the `describe` method. The numbers in the `YrSold` column are negative because they are recorded relative to the current loan application. To see these stats in years, we can mutliple by -1 and divide by the number of days in a year:



In [None]:
# (app_train['YrSold'] / -1).describe()
app_train['YrSold'].describe()

Those ages look reasonable. There are no outliers for the age on either the high or low end.

### Correlations

Now that we have dealt with the categorical variables and the outliers, let's continue with the EDA. One way to try and understand the data is by looking for correlations between the features and the target. We can calculate the Pearson correlation coefficient between every variable and the target using the `.corr` dataframe method.

The correlation coefficient is not the greatest method to represent "relevance" of a feature, but it does give us an idea of possible relationships within the data. Some [general interpretations of the absolute value of the correlation coefficent](http://www.statstutor.ac.uk/resources/uploaded/pearsons.pdf) are:


* .00-.19 “very weak”
*  .20-.39 “weak”
*  .40-.59 “moderate”
*  .60-.79 “strong”
* .80-1.0 “very strong”


In [None]:
# Find correlations with the target and sort
correlations = app_train.corr()['SalePrice'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

Let's take a look at some of more significant correlations: the `OverallQual` is the most positive correlation. (except for `SalePrice` because the correlation of a variable with itself is always 1!) Looking at the documentation, `OverallQual` is the rates the overall material and finish of the house. 

OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

In [None]:
# Find the correlation of the positive days since birth and target
app_train['OverallQual'] = abs(app_train['OverallQual'])
app_train['OverallQual'].corr(app_train['SalePrice'])

In [None]:
# Set the style of plots
plt.style.use('fivethirtyeight')

# Plot the General quality
plt.hist(app_train['OverallQual'], edgecolor = 'k', bins = 25)
plt.title('General quality'); plt.xlabel('OverallQual'); plt.ylabel('Count');

By itself, the distribution of 'OverallQual' does not tell us much. To visualize the effect of the 'OverallQual' on the 'SalePrice', we will next make a [kernel density estimation plot](https://en.wikipedia.org/wiki/Kernel_density_estimation) (KDE) colored by the value of the target. A [kernel density estimate plot shows the distribution of a single variable](https://chemicalstatistician.wordpress.com/2013/06/09/exploratory-data-analysis-kernel-density-estimation-in-r-on-ozone-pollution-data-in-new-york-and-ozonopolis/) and can be thought of as a smoothed histogram (it is created by computing a kernel, usually a Gaussian, at each data point and then averaging all the individual kernels to develop a single smooth curve). We will use the seaborn `kdeplot` for this graph.

In [None]:
# KDE plot 
plt.figure(figsize = (20, 15))
sns.kdeplot(app_train['OverallQual'], label = 'OverallQual')
sns.kdeplot(app_train['SalePrice']/25000, label = 'levelized selling price')
plt.xlabel('OverallQual \ levelized selling price'); plt.ylabel('Density'); plt.title('Distribution of General quality');


In [None]:
plt.figure(figsize = (25, 15))

sns.kdeplot(data=app_train, x="OverallQual", y="SalePrice", fill=True,)

There is a clear trend: the most sold housing is of average quality - this is understandable! The cost of housing at the average level still has 3 local maximums and minimums, respectively, in order to find the cause of these differences, it is necessary to investigate the remaining parameters of housing data.

### Exterior Sources

The 1 variable with the strongest negative correlations with the `SalePrice` is `ExterQual_TA`, and 6 variables with the strongest positive correlations `1stFlrSF`, `TotalBsmtSF`, `GarageArea`,`GarageCars`,`GrLivArea`,`OverallQual`.
According to the documentation, these features represent a 
- 1stFlrSF: First Floor square feet            
- TotalBsmtSF: Total square feet of basement area         
- GarageArea: Size of garage in square feet          
- GarageCars: Size of garage in car capacity          
- GrLivArea: Above grade (ground) living area square feet          
- OverallQual: Rates the overall material and finish of the house        



In the initial data, such a feature ExterQual_TA did not exist, but there was a categorical feature ExterQual, which was later coded by us and corresponds to the value TA - Average/Typical.

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

Let's take a look at these variables.

First, we can show the correlations of the  features with the SalePrice and with each other.

In [None]:
# 1stFlrSF            0.605852
# TotalBsmtSF         0.613581
# GarageArea          0.623431
# GarageCars          0.640409
# GrLivArea           0.708624
# OverallQual         0.790982
# 
# ExterQual_TA        -0.589044

# Extract the EXT_SOURCE variables and show correlations
ext_data = app_train[['SalePrice', '1stFlrSF', 'TotalBsmtSF', 'GarageArea', 'GarageCars', 'GrLivArea', 'OverallQual', 'ExterQual_TA']]
ext_data_corrs = ext_data.corr()
ext_data_corrs

In [None]:
plt.figure(figsize = (8, 6))

# Heatmap of correlations
sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.6, annot = True, vmax = 0.8)
plt.title('Correlation Heatmap');

Next we can look at the distribution of each of these features colored by the value of the target. This will let us visualize the effect of this variable on the target.

In [None]:

    
plt.figure(figsize = (20, 15))
sns.kdeplot(app_train['ExterQual_TA'], label = 'ExterQual_TA')
sns.kdeplot(app_train['SalePrice']/200000, label = 'levelized selling price')
plt.xlabel('ExterQual_TA \ levelized selling price'); plt.ylabel('Density'); plt.title('Distribution of ExterQual_TA');   



In [None]:
plt.figure(figsize = (20, 15))



plt.xlabel('ExterQual_TA \ levelized selling price'); plt.ylabel('Density'); plt.title('Distribution of ExterQual_TA'); 

# iterate through the sources
for i, source in enumerate(['1stFlrSF', 'TotalBsmtSF', 'GarageArea', 'GarageCars', 'GrLivArea']):
    
    # create a new subplot for each source
    plt.subplot(5, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(app_train[source], label = source)

    # plot loans that were not repaid
    sns.kdeplot(app_train['SalePrice']/100, label = 'levelized selling price')
    
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

In [None]:
plt.figure(figsize = (20, 15))



plt.xlabel('ExterQual_TA \ levelized selling price'); plt.ylabel('Density'); plt.title('Distribution of ExterQual_TA'); 

# iterate through the sources
for i, source in enumerate(['1stFlrSF', 'TotalBsmtSF', 'GarageArea', 'GarageCars', 'GrLivArea']):
    
    # create a new subplot for each source
    plt.subplot(5, 1, i + 1)
    # plot repaid loans
    sns.kdeplot(data=app_train, x=source, y="SalePrice", fill=True,)
       
    # Label the plots
    plt.title('Distribution of %s by Target Value' % source)
    plt.xlabel('%s' % source); plt.ylabel('Density');
    
plt.tight_layout(h_pad = 2.5)

GrLivArea displays the next largest determinant of the cost of housing. We can clearly see that this characteristic is most directly related to the cost of housing. The relationship is not very strong (in fact, they are all considered very weak, but these variables will still be useful for a machine learning model to predict the cost of housing).

# Feature Engineering

Kaggle competitions are won by feature engineering: those win are those who can create the most useful features out of the data. (This is true for the most part as the winning models, at least for structured data, all tend to be variants on [gradient boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)). This represents one of the patterns in machine learning: feature engineering has a greater return on investment than model building and hyperparameter tuning. [This is a great article on the subject)](https://www.featurelabs.com/blog/secret-to-data-science-success/). As Andrew Ng is fond of saying: "applied machine learning is basically feature engineering." 

While choosing the right model and optimal settings are important, the model can only learn from the data it is given. Making sure this data is as relevant to the task as possible is the job of the data scientist (and maybe some [automated tools](https://docs.featuretools.com/getting_started/install.html) to help us out).

Feature engineering refers to a geneal process and can involve both feature construction: adding new features from the existing data, and feature selection: choosing only the most important features or other methods of dimensionality reduction. There are many techniques we can use to both create features and select features.

We will do a lot of feature engineering when we start using the other data sources, but in this notebook we will try only two simple feature construction methods: 

* Polynomial features
* Domain knowledge features


## Polynomial Features

One simple feature construction method is called [polynomial features](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html). In this method, we make features that are powers of existing features as well as interaction terms between existing features. For example, we can create variables `SOURCE_1^2` and `SOURCE_2^2` and also variables such as `SOURCE_1` x `SOURCE_2`, `SOURCE_1` x `SOURCE_2^2`, `SOURCE_1^2` x   `SOURCE_2^2`, and so on. These features that are a combination of multiple individual variables are called [interaction terms](https://en.wikipedia.org/wiki/Interaction_(statistics) because they  capture the interactions between variables. In other words, while two variables by themselves  may not have a strong influence on the target, combining them together into a single interaction variable might show a relationship with the target. [Interaction terms are commonly used in statistical models](https://www.theanalysisfactor.com/interpreting-interactions-in-regression/) to capture the effects of multiple variables, but I do not see them used as often in machine learning. Nonetheless, we can try out a few to see if they might help our model to predict housing costs. 

Jake VanderPlas writes about [polynomial features in his excellent book Python for Data Science](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html) for those who want more information.


In the following code, we create polynomial features using the `GrLivArea` variables and the `OverallQual` variable. [Scikit-Learn has a useful class called `PolynomialFeatures`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) that creates the polynomials and the interaction terms up to a specified degree. We can use a degree of 3 to see the results (when we are creating polynomial features, we want to avoid using too high of a degree, both because the number of features scales exponentially with the degree, and because we can run into [problems with overfitting](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py)). 

In [None]:
# Make a new dataframe for polynomial features
poly_features = app_train[['GrLivArea', 'OverallQual', 'SalePrice']]
poly_features_test = app_test[['GrLivArea', 'OverallQual']]


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')


poly_target = poly_features['SalePrice']

poly_features = poly_features.drop(columns = ['SalePrice'])

# Need to impute missing values
poly_features = imputer.fit_transform(poly_features)
poly_features_test = imputer.transform(poly_features_test)

from sklearn.preprocessing import PolynomialFeatures
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 3)

In [None]:
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial Features shape: ', poly_features.shape)

This creates a considerable number of new features. To get the names we have to use the polynomial features `get_feature_names` method.

In [None]:
poly_transformer.get_feature_names(input_features = ['GrLivArea', 'OverallQual'])[:15]

There are 10 features with individual features raised to powers up to degree 3 and interaction terms. Now, we can see whether any of these new features are correlated with the target.

In [None]:
# Create a dataframe of the features 
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names(['GrLivArea', 'OverallQual']))

# Add in the target
poly_features['SalePrice'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['SalePrice'].sort_values()

# Display most negative and most positive
print(poly_corrs.head(10))
print(poly_corrs.tail(5))

Several of the new variables have a greater (in terms of absolute magnitude) correlation with the target than the original features. When we build machine learning models, we can try with and without these features to determine if they actually help the model learn. 

We will add these features to a copy of the training and testing data and then evaluate models with and without the features. Many times in machine learning, the only way to know if an approach will work is to try it out! 

In [None]:
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names(['GrLivArea', 'OverallQual']))

# Merge polynomial features into training dataframe
poly_features['Id'] = app_train['Id']
app_train_poly = app_train.merge(poly_features, on = 'Id', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['Id'] = app_test['Id']
app_test_poly = app_test.merge(poly_features_test, on = 'Id', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

In [None]:
# Missing values statistics app_train_poly
missing_values = missing_values_table(app_train_poly)
missing_values.head(20)

In [None]:
# fill missing values based on probability of occurrence
for column in app_train_poly.columns:
    null_vals = app_train_poly.isnull().values
    a, b = np.unique(app_train_poly.values[~null_vals], return_counts = 1)
    app_train_poly.loc[app_train_poly[column].isna(), column] = np.random.choice(a, app_train_poly[column].isnull().sum(), p = b / b.sum())

In [None]:
# Missing values statistics
missing_values = missing_values_table(app_train_poly)
missing_values.head(20)

In [None]:
# Missing values statistics app_test_poly
missing_values = missing_values_table(app_test_poly)
missing_values.head(20)

In [None]:
# fill missing values based on probability of occurrence
for column in app_test_poly.columns:
    null_vals = app_test_poly.isnull().values
    a, b = np.unique(app_test_poly.values[~null_vals], return_counts = 1)
    app_test_poly.loc[app_test_poly[column].isna(), column] = np.random.choice(a, app_test_poly[column].isnull().sum(), p = b / b.sum())

In [None]:
# Missing values statistics
missing_values = missing_values_table(app_train_poly)
missing_values.head(20)

In [None]:
app_train_poly.columns

In [None]:
for column in app_train.columns:
    null_vals = app_train.isnull().values
    a, b = np.unique(app_train.values[~null_vals], return_counts = 1)
    app_train.loc[app_train[column].isna(), column] = np.random.choice(a, app_train[column].isnull().sum(), p = b / b.sum())
    
for column in app_test.columns:
    null_vals = app_test.isnull().values
    a, b = np.unique(app_test.values[~null_vals], return_counts = 1)
    app_test.loc[app_test[column].isna(), column] = np.random.choice(a, app_test[column].isnull().sum(), p = b / b.sum())

In [None]:
# Missing values statistics
missing_values = missing_values_table(app_train)
missing_values.head(20)

## Modelling
I will perform a simple linear regression on the dataset to predict house prices. In order to train out the regression model, we need to first split up the data into an X list that contains the features to train on, and a y list with the target variable, in this case, the Price column.

In [None]:
from sklearn.model_selection import train_test_split

app_train_poly['SalePrice']=app_train['SalePrice']
X = app_train_poly.drop(['SalePrice'], axis = 1)
y = app_train_poly['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
'''for comparison, I also left the code with the original dataset in this laptop, without additional functions '''
# from sklearn.model_selection import train_test_split

# app_train_poly['SalePrice']=app_train['SalePrice']

# X = app_train.drop(['SalePrice'], axis = 1)  
# y = app_train['SalePrice']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Split the data into training and testing set using scikit-learn train_test_split function. We are using 80% of the data for training and 20% for testing, train_test_split() returns four objects:

- X_train: the subset of our features used for training
- X_test: the subset which will be our ‘hold-out’ set – what we’ll use to test the model
- y_train: the target variable SalePrice which corresponds to X_train
- y_test: the target variable SalePrice which corresponds to X_test
Now we will import the linear regression class, create an object of that class, which is the linear regression model.

In [None]:
from sklearn import linear_model

lr = linear_model.LinearRegression()

Then using the fit method to "fit" the model to the dataset. What this does is nothing but make the regressor "study" the data and "learn" from it.

In [None]:
model = lr.fit(X_train, y_train)

R-squared is the measure of how close the data are to the fitted regression line, in other words it measures the strength of the relationship between the model and the SalePrice on a convenient 0 – 100% scale.

In [None]:
# make predictions based on model
predictions = model.predict(X_test)

There are three primary metrics used to evaluate linear models. These are:

- Mean absolute error (MAE)
- Mean squared error (MSE)
- Root mean squared error (RMSE)
    * MAE: The easiest to understand. Represents average error.
    * MSE: Similar to MAE but noise is exaggerated and larger errors are "punished". It is harder to interpret than MAE as it's not in base units, however, it is generally more popular.
    * RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it's in base units. It is recommended that RMSE be used as the primary metric to interpret your model.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt

print ('MAE is:', mean_absolute_error(y_test, predictions))
print ('MSE is:', mean_squared_error(y_test, predictions))
print ('RMSE is:', sqrt(mean_squared_error(y_test, predictions)))

In [None]:
# alpha helps to show overlapping data
plt.scatter(predictions, y_test, alpha = 0.7, color = 'b')
plt.xlabel('Predicted Price')
plt.ylabel('Actual Price')
plt.title('Linear Regression Model')

## Submission

In [None]:
submission = pd.DataFrame()
submission['Id'] = app_test_poly['Id'].astype(int)
temp = app_test_poly.select_dtypes(include = [np.number]).drop(['Id'], axis = 1).interpolate()
predictions = model.predict(app_test_poly)

In [None]:
# submission = pd.DataFrame()
# submission['Id'] = app_test['Id'].astype(int)
# temp = app_test.select_dtypes(include = [np.number]).drop(['Id'], axis = 1).interpolate()
# predictions = model.predict(app_test)

In [None]:
submission['SalePrice'] = predictions

In [None]:
submission.to_csv('submission.csv', index = False)

The submission has now been saved to the virtual environment in which our notebook is running. To access the submission, at the end of the notebook, we will hit the blue Commit & Run button at the upper right of the kernel. This runs the entire notebook and then lets us download any files that are created during the run. 

Once we run the notebook, the files created are available in the Versions tab under the Output sub-tab. From here, the submission files can be submitted to the competition or downloaded. Since there are several models in this notebook, there will be multiple output files. 

__Our script hits  score around 0.34661 when submitted.__

If we use the original dataset with missing values filled in but no additional features, our final score is - 0.38272

# Conclusions

In this notebook, we saw how to get started with a Kaggle machine learning competition. We first made sure to understand the data, our task, and the metric by which our submissions will be judged. Then, we performed a fairly simple EDA to try and identify relationships, trends, or anomalies that may help our modeling. Along the way, we performed necessary preprocessing steps such as encoding categorical variables, imputing missing values, and scaling features to a range. Then, we constructed new features out of the existing data to see if doing so could help our model. 

Once the data exploration, data preparation, and feature engineering was complete, we implemented a baseline model upon which we hope to improve. Then we built a second slightly more complicated model to beat our first score. We also carried out an experiment to determine the effect of adding the engineering variables. 

We followed the general outline of a [machine learning project](https://towardsdatascience.com/a-complete-machine-learning-walk-through-in-python-part-one-c62152f39420): 

1.  Understand the problem and the data
2. Data cleaning and formatting (this was mostly done for us)
3. Exploratory Data Analysis
4. Baseline model
5.  Improved model


Machine learning competitions do differ slightly from typical data science problems in that we are concerned only with achieving the best performance on a single metric and do not care about the interpretation. However, by attempting to understand how our models make decisions, we can try to improve them or examine the mistakes in order to correct the errors. In future notebooks we will look at incorporating more sources of data, building more complex models (by following the code of others), and improving our scores. 

I hope this notebook was able to get you up and running in this machine learning competition and that you are now ready to go out on your own - with help from the community - and start working on some great problems! 

__Running the notebook__: now that we are at the end of the notebook, you can hit the blue Commit & Run button to execute all the code at once. After the run is complete (this should take about 2 minutes), you can then access the files that were created by going to the versions tab and then the output sub-tab. The submission files can be directly submitted to the competition from this tab or they can be downloaded to a local machine and saved. 

I hope you liked this code, I also prepared more interesting laptops for this competition and I will be glad to share them with you:

1. [COMPREHENSIVE DATA EXPLORATION WITH PYTHON](https://www.kaggle.com/andrej0marinchenko/comprehensive-data-exploration-with-python-upd)
2. [Data ScienceTutorial for Beginners ](https://www.kaggle.com/andrej0marinchenko/data-sciencetutorial-for-beginners-house-prices)
3. [House Price Calculation methods for beginnners](https://www.kaggle.com/andrej0marinchenko/house-price-calculation-methods-for-beginnners)
4. [Start: Introduction for beginners ](https://www.kaggle.com/andrej0marinchenko/start-introduction-for-beginners-house-prices)
5. [EDA + Data Analytics For beginners](https://www.kaggle.com/andrej0marinchenko/eda-data-analytics-for-beginners-house-prices)
6. [1 step for beginners linear model](https://www.kaggle.com/andrej0marinchenko/1-step-for-beginners-linear-model-house-prices)
7. [Universal notebook 4 data analysis](https://www.kaggle.com/andrej0marinchenko/universal-notebook-4-data-analysis)