# Useful articles

- [Dealing with categorical features](https://medium.com/hugo-ferreiras-blog/dealing-with-categorical-features-in-machine-learning-1bb70f07262d)

- ['statsmodels' library](https://www.statsmodels.org/stable/index.html)

- [some potentially useful packages](https://medium.com/activewizards-machine-learning-company/top-15-python-libraries-for-data-science-in-in-2017-ab61b4f9b4a7)

- [other pot. useful packages](https://www.kdnuggets.com/2018/06/top-20-python-libraries-data-science-2018.html/2)

- ['seaborn' library (visualization)](https://seaborn.pydata.org/tutorial.html)

- [Nyttig eksempel](https://becominghuman.ai/linear-regression-in-python-with-pandas-scikit-learn-72574a2ec1a5)

- [Eksempel med gradient boost regression på boston housing](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regression-py)

# Small tasks yet to be done

### Declaring libraries

In [2]:
%matplotlib inline
# For configuration and jupiter
import os
import sys
import re
import random
import matplotlib
import implicit
# For data manipulation
import pandas as pd
import numpy as np
# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
# For performance evaluation
from time import time

os.environ["OPENBLAS_NUM_THREADS"] = "1"   # Required by implicit
base = "/mnt/workspace/AML-2019/Challenges/House_Pricing/challenge_data/"

# Data exploration
### Fetching data

In [3]:
#pricesRawDF = pd.read_csv(base + 'train.csv', keep_default_na = False)
pricesRawDF = pd.read_csv(base + 'train.csv')
null_values = pricesRawDF.isnull().sum().sort_values(ascending = False)
null_values[:20]
#print(null_values.index)

PoolQC          1196
MiscFeature     1153
Alley           1125
Fence            973
FireplaceQu      564
LotFrontage      210
GarageType        67
GarageCond        67
GarageYrBlt       67
GarageFinish      67
GarageQual        67
BsmtExposure      33
BsmtFinType2      33
BsmtFinType1      32
BsmtCond          32
BsmtQual          32
MasVnrArea         6
MasVnrType         6
Exterior2nd        0
Exterior1st        0
dtype: int64

Notably many values missing in many of the columns. In the case all the categories this is because the category "NA" has been interpreted as NaN by the "read_csv" function. So first we wish to correct for this. 

In [4]:
columns_with_NaN = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage',
       'GarageType', 'GarageCond', 'GarageYrBlt', 'GarageFinish', 'GarageQual',
       'BsmtExposure', 'BsmtFinType2', 'BsmtFinType1', 'BsmtCond', 'BsmtQual',
       'MasVnrType']
for col in columns_with_NaN:
    pricesRawDF[col] = pricesRawDF[col].fillna('NA')

pricesRawDF['MasVnrArea'] = pricesRawDF['MasVnrArea'].fillna(0)

pricesRawDF.isnull().sum().any()

False

## Reformatting data
In a statistical model or machine learning model, it is much easier to use numerical data rather than numerical data. For the parameters if the data set that are categorical, we can either make them numerical directly or split them into individual boolean columns by one-hot encoding. we have chosen to one-hot encode som columns and make some numerical.  

In [5]:
# To check if number of unique elements in coloumn exceeds number of types in data description
# Only useful for columns with categorical data 
nr_column_categories = []

for column in pricesRawDF:
    nr_column_categories.append((column, pricesRawDF[column].nunique()))

for elements in nr_column_categories[:10]:
    print(elements)

('Id', 1200)
('MSSubClass', 15)
('MSZoning', 5)
('LotFrontage', 107)
('LotArea', 913)
('Street', 2)
('Alley', 3)
('LotShape', 4)
('LandContour', 4)
('Utilities', 2)


What is worth noting here is that in some of the columns containing categorical data; not all the different categories are represented. For example an element of the column "MSSubClass" can take 16 unique values based upon the its description in "Data Description.rtf". This means that transforming the values in categorical columns to numerical values, or one-hot-encoding will be a bit cumbersome. 

In [6]:
# Transforming the categories we want to be numeric, to numeric values
from sklearn.preprocessing import LabelEncoder
from variables import cats_split, cats_num

# The categories we want to split
categorical_to_split = ['MSSubClass', 'MSZoning', 'Alley', 'LotConfig', 'Utilities',
                       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
                       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
                       'Foundation', 'Heating', 'Electrical', 'GarageType', 'PavedDrive', 
                       'MiscFeature', 'SaleType', 'SaleCondition']

# Casting the type of all columns we want to one-hot-key to strings
for cat in categorical_to_split:
    pricesRawDF[cat] = pricesRawDF[cat].astype(str, errors = 'ignore')

# Categories to make numerical
categorical_to_make_numerical = ['Street', 'LotShape', 'LandContour', 'LandSlope', 'CentralAir',
                                'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
                                'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual',
                                'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond',
                                'PoolQC', 'Fence']

# Function to make desired columns numerical
def numerize(rawDF, complete_mapping_dictionary):
    DF = rawDF.copy()
    errors = []
    for mapping_dictionary in complete_mapping_dictionary:
        # Fetching column name
        col_name = mapping_dictionary['name']
        # Fetching column
        column_to_numerize = DF[col_name].copy()
        # Creating columns of same size, and correct type
        numerized_column = pd.Series(np.zeros([column_to_numerize.size]), dtype=np.int8, name=col_name)
        # Resetting error_counter
        error_counter = 0
        # Looping through elements of column
        for index, value in column_to_numerize.iteritems():
            if (value in mapping_dictionary):
                numerized_column.at[index] = mapping_dictionary[value]
            else:
                error_counter += 1
        # Merging numerized column into dataframe
        DF.drop(labels=col_name, axis=1, inplace=True)
        DF[col_name] = numerized_column
        # Appending errors to error vector
        errors.append((col_name, error_counter))
    return errors, DF

# Function for one-hot-encoding a single column, given the column and its possible categories
def split_and_filter(col_name, DF, categories):
    # Extracting column to split
    column_to_split = DF[col_name].copy()
    # Setting new names for columns
    new_col_names = [(col_name + '_' + category) for category in categories]
    # Creating expanded DF of zeros
    splitDF = pd.DataFrame(np.zeros([column_to_split.size,len(categories)], dtype = np.int8), columns = categories)
    # Resetting error_counter
    error_counter = 0
    # Looping through series and setting correct values in new DF
    for index, value in column_to_split.iteritems():
        # Checking if the value is valid, i.e exists in the set of possible categories for one column
        if (value in categories):
            splitDF.at[index, value] = 1
        else:
            error_counter += 1
    # Renaming columns of new dataframe
    splitDF.columns = new_col_names
    # Merging expansion of one column with full DataFrame
    newDF = pd.merge(DF, splitDF, left_index=True, right_index=True)
    # Dropping original column
    newDF.drop([col_name], axis=1, inplace=True)
    # Deleting unused dataframe and series to conserve memory
    del splitDF
    del column_to_split
    return error_counter, newDF

# Function to one-hot encode the entire dataframe given a dataframe and list of names for new columns
def category_splitting(rawDF, list_of_categories):
    first_err, newDF = split_and_filter(list_of_categories[0][0], rawDF, list_of_categories[0][1:])
    errors = [(list_of_categories[0][0], first_err)]
    for column_to_split in list_of_categories[1:]:
        nth_err, newDF = split_and_filter(column_to_split[0], newDF, column_to_split[1:])
        errors.append((column_to_split[0], nth_err))
    return errors, newDF

# Function to reformat a dataset to our desired format
def reformat_dataset(rawDF):
    num_errors, numerizedDF = numerize(rawDF, cats_num)
    split_errors, reformattedDF = category_splitting(numerizedDF, cats_split)
    errors = num_errors + split_errors
    return errors, reformattedDF

errors, rfmtDF = reformat_dataset(pricesRawDF)

# Pre-processing 
## Error processing
Apart from missing values, we found some errors when reformatting the categorical columns. Now let as look at what errors we have got.

In [7]:
for columns in errors:
    if(columns[1] > 0):
        print(columns)

('BldgType', 106)
('Exterior2nd', 85)
('MasVnrType', 6)


Here we can see that we have some errors. How we have classified "errors" in this sense, is that there has been an cell that did not contain one of the predefined elements in the "Data description" file. Now we inspect the values that give errors.

In [8]:
from variables import error_detect

error_values = []
for defect_col_name in error_detect:
    err = [defect_col_name[0]]
    defect_col = pricesRawDF[defect_col_name[0]]
    categories = defect_col_name[1:]
    for index, value in defect_col.iteritems():
        if value not in categories:
            err.append(value)
    error_values.append(err)

# Find a way to print all the unique errors in each list-element of error_values
col2print = error_values[4]
for el in col2print:
    print(el)

MasVnrType
NA
NA
NA
NA
NA
NA


Now we can see that the errors are due to the following reasons:
    
* MSZoning:  
 * The one category is called 'C' in the data description, but is called 'C (all)' in the data. 
 * Here we need to change the dictionary
* Neighborhood:  
 * 179 elements that should have been spelled as 'Names', have been misspelled as 'NAmes'. 
 * Here we need to change the dictionary
* BldgType:  
 * Spelling error: "2fmCon" instead of "2FmCon"
 * Spelling error: "Duplex" instead of "Duplx"
 * Spelling error: Many elements have been spelled as "Twnhs". the problem now is that it is impossible to discern whether they have ment the category "TwnhsI" or "TwnhsE"
 * Correcting function (that fixes spelling errors and combines the two "Twnhs" categories to one)
* Exterior2nd:  
 * Spelling error: "Wd Shng" instead of "Wd Sdng"
 * Spelling error: "CmentBd" instead of "CemntBd"
 * Spelling error: "Brk Cmn" instead of "BrkComm"
 * Correcting function
* MasVnrType:  
 * Six elements where this category is not applicable
 * No correction is required

In [9]:
# Spelling-error-correcting function
def correct_known_spelling_errors(rawDF, reformattedDF):
    toReturnDF = reformattedDF.copy()
    cols_to_correct = ['BldgType', 'Exterior2nd']
    error_fix_count = 0
    ################## BldgType
    col = rawDF['BldgType']
    for index, value in col.iteritems():
        if value == '2fmCon':
            toReturnDF.at[index, 'BldgType_2FmCon'] = 1
            error_fix_count += 1
        elif value == 'Duplex':
            toReturnDF.at[index, 'BldgType_Duplx'] = 1
            error_fix_count += 1
     ################## Exterior2nd
    col = rawDF['Exterior2nd']
    for index, value in col.iteritems():
        if value == 'Wd Shng':
            toReturnDF.at[index, 'Exterior2nd_Wd Sdng'] = 1
            error_fix_count += 1
        elif value == 'CmentBd':
            toReturnDF.at[index, 'Exterior2nd_CemntBd'] = 1
            error_fix_count += 1
        elif value == 'Brk Cmn':
            toReturnDF.at[index, 'Exterior2nd_BrkComm'] = 1
            error_fix_count += 1
    print('Nr. of errors corrected: ', error_fix_count)
    return toReturnDF

rfmtDF = correct_known_spelling_errors(pricesRawDF, rfmtDF)

Nr. of errors corrected:  154


In [10]:
# Function to compensate for "Twnhs" spelling erorr
def twnhs_combiner(rawDF, reformattedDF):
    toReturnDF = reformattedDF.copy()
    error_fix_count = 0
    toReturnDF['BldgType_Twnhs'] = toReturnDF['BldgType_TwnhsI'] + toReturnDF['BldgType_TwnhsE']
    toReturnDF.drop(['BldgType_TwnhsI', 'BldgType_TwnhsE'], axis=1, inplace=True)
    BldgType_col = rawDF['BldgType']
    for index, value in BldgType_col.iteritems():
        if value == 'Twnhs':
            toReturnDF.at[index, 'BldgType_Twnhs'] = 1 
            error_fix_count += 1
    print('Nr. of errors corrected: ', error_fix_count)
    return toReturnDF

correctedDF = twnhs_combiner(pricesRawDF, rfmtDF)

Nr. of errors corrected:  37


## Error processing on test-data


In [11]:
testDataRawDF = pd.read_csv(base + 'test.csv')
null_values = testDataRawDF.isnull().sum().sort_values(ascending = False)
null_values[:20]

PoolQC          257
MiscFeature     253
Alley           244
Fence           206
FireplaceQu     126
LotFrontage      49
GarageCond       14
GarageType       14
GarageYrBlt      14
GarageFinish     14
GarageQual       14
BsmtExposure      5
BsmtCond          5
BsmtQual          5
BsmtFinType1      5
BsmtFinType2      5
MasVnrArea        2
MasVnrType        2
Electrical        1
LotConfig         0
dtype: int64

Here we have all the same errors as before, with the exception of one "NA" in Electrical. But, this can be treated as the others

In [12]:
columns_with_NaN = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage',
       'GarageType', 'GarageCond', 'GarageYrBlt', 'GarageFinish', 'GarageQual',
       'BsmtExposure', 'BsmtFinType2', 'BsmtFinType1', 'BsmtCond', 'BsmtQual',
       'MasVnrType', 'Electrical']
for col in columns_with_NaN:
    testDataRawDF[col] = testDataRawDF[col].fillna('NA')

testDataRawDF['MasVnrArea'] = testDataRawDF['MasVnrArea'].fillna(0)

testDataRawDF.isnull().sum().any()

False

In [13]:
for cat in categorical_to_split:
    testDataRawDF[cat] = testDataRawDF[cat].astype(str, errors = 'ignore')

errors, rfmtTestDF = reformat_dataset(testDataRawDF)

for columns in errors:
    if(columns[1] > 0):
        print(columns)

('BldgType', 20)
('Exterior2nd', 20)
('MasVnrType', 2)
('Electrical', 1)


In [14]:
rfmtTestDF = correct_known_spelling_errors(testDataRawDF, rfmtTestDF)
correctedTestDF = twnhs_combiner(testDataRawDF, rfmtTestDF)

Nr. of errors corrected:  34
Nr. of errors corrected:  6


## Feature combining

In [56]:
correctedDF['TotalSF'] = correctedDF['TotalBsmtSF'] + correctedDF['1stFlrSF'] + correctedDF['2ndFlrSF']

## Feature selection

In [54]:
paramsDF = correctedDF.copy()
paramsDF.drop(['Id', 'SalePrice'], axis=1, inplace=True)
priceCorr = correctedDF.corr().abs()['SalePrice']
corrDF = paramsDF.corr(method="pearson").abs()

In [59]:
def remove_correlated_cols(trainDF, testDF, corr_limit):
    paramsDF = trainDF.copy()
    toReturnTrainDF = trainDF.copy()
    toReturnTestDF = testDF.copy()
    paramsDF.drop(['Id', 'SalePrice'], axis=1, inplace=True)
    priceCorr = trainDF.corr().abs()['SalePrice']
    corrDF = paramsDF.corr(method="pearson").abs()

    highly_correlated_columns = np.where(np.logical_and((corrDF > corr_limit),(corrDF < 1.0)))
    param_col_names = corrDF.columns
    redundant_cols = []
    correlatd_cols = []
    for index in range(len(highly_correlated_columns[0])):
        col_name1 = col_names[highly_correlated_columns[0][index]]
        col_name2 = col_names[highly_correlated_columns[1][index]]
        if (col_name1 not in redundant_cols) and (col_name2 not in redundant_cols):
            correlation = corrDF.iloc[highly_correlated_columns[0][index]][col_name2]
            correlatd_cols.append([col_name1, col_name2, correlation])
            #print(col_name1, col_name2, correlation)
            redundant_cols.append(col_name1)
            redundant_cols.append(col_name2)

    cols_to_remove = []
    for row in correlatd_cols:
        correlation1 = priceCorr[row[0]]
        correlation2 = priceCorr[row[1]]
        if correlation1 > correlation2:
            cols_to_remove.append(row[1])
        else:
            cols_to_remove.append(row[0])
    toReturnTrainDF.drop(cols_to_remove, axis=1, inplace=True)
    toReturnTestDF.drop(cols_to_remove, axis=1, inplace=True)
    return toReturnTrainDF, toReturnTestDF

featureReducedTrainDF, featureReducedTestDF = remove_correlated_cols(correctedDF, correctedTestDF, 0.7)
print(featureReducedTrainDF.shape)
print(featureReducedTestDF.shape)

(1200, 212)
(260, 211)


In [None]:
# Function to remove columns that have a low correlation with SalePrice? 

In [20]:
# Possible models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

# ?
from sklearn import ensemble
from sklearn import datasets

# Model training and evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error # Metric

# More complex model evaluation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

housing_params = rfmtDF.copy()
# Put prices in sep columns
housing_params.drop(['SalePrice','LotFrontage','MasVnrArea','GarageYrBlt'], axis=1, inplace=True)
prices = rfmtDF['SalePrice']
X_train, X_test, Y_train, Y_test  = train_test_split(housing_params, prices, test_size = 0.1, random_state = 9)

# #############################################################################
# Fit regression model
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'ls'}
clf = ensemble.GradientBoostingRegressor(**params)

clf.fit(X_train, Y_train)
mse = mean_squared_error(Y_test, clf.predict(X_test))

# #############################################################################
# Plot training deviance

# compute test set deviance
test_score = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_predict(X_test)):
    test_score[i] = clf.loss_(Y_test, y_pred)

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.title('Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, clf.train_score_, 'b-',
         label='Training Set Deviance')
plt.plot(np.arange(params['n_estimators']) + 1, test_score, 'r-',
         label='Test Set Deviance')
plt.legend(loc='upper right')
plt.xlabel('Boosting Iterations')
plt.ylabel('Deviance')

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').