<a class="anchor" name="Return"></a>
# Table of Contents

* ## [Data Description](#Data_Description)
   #### * [Overview of Dataset](#overview)
   #### * [Exploratory Data Analysis](#eda)   
* ## [Data Cleansing/Pre-Processing](#Data_Cleansing)
   #### * [Imputing Missing Values-Training](#impute_missing_train)
   #### * [Imputing Missing Values-Testing](#impute_missing_test)
   #### * [Converting columns of categorical type to numerical type](#conv_columns)
* ## [Finding Most Important Features](#best_features)
   ####  * [SelectKBest](#SelectKBest)
   ####  * [ExtraTreesClassifier](#ExtraTreesClassifier)
* ## [Regression Techniques](#RegTech)
   ####  * [Linear Regression](#linReg)
   ####  * [Boosted Trees Regression](#bReg)
* ## [Classification Techniques](#class)
   ####  * [Creating Bins](#Bins)
   ####  * [Create Generic Model](#GenericClass)
   ####  * [Create Random Forest Classification Model](#RandomForest)
* ## [Graphs](#graphs)
   ####  * [Violin Plots-Regression](#Violin_PlotsR)
   ####  * [Violin Plots-Classification](#Violin_PlotsC)
   #### * [Histogram-Regression](#RHistogram)
   #### * [Histogram-Classification](#CHistogram)
   ####  * [Line Plots-Regression](#Line_PlotsR)
* ## [Summary](#Summary)

Importing necessary packages

In [None]:
#Importing necessary packages
import pandas as pd # Used for cleaning the data and filling in missing values
import numpy as np
import turicreate as tc
from sklearn.feature_selection import SelectKBest # Used for finding the best combination of features
from sklearn.feature_selection import chi2 # Used for finding the best combination of features
from sklearn.ensemble import ExtraTreesClassifier # Used for finding the best combination of features
import matplotlib.pyplot as plt # Used for visualizations
import itertools as it # Used for finding the best combination of features
import seaborn as sns # Used for visualizations

<a class="anchor" name="Data_Description"></a>
## Data Description

[Return to TOC](#Return)

In [None]:
# Reading in both datasets
train_df = pd.read_csv(r'train.csv', index_col = 0)
test_df = pd.read_csv(r'test.csv', index_col = 0)

<a class="anchor" name="overview"></a>
### Overview of Dataset
[Return to TOC](#Return)

Size of data

In [None]:
# Printing # of rows and columns in both datasets
print('Training data:', train_df.shape)
print('Testing data', test_df.shape)

Data types of columns

In [None]:
# Printing out data types of train dataset
train_df.dtypes

In [None]:
# Printing out # of features for each data type in train dataset
train_df.dtypes.value_counts()

In [None]:
# Printing out data types of test dataset
test_df.dtypes

In [None]:
# Printing out # of features for each data type in test dataset
test_df.dtypes.value_counts()

<a class="anchor" name="eda"></a>
### Exploratory Data Analysis
[Return to TOC](#Return)

head()

In [None]:
# Printing out first 5 rows of train dataset
train_df.head()

In [None]:
# Printing out first 5 rows of test dataset
test_df.head()

tail()

In [None]:
# Printing out last 5 rows of train dataset
train_df.tail()

In [None]:
# Printing out last 5 rows of test dataset
test_df.tail()

sample(5)

In [None]:
# Printing out random 5 rows of train dataset
train_df.sample(5)

In [None]:
# Printing out random 5 rows of test dataset
test_df.sample(5)

info()

In [None]:
# Printing out concise summary of columns, their data types, and non-null values in train dataset
train_df.info()

In [None]:
# Printing out concise summary of columns, their data types, and non-null values in test dataset
test_df.info()

describe()

In [None]:
# Printing out descriptive statistics of train dataset
train_df.describe()

In [None]:
# Printing out descriptive statistics of test dataset
test_df.describe()

<a class="anchor" name="Data_Cleansing"></a>
## Data Cleansing/Pre-Processing

[Return to TOC](#Return)

<a class="anchor" name="impute_missing_train"></a>
### Imputing Missing Values in Training Dataset
[Return to TOC](#Return)

In [None]:
# Get columns with NA values and the number of NA values in these columns in the training data
columns_with_detected_null_values_in_train_data = train_df.columns[train_df.isna().any()].tolist()
train_df[columns_with_detected_null_values_in_train_data].isnull().sum()

In [None]:
# Get columns with NA values and the number of NA values in these columns in the testing data
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# These are columns that are considered to have null/na values in Pandas. However, in the data dictionary 
# na simply means that there is none. For example, some houses may or may not have an alley, but Pandas will interpret 
# these NA values as being empty data points, which we don't want. The next three cells solve this problem for the training
# data

columns_with_no_actual_null_values = ['Alley',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [None]:
# We will iterate through the list of columns_with_no_actual_null values and replace the NA values with a placeholder value
for col in columns_with_no_actual_null_values:
    train_df[col].fillna('asdf', inplace = True)

In [None]:
# Whenever we run this cell, we store any remaining columns in the dataframe that still have NA values
# into this variable, This is done after every time we fill in the NA values to see which columns still have NA values
columns_with_detected_null_values_in_train_data = train_df.columns[train_df.isna().any()].tolist()
train_df[columns_with_detected_null_values_in_train_data].isnull().sum()

In [None]:
# Get datatypes of remaining null values
train_df[columns_with_detected_null_values_in_train_data].dtypes

In [None]:
# If any of the remaining columns with null values are an object type, we will fill those with the value that occurs most
# often in that column
for col in columns_with_detected_null_values_in_train_data:
    if train_df[col].dtypes == 'object':
        train_df[col].fillna((train_df[col].value_counts()[0]), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present 
columns_with_detected_null_values_in_train_data = train_df.columns[train_df.isna().any()].tolist()
train_df[columns_with_detected_null_values_in_train_data].isnull().sum()

In [None]:
# filling in NA values with the mean
train_df['LotFrontage'].fillna((train_df['LotFrontage'].mean()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_train_data = train_df.columns[train_df.isna().any()].tolist()
train_df[columns_with_detected_null_values_in_train_data].isnull().sum()

In [None]:
# filling in NA values with the median
train_df['MasVnrArea'].fillna((train_df['MasVnrArea'].median()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_train_data = train_df.columns[train_df.isna().any()].tolist()
train_df[columns_with_detected_null_values_in_train_data].isnull().sum()

In [None]:
# filling in NA values with the median
train_df['GarageYrBlt'].fillna((train_df['GarageYrBlt'].median()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present. In this case, all columns are clean and 
# do not contain any NA values
columns_with_detected_null_values_in_train_data = train_df.columns[train_df.isna().any()].tolist()
train_df[columns_with_detected_null_values_in_train_data].isnull().sum()

<a class="anchor" name="impute_missing_test"></a>
### Imputing Missing Values In Test Dataset
[Return to TOC](#Return)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
columns_with_no_actual_null_values

In [None]:
# We will iterate through the list of columns_with_no_actual_null values and replace the NA values with a placeholder value
for col in columns_with_no_actual_null_values:
    test_df[col].fillna('asdf', inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# Get datatypes of remaining null values
test_df[columns_with_detected_null_values_in_test_data].dtypes

In [None]:
# If any of the remaining columns with null values are an object type, we will fill those with the value that occurs most
# often in that column
for col in columns_with_detected_null_values_in_test_data:
    if test_df[col].dtypes == 'object':
        test_df[col].fillna((test_df[col].value_counts()[0]), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# filling in NA values with the mean
test_df['LotFrontage'].fillna((test_df['LotFrontage'].mean()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# filling in NA values with the median
test_df['MasVnrArea'].fillna((test_df['MasVnrArea'].median()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# Each column in this list was filled in with the most common values in said columns
for col in ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageCars']:
    test_df[col].fillna((test_df[col].value_counts()[0]), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# filling in NA values with the median
test_df['GarageYrBlt'].fillna((test_df['GarageYrBlt'].median()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# filling in NA values with the mean
test_df['TotalBsmtSF'].fillna((test_df['TotalBsmtSF'].mean()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()

In [None]:
# filling in NA values with the mean
test_df['GarageArea'].fillna((test_df['GarageArea'].mean()), inplace = True)

In [None]:
# Refreshing variable with list of columns where there are still NA values present, which in this case is none
columns_with_detected_null_values_in_test_data = test_df.columns[test_df.isna().any()].tolist()
test_df[columns_with_detected_null_values_in_test_data].isnull().sum()
# At this point, there are no longer any null values in both our training and testing datasets

<a class="anchor" name="conv_columns"></a>
### Converting columns of categorical type to numerical type
[Return to TOC](#Return)

In [None]:
cat_columns = train_df.select_dtypes(['object']).columns

In [None]:
cat_columns

In [None]:
# Convert object to object dtype to category 
for col in cat_columns:
    train_df[col] = train_df[col].astype('category')
    test_df[col] = test_df[col].astype('category')

In [None]:
# Convert all category columns to integer (label encoding)
train_df[cat_columns] = train_df[cat_columns].apply(lambda x: x.cat.codes)
test_df[cat_columns] = test_df[cat_columns].apply(lambda x: x.cat.codes)

<a class="anchor" name="best_features"></a>
## Finding most important features
[Return to TOC](#Return)

<a class="anchor" name="SelectKBest"></a>
### Method 1: SelectKBest
[Return to TOC](#Return)

In [None]:
X = train_df.drop('SalePrice', axis = 1) # Independent features/columns
y = train_df['SalePrice'] # Target feature/column

# Using SelectKBest class with chi-squared test (used to measure statistical significance) to retrieve the top 10 features with highest importance scores
bestfeatures = SelectKBest(score_func=chi2, k=10) 
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  
print(featureScores.nlargest(10,'Score'))  # Printing the 10 best features

first_set_of_important_features = list(featureScores.nlargest(10,'Score')['Specs']) # Retrieving the 10 best features

<a class="anchor" name="ExtraTreesClassifier"></a>
### Method 2: ExtraTreesClassifier
[Return to TOC](#Return)

In [None]:
model = ExtraTreesClassifier()
model.fit(X,y)

feat_importances = pd.Series(model.feature_importances_, index=X.columns) # Using inbuilt class feature_importances of tree based classifiers
feat_importances.nlargest(10).plot(kind='barh') # Plotting bar chart of feature importance
plt.title('Method 2 - Top 10 features with highest importance scores')
plt.xlabel('Feature importance score')
plt.ylabel('Feature')
plt.show()

second_set_of_important_features = list(feat_importances.nlargest(10).index) # Retrieving features that have the 10 highest feature importance scores

In [None]:
# Getting a cumulative list of unique, important features from the 2 ways

cumulative_list_important_features = first_set_of_important_features + second_set_of_important_features
cumulative_set_important_features = set(cumulative_list_important_features)
print("Cumulative set of unique, important features from the two ways:", cumulative_set_important_features)

<a class="anchor" name="RegTech"></a>
## Regression Techniques
[Return to TOC](#Return)

<a class="anchor" name="linReg"></a>
### Linear Regression
[Return to TOC](#Return)

Creating generic regression model

In [None]:
# Converting DataFrames to SFrames for Turicreate models to run on
train_SF_regression = tc.SFrame(train_df)
test_SF_regression = tc.SFrame(test_df)

In [None]:
# Creating generic regression model
regression_model = tc.regression.create(train_SF_regression, target='SalePrice')
regression_model_errors = regression_model.evaluate(test_SF_regression)
print(regression_model_errors) # Printing out errors after testing evaluation

Creating linear regression model and finding the ideal combination of features to generate least amount of RMSE

In [None]:
ideal_linear_regression_model_features_and_errors = []

for i in range(1, 4): # Choosing combinations from 1 to 3 features inclusive due to high computation expense
    list_combination_indexes = list(it.combinations(list(cumulative_set_important_features), i))
    ideal_linear_regression_model_features = list(list_combination_indexes[0])
    linear_regression_model = tc.linear_regression.create(train_SF_regression, target = 'SalePrice', features = ideal_linear_regression_model_features)
    ideal_linear_regression_model_errors = linear_regression_model.evaluate(test_SF_regression)
    
    print('BEGINNING')
    print('Number of features used:', len(ideal_linear_regression_model_features))
    print('Features used:', ideal_linear_regression_model_features)
    print('Errors:', ideal_linear_regression_model_errors)

    for j in list_combination_indexes: # iterating through the items in list of combinations of size i
        linear_regression_model = tc.linear_regression.create(train_SF_regression, target = 'SalePrice', features = list(j))
        linear_regression_model_errors = linear_regression_model.evaluate(test_SF_regression)
        
        if (linear_regression_model_errors['rmse'] < ideal_linear_regression_model_errors['rmse']): # comparing the error with current Ideal
            ideal_linear_regression_model_features = list(j) # Finding ideal features for each number combination with lowest amount of RMSE
            ideal_linear_regression_model_errors = linear_regression_model_errors
    
    print('ENDING')
    print('Number of features used:', len(ideal_linear_regression_model_features))
    print('Features used:', ideal_linear_regression_model_features)
    print('Errors:', ideal_linear_regression_model_errors)
    
    ideal_linear_regression_model_features_and_errors.append([ideal_linear_regression_model_features, ideal_linear_regression_model_errors['rmse']])

print(ideal_linear_regression_model_features_and_errors)

In [None]:
final_ideal_linear_regression_model_features_and_errors = ideal_linear_regression_model_features_and_errors[0]

for i in ideal_linear_regression_model_features_and_errors: # Finding ideal combination of features out of the ideal features for each number combination
    if i[1] < final_ideal_linear_regression_model_features_and_errors[1]:
        final_ideal_linear_regression_model_features_and_errors = i

In [None]:
# Printing results
print("Ideal number of features used:", len(final_ideal_linear_regression_model_features_and_errors[0]))
print("Ideal combination of features used:", final_ideal_linear_regression_model_features_and_errors[0])
print("RMSE:", final_ideal_linear_regression_model_features_and_errors[1])

<a class="anchor" name="bReg"></a>
### Boosted Regression Tree
[Return to TOC](#Return)

Creating boosted regression tree model and finding ideal max_depth and # of max_iterations to generate least amount of RMSE

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

# Testing the number of parameters only up to 10 due to expensive computation
a = list(range(1, 11))
b = list(range(1, 11))
c = list(it.product(a, b))
print(c)

x = []
y = []

for val in c:
    x.append(val[0])
    y.append(val[1])

z = []

ideal_max_iterations_boosted_trees_regression = 1
ideal_max_depth_boosted_trees_regression = 1

ideal_boosted_trees_regression_model = tc.boosted_trees_regression.create(train_SF_regression, target = 'SalePrice', features = list(cumulative_set_important_features), max_iterations = 1, max_depth = 1)
ideal_boosted_trees_regression_model_errors = ideal_boosted_trees_regression_model.evaluate(test_SF_regression)

for i in range(1, 11):
    for j in range(1, 11):
        test_boosted_trees_regression_model = tc.boosted_trees_regression.create(train_SF_regression, target = 'SalePrice', features = list(cumulative_set_important_features), max_iterations = i, max_depth = j)
        test_boosted_trees_regression_model_errors = test_boosted_trees_regression_model.evaluate(test_SF_regression)
        
        z.append(test_boosted_trees_regression_model_errors['rmse'])

        if test_boosted_trees_regression_model_errors['rmse'] < ideal_boosted_trees_regression_model_errors['rmse']:
            ideal_max_iterations_boosted_trees_regression = i
            ideal_max_depth_boosted_trees_regression = j
            ideal_boosted_trees_regression_model_errors['rmse'] = test_boosted_trees_regression_model_errors['rmse']

df = pd.DataFrame({'max_iterations':x, 'max_depth':y, 'rmse':z})

# Creating a heatmap to show different combinations of max_depth and max_iterations and the model's corresponding RMSE
sns.heatmap(pd.crosstab(df['max_iterations'], df['max_depth'], values=df['rmse'] / 10000, aggfunc='sum'), linewidths=.5, ax=ax,annot=True)

ax.set_title('Ideal # of max_iterations and max_depth (RMSE shown in 10,000s) - Boosted trees regression model', fontsize = 15)

plt.show()

In [None]:
# Printing ideal max_iterations and max_depth for the boosted regression tree model to generate the least amount of RMSE
print(ideal_max_iterations_boosted_trees_regression)
print(ideal_max_depth_boosted_trees_regression)

<a class="anchor" name="class"></a>
## Classification Techniques
[Return to TOC](#Return)

<a class="anchor" name="Bins"></a>
### Creating Bins
[Return to TOC](#Return)

In [None]:
# Determine the maximum range of SalePrice between the testing & training data
max_value = max(train_df['SalePrice'])
min_value = min(train_df['SalePrice'])

if max(test_df['SalePrice']) > max_value:
    max_value = max(test_df['SalePrice'])

if min(test_df['SalePrice']) < min_value:
    min_value = min(test_df['SalePrice'])

# Creating the bins and their respective labels which will be used to encode the new target column, 
# 'SalePrice_label_for_bin_range'

bins = []

for i in range(min_value - 10000, max_value + 10001, 10000):
    bins.append(i)

labels = list(range(1, len(bins)))
## Train
# Creating new target column which is in the format of just an integer
train_df['SalePrice_bin_range'] = pd.cut(train_df['SalePrice'], bins = bins)
train_df['SalePrice_label_for_bin_range'] = pd.cut(train_df['SalePrice'], bins = bins, labels = labels)

# Create a dictionary to match the bin label to the actual range
label_for_bin_range_SalePrice = dict(zip(list(train_df['SalePrice_label_for_bin_range']), list(train_df['SalePrice_bin_range'].astype(str))))

## Test
# Creating new target column which is in the format of just an integer
test_df['SalePrice_bin_range'] = pd.cut(test_df['SalePrice'], bins = bins)
test_df['SalePrice_label_for_bin_range'] = pd.cut(test_df['SalePrice'], bins = bins, labels = labels)

# Drop bin range due to it not being necessary in our analysis
train_df.drop('SalePrice_bin_range', inplace = True, axis = 1)
test_df.drop('SalePrice_bin_range', inplace = True, axis = 1)

In [None]:
train_df['SalePrice_label_for_bin_range'] = train_df['SalePrice_label_for_bin_range'].astype(int)
test_df['SalePrice_label_for_bin_range'] = test_df['SalePrice_label_for_bin_range'].astype(int)

# Creating the new SFrame by dropping SalePrice so that it is not used as a feature to predict the 'SalePrice_label_for_bin_range'

train_df_classifier = train_df.drop('SalePrice', axis = 1)
test_df_classifier = test_df.drop('SalePrice', axis = 1)

train_SF_classifier = tc.SFrame(train_df_classifier)
test_SF_classifier = tc.SFrame(test_df_classifier)

<a class="anchor" name="GenericClass"></a>
### Creating generic classification model
[Return to TOC](#Return)

In [None]:
# Creating generic classification model
classifier_model = tc.classifier.create(train_SF_classifier, target = 'SalePrice_label_for_bin_range')
classifier_model_errors = classifier_model.evaluate(test_SF_classifier)

In [None]:
# Printing out accuracy of generic model
classifier_model_errors['accuracy']

<a class="anchor" name="RandomForest"></a>
### Creating random forest classification model
[Return to TOC](#Return)

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

a = list(range(1, 11))
b = list(range(1, 11))
c = list(it.product(a, b))
print(c)

x = []
y = []

for val in c:
    x.append(val[0])
    y.append(val[1])

z = []

ideal_max_iterations_random_forest_classifier = 1
ideal_max_depth_random_forest_classifier = 1

ideal_random_forest_classifier_model = tc.random_forest_classifier.create(train_SF_classifier, target='SalePrice_label_for_bin_range', max_iterations = 1, max_depth = 1)
ideal_random_forest_classifier_model_errors = ideal_random_forest_classifier_model.evaluate(test_SF_classifier)['accuracy'] 

for i in range(1, 11):
    for j in range(1, 11):
        test_random_forest_classifier_model = tc.random_forest_classifier.create(train_SF_classifier, target='SalePrice_label_for_bin_range', max_iterations = i, max_depth = j)
        test_random_forest_classifier_model_errors = test_random_forest_classifier_model.evaluate(test_SF_classifier)
        
        z.append(test_random_forest_classifier_model_errors['accuracy'])
        
        # Finding ideal max_iterations and max_depth to generate highest testing accuracy
        if test_random_forest_classifier_model_errors['accuracy'] > ideal_random_forest_classifier_model_errors:
            ideal_max_iterations_random_forest_classifier = i
            ideal_max_depth_random_forest_classifier = j
            ideal_random_forest_classifier_model_errors = test_random_forest_classifier_model_errors['accuracy']

df = pd.DataFrame({'max_iterations':x, 'max_depth':y, 'accuracy':z})

# Creating a heatmap to show different combinations of max_depth and max_iterations and the model's corresponding accuracy
sns.heatmap(pd.crosstab(df['max_iterations'], df['max_depth'], values=df['accuracy'] * 100, aggfunc='sum'), linewidths=.5, ax=ax, annot=True)

ax.set_title('Ideal # of max_iterations and max_depth (Accuracy shown in %) - Random forest classifier model', fontsize=10)

plt.show()

In [None]:
# Printing ideal max_iterations and max_depth for the random forest classifier model
print(ideal_max_iterations_random_forest_classifier)
print(ideal_max_depth_random_forest_classifier)

<a class="anchor" name="graphs"></a>
## Graphs
[Return to TOC](#Return)

<a class="anchor" name="Violin_PlotsR"></a>
### Violin Plots - Regression
[Return to TOC](#Return)

Making Violin Plots to show distribution of linear model predictions and Actual Data

In [None]:
# Take our best linear model predictions and store them
ideal_linear_model = tc.linear_regression.create(train_SF_regression,features = final_ideal_linear_regression_model_features_and_errors[0], target='SalePrice')
ideal_linear_predictions = ideal_linear_model.predict(test_SF_regression)

# Also take our actual values for comparison
regression_actual_values = test_SF_regression['SalePrice']

# Take our generic model predictions (using all features) and store them
generic_linear_predictions = regression_model.predict(test_SF_regression)

# Take our best determined boosted trees regression model instance, recreate it, and store the predictions
boosted_trees_model = tc.boosted_trees_regression.create(train_SF_regression, target = 'SalePrice', features = list(cumulative_set_important_features), max_iterations = ideal_max_iterations_boosted_trees_regression, max_depth = ideal_max_depth_boosted_trees_regression)
boosted_trees_model_predictions = boosted_trees_model.predict(test_SF_regression)
boosted_trees_model_errors = boosted_trees_model.evaluate(test_SF_regression)['rmse']
# Make a list of all the plotting data
regression_plot_data = [regression_actual_values,ideal_linear_predictions,boosted_trees_model_predictions,generic_linear_predictions]

print('Ideal Linear Model:',ideal_linear_model.evaluate(test_SF_regression))
print('Boosted Trees Model:',boosted_trees_model.evaluate(test_SF_regression))
print('Generic Model:',regression_model.evaluate(test_SF_regression))

In [None]:
# For one unit of change in the feature variables LotArea, MoSold, and YrSold, there is a corresponding change
# in the predictor variable SalePrice which is represented by the magnitude of the coefficients.
ideal_linear_model.coefficients

In [None]:
# Creating a violin plot showing the actual values compared to the predicted values for our regression models.
VLfig = plt.figure()
VLax = VLfig.add_axes([0,0,2,1])

Lxticklabels = ['Actual Values', 'Ideal Linear Regression Model','Boosted Model','Generic Model']

VLax.set_xticks([1,2,3,4])
VLax.set_xticklabels(Lxticklabels)

plt.title('Regression models',fontsize=20)
plt.xlabel('Models',fontsize=15)
plt.ylabel('Values',fontsize=15)

bp = VLax.violinplot(regression_plot_data)

"""
plt.ylim([0, 1000000])
VLax.set_yticks(list(range(0, 1000001, 100000)))
VLax.set_yticklabels(list(range(0, 1000001, 100000)))
"""

plt.show()

<a class="anchor" name="Violin_PlotsC"></a>
### Violin Plots - Classification
[Return to TOC](#Return)

Making Violin Plots to show distribution of various classification models and Actual Data

In [None]:
# Take our generic classifier model and store the predicted values
generic_classifier_predicted_values = classifier_model.predict(test_SF_classifier)

# Store our actual values as an SArray for comparison
classifier_actual_values = test_SF_classifier['SalePrice_label_for_bin_range']

# Create a random forest model based on optimal max_depth, which is equal to 1, then store those predictions
random_forest_classifier_model = tc.random_forest_classifier.create(train_SF_classifier, target='SalePrice_label_for_bin_range', max_iterations = ideal_max_iterations_random_forest_classifier, max_depth = ideal_max_depth_random_forest_classifier)
random_forest_classifier_predictions = random_forest_classifier_model.predict(test_SF_classifier)
random_forest_classifier_errors = random_forest_classifier_model.evaluate(test_SF_classifier)['accuracy']

# Make a list of all the plotting data
classification_plot_data = [classifier_actual_values,generic_classifier_predicted_values,random_forest_classifier_predictions]

In [None]:
print(random_forest_classifier_errors)

In [None]:
# Creating a violin plot showing the actual values compared to the predicted values for our classification models.
VCfig = plt.figure()
VCax = VCfig.add_axes([0,0,2,1])

xticklabels = ['Actual Values', 'Generic Classifier','Random Forest Classifier']

VCax.set_xticks([1,2,3])
VCax.set_xticklabels(xticklabels)

plt.title('Classification models',fontsize=20)
plt.xlabel('Models',fontsize=15)
plt.ylabel('Bin Values',fontsize=15)

ap = VCax.violinplot(classification_plot_data)
plt.show()

<a class="anchor" name="RHistogram"></a>
### Histograms - Regression 
[Return to TOC](#Return)

In [None]:
## Histogram plots for Regression models
n_bins = 50

fig, axs = plt.subplots(1, 4, sharey=True,sharex=True, tight_layout=False)
fig.set_figheight(10)
fig.set_figwidth(20)

axs[0].hist(regression_plot_data[0], bins = n_bins)
axs[1].hist(regression_plot_data[1], bins = n_bins)
axs[2].hist(regression_plot_data[2], bins = n_bins)
axs[3].hist(regression_plot_data[3], bins = n_bins)

axs[0].set_ylabel('Count', fontsize = 12)

for i in range(0, 4):
    axs[i].set_xlabel('SalePrice', fontsize = 12)

axs[0].set_title('Actual Data', fontsize = 15)
axs[1].set_title('Ideal Linear Model Predictions', fontsize = 15)
axs[2].set_title('Boosted Trees Model Predictions', fontsize = 15)
axs[3].set_title('Generic Model Predictions', fontsize = 15)

plt.show()

<a class="anchor" name="CHistogram"></a>
### Histograms - Classification
[Return to TOC](#Return)

In [None]:
## Histogram plots for Classification models
n_bins = 50

fig, axs = plt.subplots(1, 3, sharey=True,sharex=True, tight_layout=True)
fig.set_figheight(10)
fig.set_figwidth(15)

axs[0].hist(classification_plot_data[0], bins = n_bins)
axs[1].hist(classification_plot_data[1], bins = n_bins)
axs[2].hist(classification_plot_data[2], bins = n_bins)

axs[0].set_ylabel('Count', fontsize = 12)

for i in range(0, 3):
    axs[i].set_xlabel('SalePrice_BinLabels', fontsize = 12)
    
axs[0].set_title('Actual Data', fontsize = 15)
axs[1].set_title('Generic Classifier', fontsize = 15)
axs[2].set_title('Random Forest Model', fontsize = 15)

plt.show()

<a class="anchor" name="Line_PlotsR"></a>
### Line Plots - Regression
[Return to TOC](#Return)

In [None]:
# Comparing predicted SalePrice vs actual SalePrice based on LotArea, MoSold, and YrSold independently for linear regression model
test_df_used_for_graphs = test_df.copy()
test_df_used_for_graphs['YrSold'] = test_df_used_for_graphs['YrSold'].astype(int)
test_df_used_for_graphs['Linear_Regression_Predictions'] = ideal_linear_predictions

fig, ax = plt.subplots(figsize=(20, 10), nrows=1, ncols=3)
fig.suptitle('Linear Regression Model - Predicted SalePrice vs. Actual SalePrice based on:')

ax1 = plt.subplot(1, 3, 1)
sns.regplot(x='LotArea', y='SalePrice', data = test_df_used_for_graphs, label = 'Actual SalePrice', scatter=None, ci=None)
sns.regplot(x='LotArea', y='Linear_Regression_Predictions', data = test_df_used_for_graphs, label = 'Predicted SalePrice', scatter=None)
plt.title('Feature: LotArea')
plt.ylabel('SalePrice')
plt.legend()

ax2 = plt.subplot(1, 3, 2)
sns.regplot(x='MoSold', y='SalePrice', data = test_df_used_for_graphs, label = 'Actual SalePrice', scatter=None, ci=None)
sns.regplot(x='MoSold', y='Predictions', data = test_df_used_for_graphs, label = 'Predicted SalePrice', scatter=None)
plt.legend()
plt.title('Feature: MoSold')
# plt.ylabel('SalePrice')
ax2.yaxis.label.set_visible(False)

ax3 = plt.subplot(1, 3, 3)
sns.regplot(x='YrSold', y='SalePrice', data = test_df_used_for_graphs, label = 'Actual SalePrice', scatter=None, ci=None)
sns.regplot(x='YrSold', y='Linear_Regression_Predictions', data = test_df_used_for_graphs, label = 'Predicted SalePrice', scatter=None)
plt.title('Feature: YrSold')
# plt.ylabel('SalePrice')
plt.legend()
ax3.set_xticks(list(range(2006, 2011)))
ax3.yaxis.label.set_visible(False)

fig.tight_layout()

plt.show()

In [None]:
# Comparing predicted SalePrice vs actual SalePrice based on LotArea, MoSold, and YrSold independently for boosted regression tree model
test_df_used_for_graphs['Boosted_Regression_Tree_Predictions'] = boosted_trees_model_predictions

fig, ax = plt.subplots(figsize=(20, 10), nrows=1, ncols=3)
fig.suptitle('Boosted Regression Tree Model - Predicted SalePrice vs. Actual SalePrice based on:')

ax1 = plt.subplot(1, 3, 1)
sns.regplot(x='LotArea', y='SalePrice', data = test_df_used_for_graphs, label = 'Actual SalePrice', scatter=False, ci=None)
sns.regplot(x='LotArea', y='Boosted_Regression_Tree_Predictions', data = test_df_used_for_graphs, label = 'Predicted SalePrice', scatter=False)
plt.title('Feature: LotArea')
plt.ylabel('SalePrice')
plt.legend()

ax2 = plt.subplot(1, 3, 2)
sns.regplot(x='MoSold', y='SalePrice', data = test_df_used_for_graphs, label = 'Actual SalePrice', scatter=False, ci=None)
sns.regplot(x='MoSold', y='Boosted_Regression_Tree_Predictions', data = test_df_used_for_graphs, label = 'Predicted SalePrice', scatter=False)
plt.title('Feature: MoSold')
# plt.ylabel('SalePrice')
plt.legend()
ax2.yaxis.label.set_visible(False)

ax3 = plt.subplot(1, 3, 3)
sns.regplot(x='YrSold', y='SalePrice', data = test_df_used_for_graphs, label = 'Actual SalePrice', scatter=None, ci=None)
sns.regplot(x='YrSold', y='Linear_Regression_Predictions', data = test_df_used_for_graphs, label = 'Predicted SalePrice', scatter=None)
plt.title('Feature: YrSold')
# plt.ylabel('SalePrice')
plt.legend()
ax3.set_xticks(list(range(2006, 2011)))
ax3.yaxis.label.set_visible(False)

fig.tight_layout()

plt.show()

<a class="anchor" name="Summary"></a>
# Summary
[Return to TOC](#Return)

Comparing RMSE for all Regression Models (Regression Model Performance)

In [None]:
BRfig = plt.figure()
BRax = BRfig.add_axes([0,0,1,1])

regression_model_labels = ['Generic Regression Model','Ideal Linear Model','Boosted Trees Model']
RMSE_Values = [regression_model_errors['rmse'],final_ideal_linear_regression_model_features_and_errors[1],boosted_trees_model_errors]

# Plotting bar chart for RMSE of each model after testing evaluation
BRax.bar(regression_model_labels,RMSE_Values, color=['purple', 'blue', 'green'])
plt.title('Regression models',fontsize=20)
plt.xlabel('Models',fontsize=15)
plt.ylabel('RMSE',fontsize=15)


plt.show()

Comparing Accuracy for all Classification Models

In [None]:
BLfig = plt.figure()
BLax = BLfig.add_axes([0,0,1,1])

classification_model_labels = ['Generic Classification Model','Random Forest Model']
accuracy_values = [classifier_model_errors['accuracy'] * 100, random_forest_classifier_errors * 100]

# Plotting bar chart for accuracy of each model after testing evaluation
BLax.bar(classification_model_labels, accuracy_values, color =['blue','green', 'purple'])
plt.title('Classification models',fontsize=20)
plt.xlabel('Models',fontsize=15)
plt.ylabel('Accuracy (%)',fontsize=15)

plt.show()