# *Group Exam INFO284 V21*

# *SALE PRICE PREDICTION NYC SALES*

### **Members: Didrik Nettelhorst Krog, Jonas Bech Holtan, Gunnar Hole Gjengedal, Snorre Alvsvåg**


## Introduction
### The task
You are supposed to build at least five machine learning models from these data to predict or
classify one relevant target feature for new data points. You can choose target feature yourself, but
sales price is perhaps the most suitable. You may also reduce the number of data points somewhat
by focusing on only specific meaningful parts of the data. Or perhaps you will try dimension
reduction.

### Our Approach
This task can be split into four parts:
- **Part 1** Data Exploration
- **Part 2** Data Cleaning
- **Part 3** Model Building
- **Part 4** Presentation and Analysation

We will explore our data by conducting a *Exploratory Data Analysis*. Here we will look at the data and make note of important features, non important features, and in general inform ourselves with the data. In part 2, we will generally clean the data, remove empty or unique columns and so on. In part 3 we will start to fit the models, we will also change the data in preparation of each model. And at last we will present the results, and analyze the different approaches we took with out models.

We have chosen these supervised machine learning algorythms:
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor

- Linear Reggresion 
- Ridge Reggression
- Lasso Regression
- Neural Network (MLPRegressor)

- Support Vector Machine (SVM)

- Gaussian Naive Bayes
- Bernoulli Naive Bayes

We have chosen one unsupervised machine learning algorythms:
- Clustering

Moving forward we will start by learing about our data

# Importing Data and Others

In [None]:
import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

df = pd.read_csv('nyc-rolling-sales.csv')

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.cluster import KMeans
from sklearn.svm import LinearSVR, SVR, NuSVR 

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

# Presenting the Data

In [None]:
df.head(10)

In [None]:
df.tail(10)

In [None]:
df.shape

In [None]:
df.info()

As it is possible to see above, we have several different types of data that are both categorical and continous. Before creating the models necassary for this assignment, we need to convert them to the appropriate types. 

*The countinous features being :*

- SALE PRICE
    - Depending on which models we are using, 'SALE PRICE' will be continous and categorical. This is because this feature will       serve as our target feature and must be the apprioriate type for the specific models. 
- LAND SQUARE FEET
    - The land area of the property listed in square feet
    
- GROSS SQUARE FEET 
    - The total area of all the floors of a building as measured from the exterior surfaces of the outside    walls of the
      building, including the land area and space within any building or structure on the property.
      
- SALE DATE (Which will be converted to MONTH SOLD and YEAR SOLD)
    - This is going to be categorical, 'MONTH SOLD' is the month of when the property is sold. 'YEAR SOLD' is the year the
      property was sold.
    
- COMMERCIAL UNITS
    - The number of commercial units at the listed property.
    
- RESIDENTIAL UNITS 
    - The number of residential units at the listed property.
    
- TOTAL UNITS
    - The total number of units at the listed property.
    
- BLOCK
    - Because there are more than 11k unique blocks in the dataset, it doesn't make sense to define it as a categorical 
      variable
      
- LOT
    - The same reason as 'BLOCK' feature

*The categorical features being :*

- BOROUGH
    - A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3),
      Queens (4), and Staten Island (5).
      
- NEIGHBORHOOD
    - Names of the neighboohoods that work better as a categorical feature than a numerical
    
- ZIP CODE
    - The property’s postal code.
    
- BUILDING CLASS CATEGORY
    - Easier identifiable categories of the possible types of buldings 
    
- TAX CLASS AT PRESENT
    - Every property in the city is assigned to one of four tax classes, based on the use of the property.
    
- BUILDING CLASS AT PRESENT
    - The Building Classification is used to describe a property’s constructive use. The first position of the Building Class
      is a letter that is used to describe a general class of properties. The second position, a number, adds more specific
      information about the property’s use or construction style.
      
- YEAR BUILT
    - Year the structure on the property was built.
    
- TAX CLASS AT TIME OF SALE
    - See 'TAX CLASS AT PRESENT'
    
- BUILDING CLASS AT TIME OF SALE
    - See 'BUILDING CLASS AT PRSENT'

*The features we defintely dont need is :*

- Unnamed: 0 
    - Dropping it because it just looks like an iterator
    
- EASE-MENT 
    - Because it only contains NaN values
    
- APARTMENT NUMBER 
    - The number of the apartment is not relevant for sale price
    
- ADDRESS
    - Address is just listed as names and we feel that it won't affect the sales price prediction at all.

# Cleaning the Overall Data

**The next session is about converting, reducing and cleaning the features we are going to use in our models. The type of cleaning we are doing here is for the Regression and Tree Models we are using later in the code. Since we also are using Gaussian and Bernoulli Naive Bayes Models, there will be a second data cleaning and pre-processing later in the code after the codes mentioned above.**

**Either way, the most important in this data cleaning here is to convert the necessary features to more appropriate types and remove most of the outliners. We will end up with some outliners either way because of the complexity of our dataset. However, with our Unsupervised Machine Learning Model, k-means, the models is expected to perform good even with many outliners.** 

The features we definetly dont need

In [None]:
df = df.drop(['EASE-MENT', 'Unnamed: 0', 'APARTMENT NUMBER', "ADDRESS"], axis=1)

Checking for any duplicates

In [None]:
sum(df.duplicated(df.columns))

Removing duplicates

In [None]:
df = df.drop_duplicates(df.columns, keep='last')
sum(df.duplicated(df.columns))

Checking if there are any null values in the dataframe

In [None]:
df.isna().any(), df.isnull().any()

Replacing empty or '-' values with NaN

In [None]:
df = df.replace(' ', np.nan)
df = df.replace(' -  ', np.nan)

An overview of where the NaN values are

In [None]:
print("Percentage null or na values in Dataset\n-------------------------------------")
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

### Removing some of the features

Before further inspections, we decided to remove some more of the features because we dont think they are necessary. These are:

- "ZIP CODE" : 

- "BLOCK" :

- "LOT" : 

In [None]:
df = df.drop(['ZIP CODE', 'BLOCK', 'LOT'], axis=1)

## Cleaning Sale Price

Dropping all the NaN values

In [None]:
df = df.dropna(subset=['SALE PRICE'])

Converting 'SALE PRICE' to a more appropriate type

In [None]:
df["SALE PRICE"] = df["SALE PRICE"].astype(float)

Making a diagram to get a clearer view of how 'SALE PRICE' is distributed

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['SALE PRICE'], kde=True, bins=50, rug=True)
plt.show()

From the distribution above, there are a lot of outliners around 2000000 and above. At the same time, the majority of the values are close to 0. The most optimal place to is between 3.000.000 and 0.

In [None]:
df[(df['SALE PRICE'] < 10000) | (df['SALE PRICE'] > 3000000)]['SALE PRICE'].count() /len(df) 

This shows that 22% of the values are either greater than 3.000.000 or less than 10.000. We remove this to get a better distribution of 'SALE PRICE'

In [None]:
df= df[(df['SALE PRICE'] > 10000) & (df['SALE PRICE']<3000000)]

In [None]:
df['SALE PRICE'].value_counts().head(10)

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['SALE PRICE'], kde=True, bins=50, rug=True)
plt.show()

**The 'SALE PRICE' feature, our target feature, is now more appropriated distributed for our models.**

## Cleaning Square Feet

There are a lot of missing values in these features.
We can either fill them up with the mean average or remove them like this:

For the time being, lets remove this missing features

In [None]:
df = df.dropna(subset=['LAND SQUARE FEET'])
df = df.dropna(subset=['GROSS SQUARE FEET'])

Converting the features to a more appropriate type

In [None]:
df['LAND SQUARE FEET'] = df['LAND SQUARE FEET'].astype(float)
df['GROSS SQUARE FEET'] = df['GROSS SQUARE FEET'].astype(float)

In [None]:
df['LAND SQUARE FEET'].value_counts().head(10)

In [None]:
df['GROSS SQUARE FEET'].value_counts().head(10)

There are a lot of 0 values here, and means that the properties in this category has no square feet. we will remove these

In [None]:
df = df[df['LAND SQUARE FEET'] > 10]

In [None]:
df = df[df['GROSS SQUARE FEET'] > 10]

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['LAND SQUARE FEET'], kde=True, bins=50, rug=True)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['GROSS SQUARE FEET'], kde=True, bins=50, rug=True)
plt.show()

As it is possible to see from the diagrams and value_counts(), there are a lot of outliners in these features. 
These can be cleaned by just reducing the number of values.

In [None]:
df = df[df['LAND SQUARE FEET'] < 20000]

In [None]:
df = df[df['GROSS SQUARE FEET'] < 20000]

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['LAND SQUARE FEET'], kde=True, bins=50, rug=True)
plt.show()

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df['GROSS SQUARE FEET'], kde=True, bins=50, rug=True)
plt.show()

## Cleaning Sale Date

TimeStamp types are hard to work with. Therefore, we make the 'SALE DATE' feature to two distinct features 'YEAR SOLD' and
'MONTH SOLD'. These two features can help us later in further analysis of the data

In [None]:
df['SALE DATE'].value_counts().head()

Making a list with alle the different months 

In [None]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

Spliting Sale Date into to features, one for Month Sold and one for Year Sold. This is first for making a TimeStamp feature easier to use, and for calculating which month or year had the highest sales rate

In [None]:
df['YEAR SOLD'] = [int(n[0:4]) for n in df['SALE DATE']]
df['MONTH SOLD'] = [int(n[5:7]) for n in df['SALE DATE']]
df = df.drop(['SALE DATE'], axis=1)

In [None]:
plt.hist(df['YEAR SOLD'], bins=2, color='c')
counts_per_year = [sum(df['YEAR SOLD'] == 2016), sum(df['YEAR SOLD'] == 2017)]
sns.barplot([2016, 2017], counts_per_year).set_title('Number of properties sold by year')

The first diagram shows that there were significant more houses sold in 2017 than in 2016, showing an increase that may indicate an even bigger increase in a predictive model for 2018

Code retrieved from: https://jshams.github.io/NYC-real-estate-analysis/new_york_real_estate.slides.html#/5/2 and https://jshams.github.io/NYC-real-estate-analysis/new_york_real_estate.slides.html#/6/2

Transforming to more appriopriate types

In [None]:
df['MONTH SOLD'] = df['MONTH SOLD'].astype('category')

In [None]:
df['YEAR SOLD'] = df['YEAR SOLD'].astype('category')

## Cleaning Tax Class

Dropping all the NaN values

In [None]:
df = df.dropna(subset=['TAX CLASS AT PRESENT'])

In [None]:
df['TAX CLASS AT TIME OF SALE'].value_counts()

In [None]:
df['TAX CLASS AT PRESENT'].value_counts()

Tax classes, as mentioned earlier, should be categorical classes. Therefore, it is more appropriate to replace the values to more informative values as shown below

In [None]:
df['TAX CLASS AT TIME OF SALE'] = df['TAX CLASS AT TIME OF SALE'].replace({1:'Class_1',
                                                                           2:'Class_2',
                                                                           4:'Class_4'})

In [None]:
df['TAX CLASS AT PRESENT'] = df['TAX CLASS AT PRESENT'].replace({'1':'Class_1',
                                                                 '1A':'Class_1',
                                                                 '1B':'Class_1',
                                                                 '1C':'Class_1',
                                                                 '2':'Class_2',
                                                                 '2A':'Class_2',
                                                                 '2B':'Class_2',
                                                                 '2C':'Class_2',
                                                                 '3':'Class_3',
                                                                 '4':'Class_4'})

Converting the features to more appropiate types

In [None]:
df['TAX CLASS AT TIME OF SALE'] = df['TAX CLASS AT TIME OF SALE'].astype('category')
df['TAX CLASS AT PRESENT'] = df['TAX CLASS AT PRESENT'].astype('category')

In [None]:
df['TAX CLASS AT PRESENT'].value_counts().plot(kind='pie')

In [None]:
df['TAX CLASS AT TIME OF SALE'].value_counts().plot(kind='pie')

**These pie charts show that more the 3/4 of all the values in the features belong to class_1.** 

- Class 1: Includes most residential property of up to three units (such as one-, two-, and three-family homes and small stores   or offices with one or two attached apartments), vacant land that is zoned for residential use, and most condominiums that     are not more than three stories.
- Class 2: Includes all other property that is primarily residential, such as cooperatives and condominiums.
- Class 4: Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage         buildings, etc. 

**We have chosen to not remove class_2 and class_4 because we want to focus on all of the properties instead of only focusing on residential properties.**

## Cleaning Buliding Class

Converting the features to more appropiate types

In [None]:
df['BUILDING CLASS AT TIME OF SALE'] = df['BUILDING CLASS AT TIME OF SALE'].astype('category')

'BUILDING CLASS AT TIME OF SALE' and 'BUILDING CLASS CATEGORY' is significant the same. We remove one of them to prevent having to many features.

In [None]:
df = df.drop(['BUILDING CLASS AT PRESENT'], axis=1)

In [None]:
df['BUILDING CLASS CATEGORY'].value_counts()

In [None]:
property_types = {'01 ONE FAMILY DWELLINGS': 12390,
 '02 TWO FAMILY DWELLINGS': 9590,
 '03 THREE FAMILY DWELLINGS': 2239,
 '07 RENTALS - WALKUP APARTMENTS': 1177,
 'Others' : 1045}                         # 'Others' here is the sum of fifth - 28th value given above

labels = property_types.keys()
sizes = property_types.values()
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99', '#be3cb2']
plt.pie(sizes, colors = colors, labels=labels, autopct='%1.2f%%',startangle=35, pctdistance=0.75, explode = tuple([0.05] * 5))
centre_circle = plt.Circle((0,0),0.50,fc='#ffffff')
plt.gcf().gca().add_artist(centre_circle)

From the value_count() and Pie Chart above, we can observe that the first four values stands out the most and the rest should rather be one value to remove a lot of outliners. This is because the others value is a lower percentage than the fourth value. Code retrieved from: https://jshams.github.io/NYC-real-estate-analysis/new_york_real_estate.slides.html#/8

We tried to make a new feature called others with the fifth all the way down to the 20th value, but it seems it is not possible to combine all of them. Either way, we chose to keep the feature anyway. 

Converting to a more appropriate feature

In [None]:
df['BUILDING CLASS CATEGORY'] = df['BUILDING CLASS CATEGORY'].astype('category')

## Cleaning Borough

Change borough index to borough real name in New York City

In [None]:
df['BOROUGH'][df['BOROUGH'] == 1] = 'Manhattan'
df['BOROUGH'][df['BOROUGH'] == 2] = 'Bronx'
df['BOROUGH'][df['BOROUGH'] == 3] = 'Brooklyn'
df['BOROUGH'][df['BOROUGH'] == 4] = 'Queens'
df['BOROUGH'][df['BOROUGH'] == 5] = 'Staten Island'

Converting 'BOROUGH' to a more appropriate types

In [None]:
df['BOROUGH'] = df['BOROUGH'].astype('category')

In [None]:
sns.countplot('BOROUGH',data=df,palette='Set2')
plt.title('Sales per Borough')

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(y = 'BOROUGH', x = 'SALE PRICE', data = df )
plt.title('Box plots for SALE PRICE on each BOROUGH')
plt.show()

These diagrams above shows a how big difference of property sold and outliners there exists in this features.

## Cleaning Year Built

In [None]:
df=df[df['YEAR BUILT']!=0]
sns.distplot(df['YEAR BUILT'], bins=50, rug=True)
plt.show()

**As its possible to se from the graph above, the feature will work best if the values from 1800 - 1900 are removed. With this, we removed some of the possible outliners in our model.**

In [None]:
df= df[(df['YEAR BUILT'] > 1880)]

Converting the feature to a more appropriate type

In [None]:
df['YEAR BUILT'] = df['YEAR BUILT'].astype('category')

## Cleaning "... Units"

Dropping 'TOTAL UNITS' because its only the sum of 'RESIDENTIAL UNITS' and 'COMMERCIAL UNITS'

In [None]:
df = df.drop(['TOTAL UNITS'], axis=1)

In [None]:
sns.countplot('RESIDENTIAL UNITS',data=df,palette='Set2')

In [None]:
sns.countplot('COMMERCIAL UNITS',data=df,palette='Set2')

In [None]:
df= df[(df['RESIDENTIAL UNITS'] < 10)]

In [None]:
df= df[(df['COMMERCIAL UNITS'] < 7)]

**The 'Units' features have a really skewed distribution. To avoid to many outliners, We only include the values with the most counts**

## Cleaning Neighborhood

Converting the feature to a more appropriate type

In [None]:
df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].astype('category')

# Overview so far

In [None]:
df.info()

**By now, all the necessary features have been cleaned from most of outliners and converted to the preferred types needed for the algorithms.**

# Data Pre-Processing - Making Categorical Data more Appropriate

**Since most of our models prefer continous features, and our dataset contains an significant amount categorical data, we have to encode our categorical data so we can actually use the features in our models. For this we will use One-Hot-Encoding and Label Encoding, depending on how many unique values there exists in the features.** 

In [None]:
df['BOROUGH'].unique()

In [None]:
df['TAX CLASS AT PRESENT'].unique()

In [None]:
df['TAX CLASS AT TIME OF SALE'].unique()

In [None]:
df['BUILDING CLASS CATEGORY'].unique()

In [None]:
df['YEAR SOLD'].unique()

**These categorical features above have few unique values which makes One-Hot-Encoding the most appropriate way to make the features continious**

**We make a list of all these features for an easier overview**

In [None]:
one_hot_features = ['BOROUGH', 'TAX CLASS AT PRESENT', 'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS CATEGORY', 'YEAR SOLD']

In [None]:
one_hot_encoded = pd.get_dummies(df[one_hot_features])

df = df.drop(one_hot_features, axis = 1)

df = pd.concat([df, one_hot_encoded], axis=1)

**The rest of the categorical features are a little bit different. There are hundres of unique features which makes Label Encoding more appropriate than One-Hot_encoding. This is for preventing a higher dimensionality and ending up with many more features necassary.**

In [None]:
df['NEIGHBORHOOD'].unique()

In [None]:
label_encoder = LabelEncoder()

df['NEIGHBORHOOD'] = label_encoder.fit_transform(df['NEIGHBORHOOD'])

In [None]:
df['BUILDING CLASS AT TIME OF SALE'].unique()

In [None]:
label_encoder_2 = LabelEncoder()

df['BUILDING CLASS AT TIME OF SALE'] = label_encoder_2.fit_transform(df['BUILDING CLASS AT TIME OF SALE'])

In [None]:
df['YEAR BUILT'].unique()

In [None]:
label_encoder_3 = LabelEncoder()

df['YEAR BUILT'] = label_encoder_3.fit_transform(df['YEAR BUILT'])

In [None]:
df['MONTH SOLD'].unique()

In [None]:
label_encoder_4 = LabelEncoder()

df['MONTH SOLD'] = label_encoder_4.fit_transform(df['MONTH SOLD'])

# K-means Unsupervised Learning

**To imporve the overall performance of the models, we decided to use k-means method to make clusters to see if there exists any logical ways to group the data in a feature space**

We start by just making a scattermap overview with all the features.

scatter_df = df.drop(['SALE PRICE'], axis = 1)

for col in scatter_df.columns: 
    plt.scatter(scatter_df[col], df['SALE PRICE']) 
    plt.ylabel('Sale price') 
    plt.xlabel(col) 
    plt.show()

As the scattermaps shows above, the different features are scattered very different from each other because of the categorical features. The continous features show a vide fully scattered maps and the categorical features are scattered like lines. This will affect our k-means cluster model. 

Since we are using k-means clustering, we need to define a k. k is defined as how many clusters we want in our unsupervised model. Thats why we use the elbow method to get the most optimal k.

The code used below for our Unsupervised Learning Model is retrieved from: https://predictivehacks.com/k-means-elbow-method-code-for-python/

In [None]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df)
    distortions.append(kmeanModel.inertia_)

plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-') 
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

**The model above shows us that the most optimal k is two. We only need two clusters**

In [None]:
kmeanModel = KMeans(n_clusters=2)
kmeanModel.fit(df)

df['k_means'] = kmeanModel.predict(df)
df['SALE PRICE'] = df['SALE PRICE']

fig, axes = plt.subplots(1, 2, figsize=(24,12))
axes[0].scatter(df['k_means'], df['SALE PRICE'], c=df['SALE PRICE'])
axes[1].scatter(df['k_means'], df['SALE PRICE'], c=df['k_means'], cmap=plt.cm.Set1)
axes[0].set_title('Actual', fontsize=18)
axes[1].set_title('K_Means', fontsize=18)

**As the models above show, the k-means clustering result is questionable. They show us two straigth lines in two very different intervals and does not really tell us much about how the groupings of the different features are. This might be because of the high complexity of our data or because of the how the categorical features are placed in a feature space, as the scatter maps above shows.**

**However, as we have seen in different tests of the machine learning models, the visualization of the Unsupervised Learning Model might not be good at all, but the models perform way better with the k-means feature than without. This means that the k-means clustering is of value and there actually are groupings of the features in a feature space.**

In [None]:
minmax_df = df.copy()

In [None]:
minmax_df_columns = minmax_df.columns
minmax_scaler = MinMaxScaler()
minmax_df = minmax_scaler.fit_transform(minmax_df)

minmax_df = pd.DataFrame(minmax_df)
minmax_df.columns = minmax_df_columns

minmax_df.head(10)

In [None]:
standard_df = df.copy()

To be able to make the most accurate models possible, we should use Principal Component Ananlysis (PCA). To do this, we have to use a Standard Scaler 

In [None]:
standard_df_columns = standard_df.columns
scaler = StandardScaler()
standard_df = scaler.fit_transform(standard_df)

standard_df = pd.DataFrame(standard_df)
standard_df.columns = standard_df_columns

standard_df.head(10)

**Now that we have used our data in the Unsupervised Model and defined our necessary scaling methods, we are now ready to create our Machine Learning Models.**

# ---------------------------------- NYC SALES PREDICTION -----------------------------------------

**Before we run every Model, we will perform a Grid Search to see what the best parameters for each model is**

## Tree Models

Since these Tree Models don't need any scaling, we have chosen to not use any because of the scores without scaling are better.

We still need to split the data set in a train_test_split

In [None]:
y = df['SALE PRICE']
x = df.drop(['SALE PRICE'], axis = 1)

x_train, x_test , y_train, y_test = train_test_split(x , y, random_state = 1)

In [None]:
#dt = DecisionTreeRegressor().fit(x_train, y_train)

#param_grid = {'max_depth' : [1,5,10,15,20,50,100]}
#grid_dt = GridSearchCV(DecisionTreeRegressor() , param_grid = param_grid)
#grid_dt.fit(x_train , y_train)

#print('Grid search with the best accuracy: \n------------------------------------')
#print('Best parameters: ', grid_dt.best_params_) 
#print('Best cross validation score(Accuracy): {:.3f}'.format(grid_dt.best_score_))

#print('Test set score: {:.3f}'.format(grid_dt.score(x_test,y_test)))

The GridSearch above gave us the paramter 'max_depth: 10'

In [None]:
dt = DecisionTreeRegressor(max_depth=10).fit(x_train, y_train)

print("Decision Tree Classifier Model")
print("------------------------------")
print()
print("Training set score: {:.5f}".format(dt.score(x_train, y_train)))
print("Test set score: {:.5f}".format(dt.score(x_test, y_test)))

In [None]:
#rf = RandomForestRegressor().fit(x_train, y_train)

#param_grid = {'max_depth' : [1,5,10,15,20,50,100]}
#grid_rf = GridSearchCV(RandomForestRegressor() , param_grid = param_grid)
#grid_rf.fit(x_train , y_train)

#print('Grid search with the best accuracy: \n------------------------------------')
#print('Best parameters: ', grid_rf.best_params_) 
#print('Best cross validation score(Accuracy): {:.3f}'.format(grid_rf.best_score_))

#print('Test set score: {:.3f}'.format(grid_rf.score(x_test,y_test)))

The GridSearch above gave us the paramter 'max_depth: 10'

In [None]:
rf = RandomForestRegressor(max_depth=10).fit(x_train, y_train)

print("Random Forest Regressor Model")
print("-----------------------------")
print()
print("Training set score: {:.5f}".format(rf.score(x_train, y_train)))
print("Test set score: {:.5f}".format(rf.score(x_test, y_test)))

In [None]:
#gbr = GradientBoostingRegressor(n_estimators = 500)
#param_grid = {'learning_rate' : [0.0001,0.001,0.01,1,0.05,0.1, 0.15, 0.5]}
#grid_gbr = GridSearchCV(gbr , param_grid , n_jobs = -1)
#grid_gbr.fit(x_train , y_train)

#print('Grid search with the best accuracy:')
#print('-----------------------------------')
#print('Best parameters: ', grid_gbr.best_params_) 
#print('Best cross validation score(Accuracy): {:.5f}'.format(grid_gbr.best_score_))

The GridSearch above gave us the paramter 'learning_rate: 0.15'

In [None]:
gbr = GradientBoostingRegressor(n_estimators = 500, learning_rate=0.15)
gbr.fit(x_train, y_train)

print("Gradient Boosting Regressor Model")
print("---------------------------------")
print()
print("Training set score: {:.5f}".format(gbr.score(x_train, y_train)))
print("Test set score: {:.5f}".format(gbr.score(x_test, y_test)))

**These Tree Models aboves gives extremely good scores. Its no suprise that the Random Forest Model has a higher score than the Decision Tree Model, and the Gradient Booster Regressor has a higher score than the Random Forest Model. Random Forest has 'more brances' than the Decision Tree Model and gives better possiblities for the model to choose, while the Gradient Boosting Regressor is able to find the most suited 'branch' out of all the 'branches' and therefore has the best score.**

## Linear Regression

In [None]:
y = standard_df['SALE PRICE']
x = standard_df.drop(['SALE PRICE'], axis = 1)

print("The shape of the target set:", y.shape)
print("The shape of the dataset:", x.shape)

x_train, x_test , y_train, y_test = train_test_split(x , y, random_state = 1)

In [None]:
lr = LinearRegression().fit(x_train , y_train)

print("Normal Linear Regression Model")
print("------------------------------")
print()
print("Training set score: {:.5f}".format(lr.score(x_train, y_train)))
print("Test set score: {:.5f}".format(lr.score(x_test, y_test)))

In [None]:
#ridge = Ridge().fit(x_train, y_train)

#alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
#param_grid = dict(alpha=alpha)

#grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='r2')
#grid_result = grid.fit(x_train, y_train)

#print('Best Score: ', grid_result.best_score_)
#print('Best Params: ', grid_result.best_params_)

The GridSearch above gave us the paramter 'alpha: 1'

In [None]:
ridge = Ridge(alpha=100).fit(x_train, y_train)

print("Ridge Regression Model")
print("----------------------")
print()
print("Training set score: {:.5f}".format(ridge.score(x_train, y_train)))
print("Test set score: {:.5f}".format(ridge.score(x_test, y_test)))

In [None]:
#lasso = Lasso().fit(x_train, y_train)

#alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
#param_grid = dict(alpha=alpha)

#grid = GridSearchCV(estimator=lasso, param_grid=param_grid, scoring='r2')
#grid_result = grid.fit(x_train, y_train)

#print('Best Score: ', grid_result.best_score_)
#print('Best Params: ', grid_result.best_params_)

The GridSearch above gave us the paramter 'alpha: 0.001'

In [None]:
lasso = Lasso(alpha=0.01).fit(x_train, y_train)

print("Lasso Regression Model")
print("----------------------")
print()
print("Training set score: {:.5f}".format(lasso.score(x_train, y_train)))
print("Test set score: {:.5f}".format(lasso.score(x_test, y_test)))

In [None]:
#nn = MLPRegressor()
#param_grid = {"hidden_layer_sizes": [(1,),(10,),(25),(50,),(60,),(70,),(80,),(90,)], 
#              "activation": ["identity", "logistic", "tanh", "relu"], 
#              "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]}
#grid_nn = GridSearchCV(nn, param_grid , n_jobs = -1)
#grid_nn.fit(x_train , y_train)

#print('Grid search with the best accuracy:')
#print('-----------------------------------')
#print('Best parameters: ', grid_nn.best_params_) 
#print('Best cross validation score(Accuracy): {:.5f}'.format(grid_nn.best_score_))

The GridSearch above gave us the paramter 'alpha: 0.01' and 'hidden_layers_sizes: (80,)'

In [434]:
nn = MLPRegressor(hidden_layer_sizes= [80,45], activation='relu', alpha=0.1)
nn.fit(x_train, y_train)

print("Neural Network Regressor Model")
print("------------------------------")
print()
print("Training set score: {:.5f}".format(nn.score(x_train, y_train)))
print("Test set score: {:.5f}".format(nn.score(x_test, y_test)))

Neural Network Regressor Model
------------------------------

Training set score: 0.02410
Test set score: -0.01640


**The Regressions above gives us a score that is overall not that bad. 
Linear Regression scores significantly bad, but that is expected because of the features are not linear and then the linear line from this model will not be able to make a straigth line.**

**Ridge and Lasso scores overall better than the linear because of the models being more appriopriate for complex data, and the grid search gives us the most optimal alpha.**

**Lastly, the Neural Network Model scores best. This is expected because of Neaural Network being the most suited for complex and big data. The reason being the parameter with hidden layers.**

## Gaussian and Bernoulli Naive Bayes

**This Machine Learning Models is a little bit different than the others. Since this Model is more about classification, we have to, as mentioned before, convert our target feature 'SALE PRICE' into a categorical type. Since the date pre-processing is different than the Models above, we have to do a new pre-processing. This will be based on the pre-processing we did before.** 

### Data Pre-Processing

In [None]:
catedf = df
catedf3 = catedf[(catedf['SALE PRICE'] >= 1400000)]
catedf2 = catedf[(catedf['SALE PRICE'] >= 850000) & (catedf['SALE PRICE'] < 1400000)]
catedf1 = catedf[(catedf['SALE PRICE'] >= 400000) & (catedf['SALE PRICE'] < 850000)]
catedf0 = catedf[(catedf['SALE PRICE'] < 400000)]
names_list = [catedf0, catedf1 , catedf2 , catedf3]
final_list = []
new_df = []
for i in range(4):
    numbers_list = [i]*len(names_list[i])

    catedf = names_list[i]
    catedf['CATEGORICAL PRICE'] = numbers_list
    new_df.append(catedf)
catedf = new_df[0].append([new_df[1],new_df[2],new_df[3]])
catedf = catedf.drop(['SALE PRICE'], axis = 1)

Dividing dataframe into one contaning the categorical variables, and one containing the continous. 
However some of the categorical data has not been categorized using one-hot-encoding. Therefore these variables are not binary, and can not be used for categorical NaiveBayes, will be dropped. 

In [None]:
categorical_df = catedf.drop(['GROSS SQUARE FEET' , 'LAND SQUARE FEET' , 'COMMERCIAL UNITS' , 'RESIDENTIAL UNITS', 'NEIGHBORHOOD' ,
                              'BUILDING CLASS AT TIME OF SALE', 'YEAR BUILT' , 'MONTH SOLD'] ,  axis = 1)

continous_df = catedf[['GROSS SQUARE FEET' , 'LAND SQUARE FEET' , 'CATEGORICAL PRICE']]

The continous data now needs to be given a normal distribution, in order to make GaussianNB valid. The instances that equal 0 need to be removed as it ruins the normal distribution. The variables commercial units and residential units will be dropped, as they can not be normalized. In order to make columns logarithmic we used code from: https://www.datasciencemadesimple.com/log-natural-logarithmic-value-column-pandas-python-2/

In [None]:
continous_df = continous_df[(continous_df['GROSS SQUARE FEET'] > 0)]
continous_df = continous_df[(continous_df['LAND SQUARE FEET'] > 0)]


continous_df['LOG GROSS SQUARE FEET'] = np.log(continous_df['GROSS SQUARE FEET'])
continous_df['LOG LAND SQUARE FEET'] = np.log(continous_df['LAND SQUARE FEET'])
continous_df.drop(['GROSS SQUARE FEET' , 'LAND SQUARE FEET' ], axis = 1 , inplace = True)

### Making The Models

In [None]:
x = continous_df.drop(['CATEGORICAL PRICE'] , axis = 1)
y = continous_df['CATEGORICAL PRICE']

x_train, x_test , y_train , y_test = train_test_split(x,y , random_state = 1)

In [None]:
gnb = GaussianNB()
gnb.fit(x_train, y_train)

print("Gaussian Naive Bayes Model")
print("--------------------------")
print()
print("Training set score: {:.5f}".format(gnb.score(x_train, y_train)))
print("Test set score: {:.5f}".format(gnb.score(x_test, y_test)))

In [None]:
x = categorical_df.drop(['CATEGORICAL PRICE'] , axis = 1)
y = categorical_df['CATEGORICAL PRICE']


x_train, x_test , y_train , y_test = train_test_split(x, y, random_state = 0 )

In [None]:
#param_grid = {'alpha' : [0.000001 , 0.00001 , 0.0001 , 0.001 , 0.01 , 0.1 , 1 ,10]}
#grid_bernouli = GridSearchCV(BernoulliNB() , param_grid = param_grid)
#grid_bernouli.fit(x_train , y_train)

#print('Grid search with the best accuracy: \n------------------------------------')
#print('Best parameters: ', grid_bernouli.best_params_) 
#print('Best cross validation score(Accuracy): {:.3f}'.format(grid_bernouli.best_score_))

#print('Test set score: {:.3f}'.format(grid_bernouli.score(x_test,y_test)))

The GridSearch above gave us the paramter 'alpha: 0.01' 

In [None]:
bernoulli = BernoulliNB('alpha'== 0.01)
bernoulli.fit(x_train , y_train)

print("Bernoulli Naive Bayes Model")
print("---------------------------")
print()
print("Training set score: {:.5f}".format(bernoulli.score(x_train, y_train)))
print("Test set score: {:.5f}".format(bernoulli.score(x_test, y_test)))

Predicted accuracy if the model only guesses the category with most appearances. 

In [None]:
print(15768/33865)

**To get the best possible score, we will now try combining the categorical and continous variables into one model. First by binning the continous variables.**

In [None]:
continous_df['BINNED LOG LAND SQUARE FEET'] = pd.qcut(continous_df['LOG LAND SQUARE FEET'] ,
                                                  5 )
continous_df['BINNED LOG GROSS SQUARE FEET'] = pd.qcut(continous_df['LOG GROSS SQUARE FEET'] ,
                                                 5)

one_hot_land = pd.get_dummies(continous_df['BINNED LOG LAND SQUARE FEET'])
one_hot_gross = pd.get_dummies(continous_df['BINNED LOG GROSS SQUARE FEET'])

categorical_df = pd.concat([categorical_df , one_hot_land , one_hot_gross],axis = 1)

In [None]:
x = categorical_df.drop(['CATEGORICAL PRICE'] , axis = 1)
y = categorical_df['CATEGORICAL PRICE']


x_train, x_test , y_train , y_test = train_test_split(x,y , random_state = 0 )

In [None]:
#param_grid = {'alpha' : [0.000001 , 0.00001 , 0.0001 , 0.001 , 0.01 , 0.1 , 1 ,10]}
#grid_bernouli = GridSearchCV(BernoulliNB() , param_grid = param_grid)
#grid_bernouli.fit(x_train , y_train)

#print('Grid search with the best accuracy: \n------------------------------------')
#print('Best parameters: ', grid_bernouli.best_params_) 
#print('Best cross validation score(Accuracy): {:.3f}'.format(grid_bernouli.best_score_))

#print('Test set score: {:.3f}'.format(grid_bernouli.score(x_test,y_test)))

The GridSearch above gave us the paramter 'alpha: 0.1'

In [None]:
bernoulli = BernoulliNB('alpha' == 0.1)
bernoulli.fit(x_train , y_train)


print('Training set score: {:.5f}'.format(bernoulli.score(x_train , y_train)))
print('Test set score: {:.5f}'. format(bernoulli.score(x_test , y_test)))

Performs better than when only using the continous variables.

Will now try another method for combining continous and categorical variables. The approach has been found on: 
https://towardsdatascience.com/naive-bayes-classifier-how-to-successfully-use-it-in-python-ecf76a995069

In [None]:
X = pd.concat([categorical_df.drop(['CATEGORICAL PRICE'] ,axis = 1), continous_df[['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET']]], axis = 1)
y = categorical_df['CATEGORICAL PRICE']

X_train , X_test , y_train , y_test = train_test_split(X , y , random_state = 1)

              

gaussian_model = GaussianNB()
clf_g = gaussian_model.fit(X_train[['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET']] , y_train)

categorical_model = BernoulliNB()
clf_c = categorical_model.fit(X_train.drop(['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET'] , axis = 1) 
                              , y_train)


g_train_probs = gaussian_model.predict_proba(X_train[['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET']])
c_train_probs = categorical_model.predict_proba(X_train.drop(['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET'], axis = 1 ))
 
g_test_probs = gaussian_model.predict_proba(X_test[['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET']])                                                 
c_test_probs = categorical_model.predict_proba(X_test.drop(['LOG GROSS SQUARE FEET' , 'LOG LAND SQUARE FEET'], axis = 1 ))


X_new_train = np.c_[(g_train_probs[:,1], c_train_probs[:,1])] # Train
X_new_test = np.c_[(g_test_probs[:,1], c_test_probs[:,1])] # Test

combined_gaussian = GaussianNB()
combined_gaussian.fit(X_new_train, y_train)

In [None]:
print("Combined Naive Bayes Model")
print("--------------------------")
print()
print('Training set score: {:.5f}'.format(combined_gaussian.score(X_new_train, y_train)))
print('Test set score: {:.5f}'.format(combined_gaussian.score(X_new_test, y_test)))

**Seems like the Naive Bayes models do not perform better then random guessing. Suspect that the models perfom equally good/bad on the training and test sets. Although the models both perform better than randomly guessing one of the outcomes, Naive Bayes does not seem like a valid model for this dataset. The fact that chaning the model complexity through alpha has no impact on model accuracy. It is no suprise that the Naive Bayes classifiers have very limited usefulness on the New York property dataset. As the models are not optimal for predicting complex datasets. It is also worth mentioning that the continous data initially are not normally distributed at all. The process of normalizing makes the data arificial. The same goes for the categorization of the sale price.**

## Support Vector Regressor

The Final ML Algorythm we will throw on the Dataset is the **Support Vector Regressor (SVR)**, Support Vector Classifyer`s regressive brother. The reason behind choosing the Support Vector Regressor over the Classifyer comes from the quriosity if it performs better than the newer and more modern Neural Netrwork Regressor we tested earlier. 

We will be testing traditional Support Vector Regressor with the kernels Linear and RBF, even thought there are alot of kernels, they are simply different ways off making the hyperplane decision boundary between the classes, some linearly, and some nonlinearly$^1$. Therefore we limited ourselves to one of each.

Further we will test the **Linear Support Vector Regressor (LinearSVR)**, the difference between this and SVR using the Linear kernel is the epsilon parameter and tolerance (among others). This gives us more flexibility and supposedly it scales better with large numbers of samples$^2$, something that is true for us (n = 26309). 

Lastly we will be using the **Nu Support Vector Regression (NuSVR)**, simular to SVR, with the addition of the `nu` parameter. The `nu` parameter is a an upper boundary on the fraction of training erros and a lower bound on the fraction of support vectors$^2$. Basicly you can control the amount of support vectors used.

We will first conduct a Grid Search for the three models, then we will take these results into the modeling stage, and at last discuss the outcome.

In terms of preproccesing we will simply use the already scaled and handeled `standard_df` dataframe as it is prepared for a regressive Algorythm.

<details>
<summary>Sources</summary>
[1]: Uddin, Md. Palash. (2018). Re: Diffference between SVM Linear, polynmial and RBF kernel?. Retrieved from: https://www.researchgate.net/post/Diffference_between_SVM_Linear_polynmial_and_RBF_kernel/5af811d18272c91a19463943/citation/download.
<br>[2]: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
</details>



In [435]:
standard_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26309 entries, 0 to 26308
Data columns (total 47 columns):
 #   Column                                                               Non-Null Count  Dtype  
---  ------                                                               --------------  -----  
 0   NEIGHBORHOOD                                                         26309 non-null  float64
 1   RESIDENTIAL UNITS                                                    26309 non-null  float64
 2   COMMERCIAL UNITS                                                     26309 non-null  float64
 3   LAND SQUARE FEET                                                     26309 non-null  float64
 4   GROSS SQUARE FEET                                                    26309 non-null  float64
 5   YEAR BUILT                                                           26309 non-null  float64
 6   BUILDING CLASS AT TIME OF SALE                                       26309 non-null  float64
 7   SALE

In [None]:
y = np.asarray(standard_df["SALE PRICE"], dtype=float)
x = standard_df.drop(["SALE PRICE"], axis = 1)
x = np.asarray(x, dtype=float)

print("shape of Y :"+str(y.shape))
print("shape of X :"+str(x.shape))

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=.20,random_state=42)
print("shape of X Train :"+str(X_train.shape))
print("shape of X Test :"+str(X_test.shape))
print("shape of Y Train :"+str(Y_train.shape))
print("shape of Y Test :"+str(Y_test.shape))

### Grid Search SVR()

In [None]:
"""
modelsvr = SVR()
param = {'kernel' : ('linear', 'rbf'),
         'C' : [1,3,5,7,10,15]}


grid_svr = GridSearchCV(modelsvr,param_grid = param, n_jobs = -1, verbose = 2)

grid_svr.fit(X_train,Y_train)
print('Grid search with the best accuracy: \n------------------------------------')
print('Best parameters: ', grid_svr.best_params_) 
print('Best cross validation score(Accuracy): {:.3f}'.format(grid_svr.best_score_))

print('Test set score: {:.3f}'.format(grid_svr.score(X_test,Y_test)))
"""

#### Grid search with the best accuracy: 
------------------------------------
Best parameters:  {'C': 3, 'kernel': 'rbf'}
Best cross validation score(Accuracy): 0.700
Test set score: 0.709

### Grid Search LinearSVR()

In [None]:
"""
modelsvr = LinearSVR()
param = {'C' : [0.001,0.01,0.1,0,2,0.3,1,3,5],
        "tol": [0.00001,0.0001,0.0002,0.0004,0.001],
        "epsilon": [0.0,0.3,0.5,0.7,0.9,1,1.1]}


grid_LinSVR = GridSearchCV(modelsvr,param_grid = param, n_jobs = -1, verbose = 2)

grid_LinSVR.fit(X_train,Y_train)
print('Grid search with the best accuracy: \n------------------------------------')
print('Best parameters: ', grid_LinSVR.best_params_) 
print('Best cross validation score(Accuracy): {:.3f}'.format(grid_LinSVR.best_score_))

print('Test set score: {:.3f}'.format(grid_LinSVR.score(X_test,Y_test)))
"""

#### Grid search with the best accuracy: 
------------------------------------
Best parameters:  {'C': 0.01, 'epsilon': 0.7, 'tol': 0.0002}
Best cross validation score(Accuracy): 0.688
Test set score: 0.690

### Grid Search NuSVR()

In [None]:
"""
modelsvr = NuSVR()

param = {"C": [0.1,0.6,1,1.5,2],
         "nu": [0.1,0.3,0.5,0.6,0.7],
         "kernel": ["linear", "rbf"]}

grid_NuSVR = GridSearchCV(modelsvr,param_grid = param, cv = 3, n_jobs = -1, verbose = 2)

grid_NuSVR.fit(X_train,Y_train)
print('Grid search with the best accuracy: \n------------------------------------')
print('Best parameters: ', grid_NuSVR.best_params_) 
print('Best cross validation score(Accuracy): {:.3f}'.format(grid_NuSVR.best_score_))
"""

#### Grid search with the best accuracy: 
------------------------------------
Best parameters:  {'C': 2, 'kernel': 'rbf', 'nu': 0.3}
Best cross validation score(Accuracy): 0.703

### Fitting the models

With the information we found from the grid Searches we can start fitting our models with the relevant information:

#### SVR()

In [436]:
SVR_model = SVR(kernel = "rbf", C = 3).fit(X_train, Y_train)
scoretrain = SVR_model.score(X_train,Y_train)
scoretest  = SVR_model.score(X_test,Y_test)


print("SVR_model with RBF Kernel:\n Training score :{:2f}\n Test Score: {:2f}".format(scoretrain,scoretest))

SVR_model with RBF Kernel:
 Training score :0.746980
 Test Score: 0.708567


#### LinearSVR()

**Note!!** 
Intrestringly enough using the cross validation function in GridSearchCV Originally gave us `C = 0.3` as the best option with `CV = 3`, running it once more using the default value (This being `CV = 5`) Proved the same results. But as shown beneath we are getting much better results with `C = 3`.

We have not been able to find the reason behind this, some sources suggested issues with crossvalidating too large a part of the dataset. As of now we are using 20%, but for a later date it would be intresting to test this with different values in the test_train_split function.

In [437]:
LinSVR_model = SVR(C = 3, epsilon = 0.7, tol = 0.0002).fit(X_train, Y_train)
scoretrain = LinSVR_model.score(X_train,Y_train)
scoretest  = LinSVR_model.score(X_test,Y_test)


print("Linear Support Vector Regressor with C = 3:\n Training score :{:2f}\n Test Score: {:2f}".format(scoretrain,scoretest))

Linear Support Vector Regressor with C = 3:
 Training score :0.737446
 Test Score: 0.706844


In [438]:
LinSVR_model = SVR(C = 0.3, epsilon = 0.7, tol = 0.0002).fit(X_train, Y_train)
scoretrain = LinSVR_model.score(X_train,Y_train)
scoretest  = LinSVR_model.score(X_test,Y_test)


print("Linear Support Vector Regressor with C = 0.3:\n Training score :{:2f}\n Test Score: {:2f}".format(scoretrain,scoretest))

Linear Support Vector Regressor with C = 0.3:
 Training score :0.689292
 Test Score: 0.683061


#### NuSVR()

In [439]:
NuSVR_model = NuSVR(C = 2, kernel = "rbf", nu = 0.3).fit(X_train, Y_train)
scoretrain = NuSVR_model.score(X_train,Y_train)
scoretest  = NuSVR_model.score(X_test,Y_test)


print("Nu Support Vector Regressor with C = 3:\n Training score :{:2f}\n Test Score: {:2f}".format(scoretrain,scoretest))

Nu Support Vector Regressor with C = 3:
 Training score :0.742859
 Test Score: 0.711971


### Analysing and Conclusion

To start it seems that all three models deliver the same results. 