# **CS412 - Machine Learning - 2022**
## Assignment #2
100 pts


## Goal

The goal of this homework is two-fold:

*   Gain experience with neural network approaches
*   Gain experience with the Keras library

## Dataset
You are going to use a house price dataset that we prepared for you, that contains four independent variables (predictors) and one target variable. The task is predicting the target variable (house price) from the predictors (house attributes).


Download the data from SuCourse. Reserve 10% of the training data for validation and use the rest for development (learning your models). The official test data we provide (1,200 samples) should only be used for testing at the end, and not model selection.

## Task 
Build a regressor with a neural network that has only one hidden layer, using the Keras library function calls to predict house prices in the provided dataset.

Your code should follow the given skeleton and try the indicated parameters.

## Preprocessing and Meta-parameters
You should try 10,50 and 100 as hidden node count. 

You should  decide on the learning rate (step size), you can try values such as 0.001, 0.01, 0.1, but you may need to increase if learning is very slow or decrease if you see the loss increase!

You can use either sigmoid or Relu activations for the hidden nodes (indicate with your results) and you should know what to use for the activation for the output layer, input, output layer sizes, and the suitable loss function. 

## Software: 

Keras is a library that we will use especially for deep learning, but also with basic neural network functionality of course.

You may find the necessary function references here: 

http://scikit-learn.org/stable/supervised_learning.html
https://keras.io/api/

When you search for Dense for instance, you should find the relevant function and explained parameters, easily.

## Submission: 

Fill this notebook. Write the report section at the end.

You should prepare a separate pdf document as your homework (name hw2-CS412-yourname.pdf) which consists of the report (Part 8) of the notebook for easy viewing -and- include a link to your notebook from within the pdf report (make sure to include the link obtained from the #share link on top right, **be sure to share with Sabancı University first** as otherwise there will be access problems.). Also, do not forget to add your answers for Questions 2 and 3 on the assignment document.

##1) Initialize

*   First make a copy of the notebook given to you as a starter.

*   Make sure you choose Connect form upper right.


## 2) Load training dataset

* Load the datasets (train.csv, test.csv) provided on SuCourse on your Google drive and read the datasets using Google Drive's mount functions. 
You may find the necessary functions here: 
https://colab.research.google.com/notebooks/io.ipynb

In [None]:
from google.colab import drive
drive.mount('/content/drive') 
# click on the url that pops up and give the necessary authorizations

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




*   Set your notebooks working directory to the path where the datasets are uploaded (cd is the linux command for change directory) 
*   You may need to use cd drive/MyDrive depending on your path to the datasets on Google Drive. (don't comment the code in the cells when using linux commands)






In [None]:
cd /drive/My Drive/412

[Errno 2] No such file or directory: '/drive/My Drive/412'
/content


* List the files in the current directory.

In [None]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


##3) Understanding the dataset (5 pts)

There are alot of functions that can be used to know more about this dataset

- What is the shape of the training set (num of samples X number of attributes) **[shape function can be used]**

- Display attribute names **[columns function can be used]**

- Display the first 5 rows from training dataset **[head or sample functions can be used]**

..

In [None]:
# import the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train_df = pd.read_csv("/content/drive/My Drive/412/train.csv")
# show first 10 elements of the training data
train_df.head(10)

Unnamed: 0,sqmtrs,nrooms,view,crime_rate,price
0,251,5,west,low,925701.721399
1,211,3,west,high,622237.482636
2,128,5,east,low,694998.182376
3,178,3,east,high,564689.015926
4,231,3,west,low,811222.970379
5,253,5,north,high,766250.032506
6,101,1,north,low,512749.401548
7,242,1,north,high,637010.760148
8,174,5,west,high,638136.374869
9,328,2,south,high,787704.988273


In [None]:
# print the shape of data

print("Data dimensionality is:", train_df.shape)

# also give some statistics about the data like mean, standard deviation etc.
print("-----------------")
print("Mean of sqmtrs:", np.mean(train_df['sqmtrs']))
print("Standard Deviation of sqmtrs", np.std(train_df['sqmtrs']))
print("-----------------")
print("Mean of Price", np.mean(train_df['price']))
print("Standard Deviation of Price", np.mean(train_df['price']))


Data dimensionality is: (4800, 5)
-----------------
Mean of sqmtrs: 225.03354166666668
Standard Deviation of sqmtrs 71.84395068449813
-----------------
Mean of Price 725756.960757509
Standard Deviation of Price 725756.960757509


##4) Preprocessing Steps (10 pts)

As some of the features (predictive variables) on this dataset are categorical (non-numeric) you need to do some preprocessing for those features.

You can use as many **dummy or indicator variables** as there are categories within one feature. You can also look at pandas' get_dummies or keras.utils.to_categorical functions.

In neural networks, scaling of the features are important, because they affect the net input of a neuron as a whole. You should use **MinMax scaler** on sklearn for this task, which scales the variables between 0 and 1 on by default. (Remember that mean-squared error loss function tends to be extremely large with unscaled features.)


In [None]:
# encode the categorical variables

cleanup_nums = {"crime_rate":     {"low": 0, "high": 1},
                "view": {"north": 1, "south": 2, "east": 3, "west": 4}}
train_df = train_df.replace(cleanup_nums)

# scale the features between 0-1
from sklearn.preprocessing import MinMaxScaler
msc = MinMaxScaler(feature_range=(0, 1))
scaled_train = msc.fit_transform(train_df)
scaled_train_df = pd.DataFrame(scaled_train, columns=train_df.columns.values)
scaled_train_df.head(10)

Unnamed: 0,sqmtrs,nrooms,view,crime_rate,price
0,0.606426,1.0,1.0,0.0,0.791034
1,0.445783,0.5,1.0,1.0,0.369303
2,0.11245,1.0,0.666667,0.0,0.470421
3,0.313253,0.5,0.666667,1.0,0.289327
4,0.526104,0.5,1.0,0.0,0.631941
5,0.614458,1.0,0.0,1.0,0.569441
6,0.004016,0.0,0.0,0.0,0.217145
7,0.570281,0.0,0.0,1.0,0.389834
8,0.297189,1.0,1.0,1.0,0.391398
9,0.915663,0.25,0.333333,1.0,0.599257


Don't forget the split the training data to obtain a validation set. **Use random_state=42**

In [None]:
# split 90-10
from sklearn.model_selection import train_test_split
# Define X:
X = scaled_train_df[['sqmtrs',	'nrooms',	'view',	'crime_rate']]

# Define y:
y = scaled_train_df['price']
X_train, X_val, y_train, y_val= train_test_split(X,y, test_size=0.1, random_state=42)

X_train_array = X_train.values
Y_train_array = y_train.values
X_val_array = X_val.values
Y_val_array = y_val.values


##5) Train neural networks on development data and do model selection using the validation data (55 pts)


* Train a neural network with **one hidden layer** (try 3 different values for the number of neurons in that hidden layer, as 25, 50, 100), you will need to correctly choose the optimizer and the loss function that this model will train with. Use batch_size as 64 and train each model for 30 epochs. 

* Train another neural network with two hidden layers with meta-parameters of your choice. Again, use batch_size as 64 and train the model for 30 epochs. 

* **Bonus (5 pts)** Train a KNN or a Decision Tree model with your own choice of meta parameters to predict the house prices.


In [None]:
import keras
import keras.utils
from keras import utils as np_utils
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.optimizers import SGD, Adam
from sklearn.model_selection import train_test_split


In [None]:
model_1 = Sequential()
model_1.add(Dense(25, activation='sigmoid', name='hidden_1'))
model_1.add(Dense(1, name='output_layer'))
model_1.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
histories_1= []
history_1 = model_1.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=1)
histories_1.append(history_1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
model_2 = Sequential()
model_2.add(Dense(50, activation='sigmoid', name='hidden_1'))
model_2.add(Dense(1, name='output_layer'))
model_2.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
histories_2= []
history_2 = model_2.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=1)
histories_2.append(history_2)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
model_3 = Sequential()
model_3.add(Dense(100, activation='sigmoid', name='hidden_1'))
model_3.add(Dense(1, name='output_layer'))
model_3.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
histories_3= []
history_3 = model_3.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=1)
histories_3.append(history_3)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
# train a two-hidden layered neural network
model = Sequential()
model.add(Dense(100, activation='sigmoid', name='hidden_1'))
model.add(Dense(50, activation='sigmoid', name='hidden_2'))
model.add(Dense(1, name='output_layer'))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
histories= []
history = model.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=1)
histories.append(history)




Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## 6) Test your trained classifiers on the Validation set (10 pts)
Test your trained classifiers on the validation set and print the mean squared errors.


In [None]:
# tests on validation

#...
score = model.evaluate(X_val_array, Y_val_array, verbose=0)
print('Val MSE of 2-Layered:', score[0])

score1 = model_1.evaluate(X_val_array, Y_val_array, verbose=0)
print('Val MSE of 1-Layered 25 neuron:', score1[0])

score2 = model_2.evaluate(X_val_array, Y_val_array, verbose=0)
print('Val MSE of 1-Layered 50 neuron:', score2[0])

score3 = model_3.evaluate(X_val_array, Y_val_array, verbose=0)
print('Val MSE of 1-Layered 100 neuron:', score3[0])


Val MSE of 2-Layered: 0.0005103551666252315
Val MSE of 1-Layered 25 neuron: 0.00039245132938958704
Val MSE of 1-Layered 50 neuron: 0.0004338944854680449
Val MSE of 1-Layered 100 neuron: 0.0004035242600366473


## 7) Test your classifier on Test set (10 pts)

- Load test data
- Apply same pre-processing as training data (encoding categorical variables, scaling)
- Predict the labels of testing data **using the best model that you have selected according to your validation results** and report the mean squared error. 

In [None]:
# test results
test_df = pd.read_csv("/content/drive/My Drive/412/test.csv")
cleanup_nums = {"crime_rate":     {"low": 0, "high": 1},
                "view": {"north": 1, "south": 2, "east": 3, "west": 4}}
test_df = test_df.replace(cleanup_nums)

# scale the features between 0-1
msc = MinMaxScaler(feature_range=(0, 1))
scaled_test = msc.fit_transform(test_df)
scaled_test_df = pd.DataFrame(scaled_test, columns=test_df.columns.values)
scaled_test_df.head(10)
# Define X:
X = scaled_test_df[['sqmtrs',	'nrooms',	'view',	'crime_rate']]

# Define y:
Y = scaled_test_df['price']

X_test_array = X.values
Y_test_array = Y.values

score_final = model_2.evaluate(X_test_array, Y_test_array, verbose=0)
print('Test MSE of 1-Layered 50 neuron:', score_final[0])

Test MSE of 1-Layered 50 neuron: 0.0011126638855785131


In [None]:
min_rates = []
learning_rates = [0.1, 0.09, 0.07, 0.03, 0.01, 0.001]
for i in range (3):
  print('RUN:', i + 1)
  min_score = 0
  min_rate = 0
  count = 0
  for rate in learning_rates:
    model = keras.Sequential()
    model.add(keras.layers.Dense(128, activation='relu', name='hiddenLayer-1'))
    model.add(keras.layers.Dense(1))
    opt = Adam(learning_rate=rate)
    model.compile(loss='mean_squared_error', optimizer=opt)
    model.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=0)
    score = model.evaluate(X_val_array, Y_val_array, verbose=0)
    if count == 0:
      min_score = score
      min_rate = rate
    count += 1
    if score < min_score:
      min_score = score
      min_rate = rate
    print("Loss with learning rate ", rate, ': ', score, sep='')
  print("Lowest loss was:", min_score, 'with learning rate:', min_rate, '\n')
  min_rates.append(min_rate)

max_occur_rate = 0
for rate in min_rates:
  if min_rates.count(rate) > min_rates.count(max_occur_rate):
    max_occur_rate = rate

print("Best learning rate is:", max_occur_rate)

RUN: 1
Loss with learning rate 0.1: 0.048385992646217346
Loss with learning rate 0.09: 0.00015368190361186862
Loss with learning rate 0.07: 0.00046236536582000554
Loss with learning rate 0.03: 6.817416578996927e-05
Loss with learning rate 0.01: 5.880971366423182e-05
Loss with learning rate 0.001: 6.194718298502266e-05
Lowest loss was: 5.880971366423182e-05 with learning rate: 0.01 

RUN: 2
Loss with learning rate 0.1: 0.0004445115337148309
Loss with learning rate 0.09: 0.0007889315602369606
Loss with learning rate 0.07: 0.00015180477930698544
Loss with learning rate 0.03: 7.618909876327962e-05
Loss with learning rate 0.01: 5.9890571719734e-05
Loss with learning rate 0.001: 6.119866156950593e-05
Lowest loss was: 5.9890571719734e-05 with learning rate: 0.01 

RUN: 3
Loss with learning rate 0.1: 0.0008767552208155394
Loss with learning rate 0.09: 0.0004876038001384586
Loss with learning rate 0.07: 0.00012871863145846874
Loss with learning rate 0.03: 7.992091559572145e-05
Loss with learnin

In [None]:
min_rates = []
neurons = [25,50,100,200,500]
for i in range (3):
  print('RUN:', i + 1)
  min_score = 0
  min_rate = 0
  count = 0
  for n in neurons:
    model = keras.Sequential()
    model.add(keras.layers.Dense(n, activation='relu', name='hiddenLayer-1'))
    model.add(keras.layers.Dense(1))
    opt = Adam(learning_rate=0.001)
    model.compile(loss='mean_squared_error', optimizer=opt)
    model.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=0)
    score = model.evaluate(X_val_array, Y_val_array, verbose=0)
    if count == 0:
      min_score = score
      min_rate = n
    count += 1
    if score < min_score:
      min_score = score
      min_rate = n
    print("Loss with # of neurons ", n, ': ', score, sep='')
  print("Loss was:", min_score, 'with # of neurons:', min_rate, '\n')
  min_rates.append(min_rate)

max_occur_rate = 0
for rate in min_rates:
  if min_rates.count(rate) > min_rates.count(max_occur_rate):
    max_occur_rate = rate

print("Best # of neurons is:", max_occur_rate)

RUN: 1
Loss with # of neurons 25: 0.0003037714632228017
Loss with # of neurons 50: 0.00010257354733766988
Loss with # of neurons 100: 6.398266123142093e-05
Loss with # of neurons 200: 6.754184141755104e-05
Loss with # of neurons 500: 6.517938163597137e-05
Loss was: 6.398266123142093e-05 with # of neurons: 100 

RUN: 2
Loss with # of neurons 25: 0.00011754301522159949
Loss with # of neurons 50: 0.0001277210540138185
Loss with # of neurons 100: 8.362873631995171e-05
Loss with # of neurons 200: 5.5636748584220186e-05
Loss with # of neurons 500: 5.532363138627261e-05
Loss was: 5.532363138627261e-05 with # of neurons: 500 

RUN: 3
Loss with # of neurons 25: 0.00014621230366174132
Loss with # of neurons 50: 0.0001312487875111401
Loss with # of neurons 100: 8.525009616278112e-05
Loss with # of neurons 200: 6.583460344700143e-05
Loss with # of neurons 500: 5.1525876187952235e-05
Loss was: 5.1525876187952235e-05 with # of neurons: 500 

Best # of neurons is: 500


In [None]:
model = keras.Sequential()
model.add(keras.layers.Dense(500, activation='relu', name='hiddenLayer-1'))
model.add(keras.layers.Dense(1))
opt = Adam(learning_rate=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
model.fit(X_train_array, Y_train_array, batch_size=64, epochs=30, verbose=0)
score = model.evaluate(X_val_array, Y_val_array, verbose=0)
score_final = model_2.evaluate(X_test_array, Y_test_array, verbose=0)

print("We have obtained the best results on the validation set with the the last approach (experimenting with different neurons and learning rates) using a value of \n500 neurons and 0.001 learning rate for the hyperparameter.\nThe result of this model on the test data is", score_final[0], " MSE.")

We have obtained the best results on the validation set with the the last approach (experimenting with different neurons and learning rates) using a value of 
500 neurons and 0.001 learning rate for the hyperparameter.
The result of this model on the test data is 0.0011126638855785131  MSE.


##8) Report Your Results (10 pts)

**Notebook should be RUN:** As training and testing may take a long time, we may just look at your notebook results without running the code again; so make sure **each cell is run**, so outputs are there.

**Report:** Write an **1-2 page summary** of your approach to this problem **as indicated below**. 

**Must include statements such as those below:**
**(Remove the text in parentheses, below, and include your own report)**

( Include the problem definition: 1-2 lines )

 (Talk about train/val/test sets, size and how split. )

 (Talk about feature extraction or preprocessing.)

**Add your observations as follows** (keep the questions for easy grading/context) in the report part of your notebook.

**Observations**

- Try a few learning rates for N=25 hidden neurons,  train for the indicated amount of epochs. Comment on what happens when learning rate is large or small? What is a good number/range for the learning rate?
Your answer here….

- Use that learning rate and vary the number of hidden neurons for the given values and try the indicated number of epochs. Give the validation mean squared errors for different approach and meta-parameters tried **in a table** and state which one you selected as your model. How many hidden neurons give the best model? 
Your answer here….

- State  what your test results are with the chosen approach and meta-parameters: e.g. "We have obtained the best results on the validation set with the ..........approach using a value of ...... for .... parameter. The result of this model on the test data is ..... % accuracy."" 

- How slow is learning? Any other problems?
Your answer here….

- Any other observations (not obligatory)

 You can add additional visualization as separate pages if you want, think of them as appendix, keeping the summary to 1-2-pages.

