# Task 1: Introduction

---

For this project, we are going to work on evaluating price of houses given the following features:

1. Year of sale of the house
2. The age of the house at the time of sale
3. Distance from city center
4. Number of stores in the locality
5. The latitude
6. The longitude


Note: This notebook uses `python 3` and these packages: `tensorflow`, `pandas`, `matplotlib`, `scikit-learn`.

## 1.1: Importing Libraries & Helper Functions

First of all, we will need to import some libraries and helper functions. This includes TensorFlow and some utility functions that I've written to save time.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

from utils import *
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, LambdaCallback

%matplotlib inline
tf.logging.set_verbosity(tf.logging.ERROR)

print('Libraries imported.')

ModuleNotFoundError: No module named 'matplotlib'

# Task 2: Importing the Data

## 2.1: Importing the Data

The dataset is saved in a `data.csv` file. We will use `pandas` to take a look at some of the rows.

In [None]:
df = pd.read_csv('data.csv', names = column_names) # reads the data from the file using pandas
df.head() #outputs the first 5 rows of the table

## 2.2: Check Missing Data

It's a good practice to check if the data has any missing values. In real world data, this is quite common and must be taken care of before any data pre-processing or model training.

In [1]:
df.isna().sum() 
#returns T/F for each colum/row in our data, not practical bc we have 1000s of rows
# instead use .sum() to get the sum of missing values from each column

NameError: name 'df' is not defined

# Task 3: Data Normalization

## 3.1: Data Normalization

We can make it easier for optimization algorithms to find minimas by normalizing the data before training a model.

In [None]:
df = df.iloc[:,1:] #ignore the serial column, first argument rows, 2nd one columns
df_norm = (df - df.mean()) / df.std() #df.mean() -> column-wise means, df.std() -> column-wise std
df_norm.head()

## 3.2: Convert Label Value

Because we are using normalized values for the labels, we will get the predictions back from a trained model in the same distribution. So, we need to convert the predicted values back to the original distribution if we want predicted prices.

In [None]:
y_mean = df['price'].mean() # price column, not using df.norm, want mean of original distribution
y_std = df['price'].std()

def convert_label_value(pred):
    return int(pred * y_std + y_mean) # gives the prediction back into the original price distribution

print(convert_label_value(0.350088)) # might not be the exact same number bc using int value

# Task 4: Create Training and Test Sets

## 4.1: Select Features

Make sure to remove the column __price__ from the list of features as it is the label and should not be used as a feature.

In [None]:
x = df_norm.iloc[:, :6] # all rows, with 1st 6 columns (features with price as the label)
x.head()

## 4.2: Select Labels

In [None]:
y = df_norm.iloc[:, -1] # only the last column
y.head()

## 4.3: Feature and Label Values

We will need to extract just the numeric values for the features and labels as the TensorFlow model will expect just numeric values as input.

In [None]:
# x and y data frames already have values as multidimensional arrays (for x, array for y, list)
# dataframes already have numpy arrays which can be accessed with .values
x_arr = x.values
y_arr = y.values #label

# 6 features and 5000 total examples
print('features array shape:', x_arr.shape)
print('labels array shape', y_arr.shape)

## 4.4: Train and Test Split

We will keep some part of the data aside as a __test__ set. The model will not use this set during training and it will be used only for checking the performance of the model in trained and un-trained states. This way, we can make sure that we are going in the right direction with our model training.

In [None]:
#train model using all examples we have, no way of measuring the performance of a trained model in an unbiased way
# know the performance of the model on the training set, no way of accessing if it'll work on new data that the model has never seen before
# use test set to ensure the model is trying to figure out the underlying mathematical function btwn inputs and outputs and not just memorizing the data

x_train, x_test, y_train, y_test = train_test_split(x_arr, y_arr, test_size=0.05, random_state=0) #built in helper function
# use 5% of the data, random_state=0, get same split as someone else using this

print('training set:', x_train.shape, y_train.shape)
print('test set:', x_test.shape, y_test.shape)

# Task 5: Create the Model

## 5.1: Create the Model

Let's write a function that returns an untrained model of a certain architecture.

In [None]:
def get_model():
    
    model = Sequential([
        Dense(10, input_shape = (6,), activation = 'relu'),
        Dense(20, activation = 'relu'),
        Dense(5, activation = 'relu'),
        Dense(1)
    ])

    model.compile(
        loss='mse',
        optimizer='adam' #used to optimize the mse loss algorithm
    )
    
    return model

model = get_model()
model.summary()

# Task 6: Model Training

## 6.1: Model Training

We can use an `EarlyStopping` callback from Keras to stop the model training if the validation loss stops decreasing for a few epochs.

In [None]:
#calculated on the test set (better metric to use to decide when to stop training)
# model will stop training when it doesn't see any improvement in validation loss

early_stopping = EarlyStopping(monitor='val_loss', patience = 5)

model = get_model()

preds_on_untrained = model.predict(X_test) #random predictions

history = model.fit( #history gives info on loss and validation loss across different epochs
    X_train, y_train,
    validation_data = (X_test, y_test),
    epochs = 1000,
    callbacks = [early_stopping]
)

## 6.2: Plot Training and Validation Loss

Let's use the `plot_loss` helper function to take a look training and validation loss.

In [None]:
plot_loss(history)

# Task 7: Predictions

## 7.1: Plot Raw Predictions

Let's use the `compare_predictions` helper function to compare predictions from the model when it was untrained and when it was trained.

In [None]:
preds_on_trained = model.predict(X_test)

compare_predictions(preds_on_untrained, preds_on_trained, y_test)

#trained model pretty much a linear plot, and much more precise than untrained model
# this plot is with the normalized prices

## 7.2: Plot Price Predictions

The plot for price predictions and raw predictions will look the same with just one difference: The x and y axis scale is changed.

In [None]:
#convert every prediction back to original prices
price_on_untrained = [convert_label_value(y) for y in preds_on_untrained]
price_on_trained = [convert_label_value(y) for y in preds_on_trained]
price_y_test = [convert_label_value(y) for y in y_test]

compare_predictions(price_on_untrained, price_on_trained, price_y_test)