# Modelling and Deep Learning with TensorFlow

Welcome to the third notebook of our project, where we will focus on building and training our predictive models.

In this notebook, we'll be using TensorFlow, a popular open-source platform for machine learning. TensorFlow offers a comprehensive ecosystem of tools, libraries, and community resources that allows researchers and developers to build and deploy machine learning models with ease.

We will start by loading the processed dataset that we created in the previous notebook. After this, we will carry out the following steps:

1. **Data Splitting**: We will split our data into training and test sets. The training set will be used to train our models, while the test set will serve to evaluate their performance on unseen data.

2. **Baseline Model**: We will begin with a simple linear regression model to serve as our baseline. This will allow us to gauge the performance of our subsequent, more complex models.

3. **Deep Learning Model**: After establishing our baseline, we will proceed to construct a deep learning model using TensorFlow's Keras API. We will start with a basic feed-forward neural network and assess its performance.

4. **Model Improvement**: We will attempt to improve the performance of our deep learning model by tuning its architecture and hyperparameters. We may include techniques such as adding more layers, using different types of layers (like dropout for regularization), or adjusting the learning rate.

5. **Model Training**: Our models will be trained on our training set, using the appropriate loss functions and optimizers for regression tasks. We will monitor the training and validation loss during training.

By the end of this notebook, we should have a trained model ready for evaluation and optimization in our final notebook.


## Data Splitting

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Read the data from the CSV file
df = spark.read.csv('processed_housing.csv', inferSchema=True, header=True)

# Convert Spark DataFrame to Pandas DataFrame
df_pandas = df.toPandas()

# Perform one-hot encoding
df_encoded = pd.get_dummies(df_pandas)

# Split the data into train and test sets
train_data, test_data = train_test_split(df_encoded, test_size=0.2, random_state=42)

# Separate the features from the target variable
X_train = train_data.drop('median_house_value', axis=1)
y_train = train_data['median_house_value']

X_test = test_data.drop('median_house_value', axis=1)
y_test = test_data['median_house_value']

# Convert data to float32
X_train = np.array(X_train).astype('float32')
y_train = np.array(y_train).astype('float32')
X_test = np.array(X_test).astype('float32')
y_test = np.array(y_test).astype('float32')

After loading the processed dataset, we convert the Spark DataFrame into a Pandas DataFrame to facilitate the further operations.

Our dataset includes a categorical variable 'ocean_proximity'. Machine learning models typically require inputs to be in numerical format. Hence, we need to convert this categorical data into a numerical form. For this, we use a technique known as One-Hot Encoding. One-Hot Encoding is a process of converting categorical data variables so they can be provided to machine learning algorithms to improve predictions. With one-hot, we convert each categorical value into a new categorical column and assign a binary value of 1 or 0. Each integer value is represented as a binary vector.

Then, we split our dataset into training data and testing data. Training data (80% of the dataset) is used to train our machine learning model, while testing data (20% of the dataset) is used to evaluate the model's performance.

In the end, we separate the features (independent variables) from the target variable (median_house_value) and convert the data into a floating-point format, which is the preferred format for neural network models.

In the next step, we will normalize our features.

In [2]:
# Normalizing the features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Normalization (or scaling) is an important step in many machine learning algorithms. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization is also required for some algorithms to model the data correctly.

Next, we will build our machine learning model using TensorFlow.

## Baseline Model
To have a point of reference for evaluating the performance of our deep learning models, we'll first build a simple linear regression model.

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Create a linear regression model
lin_reg = LinearRegression()

# Train the model
lin_reg.fit(X_train, y_train)

# Get predictions on the training set
lin_reg_preds = lin_reg.predict(X_train)

# Compute the mean squared error of the predictions
lin_reg_mse = mean_squared_error(y_train, lin_reg_preds)

print("Linear Regression MSE: ", lin_reg_mse)

Linear Regression MSE:  2643793400.0


## Deep Learning Model

Now that we've preprocessed our data, we're ready to start building our machine learning model. We'll use a sequential model from the TensorFlow library, which is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor. For this task, we'll be using fully connected (dense) layers.

In the following code block, we'll build and compile our model:

In [4]:
import tensorflow as tf

# Define the model
model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(64, activation='relu', input_shape=[len(X_train[0])]),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dense(1)
])

# Compile the model
model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(0.001),
              metrics=['mae', 'mse'])

Our model consists of three layers:

The first layer is a dense layer with 64 nodes (or neurons), and we use the ReLU (Rectified Linear Unit) activation function.
The second layer is also a dense layer with 64 neurons, also with the ReLU activation function.
The third layer is the output layer, and it has just one node as we're predicting a single value (the median house value).
After defining the model's architecture, we compile the model. During the model compilation, we specify a loss function and an optimizer, and the metrics we want to observe. Here, we are using the mean squared error as our loss function, which is a common choice for regression problems. We're using the Adam optimizer. The metrics we're monitoring are mean absolute error (MAE) and mean squared error (MSE).

Next, we'll train our model.

In [5]:
# Train the model
history = model.fit(X_train, y_train, epochs=100, validation_split = 0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Here, we train the model for 100 epochs with a validation split of 0.2, meaning that 20% of the training data is used as validation data.

In the next step, we'll evaluate our model's performance.

## Model Improvement
To improve our model's performance, we can adjust its architecture and hyperparameters. One approach might be to add more layers, use different types of layers, such as dropout for regularization, or tweak the learning rate.

In [6]:
# Redefine the model with dropout
model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(64, activation='relu', input_shape=[len(X_train[0])]),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(1)
])

# Recompile the model
model.compile(loss='mean_squared_error',
              optimizer=tf.keras.optimizers.Adam(0.001),
              metrics=['mae', 'mse'])

# Retrain the model
history = model.fit(X_train, y_train, epochs=100, validation_split = 0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

By incorporating dropout layers, we're introducing a form of regularization. During training, dropout will randomly set a fraction of input units to 0 at each update, which helps prevent overfitting.

## Model Training
We have already trained our model on the training data using the appropriate loss functions and optimizers for our regression task. We have also monitored the training and validation loss, which can provide useful insight into the performance of our model and its generalization ability.

In [7]:
# Continue training the model
history = model.fit(X_train, y_train, epochs=100, validation_split = 0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Finally, we save our improved model for future use.

In [8]:
# Save the model
model.save('housing_model_improved.h5')

## Summary
In this notebook, we've built upon our preprocessed data, constructed both a baseline linear regression model and a deep learning model, and made attempts to improve the latter. We've also trained our models and evaluated their performance, giving us a good idea of how well they can predict house prices. In the next notebook, we'll delve into model evaluation and optimization.