<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C201_Activity_3_1_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 3.1.5 Building a basic neural network

## Scenario
Hopkins et al. (1999) created the Spambase data set donated to the UCI Machine Learning Repository. The data set contains 4,601 emails marked as spam or non-spam by a postmaster or individuals. Fifty-seven features aid in classifying emails as spam (e.g. word frequencies and email characteristics). The Spambase data set is used for developing and benchmarking spam detection models, providing a base for analysing the effectiveness of various machine learning techniques in distinguishing between spam and legitimate emails.

As a data professional, you were tasked by your company to develop a neural network with TensorFlow that can classify emails as spam or non-spam. You were tasked to develop a model based on the Spambase data set.


## Objective
In this portfolio activity, you’ll create a simple neural network using TensorFlow to classify emails as spam or non-spam.

You will complete the activity in your Notebook, where you’ll:
- create a sequential API
- add layers as needed
- employ the model pipeline (compile, fit, and evaluate)
- present your insights based on the performance of the model.


## Assessment criteria
By completing this activity, you will be able to provide evidence that you can synthesise and apply the TensorFlow life cycle from creation to evaluation.


## Activity guidance

1.  Import the relevant libraries to import and analyse the data set. The URL for the data set is provided. Note that the data set has no header row.
2. View the DataFrame.
3. Specify input features (`X`) and the target variable (`y`). The last column indicates whether an email is spam or non-spam.
4. Split the data into train and test sets, with a test percentage of 20%. Create a validation data with a split of 0.1.
5. Standardise the features and define the sequential model with 2 dense hidden layers. The first layer has 64 neurons and ReLU activition, while the second layer has 32 heurons and ReLU activation. Remember that the `input_shape` of the first hidden layer has to be equal to the number of columns of the input data features matrix. The last layer is the output layer, with sigmoid activation function and 1 neuron.
6. Compile the model with a `binary_crossentropy` as loss, Adam optimiser, and print the accuracy of the model.
7. Train/fit the model with a batch size of 64 and 10 epochs.
8. Evaluate the model on the test set with the evaluate function and print the loss and accuracy which are returned by the model.

> Start your activity here. Select the pen from the toolbar to add your entry.

In [7]:
# Start your activity here:

# URL to import data set from GitHub.
url = 'https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/spamdata.csv'

In [8]:
import keras
from keras import layers
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

In [9]:
data = pd.read_csv(url, header = None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


###Specify Input Features (X) and Target Variable (y)


In [10]:
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

### Split the Data into Training, Validation, and Test Sets


In [11]:
# Split into train and test sets (80% train, 20% test)
X_train_full, X_test, y_train_full, y_test = train_test_split(X ,y, test_size=0.2, random_state = 42)
# Further split the training set into train and validation (90% train, 10% validation)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.1, random_state = 42)

### Standardise the Features


In [12]:
scaler = StandardScaler()
# Fit the scaler on the training data and transform it
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

In [13]:
model =  tf.keras.Sequential()
number_neurons_1 = 64
number_neurons_2 = 32

#hiden layer 1.
model.add(tf.keras.layers.Dense(number_neurons_1,activation = 'relu', input_shape = (X_train.shape[1],)))
#hidden layer 2.
model.add(tf.keras.layers.Dense(number_neurons_2,activation = 'relu'))
#output layer
model.add(tf.keras.layers.Dense(1, activation = 'sigmoid'))

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


### Compile the Model Using TensorFlow's Alias


In [16]:
# Compile the model using TensorFlow's alias
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

### Train the Model Using TensorFlow's Alias


In [18]:
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_valid,y_valid))

Epoch 1/10
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 11ms/step - accuracy: 0.6900 - loss: 0.6067 - val_accuracy: 0.9158 - val_loss: 0.3403
Epoch 2/10
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9064 - loss: 0.3176 - val_accuracy: 0.9158 - val_loss: 0.2313
Epoch 3/10
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9316 - loss: 0.2118 - val_accuracy: 0.9293 - val_loss: 0.2073
Epoch 4/10
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9343 - loss: 0.1922 - val_accuracy: 0.9375 - val_loss: 0.1989
Epoch 5/10
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9382 - loss: 0.1701 - val_accuracy: 0.9402 - val_loss: 0.1902
Epoch 6/10
[1m52/52[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.9366 - loss: 0.1719 - val_accuracy: 0.9348 - val_loss: 0.1867
Epoch 7/10
[1m52/52[0m [32m━━━━━━━━━

Evaluate the Model on the Test Set Using TensorFlow's Alias


In [19]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
# Print the loss and accuracy
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')

[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9379 - loss: 0.1612 
Test Loss: 0.1542, Test Accuracy: 0.9468


# Reflect

Write a brief paragraph highlighting your process and the rationale to showcase critical thinking and problem-solving.

> Select the pen from the toolbar to add your entry.

# References

Hopkins, M., Reeber, E., Forman, G., Suermondt, J., 1999. Spambase. [online]. Available at: https://archive.ics.uci.edu/dataset/94. [Accessed 5 March 2024].