<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C201_Activity_3_2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 3.2.3 Experimenting with hyperparameter tuning

## Scenario
Hopkins et al. (1999) created the Spambase data set donated to the UCI Machine Learning Repository. The data set contains 4,601 emails marked as spam or non-spam by a postmaster or individuals. Fifty-seven features aid in classifying emails as spam (e.g. word frequencies and email characteristics). The Spambase data set is used for developing and benchmarking spam detection models, providing a base for analysing the effectiveness of various machine learning techniques in distinguishing between spam and legitimate emails.

As a data professional, you were tasked by your company to develop a neural network with TensorFlow that can classify emails as spam or non-spam. You were tasked to develop a model based on the Spambase data set.



## Objective
In this portfolio activity, you’ll continue to work with the model you created in Activity 3.1.5 by applying model tuning and grid search to classify emails as spam or non-spam.

You will complete the activity in your Notebook, where you’ll:
- add an extra four layers to the model you created previously
- create a new model pipeline
- employ different batch sizes and epochs to evaluate the impact on the accuracy
- present your insights based on the performance of the model.


## Assessment criteria
By completing this activity, you will be able to provide evidence that you can critically select appropriate strategies to demonstrate expertise in model tuning techniques.


## Activity guidance
1. Continue to work on the model you created in **Activity 3.1.5**.
2. Add 4 hidden layers with the ReLU activation and 16 neurons for the fourth layer.
3. Compile the model with `binary_crossentropy` as loss, Adam optimiser, and print the accuracy of the model.
4. Train and evaluate the model again.
5. Jot down whether the final evaluation changed? Was there any improvement in the model? If not, train and evaluate again. Does the final evaluation change? Does it improve?
6. Create a vector of different `batch_sizes=np.array([16, 32, 64])` and `loop` through it, retraining the model each time, and print the performances. Use the same model and number of epochs.
7. Create a vector of different `epochs=np.array([10, 20, 30])` and `loop` through it, retraining the model each time and using the batch size. Jot down which model gave you the highest accuracy.

> Start your activity here. Select the pen from the toolbar to add your entry.

In [None]:
# Start your activity here:

# URL to import data set from GitHub.
url = 'https://raw.githubusercontent.com/fourthrevlxd/cam_dsb/main/spamdata.csv'

In [None]:
data = pd.read_csv(url, header = None)
data.head()

In [None]:
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

In [None]:
# Split into train and test sets (80% train, 20% test)
X_train_full, X_test, y_train_full, y_test = train_test_split(X ,y, test_size=0.2, random_state = 42)
# Further split the training set into train and validation (90% train, 10% validation)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.1, random_state = 42)

In [None]:
scaler = StandardScaler()
# Fit the scaler on the training data and transform it
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)

In [None]:
model =  tf.keras.Sequential()
number_neurons_1 = 64
number_neurons_2 = 32

#hiden layer 1.
model.add(tf.keras.layers.Dense(number_neurons_1,activation = 'relu', input_shape = (X_train.shape[1],)))
#hidden layer 2.
model.add(tf.keras.layers.Dense(number_neurons_2,activation = 'relu'))
#output layer
model.add(tf.keras.layers.Dense(1, activation = 'sigmoid'))

In [None]:
# Compile the model using TensorFlow's alias
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss=tf.keras.losses.BinaryCrossentropy(),
              metrics=['accuracy'])

In [None]:
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_valid,y_valid))

In [None]:
test_loss, test_accuracy = model.evaluate(X_test, y_test)
# Print the loss and accuracy
print(f'Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}')

# Reflect

Write a brief paragraph highlighting your process and the rationale to showcase critical thinking and problem-solving.

> Select the pen from the toolbar to add your entry.

# References

Hopkins, M., Reeber, E., Forman, G., Suermondt, J., 1999. Spambase. [online]. Available at: https://archive.ics.uci.edu/dataset/94. [Accessed 5 March 2024].