# CMSC471 - Artificial Intelligence - Spring 2020
## Instructor: Fereydoon Vafaei
# <font color="blue"> Assignment 5: Classification and Regression Using NN in Tensorflow</font>

*Type your name and ID here*

## Overview and Learning Objectives

In Part I of this assignment, you are going to build a neural network for binary classification. 

In Part II, you will build a NN for regression.

<b>Note: </b>As you work through this assignment, you are recommended to check the textbook examples, notebooks and tensorflow documentations.

Pedagogically, this assignment will help you:
- better understand classification and regression using Neural Networks.

- practice implementing NNs in Tensorflow and Keras.

## Part I - Classification Using NN in Tensorflow

You're going to build a binary classifier NN to predict disease, i.e. "diagnosis".

The first thing to do is downloading [the breast cancer dataset](https://github.com/fereydoonvafaei/CMSC471-Spring2020/blob/master/Assignment-5/breast_cancer.csv) and save it in the same working directory as your notebook.

Read the feature specifications in [Kaggle page](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data) to learn more about the data.

**NOTE:** As you work through the notebook, keep adding any required module from Python, Tensorflow and Keras in the following cell.

## <font color="red"> Required Coding

In [1]:
# Import necessary modules from python, tensorflow and keras
# NOTE: As you work through the notebook, keep adding any required module here if necessary
...

import warnings
warnings.filterwarnings("ignore")

In [2]:
print("tf Version: ", tf.__version__)
print("Eager Execution mode: ", tf.executing_eagerly())

tf Version:  2.1.0
Eager Execution mode:  True


> Next, load the data with pandas. The data (csv file) should be stored in the same working directory as your notebook.

In [3]:
# Load dataset using pd
...

# Show the first five rows
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### Preprocessing

> Check if there is any null or na in the data.

In [1]:
print(data.isnull().sum())
data.isna().sum()

> Also, the first column `id` doesn't seem to provide any useful info to ML model, so drop it.

In [6]:
len(data['id'].unique())

569

In [7]:
# drop "id"
...

> Now, you can extract features and labels from `data`. Your classifier should attempt to predict `diagnosis` so that is your target/label column.

In [24]:
# Organize data to feature vector X and label vector y
X = ...
y = ...

In [25]:
print("Features shape: ", X.shape)
print("Labels shape: ", y.shape)

Features shape:  (569, 30)
Labels shape:  (569,)


> Your `X` dataframe now only contains features, hence has 30 columns whereas `y` has now become a 1D vector containing labels only. Notice that `y` has 569 labels equal to the number of data records in the feature vector.

In [26]:
# X should no longer contain the diagnosis which is target/label column - i.e. the column to be predicted
X.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [27]:
# y should only contain diagnosis - target/label column
y.head()

0    M
1    M
2    M
3    M
4    M
Name: diagnosis, dtype: object

In [28]:
y.unique()

array(['M', 'B'], dtype=object)

> The two classes (aka labels) here are `M` and `B` representing `malignant` and `benign` which refers to the tumors you are going to classify. You need to represent them by `1` and `0` respectively. In other words, to use sklearn classifiers and score metrics, you need to convert the categorical lables.

In [29]:
# Encoding categorical labels M and B to 1 and 0
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

> When you have multiple features with different scales/ranges, you should consider standardizing them. There are different ways to standardize and to normalize the feature vector. One convenient way is using scikit-learn modules.

In [30]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
X

array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
         2.75062224,  1.93701461],
       [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
        -0.24388967,  0.28118999],
       [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
         1.152255  ,  0.20139121],
       ...,
       [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
        -1.10454895, -0.31840916],
       [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
         1.91908301,  2.21963528],
       [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
        -0.04813821, -0.75120669]])

In [31]:
# Split the data to training set and testing set
...

# Check the shapes of X_train, X_test, y_train, y_test
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (426, 30)
y_train shape:  (426,)
X_test shape:  (143, 30)
y_test shape:  (143,)


### Building NN for Binary Classification

> Now, you should build a binary classifier NN that can predict diagnosis.

> You can begin with a simple neural network with a couple of hidden layers, and increase number of hidden layers and neurons as needed. You may also use callback and early stopping to find the optimal number of epochs, but it's possible to obtain the minimum required accuracy (0.97) within 20 epochs only.

> **Hint-1**: During training, despite some variations, you should see a clear trend of descending loss and increasing accuracy; otherwise, your model has not been developed properly.

> **Hint-2**: Every time you want to train your network, you should start running the following cell to start with a fresh NN and to clear the previously trained weights. Otherwise, your results are not accurate.

In [169]:
# Build a sequential NN with appropriate layers for binary classification of diagnosis
# Use ReLU for all hidden layers

# Hint1: input_dim of the first layer should match with the number of features in X_train

# Hint2: Notice that the activation function and number of neurons in the output layer are determined
# by the type of ML task i.e. Binary Classification
nn_clf = tf.keras.Sequential([
    # Add layers accordingly
    ...
    ]) 

In [2]:
nn_clf.summary()

> Next, you should compile your `nn_clf`.

In [171]:
# Compile nn_clf with loss='binary_crossentropy' and metrics=['accuracy']
# Hint1: One of the hyperparameters you can change is the optimizer (Adam, RMSprop, SGD, ...)
# Hint2: The other impactful hyperparameter is learning_rate,
# initially set it to 0.001 and fine-tune accordingly
...

> Train the model using `fit()` method.

In [3]:
# Train nn_clf on X_train and y_train with 20 epochs
nn_clf_history = ...

> Next, plot the history of train.

In [4]:
pd.DataFrame(nn_clf_history.history).plot(figsize=(10, 5))
plt.grid(True)

# Set the xticks - label locations
plt.xticks(np.arange(0, 20, step=2))  

# set the y-axis range to [0-1]
plt.gca().set_ylim(0, 1) 

> To evaluate the model, you use `evaluate()` method.

> <font color='red'>**Minimum Accuracy Requirement**</font>: Your accuracy on `X_test` and `y_test` must be at least **0.97**. Otherwise, your notebook will get NO CREDIT for this part, so you must fine-tune your `nn_clf` accordingly.

In [5]:
# Evaluate the model on X_test and y_test
loss, accuracy = ...

In [175]:
# Minimum Required Accuracy: 0.97
round(accuracy, 2)

0.98

## Part II - Regression Using NN in Tensorflow

In this part, you build a regression model for prediction of video game sales using NN.

You can download the dataset directly from [here](https://github.com/fereydoonvafaei/CMSC471-Spring2020/blob/master/Assignment-5/video-games.csv).

The description of the dataset you're going to work on can be seen [here](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings). You are going to use the features to predict video game sales in Europe `EU_Sales`.

Follow the instructions for loading the data, and preprocessing very carefully. Even though preprocessing has 10 points only, if you don't do it correctly, your whole results would be wrong.

## <font color="red"> Required Coding

In [6]:
# Load the data as a dataframe using pandas
game_data = ...
print(game_data.shape)
game_data.head()

In [7]:
# Drop NAs
...
print(game_data.shape)
game_data.head()

In [8]:
# Drop "Name" column as it does not provide any useful info
...
print(game_data.shape)
game_data.head()

In [9]:
# Drop "Global_Sales" column as it is redundant feature - it's just sum of regional and other sales
...
print(game_data.shape)
game_data.head()

In [45]:
# Get feature vector X_reg (all columns but "EU_Sales") and target label y_reg as "EU_Sales"
X_reg = ...
y_reg = ...

In [10]:
# Print X_reg shape and head
print(X_reg.shape)
X_reg.head()

In [11]:
# Using pandas.get_dummies() create dummy variables for categorical features of X_reg
...
print(X_reg.shape)
X_reg.head()

> <b>Note:</b> The output of the following cells is provided to you for your reference. All the following cells depend on the correctness of your preprocessing steps and can be verified by these outputs.

In [48]:
# Normalize X_reg using mean() and std()  NOTE: The output is provided for your reference.
...
print(X_reg.shape)
X_reg.head()

(6825, 1683)


Unnamed: 0,Year_of_Release,NA_Sales,JP_Sales,Other_Sales,Critic_Score,Critic_Count,User_Count,Platform_3DS,Platform_DC,Platform_DS,...,Developer_odenis studio,Developer_syn Sophia,Developer_zSlide,Rating_AO,Rating_E,Rating_E10+,Rating_K-A,Rating_M,Rating_RP,Rating_T
0,-0.341176,42.346639,12.886767,31.004904,0.413014,1.147975,0.250716,-0.15243,-0.045334,-0.270063,...,-0.012105,-0.01712,-0.012105,-0.012105,1.509226,-0.397162,-0.012105,-0.515485,-0.012105,-0.730971
2,0.133743,15.800856,12.956315,11.884655,0.845647,2.292368,0.909519,-0.15243,-0.045334,-0.270063,...,-0.012105,-0.01712,-0.012105,-0.012105,1.509226,-0.397162,-0.012105,-0.515485,-0.012105,-0.730971
3,0.371202,15.728496,11.182831,10.624793,0.701436,2.292368,0.029412,-0.15243,-0.045334,-0.270063,...,-0.012105,-0.01712,-0.012105,-0.012105,1.509226,-0.397162,-0.012105,-0.515485,-0.012105,-0.730971
6,-0.341176,11.252514,22.380122,10.36541,1.350385,1.876225,0.43627,-0.15243,-0.045334,3.702302,...,-0.012105,-0.01712,-0.012105,-0.012105,1.509226,-0.397162,-0.012105,-0.515485,-0.012105,-0.730971
7,-0.341176,14.022868,9.965734,10.217191,-0.884885,0.627797,-0.077835,-0.15243,-0.045334,-0.270063,...,-0.012105,-0.01712,-0.012105,-0.012105,1.509226,-0.397162,-0.012105,-0.515485,-0.012105,-0.730971


In [50]:
# Split the data to train and test with ratio of 80/20 for train/test respectively
X_reg_train, X_reg_test, y_reg_train, y_reg_test = ...
print(X_reg_train.shape)
print(y_reg_train.shape)
print(X_reg_test.shape)
print(y_reg_test.shape)

(5460, 1683)
(5460,)
(1365, 1683)
(1365,)


### Building NN for Regression

> You are recommended to try different architectures (different number of hidden layers and neurons) to get the desired loss. You may start with a couple of hidden layers and a few neurons and add accordingly until you hit below the maximum acceptable loss.

In [63]:
# Build a sequential NN with appropriate layers for regression to predict EU_Sales
# Use ReLU for all hidden layers

# Hint1: input_dim of the first layer should match with the number of features in X_reg_train
# Hint2: Notice that the activation function and number of neurons in the output layer are determined
# by the type of ML task i.e. Regression

nn_reg = tf.keras.Sequential([
    # Add layers accordingly
    
    ]) 

In [12]:
nn_reg.summary()

In [65]:
# Compile nn_reg: set both loss and metric to 'mse',
# and set the optimizer to 'RMSprop' with a learning rate of 0.001

# For this regression task, loss and metrics are the same
...

In [13]:
# Fit the network on X_reg_train and y_reg_train with 20 epochs
nn_reg_history = ...

In [15]:
pd.DataFrame(nn_reg_history.history).plot(figsize=(10, 5))
plt.grid(True)

# Set the xticks - label locations
plt.xticks(np.arange(0, 20, step=2))  

# set the y-axis range to [0-1]
plt.gca().set_ylim(0, 1) 

> <font color='red'>**Maximum Acceptable MSE Loss Requirement**</font>: The MSE loss of your model evaluated on `X_reg_test` and `y_reg_test` should not exceed **0.20**. Otherwise, your notebook will get NO CREDIT for this part, so you must fine-tune your `nn_reg` accordingly.

In [16]:
# Evaluate the model on X_reg_test, y_reg_test
mse, mse = nn_reg.evaluate(X_reg_test, y_reg_test)

In [70]:
# Maximum acceptable mse loss: 0.20
round(mse, 2)

0.19

## Grading

Assignment-5 has a maximum of 100 points. Make sure that you get the correct outputs and plots for all cells that you implement and give complete answers to all questions. Also, your notebook should be written with no grammatical and spelling errors and should be nicely-formatted and easy-to-read.

Notice that even though preprocessing has 10 points, if your preprocessing steps are not correct, your notebook will get ZERO because all the results would be wrong, so be very careful with preprocessing.

The breakdown of the 100 points is as follows:

- Part I Classification on diagnosis: [total 50 points]
    - Preprocessing: 10 points
    - Implementation of nn_clf: 40 points - **Minimum Required Accuracy**: 0.97 otherwise ZERO CREDIT!
    
- Part II Regression on NA_Sales: [total 50 points]
    - Preprocessing: 10 points
    - Implementation of nn_reg: 40 points - **Maximum Acceptable MSE Loss**: 0.20 otherwise ZERO CREDIT!
   

Follow the instructions of each section carefully. Up to 10 points may be deducted if your submitted notebook is not easy to read and follow or if it has grammatical, spelling or formatting issues.

## Submission

Name your notebook ```Lastname-A5.ipynb```. Submit the file using the ```Assignment-5``` link on Blackboard.

Grading will be based on 

  * correct implementation, correct results and plots, correct answer to the questions, and
  * readability of the notebook.
  
<font color=red><b>Due Date: Tuesday May 12th, 11:59PM.</b></font>