## SI 670 Applied Machine Learning, Homework 7: Deep Learning

For this assignment, question 1 is worth 50 points, and question 2 is worth 40 points, for a total of 90 points. Correct answers and code receive full credit, but partial credit will be awarded if you have the right idea even if your final answers aren't quite right.

Submit your completed notebook file AND corresponding **HTML** file to the Canvas site.

As a reminder, the notebook code you submit must be your own work. Feel free to discuss general approaches to the homework with classmates: if you end up forming more of a team discussion on multiple questions, please include the names of the people you worked with at the top of your notebook file.

### Put your name here: `Yijing Chen`

### Put your uniquename here: `yijingch`

### Question 1 Comparing ML with DL (50 points)

In this question, we are still exploring classifying the IMDB movie data set as we did in the lab.   You will use the different classifiers you learned in this course: (1) LinearSVC; (2) RandomForestClassifier; (3) Deep learning. 

#### Preprocessing

In [1]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
X_train = vectorize_sequences(train_data)
# Our vectorized test data
X_test = vectorize_sequences(test_data)

# Our vectorized labels
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")

In [5]:
X_train.shape
# y_train.shape

(25000, 10000)

#### Question 1(a) (10 points)
Please use LinearSVC to train the model and return the mean accuracy on the given test data and labels. You can use the default parameteers in LinearSVC.

In [3]:
def answer_one_a():
    from sklearn.svm import LinearSVC
    
    clf = LinearSVC(max_iter=2000)
    clf.fit(X_train, y_train)

    score_train = clf.score(X_train, y_train)
    score_test = clf.score(X_test, y_test)

    print("accuracy on the training set", score_train)
    print("accuracy on the test set", score_test)

answer_one_a()

# max_iter = 2000, seems to overfit
# also received ConvergenceWarning: Liblinear failed to converge
# accuracy on the training set 0.99996
# accuracy on the test set 0.83572



accuracy on the training set 0.99996
accuracy on the test set 0.83572


#### Question 1(b) (10 points)
Please use RandomForestClassifier (with random_state = 0) to train the model and return the mean accuracy on the given test data and labels. 

In [4]:
def answer_one_b():
    from sklearn.ensemble import RandomForestClassifier
    
    clf = RandomForestClassifier(random_state=0)
    clf.fit(X_train, y_train)

    score_train = clf.score(X_train, y_train)
    score_test = clf.score(X_test, y_test)

    print("accuracy on the training set", score_train)
    print("accuracy on the test set", score_test)

answer_one_b()

# default parameter: (also overfitting)
# accuracy on the training set 1.0
# accuracy on the test set 0.84384

accuracy on the training set 1.0
accuracy on the test set 0.84384


#### Question 1(c) (20 points)

Please use the below architecture of the dense layers to design your model:
one intermediate layers with 32 hidden units, 
and a second layer which will output the scalar prediction regarding the sentiment of the current review. 

The intermediate layer will use `relu` as its "activation function", 
and the final layer will use a sigmoid activation so as to output a probability 
(a score between 0 and 1, indicating how likely the sample is to have the target "1", i.e. how likely the review is to be positive). 
A `relu` (rectified linear unit) is a function meant to zero-out negative values, 
while a sigmoid "squashes" arbitrary values into the `[0, 1]` interval, thus outputting something that can be interpreted as a probability.

We configure our model with the `rmsprop` optimizer and the `binary_crossentropy` loss function as we did in the lab. Note that we will 
also monitor accuracy during training.

For model fitting, we train our model for 4 epochs (4 iterations over all samples in the x_train and y_train tensors), in mini-batches of 512 samples.

Please return the testing accuracy. 

In [5]:
# X_train.shape

In [6]:
def answer_one_c():
    from keras import models
    from keras import layers

    model = models.Sequential()
    model.add(layers.Dense(32, activation="relu", input_shape=(10000,))) # the 1st layer with 32 hidden units, use "relu" to activate
    model.add(layers.Dense(1, activation="sigmoid")) # the 2nd layer using sigmoid activation

    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])

    model.fit(X_train, y_train, epochs=4, batch_size=512)
    results = model.evaluate(X_test, y_test)
    return results

answer_one_c()

# result 1:
# Epoch 1/4
# 49/49 [==============================] - 2s 32ms/step - loss: 0.4174 - accuracy: 0.8252
# Epoch 2/4
# 49/49 [==============================] - 2s 32ms/step - loss: 0.2484 - accuracy: 0.9119
# Epoch 3/4
# 49/49 [==============================] - 2s 32ms/step - loss: 0.1975 - accuracy: 0.9294
# Epoch 4/4
# 49/49 [==============================] - 2s 32ms/step - loss: 0.1696 - accuracy: 0.9408
# 782/782 [==============================] - 2s 2ms/step - loss: 0.3061 - accuracy: 0.8771
# [0.3060723543167114, 0.8771200180053711]

# result 2:
# Epoch 1/4
# 49/49 [==============================] - 2s 34ms/step - loss: 0.4210 - accuracy: 0.8264
# Epoch 2/4
# 49/49 [==============================] - 2s 32ms/step - loss: 0.2492 - accuracy: 0.9127
# Epoch 3/4
# 49/49 [==============================] - 2s 33ms/step - loss: 0.1976 - accuracy: 0.9314
# Epoch 4/4
# 49/49 [==============================] - 2s 33ms/step - loss: 0.1683 - accuracy: 0.9422
# 782/782 [==============================] - 2s 3ms/step - loss: 0.2934 - accuracy: 0.8834
# [0.29340335726737976, 0.8834400177001953]

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


[0.2880520522594452, 0.884440004825592]

#### Question 1(d) Open Question (10 points)
Can you conclude that deep learning is better than the classic ML models on this task? If so, what do you think that helps Deep Learning perform better? If not, tells us your reasons. 

Deep learning models apparantly achieved better results than classic ML models. I think one of the advantages of deep learning models lies in the multi-layer architecture (the neural network) that learns the feature automatically. 

(In addition, because of the SGD training process, I noticed that with 4 epochs, the deep learning model (sometimes) does not reach an optimal point where the loss is minimized and the accuracy is maximized, which could be the reason why the current improvements on accuracy scores is not as impressive as it should be.)


## Question 2 Hyper-parameteer tunning in DL (40 points)

We have shown you how to tune parameters such as training epoch in the lab. 

In this question, we are exploring the hyper-parameteer tuning in deep learning from the perspective of the size of the network. 

First, we divide some part of training data into valiadation data. 

In [7]:
x_val = X_train[:10000]
partial_x_train = X_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

### Question 2(a): (30 points)

Follow the IMDB classification question in Question 1, we hope you design and test different neural network. You shall vary the first intermediate layer with [3, 6, 9, 12] hidden units, and the second intermediate layer with [2, 4, 6, 8] hidden units. 

The last layer which will output the scalar prediction regarding the sentiment of the current review. 

Apply the different models onto the valiadation set and return the best model with highest accuracy on the valiadation. You should three numbers, which represent the best model's number of hideen units in first, second intermediate layer, and the best accuracy. 

Fit the model by using the default setting in 1(c). (epochs = 4, batch_sze = 512, optimizer = rmsprop, loss = binary_crossentropy, metrics = 'accuracy')

In [9]:
def answer_two_a():
    from keras import models
    from keras import layers

    best_first, best_second, best_res = 0, 0, 0

    score_dict = {}
    for first_layer in [3, 6, 9, 12]:
        for second_layer in [2, 4, 6, 8]:
            print("hidden units in the 1st layer:", first_layer)
            print("hidden units in the 2nd layer:", second_layer)
            model = models.Sequential()
            model.add(layers.Dense(first_layer, activation="relu", input_shape=(10000,)))
            model.add(layers.Dense(second_layer, activation="relu"))
            model.add(layers.Dense(1, activation="sigmoid"))



            model.compile(optimizer='rmsprop',
                      loss='binary_crossentropy',
                      metrics=['accuracy'])

            model.fit(partial_x_train, partial_y_train, epochs=4, batch_size=512)

            score = model.evaluate(X_test, y_test)[1]
            score_dict[(first_layer, second_layer)] = score
            print("  accuracy:", score)

    best_first, best_second = max(score_dict, key=score_dict.get)
    best_res = score_dict[(best_first, best_second)]
    print("!! best performing model !!")
    print("hidden units in the 1st layer:", best_first)
    print("hidden units in the 2nd layer:", second_layer)
    print("  accuracy:", best_res)

    return best_first, best_second, best_res

answer_two_a()

# first run:
# (6, 8, 0.8838000297546387)

# second run:
# (12, 6, 0.8835999965667725)

# third run:
# (9, 8, 0.8841599822044373)

# these best performing models do not have the most complex neural network.

hidden units in the 1st layer: 3
hidden units in the 2nd layer: 2
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.830839991569519
hidden units in the 1st layer: 3
hidden units in the 2nd layer: 4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.8225200176239014
hidden units in the 1st layer: 3
hidden units in the 2nd layer: 6
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.847760021686554
hidden units in the 1st layer: 3
hidden units in the 2nd layer: 8
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.8689600229263306
hidden units in the 1st layer: 6
hidden units in the 2nd layer: 2
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.7535200119018555
hidden units in the 1st layer: 6
hidden units in the 2nd layer: 4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.8673999905586243
hidden units in the 1st layer: 6
hidden units in the 2nd layer: 6
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
  accuracy: 0.8812800049781799
hidden units in the 1st layer: 6
hidden uni

(9, 8, 0.8841599822044373)

#### Question 2(b): (10 points)

According to the performance we observed for three models? Which design is best for this task? 

What take-away do you get from this question? Does it mean larger network improve the model's performance?

Based on the performance, the best performing model with an accuracy of 0.88 does not (necessarily) have the most complex neural network. I ran the program for several times and because of the stochasticity involved in the process, the best performing model is not fixed (see comments in the last code block). But, the take-away is that larger network does not always guarantee a better performance. 