In this assignment - we're going to use a new dataset of tweets with sentiment labels to train a neural network based sentiment analyzer

---

*Fill in the missing code from the relevant sections below, missing code is indicated by \<FILL_CODE>*

*IMPORTANT: Make sure you include the outputs or printouts for every cell in the .ipynb file that you upload.*

## Import & Pre-process data

### 1. Load the data, preprocess & tokenize as appropriate for tweets (2 points)

* Load data from the twitter_sentiment_train_150k.txt file into a dataframe with columns - 'label' and 'tweet'. 
* Put preprocessed text into column named 'tweet_cleaned' and tokens into 'tweet_tokens'

In [None]:
import pandas as pd

twitter_training = pd.read_csv('./twitter_sentiment_train_150k.txt', sep='\t', header=None, names=['label', 'tweet'])

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_tweet(tweet):
    tweet = re.sub(r'[^a-zA-Z]', ' ', tweet)
    tweet_tokens = word_tokenize(tweet)
    tweet_tokens = [word for word in tweet_tokens if word not in stop_words]
    tweet_tokens = [lemmatizer.lemmatize(word) for word in tweet_tokens]
    return tweet_tokens

twitter_training['tweet_cleaned'] = twitter_training['tweet'].apply(preprocess_tweet)
twitter_training['tweet_tokens'] = twitter_training['tweet_cleaned'].apply(lambda x: ' '.join(x))

print(twitter_training['tweet_tokens'].head(10))

### 2. Load & split your testing data into validation (1/3) & testing datasets (2/3rd) - (1 point)

* Similarly load test data from the twitter_sentiment_test_63k.txt file
* Pre-process and tokenize the testing data as well
* Split testing data into testing and validation data 

In [None]:

twitter_t_and_v = pd.read_csv('./twitter_sentiment_test_62k.txt', sep='\t', header=None, names=['label', 'tweet'])

twitter_t_and_v['tweet_cleaned'] = twitter_t_and_v['tweet'].apply(preprocess_tweet)
twitter_t_and_v['tweet_tokens'] = twitter_t_and_v['tweet_cleaned'].apply(lambda x: ' '.join(x))


test_data = twitter_t_and_v.sample(frac=0.6667, random_state=0)
validation_data = twitter_t_and_v.drop(test_data.index)

print(test_data.shape)
print(validation_data.shape)

## Vectorize, setup model & train

### 3. Vectorize using a count vectorizer (1 point)

Use the Tokenizer and text_to_word_sequence functions from keras to convert the text in the 'tweet_cleaned' column from all 3 datasets into tokens and labels

In [None]:
MAX_FEATURES = 5000
MAX_LENGTH = 100

from keras_preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(twitter_training['tweet_tokens'])

train_sequences = tokenizer.texts_to_sequences(twitter_training['tweet_tokens'])
test_sequences = tokenizer.texts_to_sequences(test_data['tweet_tokens'])
val_sequences = tokenizer.texts_to_sequences(validation_data['tweet_tokens'])

train_texts = pad_sequences(train_sequences, maxlen=MAX_LENGTH)
test_texts = pad_sequences(test_sequences, maxlen=MAX_LENGTH)
val_texts = pad_sequences(val_sequences, maxlen=MAX_LENGTH)

train_labels = twitter_training['label']
test_labels = test_data['label']
val_labels = validation_data['label']

print(len(train_texts))
print(len(train_labels))

print(len(test_texts))
print(len(test_labels))

print(len(val_texts))
print(len(val_labels))

In [None]:
from keras.utils import pad_sequences

MAX_LENGTH = max(len(train_ex) for train_ex in train_texts)
train_texts = pad_sequences(train_texts, maxlen=MAX_LENGTH)
test_texts = pad_sequences(test_texts, maxlen=MAX_LENGTH)
validation_texts = pad_sequences(val_texts, maxlen=MAX_LENGTH)

### 4. Why do you have to pad sequences? Explain (1 point)

Answer: The neural network needs all inputs to be the same length, so you have to pad sequences to make all seuqneces the same length.

### We're going to use a neural network with the following architecture for this part of the assignment


In [None]:
from keras import layers, Input, Model

sequences = Input(shape=(MAX_LENGTH,))
embedded = layers.Embedding(MAX_FEATURES, 64)(sequences)
x = layers.SimpleRNN(128, return_sequences=True)(embedded)
x = layers.SimpleRNN(128)(x)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(100, activation='relu')(x)
predictions = layers.Dense(1, activation='sigmoid')(x)
rnn_model = Model(inputs=sequences, outputs=predictions)

### 6. Train your model with the following settings (1 point)

Note: The actual performance of your model is less important for this assignment. So if your model takes a while to train - you have some options to shorten the training time:
*   Use a GPU to accelerate training (colab has these as well)
*   Decrease the number of epochs (minimum 5 for this assignment)
*   Decrease the size of your training data through random sampling
*   If you have a different methodology that you'd like to try - as long as you can justify using it (provide the justification as well)- feel free to use and submit that as well



In [None]:
batch_size = 128
epochs = 10

# first compile the model using the correct loss function & metrics
rnn_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# then save the history of the model training
rnn_history = rnn_model.fit(train_texts, train_labels, batch_size=batch_size, epochs=epochs, validation_data=(validation_texts, val_labels))

In [None]:
history_dict = rnn_history.history
print(history_dict.keys())

epochs_range = range(1, epochs+1)
print(list(epochs_range))

### Plot accuracy vs val_accuracy below 

In [None]:
import matplotlib.pyplot as plt

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']

# "bo" is for "blue dot"
plt.plot(epochs_range, acc, 'bo', label='Training acuracy')
# b is for "solid blue line"
plt.plot(epochs_range, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

### 7. What is the difference between accuracy & val_accuracy? How should you change your training strategy (if at all) based on the differences between accuracy & val_accuracy? (1 point)

Answer: Accuracy is the accuracy of the model on the training data and val_accuracy is the accuracy of the model on the validation data. If the val_accuracy is much higher than the accuracy, then the model is overfitting the training data. To fix this, you can increase the number of epochs, increase the size of the training data, or decrease the size of the validation data.

### Plot loss vs val_loss below

In [None]:
import matplotlib.pyplot as plt
loss = history_dict['loss']
val_loss = history_dict['val_loss']

# "bo" is for "blue dot"
plt.plot(epochs_range, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs_range, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

### 8. What is the difference between loss & val_loss and why do they matter? (1 point)

Answer: The loss of the model on the training data vs the validation data. They matter because if the loss is much higher than the val_loss, then the model is underfitting the training data. If the val_loss is much higher than the loss, then the model is overfitting the training data.

### 9. Evaluate your model on your test dataset. How are the loss & accuracy here different from the earlier losses & accuracies? (1 points)

In [None]:
rnn_model.evaluate(test_texts, test_labels)


## Re-train model using the embeddings provided

### Convert your text data into embeddings data

In [None]:
import numpy as np
embeddings_index = {}

# load embeddings provided
#with open("/content/drive/MyDrive/Wharton/Models/glove.6B.100d.txt") as f:
with open("./glove.6B.100d.txt", errors="ignore") as f:
  for line in f:
    word, coefs = line.split(maxsplit=1)
    coefs = np.fromstring(coefs, "f", sep=" ")
    embeddings_index[word] = coefs

In [None]:
# vectorize again before creating the embeddings
import tensorflow as tf
from keras.layers import TextVectorization

vectorizer = TextVectorization(max_tokens=MAX_FEATURES, output_sequence_length=MAX_LENGTH)
tweets_as_strings = twitter_training['tweet_cleaned'].apply(' '.join).values
text_ds = tf.data.Dataset.from_tensor_slices(tweets_as_strings).batch(128)
vectorizer.adapt(text_ds)

In [None]:
# Extract vocabuary & word index
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

num_tokens = len(voc) + 2
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

### 10. Use an embedding layer as the input layer instead of counts

In [None]:
from keras.layers import Embedding
from keras.initializers import Constant

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False,
)

In [None]:
# specify input layer
rnn_inputs = Input(shape=(MAX_LENGTH,), dtype="int64")
rnn_embedded_sequences = embedding_layer(rnn_inputs)
x = layers.CuDNNGRU(128, return_sequences=True)(rnn_embedded_sequences)
x = layers.CuDNNGRU(128)(x)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(100, activation='relu')(x)
predictions = layers.Dense(1, activation='sigmoid')(x)

rnn_model_embeddings = Model(inputs=rnn_inputs, outputs=predictions)
rnn_model_embeddings.summary()


### 11. In creating the embedding layer in the previous section of code - what does the embedding layer do and what are the first 2 parameters? (1 point)

Answer: The embedding layer converts the input data into embeddings. The first parameter is the number of unique tokens and the second parameter is the dimension of the embeddings.

### Compile & Train your model


In [None]:
batch_size = 128
epochs = 10

# first compile the model using the correct loss function & metrics
rnn_model_embeddings.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# then save the history of the model training
rnn_embeddings_history = rnn_model.fit(train_texts, train_labels, batch_size=batch_size, epochs=epochs, validation_data=(validation_texts, val_labels))

### 12. Evaluate your new model (1 point)

In [None]:
rnn_model_embeddings.evaluate(test_texts, test_labels)

### 13. Why did you model perform better or worse after switching to an embedding input layer? Explain... (1 point)

Answer: Embeddings are more accurate than the counts.


### 14. Did the model train faster or slower with embeddings or without using embeddings as the inputs? Why was it faster or slower? (1 point)

Answer: Faster with embeddings due to decreased complexity of the model.