<a href="https://colab.research.google.com/github/subir357/NLP/blob/main/Assignment_2_deep_learning_models_for_sentiment_classification_(2)_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning models for sentiment classification

### Full marks: 10

In this assignment you will implement a few simple deep learning models for sentiment classification using review data. 

You are required to implement the functions as instructed. <b>Do not change the signatures of the functions.</b>

<b>Deadline: Monday, 16 August 2021 03:00 AM</b>

<i>This is an easy assignment. An extension of this deadline should not be required.</i>

<b>Submission instruction</b>:
Submit an ipynb file to the classroom. If you work on Colab, download your file as ipynb and upload that file. Name your file suffixed by your roll number for convenience.

<b>Important</b>: Do not just share the link of your colab file. Since the assignment needs to be submitted before the deadline, the file you submit should not be editable by you after you submit. A mistake in this regard will not be acceptable under any circumstances, it will be equivalent to not submitting the assignment.

<b>Late submission penalty</b>: For every hour of being late, you will be charged 1% of the marks you get. So, if you submit 2 days late, you lose almost half of the marks. Please do not keep this as a last minute task, you will need time to solve it.

## Setup 

Get the auth code using your Google account so that you can read the files.

In [None]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()  
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Training data (anyone on the internet can view)
# https://drive.google.com/file/d/1E4aQmxQg_RLHVH28G2sUecUQwvGl0EXE/view?usp=sharing 

# Test data (anyone on the internet can view)
# https://drive.google.com/file/d/1i0nX9LxYI1oh7872qXTVrVqliQCnkbOn/view?usp=sharing

filename_train = 'reviews_train.bz2'
filename_test = 'reviews_test.bz2'

id_train = '1E4aQmxQg_RLHVH28G2sUecUQwvGl0EXE' 
id_test = '1i0nX9LxYI1oh7872qXTVrVqliQCnkbOn' 

train_file = drive.CreateFile({'id': id_train})
test_file = drive.CreateFile({'id': id_test})

# Read the files
train_file.GetContentFile(filename_train)
test_file.GetContentFile(filename_test)

## Imports

Some essential imports are listed below. You may import more libraries if you want to. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences

import bz2
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score

%matplotlib inline

## 1. Read texts and labels from the files (1 mark)

Each line in the given files starts with <tt>\_\_label\_\_1</tt> or <tt>\_\_label\_\_2</tt> (two underscores, 'label', two underscores and  then 1 or 2), followed by a whitespace and then the text. \_\_label\_\_1 indicates <b>negative sentiment</b> and \_\_label\_\_2 indicates <b>positive sentiment</b>. 

Write a function to read the file and return labels (numpy array of integers) and texts (array or list of strings). Ideally, you want to convert positive sentiment labels to 1 and negative sentiment labels to 0 right here. If you don't, you will have to deal with it later in the model. 


If you want, you may also apply some preprocessing to the text (e.g., lowercase, stemming, remove punctuations, etc), but it is optional. Note that not all such preprocessing steps may help in sentiment classification. 

In [None]:
def read_texts_and_labels(file):
  labels={}
  texts={}
  i=0
  for line in bz2.BZ2File(file):
    x = line.decode("utf-8")
    # Implement the rest and return appropriately
    texts[i]=x[11:].lower()

    all_words = x.split()
    first_word= all_words[0]
    if first_word =="__label__2":
      first_word=1
      labels[i]=1
    else:
      labels[i]=0
    i=i+1
  return texts,labels
  
train_texts, train_labels = read_texts_and_labels(filename_train)
test_texts, test_labels = read_texts_and_labels(filename_test)

train_texts=list(train_texts.values())
train_labels=list(train_labels.values())
test_texts=list(test_texts.values())
test_labels=list(test_labels.values())
#categorical
test_labels=keras.utils.to_categorical(test_labels,num_classes=2)

## 2. Set aside validation data (1 mark)

Since there are a lot of data points, set aside 5% data as validation set. You may use the <tt>train_test_split</tt> function from <tt>sklearn</tt>, or any other method to do it. 

In [None]:
from sklearn.model_selection import train_test_split
train_texts_split, val_texts_split, train_labels_split, val_labels_split=train_test_split(train_texts, train_labels, test_size=0.05, random_state=42)

## 3. Convert the texts to sequences and pad the sequences (1 mark)

Convert the texts (train\_text, val\_text and test\_text) to sequences using keras tokenizer. After that, pad the sequences with leading zeros using keras. 

In [None]:
vocab_size=3000
trunc_type='post'
oov_tok='<oov>'
tokenizer= Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_texts)
word_index=tokenizer.word_index
sequences=tokenizer.texts_to_sequences(train_texts_split)
val_texts_tokenized= tokenizer.texts_to_sequences(val_texts_split)
test_texts_tokenized=tokenizer.texts_to_sequences(test_texts)

In [None]:
max_length_of_sequences = 100


train_texts_split_tokenized_padded=pad_sequences(sequences, maxlen=max_length_of_sequences, truncating=trunc_type)
val_texts_tokenized_padded=pad_sequences(val_texts_tokenized, maxlen=max_length_of_sequences, truncating=trunc_type)
test_texts_tokenized_padded=pad_sequences(test_texts_tokenized, maxlen=max_length_of_sequences, truncating=trunc_type)

### Scale of padded_sequence

In [None]:
normalizer_train_texts_split_tokenized_padded=train_texts_split_tokenized_padded/3000 
val_texts_tokenized_padded=val_texts_tokenized_padded/3000 
normalizer_test_texts_tokenized_padded=test_texts_tokenized_padded/300 

In [None]:
normalizer_test_texts_tokenized_padded=test_texts_tokenized_padded/10

## 4. A simple feedforward network (2 marks)

Implement a simple feedforward network with one embedding layer with <tt>d1</tt> dimension, then one hidden layer with <tt>d2</tt> dimension (activation function ReLU) and an output layer with appropriate dimension and activation function. 

Then, compile the model with a suitable optimizer and suitable loss and metrics. 

In [None]:
# You may tune these hyperparameters, but implement the model as instructed above.
d1 = 200
d2 = 128 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dense

 # implement the model and return the model
model= tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=3000, output_dim=32, input_length=100))
model.add(Dense(32,activation='relu'))
model.add(Dense(2,activation='softmax'))

model1 = model
optimizer=keras.optimizers.Adam(learning_rate=0.001)
model.compile(
    # Implement this code
    optimizer=optimizer,loss='binary_crossentropy',metrics=['acc']
)


In [None]:
model1.summary()

In [None]:

train_labels_split=keras.utils.to_categorical(train_labels_split,num_classes=2)
val_labels_split=keras.utils.to_categorical(val_labels_split,num_classes=2)

Train the model using the fit() method.

In [None]:
model.fit(normalizer_train_texts_split_tokenized_padded, train_labels_split, batch_size=256, epochs=3,validation_data=(val_texts_tokenized_padded, val_labels_split))
 

## 4A. Experiment and note your observation (1 mark)

Experiment with the hyperparameters a bit (dimensions, learning rate, optimizer) and note your observations in the cell below. You do not have to experiment with all possible choices of hyperparameters, but also not only one choice.  

WRITE HERE

## 5. Implement a birectional GRU Model (2 marks)

Implement a GRU model with the following details:

*   First, an embedding layer with <tt>d1</tt> dimension.
*   Then, a bidirectional GRU layer with <tt>d2</tt> dimension.
*   Then, a unidirectional GRU layer with <tt>d3</tt> dimension.
*   Lastly, an appropriate output layer with activation.



In [None]:
model= tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=3000, output_dim=d1, input_length=20))
model.add(keras.layers.Bidirectional(keras.layers.GRU(d2, return_sequences=True))
model.add(keras.layers.BGRU(d3, return_sequences=True)
model.add(Dense(2,activation='softmax'))
model2 = model

## 5A. Experiment and note your observation (1 mark)

Experiment with the hyperparameters a bit (dimensions, learning rate, optimizer) and note your observations in the cell below. You do not have to experiment with all possible choices of hyperparameters, but also not only one choice.  

In [None]:
model2 = model_GRU()
model2.compile(
    # Implement this code
)

model2.fit(
    # complete code here
    )

## 6. Compute the accuracy and F1 scores (1 mark)

Compute the accuracy and F1 scores for both models on the <b>test data</b>.

In [None]:
# Implement code here

## Optional (ungraded)

Learn how to load a pre-trained BERT model and fine-tune that to classify the sentiment of the texts. 