<a href="https://colab.research.google.com/github/rajeshsahu09/CS69002_9A_18CS60R19/blob/master/MLP_Tutorial_New.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Review Sentiment Analysis

## Import Header files

In [0]:
import torch
import pandas as pd
import numpy as np

In [7]:
print('Welcome to Computing Lab 2. The current PyTorch version is ', torch.__version__)

Welcome to Computing Lab 2. The current PyTorch version is  1.0.1.post2


## Load the dataset and visualize


There are 3 ways to do this:

(a)  From Github (Files < 25MB)

The easiest way to upload a CSV file is from your GitHub repository. Click on the dataset in your repository, then click on View Raw. Copy the link to the raw dataset and store it as a string variable called url in Colab as shown below (a cleaner method but it’s not necessary). The last step is to load the url into Pandas read_csv to get the dataframe.


```
url = 'copied_raw_GH_link'
df1 = pd.read_csv(url)
# Dataset is now stored in a Pandas Dataframe

```

---


(b) From a local drive

To upload from your local drive, start with the following code:

```
from google.colab import files
uploaded = files.upload()
```

It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.

Finally, type in the following code to import it into a dataframe (make sure the filename matches the name of the uploaded file).
```
import io
df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv']))
# Dataset is now stored in a Pandas Dataframe
```

---


(c) From Google drive

This the most complicated of the three methods. Follow this [link](https://medium.freecodecamp.org/how-to-transfer-large-files-to-google-colab-and-remote-jupyter-notebooks-26ca252892fa) to learn more.

In [0]:
url = 'https://raw.githubusercontent.com/rajeshsahu09/CS69002_9A_18CS60R19/master/Dataset/check.csv?token=As1EmLNK6ueuuXJ9qHTb6sk1CQdMCZ2bks5cnkkMwA%3D%3D'
df = pd.read_csv(url, sep='\t')

In [0]:
# from google.colab import files
# uploaded = files.upload()

In [0]:
# type(uploaded), uploaded.keys(), type(uploaded['check.csv'])

In [0]:
import io
# df = pd.read_csv(io.StringIO(uploaded['check.csv'].decode('utf-8')), sep='\t')
# df.head()

## Pandas

[Pandas](https://pandas.pydata.org/) is a high-performance, easy-to-use data structure and data analytics tool for Python.

Tutorials:

*   [Link 1](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
*   [Link 2](https://www.tutorialspoint.com/python_pandas)
*   [Link 3](https://www.dataquest.io/blog/pandas-python-tutorial/)



In [10]:
print('Number of Negative movie reviews', len(df[df['label']==0]))
print('Number of Positive movie reviews', len(df[df['label']==1]))

Number of Negative movie reviews 246
Number of Positive movie reviews 253


## Data pre-processing
The first step when building a neural network model is getting your data into the proper form to feed into the network. We'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can use any type of pre-processing that you see fit, based on the task. For the tutorial, i will show just one pre-processing step: **lowercasing**.

In [11]:
text_reviews = df['text'].astype(str).tolist()
text_labels = df['label'].astype(int)

text_reviews = [x.lower() for x in text_reviews]
text_reviews[0], text_labels[0]

('john waters has given us a genuinely enjoyable film. this certainly isn\'t without its shocking waters-esque moments, but it is tamer than his older culty stuff, such as "pink flamingoes". "pecker" harkens back to john\'s early mainstream stage in that it reminds the viewer of the same kind of humor that was evident in "polyester". overall, a really fun comedy with some great moments!',
 1)

In [12]:
len(text_reviews)

499

## Creating Bag Of Word (BOW) representation of sentences.

In [0]:
def generate_word_ids(dataset):
  word_to_ix = {}
  print(dataset)
  for sent,_ in dataset:
      for word in sent.split():
          if word not in word_to_ix:
              word_to_ix[word] = len(word_to_ix)
  return word_to_ix

In [14]:
generate_word_ids([['welcome to computing lab 2','1']])

[['welcome to computing lab 2', '1']]


{'2': 4, 'computing': 2, 'lab': 3, 'to': 1, 'welcome': 0}

In [0]:
data = [("me gusta comer en la cafeteria", "SPANISH"),
        ("Give it to me", "ENGLISH"),
        ("No creo que sea una buena idea", "SPANISH"),
        ("No it is not a good idea to get lost at sea", "ENGLISH")]

test_data = [("Yo creo que si", "SPANISH"),
             ("it is lost on me", "ENGLISH")]

In [16]:
word_to_ix = generate_word_ids(data+test_data)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2
print(VOCAB_SIZE, word_to_ix)

[('me gusta comer en la cafeteria', 'SPANISH'), ('Give it to me', 'ENGLISH'), ('No creo que sea una buena idea', 'SPANISH'), ('No it is not a good idea to get lost at sea', 'ENGLISH'), ('Yo creo que si', 'SPANISH'), ('it is lost on me', 'ENGLISH')]
26 {'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


## Intro to PyTorch

In [0]:
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

SEED = 42

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

lin = nn.Linear(5,3)
data1 = Variable(torch.randn(10,5)) # create a 10 * 5 matrix of random vars

In [18]:
data1, data1.size()

(tensor([[-0.4974,  0.4396, -0.7581,  1.0783,  0.8008],
         [ 1.6806,  0.3559, -0.6866,  0.6105,  1.3347],
         [-0.2316,  0.0418, -0.2516,  0.8599, -0.3097],
         [-0.3957, -0.2234,  1.7174,  0.3189, -0.4245],
         [ 0.3057, -0.7746,  0.0349,  0.3211, -0.8798],
         [-0.6011, -1.2742,  2.1228, -1.2347, -0.4879],
         [-1.4181,  0.8963,  0.0780,  0.5258,  0.3466],
         [-0.1973, -1.0546,  1.2780,  0.1453,  0.2311],
         [ 0.0087, -0.1423,  0.5750, -0.6417, -2.2064],
         [-0.7508,  2.8140,  0.3598, -0.0898,  0.4584]]), torch.Size([10, 5]))

In [19]:
lin(data1), lin(data1).size()

(tensor([[ 0.3739,  0.1674, -0.1029],
         [ 0.8355,  0.0414,  0.8471],
         [ 0.2831,  0.6893, -0.1183],
         [-0.2887,  1.0737,  0.3888],
         [-0.0317,  0.9654, -0.0386],
         [-1.4234,  0.7988,  0.2472],
         [-0.0414,  0.1357, -0.2777],
         [-0.6189,  0.8737,  0.3820],
         [-0.2207,  0.9984, -0.2671],
         [ 0.6052, -0.4270,  0.2221]], grad_fn=<AddmmBackward>),
 torch.Size([10, 3]))

In [20]:
lin.weight.size()

torch.Size([3, 5])

**BE VERY CAREFUL WITH THE DIMENSIONS**

In [21]:
F.relu(lin(data1))  # Activation Function

tensor([[0.3739, 0.1674, 0.0000],
        [0.8355, 0.0414, 0.8471],
        [0.2831, 0.6893, 0.0000],
        [0.0000, 1.0737, 0.3888],
        [0.0000, 0.9654, 0.0000],
        [0.0000, 0.7988, 0.2472],
        [0.0000, 0.1357, 0.0000],
        [0.0000, 0.8737, 0.3820],
        [0.0000, 0.9984, 0.0000],
        [0.6052, 0.0000, 0.2221]], grad_fn=<ReluBackward0>)

## Model Definition for the BOWClassifier

In [0]:
class BOWClassifier(nn.Module):
  def __init__(self, num_labels, vocab_size):
    super(BOWClassifier, self).__init__()
    self.lin = nn.Linear(vocab_size, num_labels)
    
  def forward(self, x):
    return F.softmax(self.lin(x))

##Generate the BOW Vectors

In [0]:
def make_bow_vector(sentence, word_to_ix):
    # create a vector of zeros of vocab size = len(word_to_idx)
    vec = torch.zeros(len(word_to_ix))
    for word in sentence.split():
        if word not in word_to_ix:
            raise ValueError('Word',word,' not present in the dictionary. Sorry!')
        else:
            vec[word_to_ix[word]]+=1
    return vec.view(1, -1)

def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


In [0]:
bow = BOWClassifier(NUM_LABELS, VOCAB_SIZE)

In [25]:
for param in bow.parameters():
    print(param,param.size())

Parameter containing:
tensor([[ 0.1190, -0.0465,  0.1122, -0.1524, -0.0990,  0.0598,  0.0415, -0.0500,
          0.1169,  0.1333, -0.1422, -0.1047,  0.1796, -0.0662, -0.0695, -0.1898,
         -0.1123,  0.0490, -0.0259, -0.1424,  0.0046, -0.1340, -0.1664, -0.1080,
         -0.1716, -0.1249],
        [ 0.1960,  0.0370,  0.0604, -0.1829, -0.1288, -0.0653,  0.0307, -0.1726,
         -0.0845, -0.1174,  0.0005, -0.0730, -0.0136, -0.1329, -0.1346, -0.1144,
         -0.0671, -0.1548,  0.1644, -0.0389,  0.1687,  0.0611, -0.1661,  0.1357,
         -0.0540, -0.0752]], requires_grad=True) torch.Size([2, 26])
Parameter containing:
tensor([-0.1628, -0.1950], requires_grad=True) torch.Size([2])


### Let's check with a sample case how our model is working

In [26]:
sample_data, sample_label = data[0]
bow_vector = torch.autograd.Variable(make_bow_vector(sample_data, word_to_ix))
logprobs = bow(bow_vector)
print(data[0])
print(logprobs)

('me gusta comer en la cafeteria', 'SPANISH')
tensor([[0.5272, 0.4728]], grad_fn=<SoftmaxBackward>)


  import sys


In [27]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}
ix_to_label = {v: k for k, v in label_to_ix.items()}

label_to_ix, ix_to_label

({'ENGLISH': 1, 'SPANISH': 0}, {0: 'SPANISH', 1: 'ENGLISH'})

In [28]:
for instance, label in test_data:
    bow_vec = Variable(make_bow_vector(instance, word_to_ix))
    logprobs = bow(bow_vec)
    print(logprobs)
    pred = np.argmax(logprobs.data.numpy())
    print('prediction: {}'.format(ix_to_label[pred]))
    print('actual: {}'.format(label))


tensor([[0.3767, 0.6233]], grad_fn=<SoftmaxBackward>)
prediction: ENGLISH
actual: SPANISH
tensor([[0.4471, 0.5529]], grad_fn=<SoftmaxBackward>)
prediction: ENGLISH
actual: ENGLISH


  import sys


In [0]:
# define a loss function and an optimizer
loss_function = nn.NLLLoss()
opt = torch.optim.SGD(bow.parameters(), lr = 0.1)

In [30]:
# the training loop
for epoch in range(100):
#     print(epoch)
    for instance, label in data:
        # get the training data
        bow.zero_grad()
        bow_vec = Variable(make_bow_vector(instance, word_to_ix))
        label = Variable(make_target(label, label_to_ix))
        probs = bow(bow_vec) # forward pass
        loss = loss_function(probs, label)
        loss.backward()
#        print('CURRENT LOSS: {}'.format(loss.data))
        opt.step()


  import sys


In [31]:
print('--- AFTER TRAINING ---')
for instance, label in test_data:
    bow_vec = Variable(make_bow_vector(instance, word_to_ix))
    logprobs = bow(bow_vec)
    print(logprobs)
    pred = np.argmax(logprobs.data.numpy())
    print('prediction: {}'.format(ix_to_label[pred]))
    print('actual: {}'.format(label))

--- AFTER TRAINING ---
tensor([[0.7725, 0.2275]], grad_fn=<SoftmaxBackward>)
prediction: SPANISH
actual: SPANISH
tensor([[0.0491, 0.9509]], grad_fn=<SoftmaxBackward>)
prediction: ENGLISH
actual: ENGLISH


  import sys


In [32]:
torch.save(bow,'model.json')

from google.colab import files
files.download("model.json")

  "type " + obj.__name__ + ". It won't be checked "


In [38]:
from google.colab import files
temp_test = files.upload()


Saving model.json to model (1).json


In [39]:
torch.load(io.BytesIO(temp_test['model.json']))

BOWClassifier(
  (lin): Linear(in_features=26, out_features=2, bias=True)
)