<a href="https://colab.research.google.com/github/binny-mathew/IITKGP_CS69002_Spring_2019/blob/master/MLP_Tutorial_New.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Review Sentiment Analysis

## Import Header files

In [0]:
import torch
import pandas as pd
import numpy as np

In [0]:
print('Welcome to Computing Lab 2. The current PyTorch version is ', torch.__version__)

## Load the dataset and visualize


There are 3 ways to do this:

(a)  From Github (Files < 25MB)

The easiest way to upload a CSV file is from your GitHub repository. Click on the dataset in your repository, then click on View Raw. Copy the link to the raw dataset and store it as a string variable called url in Colab as shown below (a cleaner method but it’s not necessary). The last step is to load the url into Pandas read_csv to get the dataframe.


```
url = 'copied_raw_GH_link'
df1 = pd.read_csv(url)
# Dataset is now stored in a Pandas Dataframe

```

---


(b) From a local drive

To upload from your local drive, start with the following code:

```
from google.colab import files
uploaded = files.upload()
```

It will prompt you to select a file. Click on “Choose Files” then select and upload the file. Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.

Finally, type in the following code to import it into a dataframe (make sure the filename matches the name of the uploaded file).
```
import io
df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv']))
# Dataset is now stored in a Pandas Dataframe
```

---


(c) From Google drive

This the most complicated of the three methods. Follow this [link](https://medium.freecodecamp.org/how-to-transfer-large-files-to-google-colab-and-remote-jupyter-notebooks-26ca252892fa) to learn more.

In [0]:
url = 'https://github.com/binny-mathew/IITKGP_CS69002_Spring_2019/blob/master/Dataset/check.csv'
df1 = pd.read_csv(url, sep='\t')

In [0]:
from google.colab import files
uploaded = files.upload()

In [0]:
type(uploaded), uploaded.keys(), type(uploaded['check.csv'])

In [0]:
import io
df = pd.read_csv(io.StringIO(uploaded['check.csv'].decode('utf-8')), sep='\t')
df.head()

## Pandas

[Pandas](https://pandas.pydata.org/) is a high-performance, easy-to-use data structure and data analytics tool for Python.

Tutorials:

*   [Link 1](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
*   [Link 2](https://www.tutorialspoint.com/python_pandas)
*   [Link 3](https://www.dataquest.io/blog/pandas-python-tutorial/)



In [0]:
print('Number of Negative movie reviews', len(df[df['label']==0]))
print('Number of Positive movie reviews', len(df[df['label']==1]))

## Data pre-processing
The first step when building a neural network model is getting your data into the proper form to feed into the network. We'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can use any type of pre-processing that you see fit, based on the task. For the tutorial, i will show just one pre-processing step: **lowercasing**.

In [0]:
text_reviews = df['text'].astype(str).tolist()
text_labels = df['label'].astype(int)

text_reviews = [x.lower() for x in text_reviews]
text_reviews[0], text_labels[0]

In [0]:
len(text_reviews)

## Creating Bag Of Word (BOW) representation of sentences.

In [0]:
def generate_word_ids(dataset):
  word_to_ix = {}
  print(dataset)
  for sent,_ in dataset:
      for word in sent.split():
          if word not in word_to_ix:
              word_to_ix[word] = len(word_to_ix)
  return word_to_ix

In [0]:
generate_word_ids([['welcome to computing lab 2','1']])

In [0]:
data = [("me gusta comer en la cafeteria", "SPANISH"),
        ("Give it to me", "ENGLISH"),
        ("No creo que sea una buena idea", "SPANISH"),
        ("No it is not a good idea to get lost at sea", "ENGLISH")]

test_data = [("Yo creo que si", "SPANISH"),
             ("it is lost on me", "ENGLISH")]

In [0]:
word_to_ix = generate_word_ids(data+test_data)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2
print(VOCAB_SIZE, word_to_ix)

## Intro to PyTorch

In [0]:
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

SEED = 42

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

lin = nn.Linear(5,3)
data1 = Variable(torch.randn(10,5)) # create a 10 * 5 matrix of random vars

In [0]:
data1, data1.size()

In [0]:
lin(data1), lin(data1).size()

In [0]:
lin.weight.size()

**BE VERY CAREFUL WITH THE DIMENSIONS**

In [0]:
F.relu(lin(data1))  # Activation Function

## Model Definition for the BOWClassifier

In [0]:
class BOWClassifier(nn.Module):
  def __init__(self, num_labels, vocab_size):
    super(BOWClassifier, self).__init__()
    self.lin = nn.Linear(vocab_size, num_labels)
    
  def forward(self, x):
    return F.softmax(self.lin(x))

##Generate the BOW Vectors

In [0]:
def make_bow_vector(sentence, word_to_ix):
    # create a vector of zeros of vocab size = len(word_to_idx)
    vec = torch.zeros(len(word_to_ix))
    for word in sentence.split():
        if word not in word_to_ix:
            raise ValueError('Word',word,' not present in the dictionary. Sorry!')
        else:
            vec[word_to_ix[word]]+=1
    return vec.view(1, -1)

def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


In [0]:
bow = BOWClassifier(NUM_LABELS, VOCAB_SIZE)

In [0]:
for param in bow.parameters():
    print(param,param.size())

### Let's check with a sample case how our model is working

In [0]:
sample_data, sample_label = data[0]
bow_vector = torch.autograd.Variable(make_bow_vector(sample_data, word_to_ix))
logprobs = bow(bow_vector)
print(data[0])
print(logprobs)

In [0]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}
ix_to_label = {v: k for k, v in label_to_ix.items()}

label_to_ix, ix_to_label

In [0]:
for instance, label in test_data:
    bow_vec = Variable(make_bow_vector(instance, word_to_ix))
    logprobs = bow(bow_vec)
    print(logprobs)
    pred = np.argmax(logprobs.data.numpy())
    print('prediction: {}'.format(ix_to_label[pred]))
    print('actual: {}'.format(label))


In [0]:
# define a loss function and an optimizer
loss_function = nn.NLLLoss()
opt = torch.optim.SGD(bow.parameters(), lr = 0.1)

In [0]:
# the training loop
for epoch in range(100):
#     print(epoch)
    for instance, label in data:
        # get the training data
        bow.zero_grad()
        bow_vec = Variable(make_bow_vector(instance, word_to_ix))
        label = Variable(make_target(label, label_to_ix))
        probs = bow(bow_vec) # forward pass
        loss = loss_function(probs, label)
        loss.backward()
#        print('CURRENT LOSS: {}'.format(loss.data))
        opt.step()


In [0]:
print('--- AFTER TRAINING ---')
for instance, label in test_data:
    bow_vec = Variable(make_bow_vector(instance, word_to_ix))
    logprobs = bow(bow_vec)
    print(logprobs)
    pred = np.argmax(logprobs.data.numpy())
    print('prediction: {}'.format(ix_to_label[pred]))
    print('actual: {}'.format(label))

In [0]:
torch.save(bow,'model.json')

from google.colab import files
files.download("model.json")

In [0]:
from google.colab import files
temp_test = files.upload()


In [0]:
torch.load(io.BytesIO(temp_test['model.json']))