<a href="https://colab.research.google.com/github/shipra-bhadauria/NLP_practice/blob/main/LSTM_WITH_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here we’ll use a dataset of movie reviews, accompanied by sentiment labels: positive or negative.
We’ll use RNN, and in particular LSTMs

Load in and visualize the data

In [None]:
import numpy as np

# read data from text files
with open('reviews.txt', 'r') as f:
    reviews = f.read()
with open('labels.txt', 'r') as f:
    labels = f.read()

Print some content:

In [None]:
print(reviews[:1000])#1000 number of letters to show in reviews
print()
print(labels[:20]) #20 number of letters to show in labels

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

In [None]:
from string import punctuation

Convert to lower case

In [None]:
# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

Create list of reviews

In [None]:
# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

In [None]:
words[:20]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such']

 Tokenize- Create Vocab to Int mapping dictionary

In [None]:
from collections import Counter
## Build a dictionary that maps words to integers 
counts = Counter(words)

In [None]:
vocab = sorted(counts, key=counts.get, reverse=True)
print(vocab)



In [None]:
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

In [None]:
print(vocab_to_int)



We can see that mapping for ‘the’ is 1 now

Tokenize — Encode the words

In [None]:
## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = [] 
for review in reviews_split:     reviews_ints.append([vocab_to_int[word] for word in review.split()])

In [None]:
print('Unique words: ', len((vocab_to_int)))
# print tokens in first review 
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  64404
Tokenized review: 
 [[17570, 314, 6, 3, 1076, 203, 8, 2136, 32, 1, 169, 58, 15, 49, 79, 5514, 44, 401, 110, 137, 15, 4679, 60, 152, 9, 1, 5028, 6029, 476, 70, 5, 256, 12, 17570, 314, 13, 2254, 6, 74, 2431, 5, 583, 73, 6, 4679, 1, 22193, 5, 2056, 9233, 1, 6227, 1536, 36, 52, 66, 205, 144, 67, 1177, 4679, 18818, 1, 32355, 4, 1, 222, 905, 31, 3109, 70, 4, 1, 5856, 10, 689, 2, 67, 1536, 54, 10, 213, 1, 360, 9, 62, 3, 1423, 3920, 810, 5, 3701, 179, 1, 401, 10, 1195, 15647, 32, 314, 3, 359, 344, 3282, 10, 142, 127, 5, 7337, 30, 4, 129, 4679, 1423, 2340, 5, 17570, 314, 10, 518, 12, 109, 1504, 4, 60, 543, 102, 12, 17570, 314, 6, 229, 4193, 48, 3, 2247, 12, 8, 221, 22]]


Tokenize — Encode the labels

In [None]:
# 1=positive, 0=negative label conversion 
labels_split = labels.split('\n') 
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
print(labels_split)

['positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'negative', 'positive', 'ne

Removing Outliers — Getting rid of extremely long or short reviews

In [None]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Maximum review length: 2514


In [None]:
def pad_features(reviews_ints, seq_length):
  features = np.zeros((len(reviews_ints), seq_length), dtype=int)
  for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_length]
  return features
# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches 
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [18819    42 40223    15   728 16557  3398    47    75    35]
 [ 4570   515    15     3  3884   162  7789  1596     6  4571]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   119    60   831   554    70   351     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   332   578    34     3   162   755  2682     9   323]
 [    9    11 10378  5091  1953   687   434    21   269   669]
 [    0     0     0     0     0     0     0     0     0

 Training, Validation, Test Dataset Split

In [None]:
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(14379, 200) 
Validation set: 	(1797, 200) 
Test set: 		(1798, 200)


Dataloaders and Batching

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader
batch_size = 50
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)

In [None]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[    0,     0,     0,  ...,     8,   221,    22],
        [    0,     0,     0,  ...,    29,   111,  3223],
        [18819,    42, 40223,  ...,   489,    17,     3],
        ...,
        [    0,     0,     0,  ...,    23,   395,   242],
        [    0,     0,     0,  ...,     7,  2738,  4578],
        [    0,     0,     0,  ...,  1502,    70,   348]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 0])
