### Project - Analyzing Movie Reviews using NLTK

![imdb](https://upload.wikimedia.org/wikipedia/commons/6/69/IMDB_Logo_2016.svg)

In [9]:
import pandas as pd
imdb_data = pd.read_csv('IMDB Dataset.csv')

In [10]:
reviews = imdb_data.review

Transform the reviews object into a list of reviews. 

Tokenize every review on the `list_reviews` into words. 

In [17]:
import nltk
tokenized_reviews = [nltk.tokenize.word_tokenize(review) for review in reviews]

In [18]:
tokenized_reviews[0:10]

[['One',
  'of',
  'the',
  'other',
  'reviewers',
  'has',
  'mentioned',
  'that',
  'after',
  'watching',
  'just',
  '1',
  'Oz',
  'episode',
  'you',
  "'ll",
  'be',
  'hooked',
  '.',
  'They',
  'are',
  'right',
  ',',
  'as',
  'this',
  'is',
  'exactly',
  'what',
  'happened',
  'with',
  'me.',
  '<',
  'br',
  '/',
  '>',
  '<',
  'br',
  '/',
  '>',
  'The',
  'first',
  'thing',
  'that',
  'struck',
  'me',
  'about',
  'Oz',
  'was',
  'its',
  'brutality',
  'and',
  'unflinching',
  'scenes',
  'of',
  'violence',
  ',',
  'which',
  'set',
  'in',
  'right',
  'from',
  'the',
  'word',
  'GO',
  '.',
  'Trust',
  'me',
  ',',
  'this',
  'is',
  'not',
  'a',
  'show',
  'for',
  'the',
  'faint',
  'hearted',
  'or',
  'timid',
  '.',
  'This',
  'show',
  'pulls',
  'no',
  'punches',
  'with',
  'regards',
  'to',
  'drugs',
  ',',
  'sex',
  'or',
  'violence',
  '.',
  'Its',
  'is',
  'hardcore',
  ',',
  'in',
  'the',
  'classic',
  'use',
  'of',
  't

Perform token cleaning on every review of the list review, namely:
- Remove stop words
- lower case every word
- remove punctuation.

In [19]:
import string
from nltk.corpus import stopwords

punctuation = string.punctuation
stop_words = stopwords.words('english')

cleaned_tokens = []

for review in tokenized_reviews:
    cleaned_review = []
    for word in review:
        if word.lower() in punctuation or word.lower() in stop_words:
            continue
        else:
            cleaned_review.append(word.lower())
    # Append Review to the cleaned_tokens object
    cleaned_tokens.append(cleaned_review)

In [20]:
nltk.FreqDist(cleaned_tokens[80]).most_common(10)

[('movie', 3),
 ('man', 3),
 ('stephen', 2),
 ('hawkings', 2),
 ('makes', 2),
 ("'s", 2),
 ('theories', 2),
 ('universe', 2),
 ('black', 2),
 ('holes', 2)]

Check the top 10 words of the entire corpus:

In [21]:
bag_words = []

for review in cleaned_tokens:
    for word in review:
        bag_words.append(word)

In [22]:
nltk.FreqDist(bag_words).most_common(10)

[('br', 201951),
 ("'s", 122130),
 ('movie', 85070),
 ('film', 76919),
 ("''", 66440),
 ("n't", 66243),
 ('``', 65690),
 ('one', 51828),
 ('like', 39182),
 ('good', 28767)]

Use a `pos_tag` to produce a version of the tokens with the respective POS_TAG. 

In [23]:
tagged_reviews = [nltk.tag.pos_tag(review) for review in cleaned_tokens[0:10000]]

Based on the `tagged_reviews` object, create a new list of lists called `adjectives` where you will have a list of every adjective per review.

In [24]:
adjectives = []
for review in tagged_reviews:
    adj_review = []
    for word_tag in review:
        if word_tag[1].startswith('JJ'):
            adj_review.append(word_tag[0])
    adjectives.append(adj_review)

Based on the column `sentiment` of the dataframe `imdb_data`, split the `adjectives` list into two lists: `adjectives_positive` and `adjectives_negative`. The `adjectives_positive` should contain be a list (not a list of lists) with all adjectives that are tied to positive reviews. The adjective negative should be a similar list with all adjectives that are tied to negative reviews.

In [25]:
def get_adjectives_sentiment(adjectives, sentiment_column, sentiment):
    sentiment_list = []
    for index, adjectives_review in enumerate(adjectives):
        if sentiment_column[index] == sentiment:
            sentiment_list.extend(adjectives_review)
    return sentiment_list

In [26]:
adjectives_positive = get_adjectives_sentiment(adjectives, imdb_data.sentiment, 'positive')
adjectives_negative = get_adjectives_sentiment(adjectives, imdb_data.sentiment, 'negative')

Extract the top 50 common adjectives for negative and positive reviews. Save them in a dataframe with the number of times each adjective appears in positive or negative reviews. For example, if an adjective appear 5 times in the top 50 of negative list and it does not appear in the top 50 of the positive list, mark it as `0` in this new dataframe 

In [27]:
top_positives = pd.DataFrame(
    [
        [count[0] for count in nltk.FreqDist(adjectives_positive).most_common(50)],
        [count[1] for count in nltk.FreqDist(adjectives_positive).most_common(50)]
    ],
    index = ['adjective','positive_count']
).T

top_negatives = pd.DataFrame(
    [
        [count[0] for count in nltk.FreqDist(adjectives_negative).most_common(50)],
        [count[1] for count in nltk.FreqDist(adjectives_negative).most_common(50)]
    ],
    index = ['adjective','negative_count']
).T

In [28]:
top_adjectives = top_positives.merge(top_negatives, on='adjective', how='outer').fillna(0)

Which adjective seems to be more overweighted (meaning that it seems to appear very often on negative reviews and not on positive ones) on negative reviews?

In [29]:
top_adjectives.sort_values(by='negative_count', ascending=False).head(10)

Unnamed: 0,adjective,positive_count,negative_count
0,good,2809,2866
13,bad,694,2865
2,br,1868,2049
6,much,1323,1414
5,little,1328,1176
3,many,1473,1149
1,great,2575,1019
7,real,1008,888
8,first,974,877
11,old,746,771


And on the positive reviews? Do we have more than one objective that is overweighted?

In [30]:
top_adjectives.sort_values(by='positive_count', ascending=False).head(10)

Unnamed: 0,adjective,positive_count,negative_count
0,good,2809,2866
1,great,2575,1019
2,br,1868,2049
3,many,1473,1149
4,best,1371,662
5,little,1328,1176
6,much,1323,1414
7,real,1008,888
8,first,974,877
9,new,914,678


Based on the `cleaned_tokens` object, stem all words available in our reviews and save the object in a new list of lists.

In [31]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')

In [32]:
stemmed_tokens = []
for review in cleaned_tokens:
    stemmed_review = []
    for token in review:
        stemmed_token = snowball.stem(token)
        stemmed_review.append(stemmed_token)
    stemmed_tokens.append(stemmed_review)

Add the percentage of retained data (number of characters retained for each review in the `cleaned_tokens` divided by the number of characters of the original review) to the `imdb_data`.


In [33]:
perc_loss = []
for index, review in enumerate(stemmed_tokens):
    perc_loss.append(len(' '.join(review))/len(reviews[index])) 

In [34]:
imdb_data['perc_retain_stemming'] = perc_loss

In [35]:
imdb_data.sort_values(by='perc_retain_stemming',ascending=True)

Unnamed: 0,review,sentiment,perc_retain_stemming
48927,Smallville episode Justice is the best episode...,positive,0.110272
39182,Smallville episode Justice is the best episode...,positive,0.110272
45723,What is the story what is it on the screen. At...,negative,0.390173
48697,Maybe it was the fact that I saw Spider-man th...,negative,0.391213
11645,No offense to anyone who saw this and liked it...,negative,0.404545
...,...,...,...
11926,I wouldn't rent this one even on dollar rental...,negative,0.811321
30527,"As so many others have written, this is a wond...",positive,0.813246
28920,Primary plot!Primary direction!Poor interpreta...,negative,0.823529
36844,OZ is the greatest show ever mad full stop.OZ ...,positive,0.836257


In [36]:
imdb_data.loc[48927,'review']

"Smallville episode Justice is the best episode of Smallville ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! It's my favorite episode of Smallville! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !"

In [37]:
' '.join(stemmed_tokens[48927])

"smallvill episod justic best episod smallvill 's favorit episod smallvill"

- The stemmed review retained few original characters because of the punctuation and not because of stemming. It probably may be excluded from the analysis as it contain few meaningful text.