<h1>Parts-of-Speech-Tagging for sentiment analysis</h1>
Part-of-speech tagging is the process of converting a sentence, in the form of a list of words,
into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech
tag, and signifies whether the word is a noun, adjective, verb, and so on.
<br>
<br>
Most of the taggers are trainable. They use a list of tagged sentences as their training data, such as
what you get from the tagged_sents() method of a TaggedCorpusReader class. With these training
sentences, the tagger generates an internal model that will tell it how to tag a word. Other taggers
use external data sources or match word patterns to choose a tag for a word.
All taggers in NLTK are in the nltk.tag package. Many taggers can also be combined into a backoff
chain, so that if one tagger cannot tag a word, the next tagger is used, and so on.

# POS tagging import

In [55]:
from nltk import *
import nltk
from nltk.tag import UnigramTagger
from nltk.corpus import treebank

nltk.download('treebank')

import unicodedata
from nltk.tokenize import TreebankWordTokenizer
import re

import pandas as pd
nltk.download('stopwords')
nltk.download('sentiwordnet')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## Treebank

In [2]:
treebank.sents()[0]

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [3]:
treebank.tagged_sents()[0]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [4]:
treebank.tagged_sents()[0:2]

[[('Pierre', 'NNP'),
  ('Vinken', 'NNP'),
  (',', ','),
  ('61', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('will', 'MD'),
  ('join', 'VB'),
  ('the', 'DT'),
  ('board', 'NN'),
  ('as', 'IN'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('Nov.', 'NNP'),
  ('29', 'CD'),
  ('.', '.')],
 [('Mr.', 'NNP'),
  ('Vinken', 'NNP'),
  ('is', 'VBZ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Elsevier', 'NNP'),
  ('N.V.', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('Dutch', 'NNP'),
  ('publishing', 'VBG'),
  ('group', 'NN'),
  ('.', '.')]]

In [5]:
train_sents = treebank.tagged_sents()[:3000]
train_sents

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]

## UnigramTagger

In [6]:
tagger = UnigramTagger(train_sents)

In [7]:
tokenizer = TreebankWordTokenizer()
tokens=tokenizer.tokenize("it is good")
tokens

['it', 'is', 'good']

In [8]:
tagger.tag(tokens)

[('it', 'PRP'), ('is', 'VBZ'), ('good', 'JJ')]

In [9]:
tagger.tag(["good"])

[('good', 'JJ')]

We use the first 3000 tagged sentences of the treebank corpus as the training set to
initialize the UnigramTagger class. Then, we see the first sentence as a list of words,
and can see how it is transformed by the tag() function into a list of tagged tokens.

## Averaged Perceptron Tagger

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
nltk.pos_tag(nltk.word_tokenize("it is good"))

[('it', 'PRP'), ('is', 'VBZ'), ('good', 'JJ')]

# SentiWordNet

http://wordnetweb.princeton.edu/perl/webwn

In [70]:
from nltk.corpus import sentiwordnet as swn


In [None]:
list(swn.senti_synsets('computer'))

[SentiSynset('computer.n.01'), SentiSynset('calculator.n.01')]

In [None]:
list(swn.senti_synsets('computer', 'n'))

In [None]:
list(swn.senti_synsets('computer', 'n'))[0].obj_score()

1.0

In [None]:
list(swn.senti_synsets('computer', 'n'))[0].pos_score()

0.0

In [None]:
list(swn.senti_synsets('computer', 'n'))[0].neg_score()

0.0

In [None]:
list(swn.senti_synsets('good'))

[SentiSynset('good.n.01'),
 SentiSynset('good.n.02'),
 SentiSynset('good.n.03'),
 SentiSynset('commodity.n.01'),
 SentiSynset('good.a.01'),
 SentiSynset('full.s.06'),
 SentiSynset('good.a.03'),
 SentiSynset('estimable.s.02'),
 SentiSynset('beneficial.s.01'),
 SentiSynset('good.s.06'),
 SentiSynset('good.s.07'),
 SentiSynset('adept.s.01'),
 SentiSynset('good.s.09'),
 SentiSynset('dear.s.02'),
 SentiSynset('dependable.s.04'),
 SentiSynset('good.s.12'),
 SentiSynset('good.s.13'),
 SentiSynset('effective.s.04'),
 SentiSynset('good.s.15'),
 SentiSynset('good.s.16'),
 SentiSynset('good.s.17'),
 SentiSynset('good.s.18'),
 SentiSynset('good.s.19'),
 SentiSynset('good.s.20'),
 SentiSynset('good.s.21'),
 SentiSynset('well.r.01'),
 SentiSynset('thoroughly.r.02')]

In [None]:
if not list(swn.senti_synsets('rzezdf', 'n')):
  print("liste vide")

liste vide


# Sentiment analysis

The main goal of this work package is to identify if reviews of movies are positives or negatives.

Download movie reviews polarity dataset v2.0 at http://www.cs.cornell.edu/people/pabo/movie-
review-data





In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We extract the data

In [15]:
import tarfile
fname = '/content/drive/My Drive/data_collab/review_polarity.tar.gz'
if fname.endswith("tar.gz"):
    tar = tarfile.open(fname, "r:gz")
    tar.extractall('/content/drive/My Drive/data_collab')
    tar.close()

In [22]:
poslist = []
neglist = []
for file in os.listdir('/content/drive/My Drive/data_collab/txt_sentoken/pos'):
  f = open('/content/drive/My Drive/data_collab/txt_sentoken/pos/{}'. format(file), 'r')
  text = f.read()
  poslist.append(text)
  f.close()

for file in os.listdir('/content/drive/My Drive/data_collab/txt_sentoken/neg'):
  f = open('/content/drive/My Drive/data_collab/txt_sentoken/neg/{}'. format(file), 'r')
  text = f.read()
  neglist.append(text)
  f.close()

In [57]:
raw_reviews = poslist + neglist

Let's clean the data (\n)

In [58]:
reviews_clean = [x.replace('\n','') for x in raw_reviews]
reviews_clean = [x.replace(r"\'","'") for x in poslist_clean]

In [59]:
len(reviews_clean)

2000

In [60]:
reviews_clean[0:4]

['films adapted from comic books have had plenty of success , whether they\'re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there\'s never really been a comic book like from hell before . for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid \'80s with a 12-part series called the watchmen . to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . in other words , don\'t dismiss this film because of its source . if you can get past the whole comic book thing , you might find another stumbling block in from hell\'s directors , albert and allen hughes . getting the hughes brothers to direct this seems almost as

First,to identify positivity or negativity you will pos-tag the reviews and identify adverbes.

Adverbs are RB in the NLTK pos tagger : Let's identify them

In [63]:
reviews_postag = []
for sentence in reviews_clean:
  postag = nltk.pos_tag(nltk.word_tokenize(sentence))
  reviews_postag.append(postag)

In [65]:
len(reviews_postag)

2000

In [68]:
adverbs_only = [[x for x in sentence if x[1]=='RB'] for sentence in reviews_postag]

Second you will use a lexical resource to classify the adverbes found.
SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synset of
WordNet three sentiment scores: positivity, negativity, objectivity.
Each of the three scores ranges in the interval [0.0 ; 1.0], and their sum is 1.0 for each synset.

For the strategy that we will use for the scoring : 
<br>
We will try to estimate the polarity score from the posterior polarities of all the senses for a single adverb. 
<br>
Given adverb with $n$ senses, every formula $f$ is idependently applied to all the Pos(s) and Neg(s).
<br>
This produce two scores, $f(posScore)$ and $f(negScore)$ for each adverb. To obtain unique prior polarity fo adverb we will us this strategy :

$$
f_m = \left\{
  \begin{array}{ll}
    f(posScore)\  if \ f(posScore) \ge f(negScore)\\
    -f(negScore)\  otherwise
  \end{array}
\right.
$$
<br>
$$ f_d = f(posScore) - f(negScore) $$

In [93]:
def f_posscore(list_synsets):
  sum = 0
  for i in range(0,len(list_synsets)):
    sum = sum + list_synsets[i].pos_score()
  return sum/len(list_synsets)

def f_negscore(list_synsets):
  sum = 0
  for i in range(0,len(list_synsets)):
    sum = sum + list_synsets[i].neg_score()
  return sum/len(list_synsets)

def f_m(word):
  synsets_list = list(swn.senti_synsets(word, 'a'))
  f_pos_score = f_posscore(synsets_list)
  f_neg_score = f_negscore(synsets_list)

  if f_pos_score >= f_neg_score:
    return f_pos_score
  else:
    return -f_neg_score

def f_d(word):
  synsets_list = list(swn.senti_synsets(word, 'a'))
  f_pos_score = f_posscore(synsets_list)
  f_neg_score = f_negscore(synsets_list)

  return f_pos_score - f_neg_score


Let's explain the $f_m$ and $f_d$ values :
- $f_m$ is the absolute maximum of the two scores
- $f_d$ is the difference between them

Here is an exemple of the two scores with nice


In [94]:
f_m('nice')

0.65

In [95]:
f_d('nice')

0.5750000000000001

Now we will apply these two scores on the adverbs-only list and make the mean for each sentence

In [None]:
f