# TD2: Parts of Speech tagging for sentimment analysis

Part-of-speech tagging is the process of converting a sentence, in the form of a list of words,
into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech
tag, and signifies whether the word is a noun, adjective, verb, and so on.

Most of the taggers are trainable. They use a list of tagged sentences as their training data, such as
what you get from the tagged_sents() method of a TaggedCorpusReader class. With these training
sentences, the tagger generates an internal model that will tell it how to tag a word. Other taggers
use external data sources or match word patterns to choose a tag for a word.
All taggers in NLTK are in the nltk.tag package. Many taggers can also be combined into a backoff
chain, so that if one tagger cannot tag a word, the next tagger is used, and so on.

Training a unigram part-of-speech tagger

UnigramTagger can be trained by giving it a list of tagged sentences at initialization.

>>> from nltk.tag import UnigramTagger

>>> from nltk.corpus import treebank

>>> train_sents = treebank.tagged_sents()[:3000]

>>> tagger = UnigramTagger(train_sents)

>>> treebank.sents()[0]

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director','Nov.', '29', '.']

>>> tagger.tag(treebank.sents()[0])

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will',
'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'),('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

We use the first 3000 tagged sentences of the treebank corpus as the training set to
initialize the UnigramTagger class. Then, we see the first sentence as a list of words,
and can see how it is transformed by the tag() function into a list of tagged tokens.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('sentiwordnet')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\sentiwordnet.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
import os

In [4]:
# Chemin vers le dossier où vous avez décompressé le dataset
dataset_directory = "C:/Users/rosel/Desktop/ML_NLP/txt_sentoken"

In [5]:
# Sous-répertoires pour critiques positives et négatives
pos_dir = os.path.join(dataset_directory, 'pos')
neg_dir = os.path.join(dataset_directory, 'neg')

In [6]:
def load_reviews(directory, label):
    """Charge les critiques d'un répertoire donné et attribue une étiquette"""
    reviews = []
    for filename in os.listdir(directory):
        with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
            text = file.read().strip()
            reviews.append((text, label))
    return reviews

In [7]:
# Charger les critiques positives et négatives
positive_reviews = load_reviews(pos_dir, 'positive')
negative_reviews = load_reviews(neg_dir, 'negative')


In [8]:
# Combiner les critiques
all_reviews = positive_reviews + negative_reviews

Mélanger le dataset: pour plus tardl'entrainer et le test

In [9]:
import random

random.shuffle(all_reviews)

In [14]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Tokenise et étiquette morpho-syntaxique pour chaque critique
tagged_reviews = [(word_tokenize(review), label) for review, label in all_reviews]
tagged_reviews = [(nltk.pos_tag(tokens), label) for tokens, label in tagged_reviews]

In [15]:
def extract_adverbs(tagged_tokens):
    """Extrait les adverbes d'une liste de tokens étiquetés."""
    return [word for word, pos in tagged_tokens if pos.startswith('RB')]

# Extraire les adverbes pour chaque critique
adverbs_in_reviews = [(extract_adverbs(tagged_tokens), label) for tagged_tokens, label in tagged_reviews]

In [20]:
adverbs_in_reviews

[(['not',
   'often',
   'never',
   'always',
   'once',
   'just',
   'really',
   'politically',
   'enjoy',
   'completely',
   'not',
   'often',
   'not',
   'so',
   'very',
   "n't",
   'dangerously',
   'so',
   'long',
   'very',
   'once',
   'hard',
   'again',
   'lovely'],
  'positive'),
 (['as',
   'probably',
   'up',
   'most',
   'however',
   'soon',
   'yet',
   'elsewhere',
   'indeed',
   'back',
   'even',
   'even',
   'probably',
   'already',
   "n't",
   'even',
   "n't",
   'even',
   'emily',
   'craven',
   'supposedly',
   'quite',
   'randomly',
   'maybe',
   'eventually',
   'eventually',
   'just',
   'also',
   'forth',
   'probably',
   'not',
   'as',
   'well',
   'also',
   "n't",
   'even',
   'harder',
   'even',
   'even'],
  'negative'),
 (['cuddly',
   'largely',
   'here',
   'merely',
   'kelly',
   'not',
   'only',
   'little',
   'curiously',
   'actually',
   'quite',
   "n't",
   'much',
   'either'],
  'negative'),
 (['deservedly',
 

In [16]:
from nltk.corpus import sentiwordnet as swn

def get_sentiment(adverb):
    """Obtient le score de sentiment pour un adverbe à l'aide de SentiWordNet."""
    synsets = list(swn.senti_synsets(adverb, 'r'))  # 'r' pour adverbes
    if not synsets:
        return 0  # Aucun score si l'adverbe n'est pas trouvé dans SentiWordNet
    
    # Utiliser le premier synset par défaut (pourrait être amélioré en utilisant des méthodes de désambiguïsation)
    return synsets[0].pos_score() - synsets[0].neg_score()

# Calculer le score de sentiment pour chaque adverbe dans les critiques
sentiments_in_reviews = [(sum(get_sentiment(adverb) for adverb in adverbs), label) for adverbs, label in adverbs_in_reviews]

In [21]:
sentiments_in_reviews

[(-1.5, 'positive'),
 (0.0, 'negative'),
 (-0.625, 'negative'),
 (-1.875, 'positive'),
 (-0.375, 'positive'),
 (-1.875, 'negative'),
 (-0.25, 'positive'),
 (0.75, 'positive'),
 (-3.375, 'positive'),
 (-2.0, 'positive'),
 (-3.25, 'negative'),
 (2.125, 'negative'),
 (-0.75, 'negative'),
 (3.25, 'positive'),
 (1.25, 'positive'),
 (2.125, 'positive'),
 (-3.625, 'negative'),
 (2.0, 'positive'),
 (-1.25, 'negative'),
 (2.0, 'positive'),
 (0.25, 'negative'),
 (-1.625, 'negative'),
 (-2.75, 'negative'),
 (-2.375, 'negative'),
 (-1.0, 'positive'),
 (0.25, 'positive'),
 (0.875, 'positive'),
 (0.625, 'negative'),
 (-4.0, 'negative'),
 (1.625, 'positive'),
 (-3.25, 'negative'),
 (-0.375, 'positive'),
 (-1.375, 'negative'),
 (3.5, 'negative'),
 (-6.375, 'negative'),
 (-1.5, 'negative'),
 (0.5, 'negative'),
 (0.75, 'negative'),
 (1.0, 'positive'),
 (1.5, 'positive'),
 (-2.0, 'positive'),
 (0.125, 'negative'),
 (1.875, 'positive'),
 (0.75, 'positive'),
 (-2.75, 'positive'),
 (-0.875, 'positive'),
 (-

In [19]:
#Classer les critiques en fonction des scores de sentiment
def classify_review(sum_score):
    return "pos" if sum_score > 0 else "neg"

predicted_labels = [classify_review(score) for score, _ in sentiments_in_reviews]

# Calculer la précision de la classification
actual_labels = [label for _, label in sentiments_in_reviews]
correctly_classified = sum(1 for predicted, actual in zip(predicted_labels, actual_labels) if predicted == actual)

accuracy = correctly_classified / len(predicted_labels)

print(f"Précision de la classification: {accuracy * 100:.2f}%")

Précision de la classification: 0.00%


In [22]:
X = [sum_ for sum_, _ in sentiments_in_reviews]
y = [label for _, label in sentiments_in_reviews]

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
from sklearn.linear_model import LogisticRegression

# Reshape les données car nous avons une seule caractéristique
X_train = [[x] for x in X_train]
X_test = [[x] for x in X_test]

# Entraîner le modèle
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

In [25]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)

Accuracy: 47.00%

Classification Report:
              precision    recall  f1-score   support

    negative       0.48      0.18      0.26       209
    positive       0.47      0.79      0.59       191

    accuracy                           0.47       400
   macro avg       0.47      0.48      0.42       400
weighted avg       0.47      0.47      0.42       400



Votre modèle a une précision de 47%, ce qui n'est pas idéal, essayons de l'améliorer

TF-IDF (Term Frequency-Inverse Document Frequency):

Au lieu de se concentrer uniquement sur le score de sentiment, nous pouvons transformer les critiques en vecteurs numériques à l'aide de TF-IDF. Cette technique prend en compte l'importance d'un mot dans un document par rapport à l'ensemble du corpus.

In [28]:
print(all_reviews[:5])  # Affichez les 5 premières entrées pour vérifier.

[('robert redford\'s a river runs through it is not a film i watch often . \nit is a masterpiece -- one of the better films of recent years . \nuntil 1994 , it was my second favorite film of all time . \nthe acting and direction is top-notch -- never sappy , always touching . \na friend of mine once reported that he avoided it because " i was afraid it would just be really politically correct , and tick me off . " \nall i could do was tell him to go in unbiased , and enjoy . \nit is one of the few movies that has completely reduced me to tears . \nbut certain memories should not often be rereleased -- in the last few shots , you have to cry . \nupon my first viewing i left bawling . \nit is not flawless -- but it is so very good , that you can\'t help but be effected . \nthe opening is dangerously nolstalgic and sentimental -- watching these shots of people who have been dead so long , gives you a feeling of perspective and history observation that you will find in very few other films

In [29]:
all_reviews = [review[0] for review in all_reviews]

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Créez un vecteur TF-IDF basé sur les critiques
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(all_reviews)

# Divisez à nouveau les données en ensembles d'entraînement et de test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [31]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")

report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)

Accuracy: 81.50%

Classification Report:
              precision    recall  f1-score   support

    negative       0.81      0.84      0.83       209
    positive       0.82      0.79      0.80       191

    accuracy                           0.81       400
   macro avg       0.82      0.81      0.81       400
weighted avg       0.82      0.81      0.81       400

