# Aspect-based sentiment analysis

Aspect-Based Sentiment Analysis (ABSA), also known as fine-grained opinion mining, focuses on identifying the sentiment of a text concerning a specific aspect. This approach has gained prominence as a response to the limitations of traditional sentiment analysis methods.

Conventional sentiment analysis typically assigns an overall sentiment label (e.g., positive, negative, or neutral) to an entire text. While this is sufficient for many applications, it lacks the granularity needed in scenarios where sentiment varies across different aspects. For instance, in a restaurant review, a customer may rate the restaurant positively overall but criticize the service. In such cases, ABSA helps capture the sentiment towards individual aspects, such as identifying that the sentiment toward “service” is negative, despite the overall positive review.

## Example: Hotel reviews

<p style="margin-left:200px; float:right"><img src="hotel-reviews.jpg" width="300px" /></p>  

**Hotel aspects:** cleanliness, staff behavior, food quality, location, service and amenities\
**Aspect sentiment analysis:** sentiment scores aggregated for each aspect



- cleanliness_tables_area: ⭐⭐⭐⭐⭐
- staff_behavior_serving_greeting: ⭐⭐⭐
- food_menu_taste_cousines: ⭐⭐⭐⭐⭐
- location_by main road_close to city: ⭐⭐
- service_waiter_table booking: ⭐⭐


Aspect-Based Sentiment Analysis (ABSA) is particularly useful in analyzing hotel reviews, where customers express opinions on multiple aspects of their stay, such as cleanliness, staff behavior, room quality, location, and amenities. Traditional sentiment analysis may label a review as positive or negative as a whole, but ABSA allows for a more nuanced understanding by identifying sentiment tied to specific aspects. For example, a guest might praise the hotel's location and service but complain about the room's cleanliness. By applying ABSA, hotel management can gain detailed insights into what aspects need improvement while maintaining strengths. Additionally, potential customers can make informed decisions based on sentiments about aspects that matter most to them. This fine-grained analysis helps hotels enhance customer experience and tailor their services to meet guest expectations more effectively.



Aspect-Based Sentiment Analysis (ABSA) is a challenging task as it involves both identifying relevant "aspects" within a text and assigning sentiment labels to them. Various approaches exist for ABSA, but a common strategy involves first detecting aspects in the text and then applying an ABSA model to determine the sentiment associated with each aspect.

Aspect identification can be performed using different techniques, including rule-based methods such as dictionary-based approaches. For instance, terms like "iPhone X" or "MacBook Pro" might be predefined as aspects.

After identifying aspects, an ABSA classifier is trained to assess sentiment in relation to the context of a sentence. For example, in the sentence, "We had a great experience at the restaurant, the food was delicious, but the service was kinda bad," the classifier would determine that the sentiment towards "service" is negative, despite the overall positive tone of the review.



## Topic Modeling

Topic modeling is an unsupervised machine learning technique used to identify hidden thematic structures in a large collection of text data. It helps discover topics that frequently occur in a dataset without requiring prior labeling or annotation. One of the most widely used topic modeling methods is Latent Dirichlet Allocation (LDA), which represents documents as mixtures of topics, with each topic consisting of a set of words with varying probabilities. Topic modeling is commonly applied in text mining, information retrieval, document classification, and content recommendation systems. It enables researchers and businesses to analyze vast amounts of textual data, uncover trends, and gain insights into discussions, making it a valuable tool in areas such as social media analysis, academic research, and customer feedback categorization.

## Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text. It involves classifying text into categories such as positive, negative, or neutral, enabling businesses and researchers to analyze opinions, feedback, and trends. Sentiment analysis is widely applied in various domains, including social media monitoring, customer feedback analysis, brand reputation management, and market research. Advanced sentiment analysis techniques, such as deep learning and transformer-based models, enhance accuracy by capturing contextual nuances, sarcasm, and complex emotions within text data.

## Tutorial Content
1. Data preparation / preprocessing
2. Integer encoding
3. Topic modeling (Latent dirichlet allocation with collapsed Gibbs sampling)
4. Performing sentiment analysis (using [SentiStrength](https://github.com/zhunhung/Python-SentiStrength))  
5. Separate neutral i.e., topic (aspect) presenting words and subjective words
6. Aggregating scores of the subjective words against each topic
7. Preparing output

In [1]:
# Its vanilla implementation of Topic modeling that only uses basic tools:
# json - to read from and write to files in json format 
# numpy - for faster matrix operations 
# pandas - to read csv data
# string - to only keep English letters, removing puntuations and other characters
# random - to generate random numbers for initializing Markov-chain monte carlo, and 
#           and during algorithm working to avoid local optima


import json
import numpy as np
import random
import pandas as pd
import string

### 1. Data preparation

**1.1. Read textual data**

In [2]:
en_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

def clean_text(text):
    clean_text = text.lower()
    # cleaning documents by removing unwanted characters
    clean_text = "".join([char for char in text if char in string.ascii_lowercase])
    # cleaning documents by stopwords
    clean_text = [word for word in text.split(" ") if word not in en_stopwords and len(word) > 2]
    return clean_text

In [3]:
# Read input data: titles of BBC articles available on the following link
# https://github.com/vahadruya/Capstone-Project-Unsupervised-ML-Topic-Modelling/blob/main/Input_Data/input.csv

with open('config.json', 'r') as file:
    configurations = json.load(file)
#df = pd.read_csv(configurations["text-doc-path"], encoding="utf-8")
rawdata = open(configurations["text-doc-path"]).read()
#rawdata = df["Title"].tolist()
rawdata = rawdata.split("\n")

# Tokenize sentences into words
tokenized_documents = []
for document in rawdata: # Considering only first 100 titles for the sake of demonstration
    document = clean_text(document)
    if len(document) > 2:
        tokenized_documents.append(document)
len(tokenized_documents)

605

**1.2. Generate Integer encoding**\
It preserves both frequency and position related information. The process involves assigning each unique token a dedicated integer id, preserving it in a dictionary for later retrieval, while rewriting documents by replacing with with their integer ids.

It makes the operations a lot faster as numbers are much faster to read/store and compare as compared to strings. 

The integer ids will be replaced with their original words at the end using stored dictionary files

In [4]:
# Create a dictionary of unique tokens and assign integers
dictionary = {}
revdictionary = {}
index = 0

#tokenized_documents = [[word for word in doc if word not in esw] for doc in tokenized_documents]

for doc in tokenized_documents:
    for word in doc:
        if word not in dictionary.keys():
            dictionary[word] = index
            revdictionary[index] = word
            index += 1

# Replace words in sentences with their corresponding integers
encoded_documents = [[dictionary[word] for word in doc] for doc in tokenized_documents]

**1.3. Storing intermediate data**\
The integer encoded documents are stored in files
the word-to-id and id-to-word dictionaries are also stored

*It will help to avoid these steps, each time topic modeling is performed under different settings*

In [5]:
toStr = ''
for endoc in encoded_documents:
    toStr = toStr + '\t'.join(str(item) for item in endoc)
    toStr = toStr + '\n'
toStr = toStr[:-2]
file = open('data/integer-encoded-data.txt', 'w')
file.write(toStr)
file.close()

#write dictionary to file
file = open('data/dictionary.json', 'w')
file.write(json.dumps(dictionary))
file.close()
file = open('data/revdictionary.json', 'w')
file.write(json.dumps(revdictionary))
file.close()

### 2. Integer Encoding

It preserves both frequency and position related information. The process involves assigning each unique token a dedicated integer id, preserving it in a dictionary for later retrieval, while rewriting documents by replacing with with their integer ids.

It makes the operations a lot faster as numbers are much faster to read/store and compare as compared to strings. 

The integer ids will be replaced with their original words at the end using stored dictionary files

**2.1 Generate integer encoded documents**


In [6]:
# Create a dictionary of unique tokens and assign integers
dictionary = {}
revdictionary = {}
index = 0

#tokenized_documents = [[word for word in doc if word not in esw] for doc in tokenized_documents]

for doc in tokenized_documents:
    for word in doc:
        if word not in dictionary.keys():
            dictionary[word] = index
            revdictionary[index] = word
            index += 1

# Replace words in sentences with their corresponding integers
encoded_documents = [[dictionary[word] for word in doc] for doc in tokenized_documents]

**2.2 Storing intermediate data**\
The integer encoded documents are stored in files
the word-to-id and id-to-word dictionaries are also stored

*It will help to avoid these steps, each time topic modeling is performed under different settings*

In [7]:
toStr = ''
for endoc in encoded_documents:
    toStr = toStr + '\t'.join(str(item) for item in endoc)
    toStr = toStr + '\n'
toStr = toStr[:-2]
file = open('data/integer-encoded-data.txt', 'w')
file.write(toStr)
file.close()

#write dictionary to file
file = open('data/dictionary.json', 'w')
file.write(json.dumps(dictionary))
file.close()
file = open('data/revdictionary.json', 'w')
file.write(json.dumps(revdictionary))
file.close()

### 3. Topic Modeling (Latent Dirichlet Allocation)

**Setting (in config.json)**

*numTopics: 10* - how much can we stretch the data? After manual exploration or domain knowledge having fewer topics more than the high level separation can give good meaningful topics. Having more topics beyond that can identify more specific topics, however there can me more topics that are incoherent and cannot be interpreted.

*numAlpha ($\alpha$): 1.0* - We want natural representation of topics in documents. A higher value will push in more topics within documents while a lower value will only have fewer most dominant topics. $\alpha$ is a hyper-parameter where a higher value (above 1) adds external bias to each topic within a document. In extreme case (a value of 1000 or above for example) will have equal representation of all topics within the document.

*numBeta ($\beta$): 0.01* - We want fewer words to represent a topic, therefore, a value 0.01 (below 1) is used. Given the vocabulary size, a lower value will push the lower probability words in the topic further down, therefore, we will have few more prominent words to represent a topic. Pushing this value further down will results in increase in the probability of the prominent words while further drop in the probabilities of the background words for the topic.

Further, we set the number of iterations *numGIterations: 1000* giving it enough time to settle, starting from a randomly initialized state.

There are some other performance related parameters, set to default values

**LDA class**
main functions are:
1. Markov chain monte carlo initialization (giving the model a random inital state, expecting the model
    to converge for higher number of iterations.
2. Collapsed gibbs sampling inference: in each iteration \
   2.1 Iterates through all documents, all tokens/words in each document \
   2.2 For for each token computes its most suitable topic, given the current status of the model \
   2.3 Updates new topic if different from current topic, associated estimates update, so does the model state \
3. Estimate document-topic distribution from the final state of the model 
4. Estimate topic-word distribution (organized in decreasing order of probabilities) from the final state of the model
5. Other utility functions

In [8]:
# The class implements topic modeling (Latent dirichlet allocation) algorithm using collapsed gibbs sampling as in inference. 
class LDA:
    # topics to extract from the data (Components)
    _numTopics = None
    # vocabulary (unique words) in the dataset
    _arrVocab = None
    #size of vocabulary (count of unique words)
    _numVocabSize = None
    # dataset
    _arrDocs = []
    # dataset size (number of documents)
    _numDocSize = None
    # dirichlet prior (document to topic prior)
    _numAlpha = None
    # dirichlet prior (topic to word prior)
    _numBeta = None
    _ifScalarHyperParameters = True
    # Gibb sampler iterations
    _numGSIterations = None
    # The iterations for initial burnin (update of parameters)
    _numNBurnin = None
    # The iterations for continuous burnin (update of parameters)
    _numSampleLag = None
    
    
    
    # The following attributes are for internal working
    __numTAlpha = None  
    __numVBeta = None   
    __arrTheta = None
    __arrThetaSum = None
    __arrPhi = None
    __arrPhiSum = None
    __arrNDT = None
    __arrNDSum = []
    __arrNTW = None
    __arrNTSum = []
    __arrZ = []
    
    # for alpha to be a list, its size must be equal to the size of the dataset, has value for each doc
    # for beta to be a list, its size must be equal to the number of topics, has value for each topic  
    def __init__(self, numTopics = 2, numAlpha = 1.0, numBeta = 0.01, 
                 numGSIterations = 1000, numNBurnin = 50, numSampleLag = 20, 
                 wordsPerTopic = 10):
        self._numTopics = configurations["numTopics"]
        self._numAlpha = configurations["numAlpha"]
        self._numBeta = configurations["numBeta"]
        self._numGSIterations = configurations["numGSIterations"]
        self._numNBurni = configurations["numNBurnin"]
        self._numSampleLag = configurations["numSampleLag"]
        self.__wordsPerTopic = configurations["wordsPerTopic"]
            
    #load data as integer encoding of words in a sequence (no padding or truncation)
    def getData(self, path):
        file = open(path, 'r')
        rawData = file.read()
        file.close()
        self.__loadData(rawData)
        self.__loadVocab()
        self.__prepareCollections()

    #load docs and docSize from the dataset
    def __loadData(self, rawData):
        rows = rawData.split('\n')
         
        #read dataset as documents of words IDs
        for row in rows:
            swordlist = row.split('\t')
            swordlist = list(filter(None, swordlist))   #remove empty items from list
            if len(swordlist) > 0:
                iwordlist = [eval(w) for w in swordlist]    
                self._arrDocs.append(iwordlist)

        # determine dataset size
        self._numDocSize = len(self._arrDocs)
        
        
    #Determine unique words (vocabulary) and count of unique words (vocabSize)    
    def __loadVocab(self):
        #determine unique vocabulary
        uniqueWords = []
        for doc in self._arrDocs:
            for word in doc:
                if word not in uniqueWords:
                    uniqueWords.append(word)
        self._arrVocab = uniqueWords
        self._numVocabSize = len(self._arrVocab)    

    def __prepareCollections(self):
        self.__arrNDSum = np.array([0] * self._numDocSize)
        self.__arrTheta = np.array([[0] * self._numTopics] * self._numDocSize)
        self.__arrThetasum = np.array([[0] * self._numTopics] * self._numDocSize)
        self.__arrNDT = np.array([[0] * self._numTopics] * self._numDocSize)
        
        self.__arrNTSum = np.array([0] * self._numTopics)
        self.__arrPhi = np.array([[0] * self._numVocabSize] * self._numTopics)
        self.__arrPhisum = np.array([[0] * self._numVocabSize] * self._numTopics)
        self.__arrNTW = np.array([[0] * self._numVocabSize] * self._numTopics)

        #Assign values to parameters based on hyper-parameters
        self.__numTAlpha = self._numTopics*self._numAlpha  
        self.__numVBeta = self._numVocabSize*self._numBeta   

        
        for d in range(0, self._numDocSize):
            rowOfZeros = [0] * len(self._arrDocs[d])
            self.__arrZ.append(rowOfZeros)
                
    # Initialize first markov chain randomly
    def randomMarkovChainInitialization(self):
        
        for d in range(self._numDocSize):
            wta = []                        #wta - word topic assignment
            doc = self._arrDocs[d]
            for ind in range(len(doc)): 
                randtopic = random.randint(0, self._numTopics - 1)      # generate a topic number at random
                self.__arrZ[d][ind] = randtopic
                self.__arrNDT[d][randtopic] += 1
                self.__arrNDSum[d] += 1
                wordid = self._arrDocs[d][ind]
                self.__arrNTW[randtopic][wordid] += 1
                self.__arrNTSum[randtopic] += 1
            
    
    #Inference (Collapsed Gibbs Sampling)
    def gibbsSampling(self):
        tAlpha = self._numAlpha * self._numTopics
        vBeta = self._numBeta * self._numVocabSize            
                    
        for it in range(self._numGSIterations):
            for d in range(self._numDocSize):
                dsize = len(self._arrDocs[d])
                for ind in range(dsize):
                    # remove old topic from a word instance
                    oldTopic = self.__arrZ[d][ind]
                    wordid = self._arrDocs[d][ind]
                    self.__arrNDT[d][oldTopic] -= 1
                    self.__arrNDSum[d] -= 1
                    self.__arrNTW[oldTopic][wordid] -= 1
                    self.__arrNTSum[oldTopic] -= 1   

                    # find a new more appropriate tpoic for the word instanc as per current state of the model
                    prob = [0] * self._numTopics
                    
                    for t in range(self._numTopics):
                        prob[t] = ((self.__arrNDT[d][t] + self._numAlpha) / (self.__arrNDSum[d] + tAlpha)) * \
                            (self.__arrNTW[t][wordid] + self._numBeta) / (self.__arrNTSum[t] + vBeta)
                    
                    #cumulate multinomial
                    cdf = prob
                    for x in range(1, len(cdf)):
                        cdf[x] += cdf[x-1]
                    
                    cutoff = random.random() * cdf[-1]
                    newTopic = 0
                    for i in range(len(cdf)):
                        if cdf[i] > cutoff:
                            newTopic = i
                            break
                    #update as per new topic
                    self.__arrZ[d][ind] = newTopic
                    self.__arrNDT[d][newTopic] += 1
                    self.__arrNDSum[d] += 1
                    self.__arrNTW[newTopic][wordid] += 1
                    self.__arrNTSum[newTopic] += 1
                
    def getTopicsPerDocument(self):
        results = ''
        results += "***Topics per Document***\n"
        for d in range(self._numDocSize):
            results += "Document " + str(d) + ":\n"
            for t in range(self._numTopics):
                val = (self.__arrNDT[d][t]+self._numAlpha)/(self.__arrNDSum[d]+self.__numTAlpha)
                results += "Topic " + str(t) + ":" + str(val) + '\t'
            results += '\n'
        #print(results)
        file = open('data/output-data/document-topic-distribution.txt', 'w')
        file.write(results)
        return results
                    
   
    def getWordsPerTopic(self, revdictionary):
        results = {}
        
        for t in range(self._numTopics):
            #results += "\nTopic " + str(t) + ":"
            #flag = 0
            wpt = {}
            for v in range(self._numVocabSize):
                val = (self.__arrNTW[t][v]+self._numBeta)/(self.__arrNTSum[t]+self.__numVBeta)
                wpt[revdictionary[str(v)]] = float(val)
             #   flag += 1
             #   if flag == self.__wordsPerTopic:
             #       break
            wpt = sorted(wpt.items(), key=lambda x: x[1], reverse=True)[:self.__wordsPerTopic]
            results[t] = wpt
        #print(results)
        return results
        
    
    def printall(self):
        print("topics: ", self._numTopics)
        print("dataset: ", self._arrDocs)
        print("dataset size: ", self._numDocSize)
        print("vocab: ", self._arrVocab)
        print("vocab size: ", self._numVocabSize)
        print("ndt: ", self.__arrNDT)
        print("ndsum: ", self.__arrNDSum)
        print("ntw: ", self.__arrNTW)
        print("ntsum: ", self.__arrNTSum)
        print("z: ", self.__arrZ)

**Running the model**

In [9]:
if __name__ == "__main__":
    lda = LDA()
    lda.getData(configurations["integer-encoded-doc-path"])
    lda.randomMarkovChainInitialization()
    lda.gibbsSampling()

**Results: Getting Topics**

In [10]:
with open(configurations["integer-word-dict"], 'r') as file:
    revdictionary = json.load(file)
topic_words = lda.getWordsPerTopic(revdictionary)

### 4. Performing sentiment analysis (using SentiStrength)s

- We are using Senti-Strength in this tutorial for computing the sentiment score, using scale parameter which gives a score in range [-5, 5]

### 5. Separate neutral i.e., topic (aspect) presenting words and subjective words

- Words with Senti-score of 0 are considered Neutral or Objective, words with score below 0 are negatively subjective while words with score above 0 are positively subjective.
- It gives us a split of Neutral i.e., topic or aspect presentable words and subjective (both positive and negative together) words

### 6. Aggregating scores of the subjective words against each topic
- Aggregate the senti-scorse of all subjective terms in a topic (using mean)

In [11]:
#!pip install sentistrength

In [72]:
from sentistrength import PySentiStr
senti = PySentiStr()
senti.setSentiStrengthPath('jar_datei/SentiStrength.jar') # Note: Provide absolute path instead of relative path
senti.setSentiStrengthLanguageFolderPath('SentiStrengthData/') # Note: Provide absolute path instead of relative path

topic_presenting_words = {}
topic_senti_words = {}
topic_senti_score = {}

for i in range(configurations['numTopics']):
    topic_presenting_words[i] = []
    topic_senti_words[i] = []
    topic_senti_score[i] = 0

for topic, wordslist in topic_words.items():
    for word in wordslist:
        score = senti.getSentiment(word[0], score='scale')[0]
        if score == 0:
            topic_presenting_words[topic].append(word[0])
        else: 
            topic_senti_words[topic].append(word[0])
            topic_senti_score[topic] += score
    topic_senti_score[topic] /= len(topic_senti_words)
            

### 7. Preparing output

- Prepare understandable topic name by concating its top 5 most presentable words
-  

In [82]:
for i in range(configurations['numTopics']):
    print('Topic ', i)
    print(topic_presenting_words[i][:5], ' = ', topic_senti_score[i], '\ntop remarks(', ' '.join(topic_senti_words[i][:5]), ')\n')


Topic  0
['one', 'Areas', 'experience', 'time', 'Seixo']  =  0.9 
top remarks( great fantastic excellent kind better )

Topic  1
['stay', 'staff', 'room,', 'days', 'stay.']  =  1.5 
top remarks( Thank wonderful beautiful thanks lovely )

Topic  2
['every', 'much', 'everything', 'Everything', 'want']  =  1.1 
top remarks( thank enjoyed perfect love truly )

Topic  3
['come', 'back', 'hotel,', 'back.', 'next']  =  2.4 
top remarks( amazing loved Thanks place! Fantastic )

Topic  4
['would', 'team', 'little', 'menu', 'bit']  =  0.9 
top remarks( like good experience! beauty perfect! )

Topic  5
['The', 'hotel', 'also', 'restaurant', 'food']  =  0.6 
top remarks( delicious friendly, Amazing impressed )

Topic  6
['made', 'And', 'always', 'get', 'guests']  =  0.3 
top remarks( enjoy welcome )

Topic  7
['place', 'place.', 'feel', 'This', 'place,']  =  0.9 
top remarks( special magical cozy relax everything! )

Topic  8
['The', 'room', 'staff', 'definitely', 'rooms']  =  1.3 
top remarks( ni

**Commentary on Output**

*Topic 0* is synonymous to user experience of the hotel getting a score of *0.9* with great, fantastic, excellent, kind and better being the top responses. *Topic 1* is similar to services getting a score of *1.5*. Similarly, *Topic 4* represents menu items getting *0.9* score etc. Some of the topics e.g., *Topic 2* is incohrent. 

Reducing the number of topics from 10 can results in more understandable topics.

### Bonus
- Computing Sentiment analysis for topic words using **SentiWordNET**

In [88]:
import nltk
nltk.download('wordnet')
nltk.download('sentiwordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\khantr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\khantr\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

In [101]:
#Sentiment Analysis with SentiWordNET (on top of WordNET)
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn

for topic, wordslist in topic_words.items():
    score = 0
    for word in wordslist:
        if len(wn.synsets(word[0])) > 0:
            senti_synset = swn.senti_synset(wn.synsets(word[0])[0].name())
            score += senti_synset.pos_score() - senti_synset.neg_score()
    print('Topic: ', topic, 'has Sentiment score of ', score)

Topic:  0 has Sentiment score of  3.375
Topic:  1 has Sentiment score of  3.625
Topic:  2 has Sentiment score of  2.25
Topic:  3 has Sentiment score of  3.0
Topic:  4 has Sentiment score of  0.625
Topic:  5 has Sentiment score of  -0.25
Topic:  6 has Sentiment score of  1.75
Topic:  7 has Sentiment score of  1.75
Topic:  8 has Sentiment score of  1.0
Topic:  9 has Sentiment score of  1.75
