<a href="https://colab.research.google.com/github/saishdesai23/Sentiment-Analysis/blob/main/Assignment2_Sentiment_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Character sketch using Sentiment Analysis 
## Author - Saish Desai (sbdesai2)
## Assingment 2

This project involves studying characters in a play and analysing their characteristics using the Vader Sentiment Analyser. We extract dialogues of all characters in a play and quantify there characteritics using sentiments scores associated with their dialogues. For calculating the sentiment score for each character we make use of the VADER sentiment analyzer.

## Reference Articles
1) VADER ( Valence Aware Dictionary for Sentiment Reasoning) - https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664

1. Importing all the required packages

In [2]:
# importing all required libraries
import numpy as np
import pandas as pd
import requests

# importing libraries for sentiment analysis
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
# !pip install flair
# from flair.models import TextClassifier
# from flair.data import Sentence

# importing packages for text pre-porcessing
from nltk.corpus import stopwords
from bs4 import BeautifulSoup # package used for web scrapping to remove the HTML tags from the text ( not needed here)
import re # A package dealing with regular experession to remove punctuation and numbers
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet

import spacy


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...




2. Scarpping the data pertaining to the play from the html file available on Project Gutenberg

In [3]:
# Extracting book text link
def url_to_html(website_link : str,book_id : int):
    """
    A function to get the website link and book id and return the html parsed format of the chosen book.
    :param website_link: link of the website where books are stored
    :param book_id: id of the book for which the data is to be extracted
    :returns: returns content from the chosen webpage in html format
    """
    ebook_link = website_link + "/ebooks/" + book_id
    headers = {'Saish Desai': 'Web scraper - school project (sbdesai2@illinois.edu)'}
    book = requests.get(ebook_link, params = None) #using get request to get the website link for chosen book
    if book.status_code == 200:
        soup = BeautifulSoup(book.text, 'html.parser')# using the BeaitifulSoup module to parse the html format
        link = soup.find(type=re.compile("text/html")) # selecting tag to extract html format of the book
        text_link  = link.get('href')
        ebook_text_link = website_link + text_link #link for html format of the book
        s = requests.Session()
        book_data = s.get(ebook_text_link)
        print(ebook_text_link)
        soup_data = BeautifulSoup(book_data.text, 'html.parser')
        return soup_data
    else:
        print("Error:",book.status_code)

Please enter the bookID as "1524"

In [4]:
website_link = "https://www.gutenberg.org"
book_id = input("Enter Book ID: ")
soup_data = url_to_html(website_link,book_id)
# print(soup_data)

Enter Book ID: 1524
https://www.gutenberg.org/files/1524/1524-h/1524-h.htm


3. Listing all the characters names from the play

In [5]:
# Extracting list of characters from the play
characters = []
for ele in soup_data.find_all("h3"):
    if ele.text == " Dramatis PersonÃ¦ ":
        a = [ele.split(",") for ele in ele.find_next('p').get_text().split("\n")]
        # print(a)
for ele in a:
    ele[0] = re.sub("[^a-zA-Z]","",ele[0])
    if ele[0] != "":
      characters.append(ele[0])
print("Following is the list of characters in the play: ")
print(characters)

Following is the list of characters in the play: 
['HAMLET', 'CLAUDIUS', 'TheGHOSTofthelateking', 'GERTRUDE', 'POLONIUS', 'LAERTES', 'OPHELIA', 'HORATIO', 'FORTINBRAS', 'VOLTEMAND', 'CORNELIUS', 'ROSENCRANTZ', 'GUILDENSTERN', 'MARCELLUS', 'BARNARDO', 'FRANCISCO', 'OSRIC', 'REYNALDO', 'Players', 'AGentleman', 'APriest', 'TwoClowns', 'ACaptain', 'EnglishAmbassadors', 'Lords']


4. Extracting all the dialouges of both the characeters

In [6]:
act = [] #list storing all the dialouges from the play along with their respective character
hamlet = [] # list storing all the dialogues of Hamlet
claudius = [] # list storing all the dialogues of Claudius
for ele in soup_data.find_all("div","chapter"):
    act+=[ele.text.split("\n") for ele in ele.find_all_next("p")]
print("Dialogues in the act")
print(act[:10])

Dialogues in the act
[[' Enter Francisco and\r', 'Barnardo, two sentinels.'], ['BARNARDO.\r', 'Who’s there?', ''], ['FRANCISCO.\r', 'Nay, answer me. Stand and unfold yourself.', ''], ['BARNARDO.\r', 'Long live the King!', ''], ['FRANCISCO.\r', 'Barnardo?', ''], ['BARNARDO.\r', 'He.', ''], ['FRANCISCO.\r', 'You come most carefully upon your hour.', ''], ['BARNARDO.\r', '’Tis now struck twelve. Get thee to bed, Francisco.', ''], ['FRANCISCO.\r', 'For this relief much thanks. ’Tis bitter cold,\r', 'And I am sick at heart.', ''], ['BARNARDO.\r', 'Have you had quiet guard?', '']]


In [7]:
act = []
hamlet = [] # list storing all the dialogues of Hamlet
claudius = [] # list storing all the dialogues of Claudius
for ele in soup_data.find_all("div","chapter"):
    act+=[ele.text.split("\n") for ele in ele.find_all_next("p")]

# storing dialouges of Hamlet in a list (Protagonist) and Claudius in a list (Antagonist)
for ele in act:
    if ele[0][:-1] == "HAMLET.":
        temp_text = " ".join(ele[1:])
        temp_text = re.sub("[^a-zA-Z,.?!']"," ",temp_text)
        temp_text = temp_text.strip()
        hamlet.append(temp_text)
    if ele[0][:-1] == "KING.":
        temp_text = " ".join(ele[1:])
        temp_text = re.sub("[^a-zA-Z,.?!']"," ",temp_text)
        temp_text = temp_text.strip()
        claudius.append(temp_text)

Here, we have used a separate counter for each character. If we identify a dialouge with positive sentiment value, we increase the counter by 1. On the contorary, if we identify a dialogue with negative sentiment value, we reduce the counter by 1.

After iterating over all the dialouges of both the characters we can conclude that counter for Hamlet is higher in valaue as compared to Claudius. Thus, Hamlet has more number of dialouges with positive sentiment. So, if we quantify the sentiments of both the characters, we can conclude that Hamlet is a protagonist and Claudius is an antagonist.

In [8]:
hcount = 0 # sentiment count for Hamlet
ccount = 0 # sentiment count for Claudius
sid = SentimentIntensityAnalyzer()

for ele in set(hamlet):
    if sid.polarity_scores(ele)['compound']<0:
        hcount-=1
    elif sid.polarity_scores(ele)['compound']>0:
        hcount+=1
               
for ele in set(claudius):
    if sid.polarity_scores(ele)['compound'] < 0:
        ccount-=1
    elif sid.polarity_scores(ele)['compound'] > 0:
        ccount+=1
print(hcount,ccount)

45 13


We now generalize this process into a function and evaluate the positive sentiment score count for all the listed characters in the play. This function gives us the idea of the character sketch of each member of the play in terms of the postive, negative and neutral sentiment count from the all the dialogues spoken in the play.


In [37]:
def character_quantifier(act,character):
    """
    A function to extract dialouges of a character in the act and quantify their character sketch based on the dialogues with
    positive sentiment
    """
    if character == 'CLAUDIUS':
        character = 'KING'
    if character == 'GERTRUDE':
        character = 'QUEEN'
    char_diag = []
    diag_count = 0 # counter for all the dialogues spoken by particular character.
    
    # Counters for VADER sentiment analyzer
    vader_pos = 0 # counter for evaluating the positive sentiment strength of the character.
    vader_neg = 0 # counter for evaluating the negative sentiment strength of the character.
    vader_neu = 0 # counter for evaluating the neutralsentiment strength of the character.
    
    # Counters for Flair sentiment analyzer
    flair_pos = 0 # postive sentiment count using flair package
    flair_neg = 0 # negative sentiment count using flair package
    flair_neu = 0 # neutral sentiment count using flair
    
    # Counters for textblob sentiment analyzer
    tb_pos = 0 # postive sentiment count using textblob package
    tb_neg = 0 # negative sentiment count using textblob package
    tb_neu = 0 # neutral sentiment count using textblob

    #storing all the dialogues for a character in a list
    for ele in act:
        if ele[0][:-1] == character+".":
            temp_text = "".join(ele[1:])
            temp_text = re.sub("[^a-zA-Z,.?!']"," ",temp_text)
            temp_text = temp_text.strip()
            temp_text = temp_text.lower()
            char_diag.append(temp_text)
    
    # Using the VADER sentiment analyzer
    for ele in set(char_diag):
        if len(ele.split(" ")) > 10:
            diag_count +=1
            if sid.polarity_scores(ele)['compound'] < 0:
                vader_neg+=1
            elif sid.polarity_scores(ele)['compound'] > 0:
                vader_pos+=1
            else:
                vader_neu+=1
    # Using the flair package for sentiment analysis
    # for ele in set(char_diag):
    #   sentence = Sentence(ele)
    #   classifier.predict(sentence)
    #   score = sentence.labels[0]
    #   if "POSITIVE" in str(score):
    #     flair_pos+=1
    #   elif "NEGATIVE" in str(score):
    #     flair_neg+=1
    #   else:
    #     flair_neu+=1

    # Usinf Textblob sentiment analyzer
    # for ele in set(char_diag):
    #    testimonial = TextBlob(ele)
    #    if testimonial.sentiment.polarity > 0:
    #      tb_pos+=1
    #    elif testimonial.sentiment.polarity < 0:
    #      tb_neg+=1

    return (diag_count, vader_pos,vader_neg,vader_neu, flair_pos,flair_neg, flair_neu, tb_pos, tb_neg, tb_neu, char_diag)

In [38]:
# Creating a dictionary to store the dialogue and sentiment count for each player
data = {"Character":[], "Dialogue_count":[], "VADER_Postive_Sentiment":[], "VADER_Negative_Sentiment":[],"VADER_Neutral_Sentiment":[]}

for ele in characters:
  if ele != "":
    data["Character"].append(ele)
    data['Dialogue_count'].append(character_quantifier(act,ele)[0])
    data['VADER_Postive_Sentiment'].append(character_quantifier(act,ele)[1])
    data['VADER_Negative_Sentiment'].append(character_quantifier(act,ele)[2])
    data['VADER_Neutral_Sentiment'].append(character_quantifier(act,ele)[3])

# presenting the data from the dictionary in a dataframe.
Char_data = pd.DataFrame.from_dict(data)

In [39]:
# Cleaning the data frame to extract data pertaining to relevant characters
Char_data = Char_data.dropna()
Char_data = Char_data[Char_data['Dialogue_count']>0]
Char_data['Positve_Sentiment_Score'] = Char_data['VADER_Postive_Sentiment'] - Char_data['VADER_Negative_Sentiment']

In [40]:
Char_data.sort_values(by = ['Dialogue_count'],ascending=False, inplace= True)
Char_data

Unnamed: 0,Character,Dialogue_count,VADER_Postive_Sentiment,VADER_Negative_Sentiment,VADER_Neutral_Sentiment,Positve_Sentiment_Score
0,HAMLET,198,103,69,26,34
1,CLAUDIUS,48,27,17,4,10
4,POLONIUS,44,28,14,2,14
7,HORATIO,30,12,13,5,-1
5,LAERTES,29,14,13,2,1
6,OPHELIA,27,17,8,2,9
3,GERTRUDE,25,10,11,4,-1
11,ROSENCRANTZ,21,11,5,5,6
12,GUILDENSTERN,11,9,2,0,7
13,MARCELLUS,8,3,5,0,-2


The "Positve_Sentiment_Score" column is calculated by subtracting all the sentences with negaive sentiments from the sentences with postive sentiments.

By observing the column "Positve_Sentiment_Score" we can conclude that the  

*   character "Hamlet" is closest to being a protagonist. However, the high negative sentiment count is due to his hatred towards his uncle and his rash and impulsive acts.
*   Similary, Claudius has a relatively low "Positive Sentiment Score", which proves that he play a villain in this play

*   Gertrude, though caring for Hamlet, is a shallow, weak woman who seeks affection and status more urgently than moral rectitude or truth. This is evident due to a negative value of the "Positive Sentiment Score".

Thus, we are able to quantify the charactershektch of the lead characters in the play.

Reference for character sketch - https://www.sparknotes.com/shakespeare/hamlet/characters/






