# Introduction

We are going to put to good use what we learn in the previous notebook analyzing a real book.

You can download the book here: https://www.gutenberg.org/ebooks/345



In [30]:
import spacy
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
nlp = spacy.load('en_core_web_lg')
# This is what matcher print when there is no object!
from spacy.matcher import Matcher
from spacy import displacy
from transformers import pipeline
import pandas as pd
import plotly.express as px

## Analyzing how Jonathan Harker Feels
So the first chapter are te journal of Harker and what we will do is to take the adjetives and noun that he is refering to, so that we can understand how is he evolving with his situation. Asking google about what are the first chapters is said this: 

""" 

In Dracula, Harker's journal is one of the longest and most important parts of the novel. During his long stay at Dracula's castle, Jonathan discloses his fears of Dracula in his journal. Because he is alone for so long, Jonathan wonders at times if what he is witnessing is real, or whether he is losing his mind.

"""

So we will analyze this to check whether this is true or not.

In [2]:
# let us save the book
with open('dracula book.txt', 'r') as file:
    data = file.read()

In [3]:
# Let us take only what concerns us
harker_journal = data.split("CHAPTER V",1)[0]
#print(harker_journal)

In [4]:
# This is what matcher print when there is an object!
doc = nlp(harker_journal)

In [5]:
# nice display
#displacy.render(doc,style='ent')

In [6]:
pattern1 = [{'POS' : "VERB"},
           {'IS_ALPHA' : True, "OP": '*'},
           {'POS' : "ADV"}
          ]

pattern2 = [{'POS' : "ADJ"},
          # {'IS_ALPHA' : True, "OP": '*'},
           {'POS' : "NOUN"} 
          ]
matcher = Matcher(nlp.vocab)
matcher.add('ACTIONS', [pattern1], greedy="FIRST")
matcher.add('DESCRIPTIONS', [pattern2], greedy="FIRST")
matches = matcher(doc)

matches.sort(key=lambda x: x[1])

In [7]:
for match in matches:
    print(f"{match} -> {doc[match[1]:match[2]]} -> { nlp.vocab[match[0]].text}")

(13441827139367135875, 40, 42) -> next morning -> DESCRIPTIONS
(13441827139367135875, 62, 64) -> wonderful place -> DESCRIPTIONS
(5492369006848601173, 89, 94) -> feared to go very far -> ACTIONS
(13441827139367135875, 110, 112) -> correct time -> DESCRIPTIONS
(13441827139367135875, 137, 139) -> splendid bridges -> DESCRIPTIONS
(13441827139367135875, 148, 150) -> noble width -> DESCRIPTIONS
(5492369006848601173, 165, 168) -> left in pretty -> ACTIONS
(13441827139367135875, 168, 170) -> good time -> DESCRIPTIONS
(13441827139367135875, 255, 257) -> national dish -> DESCRIPTIONS
(5492369006848601173, 263, 266) -> get it anywhere -> ACTIONS
(5492369006848601173, 272, 280) -> found my smattering of German very useful here -> ACTIONS
(13441827139367135875, 409, 411) -> known portions -> DESCRIPTIONS
(13441827139367135875, 428, 430) -> exact locality -> DESCRIPTIONS
(5492369006848601173, 438, 446) -> are no maps of this country as yet -> ACTIONS
(13441827139367135875, 514, 516) -> distinct nat

IF we read the descriptions matcher, we see that the adjetive are steadely getting more dark, and everything is getting worse. The actions are a little more difficult to understand and analyze, so we are going to analize here the descriptions and the token number (where it appears), to se if we can find some spots where the journal become darker.


In [8]:
description = [x for x in matches if x[:][0] == 13441827139367135875 ]

In [9]:
for match in description:
    print(f"{match} -> {doc[match[1]:match[2]]} -> { nlp.vocab[match[0]].text}") 

(13441827139367135875, 40, 42) -> next morning -> DESCRIPTIONS
(13441827139367135875, 62, 64) -> wonderful place -> DESCRIPTIONS
(13441827139367135875, 110, 112) -> correct time -> DESCRIPTIONS
(13441827139367135875, 137, 139) -> splendid bridges -> DESCRIPTIONS
(13441827139367135875, 148, 150) -> noble width -> DESCRIPTIONS
(13441827139367135875, 168, 170) -> good time -> DESCRIPTIONS
(13441827139367135875, 255, 257) -> national dish -> DESCRIPTIONS
(13441827139367135875, 409, 411) -> known portions -> DESCRIPTIONS
(13441827139367135875, 428, 430) -> exact locality -> DESCRIPTIONS
(13441827139367135875, 514, 516) -> distinct nationalities -> DESCRIPTIONS
(13441827139367135875, 587, 589) -> eleventh century -> DESCRIPTIONS
(13441827139367135875, 628, 630) -> imaginative whirlpool -> DESCRIPTIONS
(13441827139367135875, 763, 765) -> more paprika -> DESCRIPTIONS
(13441827139367135875, 794, 796) -> excellent dish -> DESCRIPTIONS
(13441827139367135875, 877, 879) -> further east -> DESCRIPTI

In [13]:
def matches_to_df(matches):
    elements_matcher = []
    token_begin = [] 
    for match in matches:
        text = str(doc[match[1]:match[2]])
        elements_matcher.append(text)
        token_begin.append(match[1])
    df = pd.DataFrame({'text':elements_matcher,
                       'token_begin':token_begin})
    return df

In [14]:
df =  matches_to_df(description)

In [15]:
classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None)
model_outputs = classifier(list(df.text))
print(model_outputs[0][0])

{'label': 'neutral', 'score': 0.9610273838043213}


In [26]:
labels = []
for output in model_outputs:
    labels.append(output[0]['label'])

In [25]:
df['emotion'] = labels
df

Unnamed: 0,text,token_begin,emotion
0,next morning,40,neutral
1,wonderful place,62,admiration
2,correct time,110,neutral
3,splendid bridges,137,neutral
4,noble width,148,neutral
...,...,...,...
769,dreadful place,28291,fear
770,nearest train,28306,neutral
771,cursed spot,28313,annoyance
772,cursed land,28318,anger


There are some expressions that I would not like that much the label put with the classifier like "splendid bridges" as neutral, but overall it looks ok!
there are a lot of neutral let us quickly looks the emotions but taking out this labels.

In [28]:
[emotion for emotion in labels if emotion != 'neutral']

['admiration',
 'joy',
 'admiration',
 'admiration',
 'admiration',
 'fear',
 'admiration',
 'joy',
 'admiration',
 'admiration',
 'confusion',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'excitement',
 'admiration',
 'disappointment',
 'admiration',
 'sadness',
 'sadness',
 'admiration',
 'admiration',
 'fear',
 'fear',
 'annoyance',
 'fear',
 'admiration',
 'admiration',
 'fear',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'gratitude',
 'admiration',
 'gratitude',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'fear',
 'surprise',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'love',
 'surprise',
 'surprise',
 'surprise',
 'admiration',
 'admiration',
 'surprise',
 'sadness',
 'fear',
 'fear',
 'fear',
 'sadness',
 'admiration',
 'fear',
 'fear',
 'admiration',
 'pride',
 'embarrassment',
 'annoyance',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'admiration',
 'sadness',
 'fe

In [29]:
# IT is very interesting to see how the emotion are indeed changing!

In [33]:
len(df.emotion.unique())

19

In [62]:
color_map = {
    "love": "lightpink",
    "gratitude":"lime", 
    "admiration":"lightsalmon",
    "joy": "yellow",
    
    "excitement":"cyan",
    'pride':"gold", 
    'amusement':"magenta",
    "approval":"green",
    'surprise':"orange",
    "neutral":"white",
    
    "sadness": "darkblue", 
    "confusion": "chocolate", 
    'embarrassment':"darkorange",
    "disappointment":"slategray", 
    'disapproval':"darksalmon", 
    'annoyance':"darkviolet", 
    'disgust': "darkgreen", 
    "anger": "darkred", 
    "fear": "black"}
    
fig = px.scatter(df, x="token_begin", color="emotion",color_discrete_map=color_map)
fig.show()

Notice the darker colors at the end, those are reserved for "bad" emotion, and the neutrality that we have at the beginning at the end of the journey is lost!