# Exploring a dictionary-based approach with Empath

Empath (see [Fast et al., 2016](https://dl.acm.org/doi/10.1145/2858036.2858535)) is a tool for analysing a given corpus of text to identify the occurrence of certain pre-defined linguistic categories (similar to what is provided by LIWC), but also provides us with a way to create our own linguistic categories based on the behaviour we might want to examine.

In [None]:
## Uncomment the below lines if needed.
# !conda install empath

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from empath import Empath
from collections import Counter
lexicon = Empath()

As earlier, we first load the book into a variable.

In [None]:
with open('./data/gutenberg/carroll-alice.txt', 'r', encoding='utf-8-sig') as fo :
    book = fo.readlines()

# Get rid of lines containing table of contents
book=book[23:] 

# remove all carriage returns within lines.
book = [text.replace('\n', '') for text in book] 

# remove all empty lines
book = [text for text in book if len(text) > 0]  

## Get the list of categories from Empath (optional)
Empath has a set of predefined categories. 
For this exercise, we will focus on existing categories. 
We can also create our own category, but let's not worry about it for now.
You can uncomment the code below to see a list of all existing categories in Empath.

In [None]:
# Uncomment the line below to see a list of all Empath categories
# lexicon.cats.keys()

Empath also has a built-in function to analyse a given text against all its categories. Let's use this function. For this, we should treat the entire book as a single 'string' of text, then pass the text to Empath to analyse. 
The results cover over 195 categories, so we should sort them in descending order, paying attention to the top-scoring categories.

In [None]:
book_text = ' '.join(book)
results = lexicon.analyze(book_text, normalize=True)
results_sorted = Counter(results).most_common()
print("Top five dictionary categories by score:")
results_sorted[:5]

---
We can also plot these values for easier comparison.

In [None]:
sns.set_context('notebook', font_scale=0.9)
sns.set_style('ticks')
plt.figure(figsize=(4,5))
sns.barplot(dict(results_sorted[0:20]), orient='h', color='steelblue')
sns.despine(right=True, top=True)

## Analysing a piece of text using a particular category in Empath

Let's first create a dataframe so we can add any computed metrics alongside each unit of text, like paragraphs

In [None]:
book_df = pd.DataFrame({'text' : book})
print("********************************************")
print(" Loaded", book_df.shape[0], "lines of text into dataframe.")
print("********************************************")
book_df[0:10]

We can create a function for this approach so that we can pass this function to the dataframe.

In [None]:
def calc_category(text, category_name, normalize=True):
    score = lexicon.analyze(text, categories=[category_name], normalize=normalize)
    return score[category_name]

category = 'shape_and_size'
book_df[category] = book_df.apply(
                        lambda x: calc_category(x['text'], category,
                                                normalize=False),
                        axis=1)
book_df.sample(5)                                                                               

---
Recall the use of `lambda` from the previous notebook. 

In the above code, we use it pass the function we created (`calc_category`) to a built-in function within pandas which allows us to apply an operation to an entire column.

---
We can now examine how our category score changes over the length of the book.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize=(15,3),dpi=600)
g = sns.lineplot(data = book_df, y=category, x=book_df.index,
                 color='steelblue', lw=0.5)
sns.despine(right=True, top=True)

We can then choose a particular range of line numbers and examine the text closely against the category scores.

In [None]:
plt.figure(figsize=(3,6))
# sns.set_context('notebook', font_scale=0.7)
g = sns.barplot(data = book_df[675:700], x=category, y='text',
                color='steelblue')
sns.despine(right=True, top=True)

In [None]:
plt.figure(figsize=(2,6))
category_text_df = book_df[book_df[category] > 0]
g = sns.barplot(data = category_text_df[0:25], x=category, y='text',
                color='steelblue')
sns.despine(right=True, top=True)

## A similar approach for sentiment analysis
We can either add more dictionary categories, or use a different metric.
Let's try sentiment analysis, for which we will use [VADER](https://github.com/cjhutto/vaderSentiment).

In [None]:
# Uncomment the below line if you don't have VADER installed.
# !conda install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

In [None]:
def sentiment(text) :
    vs = analyzer.polarity_scores(text)['compound']
    return vs

In [None]:
book_df['sentiment'] = book_df['text'].apply(sentiment)
book_df.sample(5)

As in the earlier case, we can visualize the scores, this time using different colours for positive and negative sentiment.

In [None]:
plt.figure(figsize=(2,6))
sns.barplot(data=book_df[-30:], x='sentiment', y='text',hue='sentiment',
            palette=sns.color_palette("vlag_r", as_cmap=True), legend=None)
sns.despine(right=True, top=True)

Do you observe anything interesting in how the sentiment changes over parts of your story?

## Studying correlations
Do you notice any correlations between your chosen LIWC/Empath measure and sentiment? 
You can use a 2D histogram to plot them together.

In [None]:
plt.figure(figsize=(4,4))
g = sns.histplot(data=book_df, x=category, y="sentiment")
sns.despine(right=True, top=True)

# Next Steps

What kind of hunch do you have about the tone, style, or themes of your favourite book? How will you verify it?

Can you use this approach to compare tones and styles across different books? How will you try it out?