# Emotion dynamics in books related to technology
**Research Question: Emotional words used in a list of books from** https://www.goodreads.com/list/show/123917.Scary_Tech_Big_Data_Surveillance_Information_Overload_Tech_Addiction_Propaganda_Dark_Money_

## Part I: Process Books

Importing e-books with .epub format and turn it into a dictionary with words and the occurrence count of the word

### List of books - top 8 books from the webpage link (goodreads)
**book1:** Algorithms of Oppression How Search Engines Reinforce Racism (Safiya Umoja Noble) <br>
**book2:** Data and Goliath The Hidden Battles to Collect Your Data and Control Your World (Bruce Schneier) <br>
**book3:** World Without Mind The Existential Threat of Big Tech (Franklin Foer) <br>
**book4:** How to Do Nothing Resisting the Attention Economy (Jenny Odell)<br>
**book5:** Irresistible The Rise of Addictive Technology and the Business of Keeping Us Hooked (Adam Alter)<br>
**book6:** The Age of Surveillance Capitalism The Fight for a Human Future at the New Frontier of Power (Shoshana Zuboff)<br>
**book7:** The Attention Merchants The Epic Scramble to Get Inside Our Heads (Tim Wu)<br>
**book8:** Weapons of Math Destruction How Big Data Increases Inequality and Threatens Democracy (Cathy O’Neil)<br>


I first write a function `get_words(book)` to input/read the e-books (.epub) and return to a dictionary `book_dict` where keys = name assigned in ebook of the chapter; values = text/content of the chapter

In [2]:
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup

def get_words(book):
    book = epub.read_epub(book)
    items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))
    
    book_dict = {}
    for item in book.get_items():
        if item.get_type() == ebooklib.ITEM_DOCUMENT:
            soup = BeautifulSoup(item.get_body_content(), 'html.parser')
            text = [para.get_text() for para in soup.find_all('p')]
            texts = ' '.join(text)
            book_dict[item.get_name()] = texts
    
    return book_dict

I then write a function `get_word_count(book_dict)` to process all the text in the book and return to a dictionary with all words (except stopwords) and the number of appearance of the word in this book

In [3]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words("english"))
from nltk.tokenize import word_tokenize

from collections import Counter

def get_word_count(book_dict):
    list_words = []
    for text in book_dict.values():
        for word in word_tokenize(text):
            if word.isalpha():
                if word.lower() not in stop_words:
                    list_words.append(word.lower())
    
    word_count = Counter(list_words)
    dict_word_count = dict(word_count)
    
    return dict_word_count

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/stefankronborgnielsen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/stefankronborgnielsen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Using both `get_words(book)` and `get_word_count(book_dict)` to process **book 1 to 8** including all the words and word count in dictionary `book_word_count`

In [11]:
book_list = ['books/book1.epub', 'books/book2.epub', 'books/book3.epub', 'books/book4.epub', 'books/book5.epub', 'books/book6.epub', 'books/book7.epub', 'books/book8.epub']

book_word_count = {}
for book in book_list:
    book_dict = get_words(book)
    word_count = get_word_count(book_dict)
    book_word_count[book] = word_count
    

## Part II: Emotion Lexicon
Source: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm <br>
The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing. 

I use the eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) for this analysis.

Reference:
Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational intelligence, 29(3), 436-465.

I write a funtion `get_lexicon(emtion)` and loop through all eight emotions to get the list of words in the .txt file 

In [12]:
def get_lexicon(term):
    word_list = []    
    with open(f'NRC-Emotion-Lexicon/OneFilePerEmotion/{term}-NRC-Emotion-Lexicon.txt', 'r') as file:
        for line in file:
            word = line.split()[0]
            N = int(line.split()[1])
            if N == 1:
                word_list.append(word)    
    return word_list

In [13]:
emotion_lexicon = {}
emotions = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust']

for emotion in emotions:
    emotion_lexicon[emotion] = set(get_lexicon(emotion))

## Part III: Combining & Presenting the results

In this part, I combine the first part (words & count of the word from books) and the second part (groups of emotion-related words). The dictionary `emotion_word_count` contains the eight emotion as keys and the values are dictionaries with all the words in lexicon of that emotion and the count of the words. 

I then write a function called `get_top_words(emotion, N)` which shows the the top N words with most count in that emotion lexicon.

In [16]:
emotion_word_count = {}
for emotion in emotions:
    emotion_word_count[emotion] = {}

for emotion, lexicon in emotion_lexicon.items():
    for word in lexicon:
        emotion_word_count[emotion][word] = 0
    
for word_count in book_word_count.values():
    for word, count in word_count.items():
        for emotion, lexicon in emotion_lexicon.items():
            if word in lexicon:
                emotion_word_count[emotion][word] += count

In [17]:
import heapq

def get_top_words(emotion, N):
    top_words = heapq.nlargest(N, emotion_word_count[emotion].items(), key=lambda x: x[1])
    return top_words

### Presenting results: Bar chart
I here present eight bar charts showing the top 20 words with the most occurrences of the eight basic emotions. 

In [33]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="darkgrid")

for emotion in emotions:
    top_words = get_top_words(emotion, 20)
    
    words = [item[0] for item in top_words]
    counts = [item[1] for item in top_words]

    fig, ax = plt.subplots()
    ax.barh(words, counts)

    ax.set_xlabel('count')
    ax.set_title(f'Top words: {emotion}')
    
    plt.subplots_adjust(left=0.2, right=0.95, bottom=0.15, top=0.9)
    
    
    plt.savefig(f'plot/top_words_{emotion}.png')
    plt.close()

### Reflection on the bar charts
Questioning if the words shown in the plot actually have such emotional connotations

### Presenting results: Pie chart
A pie chart showing all eight basic emotions and the sum of the counts of all words

In [38]:
emotion_total_count = {}
for emotion, lexicon_count in emotion_word_count.items():
    emotion_total_count[emotion] = sum(lexicon_count.values())

{'anger': 10536,
 'anticipation': 19721,
 'disgust': 6340,
 'fear': 16747,
 'joy': 12882,
 'sadness': 9802,
 'surprise': 7569,
 'trust': 29187}

In [46]:
import matplotlib.pyplot as plt

labels = []
sizes = []
for emotion, count in emotion_total_count.items():
    labels.append(emotion)
    sizes.append(count)

colors = ['#d62728', '#ff7f0e', '#2ca02c', '#9467bd', '#FCE205','#1f77b4', '#e377c2', '#7f7f7f']
explode = (0, 0, 0, 0, 0, 0, 0, 0) 


plt.pie(sizes, labels=labels, colors=colors, explode=explode, autopct='%1.1f%%',
        startangle=90)


plt.title('Emotion distribution in books')

plt.axis('equal')
# plt.show()

plt.savefig('plot/pie_chart_emotions.png')
plt.close()


### Reflection on the pie chart
Questioning whether we can draw any conclusion from this plot

## Conclusion
