# Book Content Analysis
RQ: emotional words used in list of book in https://www.goodreads.com/list/show/123917.Scary_Tech_Big_Data_Surveillance_Information_Overload_Tech_Addiction_Propaganda_Dark_Money_ 

## Get started: one chapter
I start with a small file - sample 1, trying to get familiar with how epub files work


In [6]:
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup

In [1]:
book = epub.read_epub('books/sample1.epub')
items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))

for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        print(item.get_content())
        print('==================================')

NAME :  OEBPS/GeographyofBli_body_split_000.html
----------------------------------
b'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" epub:prefix="z3998: http://www.daisy.org/z3998/2012/vocab/structure/#" lang="en" xml:lang="en">\n  <head/>\n  <body><div class="dedication" id="dedication_1">\n<p class="ded"><em class="calibre1">for Sharon</em></p>\n</div>\n \n</body>\n</html>\n'
NAME :  OEBPS/GeographyofBli_body_split_001.html
----------------------------------
b'<?xml version=\'1.0\' encoding=\'utf-8\'?>\n<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops" epub:prefix="z3998: http://www.daisy.org/z3998/2012/vocab/structure/#" lang="en" xml:lang="en">\n  <head/>\n  <body><div class="dedication" id="epigraph_1">\n\n<p class="ep"><em class="calibre1">In these days of wars and rumors of wars, haven\xe2\x80\x99t you ever dreamed of a place wher



### html 
with the result shown above, `item.get_content()` is in html format. 
So in the following codes, I use `BeautifulSoup` to extract the content in the sample

In [11]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p')]
        texts = ' '.join(text)
        print(text)
        print('==================================')

NAME :  OEBPS/GeographyofBli_body_split_000.html
----------------------------------
['for Sharon']
NAME :  OEBPS/GeographyofBli_body_split_001.html
----------------------------------
['In these days of wars and rumors of wars, haven’t you ever dreamed of a place where there was peace and security, where living was not a struggle but a lasting delight?', '—Lost Horizon, directed by Frank Capra, 1937']
NAME :  OEBPS/GeographyofBli_body_split_002.html
----------------------------------
['Introduction', 'My bags were packed and my provisions loaded. I was ready for adventure. And so, on a late summer afternoon, I dragged my reluctant friend Drew off to explore new worlds and, I hoped, to find some happiness along the way. I’ve always believed that happiness is just around the corner. The trick is finding the right corner.', 'Not long into our journey, Drew grew nervous. He pleaded with me to turn back, but I insisted we press on, propelled by an irresistible curiosity about what lay ahead.

In [13]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p', class_=['cotx', 'calibre3'])]
        texts = ' '.join(text)
        print(text)
        print('==================================')

NAME :  OEBPS/GeographyofBli_body_split_000.html
----------------------------------
[]
NAME :  OEBPS/GeographyofBli_body_split_001.html
----------------------------------
[]
NAME :  OEBPS/GeographyofBli_body_split_002.html
----------------------------------
['My bags were packed and my provisions loaded. I was ready for adventure. And so, on a late summer afternoon, I dragged my reluctant friend Drew off to explore new worlds and, I hoped, to find some happiness along the way. I’ve always believed that happiness is just around the corner. The trick is finding the right corner.', 'Not long into our journey, Drew grew nervous. He pleaded with me to turn back, but I insisted we press on, propelled by an irresistible curiosity about what lay ahead. Danger? Magic? I needed to know, and to this day I’m convinced I would have reached wherever it was I was trying to reach had the Baltimore County Police not concluded, impulsively I thought at the time, that the shoulder of a major thoroughfare

In [26]:
list_text =[]
full_text = ''
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p', class_=['cotx', 'calibre3'])]
        texts = ' '.join(text)
        full_text += texts
full_text




In [7]:
import nltk

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(nltk.corpus.stopwords.words("english"))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/stefankronborgnielsen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/stefankronborgnielsen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [27]:
from nltk.tokenize import word_tokenize

list_words = []
for word in word_tokenize(full_text):
    if word.isalpha():
        if word.lower() not in stop_words:
            list_words.append(word.lower())
print(list_words[:50])

['bags', 'packed', 'provisions', 'loaded', 'ready', 'adventure', 'late', 'summer', 'afternoon', 'dragged', 'reluctant', 'friend', 'drew', 'explore', 'new', 'worlds', 'hoped', 'find', 'happiness', 'along', 'way', 'always', 'believed', 'happiness', 'around', 'corner', 'trick', 'finding', 'right', 'corner', 'long', 'journey', 'drew', 'grew', 'nervous', 'pleaded', 'turn', 'back', 'insisted', 'press', 'propelled', 'irresistible', 'curiosity', 'lay', 'ahead', 'danger', 'magic', 'needed', 'know', 'day']


In [None]:
from itertools import chain

In [28]:
from collections import Counter

word_count = Counter(list_words)
top_words = word_count.most_common(25)
print(top_words)

[('happiness', 113), ('happy', 58), ('people', 48), ('one', 34), ('like', 32), ('veenhoven', 31), ('world', 26), ('good', 22), ('say', 22), ('dutch', 22), ('know', 20), ('new', 18), ('would', 18), ('happier', 18), ('time', 17), ('places', 15), ('something', 14), ('research', 14), ('many', 13), ('think', 13), ('countries', 13), ('others', 12), ('also', 12), ('man', 12), ('get', 12)]


## one book
book1: Algorithms of Oppression How Search Engines Reinforce Racism (Safiya Umoja Noble)

In [8]:
book = epub.read_epub('books/book1.epub')
items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))


for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('NAME : ', item.get_name())

NAME :  nyu-noble-0001.xhtml
NAME :  nyu-noble-0002.xhtml
NAME :  nyu-noble-0003.xhtml
NAME :  nyu-noble-0004.xhtml
NAME :  nyu-noble-0005.xhtml
NAME :  nyu-noble-0006.xhtml
NAME :  nyu-noble-0007.xhtml
NAME :  nyu-noble-0008.xhtml
NAME :  nyu-noble-0009.xhtml
NAME :  nyu-noble-0010.xhtml
NAME :  nyu-noble-0011.xhtml
NAME :  nyu-noble-0012.xhtml
NAME :  nyu-noble-0013.xhtml
NAME :  nyu-noble-0014.xhtml
NAME :  nyu-noble-0015.xhtml
NAME :  nyu-noble-0016.xhtml
NAME :  nyu-noble-0017.xhtml
NAME :  nyu-noble-0018.xhtml
NAME :  nyu-noble-0019.xhtml
NAME :  nyu-noble-0020.xhtml
NAME :  toc.xhtml




In [9]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        print(item.get_body_content().decode('utf-8'))
        print('==================================')


NAME :  nyu-noble-0001.xhtml
----------------------------------
<body style="margin-top: 0px; margin-left: 0px; margin-right: 0px; margin-bottom: 0px;"><div style="display:none;"><a id="GBS.0001.01"/></div>
    <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" height="99%" preserveaspectratio="xMidYMid meet" version="1.1" viewbox="0 0 533 800" width="100%"><image alt="" height="800" width="533" xlink:href="images/nyu-noble-cover.jpg"/></svg><section xmlns:epub="http://www.idpf.org/2007/ops" class="chapter" epub:type="cover" id="cvi">
      <div class="cover">
        <p class="fig"><a id="GBS.0002.01"/></p>
      </div>
    </section>
  <div style="display:none;"><a id="GBS.0002.02"/></div></body>


NAME :  nyu-noble-0002.xhtml
----------------------------------
<div style="display:none;"><a id="GBS.0003.01"/></div>
    <section xmlns:epub="http://www.idpf.org/2007/ops" class="chapter" epub:type="halftitlepage" id="bkht">
      <h1 class="bkht"><a class

In [10]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p')]
        texts = ' '.join(text)
        print(text)
        print('==================================')

NAME :  nyu-noble-0001.xhtml
----------------------------------
['']
NAME :  nyu-noble-0002.xhtml
----------------------------------
[]
NAME :  nyu-noble-0003.xhtml
----------------------------------
['Safiya Umoja Noble', '', 'NEW YORK UNIVERSITY PRESS', 'New York']
NAME :  nyu-noble-0004.xhtml
----------------------------------
['NEW YORK UNIVERSITY PRESS', 'New York', 'www.nyupress.org', '© 2018 by New York University', 'All rights reserved', 'References to Internet websites (URLs) were accurate at the time of writing. Neither the author nor New York University Press is responsible for URLs that may have expired or changed since the manuscript was prepared.', 'ISBN: 978-1-4798-3364-1 (e-book)', 'Library of Congress Cataloging-in-Publication Data', 'Names: Noble, Safiya Umoja, author.', 'Title: Algorithms of oppression : how search engines reinforce racism / Safiya Umoja Noble.', 'Description: New York : New York University Press, [2018] | Includes bibliographical references and inde

['On June 28, 2016, Black feminist and mainstream social media erupted with the announcement that Black Girls Code, an organization dedicated to teaching and mentoring African American girls interested in computer programming, would be moving into Google’s New York offices. The partnership was part of Google’s effort to spend $150 million on diversity programs that could create a pipeline of talent into Silicon Valley and the tech industries. But just two years before, searching on “black girls” surfaced “Black Booty on the Beach” and “Sugary Black Pussy” to the first page of Google results, out of the trillions of web-indexed pages that Google Search crawls. In part, the intervention of teaching computer code to African American girls through projects such as Black Girls Code is designed to ensure fuller participation in the design of software and to remedy persistent exclusion. The logic of new pipeline investments in youth was touted as an opportunity to foster an empowered vision f

['1. Matsakis, 2017.', '2. See Peterson, 2014.', '3. This term was coined by Eli Pariser in his book The Filter Bubble (2011).', '4. See Dewey, 2015.', '5. I use phrases such as “the N-word” or “n*gger” rather than explicitly using the spelling of a racial epithet in my scholarship. As a regular practice, I also do not cite or promote non–African American scholars or research that flagrantly uses the racial epithet in lieu of alternative phrasings.', '6. See Sweney, 2009.', '7. See Boyer, 2015; Craven, 2015.', '8. See Noble, 2014.', '9. The term “digital footprint,” often attributed to Nicholas Negroponte, refers to the online identity traces that are used by digital media platforms to understand the profile of a user. The online interactions are often tracked across a variety of hardware (e.g., mobile phones, computers, internet services) and platforms (e.g., Google’s Gmail, Facebook, and various social media) that are on the World Wide Web. Digital traces are often used in the data-m

In [11]:
book_dict = {}
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p')]
        texts = ' '.join(text)
        book_dict[item.get_name()] = texts

manually input the number 7-17 - chapters with actual information

In [12]:
book_dict['nyu-noble-0013.xhtml']

'Student protests on college campuses have led to calls for increased support of students of color, but one particular request became a matter of national policy that led to a threat to the Library of Congress’s budget in the summer of 2016. In February 2014, a coalition of students at Dartmouth College put forward “The Plan for Dartmouth’s Freedom Budget: Items for Transformative Justice at Dartmouth” (the “Freedom Plan”),1 which included a line item to “ban the use of ‘illegal aliens,’ ‘illegal immigrants,’ ‘wetback,’ and any racially charged term on Dartmouth-sanctioned programming materials and locations.” The plan also demanded that “the library search catalog system shall use undocumented instead of ‘illegal’ in reference to immigrants.” Lisa Peet, reporting for Library Journal, noted, The replacement of the subject heading was the culmination of a two-year grassroots process that began when Melissa Padilla, class of 2016, first noticed what she felt were inappropriate search ter

In [14]:
from nltk.tokenize import word_tokenize
list_chapter = [f"nyu-noble-{str(i).zfill(4)}.xhtml" for i in range(7, 18)]

list_words = []
for chapter in list_chapter:
    text = book_dict[chapter]
    for word in word_tokenize(text):
        if word.isalpha():
            if word.lower() not in stop_words:
                list_words.append(word.lower())


In [15]:
from collections import Counter

word_count = Counter(list_words)
top_words = word_count.most_common(25)
print(top_words)

[('search', 450), ('google', 443), ('information', 382), ('black', 285), ('see', 263), ('people', 253), ('women', 225), ('results', 197), ('public', 184), ('social', 175), ('media', 150), ('web', 144), ('internet', 127), ('ways', 115), ('girls', 111), ('white', 110), ('many', 108), ('one', 107), ('work', 99), ('commercial', 99), ('racial', 97), ('new', 95), ('research', 94), ('digital', 93), ('african', 93)]


In [16]:
len(list_words)

32135

In [17]:
dict_word_count = dict(word_count)

In [62]:
dict_word_count['empowered']

3

In [78]:
dict_word_count['oppressing']

1

In [75]:
list_words_1 = []
for key in book_dict.keys():
    text = book_dict[key]
    for word in word_tokenize(text):
        if word.isalpha():
            if word.lower() not in stop_words:
                list_words_1.append(word.lower())

word_count_1 = Counter(list_words_1)
top_words_1 = word_count_1.most_common(25)
print(top_words_1)


[('search', 514), ('google', 504), ('information', 424), ('black', 353), ('see', 287), ('people', 271), ('women', 265), ('results', 213), ('social', 198), ('public', 195), ('media', 177), ('new', 170), ('web', 170), ('internet', 156), ('white', 137), ('girls', 127), ('racial', 121), ('digital', 119), ('ways', 117), ('race', 117), ('commercial', 111), ('university', 109), ('many', 109), ('research', 107), ('one', 107)]


## next step 1: find out how to cope with multiple books
- can I classify them in some ways? how do i know this item contain actual content

book2: Data and Goliath The Hidden Battles to Collect Your Data and Control Your World (Bruce Schneier)

book3: Everybody Lies Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are (Seth Stephens-Davidowitz)

book4: How to Do Nothing Resisting the Attention Economy (Jenny Odell)

book5: Irresistible The Rise of Addictive Technology and the Business of Keeping Us Hooked (Adam Alter)

book6: The Age of Surveillance Capitalism The Fight for a Human Future at the New Frontier of Power (Shoshana Zuboff)

book7: The Attention Merchants The Epic Scramble to Get Inside Our Heads (Tim Wu)

book8: Weapons of Math Destruction How Big Data Increases Inequality and Threatens Democracy (Cathy O’Neil)

book9: World Without Mind The Existential Threat of Big Tech (Franklin Foer) 


In [81]:
book = epub.read_epub('books/book2.epub')
items = list(book.get_items_of_type(ebooklib.ITEM_DOCUMENT))


for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())

NAME :  index_split_000.html
NAME :  index_split_001.html
NAME :  index_split_002.html
NAME :  index_split_003.html
NAME :  index_split_004.html
NAME :  index_split_005.html
NAME :  index_split_006.html
NAME :  index_split_007.html


In [82]:
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        print('==================================')
        print('NAME : ', item.get_name())
        print('----------------------------------')
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p')]
        texts = ' '.join(text)
        print(text)
        print('==================================')

NAME :  index_split_000.html
----------------------------------
['', '', 'SELECTED\tBOOKS\tBY\tBRUCE\tSCHNEIER', ' Carry\tOn:\tSound\tAdvice\tfrom\tSchneier\ton\tSecurity\t(2013)', ' Liars\tand\tOutliers:\tEnabling\tthe\tTrust\tThat\tSociety\tNeeds\tto\tThrive\t(2012) Schneier\ton\tSecurity\t(2008)', ' Beyond\tFear:\tThinking\tSensibly\tabout\tSecurity\tin\tan\tUncertain\tWorld\t(2003) Secrets\tand\tLies:\tDigital\tSecurity\tin\ta\tNetworked\tWorld\t(2000)', ' Applied\tCryptography:\tProtocols,\tAlgorithms,\tand\tSource\tCode\tin\tC\t(1994\tand\t1996) To\tKaren:\tDMASC', 'Contents', 'Introduction', 'Part\tOne:\tThe\tWorld\tWe’re\tCreating', '1.\tData\tas\ta\tBy-product\tof\tComputing', '2.\tData\tas\tSurveillance', '3.\tAnalyzing\tOur\tData', '4.\tThe\tBusiness\tof\tSurveillance', '5.\tGovernment\tSurveillance\tand\tControl', '6.\tConsolidation\tof\tInstitutional\tControl', 'Part\tTwo:\tWhat’s\tat\tStake', '7.\tPolitical\tLiberty\tand\tJustice', '8.\tCommercial\tFairness\tand\tEquality

['allowed\tacademics\tto\tmine\ttheir\tdata:\tHere\tare\ttwo\texamples.\tLars\tBackstrom\tet\tal.\t(5\tJan\t2012),\t“Four\tdegrees of\tseparation,”\tarXiv:1111.4570\t[cs.SI],\thttp://arxiv.org/abs/1111.4570.\tRussell\tB.\tClayton\t(Jul\t2014),\t“The\tthird wheel:\tThe\timpact\tof\tTwitter\tuse\ton\trelationship\tinfidelity\tand\tdivorce,”  Cyberpsychology,\tBehavior,\tand\tSocial Networking\t17,\thttp://www.cs.vu.nl/~eliens/sg/local/cyber/twitter-infidelity.pdf. ', 'Facebook\tcan\tpredict: The\texperiment\tcorrectly\tdiscriminates\tbetween\thomosexual\tand\theterosexual\tmen\tin\t88%', 'of\tcases,\tAfrican\tAmericans\tand\tCaucasian\tAmericans\tin\t95%\tof\tcases,\tand\tDemocrats\tand\tRepublicans\tin\t85%\tof cases.\tMichal\tKosinski,\tDavid\tStillwell,\tand\tThore\tGraepel\t(11\tMar\t2013),\t“Private\ttraits\tand\tattributes\tare predictable\tfrom\tdigital\trecords\tof\thuman\tbehavior,”  Proceedings\tof\tthe\tNational\tAcademy\tof\tSciences\tof\tthe United\tStates\tof\tAmerica,\tEar

['Russian\tlaw\trequiring\tbloggers: Neil\tMacFarquhar\t(6\tMay\t2014),\t“Russia\tquietly\ttightens\treins\ton\tweb\twith', '‘Bloggers\tLaw,’”  New\tYork\tTimes,\thttp://www.nytimes.com/2014/05/07/world/europe/russia-quietly-tightens-reinson-web-with-bloggers-law.html. ', 'Those\twho\tdo\tthe\treporting:\tThe\tdeputizing\tof\tcitizens\tto\treport\ton\teach\tother\tis\ttoxic\tto\tsociety.\tIt\tcreates\ta pervasive\tfear\tthat\tunravels\tthe\tsocial\tbonds\tthat\thold\tsociety\ttogether.\tBruce\tSchneier\t(26\tApr\t2007), ', '“Recognizing\t‘hinky’\tvs.\tcitizen\tinformants,”  Schneier\ton\tSecurity, ', 'https://www.schneier.com/blog/archives/2007/04/recognizing_hin_1.html. ', 'Internet\tcompanies\tin\tChina:\tJason\tQ.\tNg\t(12\tMar\t2012),\t“How\tChina\tgets\tthe\tInternet\tto\tcensor\titself,”  Waging Nonviolence,\thttp://wagingnonviolence.org/feature/how-china-gets-the-internet-to-censor-itself. ', 'the\tmore\tsevere\tthe\tconsequences:\tCuiming\tPang\t(2008),\t“Self-censorship\tand\t

['\n']


In [83]:
book_dict = {}
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        soup = BeautifulSoup(item.get_body_content(), 'html.parser')
        text = [para.get_text() for para in soup.find_all('p')]
        texts = ' '.join(text)
        book_dict[item.get_name()] = texts

list_words = []
for key in book_dict.keys():
    text = book_dict[key]
    for word in word_tokenize(text):
        if word.isalpha():
            if word.lower() not in stop_words:
                list_words.append(word.lower())

In [97]:
word_count = Counter(list_words)
top_words = word_count.most_common(25)
print(top_words)

dict_word_count = dict(word_count)

[('http', 1444), ('data', 1097), ('surveillance', 853), ('us', 723), ('nsa', 699), ('privacy', 440), ('government', 430), ('people', 412), ('security', 397), ('internet', 392), ('new', 326), ('information', 319), ('companies', 285), ('one', 259), ('use', 252), ('law', 230), ('like', 222), ('know', 221), ('google', 205), ('even', 202), ('much', 197), ('phone', 194), ('world', 183), ('https', 176), ('need', 173)]


In [117]:
dict_word_count['scared']

11

## next step 2: make emotional list words
- do I want to make it as a list of negative/ positive emotional words?
- how to cope with word derivation

categories: happiness, surprise, anger, fear, sadness, disgust 

In [2]:
anger = open('NRC-Emotion-Lexicon/OneFilePerEmotion/anger-NRC-Emotion-Lexicon.txt', 'r').read().split()
anger

['idiotic',
 '1',
 'offend',
 '1',
 'strained',
 '1',
 'punishment',
 '1',
 'kicking',
 '1',
 'hardened',
 '1',
 'slaughter',
 '1',
 'unfulfilled',
 '1',
 'disillusionment',
 '1',
 'imprisoned',
 '1',
 'cacophony',
 '1',
 'payback',
 '1',
 'trickery',
 '1',
 'retaliation',
 '1',
 'venomous',
 '1',
 'encumbrance',
 '1',
 'lying',
 '1',
 'recession',
 '1',
 'remiss',
 '1',
 'stingy',
 '1',
 'defense',
 '1',
 'suicide',
 '1',
 'diabolical',
 '1',
 'blasphemy',
 '1',
 'destroyer',
 '1',
 'gnome',
 '1',
 'fierce',
 '1',
 'selfish',
 '1',
 'stolen',
 '1',
 'slander',
 '1',
 'tripping',
 '1',
 'unforgiving',
 '1',
 'insurrection',
 '1',
 'wrangling',
 '1',
 'shaky',
 '1',
 'grudge',
 '1',
 'latent',
 '1',
 'scandalous',
 '1',
 'mob',
 '1',
 'exaggerate',
 '1',
 'rob',
 '1',
 'deserted',
 '1',
 'devastation',
 '1',
 'unjust',
 '1',
 'dissension',
 '1',
 'harshness',
 '1',
 'litigate',
 '1',
 'profane',
 '1',
 'thump',
 '1',
 'carnage',
 '1',
 'punishing',
 '1',
 'dishonor',
 '1',
 'intense',
 

In [56]:
anger = []
count = 0
with open('NRC-Emotion-Lexicon/OneFilePerEmotion/anger-NRC-Emotion-Lexicon.txt', 'r') as file:
    for line in file:
        word = line.split()[0] 
        N = int(line.split()[1])
        print(N)
        if N == 1:
            print('yes')
            anger.append(word)
        count += 1
        if count >50:
            break
anger

1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes
1
yes


['idiotic',
 'offend',
 'strained',
 'punishment',
 'kicking',
 'hardened',
 'slaughter',
 'unfulfilled',
 'disillusionment',
 'imprisoned',
 'cacophony',
 'payback',
 'trickery',
 'retaliation',
 'venomous',
 'encumbrance',
 'lying',
 'recession',
 'remiss',
 'stingy',
 'defense',
 'suicide',
 'diabolical',
 'blasphemy',
 'destroyer',
 'gnome',
 'fierce',
 'selfish',
 'stolen',
 'slander',
 'tripping',
 'unforgiving',
 'insurrection',
 'wrangling',
 'shaky',
 'grudge',
 'latent',
 'scandalous',
 'mob',
 'exaggerate',
 'rob',
 'deserted',
 'devastation',
 'unjust',
 'dissension',
 'harshness',
 'litigate',
 'profane',
 'thump',
 'carnage',
 'punishing']

In [18]:
N = 0
for word, count in dict_word_count.items():
    if word in anger:
        N += count

N 

14721

In [59]:
def get_lexicon(term):
    word_list = []    
    with open(f'NRC-Emotion-Lexicon/OneFilePerEmotion/{term}-NRC-Emotion-Lexicon.txt', 'r') as file:
        for line in file:
            word = line.split()[0]
            N = int(line.split()[1])
            if N == 1:
                word_list.append(word)    
    return word_list

In [60]:
emotion_lexicon = {}
emotions = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust']

for emotion in emotions:
    emotion_lexicon[emotion] = set(get_lexicon(emotion))



In [61]:
for emotion, lexicon in emotion_lexicon.items():
    print(emotion, list(lexicon)[:20])
    print(len(lexicon))

anger ['stab', 'surly', 'horror', 'boxing', 'disparage', 'condescension', 'upset', 'discontent', 'jerk', 'distracted', 'criticism', 'selfish', 'traitor', 'ram', 'delusional', 'destroyed', 'unfairness', 'turmoil', 'squelch', 'thoughtless']
1245
anticipation ['opponent', 'distracting', 'revive', 'flinch', 'reconstruction', 'perpetuate', 'wont', 'chant', 'punt', 'ongoing', 'closure', 'seek', 'success', 'completion', 'triumph', 'simmer', 'fortune', 'invoke', 'obliging', 'rail']
837
disgust ['unsettled', 'opponent', 'vulgar', 'mishap', 'surly', 'horror', 'unhappy', 'indignation', 'flinch', 'unthinkable', 'insanity', 'denounce', 'gutter', 'recklessness', 'disparage', 'unfair', 'condescension', 'saturated', 'instinctive', 'ridicule']
1056
fear ['stab', 'horror', 'stretcher', 'lifeless', 'isolated', 'typhoon', 'regiment', 'discontent', 'radioactive', 'traitor', 'delusional', 'destroyed', 'turmoil', 'helmet', 'steal', 'decomposition', 'diagnosis', 'theft', 'accidental', 'perpetrator']
1474
joy 

In [62]:
emotion_counts = {
    'anger': 0,
    'anticipation': 0,
    'disgust': 0,
    'fear': 0,
    'joy': 0,
    'sadness': 0,
    'surprise': 0,
    'trust': 0
}

for word, count in dict_word_count.items():
    for emotion, lexicon in emotion_lexicon.items():
        if word in lexicon:
            emotion_counts[emotion] += count

emotion_counts

{'anger': 650,
 'anticipation': 1117,
 'disgust': 430,
 'fear': 809,
 'joy': 753,
 'sadness': 606,
 'surprise': 339,
 'trust': 1934}

In [28]:
len(dict_word_count)

6648