<a href="https://colab.research.google.com/github/senthilchandrasegaran/designing-intelligence/blob/main/notebooks/empath-vs-liwc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Categories in Empath

Empath (see [Fast et al., 2016](https://dl.acm.org/doi/10.1145/2858036.2858535)) is a tool for analysing a given corpus of text to identify the occurrence of certain pre-defined linguistic categories (similar to what is provided by LIWC), but also provides us with a way to create our own linguistic categories based on the behaviour we might want to examine. For instance, we might want to look for instances of **_reflective thinking_** in a corpus of designers' interview transcripts. 

In this notebook, we provide a way for you to compare Empath against LIWC. How does a category in Empath compare against a category in LIWC?
 

## First, Some Housekeeping
The cells below are hidden for simplicity. 

You can click on the triangle (pointing right) on the left of this cell to examine the code in the cells below.

For now, simply run them. If you are unfamiliar with the notebook format, you "run" cells by clicking once on the "play" button below, to the left of the text that might say "_[X] cells hidden_". The same instruction applies for any other cell(s) later on in this notebook.

### Installing Libraries and Importing Data Sources
We install and import some necessary libraries in this cell. Simply run the cell below by clicking on the "Play" button on the top left corner (if you cannot see it, hover your mouse pointer over the top left corner of the cell below and it should appear).


In [1]:
!pip install Empath

import glob
import re
import string
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pprint
import seaborn as sns

from IPython.display import Markdown, display
from empath import Empath

pp = pprint.PrettyPrinter(indent=2, compact=True)
def printmd(string):
    display(Markdown(string))


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Empath
  Downloading empath-0.89.tar.gz (57 kB)
[K     |████████████████████████████████| 57 kB 1.4 MB/s 
Building wheels for collected packages: Empath
  Building wheel for Empath (setup.py) ... [?25l[?25hdone
  Created wheel for Empath: filename=empath-0.89-py3-none-any.whl size=57821 sha256=6d2e41448b7f4f4a05f6a7f4bebdc4eca321ac0f3ebc9204ef115d46095d2d97
  Stored in directory: /root/.cache/pip/wheels/2b/78/a8/37d4505eeae79807f4b5565a193f7cfcee892137ad37591029
Successfully built Empath
Installing collected packages: Empath
Successfully installed Empath-0.89


#### **NOTE:** Re-run above cell if needed.
If you receive a messsage above that states something along the lines of "Restart Runtime", simply run the above cell once again.

## Empath Categories
We will now use the [Empath toolkit](https://github.com/Ejhfast/empath-client/tree/master/empath) to analyse the transcripts by creating a linguistic category and using the category as a lens with which to examine the dataset. 

### Manage Categories in Empath
Below are some functions for keeping track of original categories present in Empath and new ones that you create. We will call these functions if needed.

In [2]:
def check_for_custom_caetgories():
    current_lexicon = Empath()
    empath_original_categories_list = [
         'achievement', 'affection', 'aggression', 'air_travel', 'alcohol', 'ancient', 'anger',
         'animal', 'anonymity', 'anticipation', 'appearance', 'art', 'attractive', 'banking',
         'beach', 'beauty', 'blue_collar_job', 'body', 'breaking', 'business', 'car',
         'celebration', 'cheerfulness', 'childish', 'children', 'cleaning', 'clothing', 'cold',
         'college', 'communication', 'competing', 'computer', 'confusion', 'contentment',
         'cooking', 'crime', 'dance', 'death', 'deception', 'disappointment', 'disgust',
         'dispute', 'divine', 'domestic_work', 'dominant_heirarchical', 'dominant_personality',
         'driving', 'eating', 'economics', 'emotional', 'envy', 'exasperation', 'exercise',
         'exotic', 'fabric', 'family', 'farming', 'fashion', 'fear', 'feminine', 'fight',
         'fire', 'friends', 'fun', 'furniture', 'gain', 'giving', 'government', 'hate',
         'healing', 'health', 'hearing', 'help', 'heroic', 'hiking', 'hipster', 'home',
         'horror', 'hygiene', 'independence', 'injury', 'internet', 'irritability',
         'journalism', 'joy', 'kill', 'law', 'leader', 'legend', 'leisure', 'liquid', 'listen',
         'love', 'lust', 'magic', 'masculine', 'medical_emergency', 'medieval', 'meeting',
         'messaging', 'military', 'money', 'monster', 'morning', 'movement', 'music',
         'musical', 'negative_emotion', 'neglect', 'negotiate', 'nervousness', 'night',
         'noise', 'occupation', 'ocean', 'office', 'optimism', 'order', 'pain', 'party',
         'payment', 'pet', 'philosophy', 'phone', 'plant', 'play', 'politeness', 'politics',
         'poor', 'positive_emotion', 'power', 'pride', 'prison', 'programming', 'rage',
         'reading', 'real_estate', 'religion', 'restaurant', 'ridicule', 'royalty', 'rural',
         'sadness', 'sailing', 'school', 'science', 'sexual', 'shame', 'shape_and_size',
         'ship', 'shopping', 'sleep', 'smell', 'social_media', 'sound', 'speaking', 'sports',
         'stealing', 'strength', 'suffering', 'superhero', 'surprise', 'swearing_terms',
         'swimming', 'sympathy', 'technology', 'terrorism', 'timidity', 'tool', 'torment',
         'tourism', 'toy', 'traveling', 'trust', 'ugliness', 'urban', 'vacation', 'valuable',
         'vehicle', 'violence', 'war', 'warmth', 'water', 'weakness', 'wealthy', 'weapon',
         'weather', 'wedding', 'white_collar_job', 'work', 'worship', 'writing', 'youth', 'zest'
    ]
    empath_original_set = set(empath_original_categories_list)
    empath_current_set = set(lexicon.cats.keys())
    new_categories = empath_current_set.difference(empath_original_set)
    return new_categories
    

def delete_custom_categories():
    cats_to_delete = check_for_custom_caetgories()
    empath_lexicon = Empath()
    for cat in list(cats_to_delete) :
        empath_lexicon.delete_category(cat)
    print("Categories Deleted: ")
    print(cats_to_delete)
        

## Examine Empath Category
You can list the words in an Empath Category by running the cells below. First, specify the category name you want to examine (this is ideally an existing Empath category).

In [31]:
#@title Specify Existing Empath Category
empath_category_name = 'hearing' #@param [ 'achievement', 'affection', 'aggression', 'air_travel', 'alcohol', 'ancient', 'anger', 'animal', 'anonymity', 'anticipation', 'appearance', 'art', 'attractive', 'banking', 'beach', 'beauty', 'blue_collar_job', 'body', 'breaking', 'business', 'car', 'celebration', 'cheerfulness', 'childish', 'children', 'cleaning', 'clothing', 'cold', 'college', 'communication', 'competing', 'computer', 'confusion', 'contentment', 'cooking', 'crime', 'dance', 'death', 'deception', 'disappointment', 'disgust', 'dispute', 'divine', 'domestic_work', 'dominant_heirarchical', 'dominant_personality', 'driving', 'eating', 'economics', 'emotional', 'envy', 'exasperation', 'exercise', 'exotic', 'fabric', 'family', 'farming', 'fashion', 'fear', 'feminine', 'fight', 'fire', 'friends', 'fun', 'furniture', 'gain', 'giving', 'government', 'hate', 'healing', 'health', 'hearing', 'help', 'heroic', 'hiking', 'hipster', 'home', 'horror', 'hygiene', 'independence', 'injury', 'internet', 'irritability', 'journalism', 'joy', 'kill', 'law', 'leader', 'legend', 'leisure', 'liquid', 'listen', 'love', 'lust', 'magic', 'masculine', 'medical_emergency', 'medieval', 'meeting', 'messaging', 'military', 'money', 'monster', 'morning', 'movement', 'music', 'musical', 'negative_emotion', 'neglect', 'negotiate', 'nervousness', 'night', 'noise', 'occupation', 'ocean', 'office', 'optimism', 'order', 'pain', 'party', 'payment', 'pet', 'philosophy', 'phone', 'plant', 'play', 'politeness', 'politics', 'poor', 'positive_emotion', 'power', 'pride', 'prison', 'programming', 'rage', 'reading', 'real_estate', 'religion', 'restaurant', 'ridicule', 'royalty', 'rural', 'sadness', 'sailing', 'school', 'science', 'sexual', 'shame', 'shape_and_size', 'ship', 'shopping', 'sleep', 'smell', 'social_media', 'sound', 'speaking', 'sports', 'stealing', 'strength', 'suffering', 'superhero', 'surprise', 'swearing_terms', 'swimming', 'sympathy', 'technology', 'terrorism', 'timidity', 'tool', 'torment', 'tourism', 'toy', 'traveling', 'trust', 'ugliness', 'urban', 'vacation', 'valuable', 'vehicle', 'violence', 'war', 'warmth', 'water', 'weakness', 'wealthy', 'weapon', 'weather', 'wedding', 'white_collar_job', 'work', 'worship', 'writing', 'youth', 'zest' ] {type:"string"}

In [32]:
#@title List Terms under Specified Category

lexicon = Empath()
if empath_category_name in lexicon.cats.keys() :
    empath_category_terms = sorted(lexicon.cats[empath_category_name])
    pp.pprint(empath_category_terms)
else :
    print("Category '" + empath_category_name + "' does not exist. Please try again.")

[ 'aloud', 'amplify', 'audible', 'blaring', 'blasting', 'boom', 'buzzing',
  'call', 'chatter', 'chattering', 'creaking', 'deafening', 'decibel',
  'distract', 'drumbeat', 'ear', 'ears', 'eavesdrop', 'eavesdropping',
  'echoing', 'grumbling', 'gunfire', 'harmonic', 'hear', 'heard', 'hearing',
  'hears', 'hush', 'hushed', 'irritating', 'knocking', 'listen', 'listening',
  'louder', 'loudly', 'low', 'melodic', 'melody', 'mumbling', 'murmur',
  'murmurs', 'music', 'mute', 'noise', 'noisy', 'pitch', 'quieter', 'quietly',
  'radio', 'resonate', 'response', 'ringing', 'rumble', 'scream', 'screaming',
  'shout', 'shrill', 'shuffling', 'siren', 'snoring', 'soft', 'softly',
  'soothing', 'sound', 'sounded', 'sounding', 'speak', 'speaker', 'speaking',
  'splitting', 'squeak', 'squeaky', 'talk', 'talking', 'tapping', 'thumping',
  'tune', 'voiced', 'volume', 'wail', 'wailing', 'whisper', 'whispered', 'yell',
  'yelling']


## Compare with LIWC Category
In the cell below, copy and paste a list of terms from the LIWC category against which you want to compare the Empath category. Then run all cells below it.

In [33]:
#@title Specify Existing LIWC Category
liwc_category_name = 'hear' #@param {type:"string"}

In [34]:
#@title Specify LIWC Category Terms (as a list of strings, each enclosed within quotes)
liwc_category_terms = ["audibl*", "audio*", "boom*", "cellphone*", "choir*", "chorus", "clap", "clapped", "clapping", "claps", "click*", "concert*", "deaf*", "di*", "ear", "ears", "ep", "flapping", "harmon*", "hear", "heard", "hearing", "hears", "hush*", "hymn*", "inaudibl*", "laugh", "listen", "listened", "listener*", "listening", "listens", "loud", "louder", "loudest", "loudly", "loudn*", "musi*", "noise", "noises", "noisier", "noisiest", "noisy", "opera", "orchestra*", "overhear*", "phone*", "quiet", "quieter", "quietest", "quietly", "radio*", "rang", "rap", "remark*", "rng", "ringing", "rings", "said", "sang", "say", "saying", "says", "scream*", "shout*", "silen*", "sing", "singing", "sings", "song*", "sound*", "speak", "speaker*", "speaking", "speaks", "speech", "spoke*", "stutter*", "sung", "telephon*", "thunder*", "tune", "tunes", "voic*", "wheez*", "whine*", "whining", "whisper*", "yawn*", "yell", "yelled", "yelling", "yells" ] #@param {type:"raw"}

In [35]:
#@title Compare Both Categories

def compare_categories(empath_category_terms, liwc_category_terms):
    empath_set = set(empath_category_terms)
    liwc_set = set(liwc_category_terms)
    liwc_wildcard_list = [term for term in list(liwc_set) if '*' in term]
    liwc_set = liwc_set.difference(set(liwc_wildcard_list))
    empath_str = (' ').join(list(empath_set))
    wildcard_intersections = []
    wildcards_unmatched = []
    for wildcard in liwc_wildcard_list :
        reg_str = '\\b' +wildcard +'[a-z]*\\b'
        matches = re.findall(reg_str, empath_str)
        if len(matches) > 0 :
            wildcards_unmatched.append(wildcard)
        else :
            wildcard_intersections.extend(matches)
    empath_set = empath_set.difference(set(wildcard_intersections))
    
    overlaps = empath_set.intersection(liwc_set)
    overlaps = overlaps.union(set(wildcard_intersections))
    unique_to_liwc = liwc_set.difference(empath_set)
    unique_to_liwc = unique_to_liwc.union(set(wildcards_unmatched))
    unique_to_empath = empath_set.difference(liwc_set)

    print("Words common to both categories:")
    pp.pprint(overlaps)
    print("--------------------------------")
    print("")
    print("Words in LIWC but NOT IN Empath:")
    pp.pprint(unique_to_liwc)
    print("--------------------------------")
    print("")
    print("Words in Empath but NOT IN LIWC:")
    pp.pprint(unique_to_empath)
    print("--------------------------------")
    print("")
compare_categories(empath_category_terms, liwc_category_terms)

Words common to both categories:
{ 'ear', 'ears', 'hear', 'heard', 'hearing', 'hears', 'listen', 'listening',
  'louder', 'loudly', 'noise', 'noisy', 'quieter', 'quietly', 'ringing',
  'speak', 'speaking', 'tune', 'yell', 'yelling'}
--------------------------------

Words in LIWC but NOT IN Empath:
{ 'audibl*', 'audio*', 'boom*', 'chorus', 'clap', 'clapped', 'clapping',
  'claps', 'deaf*', 'di*', 'ep', 'flapping', 'harmon*', 'hush*', 'laugh',
  'listened', 'listens', 'loud', 'loudest', 'loudn*', 'musi*', 'noises',
  'noisier', 'noisiest', 'opera', 'quiet', 'quietest', 'radio*', 'rang', 'rap',
  'rings', 'rng', 'said', 'sang', 'say', 'saying', 'says', 'scream*', 'shout*',
  'sing', 'singing', 'sings', 'sound*', 'speaker*', 'speaks', 'speech', 'sung',
  'tunes', 'voic*', 'whining', 'whisper*', 'yelled', 'yells'}
--------------------------------

Words in Empath but NOT IN LIWC:
{ 'aloud', 'amplify', 'audible', 'blaring', 'blasting', 'boom', 'buzzing',
  'call', 'chatter', 'chattering', '

## Create your own Empath Category
In the cell below, you can create your own category. Make sure the category name is an English word without special characters, as it can improve the outcome.

If needed, replace the text that is assigned to the variable ```category_name``` and the list of seed words assigned to ```category_seed_words``` with a category and a corresponding set of seed words that interest you for the analysis.

In [None]:
#@title Specify New Empath Category Name
category_name = 'reflection' #@param {type:"string"}

In [None]:
#@title Enter a list of seed words (list of strings, each within quotes)
category_seed_words = ['believe', 'realise', 'realize', 'retrospect', 'introspect', 'know'] #@param {type:"raw"}

In [None]:
#@title Maximum number of words in the new category
number_of_words_in_category = 100 #@param{type : 'number'}

In [None]:
#@title Specify which model to use when generating category terms
model_to_use= 'fiction' #@param ['fiction', 'reddit', 'nytimes'] {type:"raw"}

In [None]:
#@title Generate category based on specs
lexicon.create_category(category_name, category_seed_words, size=number_of_words_in_category, model=model_to_use)
custom_category_terms = lexicon.cats[category_name]
print('-------------------------------------------')
print('Category', category_name, "created with", len(custom_category_terms), "terms.")

["realize", "realise", "believe", "understand", "Because", "mean", "knowing", "though", "thought", "care", "honestly", "actually", "remember", "knew", "wonder", "admit", "guess", "anymore", "Honestly", "matter", "thinking", "suppose", "trust", "probably", "blame", "assume", "Maybe", "explain", "Obviously", "realized", "seem", "wish", "imagine", "though", "right", "deny", "notice", "doubt", "knows", "anything", "anyway", "realizing", "exactly", "forget", "either", "expect", "seriously", "figured", "pretend", "why", "Actually", "truly", "idea", "seeing", "realised", "Even", "If", "meant", "realise", "bet", "regret", "suspect", "yet", "sure", "happen", "slightest_clue", "understood", "Knowing", "accept", "question", "funny_thing"]
-------------------------------------------
Category reflection created with 69 terms.


In [None]:
#@title Uncomment code here to check for and/or delete custom categories
# check_for_custom_caetgories()
# delete_custom_categories()