<a href="https://colab.research.google.com/github/seungbobkimpants/reilly_nlp/blob/master/AVD_get.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Getting text arousal, valence, and dominance using Warriner, Kuperman & Brysbaert (2013)**




## **Preliminaries**

We are going to import some useful libraries first.

* `numpy` supports fast computation on large, n-dimensional arrays. Python is slower than languages like C and Java, so `numpy` tries to speed it up. It's conventional to import it as `np`.
* `pandas` is built on top of `numpy`. It gives you some easy tools to visualize and analyze tabular (2-dimensional) data. It's conventional to import it as `pd`.
* `spacy` is a fast and fancy natural language processing library.

In [None]:
import numpy as np
import pandas as pd
import spacy

Though spaCy and NLTK both come with their own lemmatizers, they are not that good. Brad Jascob's LemmInflect is better! So we'll use it. `https://github.com/bjascob/LemmInflect`

You can run bash commands after the `!` symbol. Because Colab doesn't come with LemmInflect, I have to install it with `pip`.

In [None]:
!pip3 install lemminflect
import lemminflect



## **Dealing with data**


~~First, obtain WKB's measures at `http://crr.ugent.be/archives/1003` then upload here by connecting to the runtime, then uploading `Ratings_Warriner_et_al.csv` in the Files sidebar.~~

OK, because it's annoying to upload the WKB ratings every time, we are going to keep it in the `Summer Cog Zoom May 20` folder on Google Drive. We can mount Google Drive by running the following cell:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Read in the lexicon, and call in only the relevant columns.

In [None]:
ratings_path = '/content/drive/My Drive/Summer Cog Zoom May 20/Linguistic/Ratings_Warriner_et_al.csv'

# df meaning dataframe
df = pd.read_csv(ratings_path, usecols=['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum'])
print('There are this many words:', len(df)-1)

There are this many words: 13914


Obviously, there are more than 13914 words in the English language. We will have to ignore all the words not in this list, but if we have time, we could try to extrapolate ('bootstrap') using word vector distances.

Let's define some test text. We'll modify this part to use actual text data later.

In [None]:
test_text = 'GRASSLEY: As part of judge Kavanaugh’s nomination to the Supreme Court, the FBI conducted its sixth full field background investigation of Judge Kavanaugh since 1993, 25 years ago. Nowhere in any of these six FBI reports, which committee investigators have reviewed on a bipartisan basis, was there a whiff of any issue — any issue at all related in any way to inappropriate sexual behavior. \n Dr. Ford first raised her allegations in a secret letter to the ranking member nearly two months ago in July. This letter was secret from July 30th, September 13th to — no, July 30th until September 13th when I first heard about it.'

evil_text = 'I fucking hate that stupid asshole! I am so angry!'
good_text = 'I really love that beautiful angel. I am feeling blessed.'

## **The fun stuff**

We'll first load in one of spaCy's pre-trained English models. It's customary to call it `nlp`.

In [None]:
# load spaCy's pre-trained English model and call it nlp
nlp = spacy.load('en_core_web_sm')
# try large

Feeding a text into spaCy's models automatically does some NLP for you. So, once we feed `test_text` into `nlp`, and call it `doc`, we can ask what the sentences are in `test_text`, for example.

In [None]:
doc = nlp(test_text)

sentences = list(doc.sents)
print('The sentences are:')
for s in sentences:
  print(s)

The sentences are:
GRASSLEY:
As part of judge Kavanaugh’s nomination to the Supreme Court, the FBI conducted its sixth full field background investigation of Judge Kavanaugh since 1993, 25 years ago.
Nowhere in any of these six FBI reports, which committee investigators have reviewed on a bipartisan basis, was there a whiff of any issue — any issue at all related in any way to inappropriate sexual behavior. 
 
Dr. Ford first raised her allegations in a secret letter to the ranking member nearly two months ago in July.
This letter was secret from July 30th, September 13th to — no, July 30th until September 13th when I first heard about it.


But we don't really care about sentence chunking for getting AVD values. Instead, we will need all the tokens, which are the default iterable units in a `doc`.

In [None]:
lemmata = [token.lemma_ for token in doc]
print('Lemmata:', lemmata)

Lemmata: ['GRASSLEY', ':', 'as', 'part', 'of', 'judge', 'Kavanaugh', '’s', 'nomination', 'to', 'the', 'Supreme', 'Court', ',', 'the', 'FBI', 'conduct', '-PRON-', 'sixth', 'full', 'field', 'background', 'investigation', 'of', 'Judge', 'Kavanaugh', 'since', '1993', ',', '25', 'year', 'ago', '.', 'nowhere', 'in', 'any', 'of', 'these', 'six', 'FBI', 'report', ',', 'which', 'committee', 'investigator', 'have', 'review', 'on', 'a', 'bipartisan', 'basis', ',', 'be', 'there', 'a', 'whiff', 'of', 'any', 'issue', '—', 'any', 'issue', 'at', 'all', 'relate', 'in', 'any', 'way', 'to', 'inappropriate', 'sexual', 'behavior', '.', '\n ', 'Dr.', 'Ford', 'first', 'raise', '-PRON-', 'allegation', 'in', 'a', 'secret', 'letter', 'to', 'the', 'rank', 'member', 'nearly', 'two', 'month', 'ago', 'in', 'July', '.', 'this', 'letter', 'be', 'secret', 'from', 'July', '30th', ',', 'September', '13th', 'to', '—', 'no', ',', 'July', '30th', 'until', 'September', '13th', 'when', '-PRON-', 'first', 'hear', 'about', '-P

The strange one-line code above uses *list comprehension*. Instead of writing a 3+ line for loop, we can stick it into one loop. More concretely, 

`lemmata = [token.lemma_ for token in doc]` 

is equivalent to 

```
lemmata = []
for token in doc:
  lemmata.append(token.lemma_)
```
And we can even use conditionals with it:



In [None]:
present = [lem for lem in lemmata if (df['Word']==lem).any()]
print('Lemmata in WKB:', present)

Lemmata in WKB: ['part', 'judge', 'nomination', 'conduct', 'full', 'field', 'background', 'investigation', 'year', 'six', 'report', 'committee', 'investigator', 'have', 'review', 'basis', 'be', 'whiff', 'issue', 'issue', 'relate', 'way', 'inappropriate', 'sexual', 'behavior', 'first', 'raise', 'allegation', 'secret', 'letter', 'rank', 'member', 'two', 'month', 'letter', 'be', 'secret', 'first', 'hear']


Just to be clear, `(df['Word']==lem).any()` checks whether there is any word in the `Word` column in `df` that matches the lemma, then returns a boolean (true/false) value.

We can write a quick function/method to return the vertical index of a word (basically copy-pasted from StackOverflow):

In [None]:
def idx(element):
  return (df[df['Word']==element].index)[0]

Then using the `at[index, column_name]` method that comes with `pandas`, we can find the arousal/valence/dominance values for a word in their respective columns.

In [None]:
arousal_values = [df.at[idx(lem), 'A.Mean.Sum'] for lem in present]
valence_values = [df.at[idx(lem), 'V.Mean.Sum'] for lem in present]
dominance_values = [df.at[idx(lem), 'D.Mean.Sum'] for lem in present]

print('Mean arousal:', sum(arousal_values)/len(arousal_values))
print('Mean valence:', sum(valence_values)/len(valence_values))
print('Mean dominance:', sum(dominance_values)/len(dominance_values))

Mean arousal: 4.026923076923078
Mean valence: 5.451538461538464
Mean dominance: 5.579487179487179


Here are the results for the test texts we defined earlier:

Good text:
* Mean arousal: 4.471666666666667
* Mean valence: 6.859999999999999
* Mean dominance: 5.891666666666667

Evil text: 
* Mean arousal: 5.578333333333333
* Mean valence: 3.4516666666666667
* Mean dominance: 4.323333333333333

Kavanope:
* Mean arousal: 4.026923076923078 
* Mean valence: 5.451538461538464
* Mean dominance: 5.579487179487179


## **Automating the process**

Let's write a function to do all this in one line.

In [None]:
def avd_get(text):
  doc = nlp(text)
  lemmata = [token.lemma_ for token in doc]
  present = [lem for lem in lemmata if (df['Word']==lem).any()]
  arousal_values = [df.at[idx(lem), 'A.Mean.Sum'] for lem in present]
  valence_values = [df.at[idx(lem), 'V.Mean.Sum'] for lem in present]
  dominance_values = [df.at[idx(lem), 'D.Mean.Sum'] for lem in present]

  return sum(arousal_values)/len(arousal_values), sum(valence_values)/len(valence_values), sum(dominance_values)/len(dominance_values)


Try it out:

In [None]:
avd_get(test_text)

(4.026923076923078, 5.451538461538464, 5.579487179487179)

Beautiful. Now, we want to be able to take a folder containing all the files, and then write all these values into a CSV. Because it's a hassle to upload all the input files every time, we are going to access the input files from Google Drive. 

Once we have read in all the files from `input_data`, we store a list of `(filename, contents)` tuples as `documents`.

In [None]:
import os

input_path = '/content/drive/My Drive/Summer Cog Zoom May 20/Linguistic/input_data'

documents = os.listdir(input_path)

documents = [(name, open(input_path+'/'+name, 'r').read()) for name in documents]

Then we can get to work writing the AVD values into a CSV. NB this can feel quite slow because for each text spaCy has to apply `nlp()` to it.

In [None]:
import csv

output_path = '/content/drive/My Drive/Summer Cog Zoom May 20/Linguistic/avd_output.csv'
with open(output_path, 'w') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(['name','arousal','valence','dominance'])
    for d in documents:
      name, text = d
      arousal, valence, dominance = avd_get(text)
      writer.writerow([name, arousal, valence, dominance])

Finally, let's take a look at the resulting CSV:

In [None]:
output_df = pd.read_csv(output_path)
display(output_df)

Unnamed: 0,name,arousal,valence,dominance
0,Copy of P01_T04_edited.txt,4.022658,5.993507,5.739699
1,Copy of P03_T04.json.txt,3.964554,6.215812,5.796568
2,Copy of P02_T04_speakerID.txt,3.91874,5.875603,5.669517
3,Copy of P04_T04.json.txt,4.019333,5.977841,5.651619
