<a href="https://colab.research.google.com/github/seungbobkimpants/reilly_nlp/blob/master/AVD_get.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Getting text arousal, valence, and dominance using Warriner, Kuperman & Brysbaert (2013)**




## **Preliminaries**

We are going to import some useful libraries first.

* `numpy` supports fast computation on large, n-dimensional arrays. Python is slower than languages like C and Java, so `numpy` tries to speed it up. It's conventional to import it as `np`.
* `pandas` is built on top of `numpy`. It gives you some easy tools to visualize and analyze tabular (2-dimensional) data. It's conventional to import it as `pd`.
* `spacy` is a fast and fancy natural language processing library.

In [1]:
import numpy as np
import pandas as pd
import spacy

Though spaCy and NLTK both come with their own lemmatizers, they are not that good. Brad Jascob's LemmInflect is better! So we'll use it. `https://github.com/bjascob/LemmInflect`

You can run bash commands after the `!` symbol. Because Colab doesn't come with LemmInflect, I have to install it with `pip`.

In [2]:
!pip3 install lemminflect
import lemminflect

Collecting lemminflect
[?25l  Downloading https://files.pythonhosted.org/packages/8d/c5/62e8dd0b6cbfea212cf55a2338838d85a819dbda9462ba53a415dcf19b86/lemminflect-0.2.1-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 2.8MB/s 
Installing collected packages: lemminflect
Successfully installed lemminflect-0.2.1


## **Dealing with data**


~~First, obtain WKB's measures at `http://crr.ugent.be/archives/1003` then upload here by connecting to the runtime, then uploading `Ratings_Warriner_et_al.csv` in the Files sidebar.~~

OK, because it's annoying to upload the WKB ratings every time, we are going to keep it in the `Summer Cog Zoom May 20` folder on Google Drive. We can mount Google Drive by running the following cell:

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Read in the lexicon, and call in only the relevant columns.

In [5]:
ratings_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/Ratings_Warriner_et_al.csv'

# df meaning dataframe
df = pd.read_csv(ratings_path, usecols=['Word', 'V.Mean.Sum', 'A.Mean.Sum', 'D.Mean.Sum'])
print('There are this many words:', len(df)-1)

There are this many words: 13914


Obviously, there are more than 13914 words in the English language. We will have to ignore all the words not in this list, but if we have time, we could try to extrapolate ('bootstrap') using word vector distances.

Let's define some test text. We'll modify this part to use actual text data later.

In [6]:
test_text = 'GRASSLEY: As part of judge Kavanaugh’s nomination to the Supreme Court, the FBI conducted its sixth full field background investigation of Judge Kavanaugh since 1993, 25 years ago. Nowhere in any of these six FBI reports, which committee investigators have reviewed on a bipartisan basis, was there a whiff of any issue — any issue at all related in any way to inappropriate sexual behavior. \n Dr. Ford first raised her allegations in a secret letter to the ranking member nearly two months ago in July. This letter was secret from July 30th, September 13th to — no, July 30th until September 13th when I first heard about it.'

evil_text = 'I fucking hate that stupid asshole! I am so angry!'
good_text = 'I really love that beautiful angel. I am feeling blessed.'

## **The fun stuff**

We'll first load in one of spaCy's pre-trained English models. It's customary to call it `nlp`.

In [7]:
# load spaCy's pre-trained English model and call it nlp
nlp = spacy.load('en_core_web_sm')
# try large

Feeding a text into spaCy's models automatically does some NLP for you. So, once we feed `test_text` into `nlp`, and call it `doc`, we can ask what the sentences are in `test_text`, for example.

In [8]:
doc = nlp(test_text)

sentences = list(doc.sents)
print('The sentences are:')
for s in sentences:
  print(s)

The sentences are:
GRASSLEY:
As part of judge Kavanaugh’s nomination to the Supreme Court, the FBI conducted its sixth full field background investigation of Judge Kavanaugh since 1993, 25 years ago.
Nowhere in any of these six FBI reports, which committee investigators have reviewed on a bipartisan basis, was there a whiff of any issue — any issue at all related in any way to inappropriate sexual behavior. 
 
Dr. Ford first raised her allegations in a secret letter to the ranking member nearly two months ago in July.
This letter was secret from July 30th, September 13th to — no, July 30th until September 13th when I first heard about it.


But we don't really care about sentence chunking for getting AVD values. Instead, we will need all the tokens, which are the default iterable units in a `doc`.

In [9]:
lemmata = [token.lemma_ for token in doc]
print('Lemmata:', lemmata)

Lemmata: ['GRASSLEY', ':', 'as', 'part', 'of', 'judge', 'Kavanaugh', '’s', 'nomination', 'to', 'the', 'Supreme', 'Court', ',', 'the', 'FBI', 'conduct', '-PRON-', 'sixth', 'full', 'field', 'background', 'investigation', 'of', 'Judge', 'Kavanaugh', 'since', '1993', ',', '25', 'year', 'ago', '.', 'nowhere', 'in', 'any', 'of', 'these', 'six', 'FBI', 'report', ',', 'which', 'committee', 'investigator', 'have', 'review', 'on', 'a', 'bipartisan', 'basis', ',', 'be', 'there', 'a', 'whiff', 'of', 'any', 'issue', '—', 'any', 'issue', 'at', 'all', 'relate', 'in', 'any', 'way', 'to', 'inappropriate', 'sexual', 'behavior', '.', '\n ', 'Dr.', 'Ford', 'first', 'raise', '-PRON-', 'allegation', 'in', 'a', 'secret', 'letter', 'to', 'the', 'rank', 'member', 'nearly', 'two', 'month', 'ago', 'in', 'July', '.', 'this', 'letter', 'be', 'secret', 'from', 'July', '30th', ',', 'September', '13th', 'to', '—', 'no', ',', 'July', '30th', 'until', 'September', '13th', 'when', '-PRON-', 'first', 'hear', 'about', '-P

The strange one-line code above uses *list comprehension*. Instead of writing a 3+ line for loop, we can stick it into one loop. More concretely, 

`lemmata = [token.lemma_ for token in doc]` 

is equivalent to 

```
lemmata = []
for token in doc:
  lemmata.append(token.lemma_)
```
And we can even use conditionals with it:



In [10]:
present = [lem for lem in lemmata if (df['Word']==lem).any()]
print('Lemmata in WKB:', present)

Lemmata in WKB: ['part', 'judge', 'nomination', 'conduct', 'full', 'field', 'background', 'investigation', 'year', 'six', 'report', 'committee', 'investigator', 'have', 'review', 'basis', 'be', 'whiff', 'issue', 'issue', 'relate', 'way', 'inappropriate', 'sexual', 'behavior', 'first', 'raise', 'allegation', 'secret', 'letter', 'rank', 'member', 'two', 'month', 'letter', 'be', 'secret', 'first', 'hear']


Just to be clear, `(df['Word']==lem).any()` checks whether there is any word in the `Word` column in `df` that matches the lemma, then returns a boolean (true/false) value.

We can write a quick function/method to return the vertical index of a word (basically copy-pasted from StackOverflow):

In [11]:
def idx(element):
  return (df[df['Word']==element].index)[0]

Then using the `at[index, column_name]` method that comes with `pandas`, we can find the arousal/valence/dominance values for a word in their respective columns.

In [12]:
arousal_values = [df.at[idx(lem), 'A.Mean.Sum'] for lem in present]
valence_values = [df.at[idx(lem), 'V.Mean.Sum'] for lem in present]
dominance_values = [df.at[idx(lem), 'D.Mean.Sum'] for lem in present]

print('Mean arousal:', sum(arousal_values)/len(arousal_values))
print('Mean valence:', sum(valence_values)/len(valence_values))
print('Mean dominance:', sum(dominance_values)/len(dominance_values))

Mean arousal: 4.026923076923078
Mean valence: 5.451538461538464
Mean dominance: 5.579487179487179


Here are the results for the test texts we defined earlier:

Good text:
* Mean arousal: 4.471666666666667
* Mean valence: 6.859999999999999
* Mean dominance: 5.891666666666667

Evil text: 
* Mean arousal: 5.578333333333333
* Mean valence: 3.4516666666666667
* Mean dominance: 4.323333333333333

Kavanope:
* Mean arousal: 4.026923076923078 
* Mean valence: 5.451538461538464
* Mean dominance: 5.579487179487179


## **Automating the process**

Let's write a function to do all this in one line.

In [13]:
def avd_get(text):
  doc = nlp(text)
  lemmata = [token.lemma_ for token in doc]
  present = [lem for lem in lemmata if (df['Word']==lem).any()]
  arousal_values = [df.at[idx(lem), 'A.Mean.Sum'] for lem in present]
  valence_values = [df.at[idx(lem), 'V.Mean.Sum'] for lem in present]
  dominance_values = [df.at[idx(lem), 'D.Mean.Sum'] for lem in present]

  return sum(arousal_values)/len(arousal_values), sum(valence_values)/len(valence_values), sum(dominance_values)/len(dominance_values)


Try it out:

In [14]:
avd_get(test_text)

(4.026923076923078, 5.451538461538464, 5.579487179487179)

Beautiful. Now, we want to be able to take a folder containing all the files, and then write all these values into a CSV. Because it's a hassle to upload all the input files every time, we are going to access the input files from Google Drive. 

Once we have read in all the files from `input_data`, we store a list of `(filename, contents)` tuples as `documents`.

In [15]:
import os

input_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/pilot/3_edited'

documents = os.listdir(input_path)

documents = [(name, open(input_path+'/'+name, 'r').read()) for name in documents]

Then we can get to work writing the AVD values into a CSV. NB this can feel quite slow because for each text spaCy has to apply `nlp()` to it.

In [None]:
import csv

output_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/pilot/4_results/avd_edited_output.csv'
with open(output_path, 'w') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(['name','arousal','valence','dominance'])
    for d in documents:
      name, text = d
      arousal, valence, dominance = avd_get(text)
      writer.writerow([name, arousal, valence, dominance])

Finally, let's take a look at the resulting CSV:

In [None]:
output_df = pd.read_csv(output_path)
display(output_df)

Unnamed: 0,name,arousal,valence,dominance
0,P02_T04_politics.txt,3.98181,5.478952,5.51581
1,P02_T04_vacation.txt,3.927586,6.357328,5.917328
2,P02_T04_cruelty.txt,3.878952,5.699274,5.50621
3,P03_T04_politics.txt,3.908696,6.054435,5.792087
4,P03_T04_vacation.txt,4.025524,6.461714,5.903619
5,P03_T04_cruelty.txt,4.074452,6.269806,5.783484
6,P04_T04_politics.txt,3.863061,5.910612,5.901939
7,P04_T04_vacation.txt,4.203226,6.45828,5.808602
8,P04_T04_cruelty.txt,4.194466,5.697184,5.332718
9,P01_T04_politics.txt,3.975289,5.790992,5.632397


Try with the unedited transcripts:

In [None]:
input_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/pilot/2_unedited'

documents = os.listdir(input_path)

documents = [(name, open(input_path+'/'+name, 'r').read()) for name in documents]

output_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/pilot/4_results/avd_unedited_output.csv'
with open(output_path, 'w') as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(['name','arousal','valence','dominance'])
    for d in documents:
      name, text = d
      arousal, valence, dominance = avd_get(text)
      writer.writerow([name, arousal, valence, dominance])

In [None]:
output_df = pd.read_csv(output_path)
display(output_df)

Unnamed: 0,name,arousal,valence,dominance
0,P01_T04_vacation.txt,3.975315,6.155676,5.811622
1,P01_T04_politics.txt,3.909922,5.823488,5.644806
2,P01_T04_cruelty.txt,4.054685,6.042308,5.78014
3,P02_T04_vacation.txt,3.915315,6.333604,5.884234
4,P02_T04_politics.txt,3.945929,5.484336,5.515929
5,P02_T04_cruelty.txt,3.854,5.741,5.540167
6,P03_T04_vacation.txt,4.004804,6.441471,5.885588
7,P03_T04_politics.txt,3.886607,6.051161,5.788929
8,P03_T04_cruelty.txt,4.037974,6.247647,5.775425
9,P04_T04_vacation.txt,4.187045,6.344659,5.782159


OK, so we get some results. But we need to run some statistics on them to make sure the results we're seeing are significant or insignificant. 

First, we'll compare edited vs. unedited texts.


In [16]:
edited_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/pilot/4_results/avd_edited_output.csv'
edited_df = pd.read_csv(edited_path)
display(edited_df)

Unnamed: 0,name,arousal,valence,dominance
0,P02_T04_politics.txt,3.98181,5.478952,5.51581
1,P02_T04_vacation.txt,3.927586,6.357328,5.917328
2,P02_T04_cruelty.txt,3.878952,5.699274,5.50621
3,P03_T04_politics.txt,3.908696,6.054435,5.792087
4,P03_T04_vacation.txt,4.025524,6.461714,5.903619
5,P03_T04_cruelty.txt,4.074452,6.269806,5.783484
6,P04_T04_politics.txt,3.863061,5.910612,5.901939
7,P04_T04_vacation.txt,4.203226,6.45828,5.808602
8,P04_T04_cruelty.txt,4.194466,5.697184,5.332718
9,P01_T04_politics.txt,3.975289,5.790992,5.632397


In [17]:
unedited_path = '/content/drive/My Drive/Data Analysis Summer Cog Lab/Linguistics/pilot/4_results/avd_unedited_output.csv'
unedited_df = pd.read_csv(unedited_path)
display(unedited_df)

Unnamed: 0,name,arousal,valence,dominance
0,P01_T04_vacation.txt,3.975315,6.155676,5.811622
1,P01_T04_politics.txt,3.909922,5.823488,5.644806
2,P01_T04_cruelty.txt,4.054685,6.042308,5.78014
3,P02_T04_vacation.txt,3.915315,6.333604,5.884234
4,P02_T04_politics.txt,3.945929,5.484336,5.515929
5,P02_T04_cruelty.txt,3.854,5.741,5.540167
6,P03_T04_vacation.txt,4.004804,6.441471,5.885588
7,P03_T04_politics.txt,3.886607,6.051161,5.788929
8,P03_T04_cruelty.txt,4.037974,6.247647,5.775425
9,P04_T04_vacation.txt,4.187045,6.344659,5.782159


In [23]:
from scipy.stats import ttest_ind

ttest_ind(unedited_df['arousal'], edited_df['arousal'])

Ttest_indResult(statistic=-0.6938064382131189, pvalue=0.49506483394100365)

In [24]:
ttest_ind(unedited_df['valence'], edited_df['valence'])

Ttest_indResult(statistic=-0.053334363798167445, pvalue=0.957946852395784)

In [25]:
ttest_ind(unedited_df['dominance'], edited_df['dominance'])

Ttest_indResult(statistic=-0.07150717502446252, pvalue=0.9436401999665964)

OK, now topics. We're going to compare edited and unedited separately.

In [33]:
vacations_df = pd.DataFrame([row for index, row in edited_df.iterrows() if 'vacation' in row['name']])
politics_df = pd.DataFrame([row for index, row in edited_df.iterrows() if 'politics' in row['name']])
cruelty_df = pd.DataFrame([row for index, row in edited_df.iterrows() if 'cruelty' in row['name']])

display(vacations_df)
display(politics_df)
display(cruelty_df)

Unnamed: 0,name,arousal,valence,dominance
1,P02_T04_vacation.txt,3.927586,6.357328,5.917328
4,P03_T04_vacation.txt,4.025524,6.461714,5.903619
7,P04_T04_vacation.txt,4.203226,6.45828,5.808602
10,P01_T04_vacation.txt,3.97902,6.159412,5.836176


Unnamed: 0,name,arousal,valence,dominance
0,P02_T04_politics.txt,3.98181,5.478952,5.51581
3,P03_T04_politics.txt,3.908696,6.054435,5.792087
6,P04_T04_politics.txt,3.863061,5.910612,5.901939
9,P01_T04_politics.txt,3.975289,5.790992,5.632397


Unnamed: 0,name,arousal,valence,dominance
2,P02_T04_cruelty.txt,3.878952,5.699274,5.50621
5,P03_T04_cruelty.txt,4.074452,6.269806,5.783484
8,P04_T04_cruelty.txt,4.194466,5.697184,5.332718
11,P01_T04_cruelty.txt,4.094366,6.046901,5.761831


In [37]:
import scipy.stats as stats


print(stats.f_oneway(vacations_df['arousal'], politics_df['arousal'], cruelty_df['arousal']))
print(stats.f_oneway(vacations_df['valence'], politics_df['valence'], cruelty_df['valence']))
print(stats.f_oneway(vacations_df['dominance'], politics_df['dominance'], cruelty_df['dominance']))

F_onewayResult(statistic=1.57291043869498, pvalue=0.2595197226962048)
F_onewayResult(statistic=6.331254044692919, pvalue=0.0192044543376894)
F_onewayResult(statistic=2.8145852241573617, pvalue=0.11235764875538076)


In [38]:
vacations_df = pd.DataFrame([row for index, row in unedited_df.iterrows() if 'vacation' in row['name']])
politics_df = pd.DataFrame([row for index, row in unedited_df.iterrows() if 'politics' in row['name']])
cruelty_df = pd.DataFrame([row for index, row in unedited_df.iterrows() if 'cruelty' in row['name']])

In [39]:
print(stats.f_oneway(vacations_df['arousal'], politics_df['arousal'], cruelty_df['arousal']))
print(stats.f_oneway(vacations_df['valence'], politics_df['valence'], cruelty_df['valence']))
print(stats.f_oneway(vacations_df['dominance'], politics_df['dominance'], cruelty_df['dominance']))

F_onewayResult(statistic=2.2328698375824447, pvalue=0.16313807307524747)
F_onewayResult(statistic=5.768527332746453, pvalue=0.02441572228605834)
F_onewayResult(statistic=2.2229300449588707, pvalue=0.16422627672943538)


## IMDB Test

In [None]:
pos_path = "/content/drive/My Drive/Summer Cog Zoom May 20/Linguistic/IMDB_reviews/pos"
neg_path = "/content/drive/My Drive/Summer Cog Zoom May 20/Linguistic/IMDB_reviews/neg"

pos_files = os.listdir(pos_path)
neg_files = os.listdir(neg_path)

pos_contents = [open(pos_path+"/"+file, "r").read() for file in pos_files]
  

If you "get it", it's magnificent.<br /><br />If you don't, it's decent.<br /><br />Please understand that "getting it" does not necessarily mean you've gone through a school shooting. There is so much more to this movie that, at times, the school shooting becomes insignificant.<br /><br />Above all, it's a movie about acceptance, both superficially--of a traumatic event, but also of people who are different for whatever reason.<br /><br />It's also a movie about unendurable pain, and how different people endure it. In this case, the contrast between Alicia's rage and Deanna's obsession creates an atmosphere of such palpable anxiety that halfway through the movie we wonder how the director could possibly pull a happy ending out of his hat. Thankfully, the audience is given credit for being human beings; our intelligence is not insulted by a sappy, implausibly moralistic ending.<br /><br />Above and beyond that, I try to keep a clear head about movies being fiction and all that. Yet I m

In [None]:
neg_contents = [open(neg_path+"/"+file, "r").read() for file in neg_files]

In [None]:
pos_results = [avd_get(entry) for entry in pos_contents]
neg_results = [avd_get(entry) for entry in neg_contents]

In [None]:
import pickle

pos_pickle = open(pos_path+"/pos.pickle", "wb")
pickle.dump(pos_results, pos_pickle)
neg_pickle = open(neg_path+"/neg.pickle", "wb")
pickle.dump(neg_results, neg_pickle)

In [None]:
pos_df = pd.DataFrame(pos_results)
neg_df = pd.DataFrame(neg_results)


In [None]:
display(pos_df)

Unnamed: 0,0,1,2
0,4.126977,5.794419,5.544651
1,3.981395,5.939535,5.659070
2,3.944889,6.136444,5.805111
3,4.278000,5.890231,5.722308
4,4.133692,6.214308,5.717077
...,...,...,...
195,3.853551,6.087009,5.773925
196,4.123301,5.941553,5.711942
197,3.800615,6.052462,5.928615
198,4.064412,6.209020,5.916078


In [None]:
pos_avd = (pos_df[0].mean(), pos_df[1].mean(), pos_df[2].mean())
neg_avd = (neg_df[0].mean(), neg_df[1].mean(), neg_df[2].mean())

In [None]:
print("Positive Reviews Arousal:", pos_avd[0], "Valence:", pos_avd[1], "Dominance:", pos_avd[2])
print("Negative Reviews Arousal:", neg_avd[0], "Valence:", neg_avd[1], "Dominance:", neg_avd[2])

Positive Reviews Arousal: 4.071610069204202 Valence: 5.914732459686648 Dominance: 5.661810084938473
Negative Reviews Arousal: 4.036584152805322 Valence: 5.768386691675906 Dominance: 5.580582868391926


## Statistics (T-test, p-value)

In [None]:
from scipy.stats import ttest_ind

print('Arousal')
ttest_ind(pos_df[0], neg_df[0])

Arousal


Ttest_indResult(statistic=2.406988367717121, pvalue=0.01653925997668273)

In [None]:
print("Valence")
ttest_ind(pos_df[1], neg_df[1])

Valence


Ttest_indResult(statistic=6.378697076168146, pvalue=4.963757899054474e-10)

In [None]:
print("Dominance")
ttest_ind(pos_df[2], neg_df[2])

Dominance


Ttest_indResult(statistic=5.414266488445275, pvalue=1.0663296007641505e-07)

## Edited vs. Unedited