In this notebook, we are mainly testing an hypothesis during the Great Depression: the culture industry tend to connect the people with the system. Evoking public emotion toward the country and the government. Emphasizing and promoting positive image of executive, legislative, and judicial system.

## Prepare for analysis

In [1]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/Computational-Content-Analysis-2018/lucem_illud.git

#All these packages need to be installed from pip
#For NLP
import nltk

import numpy as np #For arrays
import pandas #Gives us DataFrames
import matplotlib.pyplot as plt #For graphics
import seaborn #Makes the graphics look nicer

#Displays the graphs
import graphviz #You also need to install the command line graphviz

#These are from the standard library
import os.path
import zipfile
import subprocess
import io
import tempfile

%matplotlib inline

In [2]:
lucem_illud.setupStanfordNLP()

Starting downloads, this will take 5-10 minutes
[0%] Downloading core from http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
[24%] Downloaded core, extracting to ../stanford-NLP/core
[25%] Downloading postagger from https://nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip
[49%] Downloaded postagger, extracting to ../stanford-NLP/postagger
[50%] Downloading ner from https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
[74%] Downloaded ner, extracting to ../stanford-NLP/ner
[75%] Downloading parser from https://nlp.stanford.edu/software/stanford-parser-full-2017-06-09.zip
[99%] Downloaded parser, extracting to ../stanford-NLP/parser
[100%]Done setting up the Stanford NLP collection


In [3]:
import lucem_illud.stanford as stanford

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordNERTagger, self).__init__(*args, **kwargs)
The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordPOSTagger, self).__init__(*args, **kwargs)


## Load and Process Data

Here we load the raw data sets, label each plot, and then merge all plots into a comprehensive dataframe.

In [7]:
preDF = pandas.read_csv('../Data/pre_popular.csv') 
postDF = pandas.read_csv('../Data/post_popular.csv')

## Part-of-Speech (POS) tagging¶

Process data and tokenize the text

In [8]:
preDF['sentences'] = preDF['plot'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
postDF['sentences'] = postDF['plot'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])

Create Part-of-Speech (POS) tagging of the corpus: notice this could take more than 4 hours to run.

In [9]:
preDF['POS_sents'] = preDF['sentences'].apply(lambda x: stanford.postTagger.tag_sents(x))
postDF['POS_sents'] = postDF['sentences'].apply(lambda x: stanford.postTagger.tag_sents(x))

Export file to the data directory just in case:

In [10]:
preDF.to_csv('pre_popularPOS.csv', index = False)
postDF.to_csv('post_popularPOS.csv', index = False)

### Consider conditional frequencies (e.g., adjectives associated with nouns of interest or adverbs with verbs of interest)

In order to facilitate examining the conditional frequencies of choice, a function called 'conditionalFreq' is composed which show the conditional frequencies of choice:

In [11]:
def conditionalFreq(NTarget, corpusPOS, Word):                          
    NResults = set()                                                            
    for entry in corpusPOS:                                                     
        for sentence in entry:                                                  
            for (ent1, kind1),(ent2,kind2) in zip(sentence[:-1], sentence[1:]): 
                if (kind1,ent2.lower())==(NTarget,Word):                        
                    NResults.add(ent1)                                          
                else:                                                           
                    continue                                                    
    return NResults

def compare_conditionalFreq(NTarget, Word, corpusPOS_pre = preDF['POS_sents'], corpusPOS_post = postDF['POS_sents']):
    pre = conditionalFreq(NTarget, corpusPOS_pre, Word)
    post = conditionalFreq(NTarget, corpusPOS_post, Word)
    result = [('Before', pre), ('After', post)]
    return result

To test the effect of this function and the conditional frequency, I tried to see the adjectives associated with different US government agencies.

In [48]:
compare_conditionalFreq('JJ', 'government')

[('Before', {'Israeli', 'critical', 'democratic'}),
 ('After', {'British', 'secret'})]

In [36]:
compare_conditionalFreq('JJ', 'leader')

[('Before',
  {'Muslim', 'fast-thinking', 'military', 'stern', 'tough', 'unofficial'}),
 ('After',
  {'dictatorial',
   'evil',
   'ferocious',
   'military',
   'new',
   'original',
   'popular'})]

In [54]:
compare_conditionalFreq('JJ', 'hero')

[('Before', {'average', 'demonic', 'legendary', 'local', 'national', 'super'}),
 ('After', {'national'})]

In [55]:
compare_conditionalFreq('JJ', 'savior')

[('Before', set()), ('After', set())]

In [49]:
compare_conditionalFreq('JJ', 'children')

[('Before',
  {'adopted',
   'beloved',
   'countless',
   'husband/the',
   'ill-behaved',
   'missing',
   'other',
   'pajama-clad',
   'picture-perfect',
   'respective',
   'young'}),
 ('After',
  {'adolescent',
   'adult',
   'crazy',
   'gifted',
   'other',
   'own',
   'unruly',
   'young'})]

In [56]:
compare_conditionalFreq('JJ', 'woman')

[('Before',
  {'American',
   'British',
   'French',
   'Russian',
   'actual',
   'contemplative',
   'dead',
   'good',
   'human',
   'ill',
   'impulsive',
   'independent',
   'insecure',
   'multi-talented',
   'mysterious',
   'old',
   'perfect',
   'pregnant',
   'seductive',
   'successful',
   'unconfident',
   'unmarried',
   'young'}),
 ('After',
  {'Caucasian',
   'crazy',
   'dead',
   'difficult',
   'friendless',
   'gipsy',
   'mysterious',
   'only',
   'overachieving',
   'paranoid',
   'uncultured',
   'young'})]

In [59]:
compare_conditionalFreq('JJ', 'support')

[('Before', {'moral'}), ('After', {'real', 'stubborn'})]