# Introduction: Must I exist?

In the introduction to *Contingent Selves*, I describe the concept of the 'contingent self', and indicate what makes this concept important for the study of Romantic literature.

As part of this introduction, I analyse a corpus of academic documents from JSTOR Data for Research, and consider how the word 'self' is used in them. This notebook shows the upper-level code required to reproduce the analysis in the text. More experienced users can investigate the rest of this repository to see the details of the implementation.

In [1]:
import pickle as p

from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import pandas as pd
import nltk

from utils import JSTORCorpus, TargetedTrigramAssocFinder, RobustBigramAssocMeasures, RobustTrigramAssocMeasures

%matplotlib inline

In [2]:
DATA_DIR = 'data/'
OUT_DIR = DATA_DIR + 'associations-2020-07-07-wn25/'

corpus = JSTORCorpus.load(DATA_DIR + 'last-15-years-corpus.p')

with open(OUT_DIR + 'romantic-self-trigrams.p', 'rb') as file:
    rom_self_trigrams = p.load(file)

with open(OUT_DIR + 'self-bigrams.p', 'rb') as file:
    self_bigrams = p.load(file)
    
with open(OUT_DIR + 'romantic-bigrams.p', 'rb') as file:
    romantic_bigrams = p.load(file)

Corpus loaded from data/last-15-years-corpus.p


In [3]:
if ('self','romantic') in self_bigrams.ngram_fd:
    self_bigrams.ngram_fd[('romantic','self')] = self_bigrams.ngram_fd[('self','romantic')]
    del self_bigrams.ngram_fd[('self','romantic')]

# The shape of the data

In [4]:
print(f'Unigrams in the corpus: {rom_self_trigrams.N:,}\n')

print(f'Trigrams with "self" and "romantic":')
print(f'    types: {rom_self_trigrams.ngram_fd.B():,}')
print(f'    tokens: {rom_self_trigrams.ngram_fd.N():,}\n')

print(f'Bigrams with "self" or "romantic":')
print(f'    types: {rom_self_trigrams.bigram_fd.B():,}')
print(f'    tokens: {rom_self_trigrams.bigram_fd.N():,}')
print(f'Bigrams with "self":')
print(f'    types: {self_bigrams.ngram_fd.B():,}')
print(f'    tokens: {self_bigrams.ngram_fd.N():,}')
print(f'Bigrams with "romantic":')
print(f'    types: {romantic_bigrams.ngram_fd.B():,}')
print(f'    tokens: {romantic_bigrams.ngram_fd.N():,}')

Unigrams in the corpus: 120,941,637

Trigrams with "self" and "romantic":
    types: 5,294
    tokens: 27,504

Bigrams with "self" or "romantic":
    types: 108,830
    tokens: 2,783,740
Bigrams with "self":
    types: 59,754
    tokens: 1,567,113
Bigrams with "romantic":
    types: 49,077
    tokens: 1,219,005


In [5]:
corpus_df = pd.DataFrame.from_dict(corpus.corpus_meta, orient='index')

In [6]:
with open('data/metadata/journal-article-10.2307_41328981.xml') as file:
    meta_xml = BeautifulSoup(file.read(), features="lxml")

In [12]:
meta_xml.find('journal-title').text

'New Literary History'

# Stopwords?

What happens if we filter for stopwords?

In [15]:
import re
from nltk.corpus import stopwords
english = stopwords.words('english')

In [46]:
def stopword_filter(*ngram):
    """Returns true if the ngram contains junk or a stopword"""

    sw = set(stopwords.words('english'))

    if not sw.isdisjoint(ngram):
        return True
    elif any([re.match(r'[\W\d]+', wd) for wd in ngram]):
        return True
    else:
        return False

In [47]:
rom_self_trigrams.apply_ngram_filter(stopword_filter)
romantic_bigrams.apply_ngram_filter(stopword_filter)
self_bigrams.apply_ngram_filter(stopword_filter)

In [112]:
rom_scores = romantic_bigrams.score_ngrams(RobustBigramAssocMeasures.likelihood_ratio)
self_scores = self_bigrams.score_ngrams(RobustBigramAssocMeasures.likelihood_ratio)
rom_self_scores = rom_self_trigrams.score_ngrams(RobustTrigramAssocMeasures.likelihood_ratio)

In [113]:
for idx, (rom, self, rom_self) in enumerate(zip(rom_scores[:30], self_scores[:30], rom_self_scores[:30])):
    rom_bg, r_sc = rom
    self_bg, s_sc = self
    trigram, t_sc = rom_self
    
    r_wd = rom_bg[0]
    s_wd = self_bg[0]
    t_wd = trigram[0]
    
    print(f"{idx:<2} : {r_wd:<15} {r_sc:>13,.2f} : {s_wd:<15} {s_sc:>13,.2f} : {t_wd:<15} {t_sc:>13,.2f}")

0  : romantic           892,095.38 : self             1,137,949.75 : self             1,736,647.47
1  : period              29,328.70 : conscious           21,493.52 : romantic         1,357,924.04
2  : poetry              23,752.89 : consciousness       20,221.33 : period              70,292.85
3  : era                 19,311.59 : one                 17,474.49 : poetry              66,935.82
4  : self                16,595.72 : romantic            16,595.72 : one                 62,852.02
5  : romanticism         16,292.47 : consciously         11,999.87 : conscious           57,432.59
6  : literature          15,699.04 : identity            11,685.73 : consciousness       57,014.88
7  : poets               15,245.03 : sense               10,890.09 : era                 54,726.71
8  : british             13,548.93 : also                10,581.49 : literature          52,901.60
9  : literary            12,704.06 : world               10,242.77 : romanticism         52,815.59
10 : write

In [104]:
def rank_word(scores, word):
    for rank, (ngram, score) in enumerate(scores):
        if word in ngram:
            return rank, score

# What words are excluded in the context of 'romantic' and 'self'

It appears that several words, including 'abnegation' and 'contingent', never occur when both 'romantic' and 'self' are within 14 words either side. Are there other such words?

In [138]:
filtered_rom_scores = []
for ngram, score in rom_scores:
    if (ngram[0],'self','romantic') not in rom_self_trigrams.ngram_fd:
        filtered_rom_scores.append((ngram, score))

filtered_self_scores = []
for ngram, score in self_scores:
    if (ngram[0],'self','romantic') not in rom_self_trigrams.ngram_fd:
        filtered_self_scores.append((ngram, score))

In [146]:
for idx, (rom, self) in enumerate(zip(filtered_rom_scores[:50], filtered_self_scores[:50])):
    rom_bg, r_sc = rom
    self_bg, s_sc = self
    
    r_wd = rom_bg[0]
    s_wd = self_bg[0]
    
    print(f"{idx:<2} : {r_wd:<15} {r_sc:>13,.2f} : {s_wd:<15} {s_sc:>13,.2f}")

0  : abrams               3,086.69 : cannot               2,650.72
1  : circles              2,878.80 : preservation         2,531.96
2  : edited               1,718.78 : denial               2,085.75
3  : michael              1,657.13 : make                 2,004.17
4  : ecology              1,484.96 : interested           1,819.75
5  : scholars             1,469.87 : discipline           1,817.74
6  : melancholy           1,442.19 : effacing             1,784.51
7  : praxis               1,426.95 : esteem               1,690.77
8  : writings             1,354.74 : respect              1,621.89
9  : trumpener            1,244.66 : styled               1,531.31
10 : companion            1,234.64 : envelope             1,342.85
11 : err                  1,212.37 : things               1,297.58
12 : anthology            1,195.80 : boundaries           1,283.58
13 : scholarship          1,185.10 : addressed            1,233.29
14 : bate                 1,185.04 : abnegation           1,22

In [167]:
foo = [n for n in self_bigrams.score_ngram(w1='romantic',w2='self', score_fn=RobustBigramAssocMeasures._contingency)]

In [168]:
foo

[2378.0, 64569.0, 48458.0, 120826232.0]