# Text Mining Tutorial
#### Re: "Political discourses of national and international in the United States political process regarding the restructuring of the global order in 1921–1924"

Question: How can we support qualitative analysis with quantitative examination using collocates?

Answer: There are many ways to extract collocates. Here we introduce one way to mine for ngrams (such as bigrams and trigrams). We will define a function that uses the **nltk** library to perform the extraction. To do this, we first install **nltk**.

In [None]:
# The ! preceding the command tells Jupyter to run Shell code.
# So this is syntacitcally correct even though it has a red squiggle.
!pip install nltk

We can now define a function, **get_ngrams()**, that uses **ngrams** and **word_tokenize** from **nltk** and returns word sequences of our desired length.

In [1]:
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

**get_ngrams()** takes two arguments for **text** and **n**. **text** can be passed the textual data we wish to break into ngrams, and **n** can be used to control the length of the extracted word sequences.

In the following example we will supply **get_ngrams** with a sentence from the Hansard corpus to extract bigrams and trigrams. 

In [2]:
get_ngrams('But it is urged that landlords may exercise a grievous oppression over tenants.', 2)

['But it',
 'it is',
 'is urged',
 'urged that',
 'that landlords',
 'landlords may',
 'may exercise',
 'exercise a',
 'a grievous',
 'grievous oppression',
 'oppression over',
 'over tenants',
 'tenants .']

In [3]:
get_ngrams('But it is urged that landlords may exercise a grievous oppression over tenants.', 3)

['But it is',
 'it is urged',
 'is urged that',
 'urged that landlords',
 'that landlords may',
 'landlords may exercise',
 'may exercise a',
 'exercise a grievous',
 'a grievous oppression',
 'grievous oppression over',
 'oppression over tenants',
 'over tenants .']

In the above example we extracted ngrams from a single string of text. But in practice we will want to extract ngrams from many sentences without interruption! We can do that different ways depending on the "structure" of your data.

For example, you can use **get_ngrams()** to extract ngrams from a column in a data frame. In the following code we import our data, the Congressional Records (1919-1929), and then extract bigrams from each speech.

In [10]:
import pandas as pd 

congress = pd.read_csv('/home/stephbuon/research_data/stanford_congressional_records/stanford_congressional_records_joel.csv')

congress = congress.head(100) # for this tutorial I make our data set smaller so the code will complete sooner. Hash this out for your research.

In [11]:
# Because we are operating on an entire decade, the code takes several minutes to complete.
congress['bigrams'] = congress['speech'].apply(get_ngrams, n = 2)

The **bigrams** column of our data set now contains a list of each bigram identified in a speech.

In [12]:
congress

Unnamed: 0,speech_id,speech,date,speaker,file,year,bigrams
0,650348541,The Chair lays before the Senate the credentia...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,"[The Chair, Chair lays, lays before, before th..."
1,650348542,I present the certificate of election of my co...,1919-01-02,Mr. HALE,01021919.txt,1919,"[I present, present the, the certificate, cert..."
2,650348543,Mr. President. for the information of the Fore...,1919-01-02,Mr. JOHNSON of California,01021919.txt,1919,"[Mr. President, President ., . for, for the, t..."
3,650348544,I introduce a joint resolution. and ask that i...,1919-01-02,Mr. ASHURST,01021919.txt,1919,"[I introduce, introduce a, a joint, joint reso..."
4,650348545,Mr. President. I ask unanimous consent that th...,1919-01-02,Mr. BECKHAM,01021919.txt,1919,"[Mr. President, President ., . I, I ask, ask u..."
...,...,...,...,...,...,...,...
95,650348636,Does the Senator from Idaho know whether or no...,1919-01-02,Mr. KING,01021919.txt,1919,"[Does the, the Senator, Senator from, from Ida..."
96,650348637,I am unable to answer the question of the Sena...,1919-01-02,Mr. BORAH,01021919.txt,1919,"[I am, am unable, unable to, to answer, answer..."
97,650348638,Mr. President. I think I can answer the questi...,1919-01-02,Mr. JONES of Washington,01021919.txt,1919,"[Mr. President, President ., . I, I think, thi..."
98,650348639,Mr. President. the Senator from Washington: is...,1919-01-02,Mr. KIRBY,01021919.txt,1919,"[Mr. President, President ., . the, the Senato..."


To make this data set easier to visualize, we can do further pre-processing and explode the **bigrams** column so our data set presents as one-bigram-per-row.

In [13]:
congress = congress.explode('bigrams')

In [14]:
congress

Unnamed: 0,speech_id,speech,date,speaker,file,year,bigrams
0,650348541,The Chair lays before the Senate the credentia...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,The Chair
0,650348541,The Chair lays before the Senate the credentia...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,Chair lays
0,650348541,The Chair lays before the Senate the credentia...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,lays before
0,650348541,The Chair lays before the Senate the credentia...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,before the
0,650348541,The Chair lays before the Senate the credentia...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,the Senate
...,...,...,...,...,...,...,...
99,650348640,If there be no further conclurrent or other re...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,the morning
99,650348640,If there be no further conclurrent or other re...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,morning business
99,650348640,If there be no further conclurrent or other re...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,business is
99,650348640,If there be no further conclurrent or other re...,1919-01-02,The VICE PRESIDENT,01021919.txt,1919,is closed
