# The First Script
### In this script, the collected lipad dataframe pkl will be processed, into a corpus of text. 
The reason why we go through the intermediate step of translating to a corpus is so we can reduce the memory usage and segment the process. More explanation in the second script.

Here are the different things to play/tune around here:

1. Choice of standardizing defunct/merged party names, described below
2. Choice of stopwords
3. Choice of stemming/lemmatization (none at the moment)
4. Choice of replacement words eg. abortion, immigration. It's here instead of downstream because replacement is dependent on date and party information, which is ultimately abandoned in the collected corpus. 

**Note:** if you want to generate the groupings or the rolling 3-year average, go down to the bottom and find the desired section

In [1]:
import re
import pandas as pd
import nltk
import numpy as np
import sys
from nltk.corpus import stopwords
import time
from decimal import Decimal
from collections import Counter
from tqdm import tqdm
from itertools import chain
# Very nice tool to show progress 
tqdm.pandas()

  from pandas import Panel


In [3]:
nltk.download('punkt')
nltk.download('stopwords')
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
stopwords = stopwords.words('english')

[nltk_data] Downloading package punkt to /Users/rui/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/rui/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
hansard = pd.read_pickle('../corpora/lipad.pkl') 
hansard = hansard[['speechdate', 'speechtext','speakerparty']] # Reduce unecessary info, almost halves size
stopwords_dict = Counter(stopwords) # Making it a counter makes it much more efficient when searching

This cleaning is opinionated.  Uncontroversially, it links:
1. Progressive Conservative and Conservative (1867-1942) parties to the Conservaive party
2. Laurier Liberal to the Liberal party
3. Co-operative Commonwealth Federation (C.C.F.) to the NDP.  

More controversially, it treats the Progressive Conservative and Reform and Canadian Alliance parties as a single unit in the 1990s, before they merged in 2004.  
It is worth examining the results when treating the Reform/Alliance parties together and seperately from the Progressive Conservative Party.  

In [5]:
def recodeParty(series):
    if series in ['Progressive Conservative','Conservative (1867-1942)', 'Reform', 'Canadian Alliance']:
        return 'Conservative'
    elif series in ['Co-operative Commonwealth Federation (C.C.F.)', 'New Democratic Party']:
        return 'NDP'
    elif series == 'Laurier Liberal':
        return 'Liberal'
    else:
        return series

hansard['speakerparty'] = hansard['speakerparty'].apply(recodeParty)

In [6]:
# Choice of what to replace and what to replace it with 

replace_list = [r"abortion[s]?\b", r"preborns?\b", r"unborns?\b", r"fo?etus(es)?\b", 
                r"wom[ea]ns?\srights?\b", r"right\s(choose|life)\b", r"freedom\schoice\b",
                r"prolife\b", r"prochoice\b", r"womens?\shealth\b", r"reproductive\s(health|rights?)\b"]
replace_regex = re.compile("|".join(replace_list))
stem = 'abort'

def clean_replace(row, tokenizer, replace_regex, stem, remove_stopwords=True):
    sentences = []
    # changing ./. to just .
    paragraph = row.speechtext.replace("/.", '').replace("hon.", "hon")
    # running the paragraph through the tokenizer to split it into sentences intelligently
    raw_sentences = tokenizer.tokenize(paragraph.strip()) 
    for raw_sentence in raw_sentences:
    
        if len(raw_sentence) > 0:
            # get rid of any character that's not alphanumerical or whitepace.
            sentence_text = re.sub(r'[^\w\s]','', raw_sentence) 
            # split all words, and filter out stopwords
            words = [word for word in sentence_text.lower().split() if word not in stopwords_dict]
            # join back together each sentence with spaces and add a newline 
            sentences.append(" ".join(words) + '\n')
            
    # join all the sentences in the paragraph together and then replace
    replace_sentence = replace_regex.sub("{}_{}_{}".format(stem,row.speechdate.year,row.speakerparty), "".join(sentences))
    return replace_sentence

df = hansard.copy() 
file_name = '../corpora/corpus_abort_1.txt'

# Same as a standard df.apply, but progress_apply has tqdm
df['speechtext'] = df.progress_apply(clean_replace, axis=1, args=(tokenizer, replace_regex, stem, True))

# instead of writing a sentence at a time to the file, concat all of it together and then write it at once.
# itertools.chain was used because it's orders of magnitudes faster than df.speechtext.sum()
with open(file_name, 'w') as corpus_file:
    corpus_file.write("".join(list(chain(*df.speechtext))))

100%|██████████| 3559499/3559499 [19:57<00:00, 2971.99it/s]
100%|██████████| 3559499/3559499 [19:32<00:00, 3037.08it/s]
100%|██████████| 3559499/3559499 [20:19<00:00, 2919.74it/s]
100%|██████████| 3559499/3559499 [20:19<00:00, 2919.62it/s]
100%|██████████| 3559499/3559499 [20:16<00:00, 2926.42it/s]
100%|██████████| 3559499/3559499 [20:18<00:00, 2921.83it/s]


### **The code below is for creating the other groupings**

In [2]:
file_name = '../corpora/corpus_abort_1.txt'
file = open(file_name, 'r') 
text_abort = file.read()
file.close()

In [35]:
for group in range(2,7):
    text = text_abort
    file_name = '../corpora/corpus_abort_{}.txt'.format(group)
    for year in tqdm(range(1901, 2020)):
        text = text.replace("abort_{}_".format(year), "abort_{}_".format(int(year/group)*group))
    with open(file_name, 'w') as corpus_file:
        corpus_file.write(text)

100%|██████████| 119/119 [06:15<00:00,  3.16s/it]


### **The code below is for creating the 3-year rolling average corpus**

In [4]:
hold = ''
for group in range(3,4):
    for window in range(group):
        text = text_abort
        file_name = '../corpora/corpus_abort_{}_rolling.txt'.format(group)
        for year in tqdm(range(1901, 2020)):
            text = text.replace("abort_{}_".format(year), 
                                "abort_{}_".format(int((year-window)/group)*group + 1 + window))
        
        hold = "".join([hold, text])
        
    with open(file_name, 'w') as corpus_file:
        corpus_file.write(hold)

100%|██████████| 119/119 [06:08<00:00,  3.09s/it]
100%|██████████| 119/119 [06:23<00:00,  3.23s/it]
100%|██████████| 119/119 [06:28<00:00,  3.26s/it]


In [5]:
print(len(hold))
# locations = [m.start() for m in re.finditer(r'\babort_[0-9]{4}_', hold)]
# print(len(locations))
# for location in locations:
#     print(hold[location-12:location+12])

5955496182
