# Preprocessing

We preprocess the text we scraped in [```1_scrape.py```](1_scrape.py), converting it into a format usable for our NLP modeling.

In [1]:
import pandas as pd

import sys
sys.path.append("..")
from util.preprocess import process_texts, find_ngrams

%load_ext autoreload
%autoreload 2


Let's load our data.

In [2]:
fic_df = pd.read_pickle("../data/avatar_fics_scraped.pickle")

Our preprocessing logic is contained in [```preprocessing.py```](../util/preprocessing.py), which cleans, tokenizes, lemmatizes, removes stopwords, and merges specified n-grams. Most of the process is automatic, but we do need to specify the n-grams we would like merged.

### Finding n-grams

Let's use ```find_ngrams``` to find the most common bigrams and trigrams in the corpus. We can examine those for ideas on what to merge.

In [4]:
# Exclude "ten" as a stopword because "Lu Ten" is an imporant character in the series.
bigrams, trigrams = find_ngrams(fic_df["text"], stop_exceptions = ["ten"])

In [5]:
bigrams.most_common()[0:400]

[(('fire', 'nation'), 10989),
 (('fire', 'lord'), 7622),
 (('ty', 'lee'), 6006),
 (('zuko', 'said'), 5180),
 (('water', 'tribe'), 3903),
 (('sokka', 'said'), 3483),
 (('earth', 'kingdom'), 3244),
 (('ba', 'sing'), 3060),
 (('sing', 'se'), 3041),
 (('zuko', 'says'), 2691),
 (('sokka', 'says'), 2119),
 (('katara', 'said'), 2119),
 (('gon', 'na'), 1710),
 (('lu', 'ten'), 1541),
 (('dai', 'li'), 1525),
 (('prince', 'zuko'), 1502),
 (('felt', 'like'), 1393),
 (('aang', 'said'), 1302),
 (('deep', 'breath'), 1296),
 (('zuko', 'looked'), 1284),
 (('long', 'time'), 1259),
 (('zuko', 'asked'), 1211),
 (('toph', 'said'), 1145),
 (('uncle', 'iroh'), 1055),
 (('southern', 'water'), 1054),
 (('feels', 'like'), 1052),
 (('feel', 'like'), 1039),
 (('sokka', 'asked'), 988),
 (('years', 'ago'), 972),
 (('old', 'man'), 955),
 (('blue', 'spirit'), 946),
 (('looks', 'like'), 918),
 (('said', 'zuko'), 915),
 (('looked', 'like'), 914),
 (('iroh', 'said'), 911),
 (('azula', 'said'), 908),
 (('zuko', 'looks'),

In [6]:
trigrams.most_common()[:100]

[(('ba', 'sing', 'se'), 3033),
 (('southern', 'water', 'tribe'), 1020),
 (('fire', 'lord', 'zuko'), 695),
 (('northern', 'water', 'tribe'), 508),
 (('fire', 'lord', 'ozai'), 433),
 (('western', 'air', 'temple'), 309),
 (('water', 'tribe', 'boy'), 282),
 (('fire', 'nation', 'soldiers'), 215),
 (('ty', 'lee', 'says'), 192),
 (('long', 'time', 'ago'), 178),
 (('ty', 'lee', 'said'), 171),
 (('fire', 'lord', 'azulon'), 158),
 (('fire', 'nation', 'prince'), 155),
 (('fire', 'lord', 'sozin'), 138),
 (('dai', 'li', 'agents'), 133),
 (('fire', 'nation', 'zuko'), 126),
 (('southern', 'air', 'temple'), 123),
 (('new', 'fire', 'lord'), 122),
 (('wan', 'shi', 'tong'), 114),
 (('dai', 'li', 'agent'), 113),
 (('zuko', 'said', 'quietly'), 112),
 (('fire', 'nation', 'capital'), 110),
 (('yeah', 'zuko', 'said'), 104),
 (('water', 'tribe', 'girl'), 103),
 (('fire', 'nation', 'royal'), 103),
 (('fire', 'nation', 'ship'), 101),
 (('ember', 'island', 'players'), 98),
 (('fire', 'nation', 'attacked'), 98),
 

Based on the above counters and on domain knowledge, we decide on bi- and trigrams. We also fix certain lemmatization errors. 

In [12]:
avatar_bigrams = ["fire lord", "fire nation", "lu ten", "water tribe", "ty lee", "lu ten", "earth kingdom", "water tribe", "agni kai", "blue spirit", "air temple", "boiling rock", "chit sang", "ember island", "dai li", "pai sho", "jasmine dragon", "long feng", "north pole", "south pole", "spirit world", "white lotus", "gran gran", "si wong", "white lotus", "air nomad"]
avatar_trigrams = ["ba sing se" ]

elements = ["", "water", "earth", "fire", "air"]

# unigram exceptions
u_exc_1 = {f"{element}benders":f"{element}bender" for element in elements}
u_exc_2 = {f"{element}bends":f"{element}bend" for element in elements}
u_exc_3 = {f"{element}bending":f"{element}bend" for element in elements}
u_exc_4 = {f"{element}bended": f"{element}bend" for element in elements}
u_exc_5 = {"soulmates": "soulmate", "firelord": "fire lord", "firelords": "fire lord", "grangran": "gran gran"}
u_exceptions = {**u_exc_1, **u_exc_2, **u_exc_3, **u_exc_4, **u_exc_5}

# bigram exceptions
b_exceptions = {"soul mate": "soulmate", "soul mates": "soulmate", "gon na": "gonna", "wan na": "wanna", "agni kais": "agni kai"}

avatar_exceptions = {**u_exceptions, **b_exceptions}

avatar_ngrams = {"bigrams": avatar_bigrams, "trigrams": avatar_trigrams, "exceptions": avatar_exceptions}


A very specific case is the bigram "Lu Ten," a character in Avatar. "ten" is included in our default stopword list, so we remove it from the stopword list. Then, after merging bigrams, we remove all instances of "ten" that weren't merged into a bigram.

This functionality can be extended to other stopwords that appear in important ngrams.

In [14]:
avatar_stop_exceptions = ["ten"]
avatar_stop_after = ["ten"]

### Example Process

Let's look at an example of how the text is processed.

In [44]:
fic_df.iloc[5]

work_id                                                   27089068
rating                                                        teen
lang                                                       English
words                                                         8570
chapters                                                         1
date                                                   18 Oct 2020
series                                                          {}
author                                                     aeoleus
all_authors                                              [aeoleus]
title                                     Icarus, Point to the Sun
text               Zuko tries.    Agni, he tries. Uncle wants s...
relationships    [Iroh & Zuko (Avatar), The Gaang & Zuko (Avatar)]
chars            [Zuko (Avatar), Iroh (Avatar), Toph Beifong, K...
tags             [yeah its whump and what about it, Hurt Zuko (...
Name: 5, dtype: object

In [45]:
text_example = fic_df.iloc[5]["text"][:505]
text_example

'  Zuko tries.    Agni, he tries. Uncle wants so desperately for him to do the right thing, and he wants to, he does, he does, he does-  So when the Water Tribe girl looks at him with wild eyes, begging for help, and Azula raises one eyebrow with a sneer, Zuko chooses.    Azula’s scream of anger when he blasts a fireball in her direction is almost worth the twenty Dai Li agents that immediately surround him. The fight isn’t long, not for him.    “Go!” He shouts at the Water Tribe girl, who’s clutching'

In [48]:
processed_example = process_texts([text_start], ngrams=avatar_ngrams, stop_exceptions=avatar_stop_exceptions, stop_after = avatar_stop_after)
print(processed_example)

[['zuko', 'try', 'agni', 'try', 'uncle', 'want', 'desperately', 'right', 'thing', 'want', 'water tribe', 'girl', 'look', 'wild', 'eye', 'beg', 'help', 'azula', 'raise', 'eyebrow', 'sneer', 'zuko', 'choose', 'azula', 'scream', 'anger', 'blast', 'fireball', 'direction', 'worth', 'dai li', 'agent', 'immediately', 'surround', 'fight', 'long', 'shout', 'water tribe', 'girl', 'clutch']]


### Final Processing

We run our text through the preprocessing sequence, add it to our dataframe, and pickle the result.

In [10]:
processed_texts = process_texts(fic_df["text"], ngrams=avatar_ngrams, stop_exceptions=avatar_stop_exceptions, stop_after = avatar_stop_after)
fic_df["processed"] = processed_texts

In [12]:
fic_df.to_pickle("../data/avatar_fics_processed.pickle")