# Preprocessing

We preprocess the text we scraped in [```1_scrape.py```](1_scrape.py), converting it into a format usable for our NLP modeling.

In [164]:
import pandas as pd

import sys
sys.path.append("..")
from util.preprocess import process_texts, find_ngrams

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Let's load our data.

In [166]:
fic_df = pd.read_pickle("../avatar_fics_scraped.pickle")

Our preprocessing logic is contained in [```preprocessing.py```](../util/preprocessing.py), which cleans, tokenizes, lemmatizes, removes stopwords, and merges specified n-grams. Most of the process is automatic, but we do need to specify the n-grams we would like merged.

### Finding n-grams

Let's use ```find_ngrams``` to find the most common bigrams and trigrams in the corpus. We can examine those for ideas on what to merge.

In [167]:
# Exclude "ten" as a stopword because "Lu Ten" is an imporant character in the series.
bigrams, trigrams = find_ngrams(df["text"], stop_exceptions = ["ten"])

In [168]:
bigrams.most_common()[0:400]

[(('fire', 'nation'), 3326),
 (('fire', 'lord'), 2835),
 (('zuko', 'says'), 1503),
 (('sokka', 'says'), 1235),
 (('water', 'tribe'), 1177),
 (('ty', 'lee'), 1047),
 (('zuko', 'said'), 1006),
 (('earth', 'kingdom'), 963),
 (('lu', 'ten'), 901),
 (('sokka', 'said'), 795),
 (('prince', 'zuko'), 732),
 (('ba', 'sing'), 721),
 (('sing', 'se'), 717),
 (('lord', 'zuko'), 632),
 (('gon', 'na'), 581),
 (('feels', 'like'), 475),
 (('zuko', 'looks'), 443),
 (('long', 'time'), 386),
 (('looks', 'like'), 379),
 (('little', 'bit'), 378),
 (('uncle', 'iroh'), 358),
 (('southern', 'water'), 351),
 (('zuko', 'looked'), 342),
 (('felt', 'like'), 340),
 (('deep', 'breath'), 336),
 (('agni', 'kai'), 334),
 (('old', 'man'), 332),
 (('feel', 'like'), 303),
 (('look', 'like'), 294),
 (('zuko', 'zuko'), 290),
 (('years', 'ago'), 287),
 (('time', 'zuko'), 287),
 (('blue', 'spirit'), 283),
 (('zuko', 'knows'), 282),
 (('zuko', 'sokka'), 273),
 (('zuko', 'asks'), 269),
 (('aang', 'says'), 261),
 (('sokka', 'asks

In [169]:
trigrams.most_common()[:100]

[(('ba', 'sing', 'se'), 715),
 (('fire', 'lord', 'zuko'), 520),
 (('southern', 'water', 'tribe'), 338),
 (('western', 'air', 'temple'), 135),
 (('northern', 'water', 'tribe'), 128),
 (('fire', 'lord', 'ozai'), 99),
 (('fire', 'nation', 'soldiers'), 87),
 (('x', 'x', 'x'), 76),
 (('fire', 'lord', 'azulon'), 74),
 (('water', 'tribe', 'boy'), 72),
 (('long', 'time', 'ago'), 66),
 (('fire', 'nation', 'prince'), 66),
 (('ty', 'lee', 'says'), 58),
 (('new', 'fire', 'lord'), 48),
 (('fire', 'nation', 'zuko'), 45),
 (('yeah', 'zuko', 'says'), 44),
 (('ember', 'island', 'players'), 41),
 (('zuko', 'feels', 'like'), 41),
 (('water', 'tribe', 'siblings'), 38),
 (('okay', 'zuko', 'says'), 36),
 (('fire', 'lord', 'iroh'), 35),
 (('fire', 'nation', 'ship'), 34),
 (('zuko', 'says', 'quietly'), 34),
 (('sokka', 'says', 'zuko'), 33),
 (('water', 'tribe', 'warrior'), 33),
 (('yeah', 'sokka', 'says'), 32),
 (('zuko', 'sokka', 'says'), 32),
 (('fire', 'nation', 'royal'), 31),
 (('sokka', 'feels', 'like'),

Based on the above counters and on domain knowledge, we decide on bi- and trigrams. We also fix certain lemmatization errors. 

In [170]:
avatar_bigrams = ["fire lord", "fire nation", "lu ten", "water tribe", "ty lee", "lu ten", "earth kingdom", "water tribe", "agni kai", "blue spirit", "air temple", "boiling rock", "chit sang", "ember island", "dai li", "pai sho", "jasmine dragon", "long feng", "north pole", "south pole", "spirit world", "white lotus", "gran gran", "si wong", "white lotus", "air nomad"]
avatar_trigrams = ["ba sing se" ]

elements = ["", "water", "earth", "fire", "air"]

# unigram exceptions
u_exc_1 = {f"{element}benders":f"{element}bender" for element in elements}
u_exc_2 = {f"{element}bends":f"{element}bend" for element in elements}
u_exc_3 = {f"{element}bending":f"{element}bend" for element in elements}
u_exc_4 = {f"{element}bended": f"{element}bend" for element in elements}
u_exc_5 = {"soulmates": "soulmate", "firelord": "fire lord", "firelords": "fire lord", "grangran": "gran gran"}
u_exceptions = {**u_exc_1, **u_exc_2, **u_exc_3, **u_exc_4, **u_exc_5}

# bigram exceptions
b_exceptions = {"soul mate": "soulmate", "soul mates": "soulmate", "gon na": "gonna", "wan na": "wanna", "agni kais": "agni kai"}

avatar_exceptions = {**u_exceptions, **b_exceptions}

avatar_ngrams = {"bigrams": avatar_bigrams, "trigrams": avatar_trigrams, "exceptions": avatar_exceptions}


A very specific case is the bigram "Lu Ten," a character in Avatar. "ten" is included in our default stopword list, so we remove it from the stopword list. Then, after merging bigrams, we remove all instances of "ten" that weren't merged into a bigram.

This functionality can be extended to other stopwords that appear in important ngrams.

In [171]:
avatar_stop_exceptions = ["ten"]
avatar_stop_after = ["ten"]

### Final Processing

We run our text through the preprocessing sequence, add it to our dataframe, and pickle the result.

In [172]:
processed_texts = process_texts(df["text"], ngrams=avatar_ngrams, stop_exceptions=avatar_stop_exceptions, stop_after = avatar_stop_after)
df["processed"] = processed_texts

In [176]:
#df.to_pickle("../avatar_fics_processed.pickle")