<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Introduction" data-toc-modified-id="1.-Introduction-1">1. Introduction</a></span></li><li><span><a href="#2.-Create-Stemmed-Text-" data-toc-modified-id="2.-Create-Stemmed-Text--2">2. Create Stemmed Text </a></span></li><li><span><a href="#3.-Create-Lemmatized-Text-" data-toc-modified-id="3.-Create-Lemmatized-Text--3">3. Create Lemmatized Text </a></span></li><li><span><a href="#3.-Calculate-Class-Weights-" data-toc-modified-id="3.-Calculate-Class-Weights--4">3. Calculate Class Weights </a></span></li><li><span><a href="#4.-Split-Features/Labels-and-Binarize-Labels-" data-toc-modified-id="4.-Split-Features/Labels-and-Binarize-Labels--5">4. Split Features/Labels and Binarize Labels </a></span></li><li><span><a href="#5.-Define-and-Fit-Tokenizers-" data-toc-modified-id="5.-Define-and-Fit-Tokenizers--6">5. Define and Fit Tokenizers </a></span></li></ul></div>

In [303]:
# Styling notebookb
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

In [1]:
import sys
sys.path.append('../')
import numpy as np
from numpy import array
import os
import pandas as pd
import pickle
from tqdm.auto import tqdm
from src.pipeline_helpers import get_proportions
from src.clean_data import normalize_text
from sklearn.utils import class_weight
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from keras.layers import Input, Dense, Embedding, Flatten, Activation, LeakyReLU,Bidirectional, LSTM, BatchNormalization, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.models import Sequential
from keras.callbacks import EarlyStopping

<h1 id="1.1-Introduction">1. Introduction</h1>

<div class="description">
In this notebook we will clean and normalize our data to prepare our features and labels for useage in multi-class classification NLP models. 

<br>
    
First we will begin by using the normalize_text function defined in our clean_data module which uses nltk and Spacy for normalization. Our function will lowercase the text, remove numbers, remove stopwords, and then create a stemmed or lemmatized version of the text, we will save both versions here to compare their performance later.
<br>
    
Then we will calculate and save the class weights for each version of the text so that we can pass these to Tensorflow's class_weight parameter in our models, although we will also explore other options for handling imbalanced data.
<br>
    
The next step is splitting the dataset into its features(complaint_description) and labels(assigned_division), and converting the labels into a binarized ndarray using sklearn's LabelBinarizer.
<br>
    
Next we create and save Tensorflow Tokenizers fit on the stemmed and lemmatized text so we can use them later on to transform new data for predictions. We also use them to transform the stemmed and lemmatized features, and finally we save the tokenized features as .npy files to our /data/ directory.
    
</div>

<div class="description">

First we will load and briefly review our preprocessed data.
    <ul> We want to:
        <li> ensure there are no duplicates or n/a values (these were removed in preprocessing)
        <li> check the data's shape
        <li> view the size of each class in the label (classes with less than 1000 instances were also removed in preprocessing)
    </ul>
</div>

</div>

In [2]:
working_dir = os.getcwd()
data_path = os.path.dirname(working_dir) + '/data/'
df = pd.read_csv(data_path + 'preprocessed.csv')

In [3]:
df.head(10)

Unnamed: 0,complaint_description,assigned_division
0,Date last observed: 29-jun-20; time last ob...,Housing Inspection Services
1,Unpermitted interior framing at 3rd level atti...,Electrical Inspection Division
2,Elevator (in the parking area )to condos is in...,Housing Inspection Services
3,Complainant is concerned about the lenght of t...,Building Inspection Division
4,Date last observed: 06-jan-21; time last ob...,Building Inspection Division
5,1921-1925 mason and 28-32 valparasio --- there...,Building Inspection Division
6,Electrical work at rooftop without permits.,Electrical Inspection Division
7,Needs to renew boiler permit for permit no 107...,Plumbing Inspection Division
8,Date last observed: 31-jul-20; time last ob...,Housing Inspection Services
9,Plumbing and windows,Housing Inspection Services


In [15]:
#ensure there are no null values or duplicates
print(f"Null: \n {df.isna().sum()} \n")
print(f"Duplicates:  {df.duplicated().sum()}")

Null: 
 complaint_description    0
assigned_division        0
dtype: int64 

Duplicates:  0


In [16]:
df.shape

(183607, 2)

In [17]:
df.assigned_division.value_counts()

Housing Inspection Services       86793
Building Inspection Division      63267
Plumbing Inspection Division      16241
Code Enforcement Section          10680
Electrical Inspection Division     5515
Disabled Access Division           1111
Name: assigned_division, dtype: int64

<h1>2. Create Stemmed Text </h1>

<div class="description">
First we will prepare a stemmed version of our text using the normalize_text function in our clean_data module. 
This function will remove capital letters, remove numbers (by default), remove stopwords, and then return a stemmed version of the text created using NLTK's PorterStemmer.
<div>

In [18]:
#create stemmed text
stemmed_path = data_path + "/stemmed/"
tqdm.pandas()
stemmed = df.copy()
stemmed['complaint_description'] = stemmed['complaint_description'].progress_apply(lambda x: normalize_text(x))
stemmed.to_csv(stemmed_path + '/stemmed_text.csv')

  0%|          | 0/183607 [00:00<?, ?it/s]

<h1>3. Create Lemmatized Text </h1>

<div class="description">
Next we will create a lemmatized version of the text by passing lemmatize=True to our normalize_text function. This will perform all of the same steps as we did when we stemmed the text, except that it uses Spacy's lemmatizer to lemmatize the text instead of stemming.
<div>

In [24]:
#create lemmatized text
lemmatized_path = data_path + "/lemmatized/"
lemmatized = df.copy()
lemmatized['complaint_description'] = lemmatized['complaint_description'] \
                                      .progress_apply(lambda x: normalize_text(x,
                                                                lemmatize = True))
lemmatized.to_csv(lemmatized_path + '/lemmatized_text.csv')

  0%|          | 0/183607 [00:00<?, ?it/s]

In [264]:
#load the saved data and ensure it looks correct
lemmatized = pd.read_csv(lemmatized_path + 'lemmatized_text.csv',index_col=[0])
stemmed = pd.read_csv(stemmed_path + 'stemmed_text.csv',index_col=[0])

In [265]:
lemmatized.head()

Unnamed: 0,complaint_description,assigned_division
0,"['date', 'last', 'observe', 'jun', 'time', 'la...",Housing Inspection Services
1,"['unpermitte', 'interior', 'frame', 'rd', 'lev...",Electrical Inspection Division
2,"['elevator', 'parking', 'area', 'condo', 'inop...",Housing Inspection Services
3,"['complainant', 'concerned', 'lenght', 'time',...",Building Inspection Division
4,"['date', 'last', 'observe', 'jan', 'time', 'la...",Building Inspection Division


In [266]:
lemmatized.shape

(183607, 2)

In [267]:
print(lemmatized.isna().sum())
print("Duplicated :", lemmatized.duplicated().sum())

complaint_description    0
assigned_division        0
dtype: int64
Duplicated : 15284


In [268]:
stemmed.head()

Unnamed: 0,complaint_description,assigned_division
0,"['date', 'last', 'observ', 'jun', 'time', 'las...",Housing Inspection Services
1,"['unpermit', 'interior', 'frame', 'rd', 'level...",Electrical Inspection Division
2,"['elev', 'park', 'area', 'condo', 'inoper']",Housing Inspection Services
3,"['complain', 'concern', 'lenght', 'time', 'tak...",Building Inspection Division
4,"['date', 'last', 'observ', 'jan', 'time', 'las...",Building Inspection Division


In [269]:
stemmed.shape

(183607, 2)

In [270]:
print(stemmed.isna().sum())
print("Duplicated :", stemmed.duplicated().sum())

complaint_description    0
assigned_division        0
dtype: int64
Duplicated : 15436


<div class="description">
After stemming and lemmatizing the text we can see that we've introduced about 15,000 duplicates to each set that were not there before. We will drop them before proceeding. 
</div>

In [271]:
lemmatized.drop_duplicates(inplace = True)
stemmed.drop_duplicates(inplace = True)

In [272]:
stemmed.shape

(168171, 2)

In [273]:
lemmatized.shape

(168323, 2)

In [274]:
print(f'Duplicates in stemmed text: {stemmed.duplicated().sum()}, duplicates in lemmatized text: {lemmatized.duplicated().sum()}.')

Duplicates in stemmed text: 0, duplicates in lemmatized text: 0.


In [275]:
lemmatized.assigned_division.value_counts()

Housing Inspection Services       81498
Building Inspection Division      60955
Plumbing Inspection Division      10631
Code Enforcement Section           8737
Electrical Inspection Division     5417
Disabled Access Division           1085
Name: assigned_division, dtype: int64

In [276]:
stemmed.assigned_division.value_counts()

Housing Inspection Services       81408
Building Inspection Division      60952
Plumbing Inspection Division      10592
Code Enforcement Section           8726
Electrical Inspection Division     5412
Disabled Access Division           1081
Name: assigned_division, dtype: int64

<h1>3. Calculate Class Weights </h1>

In [277]:
classes_stemmed = stemmed.assigned_division
stemmed_class_weights = class_weight.compute_class_weight('balanced',
                                                 classes = classes_stemmed.unique(),y = classes_stemmed)
stemmed_class_weights = dict(enumerate(stemmed_class_weights))

In [278]:
print(stemmed_class_weights)

{0: 0.3442966293238994, 1: 5.178954175905395, 2: 0.459845452159076, 3: 2.646195241691843, 4: 3.212067384826954, 5: 25.928307123034227}


In [279]:
classes_lemmatized = lemmatized.assigned_division
lemmatized_class_weights = class_weight.compute_class_weight('balanced',
                                                 classes = classes_lemmatized.unique(),y = classes_lemmatized)
lemmatized_class_weights = dict(enumerate(lemmatized_class_weights))

In [280]:
print(lemmatized_class_weights)

{0: 0.34422726120068387, 1: 5.178850532274937, 2: 0.4602384272550789, 3: 2.638870598563948, 4: 3.2109228949677617, 5: 25.856067588325654}


In [281]:
#save class weights for later use
with open(stemmed_path + 'stemmed_class_weights.pickle', 'wb') as f:
    pickle.dump(stemmed_class_weights, f)
    
with open(lemmatized_path + 'lemmatized_class_weights.pickle', 'wb') as f:
    pickle.dump(lemmatized_class_weights, f)

<h1>4. Split Features/Labels and Binarize Labels </h1>

In [282]:
encoder = LabelBinarizer()

X_stemmed = stemmed.complaint_description
y_stemmed = encoder.fit_transform(classes_stemmed)

X_lemmatized = lemmatized.complaint_description
y_lemmatized = encoder.fit_transform(lemmatized.assigned_division)

In [283]:
X_stemmed.shape

(168171,)

In [284]:
y_stemmed.shape

(168171, 6)

In [285]:
X_lemmatized.shape

(168323,)

In [286]:
y_lemmatized.shape

(168323, 6)

In [287]:
rand_indices = np.random.randint(1,100,(10))
for x in rand_indices:
    print(f'Original Label: {lemmatized["assigned_division"].iloc[x]} -- Binarized Label: {y[x]}.')

Original Label: Building Inspection Division -- Binarized Label: [0 1 0 0 0 0].
Original Label: Building Inspection Division -- Binarized Label: [1 0 0 0 0 0].
Original Label: Code Enforcement Section -- Binarized Label: [0 0 0 0 0 1].
Original Label: Building Inspection Division -- Binarized Label: [1 0 0 0 0 0].
Original Label: Building Inspection Division -- Binarized Label: [1 0 0 0 0 0].
Original Label: Code Enforcement Section -- Binarized Label: [0 0 0 0 1 0].
Original Label: Housing Inspection Services -- Binarized Label: [0 1 0 0 0 0].
Original Label: Housing Inspection Services -- Binarized Label: [1 0 0 0 0 0].
Original Label: Building Inspection Division -- Binarized Label: [1 0 0 0 0 0].
Original Label: Code Enforcement Section -- Binarized Label: [0 1 0 0 0 0].


In [288]:
X_lemmatized.shape

(168323,)

In [289]:
y_lemmatized.shape

(168323, 6)

In [290]:
X_stemmed.shape

(168171,)

In [291]:
y_stemmed.shape

(168171, 6)

In [292]:
print(X_lemmatized)

0         ['date', 'last', 'observe', 'jun', 'time', 'la...
1         ['unpermitte', 'interior', 'frame', 'rd', 'lev...
2         ['elevator', 'parking', 'area', 'condo', 'inop...
3         ['complainant', 'concerned', 'lenght', 'time',...
4         ['date', 'last', 'observe', 'jan', 'time', 'la...
                                ...                        
183600    ['go', 'beyond', 'scope', 'work', 'remove', 'e...
183601    ['date', 'last', 'observe', 'mar', 'time', 'la...
183603    ['bathroom', 'community', 'bathroom', 'area', ...
183604    ['van', 'ness', 'construction', 'site', 'van',...
183605                ['outdoor', 'pipe', 'leak', 'studio']
Name: complaint_description, Length: 168323, dtype: object


In [293]:
print(y_lemmatized)

[[0 0 0 0 1 0]
 [0 0 0 1 0 0]
 [0 0 0 0 1 0]
 ...
 [0 0 0 0 1 0]
 [1 0 0 0 0 0]
 [0 0 0 0 1 0]]


In [294]:
print(X_stemmed)

0         ['date', 'last', 'observ', 'jun', 'time', 'las...
1         ['unpermit', 'interior', 'frame', 'rd', 'level...
2               ['elev', 'park', 'area', 'condo', 'inoper']
3         ['complain', 'concern', 'lenght', 'time', 'tak...
4         ['date', 'last', 'observ', 'jan', 'time', 'las...
                                ...                        
183600    ['go', 'beyond', 'scope', 'work', 'remov', 'en...
183601    ['date', 'last', 'observ', 'mar', 'time', 'las...
183603    ['bathroom', 'commun', 'bathroom', 'area', 'ke...
183604    ['van', 'ness', 'construct', 'site', 'van', 'n...
183605                ['outdoor', 'pipe', 'leak', 'studio']
Name: complaint_description, Length: 168171, dtype: object


In [295]:
print(y_stemmed)

[[0 0 0 0 1 0]
 [0 0 0 1 0 0]
 [0 0 0 0 1 0]
 ...
 [0 0 0 0 1 0]
 [1 0 0 0 0 0]
 [0 0 0 0 1 0]]


<h1>5. Define and Fit Tokenizers </h1>

In [None]:
#define tensorflow tokenizer for the stemmed text
num_words = 10000
stemmed_tokenizer = Tokenizer(num_words=num_words, oov_token='<UNK>')

#fit tokenizer to text
stemmed_tokenizer.fit_on_texts(X_stemmed)

#save the tokenizer so we can use it later to process data for predictions
with open(stemmed_path + 'stemmed_tokenizer.pickle', 'wb') as f:
    pickle.dump(stemmed_tokenizer, f)
    
#define variables for word count and index
stemmed_word_count = stemmed_tokenizer.word_counts
stemmed_word_index = stemmed_tokenizer.word_index


#encode the data into a sequence 
X_sequences_stemmed = stemmed_tokenizer.texts_to_sequences(X_stemmed)

#pad the sequences
X_stemmed = pad_sequences(X_sequences_stemmed, padding='post',
                truncating='post', maxlen=200)

In [296]:
#define tensorflow tokenizer for the lemmatized text
num_words = 10000
lemmatized_tokenizer = Tokenizer(num_words=num_words, oov_token='<UNK>')

#fit tokenizer to text
lemmatized_tokenizer.fit_on_texts(X_lemmatized)

#save the tokenizer so we can use it later to process data for predictions
with open(lemmatized_path + 'lemmatized_tokenizer.pickle', 'wb') as f:
    pickle.dump(lemmatized_tokenizer, f)


#define variables for word count and index
lemmatized_word_count = lemmatized_tokenizer.word_counts
lemmatized_word_index = lemmatized_tokenizer.word_index


#encode the data into a sequence 
X_lemmatized_sequences = lemmatized_tokenizer.texts_to_sequences(X_lemmatized)

#pad the sequences
X_lemmatized = pad_sequences(X_lemmatized_sequences, padding='post',
                truncating='post', maxlen=200)

In [298]:
display(stemmed_word_count)

OrderedDict([("'date'", 22265),
             ("'last'", 40162),
             ("'observ'", 38554),
             ("'jun'", 1878),
             ("'time'", 21091),
             ("'floor'", 30972),
             ("'nd'", 4747),
             ("'unit'", 38081),
             ("'exact'", 21035),
             ("'locat'", 24791),
             ("'main'", 17400),
             ("'bldg'", 20671),
             ("'build'", 59839),
             ("'type'", 21406),
             ("'insectsrod'", 1235),
             ("'addit'", 22377),
             ("'inform'", 20124),
             ("'mani'", 1741),
             ("'differ'", 415),
             ("'fli'", 676),
             ("'infest'", 2440),
             ("'affect'", 629),
             ("'multipl'", 1519),
             ("'face'", 1022),
             ("'rear'", 9813),
             ("'sever'", 2831),
             ("'attempt'", 357),
             ("'counter'", 499),
             ("'solut'", 72),
             ("'manag'", 4015),
             ("'contact'", 1314),


In [300]:
display(stemmed_word_index)

{'<UNK>': 1,
 "'permit'": 2,
 "'work'": 3,
 "'build'": 4,
 "'last'": 5,
 "'observ'": 6,
 "'unit'": 7,
 "'floor'": 8,
 "'locat'": 9,
 "'water'": 10,
 "'construct'": 11,
 "'without'": 12,
 "'addit'": 13,
 "'date'": 14,
 "'type'": 15,
 "'time'": 16,
 "'exact'": 17,
 "'bldg'": 18,
 "'inform'": 19,
 "'illeg'": 20,
 "'window'": 21,
 "'wall'": 22,
 "'residencedwel'": 23,
 "'main'": 24,
 "'leak'": 25,
 "'wo'": 26,
 "'bathroom'": 27,
 "'hous'": 28,
 "'door'": 29,
 "'properti'": 30,
 "'kitchen'": 31,
 "'room'": 32,
 "'garag'": 33,
 "'electr'": 34,
 "'back'": 35,
 "'perform'": 36,
 "'heat'": 37,
 "'person'": 38,
 "'fire'": 39,
 "'paint'": 40,
 "'mold'": 41,
 "'front'": 42,
 "'use'": 43,
 "'ident'": 44,
 "'roof'": 45,
 "'ceil'": 46,
 "'rear'": 47,
 "'instal'": 48,
 "'damag'": 49,
 "'structur'": 50,
 "'done'": 51,
 "'broken'": 52,
 "'area'": 53,
 "'plumb'": 54,
 "'possibl'": 55,
 "'scope'": 56,
 "'street'": 57,
 "'pm'": 58,
 "'neighbor'": 59,
 "'side'": 60,
 "'problem'": 61,
 "'also'": 62,
 "'nois'

In [299]:
display(lemmatized_word_count)

OrderedDict([("'date'", 22266),
             ("'last'", 40165),
             ("'observe'", 38319),
             ("'jun'", 1878),
             ("'time'", 21038),
             ("'floor'", 30969),
             ("'nd'", 4748),
             ("'unit'", 38049),
             ("'exact'", 21035),
             ("'location'", 22989),
             ("'main'", 17400),
             ("'bldg'", 20620),
             ("'build'", 63665),
             ("'type'", 21406),
             ("'insectsrodent'", 1235),
             ("'additional'", 20228),
             ("'information'", 19590),
             ("'many'", 1742),
             ("'different'", 390),
             ("'fly'", 695),
             ("'infestation'", 2007),
             ("'affect'", 627),
             ("'multiple'", 1519),
             ("'face'", 1022),
             ("'rear'", 9809),
             ("'several'", 1961),
             ("'attempt'", 357),
             ("'counter'", 499),
             ("'solution'", 72),
             ("'management'", 1692)

In [261]:
display(lemmatized_word_index)

{'<UNK>': 1,
 "'build'": 2,
 "'permit'": 3,
 "'work'": 4,
 "'last'": 5,
 "'observe'": 6,
 "'unit'": 7,
 "'floor'": 8,
 "'water'": 9,
 "'location'": 10,
 "'without'": 11,
 "'construction'": 12,
 "'date'": 13,
 "'type'": 14,
 "'time'": 15,
 "'exact'": 16,
 "'bldg'": 17,
 "'additional'": 18,
 "'information'": 19,
 "'window'": 20,
 "'wall'": 21,
 "'illegal'": 22,
 "'residencedwelle'": 23,
 "'main'": 24,
 "'leak'": 25,
 "'wo'": 26,
 "'bathroom'": 27,
 "'door'": 28,
 "'do'": 29,
 "'property'": 30,
 "'kitchen'": 31,
 "'room'": 32,
 "'garage'": 33,
 "'back'": 34,
 "'perform'": 35,
 "'house'": 36,
 "'heat'": 37,
 "'fire'": 38,
 "'person'": 39,
 "'paint'": 40,
 "'mold'": 41,
 "'front'": 42,
 "'electrical'": 43,
 "'use'": 44,
 "'identity'": 45,
 "'roof'": 46,
 "'ceiling'": 47,
 "'rear'": 48,
 "'break'": 49,
 "'go'": 50,
 "'damage'": 51,
 "'area'": 52,
 "'come'": 53,
 "'plumb'": 54,
 "'nt'": 55,
 "'scope'": 56,
 "'street'": 57,
 "'pm'": 58,
 "'neighbor'": 59,
 "'problem'": 60,
 "'side'": 61,
 "'al

In [301]:
np.save(stemmed_path + 'X_stemmed_prepared.npy', X_stemmed)
np.save(stemmed_path + 'y_stemmed_prepared.npy', y_stemmed)

np.save(lemmatized_path + 'X_lemmatized_prepared.npy', X_lemmatized)
np.save(lemmatized_path + 'y_lemmatized_prepared.npy', y_lemmatized)