<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1.-Introduction" data-toc-modified-id="1.-Introduction-1">1. Introduction</a></span></li><li><span><a href="#2.-Create-Stemmed-Text-" data-toc-modified-id="2.-Create-Stemmed-Text--2">2. Create Stemmed Text </a></span></li><li><span><a href="#3.-Create-Lemmatized-Text-" data-toc-modified-id="3.-Create-Lemmatized-Text--3">3. Create Lemmatized Text </a></span></li><li><span><a href="#3.-Calculate-Class-Weights-" data-toc-modified-id="3.-Calculate-Class-Weights--4">3. Calculate Class Weights </a></span></li><li><span><a href="#4.-Split-Features/Labels-and-Binarize-Labels-" data-toc-modified-id="4.-Split-Features/Labels-and-Binarize-Labels--5">4. Split Features/Labels and Binarize Labels </a></span></li><li><span><a href="#5.-Define-and-Fit-Tokenizers-" data-toc-modified-id="5.-Define-and-Fit-Tokenizers--6">5. Define and Fit Tokenizers </a></span></li></ul></div>

In [1]:
# Styling notebookb
from IPython.core.display import HTML
def css_styling():
    styles = open("../styles/custom.css", "r").read()
    return HTML(styles)
css_styling()

In [2]:
import sys
sys.path.append('../')
import numpy as np
from numpy import array
import os
import pandas as pd
import pickle
from tqdm.auto import tqdm
from src.pipeline_helpers import get_proportions
from src.clean_data import normalize_text
from sklearn.utils import class_weight
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from keras.layers import Input, Dense, Embedding, Flatten, Activation, LeakyReLU,Bidirectional, LSTM, BatchNormalization, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.models import Sequential
from keras.callbacks import EarlyStopping
import tabulate

<h1 id="1.1-Introduction">1. Introduction</h1>

<div class="description">
In this notebook we will clean and normalize our data to prepare our features and labels for useage in multi-class classification NLP models. 

<br>
    
First we will begin by using the normalize_text function defined in our clean_data module which uses nltk and Spacy for normalization. Our function will lowercase the text, remove numbers, remove stopwords, and then create a stemmed or lemmatized version of the text, we will save both versions here to compare their performance later.
<br>
    
Then we will calculate and save the class weights for each version of the text so that we can pass these to Tensorflow's class_weight parameter in our models, although we will also explore other options for handling imbalanced data.
<br>
    
The next step is splitting the dataset into its features(complaint_description) and labels(assigned_division), and converting the labels into a binarized ndarray using sklearn's LabelBinarizer.
<br>
    
Next we create and save Tensorflow Tokenizers fit on the stemmed and lemmatized text so we can use them later on to transform new data for predictions. We also use them to transform the stemmed and lemmatized features, and finally we save the tokenized features as .npy files to our /data/ directory.
    
</div>

<div class="description">

First we will load and briefly review our preprocessed data.
    <ul> We want to:
        <li> ensure there are no duplicates or n/a values (these were removed in preprocessing)
        <li> check the data's shape
        <li> view the size of each class (classes with less than 1000 instances were also removed in preprocessing)
    </ul>
</div>

</div>

In [3]:
working_dir = os.getcwd()
data_path = os.path.dirname(working_dir) + '/data/'
lemmatized_path = data_path + "/lemmatized/"
stemmed_path = data_path + "/stemmed/"
df = pd.read_csv(data_path + 'preprocessed.csv')

In [5]:
df.head(10)

Unnamed: 0,complaint_description,assigned_division
0,Caller reporting that the water is discolored....,Housing Inspection Services
1,Date last observed: 03-oct-17; time last ob...,Building Inspection Division
2,Date last observed: 08-dec-17; exact locati...,Housing Inspection Services
3,Date last observed: 19-aug-17; time last ob...,Building Inspection Division
4,Date last observed: 14-aug-17; time last ob...,Building Inspection Division
5,Garbage is overflowing. trash has no top.,Housing Inspection Services
6,Building code violations resulting from invest...,Housing Inspection Services
7,Date last observed: 27-sep-17; time last ob...,Plumbing Inspection Division
8,There is a high amount of equipment operation ...,Building Inspection Division
9,Remodeled in the garage level. four bedrooms ...,Building Inspection Division


In [6]:
#ensure there are no null values or duplicates
print(f"Null: \n {df.isna().sum()} \n")
print(f"Duplicates:  {df.duplicated().sum()}")

Null: 
 complaint_description    0
assigned_division        0
dtype: int64 

Duplicates:  0


In [7]:
df.shape

(173754, 2)

In [8]:
df.assigned_division.value_counts()

Housing Inspection Services       87117
Building Inspection Division      63684
Plumbing Inspection Division      16296
Electrical Inspection Division     5543
Disabled Access Division           1114
Name: assigned_division, dtype: int64

<h1>2. Create Stemmed Text </h1>

<div class="description">
First we will prepare a stemmed version of our text using the normalize_text function in our clean_data module. 
This function will remove capital letters, remove numbers (by default), remove stopwords, and then return a stemmed version of the text created using NLTK's PorterStemmer.
<div>

In [57]:
#create stemmed text
tqdm.pandas()
stemmed = df.copy()
stemmed['complaint_description'] = stemmed['complaint_description'].progress_apply(lambda x: normalize_text(x))
stemmed.to_csv(stemmed_path + '/stemmed_text.csv')

  0%|          | 0/173754 [00:00<?, ?it/s]

<h1>3. Create Lemmatized Text </h1>

<div class="description">
Next we will create a lemmatized version of the text by passing lemmatize=True to our normalize_text function. This will perform all of the same steps as we did when we stemmed the text, except that it uses Spacy's lemmatizer to lemmatize the text instead of stemming.
<div>

In [58]:
#create lemmatized text
lemmatized = df.copy()
#apply normalize_text to dataframe
lemmatized['complaint_description'] = lemmatized['complaint_description'] \
                                      .progress_apply(lambda x: normalize_text(x,
                                                                lemmatize = True))
#save lemmatized version of text
lemmatized.to_csv(lemmatized_path + '/lemmatized_text.csv')

  0%|          | 0/173754 [00:00<?, ?it/s]

In [4]:
#load the saved data and ensure it looks correct
lemmatized = pd.read_csv(lemmatized_path + 'lemmatized_text.csv',index_col=[0])
stemmed = pd.read_csv(stemmed_path + 'stemmed_text.csv',index_col=[0])

In [5]:
#view sample of lemamtized text
lemmatized.sample(10)

In [62]:
lemmatized.shape

(173754, 2)

In [63]:
print(lemmatized.isna().sum())                          #print null values
print("Duplicated :", lemmatized.duplicated().sum())    #print duplicates

complaint_description    0
assigned_division        0
dtype: int64
Duplicated : 13399


In [64]:
stemmed.head()

Unnamed: 0,complaint_description,assigned_division
0,"['caller', 'report', 'water', 'discolor', 'ora...",Housing Inspection Services
1,"['date', 'last', 'observ', 'oct', 'time', 'las...",Building Inspection Division
2,"['date', 'last', 'observ', 'dec', 'exact', 'lo...",Housing Inspection Services
3,"['date', 'last', 'observ', 'aug', 'time', 'las...",Building Inspection Division
4,"['date', 'last', 'observ', 'aug', 'time', 'las...",Building Inspection Division


In [65]:
stemmed.shape

(173754, 2)

In [66]:
print(stemmed.isna().sum())
print("Duplicated :", stemmed.duplicated().sum())

complaint_description    0
assigned_division        0
dtype: int64
Duplicated : 13540


<div class="description">
After stemming and lemmatizing the text we can see that we've introduced about 15,000 duplicates to each set that were not there before. We will drop them before proceeding. 
</div>

In [67]:
lemmatized.drop_duplicates(inplace = True)  #drop lemmatized duplicates
stemmed.drop_duplicates(inplace = True)     #drop stemmed duplicates

In [69]:
stemmed.shape

(160214, 2)

In [70]:
lemmatized.shape

(160355, 2)

In [71]:
print(f'Duplicates in stemmed text: {stemmed.duplicated().sum()}, duplicates in lemmatized text: {lemmatized.duplicated().sum()}.')

Duplicates in stemmed text: 0, duplicates in lemmatized text: 0.


In [72]:
lemmatized.assigned_division.value_counts() #print value counts for each division after lemmatization

Housing Inspection Services       81811
Building Inspection Division      61362
Plumbing Inspection Division      10650
Electrical Inspection Division     5444
Disabled Access Division           1088
Name: assigned_division, dtype: int64

In [73]:
stemmed.assigned_division.value_counts()

Housing Inspection Services       81721
Building Inspection Division      61359
Plumbing Inspection Division      10611
Electrical Inspection Division     5439
Disabled Access Division           1084
Name: assigned_division, dtype: int64

<h1>3. Calculate Class Weights </h1>

In [74]:
classes_stemmed = stemmed.assigned_division
stemmed_class_weights = class_weight.compute_class_weight('balanced',
                                                 classes = np.unique(classes_stemmed),y = classes_stemmed)
stemmed_class_weights = dict(enumerate(stemmed_class_weights))

In [75]:
print(stemmed_class_weights)

{0: 0.5222184194657671, 1: 29.55977859778598, 2: 5.891303548446405, 3: 0.39209994982929725, 4: 3.019771934784657}


In [76]:
classes_lemmatized = lemmatized.assigned_division                           #select classes
lemmatized_class_weights = class_weight.compute_class_weight('balanced',    #compute balanced class weights
                                                 classes = np.unique(classes_lemmatized),y = classes_lemmatized)
lemmatized_class_weights = dict(enumerate(lemmatized_class_weights))        #convert class weights list to dictionary

In [77]:
print(lemmatized_class_weights)

{0: 0.522652455917343, 1: 29.47702205882353, 2: 5.891072740631889, 3: 0.39201329894512965, 4: 3.011361502347418}


In [78]:
#save class weights for later use
with open(stemmed_path + 'stemmed_class_weights.pickle', 'wb') as f:
    pickle.dump(stemmed_class_weights, f)
    
with open(lemmatized_path + 'lemmatized_class_weights.pickle', 'wb') as f:
    pickle.dump(lemmatized_class_weights, f)

<h1>4. Split Features/Labels and Binarize Labels </h1>

In [79]:
encoder = LabelBinarizer()                          #define label binarizer

X_stemmed = stemmed.complaint_description           #select stemmed complaints
y_stemmed = encoder.fit_transform(classes_stemmed)  #select stemmed target
X_lemmatized = lemmatized.complaint_description     # selected lemmatized complaints which will be our feature
y_lemmatized = encoder.fit_transform(lemmatized.assigned_division)  #fit label finarizer
class_names = encoder.classes_                      #select and save class names to preserve order for later use

with open(lemmatized_path + 'class_names.pickle', 'wb') as f:   #save class names 
    pickle.dump(class_names, f)

In [80]:
print(class_names)

['Building Inspection Division' 'Disabled Access Division'
 'Electrical Inspection Division' 'Housing Inspection Services'
 'Plumbing Inspection Division']


In [81]:
X_stemmed.shape

(160214,)

In [82]:
y_stemmed.shape

(160214, 5)

In [83]:
X_lemmatized.shape

(160355,)

In [84]:
y_lemmatized.shape

(160355, 5)

In [85]:
X_lemmatized.shape

(160355,)

In [86]:
y_lemmatized.shape

(160355, 5)

In [87]:
X_stemmed.shape

(160214,)

In [88]:
y_stemmed.shape

(160214, 5)

In [89]:
print(X_lemmatized)

0         ['caller', 'report', 'water', 'discolor', 'ora...
1         ['date', 'last', 'observe', 'oct', 'time', 'la...
2         ['date', 'last', 'observe', 'dec', 'exact', 'l...
3         ['date', 'last', 'observe', 'aug', 'time', 'la...
4         ['date', 'last', 'observe', 'aug', 'time', 'la...
                                ...                        
173749    ['date', 'last', 'observe', 'nov', 'time', 'la...
173750    ['entirely', 'sure', 'correct', 'address', 'ma...
173751       ['mold', 'plumb', 'issue', 'back', 'bathroom']
173752    ['caller', 'say', 'wall', 'crumble', 'due', 'm...
173753    ['mold', 'visible', 'ceiling', 'throughout', '...
Name: complaint_description, Length: 160355, dtype: object


In [90]:
print(y_lemmatized)

[[0 0 0 1 0]
 [1 0 0 0 0]
 [0 0 0 1 0]
 ...
 [0 0 0 1 0]
 [0 0 0 1 0]
 [0 0 0 1 0]]


In [91]:
print(X_stemmed)

0         ['caller', 'report', 'water', 'discolor', 'ora...
1         ['date', 'last', 'observ', 'oct', 'time', 'las...
2         ['date', 'last', 'observ', 'dec', 'exact', 'lo...
3         ['date', 'last', 'observ', 'aug', 'time', 'las...
4         ['date', 'last', 'observ', 'aug', 'time', 'las...
                                ...                        
173749    ['date', 'last', 'observ', 'nov', 'time', 'las...
173750    ['entir', 'sure', 'correct', 'address', 'massi...
173751        ['mold', 'plumb', 'issu', 'back', 'bathroom']
173752    ['caller', 'say', 'wall', 'crumbl', 'due', 'mo...
173753    ['mold', 'visibl', 'ceil', 'throughout', 'unit...
Name: complaint_description, Length: 160214, dtype: object


In [92]:
print(y_stemmed)

[[0 0 0 1 0]
 [1 0 0 0 0]
 [0 0 0 1 0]
 ...
 [0 0 0 1 0]
 [0 0 0 1 0]
 [0 0 0 1 0]]


<h1>5. Define and Fit Tokenizers </h1>

In [93]:
#define tensorflow tokenizer for the stemmed text
num_words = 10000
stemmed_tokenizer = Tokenizer(num_words=num_words, oov_token=0)

#fit tokenizer to text
stemmed_tokenizer.fit_on_texts(X_stemmed)

#save the tokenizer so we can use it later to process data for predictions
with open(stemmed_path + 'stemmed_tokenizer.pickle', 'wb') as f:
    pickle.dump(stemmed_tokenizer, f)
    
#define variables for word count and index
stemmed_word_count = stemmed_tokenizer.word_counts
stemmed_word_index = stemmed_tokenizer.word_index


#encode the data into a sequence 
X_sequences_stemmed = stemmed_tokenizer.texts_to_sequences(X_stemmed)

#pad the sequences
X_stemmed = pad_sequences(X_sequences_stemmed, padding='post',
                truncating='post', maxlen=200)

In [94]:
#define tensorflow tokenizer for the lemmatized text
num_words = 10000   #number of words to use in vocabulary
lemmatized_tokenizer = Tokenizer(num_words=num_words, oov_token='<UNK>') #initialize the tokenizer with out of vocabulary token

#fit tokenizer to text
lemmatized_tokenizer.fit_on_texts(X_lemmatized)

#save the tokenizer so we can use it later to process data for predictions
with open(lemmatized_path + 'lemmatized_tokenizer.pickle', 'wb') as f:
    pickle.dump(lemmatized_tokenizer, f)


#define variables for word count and index
lemmatized_word_count = lemmatized_tokenizer.word_counts
lemmatized_word_index = lemmatized_tokenizer.word_index


#encode the data into a sequence 
X_lemmatized_sequences = lemmatized_tokenizer.texts_to_sequences(X_lemmatized)

#pad the sequences
X_lemmatized = pad_sequences(X_lemmatized_sequences, padding='post',
                truncating='post', maxlen=200)

In [95]:
display(stemmed_word_count)

OrderedDict([("'caller'", 5380),
             ("'report'", 3241),
             ("'water'", 23632),
             ("'discolor'", 72),
             ("'orang'", 49),
             ("'color'", 128),
             ("'sink'", 5216),
             ("'heaterstain'", 1),
             ("'drizzl'", 5),
             ("'wall'", 17085),
             ("'exterior'", 3427),
             ("'window'", 17811),
             ("'paint'", 10798),
             ("'peopl'", 3889),
             ("'wore'", 7),
             ("'mask'", 500),
             ("'state'", 4665),
             ("'inform'", 19732),
             ("'particl'", 187),
             ("'come'", 7315),
             ("'build'", 56838),
             ("'test'", 372),
             ("'asbesto'", 1156),
             ("'find'", 376),
             ("'disclos'", 34),
             ("'tenant'", 7269),
             ("'manag'", 4042),
             ("'ga'", 2961),
             ("'leak'", 16640),
             ("'given'", 469),
             ("'reason'", 232),
         

In [96]:
display(stemmed_word_index)

{0: 1,
 "'work'": 2,
 "'build'": 3,
 "'permit'": 4,
 "'last'": 5,
 "'observ'": 6,
 "'unit'": 7,
 "'floor'": 8,
 "'locat'": 9,
 "'water'": 10,
 "'construct'": 11,
 "'date'": 12,
 "'addit'": 13,
 "'type'": 14,
 "'without'": 15,
 "'time'": 16,
 "'exact'": 17,
 "'bldg'": 18,
 "'inform'": 19,
 "'window'": 20,
 "'illeg'": 21,
 "'residencedwel'": 22,
 "'wall'": 23,
 "'main'": 24,
 "'leak'": 25,
 "'bathroom'": 26,
 "'wo'": 27,
 "'door'": 28,
 "'hous'": 29,
 "'properti'": 30,
 "'kitchen'": 31,
 "'room'": 32,
 "'electr'": 33,
 "'garag'": 34,
 "'perform'": 35,
 "'back'": 36,
 "'heat'": 37,
 "'person'": 38,
 "'mold'": 39,
 "'paint'": 40,
 "'fire'": 41,
 "'front'": 42,
 "'ident'": 43,
 "'use'": 44,
 "'ceil'": 45,
 "'roof'": 46,
 "'rear'": 47,
 "'damag'": 48,
 "'instal'": 49,
 "'done'": 50,
 "'structur'": 51,
 "'broken'": 52,
 "'area'": 53,
 "'plumb'": 54,
 "'possibl'": 55,
 "'scope'": 56,
 "'pm'": 57,
 "'street'": 58,
 "'nois'": 59,
 "'problem'": 60,
 "'come'": 61,
 "'tenant'": 62,
 "'neighbor'": 6

In [97]:
display(lemmatized_word_count)

OrderedDict([("'caller'", 5380),
             ("'report'", 3237),
             ("'water'", 23624),
             ("'discolor'", 53),
             ("'orange'", 49),
             ("'color'", 126),
             ("'sink'", 5237),
             ("'heaterstain'", 1),
             ("'drizzle'", 5),
             ("'wall'", 17088),
             ("'exterior'", 3428),
             ("'window'", 17822),
             ("'paint'", 10806),
             ("'people'", 3889),
             ("'wear'", 937),
             ("'mask'", 500),
             ("'state'", 4665),
             ("'inform'", 520),
             ("'particle'", 187),
             ("'come'", 8201),
             ("'build'", 60224),
             ("'test'", 288),
             ("'asbestos'", 1144),
             ("'find'", 981),
             ("'disclose'", 34),
             ("'tenant'", 7271),
             ("'management'", 1712),
             ("'testing'", 84),
             ("'gas'", 2989),
             ("'leak'", 16633),
             ("'give'", 987)

In [98]:
display(lemmatized_word_index)

{'<UNK>': 1,
 "'build'": 2,
 "'work'": 3,
 "'permit'": 4,
 "'last'": 5,
 "'observe'": 6,
 "'unit'": 7,
 "'floor'": 8,
 "'water'": 9,
 "'location'": 10,
 "'date'": 11,
 "'construction'": 12,
 "'type'": 13,
 "'without'": 14,
 "'time'": 15,
 "'exact'": 16,
 "'bldg'": 17,
 "'additional'": 18,
 "'information'": 19,
 "'window'": 20,
 "'residencedwelle'": 21,
 "'wall'": 22,
 "'main'": 23,
 "'illegal'": 24,
 "'leak'": 25,
 "'wo'": 26,
 "'bathroom'": 27,
 "'door'": 28,
 "'do'": 29,
 "'property'": 30,
 "'kitchen'": 31,
 "'room'": 32,
 "'garage'": 33,
 "'perform'": 34,
 "'back'": 35,
 "'heat'": 36,
 "'house'": 37,
 "'person'": 38,
 "'mold'": 39,
 "'paint'": 40,
 "'fire'": 41,
 "'electrical'": 42,
 "'front'": 43,
 "'identity'": 44,
 "'use'": 45,
 "'ceiling'": 46,
 "'roof'": 47,
 "'break'": 48,
 "'go'": 49,
 "'rear'": 50,
 "'damage'": 51,
 "'area'": 52,
 "'come'": 53,
 "'nt'": 54,
 "'plumb'": 55,
 "'scope'": 56,
 "'pm'": 57,
 "'street'": 58,
 "'noise'": 59,
 "'problem'": 60,
 "'tenant'": 61,
 "'nei

In [99]:
np.save(stemmed_path + 'X_stemmed_prepared.npy', X_stemmed)
np.save(stemmed_path + 'y_stemmed_prepared.npy', y_stemmed)

np.save(lemmatized_path + 'X_lemmatized_prepared.npy', X_lemmatized)
np.save(lemmatized_path + 'y_lemmatized_prepared.npy', y_lemmatized)

In [100]:
display(X_lemmatized)

array([[ 86, 151,   9, ...,   0,   0,   0],
       [ 11,   5,   6, ...,   0,   0,   0],
       [ 11,   5,   6, ...,   0,   0,   0],
       ...,
       [ 39,  55,  76, ...,   0,   0,   0],
       [ 86, 173,  22, ...,   0,   0,   0],
       [ 39, 436,  46, ...,   0,   0,   0]], dtype=int32)

In [101]:
display(y_lemmatized)

array([[0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       ...,
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0]])