<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Bag-of-words,-fully-connected-with-1-hidden-layer" data-toc-modified-id="Bag-of-words,-fully-connected-with-1-hidden-layer-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Bag-of-words, fully connected with 1 hidden layer</a></span><ul class="toc-item"><li><span><a href="#baseline-sgdclassifier" data-toc-modified-id="baseline-sgdclassifier-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>baseline sgdclassifier</a></span></li><li><span><a href="#keras-with-hidden-layer" data-toc-modified-id="keras-with-hidden-layer-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>keras with hidden layer</a></span></li><li><span><a href="#CountVectorizer" data-toc-modified-id="CountVectorizer-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>CountVectorizer</a></span></li></ul></li><li><span><a href="#Word-embedding,-fully-connected" data-toc-modified-id="Word-embedding,-fully-connected-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Word embedding, fully connected</a></span></li><li><span><a href="#CNN---global-max-pooling" data-toc-modified-id="CNN---global-max-pooling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>CNN - global max pooling</a></span><ul class="toc-item"><li><span><a href="#dropout" data-toc-modified-id="dropout-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>dropout</a></span></li></ul></li><li><span><a href="#CNN-with-window" data-toc-modified-id="CNN-with-window-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>CNN with window</a></span></li><li><span><a href="#RNN" data-toc-modified-id="RNN-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>RNN</a></span></li><li><span><a href="#Bert-Transformer" data-toc-modified-id="Bert-Transformer-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Bert Transformer</a></span></li></ul></div>

plan
- one-hot encoding + fully connected network
- word embedding + fully connected network
- convolutional neural network over whole sentence
- convolutional neural network with window
- recurrent neural network
- bert transformer

~1,000,000 parameters worked well with linear SVM. Gives approx lower bound on number of parameters

In [1]:
from path import Path

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', -1)

sns.set(style="white")

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import f1_score

In [2]:
bills = pd.read_csv('US-Legislative-congressional_bills_18.1.csv', 
                    usecols=['description','majortopic'])
bills.dropna(inplace=True)

bills['majortopic'] = bills['majortopic'].astype(int)
bills = bills[(bills['majortopic'] != 99) & (bills['majortopic']!=23)]

In [3]:
nconvert_keys = sorted(bills['majortopic'].unique())
nconvert_values = range(len(nconvert_keys))
dict_nconvert = dict(zip(nconvert_keys, nconvert_values))

In [4]:
dict_number_topic = {1: 'Macroeconomics',
                     2: 'Civil Rights',
                     3: 'Health',
                     4: 'Agriculture',
                     5: 'Labor',
                     6: 'Education',
                     7: 'Environment',
                     8: 'Energy',
                     9: 'Immigration',
                     10: 'Transportation',
                     12: 'Law and Crime',
                     13: 'Social Welfare',
                     14: 'Housing',
                     15: 'Domestic Commerce',
                     16: 'Defense',
                     17: 'Technology',
                     18: 'Foreign Trade',
                     19: 'International Affairs',
                     20: 'Government Operations',
                     21: 'Public Lands'
                    }

len(dict_number_topic)

20

In [5]:
bills['topic0'] = bills['majortopic'].map(dict_nconvert)

In [6]:
bills.head(1)

Unnamed: 0,description,majortopic,topic0
3,To increase the rates of certain educational and readjustment allowances payable to veterans in order to compensate for the higher cost of living in Alaska,6,5


In [8]:
import nltk
import re
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def tokenize_doc(doc, complete=False):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    
    # tokenize document
    tokens = wpt.tokenize(doc)
    
    # filter stopwords out of document
    if complete:
        return tokens
    else:
        filtered_tokens = [token for token in tokens if token not in stop_words]
        return filtered_tokens

bills['tokens'] = bills['description'].apply(tokenize_doc)
bills['tokens_complete'] = bills['description'].apply(tokenize_doc, complete=True)

In [9]:
# re-create decription from filtered tokens
bills['norm_description'] = bills['tokens'].str.join(' ')
bills['norm_description_complete'] = bills['tokens_complete'].str.join(' ')

In [10]:
list_labels = bills["topic0"]

X_train_val, X_test, y_train_val, y_test = train_test_split(
    bills.drop(columns='topic0'),
    list_labels,
    test_size=0.2,
    stratify=list_labels,
    random_state=42
)

single_split_cv = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, val_index in single_split_cv.split(X_train_val, y_train_val):
    X_train, y_train = X_train_val.iloc[train_index], y_train_val.iloc[train_index]
    X_val, y_val = X_train_val.iloc[val_index], y_train_val.iloc[val_index]

# Bag-of-words, fully connected with 1 hidden layer

In [47]:
description = 'norm_description'
train_val = X_train_val[description]
train = X_train[description]
val = X_val[description]
test = X_test[description]

In [12]:
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(ngram_range=(1,2), n_features=2**18)
train_v = vectorizer.fit_transform(train)
val_v = vectorizer.transform(val)

In [13]:
train_v.shape

(246016, 262144)

## baseline sgdclassifier

In [40]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(max_iter=5, tol=None, random_state=42, alpha=0.0001)

In [43]:
sgd.fit(train_v, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=5,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [44]:
y_val_predicted = sgd.predict(val_v)
f1_score(y_val, y_val_predicted, average='macro')

0.7979508711993613

## keras with hidden layer

In [14]:
from keras.utils import to_categorical

Using TensorFlow backend.


In [15]:
y_train_cat = to_categorical(y_train)
y_val_cat = to_categorical(y_val)
y_train_cat.shape

(246016, 20)

In [16]:
from keras.models import Sequential
from keras.layers import Dense

In [17]:
n_cols = train_v.shape[1]
input_shape = (n_cols,)

In [18]:
model = Sequential()
model.add(Dense(20, activation='relu', input_shape=input_shape))
model.add(Dense(20, activation='softmax'))

In [19]:
model.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [20]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20)                5242900   
_________________________________________________________________
dense_2 (Dense)              (None, 20)                420       
Total params: 5,243,320
Trainable params: 5,243,320
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.fit(train_v,
          y_train_cat, 
          validation_data=(val_v, y_val_cat),
          batch_size=32, 
          epochs=4)

Train on 246016 samples, validate on 61504 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x23eec8b5908>

In [22]:
y_val_prob = model.predict(val_v)
y_val_predict = y_val_prob.argmax(axis=-1)

In [23]:
f1_score(y_val_predict, y_val_cat.argmax(axis=-1), average='macro')

0.8641545611452752

## CountVectorizer

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
cvectorizer = CountVectorizer(ngram_range=(1,2), max_features=2**18)
train_v = cvectorizer.fit_transform(train)
val_v = cvectorizer.transform(val)

In [52]:
n_cols = train_v.shape[1]
input_shape = (n_cols,)
n_cols

262144

In [55]:
model = Sequential()
model.add(Dense(20, activation='relu', input_shape=input_shape))
model.add(Dense(20, activation='softmax'))

In [56]:
model.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [58]:
model.fit(train_v,
          y_train_cat, 
          validation_data=(val_v, y_val_cat),
          batch_size=32, 
          epochs=1)

Train on 246016 samples, validate on 61504 samples
Epoch 1/1


<keras.callbacks.History at 0x2410eb45a20>

In [59]:
y_val_prob = model.predict(val_v)
y_val_predict = y_val_prob.argmax(axis=-1)
f1_score(y_val_predict, y_val_cat.argmax(axis=-1), average='macro')

0.860126142356495

# Word embedding, fully connected

In [24]:
import gensim



In [25]:
with open('path_saved_word2vec.txt') as f:
    path = Path(f.readline())
    f.close

In [26]:
word2vec_path = path / "GoogleNews-vectors-negative300.bin.gz"
if 'word2vec' not in locals():
    word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [27]:
bills["tokens20"] = bills["tokens_complete"].apply(lambda x: x[:20])
all_words_complete = [word for tokens in bills["tokens20"] for word in tokens]
VOCAB_COMPLETE = sorted(list(set(all_words_complete)))
sentence_lengths = [len(tokens) for tokens in bills["tokens20"]]
print("%s words total, with a vocabulary size of %s" % (len(all_words_complete), len(VOCAB_COMPLETE)))
print("Max sentence length is %s" % max(sentence_lengths))

6892118 words total, with a vocabulary size of 39391
Max sentence length is 20


In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [28]:
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 20
VOCAB_SIZE = len(VOCAB_COMPLETE)

In [29]:
tokenizer = Tokenizer() #num_words=VOCAB_SIZE
tokenizer.fit_on_texts(bills["description"])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 49622 unique tokens.


In [60]:
train_k = tokenizer.texts_to_sequences(X_train["description"])
val_k = tokenizer.texts_to_sequences(X_val["description"])

In [61]:
# rows of embedding_weights are vector embedding for each of the words in word_index
embedding_weights = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word,index in word_index.items():
    embedding_weights[index,:] = word2vec[word] if word in word2vec else np.random.rand(EMBEDDING_DIM)
print(embedding_weights.shape)

(49623, 300)


In [62]:
train = pad_sequences(train_k, maxlen=MAX_SEQUENCE_LENGTH)
val = pad_sequences(val_k
                    
                    
                    
                    , maxlen=MAX_SEQUENCE_LENGTH)
y_train_cat = to_categorical(y_train)
y_val_cat = to_categorical(y_val)
train.shape, y_train_cat.shape

((246016, 20), (246016, 20))

In [33]:
from keras.layers import Dense, Input, Flatten, Embedding

In [35]:
model = Sequential()
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(20, activation='relu'))
model.add(Dense(20, activation='softmax'))

In [36]:
model.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [37]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 300)           14886900  
_________________________________________________________________
flatten_1 (Flatten)          (None, 6000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 20)                120020    
_________________________________________________________________
dense_4 (Dense)              (None, 20)                420       
Total params: 15,007,340
Trainable params: 15,007,340
Non-trainable params: 0
_________________________________________________________________


In [38]:
model.fit(train,
          y_train_cat, 
          validation_data=(val, y_val_cat),
          batch_size=128, 
          epochs=10)

Train on 246016 samples, validate on 61504 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x23eeba47748>

# CNN - global max pooling

based on:

https://arxiv.org/pdf/1408.5882.pdf

In [110]:
from keras.models import Sequential, Model

from keras.layers import MaxPooling1D, GlobalMaxPooling1D
from keras.layers import Conv1D
from keras.layers import Dropout
from keras.layers.merge import Concatenate

In [99]:
max([len(x) for x in train_k])

275

In [100]:
MAX_SEQUENCE_LENGTH = 275

In [101]:
train = pad_sequences(train_k, maxlen=MAX_SEQUENCE_LENGTH)
val = pad_sequences(val_k, maxlen=MAX_SEQUENCE_LENGTH)

In [102]:
train.shape

(246016, 275)

In [103]:
graph_in = Input(shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM))

In [104]:
embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_weights],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [105]:
filter_sizes = (3, 4, 5)
num_filters = 100

convs = []
for fsz in filter_sizes:
    conv = Conv1D(num_filters, fsz, activation='relu')(graph_in)
    pool = GlobalMaxPooling1D()(conv)
    convs.append(pool)

out = Concatenate()(convs)
graph = Model(input=graph_in, output=out)

  # This is added back by InteractiveShellApp.init_path()


In [106]:
model = Sequential()
model.add(embedding_layer)
model.add(graph)
model.add(Dense(20, activation='softmax'))

In [107]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 275, 300)          14886900  
_________________________________________________________________
model_2 (Model)              (None, 300)               360300    
_________________________________________________________________
dense_11 (Dense)             (None, 20)                6020      
Total params: 15,253,220
Trainable params: 366,320
Non-trainable params: 14,886,900
_________________________________________________________________


In [108]:
model.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [109]:
model.fit(train,
          y_train_cat, 
          validation_data=(val, y_val_cat),
          batch_size=128, 
          epochs=5)

Train on 246016 samples, validate on 61504 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x23ef8901630>

## dropout

In [111]:
model = Sequential()
model.add(embedding_layer)
model.add(graph)
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [112]:
model.fit(train,
          y_train_cat, 
          validation_data=(val, y_val_cat),
          batch_size=128, 
          epochs=5)

Train on 246016 samples, validate on 61504 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x240fe356a58>

# CNN with window

# RNN

# Bert Transformer