#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

### Imports

In [1]:
from keras import layers, losses
from keras.models import Sequential
from keras.preprocessing import text
from keras.utils import np_utils
from random import randrange

import keras
import keras.backend as K
import numpy as np
import os
import pandas as pd
import random
import re
import string
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

PATH = r"C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Input"

### Data Retrieval 

In [21]:
# Grab the data
train = pd.read_pickle(PATH + r'\train_tokenized.pickle').dropna().sample(frac=1)
train_vocab = pd.read_csv(PATH + r'\train_vocab_frequency.csv', index_col=0) \
    .dropna().drop(columns='Count')
test = pd.read_pickle(PATH + r'\test_tokenized.pickle').dropna().sample(frac=1)
test_vocab = pd.read_csv(PATH + r'\test_vocab_frequency.csv', index_col=0) \
    .dropna().drop(columns='Count')

train['Party'] = pd.Categorical(train.Party)
train['Party'] = train.Party.cat.codes
test['Party'] = pd.Categorical(test.Party)
test['Party'] = test.Party.cat.codes

vocab_size = train_vocab['Terms'].append(test_vocab['Terms']).unique().shape[0]

# Validation split
idx = int(train.shape[0]*.8)
valid = train.iloc[idx:]
train = train.iloc[:idx]

# Format into Tensors
train_dataset = tf.data.Dataset \
    .from_tensor_slices(
        (train['Tweet'].to_numpy(), train['Party'].to_numpy())
    )

valid_dataset = tf.data.Dataset \
    .from_tensor_slices(
        (valid['Tweet'].to_numpy(), valid['Party'].to_numpy())
    )

test_dataset = tf.data.Dataset \
    .from_tensor_slices(
        (test['Tweet'].to_numpy(), test['Party'].to_numpy())
    )

### Input Pipeline

In [22]:
# Set up the input pipeline
BUFFER_SIZE = 5000
BATCH_SIZE = 32

train_dataset = train_dataset \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE)

valid_dataset = valid_dataset \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE)

test_dataset = test_dataset \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE)

print('Train Example')
for tweet, label in train_dataset.take(1):
    print(f"Tweet: {tweet[0]} \nLabel: {label[0]}")

print('\nValidation Example')
for tweet, label in valid_dataset.take(1):
    print(f"Tweet: {tweet[0]} \nLabel: {label[0]}")

print('\nTest Example')
for tweet, label in test_dataset.take(1):
    print(f"Tweet: {tweet[0]} \nLabel: {label[0]}")

Train Example
Tweet: b'Fun to be at the Congressional Baseball game with my family but my friend Steve Scalise was deeply missed. https://t.co/6MFTima9YM' 
Label: 1

Validation Example
Tweet: b'On this day, holy for so many, let us be joined by our common values https://t.co/iO9FmQuYgW' 
Label: 0

Test Example
Tweet: b"I'm chairing the Tactical Air &amp; Land Forces #FY19NDAA markup. You can watch live here: https://t.co/omDPZCTzyc" 
Label: 1


In [23]:
def cleaned_data(input_data):
    stripped_url_ending = tf.strings.regex_replace(input_data, "https(.*)", '')
    words_and_tags = tf.strings.regex_replace(stripped_url_ending, "[^#A-Za-z]", ' ')
    return tf.strings.regex_replace(
        words_and_tags,
        '[%s]' % re.escape(string.punctuation[:2]+string.punctuation[3:]),
        ''
    )

for tweet, label in test_dataset.take(1):
    print(f"Tweet: {cleaned_data(tweet[0])} \nLabel: {label[0]}")

Tweet: b'On  cspan LIVE soon  I ll be on the floor to talk about my H Con Res      Encouraging reunions of divided Korean American families' 
Label: 0


### Encode the Data

In [24]:
# Parameters
k = 25

# Vectorization Layer
vectorize_layer = layers.TextVectorization(
    standardize=cleaned_data,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=k
)

train_features = train_dataset.map(lambda x, y: x)
valid_features = valid_dataset.map(lambda x, y: x)
test_features = test_dataset.map(lambda x, y: x) 

vectorize_layer.adapt(train_features)
vectorize_layer.adapt(valid_features)
vectorize_layer.adapt(test_features)

def vectorize_me(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

text_batch, label_batch = next(iter(train_dataset))
first_review, first_label = text_batch[0], label_batch[0]
print("Tweet: ", first_review.numpy())
print("Label: ", first_label.numpy())
print("Encoded Tweet: ", vectorize_me(first_review, first_label))

training = train_dataset.map(vectorize_me).cache().prefetch(tf.data.AUTOTUNE)
validation = valid_dataset.map(vectorize_me).prefetch(tf.data.AUTOTUNE)
testing = test_dataset.map(vectorize_me).prefetch(tf.data.AUTOTUNE)

Tweet:  b'RT @IRSnews: #IRSTaxTip: If you didn\xe2\x80\x99t get your #IRS #tax refund yet, check its status with \xe2\x80\x9cWhere\xe2\x80\x99s My Refund?\xe2\x80\x9d https://t.co/Z07vIsuvQf'
Label:  1
Encoded Tweet:  (<tf.Tensor: shape=(1, 25), dtype=int64, numpy=
array([[    9,  5943,     1,   171,    18,  1196,    32,   160,    46,
         2884, 24480, 13160,   626,  1158,   167,  1282,    14,  3314,
           11,    73,     1,     0,     0,     0,     0]], dtype=int64)>, <tf.Tensor: shape=(), dtype=int8, numpy=1>)


# Select K-words from the tweet

## Model #1 | Dropout regularization with standard Adam optimization

### M1 | Create the model

In [6]:
epochs = 5
embedding_dim = 32

# Structure
k_words = Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.Dropout(0.1),
    layers.GlobalMaxPooling1D(),
    layers.Dropout(0.3),
    layers.Dense(1)
])

k_words.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(
        learning_rate=0.001,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-07,
        amsgrad=False,
        name='Adam'
    ),
    metrics=tf.metrics.BinaryAccuracy(threshold=0.0)
)

### M1 | Training

In [7]:
history = k_words.fit(
    training,
    validation_data=validation,
    epochs=epochs
)
k_words.save(r'C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M1')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
INFO:tensorflow:Assets written to: C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M1\assets


### M1 | Testing

In [8]:
loss, accuracy = k_words.evaluate(testing)
print(f"Loss: {loss} \nAccuracy: {accuracy}")

Loss: 0.4155210554599762 
Accuracy: 0.8021273612976074


Epochs = 5 

K = 25

Layers = [embedding | dropout(0.1) | globalAvgPooling1D | dropout(0.3) | dense(1)]

Embedding_dim = 32 

Optimizer = 'adam' (standard settings)

Loss = .4155

Accuracy = 0.8021

## Model #2 | L2 Regularization instead of Dropout 

### M2 | Create the Model

In [9]:
# Parameters
epochs = 5
embedding_dim = 32

# Structure
kr_words = Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.GlobalAveragePooling1D(),
    layers.Dense(16, activation='swish', kernel_regularizer='l2'),
    layers.Dense(1)
])

kr_words.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(
        learning_rate=0.001,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-07,
        amsgrad=False,
        name='Adam'
    ),
    metrics=tf.metrics.BinaryAccuracy(threshold=0.0)
)

### M2 | Training

In [10]:
history = kr_words.fit(
    training,
    validation_data=validation,
    epochs=epochs
)
kr_words.save(r'C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M2')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
INFO:tensorflow:Assets written to: C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M2\assets


### M2 | Testing

In [11]:
loss, accuracy = kr_words.evaluate(testing)
print(f"Loss: {loss} \nAccuracy: {accuracy}")

Loss: 0.4388076663017273 
Accuracy: 0.800305962562561


Epochs = 5 

K = 25 

Layers = [embedding | globalAvgPooling1D | dense(16, activation='swish', regularizer='l2') | dense(1)]

Embedding_dim = 32 

Optimizer = 'adam' (standard settings)

Loss = BinaryCrossEntropy : 0.4388

Accuracy = 0.8003

## Model #3: K-words with RMSProp

### M3 | Create the structure

In [42]:
# Parameters
epochs = 5
embedding_dim = 32

# Structure
handle_kr_words = Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.GlobalAveragePooling1D(),
    layers.Dense(8, activation='swish', kernel_regularizer='l2'),
    layers.Dense(1)
])

handle_kr_words.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.RMSprop(
        learning_rate=0.0012,
        rho=0.999,
        momentum=0.0,
        epsilon=1e-07,
        centered=False,
        name='RMSprop'
    ),
    metrics=tf.metrics.BinaryAccuracy(threshold=0.0)
)

### M3 | Training

In [43]:
history = handle_kr_words.fit(
    training,
    validation_data=validation,
    epochs=epochs
)
handle_kr_words.save(r'C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M3')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
INFO:tensorflow:Assets written to: C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M3\assets


### M3 | Testing

In [44]:
loss, accuracy = handle_kr_words.evaluate(testing)
print(f"Loss: {loss} \nAccuracy: {accuracy}")

Loss: 0.442028284072876 
Accuracy: 0.7968818545341492


Epochs = 5 

K = 25

Layers = [embedding | globalAvgPooling1D | dense(8, activation='swish', regularizer='l2') | dense(1)]

Embedding_dim = 32 

Optimizer = RMSProp

Loss = .4420

Accuracy = 0.7969

## Model #4: K-words with Handle

### Data Retrieval | Input Pipeline | Encode the data

In [15]:
# DATA RETRIEVAL

def with_handle(x):
    x['Tweet'] = f"{x['Handle']} " + x['Tweet']
    return x
train_ = train.apply(with_handle, 1)
valid_ = valid.apply(with_handle, 1)
test_ = test.apply(with_handle, 1)

vocab_size = train_vocab['Terms'] \
    .append(test_vocab['Terms']) \
    .append(pd.DataFrame(train_.Handle.unique(), columns=['Terms'])['Terms']) \
    .append(pd.DataFrame(valid_.Handle.unique(), columns=['Terms'])['Terms']) \
    .append(pd.DataFrame(test_.Handle.unique(), columns=['Terms'])['Terms']) \
    .unique().shape[0]

# Format into Tensors
train_dataset = tf.data.Dataset \
    .from_tensor_slices(
        (train_['Tweet'].to_numpy(), train_['Party'].to_numpy())
    )

valid_dataset = tf.data.Dataset \
    .from_tensor_slices(
        (valid_['Tweet'].to_numpy(), valid_['Party'].to_numpy())
    )

test_dataset = tf.data.Dataset \
    .from_tensor_slices(
        (test_['Tweet'].to_numpy(), test_['Party'].to_numpy())
    )


# INPUT PIPELINE

# Set up the input pipeline
BUFFER_SIZE = 5000
BATCH_SIZE = 32

train_dataset = train_dataset \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE)

valid_dataset = valid_dataset \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE)

test_dataset = test_dataset \
    .shuffle(BUFFER_SIZE) \
    .batch(BATCH_SIZE)


# ENCODE THE DATA

# Parameters
k = 25

# Vectorization Layer
vectorize_layer = layers.TextVectorization(
    standardize=cleaned_data,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=k
)

train_features = train_dataset.map(lambda x, y: x)
valid_features = valid_dataset.map(lambda x, y: x)
test_features = test_dataset.map(lambda x, y: x) 

vectorize_layer.adapt(train_features)
vectorize_layer.adapt(valid_features)
vectorize_layer.adapt(test_features)

def vectorize_me(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

text_batch, label_batch = next(iter(train_dataset))
first_review, first_label = text_batch[0], label_batch[0]
print("Tweet: ", first_review.numpy())
print("Label: ", first_label.numpy())
print("Encoded Tweet: ", vectorize_me(first_review, first_label))

training = train_dataset.map(vectorize_me).cache().prefetch(tf.data.AUTOTUNE)
validation = valid_dataset.map(vectorize_me).prefetch(tf.data.AUTOTUNE)
testing = test_dataset.map(vectorize_me).prefetch(tf.data.AUTOTUNE)

Tweet:  b'RepScottPeters #TeamPeters met with @FamilyHealthSD and Essential Health Access to disucss the impotance of Title X funding. I\xe2\x80\x99m p\xe2\x80\xa6 https://t.co/6YwN9NcSOp'
Label:  0
Encoded Tweet:  (<tf.Tensor: shape=(1, 25), dtype=int64, numpy=
array([[ 1099,  6280,   268,    14,     1,     5, 22393,   403,  2494,
            3,     1,     2,     1,     4,  4474,  4448,   184,    10,
           48,   393,     0,     0,     0,     0,     0]], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)


In [16]:
# Structure
embedding_dim = 32
epochs = 5

handle_with_k_words = Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.GlobalAveragePooling1D(),
    layers.Dense(16, activation='swish', kernel_regularizer='l2'),
    layers.Dense(1)
])

handle_with_k_words.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(
        learning_rate=0.001,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-07,
        amsgrad=False,
        name='Adam'
    ),
    metrics=tf.metrics.BinaryAccuracy(threshold=0.0)
)

In [17]:
epochs = 5
history = handle_with_k_words.fit(
    training,
    validation_data=validation,
    epochs=epochs
)
handle_with_k_words.save(r'C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M4')

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
INFO:tensorflow:Assets written to: C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Models\M4\assets


In [18]:
loss, accuracy = handle_with_k_words.evaluate(testing)
print(f"Loss: {loss} \nAccuracy: {accuracy}")

Loss: 0.035931963473558426 
Accuracy: 0.9964301586151123


# Recapitulation:

## Predict the associated political party of a user based only on the content of the user's tweet

## M1:

Epochs = 5 

K = 25

Layers = [embedding | dropout(0.1) | globalAvgPooling1D | dropout(0.3) | dense(1)]

Embedding_dim = 32 

Optimizer = 'adam' (standard settings)

Loss = .4155

Accuracy = 0.8021

## M2:

Epochs = 5 

K = 25 

Layers = [embedding | globalAvgPooling1D | dense(16, activation='swish', regularizer='l2') | dense(1)]

Embedding_dim = 32 

Optimizer = 'adam' (standard settings)

Loss = BinaryCrossEntropy : 0.4388

Accuracy = 0.8003

## M3:

Epochs = 5 

K = 25

Layers = [embedding | globalAvgPooling1D | dense(8, activation='swish', regularizer='l2') | dense(1)]

Embedding_dim = 32 

Optimizer = RMSProp

Loss = .4420

Accuracy = 0.7969

## Predict the associated political party of a user based on the user's Handle and the content of the user's tweet.

## M4:
Use K words PLUS the user's Handle from a tweet to predict the user's political party.

Epochs = 5 

K = 25 

Layers = [embedding | globalAvgPooling1D | dense(16, activation='swish', regularizer='l2') | dense(1)]

Embedding_dim = 32 

Optimizer = 'adam' (standard settings)

Loss = 0.0359

Accuracy = 0.9964
