#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

In [16]:
from keras import layers, losses
from keras.models import Sequential
from keras.preprocessing import text
from keras.utils import np_utils

import keras.backend as K
import numpy as np
import os
import pandas as pd
import tensorflow as tf

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))


PATH = r"C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3\Input"

train = pd.read_pickle(PATH + r'\train_tokenized.pickle').dropna().sample(frac=1)
train_vocab = pd.read_csv(PATH + r'\train_vocab.csv', header=0).dropna()
test = pd.read_pickle(PATH + r'\test_tokenized.pickle').dropna().sample(frac=1)
test_vocab = pd.read_csv(PATH + r'\test_vocab.csv', header=0).dropna()

Num GPUs Available:  0


In [17]:
tf.version.VERSION

'2.10.0'

In [18]:
train.Party = pd.Categorical(train.Party)
train['Party'] = train.Party.cat.codes
test.Party = pd.Categorical(test.Party)
test['Party'] = test.Party.cat.codes

In [19]:
def vocab_dict(terms: pd.DataFrame) -> tuple[dict]:
    return [
        {v:k for k, v in terms.to_dict()['Terms'].items()},
        terms.to_dict()['Terms']
    ]

In [20]:
train_words_i, train_i_words = vocab_dict(train_vocab)
test_words_i, test_i_words = vocab_dict(test_vocab)

batch_size = 128
batch_count = round(train.shape[0] / batch_size)
epochs = 5
vocab_size = len(train_words_i)
embedding_dim = 140
k = 5

In [21]:
def first_k_words(words_i:dict, data: pd.DataFrame, k: int):
    for i, row in data.iterrows():
        terms = dict.fromkeys(words_i, 0.0)
        for term in row['Tokens'][:k]:
            terms[term] = np.float32(1.0)
        p = k-len(terms)
        x = tf.convert_to_tensor(list(terms.values()))
        y = row['Party']
        yield (x, y)

for x, y in first_k_words(train_words_i, train.iloc[0:70000:15000], k):
    print(f"X = [{x}] | Y = {y}")

X = [[0. 0. 0. ... 0. 0. 0.]] | Y = 0
X = [[0. 0. 0. ... 0. 0. 0.]] | Y = 1
X = [[0. 0. 0. ... 0. 0. 0.]] | Y = 0
X = [[0. 0. 0. ... 0. 0. 0.]] | Y = 0
X = [[0. 0. 0. ... 0. 0. 0.]] | Y = 0


In [22]:
model = Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.Dropout(0.2),
    layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),
    layers.Dense(1)
])

model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

In [23]:
# Training
for epoch in range(1):
    for batch in range(batch_count):
        train_data, train_labels = [], []
        start = batch * batch_size
        end = start + batch_size
        for x, y in first_k_words(train_words_i, train.iloc[start:end], k):
            train_data.append(x)
            train_labels.append(y)
        train_labels = tf.convert_to_tensor(train_labels, dtype=tf.float32)
        train_data = tf.cast(train_data, dtype=tf.float32)
        print(f"Training on batch: {batch}")
        train_data, train_labels = tf.convert_to_tensor(train_data), tf.convert_to_tensor(train_labels)
        history = model.train_on_batch(train_data, train_labels)

Training on batch: 0
Training on batch: 1
Training on batch: 2
Training on batch: 3
Training on batch: 4
Training on batch: 5
Training on batch: 6
Training on batch: 7
Training on batch: 8
Training on batch: 9
Training on batch: 10
Training on batch: 11
Training on batch: 12
Training on batch: 13
Training on batch: 14
Training on batch: 15
Training on batch: 16
Training on batch: 17
Training on batch: 18
Training on batch: 19
Training on batch: 20
Training on batch: 21
Training on batch: 22
Training on batch: 23
Training on batch: 24
Training on batch: 25
Training on batch: 26
Training on batch: 27
Training on batch: 28
Training on batch: 29
Training on batch: 30
Training on batch: 31
Training on batch: 32
Training on batch: 33
Training on batch: 34
Training on batch: 35
Training on batch: 36
Training on batch: 37
Training on batch: 38
Training on batch: 39
Training on batch: 40
Training on batch: 41
Training on batch: 42
Training on batch: 43
Training on batch: 44
Training on batch: 4

In [None]:
model.save(r'C:\Users\samue\Documents\Applied Data Science\INFO-H518 Deep Learning\Assignments\A3')