#### Problem statement

Predict the political party from the tweet text and the handle

#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

In [2]:
from pyspark.ml import Pipeline
from pyspark.sql import (Row, functions as F)
from sparknlp.base import DocumentAssembler, EmbeddingsFinisher
from sparknlp.annotator import Tokenizer, Normalizer, Word2VecApproach

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sparknlp
import torch 
import torch.nn as nn
import warnings
warnings.filterwarnings('ignore')

spark = sparknlp.start(m1=True)
spark.sparkContext.setLogLevel('ERROR')


In [3]:
PATH = "/Users/samuelahickey/Documents/Data-Science/INFO-H518-Deep-Learning/Assignments/A3"
train = spark.createDataFrame(
    pd.read_csv(PATH+"/train.csv", header="infer", index_col=0).dropna()
)
test = spark.createDataFrame(
    pd.read_csv(PATH+"/test.csv", header="infer", index_col=0).dropna()
)
train.toPandas()

                                                                                

Unnamed: 0,Party,Handle,Tweet
0,Democrat,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P..."
1,Democrat,RepDarrenSoto,RT @WinterHavenSun: Winter Haven resident / Al...
2,Democrat,RepDarrenSoto,RT @NBCLatino: .@RepDarrenSoto noted that Hurr...
3,Democrat,RepDarrenSoto,RT @NALCABPolicy: Meeting with @RepDarrenSoto ...
4,Democrat,RepDarrenSoto,RT @Vegalteno: Hurricane season starts on June...
...,...,...,...
72729,Republican,RepTomPrice,Check out my op-ed on need for End Executive O...
72730,Republican,RepTomPrice,"Yesterday, Betty &amp; I had a great time lear..."
72731,Republican,RepTomPrice,We are forever grateful for the service and sa...
72732,Republican,RepTomPrice,Happy first day of school @CobbSchools! #CobbB...


In [4]:
doc_assembler = DocumentAssembler() \
    .setInputCol('Tweet') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

normalizer = Normalizer() \
    .setInputCols(['token']) \
    .setOutputCol('normal_token') \
    .setCleanupPatterns(["[^#A-Za-z]", "^https(.*)", "^#$"])

token_pipeline = Pipeline().setStages([
    doc_assembler,
    tokenizer,
    normalizer
])

In [5]:
train = token_pipeline \
    .fit(train) \
    .transform(train) \
    .selectExpr(['Party', 'Handle', 'Tweet', 'normal_token.result as Tokens']) \
    .withColumnRenamed('result', 'Tokens')

test = token_pipeline \
    .fit(test) \
    .transform(test) \
    .selectExpr(['Party', 'Handle', 'Tweet', 'normal_token.result as Tokens']) \
    .withColumnRenamed('result', 'Tokens')

In [6]:
# Get terms from training set
vocab = train.select('Tokens') \
    .select(F.explode('Tokens').alias('Terms')) \
    .distinct() \
    .sort('Terms') \
    .toPandas()

# Add terms from test set
vocab = vocab.Terms \
    .append(test.select('Tokens') \
        .select(F.explode('Tokens').alias('Terms')) \
        .distinct() \
        .sort('Terms') \
        .toPandas().Terms
    ).drop_duplicates().reset_index()
vocab[''] = pd.Series(np.zeros(vocab.shape[0]), name='')
# One-hot Encode
vocab = pd.get_dummies(vocab, sparse=True).drop('index', axis=1)

# Rename and transpose
vocab = vocab.rename(columns={k: k[6:] for k in vocab.columns}).T
vocab['77930'] = np.zeros(vocab.shape[0], dtype=float)
vocab = vocab.astype(np.float16)

vocab

                                                                                

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,77921,77922,77923,77924,77925,77926,77927,77928,77929,77930
,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
##SLS,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
##USPS,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#A,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#AADRFNIDCRAdvocacyD,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zone,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zones,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zoo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zoomed,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [7]:
def create_CBOW(row, v):
    result = [
        row.Party,
        row.Handle,
        row.Tweet,
        row.Tokens 
    ]
    if len(row.Tokens) > 0:
        result.append([str(i) for i in sum([v.loc[j].to_numpy() for j in row.Tokens])])
    else:
        result.append([str(i) for i in v.loc['']])
    return result

batch_size = 10000
batch_count = round(train.count() / batch_size)
batches = []
copy_df = train

for i in range(batch_count):
    tmp = copy_df.limit(batch_size)
    copy_df = copy_df.subtract(tmp)
    batches.append(tmp)
    tmp = None
batches.pop(0)
batches.pop(0)
batches.pop(0)
for i in range(3, batch_count):
    mini_size = 1000
    mini_count = round(batch_size / mini_size)
    mini_batch = []
    b = batches.pop(0)

    for j in range(mini_count):
        tmp = b.limit(mini_size)
        mini_batch.append(tmp)
        b = b.subtract(tmp)
        tmp = None
    
    for j in range(mini_count):
        tmp = mini_batch.pop(0)
        print(tmp.count())
        tmp = tmp.rdd \
        .map(lambda row: create_CBOW(row, vocab)) \
        .toDF(['Party', 'Handle', 'Tweet', 'Tokens', 'CBOW']) \
        .toPandas()
        print("tmp created")
        tmp.to_pickle(f'/Users/samuelahickey/Documents/Data-Science/INFO-H518-Deep-Learning/Assignments/A3/Train/train_CBOW_b{i}_mb{j}.pickle')


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

tmp created


                                                                                

1000


                                                                                

In [None]:
tmp = test.rdd \
    .map(lambda row: create_CBOW(row, vocab)).toDF(['Party', 'Handle', 'Tweet', 'Tokens', 'CBOW'])