# Top 5k BoW-TF
In this notebook we will create a dictionary of the top 5k words appearing accross all posts in our dataset. This is necessary for the first step of our pipeline: The 5k TF vector

Getting started:
1. Unpack the semeval2016 dataset & place the SemEval2016-Task6_subtaskA-testdata-gold file in the same directory as this notebook.
2. Install all the libraries imported below if necessary

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [13]:
from collections import Counter
from nltk.corpus import stopwords

In [2]:
# read dataset
df = pd.read_csv('SemEval2016-Task6-subtaskA-testdata-gold.txt', sep="\t", header=None)
df.head()

Unnamed: 0,0,1,2,3
0,ID,Target,Tweet,Stance
1,10001,Atheism,He who exalts himself shall be humbled; a...,AGAINST
2,10002,Atheism,RT @prayerbullets: I remove Nehushtan -previou...,AGAINST
3,10003,Atheism,@Brainman365 @heidtjj @BenjaminLives I have so...,AGAINST
4,10004,Atheism,#God is utterly powerless without Human interv...,AGAINST


In [3]:
# we have this many rows of data
len(df)

1250

Firstly, we want to remove stop words, punctuation, ...

In [5]:
# takes in string & returns a cleaned string of all non-stop-words
def preprocess(text):
    sw = stopwords.words('english')
    text = re.sub(r'[^\w\s]', '', text).lower()
    arr = ""
    for word in text.split():
        if word not in sw and word != "semst":
            if word not in arr:
                arr += (word + " ")
    return arr

Now we want to create a dictionary holding all values

In [6]:
vocab = []
for i in range(len(df)):
    vocab.append(preprocess(df[2][i]))

In [7]:
vocab

['tweet ',
 'exalts shall humbled humbles exaltedmatt 2312 ',
 'rt prayerbullets remove nehushtan previous moves god become idols high places 2 kings 184 ',
 'brainman365 heidtjj benjaminlives sought truth soul found strong enough stand merits ',
 'god utterly powerless without human intervention ',
 'david_cameron miracles multiculturalism shady 786 taqiya tawriya jaziya kafirs dhimmi jihad allah ',
 'world needs tight group hug enough relieve anger hate makepeacewitheachother ',
 'morality derived religion precedes christopher hitch hitchens freethinkers ',
 'godly husband knows trusts loves respects honors supports wants appreciates ',
 'seculardutchess ill huckleberry deanmodified ',
 'bible big irrelevant book lies exaggerations judaism god teamjesus islam truth freedom ',
 'dreams real gone singlebecause getonyourfeet ',
 'happy independence day america beautiful constitution independenceday usa ',
 'let house built wisdom become strong good sense prov 243 ',
 'days cool kids ath

It's an array of strings, where each one holds the preprocessed words of that particular post. Next we turn this into a dataframe

In [8]:
vocab_df = pd.DataFrame(vocab)

In [9]:
vocab_df.head()

Unnamed: 0,0
0,tweet
1,exalts shall humbled humbles exaltedmatt 2312
2,rt prayerbullets remove nehushtan previous mov...
3,brainman365 heidtjj benjaminlives sought truth...
4,god utterly powerless without human intervention


In [10]:
vocab_df.size

1250

In [11]:
print(df.iloc[420][2], "\n")
print(vocab_df.iloc[420][0])

Girls over 130 pounds shouldn't wear a bikini #womenintech #SemST 

girls 130 pounds shouldnt wear bikini womenintech 


For each tweet in our data, we now have a column. Above you can see the original text of the dataset, below that is the new one. It has 19 rows, since the longest preprocessed text apparently is an array of size 19. That's why there are None values.
Tbh this is just improvised and there is propably a better way to do this.

In [12]:
Counter(" ".join(vocab_df[0]).split()).most_common(5000)

[('women', 83),
 ('hillaryclinton', 73),
 ('like', 67),
 ('god', 66),
 ('dont', 63),
 ('people', 61),
 ('hillary', 50),
 ('get', 48),
 ('im', 45),
 ('feminist', 45),
 ('one', 44),
 ('would', 40),
 ('life', 39),
 ('need', 39),
 ('men', 36),
 ('abortion', 36),
 ('feminists', 34),
 ('time', 33),
 ('right', 33),
 ('know', 32),
 ('want', 32),
 ('cant', 31),
 ('rt', 30),
 ('woman', 30),
 ('good', 29),
 ('say', 29),
 ('us', 29),
 ('think', 29),
 ('world', 27),
 ('never', 27),
 ('make', 25),
 ('equality', 25),
 ('love', 24),
 ('going', 24),
 ('climate', 24),
 ('2', 23),
 ('man', 23),
 ('even', 23),
 ('much', 22),
 ('tcot', 22),
 ('may', 22),
 ('change', 22),
 ('see', 22),
 ('support', 22),
 ('feminism', 22),
 ('lord', 21),
 ('hope', 21),
 ('go', 20),
 ('youre', 20),
 ('great', 20),
 ('way', 20),
 ('america', 18),
 ('every', 18),
 ('still', 18),
 ('believe', 18),
 ('live', 18),
 ('clinton', 18),
 ('human', 17),
 ('always', 17),
 ('give', 17),
 ('really', 17),
 ('president', 17),
 ('religion', 1

Well, this seems to work better than expected. 
The list above are the most common 5k words in our tweets.
In the future we should do this with a very large combination of multiple tweet datasets & save the result to a file.
This way we get our 5k dictionary for the TF-Vector.