# 5k-MultiDF Vocabular
This notebook aims to create a function which takes as input multiple dataframes and extrathe top 5k words out of them. The result should be a production-ready function.
Here we apply methods from the previous "Top 5K BoW-TF" notebook to handle multiple dataframes

In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import Counter
from nltk.corpus import stopwords

In [9]:
df1 = pd.read_csv('SemEval2016-Task6-subtaskA-testdata-gold.txt', sep="\t", header=None)
df1.head()

Unnamed: 0,0,1,2,3
0,ID,Target,Tweet,Stance
1,10001,Atheism,He who exalts himself shall be humbled; a...,AGAINST
2,10002,Atheism,RT @prayerbullets: I remove Nehushtan -previou...,AGAINST
3,10003,Atheism,@Brainman365 @heidtjj @BenjaminLives I have so...,AGAINST
4,10004,Atheism,#God is utterly powerless without Human interv...,AGAINST


In [10]:
df2 = pd.read_csv('SemEval2016-Task6-subtaskB-testdata-gold.txt', sep="\t", header=None)
df2.head()

Unnamed: 0,0,1,2,3
0,ID,Target,Tweet,Stance
1,20001,Donald Trump,@2014voteblue @ChrisJZullo blindly supporting ...,NONE
2,20002,Donald Trump,@ThePimpernelX @Cameron_Gray @CalebHowe Total...,NONE
3,20003,Donald Trump,@JeffYoung @ThePatriot143 I fully support full...,NONE
4,20004,Donald Trump,@ABC Stupid is as stupid does! Showedhis true ...,AGAINST


In [56]:
custom_stopwords = ["semst", "im"]
# takes in string & returns a cleaned string of all non-stop-words
def preprocess(text):
    sw = stopwords.words('english')
    text = re.sub(r'[^\w\s]', '', text).lower()
    s = ""
    for word in text.split():
        if word not in sw and word not in custom_stopwords:
                s += (word + " ")
    return s

In [67]:
# Takes array of dataframes, returns df with top5k dictionary
def multidf_vocab(df_arr):
    # create array of cleaned strings
    vocab = []
    for df in df_arr:
        for i in range(len(df)):
            # This has the 2nd column hardcoded!! Change it for production
            vocab.append(preprocess(df[2][i]))
    vocab_df = pd.DataFrame(vocab)
    # how do I use counter without turning the vocab array into a df first?
    # count appearance of each word & create frequency dataframe
    return vocab_df

In [68]:
vocab = multidf_vocab([df1, df2])

In [69]:
vocab.head()

Unnamed: 0,0
0,tweet
1,exalts shall humbled humbles shall exaltedmatt...
2,rt prayerbullets remove nehushtan previous mov...
3,brainman365 heidtjj benjaminlives sought truth...
4,god utterly powerless without human intervention


In [70]:
counter = Counter(" ".join(vocab[0]).split()).most_common(5000)

In [75]:
counter[:10]

[('realdonaldtrump', 192),
 ('trump', 146),
 ('like', 111),
 ('people', 105),
 ('dont', 105),
 ('get', 95),
 ('women', 95),
 ('god', 92),
 ('hillaryclinton', 84),
 ('one', 74)]

In [72]:
counter_df = pd.DataFrame(counter)

In [73]:
counter_df.head()

Unnamed: 0,0,1
0,realdonaldtrump,192
1,trump,146
2,like,111
3,people,105
4,dont,105


In [76]:
# next, let's put the above steps in a function
# takes in cleaned text df, return top 5k frequency df
def tf5k(processed_df):
    counter = Counter(" ".join(processed_df[0]).split()).most_common(5000)
    counter_df = pd.DataFrame(counter)
    return counter_df

In [77]:
tf5k(vocab)

Unnamed: 0,0,1
0,realdonaldtrump,192
1,trump,146
2,like,111
3,people,105
4,dont,105
...,...,...
4995,stolen,1
4996,checked,1
4997,flew,1
4998,pole,1


Great, the only thing left to do would be to save the df form permanent use, but we'll do that once we have more dataframes as input. 
Furthermore, we should think of ways to improve performance, especially in the multi_df_vocab function.
If everything is working, I could write this as a python script which works like this:

$ ./5kvocab <path/to/dataframes>

and creates a .txt dataframe in the current directory