# Seed-Guided Topic Model for Document Filtering and Classification

This notebook is an implementation of the paper available at `doi:10.1145/3238250`.

In [37]:
import pandas as pd
import numpy as np

# nltk
import nltk
from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer

# Utility
import re
# import os

In [39]:
# TEXT CLENAING
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

In [34]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Tay\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
tweets_df = pd.read_csv(r"../data/tweets_dataset/training.1600000.processed.noemoticon.csv")

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 80-81: invalid continuation byte

In [22]:
# DATASET
DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"

In [24]:
dataset_path = r"../data/tweets_dataset/training.1600000.processed.noemoticon.csv"
print("Open file:", dataset_path)
tweets_df = pd.read_csv(dataset_path, encoding =DATASET_ENCODING , names=DATASET_COLUMNS)

Open file: ../data/tweets_dataset/training.1600000.processed.noemoticon.csv


In [25]:
tweets_df

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [27]:
tweets_df.dtypes

target     int64
ids        int64
date      object
flag      object
user      object
text      object
dtype: object

In [33]:
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

## Data cleaning

In [35]:
def preprocess(text, stem=False):
    # Remove link,user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

In [40]:
%%time
tweets_df.text = tweets_df.text.apply(lambda x: preprocess(x))

Wall time: 54.5 s


In [57]:
tweets_df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww bummer shoulda got david carr third day
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset update facebook texting might cry result...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,dived many times ball managed save 50 rest go ...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole body feels itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,behaving mad see


Draw a small subset of the tweets for testing the algorithm

In [60]:
small_df = tweets_df.loc[:99, :]

In [61]:
small_df

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww bummer shoulda got david carr third day
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,upset update facebook texting might cry result...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,dived many times ball managed save 50 rest go ...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,whole body feels itchy like fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,behaving mad see
...,...,...,...,...,...,...
95,0,1467836448,Mon Apr 06 22:26:27 PDT 2009,NO_QUERY,Dogbook,strider sick little puppy
96,0,1467836500,Mon Apr 06 22:26:28 PDT 2009,NO_QUERY,natalieantipas,rylee grace wana go steve party sadly since ea...
97,0,1467836576,Mon Apr 06 22:26:29 PDT 2009,NO_QUERY,timdonnelly,hey actually one bracket pools bad one money
98,0,1467836583,Mon Apr 06 22:26:29 PDT 2009,NO_QUERY,homeworld,follow either work


In [62]:
for doc in small_df.text:
    for r in revTopics:

awww bummer shoulda got david carr third day
upset update facebook texting might cry result school today also blah
dived many times ball managed save 50 rest go bounds
whole body feels itchy like fire
behaving mad see
whole crew
need hug
hey long time see yes rains bit bit lol fine thanks
nope
que muera
spring break plain city snowing
pierced ears
bear watch thought ua loss embarrassing
counts idk either never talk anymore
would first gun really though zac snyder doucheclown
wish got watch miss iamlilnicki premiere
hollis death scene hurt severely watch film wry directors cut
file taxes
ahh ive always wanted see rent love soundtrack
oh dear drinking forgotten table drinks
day get much done
one friend called asked meet mid valley today time sigh
baked cake ated
week going hoped
blagh class 8 tomorrow
hate call wake people
going cry sleep watching marley
im sad miss lilly
ooooh lol leslie ok leslie get mad
meh almost lover exception track gets depressed every time
some1 hacked account ai