```
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)
```

## Downloading data
There is a script located under data/sentiment140/download.sh

In [12]:
# the huggingface datasets library will be helpful for many future projects!
%pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [13]:
from datasets import load_dataset

dataset = load_dataset("sentiment140")  # requires ~120MB

In [14]:
train_df = dataset["train"].to_pandas()
train_df

Unnamed: 0,text,date,user,sentiment,query
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",Mon Apr 06 22:19:45 PDT 2009,_TheSpecialOne_,0,NO_QUERY
1,is upset that he can't update his Facebook by ...,Mon Apr 06 22:19:49 PDT 2009,scotthamilton,0,NO_QUERY
2,@Kenichan I dived many times for the ball. Man...,Mon Apr 06 22:19:53 PDT 2009,mattycus,0,NO_QUERY
3,my whole body feels itchy and like its on fire,Mon Apr 06 22:19:57 PDT 2009,ElleCTF,0,NO_QUERY
4,"@nationwideclass no, it's not behaving at all....",Mon Apr 06 22:19:57 PDT 2009,Karoli,0,NO_QUERY
...,...,...,...,...,...
1599995,Just woke up. Having no school is the best fee...,Tue Jun 16 08:40:49 PDT 2009,AmandaMarie1028,4,NO_QUERY
1599996,TheWDB.com - Very cool to hear old Walt interv...,Tue Jun 16 08:40:49 PDT 2009,TheWDBoards,4,NO_QUERY
1599997,Are you ready for your MoJo Makeover? Ask me f...,Tue Jun 16 08:40:49 PDT 2009,bpbabe,4,NO_QUERY
1599998,Happy 38th Birthday to my boo of alll time!!! ...,Tue Jun 16 08:40:49 PDT 2009,tinydiamondz,4,NO_QUERY


In [15]:
# we're only interested in sentiment and text
train_df = train_df[["sentiment", "text"]]
train_df

Unnamed: 0,sentiment,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."
...,...,...
1599995,4,Just woke up. Having no school is the best fee...
1599996,4,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,Happy 38th Birthday to my boo of alll time!!! ...


In [16]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
nltk.download('stopwords')
import re

STOP = stopwords.words("english")

twitter_words_to_filter = ["rt"]
def filter_single_letter(word):
    return len(word) > 1 or word == 'a' or word == 'i'

tt = TweetTokenizer()
def twitter_preprocessing(post):
    tokenized_post = tt.tokenize(post)
    pattern = re.compile(r"^[a-zA-Z\!\?\,\.\']+$")
    dot_pattern = re.compile(r"\.{2,}")
    url_matcher = re.compile(r"^[a-zA-Z]+\.[a-zA-Z]+$")
    tokens = [
        w.lower() for w in tokenized_post
        if pattern.match(w) and not dot_pattern.match(w)
        and w.lower() not in twitter_words_to_filter
        and filter_single_letter(w.lower())
        and not url_matcher.match(w.lower())
    ]
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tollef/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
# apply preprocessing to the tweets
# (depending on computational power)
%pip install mapply
import mapply
mapply.init()
train_df["text"] = train_df.text.mapply(twitter_preprocessing, train_df["text"].tolist())

Note: you may need to restart the kernel to use updated packages.


  return bound(*args, **kwds)
100%|██████████| 80/80 [00:13<00:00,  5.77it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["text"] = train_df.text.mapply(twitter_preprocessing, train_df["text"].tolist())


In [18]:
train_df

Unnamed: 0,sentiment,text
0,0,"[awww, that's, a, bummer, you, shoulda, got, d..."
1,0,"[is, upset, that, he, can't, update, his, face..."
2,0,"[i, dived, many, times, for, the, ball, manage..."
3,0,"[my, whole, body, feels, itchy, and, like, its..."
4,0,"[no, it's, not, behaving, at, all, i'm, mad, w..."
...,...,...
1599995,4,"[just, woke, up, having, no, school, is, the, ..."
1599996,4,"[very, cool, to, hear, old, walt, interviews]"
1599997,4,"[are, you, ready, for, your, mojo, makeover, a..."
1599998,4,"[happy, birthday, to, my, boo, of, alll, time,..."


In [19]:
# filter out texts with less than 5 words:
train_df = train_df[train_df.text.map(len) > 5]

In [20]:
# add a column that is the text as a string:
train_df["text_str"] = train_df.text.map(lambda x: " ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df["text_str"] = train_df.text.map(lambda x: " ".join(x))


In [21]:
train_df = train_df.rename(columns={"text": "tokens", "text_str": "text"})

In [22]:
train_df

Unnamed: 0,sentiment,tokens,text
0,0,"[awww, that's, a, bummer, you, shoulda, got, d...",awww that's a bummer you shoulda got david car...
1,0,"[is, upset, that, he, can't, update, his, face...",is upset that he can't update his facebook by ...
2,0,"[i, dived, many, times, for, the, ball, manage...",i dived many times for the ball managed to sav...
3,0,"[my, whole, body, feels, itchy, and, like, its...",my whole body feels itchy and like its on fire
4,0,"[no, it's, not, behaving, at, all, i'm, mad, w...",no it's not behaving at all i'm mad why am i h...
...,...,...,...
1599994,4,"[yeah, that, does, work, better, than, just, w...",yeah that does work better than just waiting f...
1599995,4,"[just, woke, up, having, no, school, is, the, ...",just woke up having no school is the best feel...
1599996,4,"[very, cool, to, hear, old, walt, interviews]",very cool to hear old walt interviews
1599997,4,"[are, you, ready, for, your, mojo, makeover, a...",are you ready for your mojo makeover ask me fo...


In [23]:
train_df.to_csv("sentiment140_train.csv", index=False)