# Sentiment140 dataset preparation

Clean and prepare the Stanford Sentiment140 Datastet.

The training dataset contains 1 600 000 tweets split equally between positive and negative ones.

The dataset can be downloaded from:
http://help.sentiment140.com/for-students

## Initialization

Import python packages, initialize parameters and load datasets from csv files

### Imports 

Import needed python packages.

In [1]:
import pandas as pd
import numpy as np
import string

from nltk.tokenize import TweetTokenizer
from bs4 import BeautifulSoup
from tqdm import tqdm

tqdm.pandas()

### Parameters

Initialize variables for the dataset headers and, input and output file locations.

In [2]:
train_dataset_csv = '../../data/external/training.1600000.processed.noemoticon.csv'
test_dataset_csv = '../../data/external/testdata.manual.2009.06.14.csv'
dataset_headers = ['polarity', 'id', 'date', 'query', 'user', 'text']

train_clean_csv= '../../data/interim/sentiment140_train_clean.csv'
test_clean_csv= '../../data/interim/sentiment140_test_clean.csv'
test_txt = '../../reports/sentiment140_test.txt'

### Load datasets

Load the training and test datasets from disk and drop the unnecessary columns.

In [3]:
df_train = pd.read_csv(train_dataset_csv, header=None, names=dataset_headers)
df_test = pd.read_csv(test_dataset_csv, header=None, names=dataset_headers)
df_train.drop(['id', 'date', 'query', 'user'], axis=1, inplace=True)
df_test.drop(['id', 'date', 'query', 'user'], axis=1, inplace=True)

### Remap polarities

Remap the positive sentiment polarity from 4 to 1 and drop the neutral polarity from the test dataset.

In [4]:
df_test = df_test[df_test.polarity!=2]
df_train.polarity = df_train.polarity.map({0:0, 4:1})
df_test.polarity = df_test.polarity.map({0:0, 4:1}) 

## Cleanup

The following cleanup steps will be applied on the datasets

1. Convert to lowercase
2. decode all html encoded symbols
3. tokenkize using nltk's twitter tokenizer
4. Filter out all http link tokens
5. Filter out all mentions tokens
6. Filter out all hashtags tokens
7. Filter out all tokens containing non-letter characters
8. Join tokens back together

### Process text function

Define a function that will be applied to all texts in the datasets.

In [5]:
#create a set of all lowercase ascii character plus "'"
letters = set(string.ascii_lowercase + "'")
tokenizer = TweetTokenizer()

def process_text(text):
    text = text.lower()
    text = BeautifulSoup(text, 'lxml').get_text()
    tokens = tokenizer.tokenize(text)
    tokens = list(filter(lambda t: not t.startswith('http'), tokens))
    tokens = list(filter(lambda t: not t.startswith('@'), tokens))
    tokens = list(filter(lambda t: not t.startswith('#'), tokens))
    tokens = list(filter(lambda t: set(t).issubset(letters), tokens))
    tokens = list(filter(lambda t: not t == "'", tokens))
    return " ".join(tokens)

### Process datasets

Apply the function on all datasets.

In [6]:
df_train.text = df_train.text.progress_map(process_text)
df_test.text = df_test.text.progress_map(process_text)
df_train = df_train[df_train.text!='']
df_test = df_test[df_test.text!='']
df_train.reset_index(drop=True, inplace=True)
df_test.reset_index(drop=True, inplace=True)

100%|██████████| 1600000/1600000 [05:13<00:00, 5110.36it/s]
100%|██████████| 359/359 [00:00<00:00, 4974.56it/s]


## Save clean datasets

Save the cleaned up datasets back to disk.

In [7]:
df_train.to_csv(train_clean_csv, index=False)
df_test.to_csv(test_clean_csv, index=False)

## Generate benchmark file

Create input file for SentiStrength.

In [8]:
df_bench = pd.read_csv(test_dataset_csv, header=None, names=dataset_headers)
df_bench = df_bench[df_bench.polarity!=2]
np.savetxt(test_txt, df_bench.text, fmt="%s")