# STS-Gold dataset preparation

Clean and prepare the sts-gold dataset. The set contains 2034 manually annotated tweets.

The dataset can be downloaded from:
https://github.com/pollockj/world_mood/tree/master/sts_gold_v03

## Initialization

Import python packages, initialize parameters and load dataset from csv files

### Imports 

Import needed python packages.

In [1]:
import pandas as pd
import numpy as np
import string

from nltk.tokenize import TweetTokenizer
from bs4 import BeautifulSoup
from tqdm import tqdm

# activate tqdm for pandas
tqdm.pandas()

### Parameters

Initialize variables for the dataset headers and, input and output file locations.

In [2]:
dataset_csv = '../../data/external/sts-gold.csv'
dataset_headers = ['id', 'polarity', 'tweet']

clean_csv = '../../data/interim/sts-gold-clean.csv'

test_txt = '../../reports/sts-gold_test.txt'

### Load dataset

Load the training and test dataset from disk and drop the unnecessary columns.

In [3]:
df = pd.read_csv(dataset_csv, sep=';')
df.drop(['id'], axis=1, inplace=True)

### Remap polarities

Remap the positive sentiment polarity from 4 to 1.

In [4]:
df.polarity = df.polarity.map({0:0, 4:1})

## Cleanup

The following cleanup steps will be applied on the dataset:

1. Convert to lowercase
2. decode all html encoded symbols
3. tokenkize using nltk's twitter tokenizer
4. Filter out all http link tokens
5. Filter out all mentions tokens
6. Filter out all hashtags tokens
7. Filter out all tokens containing non-letter characters
8. Join tokens back together

### Process text function

Define a function that will be applied to all texts in the dataset.

In [5]:
#create a set of all lowercase ascii character plus "'"
letters = set(string.ascii_lowercase + "'")
tokenizer = TweetTokenizer()

def process_text(text):
    text = text.lower()
    text = BeautifulSoup(text, 'lxml').get_text()
    tokens = tokenizer.tokenize(text)
    tokens = list(filter(lambda t: not t.startswith('http'), tokens))
    tokens = list(filter(lambda t: not t.startswith('@'), tokens))
    tokens = list(filter(lambda t: not t.startswith('#'), tokens))
    tokens = list(filter(lambda t: set(t).issubset(letters), tokens))
    tokens = list(filter(lambda t: not t == "'", tokens))
    return " ".join(tokens)

### Process datasets

Apply the function on the dataset.

In [6]:
df.tweet = df.tweet.progress_map(process_text)

100%|██████████| 2034/2034 [00:00<00:00, 4923.12it/s]


## Save clean datasets

Save the cleaned up dataset back to disk.

In [7]:
df.to_csv(clean_csv, index=False)

## Generate benchmark file

Create input file for SentiStrength.

In [8]:
df_bench = pd.read_csv(dataset_csv, sep=';')
np.savetxt(test_txt, df_bench.tweet, fmt="%s")