# Data Analysis - STS Dataset

Dataset being used for this example is the SEMEVAL-2012 Semantic Textual Similarity (STS) task, available from the [Semantic Textual Similarity Wiki page](http://ixa2.si.ehu.es/stswiki/index.php/Main_Page).

There are 3 sets of sentence pairs and associated scores in the training set, as listed below:

* 750 sentence pairs from Microsoft Research Paraphrase Corpus (MSR-Paraphrase)
* 750 sentence pairs from Microsoft Research Video Description Corpus (MSR-Video)
* 734 sentence pairs from [WMT2008 Development Dataset]( http://www.statmt.org/wmt08/shared-evaluation-task.html).

Sentence pairs are available in one .txt file and associated labels (real values from 0-5) are available in the corresponding .gs file.

The test set consists of 750, 750 and 459 different sentence pairs from these same 3 corpora.

In [1]:
from __future__ import division, print_function
import collections
import nltk
import os
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [2]:
DATA_DIR = "../data"
TRAIN_DIR = os.path.join(DATA_DIR, "train")
TEST_DIR = os.path.join(DATA_DIR, "test-gold")

DATA_SOURCES = ["MSRpar", "MSRvid", "SMTeuroparl"]
SPAIR_FILE_TPL = "STS.input.{:s}.txt"
LABEL_FILE_TPL = "STS.gs.{:s}.txt"

VOCAB_FILE = os.path.join(DATA_DIR, "sts-vocab.tsv")

## Data Loading

In [3]:
def load_data(datadir):
    xleft, xright, ys = [], [], []
    for data_source in DATA_SOURCES:
        label_filename = LABEL_FILE_TPL.format(data_source)
        flabel = open(os.path.join(datadir, label_filename))
        for line in flabel:
            ys.append(float(line.strip()))
        flabel.close()
        # sentence pairs
        spair_filename = SPAIR_FILE_TPL.format(data_source)
        fsents = open(os.path.join(datadir, spair_filename))
        for line in fsents:
            left, right = line.strip().split("\t")
            xleft.append(left)
            xright.append(right)
        fsents.close()
        assert len(xleft) == len(xright) and len(xright) == len(ys)
    return xleft, xright, ys
    
xleft_train, xright_train, ys_train = load_data(TRAIN_DIR)
xleft_test, xright_test, ys_test = load_data(TEST_DIR)

print((len(xleft_train), len(xright_train), len(ys_train)),
      (len(xleft_test), len(xright_test), len(ys_test)))

(2234, 2234, 2234) (1959, 1959, 1959)


## Vocabulary

In [4]:
counter = collections.Counter()
for sent_coll in [xleft_train, xright_train, xleft_test, xright_test]:
    for sent in sent_coll:
        sent = sent.decode("utf8").encode("ascii", "ignore").lower()
        for word in nltk.word_tokenize(sent):
            counter[word] += 1

fvocab = open(VOCAB_FILE, "wb")
for word, count in counter.most_common():
    fvocab.write("{:s}\t{:d}\n".format(word, count))
fvocab.close()

In [5]:
vocab_size = len([w for w, c in counter.most_common() if c >= 2])
print(vocab_size)

6940


## Compute #-words per sentence

In [6]:
num_words_list = []
for sent_coll in [xleft_train, xright_train, xleft_test, xright_test]:
    for sent in sent_coll:
        sent = sent.decode("utf8").encode("ascii", "ignore").lower()
        num_words_list.append(len([w for w in nltk.word_tokenize(sent)]))
num_words = np.array(num_words_list)

plt.hist(num_words, bins=10)
plt.show()

In [7]:
for i in range(90, 100):
    print("{:d} percentile, #-sentences: {:.0f}".format(
        i, np.percentile(num_words, i)))

90 percentile, #-sentences: 30
91 percentile, #-sentences: 30
92 percentile, #-sentences: 31
93 percentile, #-sentences: 32
94 percentile, #-sentences: 34
95 percentile, #-sentences: 36
96 percentile, #-sentences: 39
97 percentile, #-sentences: 44
98 percentile, #-sentences: 50
99 percentile, #-sentences: 60


## Labels

In [8]:
y = np.array(ys_train + ys_test, dtype="int32")
print(y.shape, min(y), max(y))

(4193,) 0 5
