In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
import os
import sys
import random as rn
rn.seed(42)

sys.path.append(os.path.abspath(os.pardir))

import numpy as np
np.random.seed(42)
from sklearn.model_selection import StratifiedShuffleSplit

from bella.helper import read_config, full_path
from bella.parsers import mitchel
from bella.data_types import TargetCollection, Target
from bella import write_data

# Creating Training and Test sets for the Mitchel et al. Dataset
We show how we created the Training and Test sets for this dataset.

The original Dataset can be downloaded from [here](http://www.m-mitchell.com/code/MitchellEtAl-13-OpenSentiment.tgz) and the accompying paper can be found [here](https://www.aclweb.org/anthology/D13-1171). As Mitchel et al. Evaluated their models of 10 fold cross validation they do not have one train, test set therefore we take one of their train, test folds combine it and split it into 70% train and 30% test, we then save the new train and test dataset in XML format that is of the same format as the [SemEval 2014](http://alt.qcri.org/semeval2014/task4/) datasets (we choose this dataset format as we found it the easiest to parse, use, understand and visually understand).

First ensure the following has been done:
1. Download the dataset and get a train and test split from the folder /en/10-fold (we used train.1 and test.1)
2. Ensure in the [config.yaml](../config.yaml) file that the following values have the correct file paths:
  1. mitchel_org_train = the file path to train.1
  2. mitchel_org_test = the file path to test.1
  3. mitchel_train = the file path that you would like the new training dataset to go
  4. mitchel_test = the file path that you would like the new test dataset to go

The original dataset contains 3288 targets as stated in the paper. We also show in this notebook that we also get the same number of targets and thus have parsed the dataset correctly.

In [2]:
# Mitchel Dataset
mitchel_org_train = mitchel(full_path(read_config('mitchel_org_train')))
mitchel_org_test = mitchel(full_path(read_config('mitchel_org_test')))

mitchel_combined = TargetCollection.combine_collections(mitchel_org_train, 
                                                        mitchel_org_test)
m_dataset_size = len(mitchel_combined)

Parsed dataset size = {{m_dataset_size}}

In [3]:
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=42)

mitchel_data = np.asarray(mitchel_combined.data_dict())
mitchel_sentiment = np.asarray(mitchel_combined.sentiment_data())
for train_indexs, test_indexs in splitter.split(mitchel_data, mitchel_sentiment):
    train_data = mitchel_data[train_indexs]
    test_data = mitchel_data[test_indexs]
    
convert_to_targets = lambda data: [Target(**target) for target in data]
mitchel_train = TargetCollection(convert_to_targets(train_data))
mitchel_test = TargetCollection(convert_to_targets(test_data))

The dataset has now been split with respect to the class labels so each class label is represented equally in the train and test splits which can be shown here:

Train Data ratio: **{{mitchel_train.ratio_targets_sentiment()}}**
Train Data raw values: **{{mitchel_train.no_targets_sentiment()}}**

Test Data ratio: **{{mitchel_test.ratio_targets_sentiment()}}**
Test Data raw values: **{{mitchel_test.no_targets_sentiment()}}**

Original Data ratio: **{{mitchel_combined.ratio_targets_sentiment()}}**  
Original Data raw values: **{{mitchel_combined.no_targets_sentiment()}}**

We now save the data to XML file format which is the same as the SemEval data format.

In [4]:
write_data.semeval_14(full_path(read_config('mitchel_train')), mitchel_train)
write_data.semeval_14(full_path(read_config('mitchel_test')), mitchel_test)