In [1]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
from collections import defaultdict
import os
import sys

sys.path.append(os.path.abspath(os.pardir))

import pandas as pd

from bella.helper import read_config, full_path
from bella.parsers import semeval_14, dong, election
from bella.data_types import TargetCollection
from bella.tokenisers import ark_twokenize

In [2]:
# Load all of the datasets
semeval_14_rest_train = semeval_14(full_path(read_config('semeval_2014_rest_train')))
semeval_14_lap_train = semeval_14(full_path(read_config('semeval_2014_lap_train')))
semeval_14_rest_test = semeval_14(full_path(read_config('semeval_2014_rest_test')))
semeval_14_lap_test = semeval_14(full_path(read_config('semeval_2014_lap_test')))
dong_train = dong(full_path(read_config('dong_twit_train_data')))
dong_test = dong(full_path(read_config('dong_twit_test_data')))
election_train, election_test = election(full_path(read_config('election_folder_dir')))
mitchel_train = semeval_14(full_path(read_config('mitchel_train')))
mitchel_test = semeval_14(full_path(read_config('mitchel_test')))

youtubean = semeval_14(full_path(read_config('youtubean')))
semeval_14_rest = TargetCollection.combine_collections(semeval_14_rest_train,
                                                           semeval_14_rest_test)
semeval_14_laptop = TargetCollection.combine_collections(semeval_14_lap_train,
                                                         semeval_14_lap_test)
dong = TargetCollection.combine_collections(dong_train, dong_test)
election = TargetCollection.combine_collections(election_train, election_test)
mitchel = TargetCollection.combine_collections(mitchel_train, mitchel_test)
# Combine all of the product reviews
datasets = {'SemEval 14 Laptop' : semeval_14_laptop, 'SemEval 14 Resturant' : semeval_14_rest,
            'Mitchel' : mitchel, 'Dong Twitter' : dong, 
            'Election Twitter' : election, 'YouTuBean' : youtubean}

# Datasets
This notebook will describe the different datasets that have been used as well as the statistics of these datasets. The datasets used are the following:
1. [Dong et al.](https://aclanthology.coli.uni-saarland.de/papers/P14-2009/p14-2009) [Twitter dataset](https://github.com/bluemonk482/tdparse/tree/master/data/lidong) NOTE that the dataset does not link to the paper as the dataset released from the paper has already been pre-processed where as this dataset has not.
2. [SemEval 2014 Resturant dataset](http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools). We used Train dataset version 2 and the test dataset. This dataset contains 4 sentiment values; 1. Positive, 2. Neutral, 3. Negative, and 4. Conflict but we are only going to use the first 3 to make it comparable to the other datasets and the fact that the conflict label only has 91 instances in the training set and 14 in the test set.
3. [SemEval 2014 Laptop dataset](http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools). We used Train dataset version2 and the test dataset. This dataset contains 4 sentiment values; 1. Positive, 2. Neutral, 3. Negative, and 4. Conflict but we are only going to use the first 3 to make it comparable to the other datasets and the fact that the conflict label only has 45 instances in the training set and 16 in the test set.
4. [Election dataset](https://figshare.com/articles/EACL_2017_-_Multi-target_UK_election_Twitter_sentiment_corpus/4479563/1)
5. [Youtubean dataset](https://github.com/epochx/opinatt/blob/master/samsung_galaxy_s5.xml) [by Marrese-Taylor et al.](https://www.aclweb.org/anthology/W17-5213) - Dataset of 7 youtube reviews of the Samsung Galaxy S5. The text are the closed captions of the videos where the captions were provided by the authors and not automatically generated. The dataset does contain 7 conflict labels which in the original paper were matched to neutral labels however in our experiments we remove these labels thus that statistics we present here are slightly different to those in the original paper when describing the dataset. However if you parse the dataset and include the conflicts then the statistics will match the original paper. (To parse the dataset with conflicts add conflict=True parameter to *semeval_14* function)
6. [Mitchel dataset](http://www.m-mitchell.com/code/MitchellEtAl-13-OpenSentiment.tgz) which was released with this [paper](https://www.aclweb.org/anthology/D13-1171). The dataset is of Tweets where the targets are named entities specifically either orgainsations or persons.

In [3]:
dataset_dict = defaultdict(list)
index = []
columns = ['Domain', 'Type', 'Medium', 'No. Targets (Dataset Size)', 
           'No. Senti Labels', 'Mean Targets per Sent', 'No Unique Targets',
           '% Targets with 1 Sentiment per Sentence', '% Targets with 2 Sentiment per Sentence', 
           '% Targets with 3 Sentiment per Sentence', 'Avg sentence length per target']
name_domain = {'SemEval 14 Laptop' : 'Laptop', 'SemEval 14 Resturant' : 'Restaurant', 
               'Mitchel' : 'Unknown', 'Dong Twitter' : 'General', 'Election Twitter' : 'Politics',
               'YouTuBean' : 'Mobile Phones'}
name_type = {'SemEval 14 Laptop' : 'Review', 'SemEval 14 Resturant' : 'Review', 
               'Mitchel' : 'Social Media', 'Dong Twitter' : 'Social Media', 'Election Twitter' : 'Social Media',
               'YouTuBean' : 'Review'}
name_medium = {'SemEval 14 Laptop' : 'Written', 'SemEval 14 Resturant' : 'Written', 
               'Mitchel' : 'Written', 'Dong Twitter' : 'Written', 'Election Twitter' : 'Written',
               'YouTuBean' : 'Spoken'}
for name, dataset in datasets.items():
    index.append(name)
    targets_i_senti = []
    num_targets = len(dataset)
    num_sentiment_labels = len(dataset.stored_sentiments())
    avg_sent_length = dataset.avg_sentence_length_per_target()
    for i in range(1, 4):
        if i > num_sentiment_labels:
            targets_i_senti.append(0)
        else:
            i_senti_targets = len(dataset.subset_by_sentiment(i))
            targets_i_senti\
            .append((i_senti_targets / num_targets) * 100)
            
    dataset_dict['Domain'].append(name_domain[name])
    dataset_dict['Type'].append(name_type[name])
    dataset_dict['Medium'].append(name_medium[name])
    dataset_dict['No. Targets (Dataset Size)'].append(num_targets)
    dataset_dict['No. Senti Labels'].append(num_sentiment_labels)
    dataset_dict['Mean Targets per Sent'].append(dataset\
                                                 .avg_targets_per_sentence())
    dataset_dict['No Unique Targets'].append(dataset.number_unique_targets())
    dataset_dict['% Targets with 1 Sentiment per Sentence'].append(targets_i_senti[0])
    dataset_dict['% Targets with 2 Sentiment per Sentence'].append(targets_i_senti[1])
    dataset_dict['% Targets with 3 Sentiment per Sentence'].append(targets_i_senti[2])
    dataset_dict['Avg sentence length per target'].append(avg_sent_length)
    

dataset_stats = pd.DataFrame(dataset_dict, index=index, columns=columns)
dataset_stats.round(2)

Unnamed: 0,Domain,Type,Medium,No. Targets (Dataset Size),No. Senti Labels,Mean Targets per Sent,No Unique Targets,% Targets with 1 Sentiment per Sentence,% Targets with 2 Sentiment per Sentence,% Targets with 3 Sentiment per Sentence,Avg sentence length per target
SemEval 14 Laptop,Laptop,Review,Written,2951,3,1.58,1295,81.09,17.62,1.29,18.57
SemEval 14 Resturant,Restaurant,Review,Written,4722,3,1.83,1630,75.26,22.94,1.8,17.25
Mitchel,Unknown,Social Media,Written,3288,3,1.22,2507,90.48,9.43,0.09,18.02
Dong Twitter,General,Social Media,Written,6940,3,1.0,145,100.0,0.0,0.0,17.37
Election Twitter,Politics,Social Media,Written,11899,3,2.94,2190,44.5,46.72,8.78,21.68
YouTuBean,Mobile Phones,Review,Spoken,798,3,2.07,522,81.45,18.17,0.38,22.53


The high level statistics are presented above. At first it is a surprising that the Social media data has such a high average sentence length but the sentence in the Twitter cases is actually a Tweet compared to the SemEval and YouTuBean data which has been sentence split. However the YouTuBean data even when sentence split is still the longest this could be due to the data being speech text rather than written.

Again the datasets vary with the number of Targets with distinct sentiments per sentence but most have only one distinct sentiment per sentence apart from the Election dataset which has quiet an even split between 1 and 2 distinct sentiments.

Lastly the Election dataset has the highest number of targets per sentence by a long way and this is not porportinal to the average sentence length either.

## Syntactic Complexity of the dataset
The above statistics are all based on quiet high level summary statistics and do not contain any lingustic specfic statistic apart from perhaps the average sentence length. Therefore below is the table of average constituency tree depth for the datasets which can be viewed as showing the sentence syntax complexity this was also shown in the [Marrese-Taylor et al.](https://www.aclweb.org/anthology/W17-5213) paper on the datasets they used and here we present the same statistic on all of the datasets above.

In [7]:
dataset_ling_dict = defaultdict(list)
index = []
columns = ['Average constituency tree depth']
for name, dataset in datasets.items():
    index.append(name)
    dataset_ling_dict['Average constituency tree depth'].append(dataset.avg_constituency_depth())

In [8]:
dataset_ling_stats = pd.DataFrame(dataset_ling_dict, index=index, columns=columns)
dataset_ling_stats.round(2)

Unnamed: 0,Average constituency tree depth
SemEval 14 Laptop,10.61
SemEval 14 Resturant,9.73
Mitchel,8.28
Dong Twitter,8.34
Election Twitter,9.67
YouTuBean,11.72


As you can see above interestingly the Election dataset that had the 2nd largest average sentence length and by far the largest number of targets per sentence but does not have the largest average tree depth. The YouTuBean dataset does which may suggest that spoken text is syntactically more complex then written text. 