# Index for Sameness Classification

See [**README.md**](README.md) for setup notes.

## Data

#### Yelp

https://www.yelp.com/dataset/documentation/main  
https://www.yelp.com/developers/documentation/v3/all_category_list  
&rarr; `business.json` + `review.json` + `all_category_list.json`

```bash
# download

# - https://www.yelp.com/dataset/download
#   extract business.json / review.json

# - https://www.yelp.com/developers/documentation/v3/all_category_list/categories.json
#   as `all_category_list.json`
```

See [Yelp dataset license](https://s3-media0.fl.yelpcdn.com/assets/srv0/engineering_pages/bea5c1e92bf3/assets/vendor/yelp-dataset-agreement.pdf)!

#### Amazon

https://nijianmo.github.io/amazon/index.html#subsets


#### IMDB

http://ai.stanford.edu/~amaas/data/sentiment/  
https://www.imdb.com/interfaces/

only single pos + neg review for each film  
&rarr; no review text!

## Code

[datasets.py](datasets.py)  
dataset (caching etc.), train data args

[metrics.py](metrics.py)  
simple evaluation metrics

[processors.py](processors.py)  
dataset loading (TSV), processor lookup (task definitions?)  
(_own processors in conversion code_)

[trainer.py](trainer.py)  
main run script, model args

[utils.py](utils.py)  
optional utilities (`Timer` for timed sections)

[data_prep.py](data_prep.py)  
&rarr; [data_prep_sentiment_yelp.py](data_prep_sentiment_yelp.py)  
utility functions for data preparation, like loading, shuffling, filtering and writing TSVs

[trainer_siamese.py](trainer_siamese.py)  
main siamese baseline run script, train args

[utils_siamese.py](utils_siamese.py)  
[hf_argparser.py](hf_argparser.py)  
utils for siamese baseline


**NOTE:** Changed the original `Trainer` to use `do_test` for test evaluation, and to use `do_predict` for predictions not on `test.tsv` but on `pred.tsv`.  
changed convention to use `test.tsv` for test evaluation and `pred.tsv` for predictions  
_Side-note: just symlink `pred.tsv` &rarr; `test.tsv`_

---

## Notebooks

[D2_samesentiment_yelp_create_pairs.ipynb](D2_samesentiment_yelp_create_pairs.ipynb)  
Conversion of samesentiment (yelp reviews) data into pairs

- review pair (samesentiment)  
  (2 per pair type, at least 5 sentiments per business)
- review pair (samesentiment)  
  (double amount of pairs)
- splits:  
  90:10 traindev/test split  
  70:30 train/dev split

[D2_samesentiment_yelp_base.ipynb](D2_samesentiment_yelp_base.ipynb)  
Conversion of samesentiment (yelp reviews) data

- splits:  
  90:10 traindev/test split  
  70:30 train/dev split
- review pair (samesentiment)  
  (2 per pair type, at least 5 sentiments per business)
- review pair (samesentiment)  
  (double amount of pairs)

[D2_samesentiment_yelp_sentiment.ipynb](D2_samesentiment_yelp_sentiment.ipynb)  
Baseline data/tests for (single) sentiment evaluations

- singe review sentiment  
  90:10 traindev/test split  
  70:30 train/dev split

[D2_samesentiment_yelp_baseline_doc2vec.ipynb](D2_samesentiment_yelp_baseline_doc2vec.ipynb)  
Baseline experiments

- count vectors
- doc2vec
- different sklearn classifiers


[D2_samesentiment_yelp_siamese.ipynb](D2_samesentiment_yelp_siamese.ipynb)  
Baseline experiments using siamese networks.  
See head notes about environment setup. May require separate environment due to conflicts torch/tensorflow.

- sandbox code, wrapped in [trainer_siamese.py](trainer_siamese.py)
- train + eval

Code based on [GH: sainimohit23/siamese-text-similarity](https://github.com/sainimohit23/siamese-text-similarity).

[D2_samesentiment_yelp_cross_setup.ipynb](D2_samesentiment_yelp_cross_setup.ipynb)  
SameSentiment (yelp reviews) data with cross-validation.

- generating of cross-validation datasets for each shard

[D2_samesentiment_yelp_siamese_cross.ipynb](D2_samesentiment_yelp_siamese_cross.ipynb)  
SameSentiment (yelp reviews) data with cross-validation using siamese baseline.  
&rarr; with results

- same as [D2_samesentiment_yelp_cross.ipynb](D2_samesentiment_yelp_cross.ipynb)
- see [D2_samesentiment_yelp_cross_setup.ipynb](D2_samesentiment_yelp_cross_setup.ipynb) for data generation
- run experiments (train on group / eval on splits...)

[D2_samesentiment_yelp_cross.ipynb](D2_samesentiment_yelp_cross.ipynb)  
SameSentiment (yelp reviews) data with cross-validation using transformers

- same as [D2_samesentiment_yelp_base.ipynb](D2_samesentiment_yelp_base.ipynb)
- see [D2_samesentiment_yelp_cross_setup.ipynb](D2_samesentiment_yelp_cross_setup.ipynb) for data generation
- run experiments (train on group / eval on splits...)

[D2_samesentiment_yelp_pair_eval.ipynb](D2_samesentiment_yelp_pair_eval.ipynb)  
SameSentiment (yelp reviews) test evaluation on trained model for each pair type.

- train a model in [D2_samesentiment_yelp_base.ipynb](D2_samesentiment_yelp_base.ipynb)
- evaluate per pair-type

[D2_samesentiment_yelp_cat_eval.ipynb](D2_samesentiment_yelp_cat_eval.ipynb)  
SameSentiment (yelp reviews) test evaluation on trained model for each category.

- train a model in [D2_samesentiment_yelp_base.ipynb](D2_samesentiment_yelp_base.ipynb)
- evaluate per category (similar to pair-type)