# Phrase-to-Phrase Matching Using [DCPCSE](https://github.com/YJiangcm/DCPCSE)

Created for GaTech CS7650 Final Project

## Downloading the Dataset

Steps to get this working:
1. Go to your Kaggle Account, and get a "New API Token" which installs a json file.
2. Upload this file into Colab under root/.kaggle (need to toggle visibility of hidden directories to see this)

In [None]:
! chmod 600 /root/.kaggle/kaggle.json
! kaggle competitions download -c us-patent-phrase-to-phrase-matching --force
! unzip -q us-patent-phrase-to-phrase-matching.zip

Downloading us-patent-phrase-to-phrase-matching.zip to /content/gdrive/MyDrive/7650_DCPCSE/DCPCSE
  0% 0.00/682k [00:00<?, ?B/s]
100% 682k/682k [00:00<00:00, 40.1MB/s]
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A


## Cloning DCPCSE

In [None]:
from google.colab import drive, files
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
%cd ~/../content
! mkdir gdrive/MyDrive/7650_DCPCSE
! mkdir results
%cd gdrive/MyDrive/7650_DCPCSE
! git clone https://github.com/YJiangcm/DCPCSE.git
%cd DCPCSE
! pip install -r requirements.txt

/content
mkdir: cannot create directory ‘gdrive/MyDrive/7650_DCPCSE’: File exists
/content/gdrive/MyDrive/7650_DCPCSE
fatal: destination path 'DCPCSE' already exists and is not an empty directory.
/content/gdrive/MyDrive/7650_DCPCSE/DCPCSE
Collecting transformers==4.2.1
  Downloading transformers-4.2.1-py3-none-any.whl (1.8 MB)
[K     |████████████████████████████████| 1.8 MB 29.2 MB/s 
[?25hCollecting scipy==1.5.4
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 MB 11.8 MB/s 
[?25hCollecting datasets==1.2.1
  Downloading datasets-1.2.1-py3-none-any.whl (159 kB)
[K     |████████████████████████████████| 159 kB 64.9 MB/s 
[?25hCollecting pandas==1.1.5
  Downloading pandas-1.1.5-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 53.1 MB/s 
[?25hCollecting scikit-learn==0.24.0
  Downloading scikit_learn-0.24.0-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
[K     |███████████

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

! ls ../../../../

gdrive	results  sample_data  sample_submission.csv  test.csv  train.csv


In [None]:
train_df = pd.read_csv('../../../../train.csv')
# test_df = pd.read_csv('../../../../test.csv')
train, test = train_test_split(train_df, shuffle=True, test_size=0.20, random_state=17)
print(train.shape)
print(test.shape)

SCORE_CUTOFF = 0.25

train_all = train.drop(['id', 'context'], axis=1)
# test = test.drop(['id', 'context'], axis=1)

train_pos = train_all[train_all['score'] >= SCORE_CUTOFF].drop('score', axis=1)
train_neg = train_all[train_all['score'] < SCORE_CUTOFF].drop('score', axis=1)

sentences = train_pos.merge(train_neg, how='inner', on='anchor', suffixes=['_pos', '_neg'])
sentences = sentences.drop_duplicates(subset=['anchor', 'target_pos'])

print(sentences.head())
print(sentences.shape)



sentences.to_csv('../../../../train_sentences.csv', index=False)
train.to_csv('SentEval/data/downstream/STS/STSBenchmark/train_sentences.csv', index=False)
test.to_csv('SentEval/data/downstream/STS/STSBenchmark/test_sentences.csv', index=False)

(29178, 5)
(7295, 5)
                        anchor                 target_pos         target_neg
0   perform working operations            perform working  working principle
10  perform working operations     perform working action  working principle
20  perform working operations   metal working operations  working principle
30  perform working operations  perform working operation  working principle
40  perform working operations  execute working operation  working principle
(21413, 3)


In [None]:
! python train.py \
  --model_name_or_path roberta-large \
  --train_file ../../../../train_sentences.csv \
  --output_dir ../../../../results \
  --num_train_epochs 10 \
  --per_device_train_batch_size 64 \
  --learning_rate 5e-3 \
  --max_seq_length 32 \
  --metric_for_best_model stsb_spearman \
  --load_best_model_at_end \
  --pooler_type cls \
  --pre_seq_len 10 \
  --overwrite_output_dir \
  --eval_steps 100 \
  --temp 0.05 \
  --do_train \

05/02/2022 02:32:25 - INFO - __main__ -   Training/evaluation parameters OurTrainingArguments(output_dir='../../../../results', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=64, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=0.005, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=10.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/May02_02-32-25_1e3fc9ec83a0', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_s

## Download the results from running DCPSCE

In [None]:
!zip -r /content/dcpcse.zip /content/results

files.download('/content/dcpcse.zip')

## Evaluate on Test data

First, download evaluation datasets

In [None]:
%cd SentEval/data/downstream/
! bash download_dataset.sh
%cd ../../..

/content/gdrive/MyDrive/7650_DCPCSE/DCPCSE/SentEval/data/downstream
--2022-05-01 23:43:01--  https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/senteval.tar
Resolving huggingface.co (huggingface.co)... 34.225.34.242, 34.197.58.156, 54.161.5.137, ...
Connecting to huggingface.co (huggingface.co)|34.225.34.242|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/datasets/princeton-nlp/datasets-for-simcse/bc43c148f7be97471c78fc4255399d3158cb99dfe8f2221999c918338b138c38 [following]
--2022-05-01 23:43:01--  https://cdn-lfs.huggingface.co/datasets/princeton-nlp/datasets-for-simcse/bc43c148f7be97471c78fc4255399d3158cb99dfe8f2221999c918338b138c38
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 52.85.130.79, 52.85.130.5, 52.85.130.28, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|52.85.130.79|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89825280 (86M

In [None]:
! python evaluation.py \
    --model_name_or_path ../../../../results/ \
    --pooler_type cls \
    --task_set na \
    --tasks STSBenchmark \
    --mode test 

total param is 354801664, trainable param is 491520
2022-05-02 03:14:02,215 : 

***** Transfer task : STSBenchmark*****


2022-05-02 03:15:51,584 : train : pearson = 0.7110, spearman = 0.7113
2022-05-02 03:16:19,037 : test : pearson = 0.6764, spearman = 0.6738
2022-05-02 03:16:19,059 : ALL : Pearson = 0.7035,             Spearman = 0.7034
2022-05-02 03:16:19,059 : ALL (weighted average) : Pearson = 0.7041,             Spearman = 0.7038
2022-05-02 03:16:19,059 : ALL (average) : Pearson = 0.6937,             Spearman = 0.6925

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+------+
|  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |    67.38     |       0.00      | 9.63 |
+-------+-------+-------+-------+-------+--------------+-----------------+------+
+------+------+------+------+------+----