# NLP Paraphrase Detection Data Cleaning

We will be using the [MRPC corpus](https://www.microsoft.com/en-us/download/details.aspx?id=52398) to build a series of paraphrase detection classifiers.

We will be taking multiple approaches with this data. So the first thing we want to do is have a look at our data, do some preliminary cleaning, and store it in a usable form, which we will make use of in later notebooks.

In [None]:
import pandas as pd
import csv

In [None]:
ROOT_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/ms_paraphrase'

In [None]:
data_path = f'{ROOT_PATH}/data'

train_file = f'{data_path}/msr_paraphrase_train.txt'
test_file = f'{data_path}/msr_paraphrase_test.txt'

train_df = pd.read_csv(train_file, delimiter='\t', quoting=csv.QUOTE_NONE)
test_df = pd.read_csv(test_file, delimiter='\t', quoting=csv.QUOTE_NONE)

In [None]:
# Viewing the data
# train_df.head()
# test_df.head()
# train_df.info()
# test_df.info()

In [None]:
train_df.head()

In [None]:
test_df.head()

Next we will remove the ID columns since those are not needed and do some simple renaming

In [None]:
def format_data(df):
    new_df = df[['#1 String', '#2 String', 'Quality', ]]
    new_df = new_df.rename(columns={'Quality': 'label', '#1 String': 's1', '#2 String': 's2'})
    return new_df

In [None]:
train_df = format_data(train_df)
test_df = format_data(test_df)

In [None]:
train_df.head()

Now let's save the data in a csv so we can use it for our classifiers

In [None]:
train_df.to_csv(f'{data_path}/train_df.csv', index=False)
test_df.to_csv(f'{data_path}/test_df.csv', index=False)