<a href="https://colab.research.google.com/github/tahaShm/knowledge-distillation/blob/transfer-run/data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data cleaning

In [9]:
!pip install clean-text[gpl]
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [10]:
from cleantext import clean
import pandas as pd
from tqdm import tqdm
import re

import datasets
from datasets import load_dataset

In [3]:
tqdm.pandas()

In [11]:
task = 'rte'
init_file_name = 'rte_generated_'
num_of_files = 10

single_col = False

single_col_header = 'sentence'

double_col_headers = ['sentence1', 'sentence2']


In [12]:
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext


def cleaning(text):
    text = str(text)
    text = text.strip()
    
    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    # cleaning htmls
    text = cleanhtml(text)
    
    # removing wierd patterns
    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        # u"\u200c"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)
    
    text = wierd_pattern.sub(r'', text)
    
    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)
    
    return text

In [13]:
combined_data = pd.DataFrame()
for i in range(num_of_files):
  new_chunk = pd.read_csv(init_file_name + str(i+1) + '.csv')
  combined_data = combined_data.append(new_chunk)

combined_data

Unnamed: 0,sentence1,sentence2
0,The French are now looking at a similar propos...,running for 25 years between Paris and Brussels.
1,"The US-based company, BAE Systems, is the worl...","its subsidiary, Raytheon, is the world's seco..."
2,The first two presidents of Zimbabwe were Afri...,"However, Mugabe was brought up outside Zimbab..."
3,"Titan's engines, propelled by nine giant gas t...",the air.
4,The UN Security Council also condemned the she...,six of the children.
...,...,...
995,"Yoko Ono was born in San Francisco, California.",She is the widow of the late John Lennon. She...
996,The British Columbia government has introduced...,"On Jan. 1, 2018, the Liquor Control Board of ..."
997,The company's shareholders' association said t...,plans due to the confidentiality agreement.
998,Hugh Jackman is playing Wolverine for Marvel.,Jackman is the actor best known for his portr...


In [14]:
filtered_data = []

if (single_col):
  combined_data[single_col_header] = combined_data[single_col_header].progress_apply(lambda s: cleaning(s))
  filtered_data = combined_data[combined_data[single_col_header].str.len() > 3]
else:
  combined_data[double_col_headers[0]] = combined_data[double_col_headers[0]].progress_apply(lambda s: cleaning(s))
  combined_data[double_col_headers[1]] = combined_data[double_col_headers[1]].progress_apply(lambda s: cleaning(s))
  filtered_data = combined_data[(combined_data[double_col_headers[0]].str.len() > 3) & (combined_data[double_col_headers[1]].str.len() > 3)]

filtered_data

100%|██████████| 10000/10000 [00:01<00:00, 9028.56it/s]
100%|██████████| 10000/10000 [00:01<00:00, 9724.62it/s]


Unnamed: 0,sentence1,sentence2
0,the french are now looking at a similar propos...,running for 25 years between paris and brussels.
1,"the us-based company, bae systems, is the worl...","its subsidiary, raytheon, is the world's secon..."
2,the first two presidents of zimbabwe were afri...,"however, mugabe was brought up outside zimbabw..."
3,"titan's engines, propelled by nine giant gas t...",the air.
4,the un security council also condemned the she...,six of the children.
...,...,...
995,"yoko ono was born in san francisco, california.",she is the widow of the late john lennon. she ...
996,the british columbia government has introduced...,"on jan. 1, 2018, the liquor control board of b..."
997,the company's shareholders' association said t...,plans due to the confidentiality agreement.
998,hugh jackman is playing wolverine for marvel.,jackman is the actor best known for his portra...


In [15]:
filtered_data.to_csv(task + '_unlabeled.csv', index=False)

# Final output generation

In [None]:
# labeled_file =  task + '.csv'

# final_file = task + '_augmented.csv'

In [None]:
# labeled_data = pd.read_csv(init_file_name + str(i+1) + '.csv')