<a href="https://colab.research.google.com/github/suleiman-odeh/NLP_Project_Team16/blob/main/01_Preprocessing_Suleiman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
"""
Load dataset and seperate train and test set
"""
import pandas as pd
from datasets import load_dataset
# Load the QEvasion dataset from Hugging Face
dataset = load_dataset("ailsntua/QEvasion")

# Construct the train and and test set
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

train_df['split_type'] = 'train'
test_df['split_type'] = 'test'

# Combine into one main dataframe
df = pd.concat([train_df, test_df], ignore_index=True)

print(f"Total train set loaded: {len(train_df)}")
print(f"Total test set loaded: {len(test_df)}")

Total train set loaded: 3448
Total test set loaded: 308


In [3]:
"""
Check sample of messy raws
"""
row_numbers = [3268, 1987, 2114, 1873]

print("--- RAW ANSWERS (By Row Number) ---")

for idx in row_numbers:
    # iloc accesses the row by its raw integer position
    text = df.iloc[idx]['interview_answer']
    print(f"\nRow Number: {idx}")
    print(text)
    print("-" * 50)

--- RAW ANSWERS (By Row Number) ---

Row Number: 3268
They talked about . [] Excuse me. I wasn't there. []No, that's a legitimate question, and the question is, why now? Why do I think something positive can happen? Well, first of all, the legislative process takes awhile in the United States. I don't know about Mexico, Mr. President, but sometimes legislators, you know, debate issues for awhile before a solution can be achieved.And we had a very—by the way, we haven't had a serious debate on migration until recently. A law was passed in 1986, and then there really wasn't a serious debate until pretty much starting after the year 2000, if my memory serves me well. I've always known this is an important issue because I happened to have been the Governor of Texas. And so I'm very comfortable about discussing the issue and have elevated the issue over the past years. And Members of Congress have taken the issue very seriously, but it's hard to get legislation out of the Congress on a very

In [4]:
"""
Clean the text
"""
import re

def clean_text(text):
    if not isinstance(text, str):
        return ""

    # Remove empty brackets []
    text = re.sub(r'\[\s*\]', '', text)

    # 2. Normalize em-dashes and multiple hyphens to a single space or comma
    # "The—all right" becomes "The, all right" or "The all right"
    text = re.sub(r'—', ', ', text)
    text = re.sub(r'--', ', ', text)

    # 3. Fix the "hven't" / missing vowel corruption (Source 45)
    # This is a specific manual fix for common patterns if they recur,
    # otherwise, general regex handles spacing.
    text = re.sub(r"\bhven't\b", "haven't", text)

    # 4. Remove generic meta-commentary placeholders if they exist in brackets
    # e.g., [addressing the press] - though we saw they are often empty []
    text = re.sub(r'\[.*?\]', '', text)

    # 5. Collapse multiple spaces into one
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply cleaning to Question and Answer columns
print("Cleaning data...")
df['cleaned_answer'] = df['interview_answer'].apply(clean_text)

# Inspect the SAME row 2114 again to see the difference
print("\n--- CLEANED ROW 2114 ---")
print(df.iloc[3191]['cleaned_answer'])

Cleaning data...

--- CLEANED ROW 2114 ---
Jennifer , I'll be glad to share some of the private conversation with His Holy Father. First, I'll give you an impression. I was talking to a very smart, loving man. And I, you know, after 61⁄2 years of being the President, I've seen some unusual, I've been to some unusual places, and I've met some interesting people. And I was in awe, and it was a moving experience for me.We didn't talk about just war. He did express deep concern about the Christians inside Iraq, that he was concerned that the society that was evolving would not tolerate the Christian religion. And I assured him we're working hard to make sure that they, people lived up to the Constitution, that modern Constitution voted on by the people that would honor people from different walks of life and different attitudes.We talked about a lot of other subjects. We talked about our attempts to help the people in Africa deal with HIV/AIDS and malaria and hunger. I reminded him that we

In [5]:
"""
Check sample of cleaned raws
"""
row_numbers = [3268, 1987, 2114, 1873]

print("--- RAW ANSWERS  ---")

for idx in row_numbers:
    text = df.iloc[idx]['cleaned_answer']
    print(f"\nRow Number: {idx}")
    print(text)
    print("-" * 50)

--- RAW ANSWERS  ---

Row Number: 3268
They talked about . Excuse me. I wasn't there. No, that's a legitimate question, and the question is, why now? Why do I think something positive can happen? Well, first of all, the legislative process takes awhile in the United States. I don't know about Mexico, Mr. President, but sometimes legislators, you know, debate issues for awhile before a solution can be achieved.And we had a very, by the way, we haven't had a serious debate on migration until recently. A law was passed in 1986, and then there really wasn't a serious debate until pretty much starting after the year 2000, if my memory serves me well. I've always known this is an important issue because I happened to have been the Governor of Texas. And so I'm very comfortable about discussing the issue and have elevated the issue over the past years. And Members of Congress have taken the issue very seriously, but it's hard to get legislation out of the Congress on a very complex issue.A lo

In [9]:
"""
Check sample of messy raws
"""
row_numbers = [3268, 1987, 2114, 1873]

print("--- RAW QUESTIONS (By Row Number) ---")

for idx in row_numbers:
    # iloc accesses the row by its raw integer position
    text = df.iloc[idx]['interview_question']
    print(f"\nRow Number: {idx}")
    print(text)
    print("-" * 50)

--- RAW QUESTIONS (By Row Number) ---

Row Number: 3268
Q. Good morning to both Presidents. President Bush, I ask you, why do Mexicans want to—why would you think that Mexicans could believe in a reform in migration when for so many years, this was not a possibility nor reality? And what are your chances of coming through with this bill in Congress? And President Calderon, you had lunch with President Fox. Can you tell us what you talked about?
--------------------------------------------------

Row Number: 1987
Q. Thank you very much. Mr. President, I've got a question to Mr. Obama. You just mentioned that Afghanistan is still a dangerous place. While it's a dangerous place, is it the right decision to draw down the force level at a time when it's a dangerous place and meanwhile Afghan forces are less equipped and they cannot fight truly?[The reporter and President Ghani spoke in Pashto, and their remarks were translated by an interpreter as follows.] Q. Mr. President, my question is,

In [None]:
##Removed "Q. " Before the interview questions
def remove_q_prefix(text):
    return re.sub(r'^\s*Q\.\s*', '', text)

df['interview_question'] = df['interview_question'].apply(remove_q_prefix)
for i in df.sample(3).index:
    print("QUESTION:", df.loc[i, 'interview_question'])
    print("-" * 40)


QUESTION: Thank you, Mr. President. My question follows Julianna's in content. The American people have seen hundreds of billions of dollars spent already, and still the economy continues to free fall. Beyond avoiding the national catastrophe that you've warned about, once all the legs of your stool are in place, how can the American people gauge whether or not your programs are working? Can they--should they be looking at the metric of the stock market, home foreclosures, unemployment? What metric should they use? When? And how will they know if it's working, or whether or not we need to go to a plan B?
----------------------------------------
QUESTION: Thanks, gentlemen. Mr. Trump, Mr. Turnbull, Phil Coorey from the Financial Review. To you, Mr. Trump, just on the region and China and associated issues, the United States Navy has conducted, frequently, freedom-of-navigation sail-throughs through the disputed areas. Would you like to see the Australian Navy participate directly in tho