### Data cleaning

Now that we have extracted the data, it is not yet clean or ready for analysis. The dataset contains unnecessary symbols, missing values, and inconsistent formats. We will process the reviews by removing special characters, stopwords, and formatting issues to ensure better readability and accuracy for further analysis.

In [None]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Loading the Dataset**  
Before cleaning, we first load the dataset. The script retrieves the current working directory (`cwd`) and reads the CSV file `BA_reviews.csv` into a pandas DataFrame. The `index_col=0` parameter ensures that the first column is used as the index. Finally, `df.head()` displays the first few rows for an initial inspection.

In [32]:
cwd = os.getcwd()
df = pd.read_csv(cwd + "/BA_reviews.csv", index_col=0)
df.head()

Unnamed: 0,Review,Stars,Date,Country
0,✅ Trip Verified | Flight mainly let down by ...,5,19th March 2025,United Kingdom
1,✅ Trip Verified | Another awful experience b...,7,16th March 2025,United States
2,"✅ Trip Verified | The service was rude, full...",1,16th March 2025,United States
3,✅ Trip Verified | This flight was a joke. Th...,3,16th March 2025,United States
4,✅ Trip Verified | This time British Airways ...,1,7th March 2025,United Kingdom


**Extracting Verified Reviews**  
We create a new column, `verified`, to indicate whether a review is marked as "Trip Verified." The `.str.contains("Trip Verified")` function checks if the phrase appears in the `Review` column, returning `True` for verified reviews and `False` otherwise.

In [33]:
df['verified'] = df.Review.str.contains("Trip Verified")



The review text is cleaned by removing non-alphabetic characters, converting to lowercase, removing stopwords, and applying lemmatization. The processed text is stored in the `corpus` column.

In [34]:
lemma = WordNetLemmatizer()
reviews_data = df.Review.str.replace("✅ Trip Verified |", "", regex=True)
corpus = []

for rev in reviews_data:
    rev = re.sub('[^a-zA-Z]', ' ', rev)
    rev = rev.lower().split()
    rev = [lemma.lemmatize(word) for word in rev if word not in set(stopwords.words("english"))]
    corpus.append(" ".join(rev))

df['corpus'] = corpus

In [35]:
df.Date = pd.to_datetime(df.Date, errors='coerce')

In [36]:
df.Stars = df.Stars.str.strip("\n\t")
df = df[df.Stars != "None"]

In [37]:
df.dropna(subset=['Country'], inplace=True)

df.reset_index(drop=True, inplace=True)


The cleaned dataset is saved as a CSV file named **`cleaned-BA-reviews.csv`** in the current working directory for further analysis.

In [38]:
df.to_csv(cwd + "/cleaned-BA-reviews.csv")

In [39]:
df.head()

Unnamed: 0,Review,Stars,Date,Country,verified,corpus
0,✅ Trip Verified | Flight mainly let down by ...,5,2025-03-19,United Kingdom,True,flight mainly let disagreeable flight attendan...
1,✅ Trip Verified | Another awful experience b...,7,2025-03-16,United States,True,another awful experience british airway flight...
2,"✅ Trip Verified | The service was rude, full...",1,2025-03-16,United States,True,service rude full attitude food poorly service...
3,✅ Trip Verified | This flight was a joke. Th...,3,2025-03-16,United States,True,flight joke four people business class includi...
4,✅ Trip Verified | This time British Airways ...,1,2025-03-07,United Kingdom,True,time british airway managed get everything rig...
