In this activity, my goal is to take a raw dataset of airline reviews and clean it up completely. I will be using a series of text preprocessing techniques to do this. The plan is to handle unique noise I found in the data convert text to lowercase, get rid of punctuation, break sentences into words "tokenization" and remove common, unimportant words "stopwords". The main purpose of all this is to make the text data clean and structured, so it's ready for more advanced tasks like sentiment analysis.

Setting Up the Environment: Importing Libraries

This initial section of the code is all about getting our tools ready. To work with the data and clean the text, we need to bring in some powerful Python libraries. First, by importing "pandas", we get the functions needed to handle our data in a table format, which makes it easy to work with. On the other hand, the "re" library is for "regular expressions" and it is perfect for finding and removing specific text patterns like dates or punctuation. Finally, we import "nltk", the Natural Language Toolkit, which is the standard library for almost any text processing task in Python.

After importing the main libraries, we need to download two specific packages from NLTK. The first is "stopwords", which is a ready-made list of common English words we'll want to remove later. The second is "punkt", a pre-trained model that knows how to properly split sentences into individual words in a process called tokenization. This step makes sure all our tools are loaded and ready before we even touch the data.

In [23]:
import pandas as pd
import re
import nltk

In [24]:
nltk.download('stopwords')
nltk.download('punkt_tab')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [25]:
import pandas as pd
from google.colab import drive

Now that our libraries are set up, the next logical step is to load our dataset. In this portion of the code, I'm connecting to my Google Drive to access the "Airline_Reviews.csv" file. The "pd.read_csv()" function from the pandas library is used here, and its job is to read the CSV file and load all the data into a structure called a DataFrame. A DataFrame is like a smart table that lets us easily see and work with our data columns and rows.

In [26]:
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Dataset/Airline_review.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


 After loading it, I'll display the first few or 5 rows using "head()" and get a summary with "info()" just to make sure everything loaded correctly and to see what kind of data we're dealing with.

In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,Airline Name,Overall_Rating,Review_Title,Review Date,Verified,Review,Aircraft,Type Of Traveller,Seat Type,Route,Date Flown,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Inflight Entertainment,Wifi & Connectivity,Value For Money,Recommended
0,0,AB Aviation,9,"""pretty decent airline""",11th November 2019,True,Moroni to Moheli. Turned out to be a pretty ...,,Solo Leisure,Economy Class,Moroni to Moheli,November 2019,4.0,5.0,4.0,4.0,,,3.0,yes
1,1,AB Aviation,1,"""Not a good airline""",25th June 2019,True,Moroni to Anjouan. It is a very small airline...,E120,Solo Leisure,Economy Class,Moroni to Anjouan,June 2019,2.0,2.0,1.0,1.0,,,2.0,no
2,2,AB Aviation,1,"""flight was fortunately short""",25th June 2019,True,Anjouan to Dzaoudzi. A very small airline an...,Embraer E120,Solo Leisure,Economy Class,Anjouan to Dzaoudzi,June 2019,2.0,1.0,1.0,1.0,,,2.0,no
3,3,Adria Airways,1,"""I will never fly again with Adria""",28th September 2019,False,Please do a favor yourself and do not fly wi...,,Solo Leisure,Economy Class,Frankfurt to Pristina,September 2019,1.0,1.0,,1.0,,,1.0,no
4,4,Adria Airways,1,"""it ruined our last days of holidays""",24th September 2019,True,Do not book a flight with this airline! My fr...,,Couple Leisure,Economy Class,Sofia to Amsterdam via Ljubljana,September 2019,1.0,1.0,1.0,1.0,1.0,1.0,1.0,no


In [28]:
print("\nDataset Info:")
df.info()


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23171 entries, 0 to 23170
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              23171 non-null  int64  
 1   Airline Name            23171 non-null  object 
 2   Overall_Rating          23171 non-null  object 
 3   Review_Title            23171 non-null  object 
 4   Review Date             23171 non-null  object 
 5   Verified                23171 non-null  bool   
 6   Review                  23171 non-null  object 
 7   Aircraft                7129 non-null   object 
 8   Type Of Traveller       19433 non-null  object 
 9   Seat Type               22075 non-null  object 
 10  Route                   19343 non-null  object 
 11  Date Flown              19417 non-null  object 
 12  Seat Comfort            19016 non-null  float64
 13  Cabin Staff Service     18911 non-null  float64
 14  Food & Beverages       

Before jumping into cleaning, it's a standard practice to get a baseline of our raw data. This part of the code serves as our before snapshot, so we have something to compare our results to at the end. First, I create a new column called "review_original" to keep a safe copy of the original reviews. Then, I calculate the total number of words across all reviews to understand the initial size of our text data. Finally, I select one specific review to use as our "test_case". The reason for this is so we can visually track the effects of each cleaning step on the same piece of text, which makes the whole process much easier to follow.

In [29]:
df['review_original'] = df['Review']

initial_word_count = df['review_original'].str.split().str.len().sum() # Calculate the total word count before any cleaning
print(f"Initial total word count: {initial_word_count}")

test_case = df['review_original'].iloc[5] # We'll use the review at index 5 as it contains dates, punctuation, and mixed case.
print("\n--- Test Case: Raw Review ---")
print(test_case)

Initial total word count: 3026955

--- Test Case: Raw Review ---
  Had very bad experience with rerouted and cancelled flights last weekend with Adria airways. Original Route was Ljubljana to Sarajevo return. Two weeks before i received an email that the flight was cancelled. Offered route change was Ljubljana to Sarajevo via Munich. Flight back changed to Sarajevo-Pristina-Ljubljana. I accepted. The first flight via Munich was ok. Two hours before the return flight I got the email that the flight was cancelled. I had to rebook via hotline and had to accept a flight with Croatian to Zagreb. I reached Ljubljana 4 h later and had to organize Transport from Zagreb to Ljubljana on my own cost. Do not book flights with Adria airways. I heard that their financial situation is very very bad.


### The Text Preprocessing Pipeline

So, this is where the main work of the exercise begins: the text preprocessing pipeline. The plan is to apply a series of cleaning steps to the "Review" column, one after the other. For each step, I'll first show how it works on our single `test_case` example, and then I'll apply that same logic to clean all the reviews in the entire dataset. I've designed the pipeline to run in a specific order to be the most effective.

Here is the plan:
1.  Remove the "new" noise I identified, which are date phrases.
2.  Convert all text to lowercase.
3.  Remove all punctuation and special characters.
4.  Split the text into individual words (tokenization).
5.  Remove common, unimportant words (stopwords).

#### "New" Noise Removal: Date Phrases

Kicking off the pipeline, the first cleaning step tackles the "new" noise that I found during my initial look at the data. I noticed many reviews contain a date stamp, like "25th August 2015" or "11th November 2019". This isn't useful for analyzing the review's content, so it needs to go. This part of the code uses a regular expression, which I've stored in the "date_pattern" variable. This pattern is basically a custom search query I designed to find text that looks exactly like those dates. Then, the "re.sub()" function finds all matches of this pattern in the text and replaces them with an empty string, which effectively deletes them.

In [30]:
date_pattern = r"\d{1,2}(st|nd|rd|th)\s\w+\s\d{4}" # This pattern looks for 1-2 digits, followed by 'st', 'nd', 'rd', or 'th', a word and a 4-digit year.

# Demonstrate on the test case
cleaned_test_case = re.sub(date_pattern, '', test_case)
print("--- Test Case after removing date ---")
print(cleaned_test_case)

df['Review'] = df['Review'].apply(lambda x: re.sub(date_pattern, '', str(x))) # Apply to the entire 'Review' column

print("\nSuccessfully removed date phrases from the 'Review' column.")

--- Test Case after removing date ---
  Had very bad experience with rerouted and cancelled flights last weekend with Adria airways. Original Route was Ljubljana to Sarajevo return. Two weeks before i received an email that the flight was cancelled. Offered route change was Ljubljana to Sarajevo via Munich. Flight back changed to Sarajevo-Pristina-Ljubljana. I accepted. The first flight via Munich was ok. Two hours before the return flight I got the email that the flight was cancelled. I had to rebook via hotline and had to accept a flight with Croatian to Zagreb. I reached Ljubljana 4 h later and had to organize Transport from Zagreb to Ljubljana on my own cost. Do not book flights with Adria airways. I heard that their financial situation is very very bad.

Successfully removed date phrases from the 'Review' column.


## Lowercasing the Text

Next in our pipeline is a simple but very important step: lowercasing. In this code, the ".lower()" function is applied to all the text in the "Review" column. The purpose of this is to ensure consistency. Without this step, a computer would see "Flight", "flight", and "FLIGHT" as three completely different words. By converting everything to lowercase, we make sure that the same word is always treated as the same word, which is a standard and crucial step for any meaningful text analysis.

In [31]:
# Demonstrate on the test case
cleaned_test_case = cleaned_test_case.lower()
print("--- Test Case after lowercasing ---")
print(cleaned_test_case)

# Apply to the entire 'Review' column
df['Review'] = df['Review'].str.lower()
print("\nSuccessfully converted 'Review' column to lowercase.")

--- Test Case after lowercasing ---
  had very bad experience with rerouted and cancelled flights last weekend with adria airways. original route was ljubljana to sarajevo return. two weeks before i received an email that the flight was cancelled. offered route change was ljubljana to sarajevo via munich. flight back changed to sarajevo-pristina-ljubljana. i accepted. the first flight via munich was ok. two hours before the return flight i got the email that the flight was cancelled. i had to rebook via hotline and had to accept a flight with croatian to zagreb. i reached ljubljana 4 h later and had to organize transport from zagreb to ljubljana on my own cost. do not book flights with adria airways. i heard that their financial situation is very very bad.

Successfully converted 'Review' column to lowercase.


##Removing Punctuation and Special Characters

This portion of the code is for cleaning up all the messy punctuation and other special characters. Just like with the date phrases, we use a regular expression to do this. The pattern "[^a-z\\s]" tells the program to find any character that is *not* a lowercase letter from 'a' to 'z' or a whitespace character. The "re.sub()" function then removes all these matched characters. This step is important because symbols like '!', '?', or '.' usually don't add value to the core meaning of the text and can be treated as noise.

In [32]:
cleaned_test_case = re.sub(r'[^a-z\s]', '', cleaned_test_case) # The pattern [^a-z\s] matches anything that is NOT a lowercase letter or a space.
print("--- Test Case after removing punctuation ---")
print(cleaned_test_case)

# Apply to the entire 'Review' column
df['Review'] = df['Review'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))
print("\nSuccessfully removed punctuation and special characters.")

--- Test Case after removing punctuation ---
  had very bad experience with rerouted and cancelled flights last weekend with adria airways original route was ljubljana to sarajevo return two weeks before i received an email that the flight was cancelled offered route change was ljubljana to sarajevo via munich flight back changed to sarajevopristinaljubljana i accepted the first flight via munich was ok two hours before the return flight i got the email that the flight was cancelled i had to rebook via hotline and had to accept a flight with croatian to zagreb i reached ljubljana  h later and had to organize transport from zagreb to ljubljana on my own cost do not book flights with adria airways i heard that their financial situation is very very bad

Successfully removed punctuation and special characters.


## Tokenization: Splitting Text into Words

After cleaning the text, we need to break it down into its basic components, which are individual words. This process is called tokenization. In this part of the code, we use the "word_tokenize()" function that we imported from the NLTK library. This function is smart about how it splits a sentence; it takes a full string of text as its input and gives us back a list of individual words, which are called "tokens". This step is fundamental because we need to work with individual words to perform our next and final cleaning step, which is removing stopwords.

In [33]:
# Demonstrate on the test case
tokens_test_case = word_tokenize(cleaned_test_case)
print("--- Test Case after tokenization ---")
print(tokens_test_case)

# Apply tokenization to the entire 'Review' column and store it in a new column
df['review_tokenized'] = df['Review'].apply(word_tokenize)
print("\nSuccessfully tokenized 'Review' column.")
display(df.head())

--- Test Case after tokenization ---
['had', 'very', 'bad', 'experience', 'with', 'rerouted', 'and', 'cancelled', 'flights', 'last', 'weekend', 'with', 'adria', 'airways', 'original', 'route', 'was', 'ljubljana', 'to', 'sarajevo', 'return', 'two', 'weeks', 'before', 'i', 'received', 'an', 'email', 'that', 'the', 'flight', 'was', 'cancelled', 'offered', 'route', 'change', 'was', 'ljubljana', 'to', 'sarajevo', 'via', 'munich', 'flight', 'back', 'changed', 'to', 'sarajevopristinaljubljana', 'i', 'accepted', 'the', 'first', 'flight', 'via', 'munich', 'was', 'ok', 'two', 'hours', 'before', 'the', 'return', 'flight', 'i', 'got', 'the', 'email', 'that', 'the', 'flight', 'was', 'cancelled', 'i', 'had', 'to', 'rebook', 'via', 'hotline', 'and', 'had', 'to', 'accept', 'a', 'flight', 'with', 'croatian', 'to', 'zagreb', 'i', 'reached', 'ljubljana', 'h', 'later', 'and', 'had', 'to', 'organize', 'transport', 'from', 'zagreb', 'to', 'ljubljana', 'on', 'my', 'own', 'cost', 'do', 'not', 'book', 'flights

Unnamed: 0.1,Unnamed: 0,Airline Name,Overall_Rating,Review_Title,Review Date,Verified,Review,Aircraft,Type Of Traveller,Seat Type,...,Seat Comfort,Cabin Staff Service,Food & Beverages,Ground Service,Inflight Entertainment,Wifi & Connectivity,Value For Money,Recommended,review_original,review_tokenized
0,0,AB Aviation,9,"""pretty decent airline""",11th November 2019,True,moroni to moheli turned out to be a pretty d...,,Solo Leisure,Economy Class,...,4.0,5.0,4.0,4.0,,,3.0,yes,Moroni to Moheli. Turned out to be a pretty ...,"[moroni, to, moheli, turned, out, to, be, a, p..."
1,1,AB Aviation,1,"""Not a good airline""",25th June 2019,True,moroni to anjouan it is a very small airline ...,E120,Solo Leisure,Economy Class,...,2.0,2.0,1.0,1.0,,,2.0,no,Moroni to Anjouan. It is a very small airline...,"[moroni, to, anjouan, it, is, a, very, small, ..."
2,2,AB Aviation,1,"""flight was fortunately short""",25th June 2019,True,anjouan to dzaoudzi a very small airline and...,Embraer E120,Solo Leisure,Economy Class,...,2.0,1.0,1.0,1.0,,,2.0,no,Anjouan to Dzaoudzi. A very small airline an...,"[anjouan, to, dzaoudzi, a, very, small, airlin..."
3,3,Adria Airways,1,"""I will never fly again with Adria""",28th September 2019,False,please do a favor yourself and do not fly wi...,,Solo Leisure,Economy Class,...,1.0,1.0,,1.0,,,1.0,no,Please do a favor yourself and do not fly wi...,"[please, do, a, favor, yourself, and, do, not,..."
4,4,Adria Airways,1,"""it ruined our last days of holidays""",24th September 2019,True,do not book a flight with this airline my fri...,,Couple Leisure,Economy Class,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,no,Do not book a flight with this airline! My fr...,"[do, not, book, a, flight, with, this, airline..."


#### Stopword Removal

This is the final cleaning step in our pipeline. Stopwords are very common words, like 'the', 'is', 'a', or 'in', that appear all the time but don't carry much specific meaning. To handle this, we first load the standard list of English stopwords that NLTK provides. Then, for each review, we go through our list of tokens from the previous step and build a new list that includes only the words that are not found in the stopword list. Removing these common words helps to reduce the overall size of our data and allows any future analysis to focus on the more unique and meaningful words in the reviews.

In [34]:
# Get the standard list of English stopwords from NLTK
stop_words = set(stopwords.words('english'))

# Define a function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Demonstrate on the test case
tokens_test_case_no_stopwords = remove_stopwords(tokens_test_case)
print("--- Test Case after stopword removal ---")
print(tokens_test_case_no_stopwords)

# Apply stopword removal to the tokenized column
df['review_cleaned'] = df['review_tokenized'].apply(remove_stopwords)
print("\nSuccessfully removed stopwords.")
display(df.head())

--- Test Case after stopword removal ---
['bad', 'experience', 'rerouted', 'cancelled', 'flights', 'last', 'weekend', 'adria', 'airways', 'original', 'route', 'ljubljana', 'sarajevo', 'return', 'two', 'weeks', 'received', 'email', 'flight', 'cancelled', 'offered', 'route', 'change', 'ljubljana', 'sarajevo', 'via', 'munich', 'flight', 'back', 'changed', 'sarajevopristinaljubljana', 'accepted', 'first', 'flight', 'via', 'munich', 'ok', 'two', 'hours', 'return', 'flight', 'got', 'email', 'flight', 'cancelled', 'rebook', 'via', 'hotline', 'accept', 'flight', 'croatian', 'zagreb', 'reached', 'ljubljana', 'h', 'later', 'organize', 'transport', 'zagreb', 'ljubljana', 'cost', 'book', 'flights', 'adria', 'airways', 'heard', 'financial', 'situation', 'bad']

Successfully removed stopwords.


Unnamed: 0.1,Unnamed: 0,Airline Name,Overall_Rating,Review_Title,Review Date,Verified,Review,Aircraft,Type Of Traveller,Seat Type,...,Cabin Staff Service,Food & Beverages,Ground Service,Inflight Entertainment,Wifi & Connectivity,Value For Money,Recommended,review_original,review_tokenized,review_cleaned
0,0,AB Aviation,9,"""pretty decent airline""",11th November 2019,True,moroni to moheli turned out to be a pretty d...,,Solo Leisure,Economy Class,...,5.0,4.0,4.0,,,3.0,yes,Moroni to Moheli. Turned out to be a pretty ...,"[moroni, to, moheli, turned, out, to, be, a, p...","[moroni, moheli, turned, pretty, decent, airli..."
1,1,AB Aviation,1,"""Not a good airline""",25th June 2019,True,moroni to anjouan it is a very small airline ...,E120,Solo Leisure,Economy Class,...,2.0,1.0,1.0,,,2.0,no,Moroni to Anjouan. It is a very small airline...,"[moroni, to, anjouan, it, is, a, very, small, ...","[moroni, anjouan, small, airline, ticket, advi..."
2,2,AB Aviation,1,"""flight was fortunately short""",25th June 2019,True,anjouan to dzaoudzi a very small airline and...,Embraer E120,Solo Leisure,Economy Class,...,1.0,1.0,1.0,,,2.0,no,Anjouan to Dzaoudzi. A very small airline an...,"[anjouan, to, dzaoudzi, a, very, small, airlin...","[anjouan, dzaoudzi, small, airline, airline, b..."
3,3,Adria Airways,1,"""I will never fly again with Adria""",28th September 2019,False,please do a favor yourself and do not fly wi...,,Solo Leisure,Economy Class,...,1.0,,1.0,,,1.0,no,Please do a favor yourself and do not fly wi...,"[please, do, a, favor, yourself, and, do, not,...","[please, favor, fly, adria, route, munich, pri..."
4,4,Adria Airways,1,"""it ruined our last days of holidays""",24th September 2019,True,do not book a flight with this airline my fri...,,Couple Leisure,Economy Class,...,1.0,1.0,1.0,1.0,1.0,1.0,no,Do not book a flight with this airline! My fr...,"[do, not, book, a, flight, with, this, airline...","[book, flight, airline, friend, returned, sofi..."


### 5. Final Evaluation

Finally, after all the cleaning steps are complete, this last section is for evaluating the results of our pipeline. This part of the code serves as our 'after' snapshot. First, we take the cleaned lists of tokens and join them back together into readable sentences. Next, we calculate the new total word count and compare it to our initial count from the beginning. This allows us to see exactly how much 'noise' we removed, and I'll display this as a percentage reduction. To make the impact really clear, I will show a direct side-by-side comparison of our original reviews and their final, fully cleaned versions. This gives a clear, visual confirmation of the effectiveness of our entire preprocessing work.

In [35]:
# Re-join the cleaned tokens into a single string for readability
df['review_cleaned_text'] = df['review_cleaned'].apply(lambda tokens: ' '.join(tokens))

# Calculate the final word count
final_word_count = df['review_cleaned_text'].str.split().str.len().sum()
print(f"Final total word count: {final_word_count}")

# Calculate the percentage reduction in words
reduction = ((initial_word_count - final_word_count) / initial_word_count) * 100
print(f"Total word count reduction: {reduction:.2f}%")

# Display the original vs. cleaned text for our test case
print("\n--- Final Comparison for Test Case ---")
print("\nOriginal:")
print(test_case)
print("\nCleaned:")
print(df['review_cleaned_text'].iloc[5])

# Display a comparison DataFrame
print("\n--- Side-by-Side Comparison ---")
display(df[['review_original', 'review_cleaned_text']].head(10))

Final total word count: 1530168
Total word count reduction: 49.45%

--- Final Comparison for Test Case ---

Original:
  Had very bad experience with rerouted and cancelled flights last weekend with Adria airways. Original Route was Ljubljana to Sarajevo return. Two weeks before i received an email that the flight was cancelled. Offered route change was Ljubljana to Sarajevo via Munich. Flight back changed to Sarajevo-Pristina-Ljubljana. I accepted. The first flight via Munich was ok. Two hours before the return flight I got the email that the flight was cancelled. I had to rebook via hotline and had to accept a flight with Croatian to Zagreb. I reached Ljubljana 4 h later and had to organize Transport from Zagreb to Ljubljana on my own cost. Do not book flights with Adria airways. I heard that their financial situation is very very bad.

Cleaned:
bad experience rerouted cancelled flights last weekend adria airways original route ljubljana sarajevo return two weeks received email flight

Unnamed: 0,review_original,review_cleaned_text
0,Moroni to Moheli. Turned out to be a pretty ...,moroni moheli turned pretty decent airline onl...
1,Moroni to Anjouan. It is a very small airline...,moroni anjouan small airline ticket advised tu...
2,Anjouan to Dzaoudzi. A very small airline an...,anjouan dzaoudzi small airline airline based c...
3,Please do a favor yourself and do not fly wi...,please favor fly adria route munich pristina j...
4,Do not book a flight with this airline! My fr...,book flight airline friend returned sofia amst...
5,Had very bad experience with rerouted and ca...,bad experience rerouted cancelled flights last...
6,"Ljubljana to Zürich. Firstly, Ljubljana airp...",ljubljana zrich firstly ljubljana airport terr...
7,"First of all, I am not complaining about a s...",first complaining specific flight lufthansa fr...
8,Worst Airline ever! They combined two flight...,worst airline ever combined two flights save c...
9,Ljubljana to Munich. The homebase airport of ...,ljubljana munich homebase airport adria airway...
