# 02 Preprocessing Practical Example on Realistic Text Data (NLP)

## Overview

This notebook demonstrates a complete text preprocessing pipeline example using **NLTK**, **Pandas**, and **regular expressions** on real-world hotel review data.  
Each preprocessing step is applied incrementally, with results stored in new columns to preserve intermediate transformations.

---

## Libraries Used

- **NLTK**
  - Tokenization (`word_tokenize`)
  - Stopwords
  - Stemming (`PorterStemmer`)
  - Lemmatization (`WordNetLemmatizer`)
  - N-grams
- **Pandas** for data handling
- **re** for regular expression–based text cleaning

---

## Dataset Loading and Inspection

- Load hotel reviews from a CSV file
- Inspect dataset structure using:
  - `data.info()`
  - `data.head()`
- Access individual review entries for inspection

---

## Text Preprocessing Steps

### 1. Lowercasing Text
- Convert all review text to lowercase
- Store results in a new column to preserve original data
- Use Pandas `.str` string accessor to apply changes to all rows

---

### 2. Stopword Removal
- Load English stopwords from NLTK
- Explicitly **retain the word “not”** to preserve sentiment meaning
- Remove stopwords using a custom `apply()` function
- Store cleaned text in a new column

---

### 3. Removing Punctuation and Special Characters
- Replace asterisk (`*`) symbols with the word `"star"`
- Remove all remaining punctuation using regular expressions
- Apply transformations row-by-row using `axis=1`

---

### 4. Tokenization
- Convert cleaned review text into tokens
- Store tokens as lists in a new column
- Each review becomes a list of individual words

---

### 5. Stemming
- Apply **Porter Stemmer** to reduce words to their root forms
- Store stemmed tokens in a separate column
- Allows comparison between original, tokenized, and stemmed text

---

### 6. Lemmatization
- Apply **WordNet Lemmatizer** to normalize words to their dictionary form
- Store lemmatized tokens separately
- Enables direct comparison between stemming and lemmatization results

---

## Preparing Tokens for N-gram Analysis

- Combine all token lists from every review into **one single list**
- Use Python’s `sum()` function to flatten the list of lists
- This produces a corpus-wide token list suitable for frequency analysis

---

## N-gram Analysis

- Generate:
  - **Unigrams (1-grams)**
  - **Bigrams (2-grams)**
  - **Trigrams (3-grams)**
- Count occurrences using:
  - `nltk.ngrams`
  - `pd.Series().value_counts()`
- Display frequency distributions for analysis

---

## Key Takeaways

- Each preprocessing step is isolated and reversible
- New columns preserve preprocessing history
- Stopword customization matters for sentiment analysis
- Token preparation is required before n-gram analysis
- Stemming and lemmatization serve different purposes


In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re
import pandas as pd

In [3]:
data = pd.read_csv("../../data/tripadvisor_hotel_reviews.csv")

In [None]:
data.info()   #to  see what we have

In [None]:
data.head()  #to take a look t our data  

In [None]:
data['Review'][0] #specified that we are intrested in row zero

In [None]:
# step 1 , lower case text on the Review Column
#a pandas Series is not a single string — it's a column containing many strings
#.str is the string accessor that tells pandas (Apply this operation to every element in this column)
data['review_lowercase'] = data['Review'].str.lower()

In [None]:
data.head()

In [None]:
# step 2 , remove stop words from the reviews
en_stopwords = stopwords.words('english') 

In [None]:
# to make sure "not" is not within the stopwords we will use .remove to remove from list of stopwords
en_stopwords.remove("not")

In [None]:
#let's create a new column called Review no stopwords
#apply function is very usefull because it lets us take one column and perform a custom operation on every value in it
# .split splts each review x into individual words 
# when preprocessing the text, it is always worth making a new column for each of the steps in your preprocessing
data['review-stop-no-stopwords'] = data['review_lowercase'].apply(lambda x: ' '.join([word for word in x.split() if word not in (en_stopwords)] ))

In [None]:
data['review-stop-no-stopwords'][0]

In [None]:
# step 3 removing punctuations
#first creating a new column
#axis = 1 tells python to go row by row rather then column by column
data['review-stop-no-stopwords-no-punct'] = data.apply(lambda x: re.sub(r"[*]", "star", x['review-stop-no-stopwords']), axis=1 )


In [None]:
data.head()  #here we can see that anywhere on review-stop-no-stopwords column where the text had a aestrik sign, its replaced with star word

In [None]:
data['review-stop-no-stopwords-no-punct'] = data.apply(lambda x: re.sub( r"([^\w\s])", "", x['review-stop-no-stopwords-no-punct']), axis=1)

In [None]:
data.head()

In [None]:
# step 4 tokenizing the text
data['tokenized'] = data.apply(lambda x: word_tokenize(x['review-stop-no-stopwords-no-punct']),axis=1)

In [None]:
data['tokenized'] [0]

In [None]:
# step 5 stemming the text
ps = PorterStemmer()

In [None]:
data['stemmed'] = data['tokenized'].apply(lambda tokens: [ps.stem(token) for token in tokens])

In [None]:
data.head() #confirming if the words have been stemmed

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
#Comparison between stemmer and lemitizer
data['lemmatized'] = data['tokenized'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])

In [None]:
data['lemmatized'][0]

In [None]:
# now we have both stemmed and lemmatized columns, before running N-grams we need to prepare our text in the right format
#right now each row in the lemmatized column contains a seperate list of tokens, Each review is stored as its own list of lemmatized words
#We need to combine these smaller lists of tokens into one long list that contains every token form all reviews
#we can do that using the sum function
tokens_clean =  sum(data['lemmatized'], [])   #in python sum() with lists joins them together instead of returning a sum(number)
#sum() keeps adding each review's list of token to the empty list as it goes through all indiviual reviews as a result we end up with 1 big list




In [None]:
tokens_clean

In [None]:
unigrams = (pd.Series(nltk.ngrams(tokens_clean, 1)).value_counts())

In [None]:
unigrams

In [None]:
bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts())

In [None]:
print(bigrams)

In [None]:
trigrams = (pd.Series(nltk.ngrams(tokens_clean, 3)).value_counts())

In [None]:
print(trigrams)