## Text Pre-Processing Example

Install the spacy package if you have not already done so.  Once you have installed spacy, you no longer need to run this code.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

Import the pandas and spacy libraries.

In [None]:
# Import libraries
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')


### Sample Data
Create some sample data.

In [None]:
# Example Twitter data
data = {
    "tweet": [
        "I love the new iPhone! It's fantastic. #Apple",
        "The service at this restaurant was terrible. Never going back. #Disappointed",
        "Tesla's new model is groundbreaking! #Innovation",
        "I had an average experience with the product. It's okay. #Neutral",
    ]
}
df = pd.DataFrame(data)

### Tokenization
The first technical step in preprocessing is tokenization, which breaks down text into smaller units called tokens.  Tokens are the building blocks for all subsequent preprocessing steps. It enables models to process text effectively by isolating meaningful components.

In [None]:
# Tokenization with spaCy
df["tokens"] = df["tweet"].apply(lambda x: [token.text for token in nlp(x)])
print(df)


### Remove Stop Words
Next, we clean the text by removing stop words.  Stop words are common words like “the,” “is,” “and,” which occur frequently but provide little semantic value.  Removing them reduces noise and allows the model to focus on meaningful terms.

In [None]:
# Function to remove stop words
def remove_stopwords(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop]  # Exclude stop words
    return tokens

# Apply the function to your DataFrame
import pandas as pd
df = pd.DataFrame(data)
df["tokens"] = df["tweet"].apply(remove_stopwords)

# Display the DataFrame
print(df)

### Lemmatization
After cleaning, we standardize text using stemming or lemmatization. Stemming reduces words to their root form by removing suffixes and prefixes. Lemmatization converts words to their base or dictionary form using linguistic rules.  Lemmitization is more accurate than stemming but computationally heavier. Note the clock difference between running the stop word code block and the lemmitization code block.

In [None]:
# Function to perform lemmatization on stopword-removed tokens
def lemmatize_tokens(tokens):
    doc = nlp(" ".join(tokens))  # Recreate a string from tokens
    lemmas = [token.lemma_ for token in doc]  # Lemmatize the tokens
    return lemmas

# Apply stop word removal
df["tokens_no_stop"] = df["tweet"].apply(remove_stopwords)

# Apply lemmatization on tokens without stop words
df["lemmas"] = df["tokens_no_stop"].apply(lemmatize_tokens)

# Display the DataFrame
print(df)

### Visualization
Visualize the processed data.

In [None]:
# Create a new DataFrame for visualization
df_visualization = pd.DataFrame({
    "Original Tweet": df["tweet"],
    "Tokens": df["tokens_no_stop"].apply(lambda x: ", ".join(x)),  # Display tokens without stop words
    "Lemmatized Tokens": df["lemmas"].apply(lambda x: ", ".join(x))  # Display lemmatized tokens
})

# Print the DataFrame
print(df_visualization)

# Optional: Render the DataFrame as an HTML table for better visualization (e.g., in a Jupyter Notebook)
from IPython.display import display
display(df_visualization)