![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Introduction to Text Mining and Natural Language Processing

Hannes Mueller


## Session 3: Pre-processing

This material is just to support the lecture Session 3. It generates the examples the students see of a text in various stages of pre-processing. The code here is useful to understand as we will use these three options as standard pre-processing options throughout the course.

The code also shows how the text can be collected in the document term matrix. This is not strictly speaking part of pre-processing but is very useful to understand the goal of pre-processing in the bag of words model.

In [1]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m35.0 MB/s[0m  [33m0:00:00[0mm0:00:01[0m0:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:

import re
import nltk
import spacy
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords



# Download NLTK data (if needed)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

sp = spacy.load('en_core_web_sm')

#for later
from sklearn.feature_extraction.text import CountVectorizer


# =============================================================================
# Original Text
# =============================================================================

text = (
    "Palestinian President  Mahmoud Abbas said Wednesday that late former Israeli President  "
    "Shimon Peres had exerted unremitting efforts to make peace until  the last moment of his life. "
    "\"He (Peres) exerted unremitting efforts to reach a permanent  peace since Oslo agreement was signed "
    "with Israel in 1993 until  the last moment in his life,\" Abbas said in a condolence letter to  Peres' family, "
    "carried by state news agency WAFA. In his letter, Abbas expressed deep grief and sorrow for Peres'  passing. "
    "\"He was a partner in making the peace of the brave with  late Palestinian leader Yasser Arafat and late "
    "Israeli Prime  Minister Yitzhak Rabin.\" Earlier on Wednesday, Israel announced that Peres died in a  hospital "
    "in the suburbs of Tel Aviv at the age of 93. \"We are so happy about the news of former Israeli president's  death. "
    "It is not only us, but all the Palestinian people are happy  about the news of Peres' death,\" he said. "
    "\"This man committed  crimes and shed the blood of our people.\" Abu Zuhri, whose movement is classified as a "
    "terrorist group  and rejects to recognize Israel and the peace process, said:  \"Peres was the last founder of this "
    "entity (Israel) and we believe  it is a start of a new stage of the Israeli occupation's  weakness.\""
)

print("=== Raw Original Text ===\n")
print(text)
print("\n")

# =============================================================================
# Option 1: Lowercasing, Punctuation Removal, and Stopword Removal
# =============================================================================

# Lowercase the text.
text_lower = text.lower()

# Remove punctuation using regex (keeping only word characters and whitespace).
text_no_punct = re.sub(r'[^\w\s]', '', text_lower)

# Get the set of English stopwords.
stop_words = set(stopwords.words('english'))

# Tokenize the cleaned text.
tokens = word_tokenize(text_no_punct)

# Filter out stopwords.
filtered_tokens = [token for token in tokens if token not in stop_words]

# Join tokens back into a single string.
cleaned_text = " ".join(filtered_tokens)

print("=== Cleaned Text (Lowercased, No Punctuation, Stopwords Removed) ===\n")
print(cleaned_text)
print("\n")

# =============================================================================
# Option 2: Stemming (Using Cleaned Text)
# =============================================================================

# Initialize the Porter Stemmer.
ps = PorterStemmer()

# Apply stemming to each filtered token.
stemmed_tokens = [ps.stem(token) for token in filtered_tokens]

# Join the stemmed tokens into a single string.
stemmed_text = " ".join(stemmed_tokens)

print("=== Stemmed Text (Stopwords Removed) ===\n")
print(stemmed_text)
print("\n")

# =============================================================================
# Option 3: Lemmatization (Only Stopwords Removed; No Lowercasing)
# =============================================================================
# Load the spaCy English language model.
nlp = spacy.load("en_core_web_sm")

# Process the original text with spaCy.
doc = nlp(text)

# Extract the lemma for each token, filtering out punctuation and stopwords.
# Note: spaCy flags stop words (token.is_stop) based on its internal list.
lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and token.lemma_.strip() != '']

# Join the lemmatized tokens into a single string.
lemmatized_text = " ".join(lemmatized_tokens)

print("=== Lemmatized Text (Stopwords Removed, Original Casing) ===\n")
print(lemmatized_text)



=== Raw Original Text ===

Palestinian President  Mahmoud Abbas said Wednesday that late former Israeli President  Shimon Peres had exerted unremitting efforts to make peace until  the last moment of his life. "He (Peres) exerted unremitting efforts to reach a permanent  peace since Oslo agreement was signed with Israel in 1993 until  the last moment in his life," Abbas said in a condolence letter to  Peres' family, carried by state news agency WAFA. In his letter, Abbas expressed deep grief and sorrow for Peres'  passing. "He was a partner in making the peace of the brave with  late Palestinian leader Yasser Arafat and late Israeli Prime  Minister Yitzhak Rabin." Earlier on Wednesday, Israel announced that Peres died in a  hospital in the suburbs of Tel Aviv at the age of 93. "We are so happy about the news of former Israeli president's  death. It is not only us, but all the Palestinian people are happy  about the news of Peres' death," he said. "This man committed  crimes and shed th

# Building the Document Term Matrix

You could add all three documents but one would never want to mix different pre-processing so I am just passing one document. Experiment by exchanging the cleaned_text below for: 

    - stemmed_text   # As above, but with stemming.
    - lemmatized_text # Original casing, stopwords removed and lemmatized.
    


In [3]:
#############################################
# Document Term Matrix Construction Example
#############################################

# For demonstration purposes, we create a list of one document only.
# In practice, you might have many documents (e.g., from a corpus).
documents = [
    stemmed_text]   # Lowercased, punctuation removed, stopwords removed.


# Define a helper function to generate n-grams.
def generate_ngrams(tokens, n):
    """
    Generate n-grams from a list of tokens.
    For example, n=2 returns bigrams.
    
    Args:
        tokens (List[str]): List of tokens.
        n (int): The number for the n-gram.
    
    Returns:
        List[str]: A list of n-gram strings.
    """
    return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

# Process each document to extract unigrams and bigrams.
documents_ngrams = []  # Will hold the combined list of unigrams and bigrams for each document.
for doc in documents:
    # Each document is already a string of tokens separated by a space.
    tokens = doc.split()  
    unigrams = tokens
    bigrams = generate_ngrams(tokens, 2)
    combined = unigrams + bigrams
    documents_ngrams.append(combined)

# Build the vocabulary. Each unique token (from unigrams and bigrams) is assigned a unique index.
vocab = {}
index = 0
for doc in documents_ngrams:
    for token in doc:
        if token not in vocab:
            vocab[token] = index
            index += 1

print("Vocabulary (Unigrams & Bigrams):")
print(vocab)
print("\nTotal vocabulary size:", len(vocab))
print("\n")

# Now, create the document-term matrix.
# Rows represent documents and columns represent token counts according to our vocabulary.
import numpy as np

# Initialize the document-term matrix with zeros.
dtm = np.zeros((len(documents_ngrams), len(vocab)), dtype=int)

# Fill in the document-term matrix.
for doc_id, doc in enumerate(documents_ngrams):
    for token in doc:
        token_index = vocab[token]
        dtm[doc_id, token_index] += 1

print("Document-Term Matrix (rows: documents, columns: token counts):")
print(dtm)

# For better visualization, let's print the matrix as a DataFrame with token (column) labels.
import pandas as pd

# Create a DataFrame from the dtm.
dtm_df = pd.DataFrame(dtm, columns=[token for token, idx in sorted(vocab.items(), key=lambda x: x[1])])
dtm_df.index = ['Cleaned']

print("\nDocument-Term Matrix DataFrame:")
dtm_df


Vocabulary (Unigrams & Bigrams):
{'palestinian': 0, 'presid': 1, 'mahmoud': 2, 'abba': 3, 'said': 4, 'wednesday': 5, 'late': 6, 'former': 7, 'isra': 8, 'shimon': 9, 'pere': 10, 'exert': 11, 'unremit': 12, 'effort': 13, 'make': 14, 'peac': 15, 'last': 16, 'moment': 17, 'life': 18, 'reach': 19, 'perman': 20, 'sinc': 21, 'oslo': 22, 'agreement': 23, 'sign': 24, 'israel': 25, '1993': 26, 'condol': 27, 'letter': 28, 'famili': 29, 'carri': 30, 'state': 31, 'news': 32, 'agenc': 33, 'wafa': 34, 'express': 35, 'deep': 36, 'grief': 37, 'sorrow': 38, 'pass': 39, 'partner': 40, 'brave': 41, 'leader': 42, 'yasser': 43, 'arafat': 44, 'prime': 45, 'minist': 46, 'yitzhak': 47, 'rabin': 48, 'earlier': 49, 'announc': 50, 'die': 51, 'hospit': 52, 'suburb': 53, 'tel': 54, 'aviv': 55, 'age': 56, '93': 57, 'happi': 58, 'death': 59, 'us': 60, 'peopl': 61, 'man': 62, 'commit': 63, 'crime': 64, 'shed': 65, 'blood': 66, 'abu': 67, 'zuhri': 68, 'whose': 69, 'movement': 70, 'classifi': 71, 'terrorist': 72, 'group

Unnamed: 0,palestinian,presid,mahmoud,abba,said,wednesday,late,former,isra,shimon,...,last founder,founder entiti,entiti israel,israel believ,believ start,start new,new stage,stage isra,isra occup,occup weak
Cleaned,3,3,1,3,4,2,3,2,4,1,...,1,1,1,1,1,1,1,1,1,1


In [8]:
#just for fun lets do the same with the countvectorizer

cv = CountVectorizer(ngram_range = (1,1), lowercase=False)
cv.fit([text])



vectorized_text=cv.transform([text])
vectorized_text=vectorized_text.todense()
print("document term matrix has size:", vectorized_text.shape)

print("The column names of the matrix are:")
print(cv.get_feature_names_out())

document term matrix has size: (1, 121)
The column names of the matrix are:
['1993' '93' 'Abbas' 'Abu' 'Arafat' 'Aviv' 'Earlier' 'He' 'In' 'Israel'
 'Israeli' 'It' 'Mahmoud' 'Minister' 'Oslo' 'Palestinian' 'Peres'
 'President' 'Prime' 'Rabin' 'Shimon' 'Tel' 'This' 'WAFA' 'We' 'Wednesday'
 'Yasser' 'Yitzhak' 'Zuhri' 'about' 'age' 'agency' 'agreement' 'all' 'and'
 'announced' 'are' 'as' 'at' 'believe' 'blood' 'brave' 'but' 'by'
 'carried' 'classified' 'committed' 'condolence' 'crimes' 'death' 'deep'
 'died' 'efforts' 'entity' 'exerted' 'expressed' 'family' 'for' 'former'
 'founder' 'grief' 'group' 'had' 'happy' 'he' 'his' 'hospital' 'in' 'is'
 'it' 'last' 'late' 'leader' 'letter' 'life' 'make' 'making' 'man'
 'moment' 'movement' 'new' 'news' 'not' 'occupation' 'of' 'on' 'only'
 'our' 'partner' 'passing' 'peace' 'people' 'permanent' 'president'
 'process' 'reach' 'recognize' 'rejects' 'said' 'shed' 'signed' 'since'
 'so' 'sorrow' 'stage' 'start' 'state' 'suburbs' 'terrorist' 'that' 'the'


In [9]:
#Let's try the same trick with the lemmatized text:
cv.fit([lemmatized_text])


vectorized_text=cv.transform([text])
vectorized_text=vectorized_text.todense()
print("document term matrix has size:", vectorized_text.shape)

print("The column names of the matrix are:")
print(cv.get_feature_names_out())

document term matrix has size: (1, 81)
The column names of the matrix are:
['1993' '93' 'Abbas' 'Abu' 'Arafat' 'Aviv' 'Israel' 'Mahmoud' 'Minister'
 'Oslo' 'Peres' 'President' 'Prime' 'Rabin' 'Shimon' 'Tel' 'WAFA'
 'Wednesday' 'Yasser' 'Yitzhak' 'Zuhri' 'age' 'agency' 'agreement'
 'announce' 'believe' 'blood' 'brave' 'carry' 'classify' 'commit'
 'condolence' 'crime' 'death' 'deep' 'die' 'early' 'effort' 'entity'
 'exert' 'express' 'family' 'founder' 'grief' 'group' 'happy' 'hospital'
 'israeli' 'late' 'leader' 'letter' 'life' 'make' 'man' 'moment'
 'movement' 'new' 'news' 'occupation' 'palestinian' 'partner' 'pass'
 'peace' 'people' 'permanent' 'president' 'process' 'reach' 'recognize'
 'reject' 'say' 'shed' 'sign' 'sorrow' 'stage' 'start' 'state' 'suburb'
 'terrorist' 'unremitting' 'weakness']


# Homework: process more than one text
Write a loop that goes through different texts and implements the document term matrix.