## Preprocessing
As we can see, we can't ignore punctuation. Moreover, there can be also special characters we need to take into account, such as `#`, `$`, `%`, etc. Raw text is typically not well suited for analyzis. The text preprocessing usually incorporate a cleaning step, including

* converting everything to lower case
* remove line breaks
* removing punctuation
* removing numbers

In [1]:
import tarfile
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm import tqdm
import json

In [2]:
# File paths
tar_path = "/home/elahe/Desktop/UniPisa/IR/Project/TinTinfy/collection.tar.gz"  
output_file = "preprocessed_data.json"

# Extract the tar.gz file
#with tarfile.open(tar_path, "r:gz") as tar:
#   tar.extractall()

# Process the extracted file
input_file = "collection.tsv"  # The extracted file name
preprocessed_data = []

In [5]:
def remove_punctuation(text):
    # Remove punctuation except for intra-word hyphens
    return re.sub(r'[^\w\s-]', '', text)

# Example usage
sample_text = "North-American wildlife, Rock&R Wildlife is—important!"
print(remove_punctuation(sample_text))


North-American wildlife RockR Wildlife isimportant


In [7]:
def remove_numbers_keep_dates(text):
    # Keep years or dates in the format YYYY or MM-DD
    return re.sub(r'\b(?!\d{4}\b|\d{1,2}-\d{1,2})\d+\b', '', text)

# Example usage
sample_text = "In 2023, there were 15 species and 12 - 25 migrations."
print(remove_numbers_keep_dates(sample_text))


In 2023, there were  species and  -  migrations.


In [28]:
import unicodedata

def fix_encoding_issues(text):
    # Normalize text to handle encoding issues
    return unicodedata.normalize('NFKD', text)

# Example usage
sample_text = "canciÃ³n doesnât Â©"
print(fix_encoding_issues(sample_text))


canciÃ3n doesnât Â©


In [10]:
def remove_line_breaks(text):
    # Replace line breaks with a single space
    return text.replace('\n', ' ')

# Example usage
sample_text = "Line 1\nLine 2\nLine 3."
print(remove_line_breaks(sample_text))


Line 1 Line 2 Line 3.


In [14]:
import re

def preserve_case(text):
    # Split the text into tokens
    tokens = text.split()
    # Convert tokens to lowercase unless they are all uppercase
    processed_tokens = [token if token.isupper() else token.lower() for token in tokens]
    # Join tokens back into a string
    return " ".join(processed_tokens)

# Example usage
sample_text = "CVT WILDLIFE Wildlife is important for the US and CVT ecosystem."
print(preserve_case(sample_text))

CVT WILDLIFE wildlife is important for the US and CVT ecosystem.


In [29]:
# Download necessary NLTK resources
nltk.download('stopwords')

# Define helper functions
def preprocess_text(text):
    # Convert to lowercase
    text = remove_punctuation(text)
    text = preserve_case(text)
    text = remove_numbers_keep_dates(text)
    text = fix_encoding_issues(text)
    text = remove_line_breaks(text)
    # Tokenize text
    tokens = text.split()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Apply stemming
    stemmer = SnowballStemmer("english")
    tokens = [stemmer.stem(word) for word in tokens]
    return " ".join(tokens)

[nltk_data] Downloading package stopwords to /home/elahe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
preprocess_text("canciÃ³n doesnât Â©")

'canciã3n doesnât â'

In [3]:
# View a sample of raw file content
print("Raw file content (first 5 lines):")
with open(input_file, "r", encoding="utf-8") as file:
    reader = csv.reader(file, delimiter="\t")
    for i, line in enumerate(reader):
        if i == 5:  # Limit output to the first 5 lines
            break
        print(line)

Raw file content (first 5 lines):
['0', 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.']
['1', 'The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.']
['2', 'Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.']
['3', 'The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â\x8

In [4]:
import random


raw_data = []
with open(input_file, "r", encoding="utf-8") as file:
    reader = csv.reader(file, delimiter="\t")
    for line in reader:
        # Ensure line has both pid and text
        if len(line) == 2:
            raw_data.append(line)

# Display 5 random batches of 5 samples each
print("Random Batches of Raw Data:")
for i in range(5):
    print(f"\nBatch {i + 1}:")
    batch = random.sample(raw_data, 5)  # Randomly select 5 samples
    for pid, text in batch:
        print(f"PID: {pid}, Text: {text}")

Random Batches of Raw Data:

Batch 1:
PID: 5216351, Text: This difference between the automatic transmission and CVT is the number of gears. The automatic is limited to 4 to 9 gear ratios and there are definite or noticable changes between the gears and this is felt during driving. The CVT has what is termed an infinite number of ratios ranging from low to high and the transmission move through the gears steeplessly according to the driving situation.
PID: 2463607, Text: WILDLIFE Wildlife at Anedodi is abundant and ever surprising! Being fairly remote and surrounded mainly by large oaks, dogwoods, pine, and other forest tree varieties, the possibilities are endless!
PID: 2214634, Text: A convenient way to access money. Travellers cheques provide a convenient way to access your money when travelling. By cashing them as you need them, you also minimize the amount of cash you carry around. Denomination Availability. Availability may vary and result in your order not being processed exactl

In [None]:
def cleanup(doc):    
    """Returns a string with special characters replaced by whitespaces."""
    tmp = doc
    tmp = tmp.replace(',', ' ')
    tmp = tmp.replace('.', ' ')
    tmp = tmp.replace('#', ' ')
    tmp = tmp.replace('$', ' ')
    tmp = tmp.replace('%', ' ')
    tmp = tmp.replace('\n', ' ')
    return tmp

In [None]:
def tokenise(doc):    
    """Returns a sequence of terms given an input text."""
    return doc.split()

In [None]:
def lowercase(doc):    
    """Returns a string with all characters lower-cased."""
    return doc.lower()

In [None]:
doc = 'One ring to rule them all. One by one, the free peoples of Middle Earth fell to the power of the Ring. But there were some who resisted.'
doc = lowercase(doc)
doc = cleanup(doc)
tokens = tokenise(doc)
print(tokens)

An even better tokenizer is the Treebank Word Tokenizer. It incorporates a variety of common rules for English word tokenization. For example, it separates phrase-terminating punctuation (?!.;,) from adjacent tokens and retains decimal numbers containing a period as a single token. In addition it contains rules for English contractions. For example “don’t” is tokenized as ["do", "n’t"].


In [None]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(cleanup(sentence))
print(tokens)

This function works well for regular cases, but is unable to address more complex cases.
Two of the most popular stemming algorithms are the **Porter** and **Snowball** stemmers. These stemmers implement more complex rules than our simple regular expression. This enables the stemmer to handle the complexities of English spelling and word ending rules:

In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

print(' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()]))

In [2]:
import tarfile
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm import tqdm
import json

# Download necessary NLTK resources
nltk.download('stopwords')

# Define helper functions
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove non-alphanumeric characters (keeping ASCII letters and spaces)
    text = re.sub(r"[^a-z\s]", " ", text)
    # Tokenize text
    tokens = text.split()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Apply stemming
    stemmer = SnowballStemmer("english")
    tokens = [stemmer.stem(word) for word in tokens]
    return " ".join(tokens)

# File paths
tar_path = "/home/elahe/Desktop/UniPisa/IR/TinTinfy/collection.tar.gz"  # Update with your file path
output_file = "preprocessed_data.json"

# Extract the tar.gz file
with tarfile.open(tar_path, "r:gz") as tar:
    tar.extractall()

# Process the extracted file
input_file = "collection.tsv"  # The extracted file name
preprocessed_data = []

with open(input_file, "r", encoding="utf-8") as file:
    reader = csv.reader(file, delimiter="\t")
    for line in tqdm(reader, desc="Processing Documents"):
        try:
            # Ensure line has both pid and text
            if len(line) != 2:
                continue
            pid, text = line
            # Preprocess the text
            cleaned_text = preprocess_text(text)
            # Skip empty or too short documents
            if len(cleaned_text.split()) < 5:
                continue
            # Append to result
            preprocessed_data.append({"pid": pid, "text": cleaned_text})
        except Exception as e:
            # Log and skip malformed lines
            print(f"Error processing line: {line} - {e}")

# Save preprocessed data
with open(output_file, "w", encoding="utf-8") as outfile:
    json.dump(preprocessed_data, outfile, ensure_ascii=False, indent=4)

print(f"Preprocessed data saved to {output_file}")


[nltk_data] Downloading package stopwords to /home/elahe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Processing Documents: 8841823it [1:10:51, 2079.65it/s]


Preprocessed data saved to preprocessed_data.json
