# Feature extraction from text

We look at the classical method to describe text documents. We will improve these descriptors in the next chapter.

In [1]:
%pip install -r requirements.txt
%load_ext autoreload
%autoreload 2

Note: you may need to restart the kernel to use updated packages.


In [2]:
from IPython.display import display, Markdown, JSON
import urllib.request
import os, re
from PyPDF2 import PdfReader
from unidecode import unidecode
from collections import Counter, defaultdict
from helpers import *
import math
import random
from nltk.stem import PorterStemmer

## Extracting text from file formats

In [3]:
uri = "https://dmi.unibas.ch/fileadmin/user_upload/dmi/Studium/Computer_Science/Vorlesungen_HS23/Multimedia_Retrieval/HS24/03_ClassicalTextRetrieval.pdf"
local_filename = 'example.pdf'

# unless local file already exists, download the file
if not os.path.exists(local_filename):
    urllib.request.urlretrieve(uri, local_filename)
    print(f"File downloaded and saved as {local_filename}")

In [8]:
def extract_text_from_pdf(file_name: str) -> str:
    pages = []

    def visitor_text(text, cm, tm, fontDict, fontSize):
        y = tm[5]
        if y > 20 and len(text) > 0:
            # replace \n and multiple spaces (\s*) with a single space
            text = text.replace("\n", " ")
            text = re.sub(r'\[\d+\]|➢|•', '', text)
            parts.append(text)

    # read the PDF and extract all texts (do some post-processing with above function)
    reader = PdfReader(file_name)
    for page in reader.pages:
        parts = []
        page.extract_text(visitor_text=visitor_text)
        pages.append(re.sub(r'\s+',' ', " ".join(parts)).strip())

    # merge text blocks and clean-up
    return pages

pages = extract_text_from_pdf(local_filename)
text = re.sub(r'\s+',' ', " ".join(pages))
display(Markdown(text[0:4000]+"..."))

Computer Science / 15731 - 01 / 2024 Multimedia Retrieval Chapter 1: Introduction Dr. Roger Weber, roger.weber@gmail.com 1.1 Motivation 1.2 Generic Retrieval Process 1.3 Metadata and How It Can Help 1.4 References Links 1.1 Motivation 1.2 Generic Retrieval Process 1.3 Metadata and How It Can Help 1.4 References & Links Course ID 15731 - 01 Lecturer Dr. Roger Weber, roger.weber@gmail.com Time Friday 15:15 - 18:00 (1 st /2 nd hour for theory, 3 rd hour for exercise & practice → bring your own laptop) Note: changes are announced on web site and / or per e - mail ahead of lectures Location Physical presence: Seminarraum 05.002, Spiegelgasse 5 If physical presence is not possible, we use Zoom Meetings. Please check the schedule for updates. During physical presence lectures, no Zoom meetings and no video recordings are available. Prerequisites Basics of programming (Python preferred) Mathematical foundations (for some parts) Content Introduction to multimedia retrieval with a focus on classical text retrieval, web retrieval, extraction and machine learning of features for images, audio, and video, index structures, search algorithms, and concrete implementations. The course is touching on classical and current information retrieval techniques and search algorithms. Exam Oral exam (30 minutes) on January 10, 14, 17, 21, 24 Credit Points 6 Grades From 1 to 6 with 0.5 steps. 4.0 or higher required to pass exam. Homepage WEB: https://dmi.unibas.ch/de/studium/computer - science - informatik/lehrangebot - hs24/15731 - lecture - multimedia - retrieval/ ADAM: https://adam.unibas.ch/goto_adam_crs_1738202.html All materials are published in advance. Practical exercises to be submitted to ADAM Structure of the class Foundation 1 Introduction We cover motivation, a summary of history, the generic retrieval process and its variations, a quick overview of metadata, and view demos to get us started 2 Evaluation We focus on evaluating and comparing retrieval systems and machine learning approaches. This serves as the basis for assessing the effectiveness of the methods covered in most of the chapters 11 ML Methods* We cover essential machine learning methods as needed for content analysis and the extraction of metadata items. As we progress through the course, we will visit individual chapters as need Text & Web Retrieval 3 Classic We explore classical text retrieval models, with a particular emphasis on vector space retrieval. We also delve into Lucene, OpenSearch and Elasticsearch which showcase the capabilities of these models 4 Advanced We examine natural language processing using NLTK as an example. Additionally, we explore contemporary methods for creating embeddings and leveraging generative AI to improve results 5 Web & Social We focus on web and social media retrieval, particularly examining methods to influence rankings based on the relationships between documents 6 Vector Search We explore the challenge of searching through embeddings and feature vectors. We discuss the “curse of dimensionality” and study contemporary techniques used by products like Lucene, OpenSearch, and Elasticsearch Image Retrieval 7 Basic We cover the human perception of visual signal information and examine several algorithms for extracting features that describe color, texture, and shape aspects found in the images 8 Advanced We delve into neural networks and explore the concept of deep learning. We apply these techniques to extract higher - level features, including classifications, facial recognition, and object bounding boxes Audio Retrieval 9 Basic We cover the human perception of audio signals and study various algorithms for extracting features in both the time and frequency domains. Additionally, we delve into the unique case of extracting musical features Video Retrieval 10 Basic We discuss fundamental elements of motion detection and video segmentation. Specifically, we focus on identifying shot and scene boundaries in videos For exams, Chapter 11 requi...

# A simple tokenizer

In [None]:
def tokenize(text: str) -> list[str]:
    text = re.sub(r'[^\w\-]+', ' ', text)
    tokens = []
    for token in text.split(' '):
        token = unidecode(token.strip().lower())
        if len(token) < 2: continue
        if not(re.match(r'^[a-zA-Z][\w\-\.]*$', token)): continue
        tokens.append(token)
    return tokens

tokens = tokenize(text)
print("\n".join(tokens[0:20]))
print(f'...\n\nextracted {len(tokens)} tokens from text with {len(set(tokens))} unique tokens')

### Let's see which terms appear most often

In [None]:
tokens = tokenize(text)
print_table(Counter(tokens).most_common(20),['token', 'frequency'])

### Apply porter stemming to reduce words to a common stem

In [None]:
porter_stemmer = PorterStemmer()

def reduce_to_stems(tokens):
    return list(map(lambda token: porter_stemmer.stem(token), tokens))

tokens = reduce_to_stems(tokenize(text))
print_table(Counter(tokens).most_common(20),['token', 'frequency'])

### Eliminate the stopwords as they do not describe the content of the document

In [None]:
def eliminate_stopwords(tokens):
    return [token for token in tokens if not(token in stopwords['english'])]

tokens = tokenize(text)
count = len(tokens)
tokens = reduce_to_stems(eliminate_stopwords(tokens))
print(f'{count-len(tokens)} stopwords removed ({(count-len(tokens))/count*100:.2f}%)')
print(f'{len(tokens)} non-stopword tokens remain with {len(set(tokens))} unique tokens')
print_table(Counter(tokens).most_common(20),['token', 'frequency'])

### We describe each page separately and treat them as mini-documents

In [None]:
collection = [reduce_to_stems(eliminate_stopwords(tokenize(text))) for text in pages]

n = 10
print_table(
    [
        [
            i+1,
            " ".join(collection[i])
        ] for i in range(n)
    ],
    ["page", "tokens"]
)

### Set-of-words summary

In [None]:
def set_of_words(tokens):
    return set(tokens)

n = 10
print_table(
    [
        [
            f"{i+1}",
            ", ".join(sorted(set_of_words(collection[i])))
        ] for i in range(n)
    ],
    ["page", "set of words"]
)

### Bag-of-words summary

In [None]:
def bag_of_words(tokens):
    return dict(Counter(tokens))

n = 10
print_table(
    [
        [
            f"{i+1}",
            ", ".join([f'{x[0]}:{x[1]}' for x in sorted(bag_of_words(collection[i]).items())])
        ] for i in range(n)
    ],
    ["page", "bag of words"]
)

### Bag-of-words require document frequency and idf weigths for each term

In [None]:
def idf(N, df):
    return math.log10((N + 1) / (df + 1))

terms = defaultdict(int)
for page in collection:
    # go through each distinct term on this page
    for term in set(page):
        terms[term] += 1
vocabulary = {term: {"df": count, "idf": idf(len(collection), count)} for term, count in terms.items()}

n = 20
sample = sorted(random.sample(list(vocabulary.items()), n), key=lambda x: x[1]["idf"], reverse=True)
print_table(
    [
        [x[0], f'{x[1]["df"]} / {len(collection)}', f'{x[1]["idf"]:.3f}'] for x in sample
    ],
    ["Term", "df", "idf"]
)

### Now we can compute the bag-of-word representation for vector space retrieval

In [None]:
from collections import Counter
def bag_of_words_idf(tokens, vocabulary):
    terms = dict(Counter(tokens))
    return map(lambda w: (w[0], w[1] * vocabulary[w[0]]['idf']), terms.items())

n = 10
print_table(
    [
        [
            f"{i+1}",
            ", ".join([f'{x[0]}:{x[1]:.2f}' for x in sorted(bag_of_words_idf(collection[i], vocabulary))])
        ] for i in range(n)
    ],
    ["page", "bag of words (vector space retrieval)"]
)