### Loading our data

In [None]:
import csv
import pandas as pd
from typing import List, Set, Tuple

# english data
classes_en = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
train_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/train.csv", 
                       names = ["Label", "Title", "Article"],
                       encoding = "utf-8")
test_en = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/AGNews/test.csv", 
                      names = ["Label", "Title", "Article"],
                      encoding = "utf-8")

# german data
train_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/train.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")
test_de = pd.read_csv("https://raw.githubusercontent.com/michabirklbauer/hgb_dse_text_mining/master/data/10kGNAD/test.csv", 
                       sep = ";", names = ["Label", "Article"], 
                       quotechar = "\'", quoting = csv.QUOTE_MINIMAL, encoding = "utf-8")

By iterating over the dataframe columns we can construct a "vanilla" list of documents that we can work on:

In [None]:
labels_en = [classes_en[int(row["Label"])] for i, row in train_en.iterrows()]
articles_en = [row["Article"] for i, row in train_en.iterrows()]
labels_de = [row["Label"] for i, row in train_de.iterrows()]
articles_de = [row["Article"] for i, row in train_de.iterrows()]

In [None]:
articles_en[:5]

# **NLTK**

[https://www.nltk.org/](https://www.nltk.org/)

NLTK - short for Natural Language Toolkit - is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning and wrappers for industrial-strength NLP libraries.

I mostly use NLTK for preprocessing tasks because it is more light-weight and straight forward than spaCy in my opinion.

In [None]:
import nltk
from nltk.corpus import stopwords as nltkStopwords
from nltk.stem.snowball import SnowballStemmer

### NLTK tokenizes documents which are any string variables

In [None]:
nltk.download("punkt")

articles_en_tokenized = [nltk.word_tokenize(doc) for doc in articles_en]
articles_de_tokenized = [nltk.word_tokenize(doc) for doc in articles_de]

In [None]:
articles_en_tokenized[0]

### Stemming can be done with NLTK's Snowball Stemmer

[https://www.nltk.org/api/nltk.stem.snowball.html](https://www.nltk.org/api/nltk.stem.snowball.html)

In [None]:
def stem(tokenized_document: str, language: str | None = None) -> List[str]:
        stemmer = SnowballStemmer(language, ignore_stopwords = False)
        return [stemmer.stem(word) for word in tokenized_document]
    
articles_en_stemmed = [stem(doc, "english") for doc in articles_en_tokenized]
articles_de_stemmed = [stem(doc, "german") for doc in articles_de_tokenized]

In [None]:
articles_en_stemmed[0]

### NLTK also offers built-in stopword sets for different languages

In [None]:
nltk.download("stopwords")
stopwords_en = set(nltkStopwords.words("english"))
stopwords_de = set(nltkStopwords.words("german"))

### The english stopwords are:

In [None]:
",".join(stopwords_en)

### And the german ones are:

In [None]:
",".join(stopwords_de)

### Removing stopwords from our stemmed documents

In [None]:
def remove_stopwords(stemmed_document: str, stopwords: Set) -> List[str]:
        def is_stopword(word):
            return not word in stopwords
        return list(filter(is_stopword, stemmed_document))
    
articles_en_final = [remove_stopwords(doc, stopwords_en) for doc in articles_en_stemmed]
articles_de_final = [remove_stopwords(doc, stopwords_de) for doc in articles_de_stemmed]

In [None]:
articles_en_final[0]

# **Gensim**

[https://radimrehurek.com/gensim/](https://radimrehurek.com/gensim/)

Gensim titles itself as "Topic Modelling for Humans" and is the third and final NLP library that we will have a look at. I have mainly used Gensim to build TF-IDF models and run text queries on datasets. We are going to use our NLTK preprocessed documents as input to build a dictionary, corpus and index with Gensim and calculate the TF-IDF matrix to run text queries on our data.

In [None]:
from gensim import corpora
from gensim import models
from gensim import similarities

### Building the TF-IDF model

In [None]:
size = 500 # adjust if model too big
corpus_dictionary_en = corpora.Dictionary(articles_en_final[:size])
corpus_en = [corpus_dictionary_en.doc2bow(document) for document in articles_en_final[:size]]
model_en = models.TfidfModel(corpus_en)
index_en = similarities.MatrixSimilarity(model_en[corpus_en])

To calculate the similarity of our input the query has to be preprocessed the same way our data was:

In [None]:
def query_en(query_string: str) -> List[Tuple[int, float]]:
    q = corpus_dictionary_en.doc2bow(remove_stopwords(stem(nltk.word_tokenize(query_string), language = "english"), stopwords_en))
    q_model = model_en[q]
    result = index_en[q_model]
    result = sorted(enumerate(result), key = lambda item: -item[1])
    for i, j in enumerate(result):
        if i > 2:
            break
        print(j, articles_en[:size][j[0]])
    return result

### Gensim returns the resulting document and its similarity

In [None]:
query_en("Scientists United States");