# Topic Identification & Classification

A fundamental concept of topic identification can be done by using bag-of-words.Simply by counting our visualizing which words are repeated the most often, we can understand the main idea of the text.

Comparison could be done similarly as well. If you remember our application of word clouds, you will see that the text from Adam Smith and David Ricardo had quite similar wordclouds. Since in real life we use much more complex resources, it is essential to learn advanced methods.

In [None]:
import nltk
nltk.download('punkt') # Some extra knowledge for the computer, so it knows where the sentences are.
nltk.download('stopwords') # Same, but this time it will know the english stopwords like 'the' and 'and'.
from  nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import requests
import numpy as np
from collections import Counter

In [None]:
# Let's remember the concept of bag-of-words, using Adam Smith and Ricardo.
wealth_of_nations = requests.get("https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/The_Wealth_of_Nations_Volume_1_Cleaned.txt").text
ricardo = requests.get('https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/ricardo.txt').text

In [None]:
# Adam Smith.
# Tokenize the text.
unique_tokens = word_tokenize(wealth_of_nations)

# Lowercase all tokens so that there are no confusions. (i.e. 'The' vs. 'the')
lower_tokens = [token.lower() for token in unique_tokens]
alpha_only = [t for t in lower_tokens if t.isalpha()]
custom_exclusions = ['upon', 'may', 'therefore', 'one', 'much', 'must', '0', 'whole','would']
all_exclusions = stopwords.words('english') + custom_exclusions
filtered_words = [word for word in alpha_only if word not in all_exclusions]

# Call the counter on our unique_tokens.
adam_smith_counts = Counter(filtered_words)

# What are the most common tokens?
adam_smith_counts.most_common(5)

In [None]:
# Ricardo.
# Tokenize the text.
unique_tokens_r = word_tokenize(ricardo)

# Lowercase all tokens so that there are no confusions. (i.e. 'The' vs. 'the')
lower_tokens = [token.lower() for token in unique_tokens_r]
alpha_only = [t for t in lower_tokens if t.isalpha()]
custom_exclusions = ['upon', 'may', 'therefore', 'one', 'much', 'must', '0', 'whole','would']
all_exclusions = stopwords.words('english') + custom_exclusions
filtered_words = [word for word in alpha_only if word not in all_exclusions]

# Call the counter on our unique_tokens.
ricardo_counts = Counter(filtered_words)

# What are the most common tokens?
ricardo_counts.most_common(5)

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

# Initialize lemmatizer and stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Function to preprocess text
def preprocess_text(document):
    # Tokenize text
    tokens = word_tokenize(document.lower())  # Convert to lower case and tokenize
    # Remove stopwords and punctuation, then lemmatize and stem (you can add stemmer.stem())
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word.isalnum() and word not in all_exclusions]
    return tokens

In [None]:
# Mini Assignment 0: Preprocess the Smith and Ricardo books with the function above, look at the differences.

a = preprocess_text(wealth_of_nations)
a1 = Counter(a)
a1.most_common(10)

In [None]:
Counter(preprocess_text(ricardo)).most_common(10)

In [None]:
# Function to preprocess text
def preprocess_text2(document):
    # Tokenize text
    tokens = word_tokenize(document.lower())  # Convert to lower case and tokenize
    # Remove stopwords and punctuation, then lemmatize and stem (you can add stemmer.stem())
    tokens = [lemmatizer.lemmatize(stemmer.stem(word)) for word in tokens if word.isalnum() and word not in all_exclusions]
    return tokens

In [None]:
print(Counter(preprocess_text(wealth_of_nations)).most_common(10))
print(Counter(preprocess_text(ricardo)).most_common(10))

In [None]:
print(Counter(preprocess_text2(wealth_of_nations)).most_common(10))
print(Counter(preprocess_text2(ricardo)).most_common(10))

# Introduction to Gensim - https://pypi.org/project/gensim/

For the core concepts of gensim - https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py

Gensim is an open-source Python library. It's particularly well-suited for applications like topic modeling, document similarity analysis, and building word embeddings.



In [None]:
from gensim import corpora, models

# Let's define our data sources.
data = [wealth_of_nations, ricardo]

# We want to preprocess our data as specified by our function above.
texts = [preprocess_text(doc) for doc in data]

In [None]:
# What we essentially do is create dictionaries and corpora, so that our machine understands the text numerically.
dictionary = corpora.Dictionary(texts)

# A basic corpus can be created with bag-of-words, like our approach during the last weeks.
corpus = [dictionary.doc2bow(text) for text in texts]

# Here is the crux of the whole thing. https://radimrehurek.com/gensim/models/ldamodel.html
lda_model = models.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=25)

# Print the topics of the model.
topics = lda_model.print_topics(num_words=4)
for topic in topics:
    print(topic)

# What about the topics in Ricardo?
lda_model.get_document_topics(corpus[1])

In [None]:
# Mini-Assingment: What other models could we use? You should find this out and apply two of those different models here. Let's see what you find out.





############# Your code here.

lsimodel = models.LsiModel(corpus, num_topics=3, id2word=dictionary)
lsimodel.print_topics(3,4)

In [None]:
models.LsiModel(corpus, num_topics=3, id2word=dictionary).show_topics(2,4)

In [None]:
# Assume that we have a big corpus on our hands. And now we would like to find out how another document is related to the topics in this corpus.
# Let's learn by applying it.

# Time forward to 20th century.
keynes = requests.get('https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/keynes.txt').text

# Preprocess.
general_theory = preprocess_text(keynes)

# We convert the text from Keynes to BOW.
general_theory_corpus = dictionary.doc2bow(general_theory)

# And we ask for topics, easy as that!
general_theory_topics = lda_model.get_document_topics(general_theory_corpus)
print(general_theory_topics)

In [None]:
# This could be a bit complicated to read, but here is a function to solve our problem.
def get_document_topics(lda_model, corpus):
    topics = lda_model.get_document_topics(corpus, minimum_probability=0)
    topic_details = [(topic, f"{' '.join(word for word, _ in lda_model.show_topic(topic, topn=10))}", prob) for topic, prob in topics]
    return topic_details

general_theory_topics = get_document_topics(lda_model, general_theory_corpus)

# Print the topic distribution with words for the new document
for topic_num, words, prob in general_theory_topics:
    print(f"Topic {topic_num}: {words} - Probability: {prob:.4f}")

In [None]:
# What if we include something mostly unrelated to the corpus?
frankenstein = requests.get('https://www.gutenberg.org/cache/epub/84/pg84.txt').text

frankenstein_text = preprocess_text(frankenstein)
frankenstein_corpus = dictionary.doc2bow(frankenstein_text)
frankenstein_topics = lda_model.get_document_topics(frankenstein_corpus)
frankenstein_topics = get_document_topics(lda_model, frankenstein_corpus)

for topic_num, words, prob in frankenstein_topics:
    print(f"Topic {topic_num}: {words} - Probability: {prob:.4f}")

**Tf-idf** stands for term-frequncy - inverse document frequency. It is an NLP model that helps you determine the most important words in each document in the corpus. The idea behind tf-idf is that each corpus might have more shared words than just stopwords (think about economics books). These common words are like stopwords and should be removed or at least down-weighted in importance.

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [None]:
from gensim.models import TfidfModel

# Initiating the model.
tfidf = TfidfModel(corpus)

In [None]:
# Let's see which weights it gets from Adam Smith.
tfidf_weights = tfidf[corpus[1]]

# Print the first five weights
print(tfidf_weights[:500])

In [None]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

In [None]:
# Language Detection
# Examples can be given from our movie review dataset.
movie_reviews = pd.read_csv('https://raw.githubusercontent.com/timuroeztuerk/data-science-lecture-S24/main/Datasets/imdb_review_dataset.csv')
movie_reviews.head()

In [None]:
# Let's see an example.
movie_reviews.review[12312]

In [None]:
# We are going to install this package to our environment, since Google does not have it already.
!pip install langdetect
# Classic import.
from langdetect import detect_langs

In [None]:
# Detecting the language is (as you can imagine) very easy. We simply call detect_langs on a string.
print(detect_langs(movie_reviews.review[12123]))

In [None]:
# Application Session 2.1: Add a new column to our dataset, that specifies which language(s) the review is in.

# 2.2. Now show some descriptive statistics. What are the minimum and maximum probabilities for english (a)? Are there any multilangual reviews (b)?

# 2.3. Lastly, using the TextBlob package, add a sentiment column, and specify 4 sentiments; very positive, positive, negative, very negative. For this you have to do a little research.


#################################### Your code here...

