# Unsupervised Text Mining - Class Exercise (TAU Text Mining for Business, 24'/25')

## Dataset

In [None]:
import io
import requests
import pandas as pd

url = "https://raw.githubusercontent.com/susanli2016/PyCon-Canada-2019-NLP-Tutorial/master/bbc-text.csv"
s = requests.get(url).content
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
# df = pd.read_csv('bbc-text.csv')  # use this line instead if you've downloaded the dataset directly

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df['category'].unique()

## 0 - Preprocessing

### Q1: Perform basic preprocessing

Write a function that receives a dataframe of the schema `["category", "text"]` stems and tokenize the texts, possibly also removing stopwrds, and returns a joined list of all tokens (not unique!) appearing in the input corpus.
Meaning, if a tokenized sentence is a list of tokens, you are required to join these lists together.

You can use the following code for that
```python
import itertools
all_words = list(itertools.chain.from_iterable(sentences))
```

In [None]:
def extract_token_list(df: pd.DataFrame) -> List[str]:
    # fix me!
    return []

### Q2: How many tokens is the corpus composed of?

## 1 - Words Frequencies

### Q3: Use nltk's FreqDist to display the top 20 and top 40 frequent words in the corpus
https://www.nltk.org/api/nltk.probability.FreqDist.html

### Q4: Display top 20 frequent words for each category separately

### Q5: Word Frequencies Bar Plots

Given an `nltk.FreqDist` object already initialized with a corpus, and thus populated with word frequencies, write a function that receives such an object and an int n, and prints a barplot of the top `n` most frequent words (on the X axis) and their frequencies (Y axis).

**HINT: ChatGPT exists, you know...**

## 2 - Word Clouds

In [None]:
!pip install wordcloud
!pip install --upgrade Pillow

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

### Q6: Plotting word clouds

Write a function that given a word list prints a word cloud, using the `wordcloud` Python package.
Use it to print a word cloud for each category.

## 3 - Clustering

### Q7: Adapt our preprocessing

Adapt `extract_token_list` into `extract_sentence_list`, returning a list of strings instead, where each sentene is a single string (after stemming, tokenization, stopwords removal, etc.)

### Q8: Focus our problem on Sport vs Politics

Assign the `bidf` variable with a sub-dataframe containing articles from only the `"sport"` and `"politics"` categories.

Then, procees it using `extract_sentence_list` and save it in `bi_sent_list`.

### Q9: Get a BoW representation of our binary corpus

You may use a naive BoW representation, or TF-IDF.

**Hint: Look at the bottom of the following page:**

https://scikit-learn.org/1.5/api/sklearn.feature_extraction.html#module-sklearn.feature_extraction.text

### Q10: KMeans Clustering

Cluster the bi-category corpus using k-mens++ w/ Euclidean distance, into 2 clusters.

**Hint: Look at the following class, and consider carefully the initialization parameters.**
https://scikit-learn.org/1.5/modules/generated/sklearn.cluster.KMeans.html

### Q11: Compare clusters to categories

Find a way to check if the clusters match the categories closely or not.

### Q12: KMeans++ w/ Cosine Similarity

Cluster using KMeans++ again, this time using cosine similarity instead of Euclidean distance.

**Hint: Look at the `KMeans.fit` method, and the following sklearn function:**
https://scikit-learn.org/dev/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

### Q13: Is it better?

Use the evaluation method you created before to compare the level of coherence of clusetring using cosine similarity with that achieved when using Euclidean distance.

## 4 - LDA (Bonus)

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
## Run LDA
num_topics = 5
lda = LatentDirichletAllocation(n_components=num_topics, random_state=1234)
lda.fit(BOW)

In [None]:
# Assign training documents to topics
doc_topic = lda.transform(BOW)

# Get the topic with the highest probability for each document
topic_assignments = doc_topic.argmax(axis=1)
# topic_assignments

In [None]:
bidf.head()

In [None]:
category_df = pd.DataFrame({"real_category": bidf.category, "topic": topic_assignments, "text": [" ".join(sentence) for sentence in bi_sent_list]})
category_df.groupby(['topic'])['topic'].count().plot(kind = 'bar', figsize=(10, 5))

In [None]:
topic_df.head()

In [None]:
topic_df.groupby(['real_category', 'cluster'])['cluster'].count().unstack().plot.pie(subplots=True,figsize=(10, 5), legend=False)

In [None]:
## analyze output
n_top_words = 10
feature_names = vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic_{topic_idx} = ", end="")
    top_features_ind = topic.argsort()[:-n_top_words - 1:-1]  # Indices of the top words in the topic
    top_features = [(feature_names[i], topic[i] / topic.sum()) for i in top_features_ind]
    top_words = [f"{weight:.3f}*{term}" for term, weight in top_features]
    print(" + ".join(top_words))

In [None]:
import numpy as np
fuzzy_df = pd.DataFrame(np.round(doc_topic,2), columns=["Topic 0", "Topic 1", "Topic 2", "Topic 3", "Topic 4"])
fuzzy_df

In [None]:
fuzzy_df.corr()