# Fudan PRML Fall 2024 Exercise 4: Unsupervised Learning

![news](./news.png)

**Your name and Student ID:**

In this assignment, you will build a **text classification** system which is a fundamental task in the field of Natural Language Processing (NLP). More precisely, you are given a news classification task, assigning given news texts to the categories to which they belong. Unlike traditional classification tasks, **we did not provide you with any labels for this assignment, and you need to find a way to construct labels for these articles**. 

For this assignment you can use commonly used deep learning frameworks like PyTorch. **You can use pretrained word vectors like Glove, but not pretrained large models like BERT.**

# 1. Setup

In [None]:
# setup code
%load_ext autoreload
%autoreload 2
%env CUDA_VISIBLE_DEVICES = 1
import os
import pickle
import numpy as np
from sklearn.cluster import KMeans

In [None]:
dataset_path = 'kmeans_news.pkl'

all_data = None
with open(dataset_path,'rb') as fin:
    all_data = pickle.load(fin)
    all_data_np = np.array(all_data)

print ('\n'.join(all_data[0:5]))
print ('Total number of news: {}'.format(len(all_data)))

# 2. Exploratory Data Analysis

Not all data within the dataset is suitable for clustering. You might need to filter and process some of them in advance.

# 3. Get embeddings for the news

We need to convert the news titles into some kind of numerical representation (embedding) before we can do clustering on them. Below are two ways to get embeddings for a paragraph of text:

1. **Pretrained word embeddings**: You can use pretrained word embeddings like Glove to get embeddings for each word in the news, and then average them (or try some more advanced techniques) to get the news embedding.

2. **General text embedding models**: You can use general text embedding models to get embedding for a sentence directly.

You can choose either of them to convert the news titles into embeddings.

# 4. Clustering

Do K-means clustering

In [10]:
clusters = 15
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(sentence_embeddings)

View samples in each cluster

In [None]:
random_sample = True
for i in range(clusters):
    print(f'Cluster {i} has {np.sum(kmeans.labels_ == i)} sentences')
    if random_sample:
        print('\n'.join(all_data_np[np.random.choice(np.where(kmeans.labels_ == i)[0], 5)]))
    else:
        print('\n'.join(all_data_np[kmeans.labels_ == i][0:5]))
    print('')