Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [None]:
NAME = "Rachna Mallara"
STUDENT_ID = "14444372"

---

*Objective*: Apply topic modelling techniques, such as Latent Dirichlet Allocation (LDA), to analyze and interpret the primary topics present in a collection of online news articles.

Topic modelling is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. It is a frequently used text-mining tool for the discovery of hidden semantic structures in a text body. This assignment involves implementing and interpreting LDA topic modelling on a dataset of online news articles to understand the prevalent themes and topics.

For this task, you will use the "Fake news" dataset, which contains information about a large number of fake news articles. The dataset is available here: https://www.kaggle.com/datasets/mrisdal/fake-news.

1. Prepare: Explore the dataset
2. Pre-process the text data
3. Implement the LDA model
4. Analyze the topics and interpret the results

### Setup and requirements
First, make sure that you have the needed libraries for Python correctly installed.

In [1]:
#!pip install numpy pandas matplotlib sklearn gensim nltk

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import gensim
from gensim import corpora, models
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rachn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rachn\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 1. Prepare and Explore the Dataset (1 point)

The first step is to download and load the dataset. Familiarize yourself with its structure and content. Understand the kind of articles included, and how the data is organized.


1. Load the dataset using pandas.
2. Explore the dataset. What columns does it include? How are the articles represented?
3. For exploration purposes and initial model training take 15-35% sample of dataframe using the sample method in pandas
4. Store your dataset in the variable named `news_df`

In [2]:
import pandas as pd

news_df = pd.read_csv('fake.csv')
print(f'The head of the original dataset fake.csv: {news_df.head()}.')
print(f'The columns in the dataset are: {news_df.columns}.')

news_df = news_df.sample(frac=0.2)
# YOUR CODE HERE
#raise NotImplementedError()

The head of the original dataset fake.csv:                                        uuid  ord_in_thread  \
0  6a175f46bcd24d39b3e962ad0f29936721db70db              0   
1  2bdc29d12605ef9cf3f09f9875040a7113be5d5b              0   
2  c70e149fdd53de5e61c29281100b9de0ed268bc3              0   
3  7cf7c15731ac2a116dd7f629bd57ea468ed70284              0   
4  0206b54719c7e241ffe0ad4315b808290dbe6c0f              0   

                 author                      published  \
0     Barracuda Brigade  2016-10-26T21:41:00.000+03:00   
1  reasoning with facts  2016-10-29T08:47:11.259+03:00   
2     Barracuda Brigade  2016-10-31T01:41:49.479+02:00   
3                Fed Up  2016-11-01T05:22:00.000+02:00   
4                Fed Up  2016-11-01T21:56:00.000+02:00   

                                               title  \
0  Muslims BUSTED: They Stole Millions In Gov’t B...   
1  Re: Why Did Attorney General Loretta Lynch Ple...   
2  BREAKING: Weiner Cooperating With FBI On Hilla...   
3  PIN DROP

In [3]:
assert 1949 <= len(news_df) <= 4550, "You should sample between 15-35% of the dataset."

### Question 1: Dataset Exploration (1 point)


What are the key characteristics of this dataset? Describe the dataset in terms of its size, variety of articles, and any other notable features.

In [4]:
# YOUR CODE HERE
print('Information about sampled dataset: \n')
print(news_df.info())
print('\nDescriptive statistics: \n')
print(news_df.describe())

#find the main language for the comments
#raise NotImplementedError()

Information about sampled dataset: 

<class 'pandas.core.frame.DataFrame'>
Index: 2600 entries, 8877 to 4223
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   uuid                2600 non-null   object 
 1   ord_in_thread       2600 non-null   int64  
 2   author              2104 non-null   object 
 3   published           2600 non-null   object 
 4   title               2467 non-null   object 
 5   text                2595 non-null   object 
 6   language            2600 non-null   object 
 7   crawled             2600 non-null   object 
 8   site_url            2600 non-null   object 
 9   country             2569 non-null   object 
 10  domain_rank         1732 non-null   float64
 11  thread_title        2594 non-null   object 
 12  spam_score          2600 non-null   float64
 13  main_img_url        1917 non-null   object 
 14  replies_count       2600 non-null   int64  
 15  participants_count  

## 2. Pre-process the Text Data

Before applying topic modelling, it's crucial to pre-process the text data. This involves cleaning the text, removing stop words, and converting the text into a suitable format for analysis.

1. Complete the `preprocess_text()` function to clean the text data (remove punctuation, lowercase, tokenize, lemmatize).
2. Remove stopwords using the NLTK library.
3. Create a corpus required for the LDA model using the gensim package and save it in variable `corpus`.
3. Convert the cleaned text into a document-term matrix using the gensim package and save it in variable `doc_term_matrix`.

In [5]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from gensim import corpora
from gensim.models import TfidfModel

nltk.download('punkt')

lemmatizer = WordNetLemmatizer()

# Step 1: Text pre-processing function
def preprocess_text(text):
    # YOUR CODE HERE
    # Handle NaN values
    if pd.isnull(text):
        return []
    # Remove punctuation and convert to lowercase
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])
    text = text.lower()
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return tokens

# Apply text pre-processing to the 'text' column
news_df['processed_text'] = news_df['text'].apply(preprocess_text)

# Step 2: Create a corpus for the LDA model
dictionary = corpora.Dictionary(news_df['processed_text'])
corpus = [dictionary.doc2bow(text) for text in news_df['processed_text']]

# Step 3: Create a document-term matrix
tfidf = TfidfModel(corpus)
doc_term_matrix = [tfidf[doc] for doc in corpus]
corpus = dictionary # for the public test lol

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rachn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Public test (1 point)

In [6]:
assert type(doc_term_matrix) == list, "doc_term_matrix should be a list of lists"
assert type(corpus) == gensim.corpora.dictionary.Dictionary, "corpus should be a gensim.corpora.dictionary.Dictionary"

Hidden tests (2 points)

### Question 2: Pre-processing Importance (2 points)

Why is pre-processing important in topic modelling? Describe how each step in the pre-processing pipeline contributes to the overall analysis.

YOUR ANSWER HERE

## 3. Implement the LDA Model (1 point)

Now, it's time to implement the LDA model using the Gensim library. Be sure to check out the documentation for hyperparameter settings.

1. Choose the number of topics for the model. This is a crucial step and may require some experimentation.
2. Train the LDA model on the dataset.
3. Save the model for future use.

In [7]:
from gensim.models import LdaModel
from gensim.models import CoherenceModel

# evaluate model to see if num_topics_selected should be modified - highest coherence was for num_topics_selected = 2
"""
for num_topics_selected in range(1, 11):
    lda_model = LdaModel(corpus = doc_term_matrix, num_topics = num_topics_selected, id2word = corpus)
    coherence_model = CoherenceModel(model = lda_model, texts = news_df['processed_text'], dictionary = corpus, coherence = 'c_v')
    print(f'Num Topics: {num_topics_selected}, Coherence Score: {coherence_model.get_coherence()}')
"""
# generate lda model
num_topics_selected = 2
lda_model = LdaModel(corpus = doc_term_matrix, num_topics = num_topics_selected, id2word = corpus)

# YOUR CODE HERE
#raise NotImplementedError()

In [8]:
assert type(lda_model) == gensim.models.ldamodel.LdaModel, "lda_model should be a gensim.models.ldamodel.LdaModel"
lda_model.save('lda_model.model')

### Question 3: Model Parameters (2 points)

Discuss the choice of number of topics for the LDA model. How does this choice impact the model's performance and the interpretability of the results?

YOUR ANSWER HERE

## 4. Analyze Topics and Interpret Results (1 point)

Finally, analyze the topics produced by the LDA model and interpret the results.

1. Use the LDA model to identify the main topics in the dataset.
2. For each topic, examine the most representative words.
4. Interpret the topics: What themes or subjects do they represent?

### Question 4: Topic Interpretation

Interpret the topics generated by the LDA model. How coherent are the topics? What do they tell us about the content of the dataset? Does this model need improvement by modifying parameters, using further pre-processing?

In [11]:
# YOUR CODE HERE
# Step 1: Identify the main topics in the dataset
topics = lda_model.show_topics(num_topics = num_topics_selected, num_words = 30, formatted=False)

# Step 2: Print the most representative words for each topic
for topic_id, word_scores in topics:
    print(f"\nTopic {topic_id + 1}:")
    for word, score in word_scores:
        print(f"{word} (Score: {score:.4f})")
        
#raise NotImplementedError()


Topic 1:
trump (Score: 0.0006)
clinton (Score: 0.0006)
u (Score: 0.0004)
email (Score: 0.0004)
state (Score: 0.0003)
election (Score: 0.0003)
people (Score: 0.0003)
hillary (Score: 0.0003)
russia (Score: 0.0003)
government (Score: 0.0003)
war (Score: 0.0003)
donald (Score: 0.0003)
syria (Score: 0.0003)
like (Score: 0.0003)
american (Score: 0.0003)
obama (Score: 0.0003)
2016 (Score: 0.0003)
president (Score: 0.0003)
в (Score: 0.0003)
new (Score: 0.0002)
fbi (Score: 0.0002)
would (Score: 0.0002)
time (Score: 0.0002)
one (Score: 0.0002)
child (Score: 0.0002)
october (Score: 0.0002)
country (Score: 0.0002)
day (Score: 0.0002)
get (Score: 0.0002)
know (Score: 0.0002)

Topic 2:
clinton (Score: 0.0007)
trump (Score: 0.0007)
hillary (Score: 0.0005)
fbi (Score: 0.0004)
de (Score: 0.0004)
email (Score: 0.0004)
election (Score: 0.0003)
said (Score: 0.0003)
vote (Score: 0.0003)
investigation (Score: 0.0003)
would (Score: 0.0003)
war (Score: 0.0003)
campaign (Score: 0.0003)
president (Score: 0.000

YOUR ANSWER HERE

## Question 5: Improving Preprocessing for Topic Modeling (1 point)

### Objective:
Enhance your understanding and skills in preprocessing text data for topic modeling. You will focus on two key areas: 
1. Subsetting posts by language (focusing on English).
2. Enriching the list of stopwords specific to your dataset for more effective topic modeling by adding custom stopwords. Analyze the results to identify irrelevant or overly common words that could be added to your stopwords list.
3. **Re-run Topic Modeling**: Apply the enriched stopwords list and re-run the topic modeling process.

In [None]:
# subset dataset by english articles
# news_df = ...

# YOUR CODE HERE
raise NotImplementedError()

custom_stopwords = set([])

# YOUR CODE HERE
raise NotImplementedError()


def preprocess_text(text):
    # YOUR CODE HERE
    raise NotImplementedError()



# YOUR CODE HERE
raise NotImplementedError()


Does this additional preprocessing improve the topic model output? Why?

YOUR ANSWER HERE

## Question 6. Assessing LDA Model Coherence (2 points)

### Objective

In this exercise, you will assess the coherence of an LDA topic model using Gensim's coherence measures. Coherence measures help in evaluating how well the topics generated by the model are interpretable and semantically meaningful.

### Task

1. **Implement an LDA Model**: Using the "Fake news" dataset, implement an LDA model as done in the previous exercises.
2. **Compute Coherence Score**: Calculate the coherence score of your model using Gensim's CoherenceModel (https://radimrehurek.com/gensim/models/coherencemodel.html).
3. **Experiment with Different Number of Topics**: Experiment with different numbers of topics (e.g., 5, 10, 15 or 10, 50, 100 or whatever range you deem likely for the given data) and assess how the coherence score changes. Write a function that computes a coherence score for each model and plot the coherence scores associated with each topic number value (1 point).
4. **Interpret Results**: Based on the coherence scores, determine the optimal number of topics for the model (1 point).

### Assessment Criteria

- Quality of LDA model implementation.
- Correct calculation and interpretation of coherence scores.
- Thoughtful experimentation with different numbers of topics and analysis of the impact on coherence.

---

In [None]:
from gensim.models.coherencemodel import CoherenceModel

# Function to compute coherence score
def compute_coherence(dictionary, corpus, texts, limit, start=2, step=3):
    # YOUR CODE HERE
    raise NotImplementedError()

# Applying the function to our dataset
model_list, coherence_values = compute_coherence(dictionary=dictionary, corpus=doc_term_matrix, texts=news_df['cleaned_text'].str.split(), start=20, limit=100, step=10)

# Plotting coherence scores
# YOUR CODE HERE
raise NotImplementedError()

What is the optimal number of topics for your model?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Question 7: Fitting the Final LDA Model on the Entire Dataset (4 points)

### Objective:
Having identified the optimal number of topics using the coherence model in Gensim, your task now is to apply this knowledge to fit the final LDA (Latent Dirichlet Allocation) model on the entire dataset.

### Instructions:

1. **Optimal Number of Topics**:
   - Recall the optimal number of topics you determined using the coherence model on a sample of your dataset.
   
2. **Preprocess the Full Dataset**:
   - Ensure that the entire dataset is properly preprocessed (tokenization, removing stopwords, etc.).
   - Create a dictionary and a bag-of-words corpus using the full dataset.

3. **Fit the LDA Model**:
   - Instantiate and train the LDA model on the entire dataset using the optimal number of topics you previously determined.
   - Use the same model parameters that were most effective during your experimentation with the sample.

4. **Model Evaluation**:
   - Briefly evaluate the model by examining the coherence score on the full dataset.
   - Display the top words for each topic and provide a brief interpretation.

5. **Reflection**:
   - Reflect on any differences observed in topic quality and coherence when the model is applied to the entire dataset versus the sample.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()