In [16]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

In [4]:
news = pd.read_csv('/Users/camillecu/Downloads/KUL/text_mining/abcnews-date-text.csv')
news.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [12]:
news.shape

(1244184, 2)

Perform the following tasks:
1. Apply the Latent Semantic Analysis and Latent Dirichlet Allocation technique to study the topic focus of ABC’s news headlines. Characterize them by exploring the most frequent words in each topic.


In [8]:
!pip install scikit-learn


Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/07/95/070d6e70f735d13f1c10afebb65ba3526125b7d6c6fc7022651a4a061148/scikit_learn-1.6.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata
  Downloading scikit_learn-1.6.0-cp311-cp311-macosx_10_9_x86_64.whl.metadata (31 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Obtaining dependency information for scipy>=1.6.0 from https://files.pythonhosted.org/packages/b8/53/7f627c180cdaa211fa537650ca05912f58cb68fc33bb2f9af3d29169913e/scipy-1.15.0-cp311-cp311-macosx_10_13_x86_64.whl.metadata
  Downloading scipy-1.15.0-cp311-cp311-macosx_10_13_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Obtaining dependency information for threadpoolctl>=3.1.0 from https://files.pythonhosted.org/packages/4b/2c/ffbf7a134b9ab11a67b0cf0726453cedd9c

Preprocess the data:
Remove stopwords, punctuation, and special characters
Convert text to lowercase
Tokenize the headlines
Apply lemmatization or stemming

In [25]:
import nltk
nltk.download('wordnet')  # For lemmatization
nltk.download('omw-1.4')  # Optional: For multilingual WordNet
nltk.download('stopwords')  # For the stopwords list
nltk.download('punkt')  # For word tokenization


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/camillecu/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/camillecu/nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/camillecu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/camillecu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text.lower())
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)

# Apply the preprocess_text function to each row in the 'headline_text' column
news['processed_text'] = news['headline_text'].apply(preprocess_text)


Creating a document-term matrix using TF-IDF vectorization is essential for several reasons:

1. Numerical representation: It converts text data into a structured, numerical format suitable for machine learning algorithms[2]. This is crucial because most ML models require numerical inputs.

2. Capturing word importance: TF-IDF scores reflect the importance of a word in a document relative to the entire corpus[1]. This helps in identifying key terms that are more relevant to each document.

3. Balancing frequency and uniqueness: TF-IDF balances the term frequency (how often a word appears in a document) with its inverse document frequency (how rare or common a word is across all documents)[5]. This provides a more nuanced representation of the text data.

4. Improved text analysis: By considering both the importance of terms within documents and across the corpus, TF-IDF allows for more effective text analysis in tasks such as classification, clustering, and information retrieval[2].

5. Addressing common word bias: TF-IDF helps mitigate the issue of common words dominating the analysis by giving higher weight to terms that are specific to particular documents[1][5].

6. Feature extraction: The resulting document-term matrix serves as a set of features that can be used for various NLP tasks, including document classification, sentiment analysis, and topic modeling[2][5].

By creating a document-term matrix using TF-IDF vectorization, we transform raw text into a format that captures the semantic importance of words, enabling more accurate and meaningful analysis in natural language processing applications.

Citations:
[1] https://www.capitalone.com/tech/machine-learning/understanding-tf-idf/
[2] https://app.studyraid.com/en/read/2545/51592/tf-idf-vectorization
[3] https://hackernoon.com/document-term-matrix-in-nlp-count-and-tf-idf-scores-explained
[4] https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/8081363-apply-the-tf-idf-vectorization-approach
[5] https://letsdatascience.com/tf-idf/
[6] https://stackoverflow.com/questions/33510938/is-using-tf-idf-for-classification-task-like-sentiment-analysis-task-correct
[7] https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/

In [27]:
# Create TF-IDF matrix
vectorizer = TfidfVectorizer(max_features=5000)
# max_features=5000, which limits the number of features (unique words) to the top 5000 most frequent terms in the corpus. This helps in reducing the dimensionality of the feature space and can improve the performance of subsequent machine learning models.
tfidf_matrix = vectorizer.fit_transform(news['processed_text'])
tfidf_matrix.shape


(1244184, 5000)

In [28]:
# Apply LSA
n_topics = 8
lsa_model = TruncatedSVD(n_components=n_topics, random_state=42)
lsa_topic_matrix = lsa_model.fit_transform(tfidf_matrix)

# Apply LDA
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda_topic_matrix = lda_model.fit_transform(tfidf_matrix)

In [29]:
# Function to print top words for each topic
def print_top_words(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        top_words = [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")


In [30]:
# Print LSA topics
print("LSA Topics:")
print_top_words(lsa_model, vectorizer.get_feature_names_out())

LSA Topics:
Topic 1: interview, police, man, new, extended, charged, court, murder, crash, death
Topic 2: police, man, charged, new, court, murder, crash, death, woman, say
Topic 3: man, charged, murder, police, court, missing, dy, jailed, stabbing, crash
Topic 4: police, probe, investigate, officer, search, hunt, arrest, seek, driver, fatal
Topic 5: new, police, man, zealand, year, case, york, murder, search, law
Topic 6: say, court, face, man, govt, murder, new, plan, police, accused
Topic 7: say, australia, crash, australian, win, day, world, woman, dy, car
Topic 8: court, australia, face, win, charge, australian, murder, back, accused, day


In [31]:
# Print LDA topics
print("\nLDA Topics:")
print_top_words(lda_model, vectorizer.get_feature_names_out())


LDA Topics:
Topic 1: test, black, future, election, murray, poll, broken, hill, go, candidate
Topic 2: iraq, year, killed, kill, attack, blast, china, asylum, pakistan, protest
Topic 3: police, man, charged, found, missing, court, charge, coast, murder, woman
Topic 4: interview, cup, world, say, trump, medium, speaks, talk, john, closer
Topic 5: market, rural, news, win, abc, record, country, share, day, weather
Topic 6: council, plan, govt, health, water, call, new, hospital, service, centre
Topic 7: crash, fire, death, rate, dy, car, man, road, victim, bushfire
Topic 8: change, new, government, pay, say, wa, climate, govt, farmer, labor
