## What is Topic Modeling ?

Topic modeling is type of unsupervised(wihtout label) machine learning that identifies the hidden topics in a collection of text documents

## Types of Topic Modeling

##### Latent Dirichlet Allocation(LDA)

##### Non-Negative Matrix Factorization(NMF)

## Latent Dirichlet Allocation(LDA)

In [15]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import spacy
import re

#Load the spaCy english model
nlp = spacy.load("en_core_web_sm")

#Sample manual documnet
documents = [
    "Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in Australia, Canada, South Africa, and the United States; in Ireland and New Zealand, the game is referred to as either 'football' or 'soccer', depending on the person using the term); Australian rules football; Gaelic football; gridiron football (specifically American football, Arena football, or Canadian football); International rules football; rugby league football; and rugby union football.[1] These various forms of football share, to varying degrees, common origins and are known as 'football codes'.",
    "Cricket is a bat-and-ball game played between two teams of eleven players on a field at the centre of which is a 22-yard (20-metre) pitch with a wicket at each end, each comprising two bails balanced on three stumps. The batting side scores runs by striking the ball bowled at one of the wickets with the bat and then running between the wickets, while the bowling and fielding side tries to prevent this (by preventing the ball from leaving the field, and getting the ball to either wicket) and dismiss each batter (so they are 'out'). Means of dismissal include being bowled, when the ball hits the stumps and dislodges the bails, and by the fielding side either catching the ball after it is hit by the bat, but before it hits the ground, or hitting a wicket with the ball before a batter can cross the crease in front of the wicket. When ten batters have been dismissed, the innings ends and the teams swap roles. The game is adjudicated by two umpires, aided by a third umpire and match referee in international matches. They communicate with two off-field scorers who record the match's statistical information."
]

In [16]:
#create a DataFrame to hold the manual documents
pd.set_option("display.max_colwidth",100)
data = pd.DataFrame({"text": documents})
data

Unnamed: 0,text
0,"Football is a family of team sports that involve, to varying degrees, kicking a ball to score a ..."
1,Cricket is a bat-and-ball game played between two teams of eleven players on a field at the cent...


In [19]:
#preprocessing: Tokenization, stopwords removal, and lemmatization using spaCy
def preprocess(text):
    doc = nlp(text)
    processed_tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    return ' '.join(processed_tokens)

In [20]:
data["processed_text"] = data["text"].apply(preprocess)
data

Unnamed: 0,text,processed_text
0,"Football is a family of team sports that involve, to varying degrees, kicking a ball to score a ...",football family team sport involve vary degree kick ball score goal unqualified word football no...
1,Cricket is a bat-and-ball game played between two teams of eleven players on a field at the cent...,cricket bat ball game play team player field centre yard metre pitch wicket end comprise bail ba...


## Create TF-IDF Vecteroizer and Fit the model

In [22]:
#Create TF-IDF Vector
vectorizer = TfidfVectorizer()

#transform the manual documnets into TF-IDF vectors
X = vectorizer.fit_transform(data["processed_text"])

# Create A latent Dirichlet Allocation model
lda = LatentDirichletAllocation(n_components=2)

#Fit the model to the TF-IDF vectors
lda.fit(X)
# print the topics
print(lda.components_)

[[0.50324437 0.50418424 0.50324437 0.50418424 0.50418424 0.50418424
  0.50418424 0.50418424 0.50511896 0.50324437 0.50817936 0.50626521
  0.50626521 0.50324437 0.50511896 0.50324437 0.50418424 0.50418424
  0.50418424 0.50324437 0.50324437 0.50418424 0.50418424 0.50418424
  0.50324437 0.50324437 0.50324437 0.50324437 0.50324437 0.50702125
  0.50418424 0.50324437 0.50511896 0.50324437 0.50511896 0.50418424
  0.50626521 0.50511896 0.514618   0.50702125 0.50418424 0.50592486
  0.50324437 0.50418424 0.50418424 0.50324437 0.50700331 0.50480607
  0.50324437 0.50324437 0.50480607 0.50418424 0.50418424 0.50418424
  0.50702125 0.50418424 0.50324437 0.50626521 0.50480607 0.50324437
  0.50418424 0.50418424 0.50418424 0.50418424 0.50324437 0.50324437
  0.50324437 0.50418424 0.50511896 0.50324437 0.50418424 0.50324437
  0.50324437 0.50702125 0.50702125 0.50511896 0.50480607 0.50324437
  0.50418424 0.50702125 0.50418424 0.50418424 0.50702125 0.50418424
  0.50324437 0.50324437 0.50511896 0.50324437 0.

## Print the Topic and thier associated words

In [None]:
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx + 1}:")
    #Get the top 5 words with highest weights for this topic
    top_words_idx = topic.argsort()[-5:][::-1]
    top_words = [vectorizer.get_feature_names_out()[i] for in top_words_idx]
    