# Using LDA to conduct topic modelling on Coursera data

<strong>The descriptions and summaries will be aggregated into a bag-of-words and the LDA model from Gensim will be used to generate n topics for each program (decided using grid search).</strong>

## (1) Import libraries and coursera data into notebook

In [1]:
import pandas as pd
import re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from gensim import corpora
from gensim.models import LdaModel
from gensim.models import CoherenceModel

import numpy as np


In [2]:
df_prog = pd.read_csv("../../data/prog_scraped.csv")
df_course = pd.read_csv("../../data/course_scraped.csv")
df_module = pd.read_csv("../../data/module_scraped.csv")

df_prog.rename(columns={"Unnamed: 0": "id"}, inplace=True)

df_prog.head()



Unnamed: 0,id,prog_title,org,cert_type,enrolled,num_courses,rating,num_reviews,difficulty,prog_lo_description,prog_lo_skills,prog_description,course_title_description_skills,url
0,0,ISC2 Systems Security Certified Practitioner (...,ISC2,Specialization,22401,7,4.8,973.0,Beginner,"Implement, monitor and administer an organizat...",Security Engineering;Network Security;Leadersh...,Pursue better IT security job opportunities an...,Security Concepts and Practices|Course 1 - Sec...,https://www.coursera.org/specializations/sscp-...
1,1,.NET FullStack Developer Specialization,Board Infinity,Specialization,11753,3,4.1,259.0,Intermediate,Master .NET full stack web dev: from .NET core...,Model–View–Controller (MVC);HTML;React (Web Fr...,Develop the proficiency required to design and...,.Net Full Stack Foundation|Understand .NET fra...,https://www.coursera.org/specializations/dot-n...
2,2,AI For Business Specialization,University of Pennsylvania,Specialization,39842,4,4.7,1.0,Beginner,,Machine Learning;Machine Learning Algorithms;A...,This specialization will provide learners with...,AI Fundamentals for Non-Data Scientists|In thi...,https://www.coursera.org/specializations/ai-fo...
3,3,AI Foundations for Everyone Specialization,IBM,Specialization,44046,4,4.7,3.0,Beginner,,Cloud Computing;Human Computer Interaction;Hum...,Artificial Intelligence (AI) is no longer scie...,Introduction to Artificial Intelligence (AI)|D...,https://www.coursera.org/specializations/ai-fo...
4,4,AI Product Management Specialization,Duke University,Specialization,40831,3,4.7,636.0,Beginner,Identify when and how machine learning can app...,Machine Learning;Machine Learning Algorithms;A...,Organizations in every industry are accelerati...,Machine Learning Foundations for Product Manag...,https://www.coursera.org/specializations/ai-pr...


## (2) Cleaning and preprocessing text to create dictionary for LDA model

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
df_prog["prog_description"] = df_prog["prog_description"].fillna("")
df_prog["prog_lo_description"] = df_prog["prog_lo_description"].fillna("")

print('nan removed')

nan removed


In [5]:
def combine_text(row):
    '''
    combines the lo_description (meant act as the summary) and the program description for ease in topic modelling
    '''
    lo_descriptions = row["prog_lo_description"].split(";") 
    lo_text = " ".join(lo_descriptions)
    combined = row["prog_description"] + " " + lo_text
    return combined
df_prog["combined_text"] = df_prog.apply(combine_text, axis=1)
print("Combined data for row 0:\n", df_prog["combined_text"][0])

Combined data for row 0:
 Pursue better IT security job opportunities and prove knowledge with confidence. The SSCP Professional Training Certificate shows employers you have the IT security foundation to defend against cyber attacks – and puts you on a clear path to earning SSCP certification.;Upon completing the SSCP Professional Certificate, you will:;Complete seven courses of preparing you to sit for the Systems Security Certified Practitioner (SSCP) certification exam Opens in a new tabas outlined below.;Course 1 - Security Concepts and Practices;Course 2 - Access Controls;Course 3 - Risk Identification, Monitoring, and Analysis;Course 4 - Incident Response and Recovery;Course 5 - Cryptography;Course 6 - Network and Communications Security;Course 7 - Systems and Application Security;Receive a certificate of program completion.;Understand how to implement, monitor and administer an organization’s IT infrastructure in accordance with security policies and procedures that ensure data

In [6]:
coursera_stopwords = { # extension of the stop words - for now this is all I have, but feel free to add on more 
    "course", "program", "learn", "learning", "outcome", "outcomes",
    "description", "skill", "skills", "module", "modules", "specialization",
    "specialisation", "certificate", "certificates", "certificate,", "certificates,",
    "professional", "career", "opportunity", "opportunities", "project", "projects", "work", "experience", "experiences"
}
STOP_WORDS = STOP_WORDS.union(coursera_stopwords)

def preprocess_text(text: str):
    '''
    preprocessing 
    1. lowercase conversion 
    2. removing nonalphanum
    3. stopwords, puncutation, short token and frequent short token removal
    '''
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    doc = nlp(text)
    tokens = []
    for token in doc:
        if (token.lemma_ not in STOP_WORDS and token.lemma_.isalpha() and len(token.lemma_) > 2):
            tokens.append(token.lemma_)
    return tokens

# apply preprocessing function
df_prog["tokens"] = df_prog["combined_text"].apply(preprocess_text)
print("Tokens:", df_prog["tokens"][0])


Tokens: ['pursue', 'security', 'job', 'prove', 'knowledge', 'confidence', 'sscp', 'training', 'employer', 'security', 'foundation', 'defend', 'cyber', 'attack', 'clear', 'path', 'earn', 'sscp', 'certificationupon', 'complete', 'sscp', 'willcomplete', 'seven', 'prepare', 'sit', 'system', 'security', 'certify', 'practitioner', 'sscp', 'certification', 'exam', 'open', 'new', 'tabas', 'outline', 'belowcourse', 'security', 'concept', 'practicescourse', 'access', 'controlscourse', 'risk', 'identification', 'monitoring', 'analysiscourse', 'incident', 'response', 'recoverycourse', 'cryptographycourse', 'network', 'communication', 'securitycourse', 'system', 'application', 'securityreceive', 'completionunderstand', 'implement', 'monitor', 'administer', 'organization', 'infrastructure', 'accordance', 'security', 'policy', 'procedure', 'ensure', 'datum', 'confidentiality', 'integrity', 'availabilityapplie', 'projecteach', 'include', 'final', 'assessment', 'knowledge', 'check', 'require', 'student

In [7]:
# creating a dictionary for the LDA model
all_tokens = df_prog["tokens"].tolist()
dictionary_prog = corpora.Dictionary(all_tokens)
dictionary_prog.filter_extremes(no_below=5, no_above=0.5)

# creating BOW corpus
bow_corpus_prog = [dictionary_prog.doc2bow(doc) for doc in all_tokens]
print(f"Number of unique tokens in dictionary: {len(dictionary_prog)}")
# print(bow_corpus_prog)
# print(bow_corpus_prog[0])
print("Example BOW for the first document:", bow_corpus_prog[0][:5])


Number of unique tokens in dictionary: 1415
Example BOW for the first document: [(0, 1), (1, 1), (2, 2), (3, 1), (4, 2)]


## (3) Calculating coherence score + Best number of topics through grid-search 

In [8]:
def compute_coherence_values(dictionary, corpus, texts, start, limit, step):
    """
    Computes c_v coherence for various values of num_topics.
    
    Returns:
        model_list: List of trained LdaModel
        coherence_values: Coherence values corresponding to the models
    """
    coherence_values = []
    model_list = []
    
    for num_topics in range(start, limit, step):
        model = LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=num_topics,
            random_state=42,
            passes=10,       # tweak for more stable training
            alpha='auto',    # auto tuning of alpha by gensim
            per_word_topics=True
        )
        model_list.append(model)
        
        # calcuate coherence score
        coherencemodel = CoherenceModel(
            model=model, 
            texts=texts, 
            dictionary=dictionary,
            coherence='c_v'
        )
        coherence_values.append(coherencemodel.get_coherence())
    
    return model_list, coherence_values

start, limit, step = 5, 26, 5  
model_list, coherence_values = compute_coherence_values(
    dictionary_prog, 
    bow_corpus_prog, 
    all_tokens, 
    start, 
    limit, 
    step
)

# identifying best coherence
best_index = np.argmax(coherence_values)
optimal_num_topics = range(start, limit, step)[best_index]
best_model = model_list[best_index]
best_coherence = coherence_values[best_index]
print("Coherence Values:", coherence_values)
print(f"Best number of topics: {optimal_num_topics} with Coherence = {best_coherence:.4f}")


Coherence Values: [0.3147428853581383, 0.3756566704581427, 0.34795359237237833, 0.3614342856557239, 0.33794611460121565]
Best number of topics: 10 with Coherence = 0.3757


## (4) Viewing top words for each topic

In [9]:
for idx in range(optimal_num_topics):
    terms = best_model.get_topic_terms(idx, topn=10)
    term_words = [dictionary_prog[term_id] for term_id, _ in terms]
    print(f"\nTopic {idx} top words: {term_words}")



Topic 0 top words: ['machine', 'programming', 'build', 'model', 'image', 'datum', 'application', 'java', 'language', 'tensorflow']

Topic 1 top words: ['application', 'web', 'design', 'build', 'create', 'technology', 'blockchain', 'learner', 'develop', 'game']

Topic 2 top words: ['datum', 'data', 'science', 'analysis', 'create', 'python', 'tool', 'database', 'analyze', 'sql']

Topic 3 top words: ['marketing', 'create', 'social', 'digital', 'business', 'product', 'strategy', 'music', 'brand', 'practice']

Topic 4 top words: ['design', 'business', 'job', 'new', 'create', 'management', 'analytic', 'customer', 'product', 'tool']

Topic 5 top words: ['cloud', 'google', 'network', 'cybersecurity', 'handson', 'security', 'lab', 'certification', 'new', 'engineer']

Topic 6 top words: ['health', 'design', 'healthcare', 'patient', 'learner', 'care', 'develop', 'human', 'new', 'field']

Topic 7 top words: ['datum', 'learner', 'business', 'model', 'analysis', 'complete', 'new', 'teach', 'techniq