# Spotify Podcast Dataset - Genre from Description via LDA

Attempt to use LDA to find some common themes in the data.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../podcasts-no-audio-13GB-selected/metadata.tsv', sep='\t')

TODO: filter out non-english speaking pod casts. 

In [3]:
# Attempt to only keep columsn show_description and show_name.
filtered_df = df.drop_duplicates(subset='show_name')[['show_name', 'show_description']]

In [4]:
filtered_df.head()

Unnamed: 0,show_name,show_description
0,Kream in your Koffee,A 20-something blunt female takes on the world...
1,Morning Cup Of Murder,Ever wonder what murder took place on today in...
2,Inside The 18 : A Podcast for Goalkeepers by G...,Inside the 18 is your source for all things Go...
3,Arrowhead Live!,Your favorite podcast for everything @Chiefs! ...
4,FBoL,"The comedy podcast about toxic characters, wri..."


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(
    stop_words='english',
    max_df=0.1,
    max_features=5000
)
X = count.fit_transform(filtered_df['show_description'].dropna().values)

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
    n_components=10,
    random_state=123,
    learning_method='batch'
)
X_topics = lda.fit_transform(X)





KeyboardInterrupt: 

In [None]:
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    #print(f"Topic: {topic}")
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

In [None]:
# X_topics has an identifying topic for each show.
# the sum of each row is one (i.e. probability distribution).
X_topics[0,:]

## Inspect the topics

In [None]:
# ALlow a user to look at the topics to come to conclusion about the quality 
# of the LDA results.
# 
# To exit got to the menu -> Kernel -> Interrupt Kernel
import numpy as np
from IPython.display import clear_output

while True:
    clear_output(wait=True)
    topic_num = input("Provide the Topic Number from 1 to 10:")
    X_topics_final = np.argmax(X_topics,axis=1)
    show_idx = np.where(X_topics_final==1)[0]
    display(filtered_df.iloc[show_idx].sample(5))


## Topics

Here are some general guesses at major topics: <br>
Topic 1: News <br>
Topic 2: Social Influencer <br>
Topic 3: Current Events <br>
Topic 4: Creatives <br>
Topic 5: Business <br>
Topic 6: Sports <br>
Topic 7: Current Events <br>
Topic 8: Sports <br>
Topic 9: Self Help <br>
Topic 10: TBD <br>

