## MVP Summary

The goal for this project is to create a recommendation system that can recommend a similar podcast episode (either from the same podcast or not, depending on user preference), given podcast and episode descriptions from Spotify on podcasts related to mental health. 

After cleaning/preprocessing the data, I tried a few vectorizer and topic model combinations. Using the Count Vectorizer with NMF seemed to yield the most interpretable topics (some still unidentified) along with top terms:

#### Topic  1: 
life, u, people, talk, get, way, share, time, love, make

#### Spiritual:
Tarot, Soul, Lindsay, healing, u, card, Tribe, Wild, work, around

#### Mental Health Support:
health, mental, Health, Mental, Join, support, issue, conversation, people, experience

#### Topic  4:
wa, year, life, time, would, first, God, going, day, could

#### Fitness/physical wellbeing:
training, athlete, fitness, CrossFit, strength, coach, Barbell, program, get, Shrugged

#### Older adults:
Dr, older, adult, show, aging, expert, interview, find, January, care

### Next Steps
I will try out the SpaCy library to help with preprocessing proper nouns/names, as well as focus on some common compound terms, in order to improve the topics. Then I will perform topic modeling on Podcast descriptions, before using the final topic mdoels in my recommendation system. 

## Supplemental/Code

In [1]:
import pickle
import pandas as pd

In [2]:
import re
import string

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [4]:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize

In [5]:
from sklearn.decomposition import NMF, TruncatedSVD

In [6]:
mh_podcasts = pd.read_pickle('mh_podcasts.pkl')

In [7]:
mh_podcasts.head()

Unnamed: 0,Podcast_Name,Ep_id,Ep_name,Ep_date,Ep_desc
0,(2020) Mental Health Explained | Created By Yo...,10JraOKEu4gb2dKQEwjhmm,Depression and Tics During Quarantine,2020-12-16,This episode helps explain the effects of quar...
1,Being African American in 2021 and dealing wit...,4Vs1ajXhg5t53zHNDpM3wu,Chipping away at the mental health stigma,2021-10-11,The Black community has made enormous contribu...
2,Being African American in 2021 and dealing wit...,6jFW6wq6Pafs0OLAlHVNRh,Being black in America in 2021,2021-10-08,With love for seven addressing mental health i...
3,Being African American in 2021 and dealing wit...,4F5RugIvvmb8uI5fDqPmhz,Surviving a Narcissistic breakup : The Fear an...,2020-12-12,Moving on and healing from an narcissistic -...
4,Being African American in 2021 and dealing wit...,4eEe5dXg47re6BjpeyZdPx,Love and mental health 2020,2020-12-09,"Love - relationship, mental health and parenti..."


In [8]:
mh_podcasts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20745 entries, 0 to 20744
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Podcast_Name  20745 non-null  object
 1   Ep_id         20745 non-null  object
 2   Ep_name       20745 non-null  object
 3   Ep_date       20745 non-null  object
 4   Ep_desc       20745 non-null  object
dtypes: object(5)
memory usage: 810.5+ KB


In [9]:
podcast_names_df = pd.read_pickle('just_podcasts.pkl')

In [10]:
podcast_names_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Podcast_Name         350 non-null    object
 1   Podcast_ShowID       350 non-null    object
 2   Podcast_Description  350 non-null    object
dtypes: object(3)
memory usage: 8.3+ KB


In [11]:
podcast_names_df

Unnamed: 0,Podcast_Name,Podcast_ShowID,Podcast_Description
0,(2020) Mental Health Explained | Created By Yo...,4pwPCZriBVbcLcufvtchsP,"Hi, my name is Logan Isfeld, I am 17 years old..."
1,Being African American in 2021 and dealing wit...,4eoXzwruqyu2yAh4jYA7EM,Being black in 2021 has its own challenges and...
2,Aubrey Marcus Podcast,0n7j2qseg6fu0Fj2dvzXVi,The Aubrey Marcus Podcast is an illuminating c...
3,Unfazed and Unbothered with Tasia and Camo,6MZJi1fkxSbqjfQiSqC5OL,"Millions of eyes watching, the pressure, the n..."
4,Barbell Shrugged,6MFeb0x9bw9wjrphztLSn9,"Shrugged Collective is a network of fitness, h..."
...,...,...,...
345,Happy and Healthy Mind with Dr. Rozina,5XwuvVKnlVtKNBluBl0ITY,Hello and welcome to Happy and Healthy mind wi...
346,NAH Podcast,0muSoy4HndaTpELvVDu1iW,Hey Hey! My name is Han or Hannah. Whichever y...
347,Healthcare Insight,5GO3DnQpENyNVJymwG8BjU,"Ronald E. Bachman FSA, MAAA, CHC President & ..."
348,Mental Health Education in High Schools,2Ow2pcCGA3rcRDVxSjhI6C,Atkins et al. (2010). Toward the integration o...


### Cleaning

In [12]:
import string
def clean_regex(series):
    # remove digits
    desc = series.apply(lambda x: re.sub('\d', '', x ))
    # remove \xa0 from string in Python: https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python
    desc = desc.apply(lambda x: x.replace(u'\xa0', u''))
    #remove the | and > symbols
    desc = desc.apply(lambda x: re.sub('\|.+', '', x))
    desc = desc.apply(lambda x: re.sub('\>.+', '', x))
    #remove punctuation
    desc = desc.apply(lambda x: re.sub('[%s]'% re.escape(string.punctuation), '', x))
    #remove websites and info that comes after (seems like sponsorship)
    desc = desc.apply(lambda x: re.sub('http.+', '', x))
    desc = desc.apply(lambda x: re.sub('www.+', '', x))
    #add in space before capital letters if none (some are combined together): referred https://stackoverflow.com/questions/199059/a-pythonic-way-to-insert-a-space-before-capital-letters)
    desc = desc.apply(lambda x: re.sub("([A-Z])(?![A-Z])", r"\1", x))
    return desc

In [13]:
cleaned = clean_regex(mh_podcasts.Ep_desc)

In [14]:
#helper function to lowercase words that are not identified as proper nouns, so can use stemmer/lemmatizer
def lowercase(single_desc):
    tokens = pos_tag(word_tokenize(single_desc))
    edited_words = []
    for item in tokens:
        if (item[1] != 'NNP') and (item[1] != 'NNPS'):
            edited_words.append(item[0].lower())
        else:
            edited_words.append(item[0])
    return edited_words
    

In [15]:
cleaned_lower = cleaned.map(lowercase)

In [16]:
cleaned_lower

0        [this, episode, helps, explain, the, effects, ...
1        [the, Black, community, has, made, enormous, c...
2        [with, love, for, seven, addressing, mental, h...
3        [moving, on, and, healing, from, an, narcissis...
4        [Love, relationship, mental, health, and, pare...
                               ...                        
20740    [by, PR, Sarkar, founder, of, Ananda, MargaDis...
20741    [by, PR, Sarkar, founder, of, Ananda, MargaDis...
20742    [by, PR, Sarkar, founder, of, Ananda, MargaPub...
20743    [Discourse, given, by, Prabhat, Ranjan, Sarkar...
20744    [Discourse, given, by, Prabhat, Ranjan, Sarkar...
Name: Ep_desc, Length: 20745, dtype: object

In [17]:
from nltk.stem import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer 
 

In [18]:
lemmatizer = WordNetLemmatizer()
pstemmer = PorterStemmer()
lstemmer = LancasterStemmer()
sstemmer = SnowballStemmer("english")

In [19]:
#comparing
for word in cleaned_lower[23]:
    lemmed = lemmatizer.lemmatize(word)
    pstemmed = pstemmer.stem(word)
    lstemmed = lstemmer.stem(word)
    sstemmed = sstemmer.stem(word)
    print(lemmed)
    print(pstemmed)
    print(lstemmed)
    print(sstemmed)
      

Alan
alan
al
alan
Stern
stern
stern
stern
is
is
is
is
a
a
a
a
planetary
planetari
planet
planetari
scientist
scientist
sci
scientist
and
and
and
and
astronautic
astronaut
astronaut
astronaut
engineer
engin
engin
engin
Alan
alan
al
alan
is
is
is
is
the
the
the
the
chief
chief
chief
chief
Exploration
explor
expl
explor
Officer
offic
off
offic
of
of
of
of
World
world
world
world
View
view
view
view
a
a
a
a
company
compani
company
compani
that
that
that
that
is
is
is
is
pioneering
pioneer
pion
pioneer
the
the
the
the
stratocraft
stratocraft
stratocraft
stratocraft
industry
industri
industry
industri
Their
their
their
their
mission
mission
miss
mission
is
is
is
is
to
to
to
to
allow
allow
allow
allow
people
peopl
peopl
peopl
an
an
an
an
unparalleled
unparallel
unparallel
unparallel
experience
experi
expery
experi
from
from
from
from
the
the
the
the
edge
edg
edg
edg
of
of
of
of
space
space
spac
space
using
use
us
use
breakthrough
breakthrough
breakthrough
breakthrough
in
in
in
in
helium
heliu

In [20]:
#given a list of words (each item in our 'cleaned' list), lemmatize each word 
def lem(words):
    new_list = [lemmatizer.lemmatize(word) for word in words]
    return new_list
    

In [21]:
#will go with lemmatizer first (more conservative approach)
cleaned_ll = cleaned_lower.map(lem)

In [22]:
cleaned_ll

0        [this, episode, help, explain, the, effect, of...
1        [the, Black, community, ha, made, enormous, co...
2        [with, love, for, seven, addressing, mental, h...
3        [moving, on, and, healing, from, an, narcissis...
4        [Love, relationship, mental, health, and, pare...
                               ...                        
20740    [by, PR, Sarkar, founder, of, Ananda, MargaDis...
20741    [by, PR, Sarkar, founder, of, Ananda, MargaDis...
20742    [by, PR, Sarkar, founder, of, Ananda, MargaPub...
20743    [Discourse, given, by, Prabhat, Ranjan, Sarkar...
20744    [Discourse, given, by, Prabhat, Ranjan, Sarkar...
Name: Ep_desc, Length: 20745, dtype: object

In [23]:

from nltk.corpus import stopwords
default_stop = stopwords.words('english')
custom_stop = ["Twitter", "Instagram", "follow", "Youtube", "Spotify", "check", 'help', 'ha', 'episode', 'thing', "YouTube", "podcasting", "like", "one", "podcast", "also"]
#my full list of stop words
full_list = default_stop + custom_stop 

# Vectorizer

Trying CV and TF-IDF

In [24]:
corpus = cleaned_ll.apply(lambda x: " ".join(x))

In [25]:
cv = CountVectorizer(stop_words=full_list, min_df=2, max_df=0.7, lowercase=False, token_pattern=r'(?u)\b[A-Za-z]+\b', max_features=10000)

In [26]:
doc_term = cv.fit_transform(corpus)

In [27]:
dtm = pd.DataFrame(doc_term.toarray(), columns = cv.get_feature_names_out())
dtm

Unnamed: 0,A,AA,ABC,ABOUT,ACL,ADD,ADDebrief,ADHD,AI,AIDS,...,youre,youseason,youth,youtube,youtubeleHOQHrNs,youve,yr,zero,zone,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20740,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20741,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20742,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20743,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
top_words_dict = {k: v for k, v in sorted(cv.vocabulary_.items(), key=lambda x: x[1], reverse=True)}

In [29]:
top_words_dict

{'zoom': 9999,
 'zone': 9998,
 'zero': 9997,
 'yr': 9996,
 'youve': 9995,
 'youtubeleHOQHrNs': 9994,
 'youtube': 9993,
 'youth': 9992,
 'youseason': 9991,
 'youre': 9990,
 'youngest': 9989,
 'younger': 9988,
 'young': 9987,
 'youll': 9986,
 'youin': 9985,
 'youhow': 9984,
 'youd': 9983,
 'youarenottobusypodcast': 9982,
 'yogi': 9981,
 'yoga': 9980,
 'yet': 9979,
 'yesterday': 9978,
 'yes': 9977,
 'yep': 9976,
 'yelling': 9975,
 'yearold': 9974,
 'year': 9973,
 'yeah': 9972,
 'yang': 9971,
 'yall': 9970,
 'ya': 9969,
 'xx': 9968,
 'xoxochristinechen': 9967,
 'xoxo': 9966,
 'xmo': 9965,
 'x': 9964,
 'wurrang': 9963,
 'wrote': 9962,
 'wrong': 9961,
 'written': 9960,
 'writing': 9959,
 'writes': 9958,
 'writer': 9957,
 'write': 9956,
 'wrist': 9955,
 'wrestling': 9954,
 'wrestler': 9953,
 'wrestled': 9952,
 'wrapping': 9951,
 'wrapped': 9950,
 'wrap': 9949,
 'wow': 9948,
 'woven': 9947,
 'wounded': 9946,
 'wound': 9945,
 'wouldnt': 9944,
 'would': 9943,
 'worthy': 9942,
 'worth': 9941,
 'w

In [30]:
#TF-IDF
tfidf_vec = TfidfVectorizer(stop_words=full_list, min_df=2, max_df=0.7, lowercase=False, token_pattern=r'(?u)\b[A-Za-z]+\b', max_features=10000)

In [31]:
doc_term_tfidf = tfidf_vec.fit_transform(corpus)

In [32]:
dtm_tfidf = pd.DataFrame(doc_term_tfidf.toarray(), columns = tfidf_vec.get_feature_names_out())

dtm_tfidf

Unnamed: 0,A,AA,ABC,ABOUT,ACL,ADD,ADDebrief,ADHD,AI,AIDS,...,youre,youseason,youth,youtube,youtubeleHOQHrNs,youve,yr,zero,zone,zoom
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20740,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20741,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20742,0.124303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
20743,0.141641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Topic Modeling

Trying NMF first, using the vectorized data from CountVectorizer and TF-IDF Vectorizer (to compare)

In [37]:
#NMF 
nmf_act = NMF(6, init = 'nndsvda')

In [38]:
nmf = nmf_act.fit(dtm)

In [39]:
# Function to display the top n terms in each topic- sourced from Metis
def display_topics(model, feature_names, no_top_words, topic_names = None): 
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix + 1)
        else:
            print("\nTopic: ", topic_names[ix])
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
    print("\n")
    return model, feature_names, no_top_words

In [40]:
# output contents for each topic - Count Vectorizer with NMF

output = display_topics(nmf, cv.get_feature_names_out(), 10)
output;


Topic  1
life, u, people, talk, get, way, share, time, love, make

Topic  2
Tarot, Soul, Lindsay, healing, u, card, Tribe, Wild, work, around

Topic  3
health, mental, Health, Mental, Join, support, issue, conversation, people, experience

Topic  4
wa, year, life, time, would, first, God, going, day, could

Topic  5
training, athlete, fitness, CrossFit, strength, coach, Barbell, program, get, Shrugged

Topic  6
Dr, older, adult, show, aging, expert, interview, find, January, care




Taking a look to preview episode descriptions associated with topic:

In [41]:
doc_topic = nmf.transform(dtm)

In [43]:
doc_topic_df = pd.DataFrame(doc_topic.round(3), columns = [1,2,3,4,5,6])

In [44]:
doc_topic_df[doc_topic_df[2] > 1]

Unnamed: 0,1,2,3,4,5,6
6431,0.057,1.258,0.013,0.036,0.036,0.000
6432,0.038,1.191,0.013,0.016,0.000,0.000
6434,0.015,1.188,0.011,0.092,0.000,0.000
6436,0.029,1.242,0.007,0.016,0.008,0.000
6439,0.033,1.244,0.007,0.005,0.000,0.000
...,...,...,...,...,...,...
6553,0.025,1.613,0.011,0.021,0.031,0.000
6554,0.002,1.391,0.005,0.029,0.000,0.000
6555,0.024,1.479,0.003,0.090,0.018,0.000
6558,0.000,1.443,0.000,0.014,0.000,0.004


In [46]:
ep_corpus = mh_podcasts.Ep_desc.tolist()

In [47]:
ep_corpus[6431]

'Welcome to a new Weekly Medicine Minisode, Wild Souls! This week, we are working with Four of Cups and Queen of Cups.   With Four of Cups, we are being called to review and reflect, to honor the process of sacred digestion. We cannot pick up and truly drink from that fourth cup, making room for the new cycles and experiences of life, until we have fully integrated what has been. This is a gentle death process, a welcome release of something that we no longer need to carry.   With Queen of Cups, we are getting a lot of support on how to both hold this work AND the duties and responsibilities of our earthly day to day. How do we make space for our emotional digestion, our intuitive clearing, and our dishes, deadlines, kids, and creation time? By touching in with this archetype, we can begin to discover how this kind of both/and work is possible. _________  ABOUT THE PODCAST\xa0 Tarot for the Wild Soul Podcast explores the cards through an inclusive, trauma informed perspective, rooted i

Continuing baseline topic modeling attempts:

In [48]:
nmf_tfidf = nmf_act.fit(dtm_tfidf)

In [49]:
# output contents for each topic - TF-IDF Vectorizer with NMF
output = display_topics(nmf_tfidf, tfidf_vec.get_feature_names_out(), 10)
output;


Topic  1
date, show, David, created, Kato, note, today, Video, Audio, join

Topic  2
life, u, talk, wa, way, time, get, people, make, share

Topic  3
FULL, EP, Abbie, guest, host, Nat, edge, FIRST, Jacob, w

Topic  4
Sanctuary, Buddhist, Meditation, guided, Awareness, open, HMR, Aggacitta, Mindful, Hokkien

Topic  5
Official, Site, Visit, ad, megaphonefmadchoices, choice, Learn, Joe, Camo, Follow

Topic  6
mental, health, Join, Health, Bobby, factor, Mental, delve, condition, Temps




Trying LDA now, using the vectorized data from CountVectorizer and TF-IDF Vectorizer (to compare)

In [50]:
#LSA 
lsa_act = TruncatedSVD(n_components=4, n_iter=8)

In [51]:
lsa = lsa_act.fit(dtm)

In [52]:
# output contents for each topic - Count Vectorizer with LSA

output = display_topics(lsa, cv.get_feature_names_out(), 6)
output;


Topic  1
wa, life, u, health, people, time, mental, get, talk, year

Topic  2
Tarot, Soul, Lindsay, healing, card, u, Tribe, Wild, around, called

Topic  3
health, mental, Dr, older, Health, adult, Mental, show, care, aging

Topic  4
wa, mental, health, year, older, family, Health, adult, Tarot, Mental




In [53]:
lsa_tfidf = lsa_act.fit(dtm_tfidf)

In [54]:
# output contents for each topic - TF-IDF Vectorizer with LSA
output = display_topics(lsa_tfidf, tfidf_vec.get_feature_names_out(), 10)
output;


Topic  1
date, show, life, u, talk, health, mental, get, wa, share

Topic  2
life, u, health, mental, talk, wa, share, way, time, get

Topic  3
FULL, EP, Abbie, Nat, host, guest, edge, FIRST, Jacob, w

Topic  4
Sanctuary, Buddhist, Meditation, guided, Awareness, open, HMR, Aggacitta, Mindful, Hokkien


