## Netflix Movie/Shows - Topic Modeling
> <span style='font-family:cursive'> If a group of people where asked to guess the genre(s) for the same movie/tv show - Do you think the answers will be same?</span>

In this notebook, we'll be analyzing Netflix dataset (https://www.kaggle.com/shivamb/netflix-shows) to create topics/groups based on descriptions. Given a new movie, can we predict the topic probabilities (This'll be useful to assign Genres, recommend movies/shows etc.)

<b>Note</b>: The summaries/views presented in this notebook are based on the above dataset and NOT to be mixed with the current Netflix shows/series

### Index
1. [Load Dataset](#section1)
2. [Interactive Visualizations](#section2)
3. [Topic Modeling](#section3)
4. [Word Embedding](#section4)

<a id="section1"></a>
### Load dataset

In [None]:
#importing libraries
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import IFrame
import warnings
%matplotlib inline

warnings.filterwarnings('ignore')

df = pd.read_csv("../input/netflix-shows/netflix_titles.csv")

#let's look at the profile report
from pandas_profiling import ProfileReport
import panel as pn
ProfileReport(df, title="Pandas Profiling Report")

<a id="section2"></a>
### Visualizations (Interactive)

Let's do some basic analysis before proceeding to topic modeling

In [None]:
#plotly libraries
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.subplots import make_subplots
import plotly.graph_objs as go
init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()
import plotly.express as px
import plotly
cols = plotly.colors.DEFAULT_PLOTLY_COLORS

#number of shows/movies across countries
from tabulate import tabulate
import seaborn as sns

def split(df,col):
    arr = []
    for index, row in df[df[col].notnull()].iterrows():
        txt = row[col].split(",")
        for i in txt:
            if i!="":
                arr.append([i.strip(),row["type"],row["show_id"]])
    
    return arr

#cleaning country and listed_in variables as it has multiple values
df_country = pd.DataFrame(split(df,"country"), columns=["country","type","show_id"])
grp = df_country.groupby(["country"])["show_id"].count().reset_index()
lst = list(grp.sort_values("show_id",ascending=False).head(10)["country"])
grp_country = df_country[df_country["country"].isin(lst)].groupby(["country","type"]) \
.count().reset_index().sort_values(["country","type"])

df_listedIn = pd.DataFrame(split(df,"listed_in"), columns=["listed_in","type","show_id"])
grp = df_listedIn.groupby(["listed_in"])["show_id"].count().reset_index()
lst = list(grp.sort_values("show_id",ascending=False).head(10)["listed_in"])
grp_listedIn = df_listedIn[df_listedIn["listed_in"].isin(lst)].groupby(["listed_in","type"]) \
.count().reset_index().sort_values(["listed_in","type"])

fig = make_subplots(rows=1, cols=2)
grp = grp_country[grp_country["type"]=="Movie"]
fig.add_trace(
    go.Bar(x=grp["country"], y=grp["show_id"],name='Movies',marker_color=cols[0]),
    row=1, col=1
)
grp = grp_country[grp_country["type"]=="TV Show"]
fig.add_trace(
    go.Bar(x=grp["country"], y=grp["show_id"],name='Series',marker_color=cols[1]),
    row=1, col=1
)
grp = grp_listedIn[grp_listedIn["type"]=="Movie"]
fig.add_trace(
    go.Bar(x=grp["listed_in"], y=grp["show_id"],showlegend=False,marker_color=cols[0]),
    row=1, col=2
)
grp = grp_listedIn[grp_listedIn["type"]=="TV Show"]
fig.add_trace(
    go.Bar(x=grp["listed_in"], y=grp["show_id"],showlegend=False,marker_color=cols[1]),
    row=1, col=2
)
fig.update_layout(template="plotly",title="Movies/Series across different Countries and Genres")
iplot(fig)

In [None]:
dff = pd.merge(df_country,df[['release_year','show_id']],how='inner',on='show_id')
dff = dff[(dff['release_year'].notnull())&(dff['release_year'] < 2020)&(dff['release_year'] > 2000)]
lst = list(dff.groupby('country')['show_id'].count().reset_index().sort_values(by='show_id',ascending=False).head(10)['country'])
dff = dff[dff['country'].isin(lst)].groupby(['release_year','country','type'])['show_id'].count().reset_index().sort_values(by=["release_year","show_id"],ascending=[True,False])
dff.columns = ['Years', 'Country', 'Type', 'Count']
fig1 = px.line(dff, x='Years', y='Count', color='Country',facet_col='Type')
fig1.update_layout(title='Movie/Shows released across years', xaxis=dict(title=''), xaxis2=dict(title=''))
fig1.for_each_annotation(lambda a: a.update(text=a.text.replace("Type=", "")))

dff = pd.merge(df_country,df[['date_added','show_id']],how='inner',on='show_id')
dff['date_added'] = pd.to_datetime(dff['date_added'])
dff = dff[(dff['date_added'].notnull())&(dff['date_added'] < '1/1/2020')&(dff['date_added'] > '1/1/2000')]
dff = dff[dff['country'].isin(lst)].groupby([dff.date_added.dt.year,'country','type'])['show_id'].count().reset_index().sort_values(by=["date_added","show_id"],ascending=[True,False])
dff.columns = ['Years', 'Country', 'Type', 'Count']

fig2 = px.line(dff, x='Years', y='Count', color='Country',facet_col='Type')
fig2.update_layout(title='Movie/Shows added to Netflix', xaxis=dict(title=''), xaxis2=dict(title=''))
fig2.for_each_annotation(lambda a: a.update(text=a.text.replace("Type=", "")))
fig1.show()
fig2.show()

In [None]:
dff = pd.merge(df_country,df_listedIn[["listed_in","show_id"]], how='inner',on='show_id')
dff = dff.groupby(['country','listed_in','type'])['show_id'].count().reset_index()
lst = ['United States', 'India', 'United Kingdom', 'Canada']

dff = dff.sort_values(by=['country', 'type','show_id'],ascending=[False, False, False])

fig1 = px.bar(dff[(dff['country'].isin(lst))&(dff['type']=='Movie')], x='listed_in', y='show_id', facet_col='country',color='country')
fig1.update_layout(title='', xaxis=dict(title=''), xaxis2=dict(title=''), xaxis3=dict(title=''), 
                   xaxis4=dict(title=''), xaxis5=dict(title=''), yaxis=dict(title=''), showlegend=False)
fig1.for_each_annotation(lambda a: a.update(text=a.text.replace("country=", "")))

fig2 = px.bar(dff[(dff['country'].isin(lst))&(dff['type']=='TV Show')], x='listed_in', y='show_id', facet_col='country',color='country')
fig2.update_layout(title='', xaxis=dict(title=''), xaxis2=dict(title=''), xaxis3=dict(title=''), 
                   xaxis4=dict(title=''),xaxis5=dict(title=''), yaxis=dict(title=''), showlegend=False)
fig2.for_each_annotation(lambda a: a.update(text=""))
fig1.update_layout(
    autosize=False,
    width=780,
    height=400,)
fig2.update_layout(
    autosize=False,
    width=780,
    height=400,)
fig1.show()
fig2.show()

#### Summary (infered from above plots):

- Netflix has more viewers in US and India as compared to other countries
- International movies/series, Dramas and comedies seems to be the prefered genre
- The growth of netflix seems to be more prominent after 2016 and more people prefer watching latest movies/shows (2015 and beyond)
- Documentries and Family movies also seem to have a good following among the countries that have high viewership
- Cult classic movies/shows and Sci-Fi shows doesn't seem to attract viewers

This gives us a taste of the likes/dislikes of our Netflix audience.

<a id="section3"></a>
### Topic Modelling (LDA)

We are going to train LDA (Latent Dirichlet allocation) model based on title descriptions and determine the topic(s) of "unseen" descriptions.

<b>Steps:</b>  
- Train/Test Split
    - We need to split the data into train/test set in order to test our LDA model
    - 90:10 - Split "descriptions"
- Preprocessing
    - Text cleaning - remove special characters, punctuations etc.
    - Remove stop-words
    - Create bigram/trigram tokens
    - Lemmatization
- Training and Evaluation (Coherence/Perplexity)
- Test on the "unseen" data

In [None]:
#importing necessary libraries/functions
from nltk.util import ngrams
import nltk, re, string, gensim
from nltk.corpus import stopwords, wordnet as wn
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from collections import defaultdict
from gensim import corpora
from nltk.stem import SnowballStemmer

ps = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer() 

tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV


#Words from corpus - dictionary 
words = set(nltk.corpus.words.words())


#Text cleaning - tokenization, remove special characters, punctions, meaningless words etc.
def txt_clean(txt):
    tokens = nltk.word_tokenize(txt.lower())
    tokens_clean = [w for w in tokens if w.isalpha() and w in words] 
    return tokens_clean

#create unigram/bigram/trigram words, remove stopwords, lemmatize
def lemmatize_stem(tokens, ngram_type=None):
    
    bigram = gensim.models.Phrases(tokens, min_count=2, threshold=100)
    bigram_tokens = [bigram[tokens[w]] for w in range(len(tokens))]
    trigram = gensim.models.Phrases(bigram[tokens],threshold=100)
    trigram_tokens = [trigram[tokens[w]] for w in range(len(tokens))]
    
    tokens_clean = []
    if ngram_type == "bigram":
        tokens_c = bigram_tokens
    elif ngram_type == "trigram":
        tokens_c = trigram_tokens
    else:
        tokens_c = tokens

    for i in range(len(tokens)-1):
        txt = tokens_c[i]
        txt_above5 = [k for k in txt if len(k)>=5 and k not in gensim.parsing.preprocessing.STOPWORDS]
        lemma_txt = [lemmatizer.lemmatize(w,pos=tag_map[tg[0]]) for w,tg in nltk.pos_tag(txt_above5)]
        stem_txt = [w for w in lemma_txt]
        tokens_clean.append(stem_txt)

    dictionary = corpora.Dictionary(tokens_clean)
    corpus = [dictionary.doc2bow(text) for text in tokens_clean]
    
    return dictionary, corpus, tokens_clean

unseen_len = int(round(0.10 * len(df),0))
unseen_data = df["description"].sample(unseen_len) #random sample
txt_data = df["description"].drop(unseen_data.index)

#Tokenize and clean
tokens = list(txt_data.apply(lambda x: txt_clean(x)))

#Chose bigram - it had the best performace (tried both unigram and trigram)
dictionary, corpus, tokens_clean = lemmatize_stem(tokens, "bigram")

from gensim.models import CoherenceModel

#Choosing best parameter and topics
topics_arr = [20, 40, 60, 80]
learning_decay = [0.5, 0.7, 0.9]
minimum_probability= [0.01, 0.05, 0.08]

def best_params(topic_num, min_probability, decay):
    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = topic_num, id2word=dictionary, passes=15, minimum_probability=min_probability, decay=decay)
    # Compute Coherence Score
    coherence_model_lda = CoherenceModel(model=ldamodel, texts = tokens_clean, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    return [topic_num, min_probability, decay, coherence_lda]


#Parameter Tuning - this'll take some time to run
params = []
for t in topics_arr:
    for p in minimum_probability:
        for l in learning_decay:
            val = best_params(t,p,l)
            params.append(val)


#Chosen from the above function (The number of topics and decay shouldn't be too high or too low)
topic_num = 40
min_probability = 0.05
learning_decay = 0.5


ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = topic_num, id2word=dictionary, passes=15, minimum_probability=min_probability, decay=learning_decay)

topics = ldamodel.print_topics(num_words=10)

#Evaluation
#Perplexity - how probable some new unseen data is given the model that was learned earlier.
print('Perplexity: ', ldamodel.log_perplexity(corpus))

#Coherence - measure of the degree of semantic similarity between high scoring words in each topic (and then average across topics)
coherence_model_lda = CoherenceModel(model=ldamodel, texts = tokens_clean, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)
print("\nSample of Topics:")
for i,j in ldamodel.show_topics(formatted=True,num_words= 10):
    print("Topic-{} => {}".format(i,j))

In [None]:
#Description-topic distributions for our training set  - It lists top 4 keywords and Dominant topic for each sentence
arr = []
for i, j in enumerate(ldamodel[corpus]):
    if len(j) > 0:
        max_val = sorted([w[1] for w in j],reverse=True)[0]
        max_topic = [w[0] for w in j if w[1]==max_val][0]
        keywords = ldamodel.show_topic(max_topic,topn=4)
        keywords = [k[0] for k in keywords]
        description = txt_data.iloc[i]
        arr.append([description,  ",".join(keywords), max_topic, round(max_val,2),])

lda_distribution = pd.DataFrame(arr, columns=['Description', 'Top Keywords', 'Dominant Topic', 'Probability'])
lda_distribution.head()

In [None]:
#interactive word visualization
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(lda_display, 'lda.html')

#Not displaying the output - this for some reason messes up other visualizations

In [None]:
#let's predict the topics for unseen data
unseen_clean = unseen_data.apply(lambda x: txt_clean(x))
    
arr = []
for i in unseen_clean:
    lemma_txt = [lemmatizer.lemmatize(w,pos=tag_map[tg[0]]) for w,tg in nltk.pos_tag(i)]
    lemma_txt2 = [w for w in lemma_txt if w not in gensim.parsing.preprocessing.STOPWORDS]
    arr2 = ldamodel[dictionary.doc2bow(lemma_txt2)]
    max_arr2 = sorted([x[1] for x in arr2],reverse=True)[:3]
    sel_arr2 = [list(x) for x in arr2 if x[1] in max_arr2][:3]
    sel_arr2 = sum(sel_arr2,[])
    if len(sel_arr2) != 6:
        sel_arr2.extend(["None"]*(6-len(sel_arr2)))
    
    sel_arr2.extend([i])
    arr.append(sel_arr2)

unseen_df = pd.DataFrame(arr, columns = ["topic_1", "topic_1_prob", "topic_2", "topic_2_prob", "topic_3", "topic_3_prob", 'Tokens']).round(2)
unseen_df["Description"] = list(unseen_data)
cols = list(unseen_df.columns)
cols = cols[-1:] + [cols[-2]] + cols[:-2]
unseen_df = unseen_df[cols]
unseen_df.head()

#### Summary

We've trained the model and have generated predictions for our test set. Also, we've generated topic distributions for each description in our training set. Let's see what we've got:

- The topic number is 40 - So 5k rows (train set) are represented with 40 topics along with probabilities; an "unseen" description can be represented with sum of "n" number of topics (from the 40). And, each topic is not unique so we can't define a topic - they've overlaps (Check out sample-topics and try to distinguish between topics)
- Coherence score is good but with the unseen data prediction the probability distributions of topics are low. And, the word distribution within those topics are high (as we saw in above bullet
- These topic probabaility distributions per description can be used as feature in clustering algorithm to see what all topics can be grouped together (or) can be used in predictive modelling for predicting ratings/duration etc.

<a id="section4"></a>
### Word Embeddings (Word2Vec)

We are going to train Word2Vec model based on descriptions and also perform Clustering on the generated features. Word2Vec is a actually a 2 layer neural network that can be used to find similarity between words like "Man" and "King", "Woman" and "Queen" rtc. and thus can represent a word in vector format. 

<b>Steps:</b>  
- Train/Test Split
- Preprocessing
    - Text cleaning - remove special characters, punctuations etc.
    - Remove stop-words
    - Create bigram/trigram tokens
- Training + Clustering

In [None]:
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

#We can use the same train and test from our LDA model - txt_data (train) and unseen_data (test)

#text cleaning - remove stopwords and meaningless words
def clean(x):
    tokens = nltk.word_tokenize(x.lower())
    tokens_clean = list(set([w for w in tokens if w.isalpha() and w not in gensim.parsing.preprocessing.STOPWORDS]))
    return tokens_clean

#clean train data
txt_clean = list(txt_data.apply(lambda x: clean(x)))

#bigram-tokenize
bigram = gensim.models.Phrases(txt_clean, min_count=2, threshold=100)
bigram_tokens = [bigram[txt_clean[w]] for w in range(len(txt_clean))]

model = Word2Vec(bigram_tokens, size=300, min_count=1, workers=3, window=5, sg=1)

print("Let's see some sample word similarities (according to train-set):\n")
print("Similar words to 'Crime':")
print(model.wv.similar_by_word("crime"),"\n")
print("Similar words to 'Love':")
print(model.wv.similar_by_word("love"),"\n")
print("These are based out of our train-set so some of it may not be true in real-life")

In [None]:
#Word2Vec to Sentence-Vector Representation
def get_feat(x):
    arr = []
    for i in x:
        arr.append(model.wv[i])
    
    arr = list(np.mean(arr, axis=0))
    return arr

#Each row in txt_data is represented by 300 features (Size parameter) from Word2Vec
X = pd.Series(bigram_tokens).apply(lambda x: get_feat(x))
X = np.array(list(X))
X = X/X.max(axis=0)

#scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [None]:
#Elbow method to determine the number of clusters
distortions = [] 
inertias = [] 
mapping1 = {} 
mapping2 = {} 
K = range(1,10) 
for k in K: 
    #Building and fitting the model 
    kmeanModel = KMeans(n_clusters=k).fit(X) 
    kmeanModel.fit(X)     
      
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 
                      'euclidean'),axis=1)) / X.shape[0]) 
    inertias.append(kmeanModel.inertia_) 
  
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_, 
                 'euclidean'),axis=1)) / X.shape[0] 
    mapping2[k] = kmeanModel.inertia_ 


df_elbow = pd.DataFrame(list(zip(K, distortions, inertias)), columns=["K","Distortions","Inertia"])

fig = make_subplots(rows=1, cols=2)
fig.add_trace(
    go.Scatter(x=df_elbow["K"], y=df_elbow["Distortions"], name="Distortion"),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=df_elbow["K"], y=df_elbow["Inertia"], name="Inertia"),
    row=1, col=2
)
fig.update_layout(template="plotly",title="Elbow method - Distortions and Inertia", xaxis_title="K", xaxis2_title="K")


# fig, ax  = plt.subplots(1,2)
# sns.lineplot(x="K", y="Distortions", data=df_elbow,ax=ax[0])
# sns.lineplot(x="K", y="Inertia", data=df_elbow,ax=ax[1])
# plt.suptitle("The Elbow method using Distortion and Inertia");

In [None]:
#clustering
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
centers = kmeans.labels_
df_X  = pd.DataFrame(X)
df_X["clusters"] = list(centers)

#concat with orginal dataframe
df_concat = pd.concat([txt_data.reset_index(),pd.Series(bigram_tokens,name="bigram_tokens_clean"),df_X],axis=1).set_index('index')
df_concat = df_concat.join(df[["show_id","title","listed_in","country"]])

In [None]:
#Cluster Visualization using PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(df_concat[np.arange(0,100)])
df_concat['PC1'] = X_pca[:,0]
df_concat['PC2'] = X_pca[:,1]

#import seaborn as sns
#mpl.rcParams['figure.dpi']= 150
fig = px.scatter(df_concat, x="PC1", y="PC2", color="clusters",
                 size=len('bigram_tokens_clean'), hover_data=['title'])



fig.show()

In [None]:
# WordCloud Visualization of each cluster
from wordcloud import WordCloud, STOPWORDS 

def clust_words(cluster_num):
    lst = list(df_concat[df_concat["clusters"]==cluster_num]["bigram_tokens_clean"])
    reviews = ""
    for i in lst:
        for t in i:
            if len(t) >= 5:
                reviews += " " + t + " "

    reviews = reviews.strip()
    return reviews


length = len(df_concat["clusters"].unique())
stopwords = set(STOPWORDS) 
wrdcld = WordCloud(width = 200, height =300, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 4)

fig, ax = plt.subplots(1,length,figsize = (24,20))
#plt.figure(figsize = (50, 50), facecolor = None) 

for i in range(length):
    reviews = clust_words(i)
    wordcloud = wrdcld.generate(reviews)
    ax[i].imshow(wordcloud) 
    ax[i].axis("off") 

plt.show();

In [None]:
#TSNE Visualization - Visualization of higher dimension models to 2-dimension
from sklearn.manifold import TSNE
embeddings_ak = model.wv[model.wv.vocab]
tsne_ak_2d = TSNE(perplexity=30, n_components=2, init='pca', n_iter=3500, random_state=32)
embeddings_ak_2d = tsne_ak_2d.fit_transform(embeddings_ak)

import matplotlib.cm as cm
def tsne_plot_2d(label, embeddings, words=[], a=1):
    plt.figure(figsize=(20, 8))
    colors = cm.rainbow(np.linspace(0, 1, 1))
    x = embeddings[:,0]
    y = embeddings[:,1]
    plt.scatter(x, y, c=colors, alpha=a, label=label)
    for i, word in enumerate(words):
        plt.annotate(word, alpha=0.3, xy=(x[i], y[i]), xytext=(5, 2), 
                     textcoords='offset points', ha='right', va='bottom', size=10)
    plt.legend(loc=4)
    plt.grid(True)
    plt.savefig("hhh.png", format='png', dpi=200, bbox_inches='tight')
    plt.show()


tsne_plot_2d('Netflix review', embeddings_ak_2d, a=0.1)

#### Summary

Using Word2Vec we've converted sentences to vector of 300 features (Word -> Vectors; Sentence -> Sum of word vectors). And, we clustered the features to get 4 groups containing potential similar titles in each cluster (based on the descriptions alone). Next, we've generated WordCloud of words within each cluster and the t-SNE visualization of the all reviews. Let's see what we've got:

- The Clusters are clearly separated but have little overlaps - the titles within each cluster have similar descriptions based on Word2Vec. So, we can use this to build recommendations - for example: people who like some movies in Cluster 0 can be be recommended all other movies (based on the distance) in Cluster 0 based on euclidean distance.
- In WordCloud visuals we can see words like "Young", "family", "friend", "World" are overlapping across clusters but we can still distinguish between the clusters. The overlap is expected on a word-level but we have to mainly look from the conext of the sentence (description)