Overview : 

* Used [sentence transformers](https://github.com/UKPLab/sentence-transformers) to get the sentence embeddings. BERT/RoBERTa/XLM RoBERTa produces out of the box sentence embeddings which are then finetuned with a siamese or triplet network structure to produce semantically meaning sentence embeddings by sentence transformers to use in semantic search or finding similarity.
* So for each episode sentence embeddings are encoded from the pretrained NLI models. From the average of sentence embeddings we find the document embedding or episode embedding in this case. Then K-means clustering is used for clustering the episodes into groups to find similar episodes.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install -U sentence-transformers


In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
episode_name_list = os.listdir('/kaggle/input/chai-time-data-science/Cleaned Subtitles')

# Finding Episode Clusters from Mean Sentence Embeddings

In [None]:
def preprocess(df):
    df = df[df['Speaker']!='Sanyam Bhutani']['Text']
    return df

In [None]:
results  = {}
for episode in episode_name_list:
    df = pd.read_csv("/kaggle/input/chai-time-data-science/Cleaned Subtitles/"+episode)
    text = preprocess(df)
    sentence_embeddings = model.encode(text)
    results[episode.replace('.csv','')] = sentence_embeddings

In [None]:
episode_embeddings = {k:np.mean(v,axis=0) for k,v in results.items()}

In [None]:
for k, v in episode_embeddings.items():
    if v.shape == ():
        print(k)

In [None]:
del episode_embeddings['E69']

In [None]:
episode_embeddings_list = list(episode_embeddings.values())
episode_ids = list(episode_embeddings.keys())


In [None]:
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(episode_embeddings_list)
cluster_assignment = clustering_model.labels_

In [None]:
cluster_assignment

In [None]:
clustered_episodes = [[] for i in range(num_clusters)]
for episode_id, cluster_id in enumerate(cluster_assignment):
    clustered_episodes[cluster_id].append(episode_ids[episode_id])


In [None]:
clustered_episodes

# Match the Episode Names to the Episode Ids

In [None]:
episode_names = pd.read_csv("/kaggle/input/chai-time-data-science/Episodes.csv")

In [None]:
episode_descriptions = pd.read_csv("/kaggle/input/chai-time-data-science/Description.csv")

In [None]:
episode_names.head()

In [None]:
episode_mapping = pd.Series(episode_names['episode_name'].values,index=episode_names['episode_id']).to_dict()

In [None]:
clustered_names = []
for index, cluster in enumerate(clustered_episodes):
    print("\n")
    print("Cluster ",index+1)
    print("\n")
    cluster_list = []
    for episode in cluster:
        print(episode_mapping[episode])
        cluster_list.append(episode_mapping[episode])

        clustered_names.append(cluster_list)

Upon observation in cluster 5 we can see multiple mentions of Japan. Machine Learning Tokyo community related interview (Suzana Ilic) and Kuzushiji recognition competitions related interviews(Tarin and Anokas), all Japanese language/culture related interviews fell into cluster 4. We can also see five mentions of fastai(Sylvain Gugger, Even Oldridge, Robert Bracco, Hamel Husain, Jason Antic). I'd say in this cluster all these five people must have mentioned the impact of fastai in accelerating their learning. The self studying and learning to learn sort of episodes fall into the cluster 5. However in Cluster 0 too I saw self study/getting started in data science/learning to learn sort of advocates(Jeremy Howard, Goku Mohandas, Parul Pandey, Daniel Bourke, Edouard Harris). 

Cluster 2 has multiple mentions of deep learning researchers. Pierre Stock and Christine Payne talks about their research on model quantization and MuseNet, Rachel Thomas is active in applied AI ethics research. Tim Dettmers interview also mentions research. Second theme in cluster 2 seems to be kaggle grandmasters(Dr Olivier Grellier, Artgor,Andres Torrubia).

In Cluster 4 there are multiple mentions of Google Quest Q&A labelling competition(Dmitry Danevskiy, Christof Henkel) and mentions of computer vision but other than that I can't discern much theme by eyeballing. Since different variety of people are using fastai, educators, researchers, kagglers, that theme is kind of present in all clusters.