# Video Transcript
**Overview:** This short script will transcribe videos that can be downloaded locally from Youtube, or else from local files, and write to a text file. <br>

**Notebook Owner:** Rahim Hashim <br>
**Date:** December 2023 <br>

***
## Install Relevant Packages
If you are running this on Google Colab, you must install the relevant packages:
> * `openai-whisper:` OpenAI transcription model [documentation](https://github.com/openai/whisper)
> * `pytube:` Youtube downloader [documentation](https://pytube.io/en/latest/)

In [2]:
!pip install -U openai-whisper
!pip install pytube



## Imports
Import the modules you will be using, and load the whisper model -- `base` is ok, but from the documentation:

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs. Below are the names of the available models and their approximate memory requirements and inference speed relative to the large model; actual speed may vary depending on many factors including the available hardware.

|  Size  | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
| :----: | :--------: | :----------------: | :----------------: | :-----------: | :------------: |
|  tiny  |    39 M    |     `tiny.en`      |       `tiny`       |     ~1 GB     |      ~32x      |
|  base  |    74 M    |     `base.en`      |       `base`       |     ~1 GB     |      ~16x      |
| small  |   244 M    |     `small.en`     |      `small`       |     ~2 GB     |      ~6x       |
| medium |   769 M    |    `medium.en`     |      `medium`      |     ~5 GB     |      ~2x       |
| large  |   1550 M   |        N/A         |      `large`       |    ~10 GB     |       1x       |

In [3]:
import whisper
import pprint

model = whisper.load_model('base')

100%|███████████████████████████████████████| 139M/139M [00:05<00:00, 25.8MiB/s]


## Select Video
Either input the Youtube link that you would like to download, or else use a local file that contains audio (.mp4 format only, for now). If local, make sure to point to the current path

In [4]:
from pytube import YouTube
from datetime import datetime

youtube_url = "https://www.youtube.com/watch?v=5t1vTLU7s40" # @param {type:"string"}
local_path = "" # @param {type:"string"}

if youtube_url:
  youtube_obj = YouTube(youtube_url)
  streams = youtube_obj.streams.filter(only_audio=True)
  stream = streams.first()
  video_path = 'youtube.mp4'
  stream.download(filename=video_path)
  video_title = stream.title
  video_url = stream.url
  video_author = youtube_obj.author
  video_date = f"{youtube_obj.publish_date:%B %d, %Y}"
else:
  video_path = local_path
  video_title = video_path
  video_url = None
  video_author = 'Local File'
  video_date = f"{datetime.now():%B %d, %Y}"
  pass

print(f'Video Title: {video_title}')
print(f'  Author: {video_author}')
print(f'  Date: {video_date}')

Video Title: Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416
  Author: Lex Fridman
  Date: March 07, 2024


## Transcribe
This is where the magic happens -- as long as everything is set up properly, all you have to do is send the video to the model, and it should all work!

## Output
Take a peek at the format of the output

In [5]:
output = model.transcribe(video_path)
print('Output Keys:', output.keys())
print('  Segment Keys:', output['segments'][0].keys())

Output Keys: dict_keys(['text', 'segments', 'language'])
  Segment Keys: dict_keys(['id', 'seek', 'start', 'end', 'text', 'tokens', 'temperature', 'avg_logprob', 'compression_ratio', 'no_speech_prob'])


## View Transcription
And now you can view the transcription.

In [6]:
segments = [text['text'] for text in output['segments']]
# segments

## Write to Text Output
If you want to save it to a text file:

In [7]:
with open('transcript.md', 'w') as f:
  f.write(f'# {video_title}\n')
  f.write(f'**{video_author}:** [{video_date}]({video_url})\n')
  for segment in segments:
    f.write(f'* {segment}\n')
f.close()

***
## BERTopic Clustering
The first clustering exercise we'll be performing is via BERTopic (documentation [link](https://github.com/MaartenGr/BERTopic)).

In [None]:
%%capture
!pip install bertopic

## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead.

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.


In [None]:
from bertopic import BERTopic

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(segments)

2024-02-02 15:03:02,837 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/15 [00:00<?, ?it/s]

2024-02-02 15:03:08,342 - BERTopic - Embedding - Completed ✓
2024-02-02 15:03:08,344 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-02 15:03:17,790 - BERTopic - Dimensionality - Completed ✓
2024-02-02 15:03:17,792 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-02 15:03:17,822 - BERTopic - Cluster - Completed ✓
2024-02-02 15:03:17,831 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-02-02 15:03:17,869 - BERTopic - Representation - Completed ✓


## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. -1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated.

In [None]:
freq = topic_model.get_topic_info()
num_topics = len(freq['Topic'].unique())
print(f'Number of topics: {num_topics}')
freq.head(10)

Number of topics: 2


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,433,0_the_and_in_to,"[the, and, in, to, that, of, you, this, is, we]",[ And the example that I often give about that...
1,1,42,1_dinner_of_wine_bottle,"[dinner, of, wine, bottle, to, party, and, sto...",[ party. It's a positive stimulus to me in tha...


To see the words that most frequent each of the topics, select the topic index (from the above `Topic` column).

In [None]:
topic_index = 0 # @param {type:"integer"}
topic_model.get_topic(topic_index)  # Select the most frequent topic

[('the', 0.1388178999581227),
 ('and', 0.08949639891522118),
 ('in', 0.08287969948216967),
 ('to', 0.07822130138054394),
 ('that', 0.076968053254933),
 ('of', 0.07135768401228992),
 ('you', 0.06934160463749271),
 ('this', 0.06200080129079323),
 ('is', 0.061742013895098785),
 ('we', 0.046671852738445894)]

***
## **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created.

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:

In [None]:
topic_model.reduce_topics(segments, nr_topics=10)

2024-01-23 13:31:10,706 - BERTopic - Topic reduction - Reducing number of topics
2024-01-23 13:31:10,758 - BERTopic - Topic reduction - Reduced number of topics from 47 to 10


<bertopic._bertopic.BERTopic at 0x7c4126e82050>

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can
be used to understand how confident BERTopic is that certain topics can be found in a document.

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.005)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=8, width=1000, height=1000)

We can go through each topic manually, which would take a lot of work, or we can visualize them all in a single interactive graph.
BERTopic has a bunch of [visualization functions](https://medium.com/r/?url=https%3A%2F%2Fmaartengr.github.io%2FBERTopic%2Fgetting_started%2Fvisualization%2Fvisualize_documents.html) that we can use. For now, we are sticking with visualizing the documents.

In [None]:
topic_model.visualize_documents(segments, hide_annotations=True, hide_document_hover=False, custom_labels=True)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank()

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:


In [None]:
topic_model.update_topics(segments, n_gram_range=(1, 2))

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
topic_model.find_topics('memory')

([0, 8, 5, 3, -1],
 [0.5628913513946472,
  0.4732045481993843,
  0.44150052173898924,
  0.436342910536105,
  0.3958021255331643])