# 🚀Setup

## Set the runtime type

Set the runtime type of this Google Collab to T4 GPU.

## Run shell commands

You can run shell commands in a cell by using prefix `!`, for example:
```
!pip install transformers
```



In [None]:
# Some imports
import subprocess, json, os
import pandas as pd
import numpy as np

## Install `insanely-fast-whisper`

This is a library to run Whisper model for audio to text transcription.

Note that first you need to install `pipx`. Check instructions in `pipx` repository about how to install it in Linux. Then check instructions about how to install `insanely-fast-whisper` in its repo.

Check it works well for this URL: https://www.signalogic.com/melp/EngSamples/Orig/male.wav

Notes:
* The installation is slow, it might take a few minutes.
* If `insanely-fast-whisper` executable is not globally available once installed, just run it with its absolute path: `/root/.local/bin/insanely-fast-whisper`. It might be tricky to make it globally available inside this collab.

In [None]:
!sudo apt update

In [None]:
!sudo apt install pipx

In [None]:
!pipx install insanely-fast-whisper

In [None]:
!cp /root/.local/bin/insanely-fast-whisper /usr/bin

In [None]:
!insanely-fast-whisper --file-name https://www.signalogic.com/melp/EngSamples/Orig/male.wav

## Install a python library to download youtube videos


There are a few python libraries to download youtube videos, but some of them are not working anymore due to banning issues. For example, `pytube` used to be commonly used for it, but it seems it is not working anymore (see https://www.reddit.com/r/learnpython/comments/1edm1q5/pytube_not_working_please_help/).

Find a library that indeed works to download youtube videos, and download some video as audio only (in mp3) to check it works.

In [None]:
!pip install pytubefix

In [None]:
from pytubefix import YouTube
from pytubefix.cli import on_progress

def download_mp3(url):
  yt = YouTube(url, on_progress_callback = on_progress)

  # Extract the audio
  audio = yt.streams.filter(only_audio=True).first()

  out_file = audio.download(output_path=".")

  base, ext = os.path.splitext(out_file)
  new_file = base + '.mp3'
  os.rename(out_file, new_file)

  return new_file

In [None]:
url = "https://www.youtube.com/watch?v=yKoF14Mu0CY"

In [None]:
download_mp3(url)

## Download the lyrics dataset

Download this dataset:
https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

Using the python code suggested in Kaggle web:
```
import kagglehub

# Download latest version
path = kagglehub.dataset_download("carlosgdcj/genius-song-lyrics-with-language-information")

print("Path to dataset files:", path)
```

You should find a very large file `song_lyrics.csv`, check it is there.


In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("carlosgdcj/genius-song-lyrics-with-language-information", force_download=True)

print("Path to dataset files:", path)

In [None]:
!mv /root/.cache/kagglehub/datasets/carlosgdcj/genius-song-lyrics-with-language-information/versions/1/song_lyrics.csv .

## Install more dependencies

Run `!pip install transformers torch faiss-cpu` to install those packages, since they will be used later.

In [None]:
!pip install transformers torch faiss-cpu



# ✏️Development of solution

## Implement `get_lyrics_from_youtube_url(youtube_url)`

Implement a function able to extract lyrics as a string from a youtube url using `insanely-fast-whisper`.

In [None]:
def get_lyrics_from_youtube_url(youtube_url):
    try:
        # Download the audio
        audio_file =  download_mp3(youtube_url)

        # Execute Whisper
        command = "insanely-fast-whisper --file-name " + "\"" + audio_file + "\""
        subprocess.run(command, shell=True)

        with open('output.json', 'r') as f:
            data = json.load(f)

        # Clean up the audio file
        os.remove(audio_file)
        os.remove("output.json")

        return data['text']

    except Exception as e:
        print(f"Error extracting lyrics: {e}")
        return ""

## Embeddings extractor

Prepare a function able to extract embeddings (for example, BERT), from a given text. In our experience, GPT will provide you with code for this very efficiently.

Test it with some string.



In [None]:
from sentence_transformers import SentenceTransformer

def extract_embeddings(text):
  # Load pre-trained tokenizer and model
  encoder = SentenceTransformer("all-mpnet-base-v2")
  embeddings = encoder.encode(text)

  return embeddings

In [None]:
# Example usage:
extract_embeddings("Cat")

## Create a vector database

Using `faiss`, create an index with a few embeddings, and use it to search the nearest neighbors from it given a query string.

Note that the input to `faiss` must be numpy arrays with proper shape, typically: `(num_items, embedding_dimension)`. For querying only one string, it might require `(1, embedding_dimension)`.



In [None]:
import faiss

In [None]:
# Sample embeddings
data = [['Musician in BMAT'], ['Doing stuff in BMAT'], ['Guacamole']]
df = pd.DataFrame(data, columns = ['text'])

text = df['text'].values
vectors = extract_embeddings(text)

# Creating index
index = faiss.IndexFlatL2(vectors.shape[1])
faiss.normalize_L2(vectors)
index.add(vectors)

# Sample query embedding
search_vector = extract_embeddings("Testing for BMAT")
testVector = np.array([search_vector])
faiss.normalize_L2(testVector)

# Search for the nearest neighbor
k = index.ntotal
distances, indices = index.search(testVector, k)
results = pd.DataFrame({'distances': distances[0], 'ann': indices[0]})
results

## Load lyrics database

From the databse in `song_lyrics.csv`, we want to extract the top-1000 songs according to views. We will build our vector database with them.

Important: This file is huge, and does not fit in RAM. In our case, we did it this way:

```
import pandas as pd

file_path = path + '/song_lyrics.csv'
chunksize = 500000
top_n = 1000

top_views_df = pd.DataFrame()

for chunk in pd.read_csv(file_path, chunksize=chunksize):
    chunk_top = chunk.nlargest(top_n, 'views')
    top_views_df = pd.concat([top_views_df, chunk_top])
    top_views_df = top_views_df.nlargest(top_n, 'views')
```




In [None]:
# Returns a dataframe of numTop most viewed songs
def getDFTop(path, numTop):
  chunksize = 500000

  top_views_df = pd.DataFrame()

  for chunk in pd.read_csv(path, chunksize=chunksize, encoding='utf8', engine='python'):
      chunk_top = chunk.nlargest(numTop, 'views')
      top_views_df = pd.concat([top_views_df, chunk_top])
      top_views_df = top_views_df.nlargest(numTop, 'views')
  return top_views_df

In [None]:
top_views_df = getDFTop('./song_lyrics.csv', 1000)

## Extract embeddings for lyrics database

Extract embeddings for the 1000 lyrics in your database.



In [None]:
# Extract the lyrics of each song
text = top_views_df['lyrics'].values

In [None]:
# Extract the embeddings of the songs
embeddings = extract_embeddings(text)


## Create a `faiss` index with lyrics

Create a `faiss` index with those 1000 lyrics, and test it with some example text.


In [None]:
# Creating index with lyrics
def getLyricsIndex(embeddings):
  index = faiss.IndexFlatL2(embeddings.shape[1])
  faiss.normalize_L2(embeddings)
  index.add(embeddings)
  return index


## Implement final function: `get_covers`

As described at the beginning of this doc.

In [None]:
def get_covers(youtube_url, k):
  # Get mp3 from youtube
  song = get_lyrics_from_youtube_url(youtube_url)

  # Extract embedding from mp3
  embedding = extract_embeddings(song)
  embedding = np.array([embedding])
  faiss.normalize_L2(embedding)

  # Extract the lyrics of each song
  lyrics = top_views_df['lyrics'].values

  # Extract the embeddings of the songs
  embeddings = extract_embeddings(lyrics)

  # Create the index
  index = getLyricsIndex(embeddings)

  # Search for the nearest neighbor
  distances, indices = index.search(embedding, k)
  results = pd.DataFrame({'distances': distances[0], 'ann': indices[0]})

  results['title'] = top_views_df['title'].iloc[results['ann'].values].values
  results['artist'] = top_views_df['artist'].iloc[results['ann'].values].values

  # Return similarity as percentage: sim = 100 * (1-D)
  results['score'] = results['distances'].apply(lambda row: round(100 * (1 - row), 1))
  results.drop(['distances', 'ann'], axis=1, inplace=True)

  # Return as: {"title": "Title 1", "artist": "Artist 1", "score": 95.0}
  return results.to_dict('records')

## 📊Evaluation of your solution

Let's evaluate the system with 8 youtube videos:

* https://www.youtube.com/watch?v=BDC8Jr-gp_4
* https://www.youtube.com/watch?v=W_97b97G5ds
* https://www.youtube.com/watch?v=L53MZzuE0QY
* https://www.youtube.com/watch?v=9vmrPrYJPqI
* https://www.youtube.com/watch?v=R6ATpAr7rQU
* https://www.youtube.com/watch?v=RmtP8X4ZErs
* https://www.youtube.com/watch?v=DfMnRP0pk3A
* https://www.youtube.com/watch?v=1BVP72VrGQs

In [None]:
# Create the dataframe if not created
top_views_df = getDFTop('./song_lyrics.csv', 1000)

In [None]:
# Evaluation
k = 1
get_covers("https://www.youtube.com/watch?v=BDC8Jr-gp_4", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=W_97b97G5ds", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=L53MZzuE0QY", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=9vmrPrYJPqI", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=R6ATpAr7rQU", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=RmtP8X4ZErs", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=DfMnRP0pk3A", k)

In [None]:
get_covers("https://www.youtube.com/watch?v=1BVP72VrGQs", k)