# Title: Enhancing Search Engine Relevance for Video Subtitles


## Objective:
- Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.

## About the Dataset
- Database File Name: **eng_subtitles_database.db**
- Database contains a sample of 82498 subtitle files from opensubtitles.org. 
- Most of the subtitles are of movies and tv-series which were released after 1990 and before 2024.
- Database contains a table called **'zipfiles'** with three columns.
  1. **num**: Serves as Unique Subtitle ID reference for www.opensubtitles.org 
  2. **name**: Denotes Subtitle File Name
  3. **content**: Subtitle file were compressed and stored as a binary using 'latin-1' encoding

## Importing Libararies

In [3]:
import pandas as pd
import sqlite3
import zipfile
import io
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")

In [2]:
# pandas: A data analysis library in Python
import pandas as pd

# sqlite3: A module that provides a lightweight, disk-based database
import sqlite3

# zipfile: A module to work with ZIP archives
import zipfile

# io: A module for handling I/O operations
import io

# warnings: A module for handling warnings during code execution
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")


## Ingesting the Data

### Reading the Tables from the Dataset

In [3]:
# Connect to the SQLite database
conn = sqlite3.connect('eng_subtitles_database.db')

# Create a cursor object to interact with the database
cursor = conn.cursor()

# Execute SQL query to select table names from the database schema
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")

# Fetch all rows (table names) from the result set
table_names = cursor.fetchall()

# Print the names of all tables in the database
print(table_names)

[('zipfiles',)]


- This code connects to a SQLite database file named 'eng_subtitles_database.db', retrieves the names of all tables in the database, and prints them. The table name here is "zipfiles".

### Reading the Columns of Table

In [4]:
# Execute a PRAGMA query to retrieve column information for the 'zipfiles' table
cursor.execute("PRAGMA table_info('zipfiles')")

# Fetch all rows (column information) from the result set
cols = cursor.fetchall()

# Iterate through the fetched columns and print their names
for col in cols:
    print(col[1])

num
name
content


- This code executes a SQL PRAGMA query to retrieve information about the columns in the 'zipfiles' table of the SQLite database. It then iterates through the fetched columns and prints their names.

**What is PRAGMA Query?**
- A PRAGMA query in SQLite is a SQL command used to query or modify various aspects of the SQLite database engine's behavior or metadata. PRAGMA statements are primarily used for administrative or informational purposes rather than for data manipulation.

- Some common uses of PRAGMA statements include:

  - PRAGMA table_info: This PRAGMA is used to retrieve information about the columns (fields) in a specific table, such as their names, data types, and constraints.
  - PRAGMA foreign_key_list: This PRAGMA retrieves information about foreign key constraints defined on a table.
  - PRAGMA journal_mode: This PRAGMA is used to query or set the journaling mode, which determines how changes are recorded in the SQLite database.
  - PRAGMA cache_size: This PRAGMA is used to query or set the size of the database page cache, which can affect performance.
- More information can be found on https://www.sqlite.org/pragma.html

### Loading the Table inside a Dataframe

In [5]:
# Read data from the 'zipfiles' table into a Pandas DataFrame
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)

# Close the database connection
conn.close()

# Display the first 5 of the DataFrame
df

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...
...,...,...,...
82493,9521935,the.prophets.game.(2000).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xb8\xa6\x...
82494,9521937,west.beirut.(1998).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x13\x97\x...
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00$\x97\x9aV...
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x97\x...


In [6]:
df.shape

(82498, 3)

In [7]:
df['content'][0]

b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x99V\x9fx\x96\xf0\x8c\x9e\x00\x00\x86\x9b\x01\x00;\x00\x00\x00The.Message.1976.REMASTERED.1080p.BluRay.x264-PiGNUS.EN.srt\xad\xbdm\x93\xdc\xc6\x91.\xfa\x9d\x11\xfc\x0f-}\xe1=\x11-\x9d\x06P\x85\x17\x9d\x8d\xd5%%[\xa4-Y>&u\x15>\xdf\xd0\xd3\x98\x19x\xfae\x0cts<\xfe\xf57\x9f\'\xb3\n\xd9\xa4\xbc\xbb\xf7\xc6Fl\xacELW\xa2\xaa\x90\x95\x95\xafO\x16/_l6\xdf\xe0\xff\xea\xf5f\xb3Y}\xf5\xd5\xbf\xaf\xf4AQ\xae7Mx\xf9\xe2\xd7\xfe|s\xbf\xea\x8f\xcf\xab\x8f\xe3n8\xadN\xc7\xfdx\x1cVO\xe3\xf9~\xf5\xf3\xe3p\xfc\xea\xfd/o>\xbc\xfb\xf0\xe3\xef\xde\xbf|\xf1\xfbi\x18Vo\xa6\xd3\xd3<L\xab\xe1\x1f\xe7\xe18\x8f\xa7\xe37\xab\xd3\xbc\xdb~-\xc3\x1e\xfe\xa7<|\xf9\xe2\xe5\x8bR_[~S\xd6\xeb\xa2k\xf3k\xe5A\xb7\xeeb\xf5\xf2\xc5\xbb\xe3\xea|?\xac\x8e\xfdaX\x9dnW?\x9cvk>8\x9c\xe6\xf3\xean\xeao\xc6\xd3ev\x8f~\x1a\xa6\x9b\xf1\xf6\xb2\xff\x1a\xe4\xabD\xbe*d\x11\xa5#_U\xeb\xaa\xd9`\xa6\xa7\xc3\xea\xa7\xcb}\x7f8\xf4F\xf9\xa7a\x9e\x87\xe3\x9d\xcc\\\xdf\x07B!\x13\xaa\xd61n<!\xd9\xaf\xd0\

### NOTE:
- It seems that the content column does not contain the compressed subtitles files stored as binary data using 'latin-1' encoding  as indicated in the README.txt file.
- When the data is retrieved, it appears as a sequence of hexadecimal bytes, which looks like gibberish.
- To see the actual text content of the subtitle files the binary data needs to be decoded and decompressed.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82498 entries, 0 to 82497
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      82498 non-null  int64 
 1   name     82498 non-null  object
 2   content  82498 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


### Function to Decompress and Extract Contents from ZIP Archive

In [9]:
import zipfile  # Library for working with ZIP archives
import io       # Library for working with binary data streams

# Define a function to decompress and extract contents from a ZIP archive
def decompress_and_extract(content):
    """
    Decompresses and extracts contents from a ZIP archive.
    
    Args:
        content (bytes): Binary data representing a ZIP archive.
        
    Returns:
        str: Extracted contents as a decoded string.
            Returns None if the ZIP archive is empty.
    """
    
    # Create a file-like object from the compressed data
    with io.BytesIO(content) as content_file:
        
        # Create a ZipFile object from the file-like object
        with zipfile.ZipFile(content_file) as zip_file:
            # Note: Assuming there is only one file in the ZIP archive, extract its contents, if there are multiple files, you might need to specify the file to extract
            
            # Obtain a list of file names contained within the ZIP archive
            file_list = zip_file.namelist()
            
            # If the archive contains at least one file
            if file_list:
                # Select the first file from the list
                first_file = file_list[0]
                # Read the contents of the selected file from the ZIP archive
                extracted_data = zip_file.read(first_file)
                # Decode the extracted binary data using the 'latin-1' encoding
                return extracted_data.decode('latin-1')
            
            # If the ZIP archive is empty
            else:
                # Return None
                return None


In [10]:
# Apply the decompress_and_extract function and add a new column 'subtitle_text' to the dataframe
df['subtitle_text'] = df['content'].apply(decompress_and_extract)

In [11]:
df['subtitle_text'][0]

'1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch any video online with Open-SUBTITLES\r\nFree Browser extension: osdb.link/ext\r\n\r\n2\r\n00:02:26,198 --> 00:02:29,953\r\nIn the name of God, the most gracious, the most Merciful.\r\n\r\n3\r\n00:02:31,072 --> 00:02:33,370\r\nFrom Muhammad, the Messenger of God\r\n\r\n4\r\n00:02:33,550 --> 00:02:36,047\r\nto Heraclius, the emperor of Byzantium.\r\n\r\n5\r\n00:02:36,407 --> 00:02:39,464\r\ngreetings to him who is the\r\nfollower of righteous guidance.\r\n\r\n6\r\n00:02:39,783 --> 00:02:42,591\r\nI bid you to hear the divine call.\r\n\r\n7\r\n00:02:43,160 --> 00:02:45,817\r\nI am the messenger of God to the people;\r\n\r\n8\r\n00:02:46,337 --> 00:02:48,784\r\naccept Islam for your salvation.\r\n\r\n9\r\n00:02:52,231 --> 00:02:54,709\r\nHe speaks of a new prophet in Arabia.\r\n\r\n10\r\n00:02:55,068 --> 00:02:57,825\r\nWas it like this when John, the Baptist\r\ncame to king Herod\r\n\r\n11\r\n00:02:58,145 --> 00:03:01,272\r\nout of the desert, 

In [12]:
# Drop the 'content' column
df.drop('content',axis=1,inplace=True)

In [13]:
import re

In [13]:
# import re

# # Extract text content from the 'subtitle_text' column and perform cleaning
# bb = ("".join(re.findall(r'[A-z ]+', re.sub(r'www.\S+', '', "".join(re.findall(r'[A-z. ]+', df['subtitle_text'][0])))))).strip()


In [14]:
# # Extract text content from the 'subtitle_text' column and perform cleaning

# cleaned_text = df['subtitle_text'][0]  # Processing the first row
# cleaned_text = re.sub(r'www\.\S+', '', cleaned_text)  # Remove URLs
# cleaned_text = re.sub(r'[^a-zA-Z\s]', '', cleaned_text)  # Remove non-alphabetic characters
# cleaned_text = cleaned_text.lower().strip()  # Convert to lowercase and remove leading/trailing spaces

In [14]:
# Extract text content from the 'subtitle_text' column and perform cleaning

subtitle_text = df['subtitle_text'][0] # Extract text content from first row

cleaned_text_no_urls = re.sub(r'www\.\S+', '', subtitle_text) # Remove URLs 

cleaned_text_alpha = re.findall(r'[A-z. ]+', cleaned_text_no_urls) # Extract alphabetic characters and spaces

concatenated_text = "".join(cleaned_text_alpha) # Concatenate the extracted text fragments into a single string

cleaned_text = concatenated_text.strip() # Remove leading/trailing spaces

cleaned_text = re.sub(r'[^a-zA-Z\s]', '', cleaned_text) # Remove non-alphabetic characters and keep spaces

In [15]:
cleaned_text

'Watch any video online with OpenSUBTITLESFree Browser extension osdblinkext  In the name of God the most gracious the most Merciful  From Muhammad the Messenger of God  to Heraclius the emperor of Byzantium  greetings to him who is thefollower of righteous guidance  I bid you to hear the divine call  I am the messenger of God to the people  accept Islam for your salvation  He speaks of a new prophet in Arabia  Was it like this when John the Baptistcame to king Herod  out of the desert crying about salvation  To Muqawqis Patriarch of Alexandria  Kisra emperor of Persia  Muhammad calls you with the call of God  Accept Islam for your salvation  embrace Islam  You come out of the desertsmelling of camel and goat  To tell Persia where he should kneel  Muhammad Messenger of God  Who gave him this authority  God sent Muhammadas a mercy to mankind  The Scholars and Historians of Islam The University of AlAzhar in CairoThe High Islamic Congress of the Shiat in Lebanon  The makers of this film 

In [16]:
# Renaming the subtitle_text column
df.rename({'subtitle_text':'subtitle'},axis=1,inplace=True)

In [17]:
df.head()

Unnamed: 0,num,name,subtitle
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther..."
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'..."
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an..."
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch..."


In [18]:
df.shape

(82498, 3)

In [19]:
# Clean and preprocess subtitle text
df['text']=df['subtitle'].apply(lambda x:("".join(re.findall(r'[A-z ]+',re.sub(r'www.\S+','',"".join(re.findall(r'[A-z. ]+',x)))))).strip())

In [20]:
df

Unnamed: 0,num,name,subtitle,text
0,9180533,the.message.(1976).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",Watch any video online with OpenSUBTITLESFree ...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,"1\r\n00:00:29,359 --> 00:00:32,048\r\nAh! Ther...",Ah Theres PrincessDawn and Terry with the Blo...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,"1\r\n00:00:53,200 --> 00:00:56,030\r\n<i>Yumi'...",iYumis Cells i iEpisode Extremely Polite Yumi...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch an...",Watch any video online with OpenSUBTITLESFree ...
4,9180600,broker.(2022).eng.1cd,"ï»¿1\r\n00:00:06,000 --> 00:00:12,074\r\nWatch...",Watch any video online with OpenSUBTITLESFree ...
...,...,...,...,...
82493,9521935,the.prophets.game.(2000).eng.1cd,"ï»¿1\r\n00:01:16,284 --> 00:01:19,537\r\nGod,\...",Godwhy are you punishing me With red onhis he...
82494,9521937,west.beirut.(1998).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\napi.Open...",apiOpenSubtitlesorg is deprecated pleaseimplem...
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,"1\r\n00:00:01,001 --> 00:00:04,630\r\n(Dramati...",Dramatic orchestral music Advertise your prod...
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,"1\r\n00:00:06,000 --> 00:00:12,074\r\nAdvertis...",Advertise your product or brand herecontact t...


In [21]:
# Dropping the subtitle column
df.drop('subtitle',axis=1,inplace=True)

In [22]:
df

Unnamed: 0,num,name,text
0,9180533,the.message.(1976).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,Ah Theres PrincessDawn and Terry with the Blo...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,iYumis Cells i iEpisode Extremely Polite Yumi...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
4,9180600,broker.(2022).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
...,...,...,...
82493,9521935,the.prophets.game.(2000).eng.1cd,Godwhy are you punishing me With red onhis he...
82494,9521937,west.beirut.(1998).eng.1cd,apiOpenSubtitlesorg is deprecated pleaseimplem...
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,Dramatic orchestral music Advertise your prod...
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,Advertise your product or brand herecontact t...


In [24]:
# Applying fucntion to text column
df['text'] = df['text'].apply(lambda x: ("".join(re.findall(r'[\w ]+',x))).strip())

In [25]:
df

Unnamed: 0,num,name,text
0,9180533,the.message.(1976).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,Ah Theres PrincessDawn and Terry with the Blo...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,iYumis Cells i iEpisode Extremely Polite Yumi...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
4,9180600,broker.(2022).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
...,...,...,...
82493,9521935,the.prophets.game.(2000).eng.1cd,Godwhy are you punishing me With red onhis he...
82494,9521937,west.beirut.(1998).eng.1cd,apiOpenSubtitlesorg is deprecated pleaseimplem...
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,Dramatic orchestral music Advertise your prod...
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,Advertise your product or brand herecontact t...


In [26]:
# Random Sampling and Index Resetting of DataFrame
# Randomly selecting 30% of the data

df = df.sample(frac=0.3, random_state=42).reset_index(drop=True)

In [27]:
df

Unnamed: 0,num,name,text
0,9251120,maybe.this.time.(2014).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
1,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,Oh I know that its getting late but I dont ...
2,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,iTiming and Subtitles by The Uncontrollable Lo...
3,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiOpenSubtitlesorg is depreca...
4,9408707,battlebots.(2015).eng.1cd,Chris Oh nonot the Minibots yelling Oh You le...
...,...,...,...
24744,9458807,kevin.can.wait.s01.e13.ring.worm.(2017).eng.1cd,Script InfoTitle Default fileScriptType vWrapS...
24745,9244890,bia.s01.e29.episode.1.29.(2019).eng.1cd,Where did that come fromI dont know Its a tap...
24746,9345965,heroes.s02.e11.chapter.eleven.powerless.(2007)...,iPreviously oni Heroes Tell me where I can fi...
24747,9417351,hot.in.cleveland.s05.e09.bad.george.clooney.(2...,i Hot in Clevelandi is recorded in frontof a ...


In [28]:
df.shape

(24749, 3)

In [29]:
# Saving the dataframe into a CSV file
df.to_csv('subtitle.csv',index=False)

In [27]:
df = pd.read_csv("subtitle.csv")

In [31]:
df

Unnamed: 0,num,name,text
0,9251120,maybe.this.time.(2014).eng.1cd,Watch any video online with OpenSUBTITLESFree ...
1,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,Oh I know that its getting late but I dont ...
2,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,iTiming and Subtitles by The Uncontrollable Lo...
3,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiOpenSubtitlesorg is depreca...
4,9408707,battlebots.(2015).eng.1cd,Chris Oh nonot the Minibots yelling Oh You le...
...,...,...,...
24744,9458807,kevin.can.wait.s01.e13.ring.worm.(2017).eng.1cd,Script InfoTitle Default fileScriptType vWrapS...
24745,9244890,bia.s01.e29.episode.1.29.(2019).eng.1cd,Where did that come fromI dont know Its a tap...
24746,9345965,heroes.s02.e11.chapter.eleven.powerless.(2007)...,iPreviously oni Heroes Tell me where I can fi...
24747,9417351,hot.in.cleveland.s05.e09.bad.george.clooney.(2...,i Hot in Clevelandi is recorded in frontof a ...


### Basic Checks

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24749 entries, 0 to 24748
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   num     24749 non-null  int64 
 1   name    24749 non-null  object
 2   text    24749 non-null  object
dtypes: int64(1), object(2)
memory usage: 580.2+ KB


In [33]:
# Checking for missing values
df.isnull().sum()

num     0
name    0
text    0
dtype: int64

In [34]:
# Checking for duplicate rows
df.duplicated().sum()

0

### Text Preprocessing

In [36]:
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [36]:
# def clean_text(text):
#     # Remove timestamps
#     cleaned_text = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\n?', '', text)
    
#     # Remove other non-textual patterns
#     cleaned_text = re.sub(r'<[^>]+>', '', cleaned_text)
#     cleaned_text = re.sub(r"[^\w\s]", '', cleaned_text)
#     cleaned_text = re.sub(r"[^\x00-\x7F]+", '', cleaned_text)
#     cleaned_text = re.sub(r"\b\d+\s", '', cleaned_text)
    
#     # Convert to lowercase
#     cleaned_text = cleaned_text.lower()
    
#     # Tokenize the text
#     tokens = word_tokenize(cleaned_text)
    
#     # Remove stopwords and lemmatize tokens
#     stop_words = set(stopwords.words('english'))
#     lemmatizer = WordNetLemmatizer()
#     clean_tokens = [lemmatizer.lemmatize(word) for word in tokens if word.lower() not in stop_words]
    
#     # Join the filtered tokens back into a string
#     cleaned_text = ' '.join(clean_tokens)
    
#     return cleaned_text.strip()

# # Test the function
# sample_text = "This is a sample text <b>with HTML tags</b> and 123 numbers. 12:34:56 --> 00:12:34,567"
# print("Cleaned Text:", clean_text(sample_text))


### Tokenization

In [37]:
" ".join(word_tokenize(df['text'][0]))
df['text'] = df['text'].apply(lambda x: " ".join(word_tokenize(x)))

In [38]:
def clean_text(text):
    # Remove HTML tags and other non-alphabetic characters
    cleaned_text = re.sub(r'<[^>]+>', '', text)
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', cleaned_text)
    
    # Remove non-ASCII characters
    cleaned_text = re.sub(r'[^\x00-\x7F]+', '', cleaned_text)
    
    # Remove timestamp patterns (if applicable)
    cleaned_text = re.sub(r'\d{2}:\d{2}:\d{2}\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '', cleaned_text)
    
    # Replace multiple spaces with a single space and convert to lowercase
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    cleaned_text = cleaned_text.lower().strip()
    
    return cleaned_text

In [39]:
df['text'] = df['text'].apply(clean_text)

In [40]:
df.head()

Unnamed: 0,num,name,text
0,9251120,maybe.this.time.(2014).eng.1cd,watch any video online with opensubtitlesfree ...
1,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,oh i know that its getting late but i dont wan...
2,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,itiming and subtitles by the uncontrollable lo...
3,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiopensubtitlesorg is deprecat...
4,9408707,battlebots.(2015).eng.1cd,chris oh nonot the minibots yelling oh you lea...


In [41]:
df['text'][1]

'oh i know that its getting late but i dont wan na go home im in no hurry baby time can wait cause i dont wan na go home i know we had to try to reach up and touch the sky baby whatever happened to you and i and i dont wan na go home watch any video online with opensubtitlesfree browser extension osdblinkext guys guys you dont understand this isnt just any wet tshirt contest this is mickey greenssixth annual breastfest hundreds of girls forget that thousands of breasts all pressed togetheron a small wet stage with the three of us right therecelebrating our manhood mugs of beer in one hand buckets of water in the other im not sure im ready for that im ready im not sure i can go im not that comfortable being in the same room with wet breasts last time you werein a room with a wet breast was when you took a steam bathwith your grandfather i remember eddies grandfather boy he was stacked come on eddie itll be fun i dont know youre unbelievable you wont go to strip joints you hate bachelor 

In [42]:
# Saving the dataframe into a CSV file
df.to_csv("cleaned_text.csv",index=False)

In [4]:
df = pd.read_csv("cleaned_text.csv")

In [5]:
df.head()

Unnamed: 0,num,name,text
0,9251120,maybe.this.time.(2014).eng.1cd,watch any video online with opensubtitlesfree ...
1,9211589,down.the.shore.s01.e10.and.justice.for.all.(19...,oh i know that its getting late but i dont wan...
2,9380845,uncontrollably.fond.s01.e07.heartache.(2016).e...,itiming and subtitles by the uncontrollable lo...
3,9301436,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiopensubtitlesorg is deprecat...
4,9408707,battlebots.(2015).eng.1cd,chris oh nonot the minibots yelling oh you lea...


In [6]:
# Dropping the num column as it does not seems of any significance
df.drop(columns=['num'], inplace=True)

In [7]:
df.head()

Unnamed: 0,name,text
0,maybe.this.time.(2014).eng.1cd,watch any video online with opensubtitlesfree ...
1,down.the.shore.s01.e10.and.justice.for.all.(19...,oh i know that its getting late but i dont wan...
2,uncontrollably.fond.s01.e07.heartache.(2016).e...,itiming and subtitles by the uncontrollable lo...
3,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiopensubtitlesorg is deprecat...
4,battlebots.(2015).eng.1cd,chris oh nonot the minibots yelling oh you lea...


### Document Chunking and Sentence Transformers

- **Chunking data** involves breaking large datasets or text documents into smaller, manageable segments. This improves memory efficiency, enables parallel processing, aids in NLP tasks by focusing on smaller text units, and serves as a preprocessing step for algorithms like time series analysis.

- **Sentence Transformers** are specialized deep learning models designed to generate high-quality sentence embeddings. These embeddings capture semantic meaning, facilitating tasks like sentence similarity, semantic search, and clustering. Incorporating architectures such as BERT, RoBERTa, or DistilBERT, Sentence Transformers encode rich contextual information of sentences, vital for applications requiring understanding of text context.

- Here, a **BERT based “SentenceTransformers”** were used to generate embeddings which encode semantic information.

In [47]:
#!pip install sentence_transformers



In [8]:
from sentence_transformers import SentenceTransformer

In [9]:
# Function to chunk the input data into smaller segments
def chunk_text(data, chunk_size=500, overlap_size=50):
    """
    Chunk the input data into smaller segments.

    Parameters:
        data (str): Input data to be chunked.
        chunk_size (int): Size of each chunk.
        overlap_size (int): Size of overlap between chunks.

    Returns:
        list of str: List of text chunks.
    """
    chunks = []
    start_index = 0
    # Chunk the data into segments of specified size with an overlap
    while start_index < len(data):
        end_index = min(start_index + chunk_size, len(data))
        chunk = ' '.join(data[start_index:end_index])
        chunks.append(chunk)
        start_index += chunk_size - overlap_size
    return chunks

# Function to generate embeddings for text chunks using a SentenceTransformer model
def generate_text_embeddings(texts):
    """
    Generate embeddings for text chunks using a SentenceTransformer model.

    Parameters:
        texts (list of str): List of text chunks.

    Returns:
        list of numpy.ndarray: List of embeddings corresponding to each text chunk.
    """
    # Initialize a SentenceTransformer model
    model = SentenceTransformer('bert-base-nli-mean-tokens')
    embeddings = []
    # Encode each text chunk into embeddings using the model
    for text_chunk in texts:
        chunk_embeddings = model.encode(text_chunk)
        embeddings.append(chunk_embeddings)
    return embeddings

In [10]:
texts = df['text'].tolist()

In [11]:
# Applying the chunk_text function to perform chunking
chunked_texts = chunk_text(texts)

In [12]:
# Applying the generate_text_embeddings to perform text embedding on the chunked data
Embedded_chunks = generate_text_embeddings(chunked_texts)

### Storing the Embeddings in ChromaDB

- Chroma DB is an open-source vector store used for storing and retrieving vector embeddings. 
- Its main use is to save embeddings along with metadata to be used later by large language models. Additionally, it can also be used for semantic search engines over text data.

In [53]:
#!pip install chromadb

Collecting chromadb
  Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/a4/e1/ce276f553811bd6c684cfe5f637a33ae6444750746f974a8f73d5dc92004/chromadb-0.5.0-py3-none-any.whl.metadata
  Downloading chromadb-0.5.0-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Obtaining dependency information for build>=1.0.3 from https://files.pythonhosted.org/packages/e2/03/f3c8ba0a6b6e30d7d18c40faab90807c9bb5e9a1e3b2fe2008af624a9c97/build-1.2.1-py3-none-any.whl.metadata
  Downloading build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Obtaining dependency information for chroma-hnswlib==0.7.3 from https://files.pythonhosted.org/packages/d2/32/a91850c7aa8a34f61838913155103808fe90da6f1ea4302731b59e9ba6f2/chroma_hnswlib-0.7.3-cp311-cp311-win_amd64.whl.metadata
  Downloading chroma_hnswlib-0.7.3-cp311-cp311-win_amd64.whl.metadata (262 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Obtaini



In [2]:
#!pip install --upgrade chromadb

Note: you may need to restart the kernel to use updated packages.




In [1]:
#pip install google.protobuf>=3.19.0

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement google.protobuf (from versions: none)
ERROR: No matching distribution found for google.protobuf


In [13]:
import chromadb

In [20]:
Embedded_chunks

[array([-6.83184117e-02,  9.31833088e-02,  1.07720590e+00,  2.54844457e-01,
         5.08137763e-01, -9.76729333e-01,  9.24862683e-01,  1.84479311e-01,
         3.90087515e-01, -1.29481614e-01, -1.39887527e-01,  1.43626735e-01,
         4.72809523e-01,  2.70105958e-01, -4.00534898e-01,  5.53636491e-01,
        -1.39639631e-01, -4.43915099e-01,  2.81683177e-01, -7.60083318e-01,
        -4.41479355e-01,  6.71645552e-02, -1.74850538e-01, -1.34599373e-01,
         5.97074509e-01,  1.43535987e-01,  4.02392864e-01,  1.23448573e-01,
        -7.92010963e-01,  2.57300884e-01, -3.88647705e-01,  9.44299817e-01,
         6.43783808e-02, -2.03857973e-01,  1.54299721e-01,  7.50576138e-01,
        -1.38042718e-02,  3.34336579e-01,  4.04155731e-01, -1.90665066e-01,
        -3.21541697e-01, -1.79468662e-01,  3.16014647e-01, -3.24590474e-01,
        -6.19631886e-01,  2.77414411e-01,  3.04029822e-01,  6.74875021e-01,
         1.29043543e+00, -7.55099952e-01,  6.76638544e-01,  3.48051548e-01,
         1.2

In [None]:
chunked_texts

In [21]:
df

Unnamed: 0,name,text
0,maybe.this.time.(2014).eng.1cd,watch any video online with opensubtitlesfree ...
1,down.the.shore.s01.e10.and.justice.for.all.(19...,oh i know that its getting late but i dont wan...
2,uncontrollably.fond.s01.e07.heartache.(2016).e...,itiming and subtitles by the uncontrollable lo...
3,screen.two.s13.e04.the.precious.blood.(1996).e...,ethereal music apiopensubtitlesorg is deprecat...
4,battlebots.(2015).eng.1cd,chris oh nonot the minibots yelling oh you lea...
...,...,...
24744,kevin.can.wait.s01.e13.ring.worm.(2017).eng.1cd,script infotitle default filescripttype vwraps...
24745,bia.s01.e29.episode.1.29.(2019).eng.1cd,where did that come fromi dont know its a tape...
24746,heroes.s02.e11.chapter.eleven.powerless.(2007)...,ipreviously oni heroes tell me where i can fin...
24747,hot.in.cleveland.s05.e09.bad.george.clooney.(2...,i hot in clevelandi is recorded in frontof a l...


In [24]:
df['chunks'] = df['text'].apply(chunk_text)

In [26]:
df['embeddings'] = df['chunks'].apply(generate_text_embeddings)

In [28]:
df.to_csv("final.csv")