In [1]:
from langchain_huggingface import HuggingFaceEmbeddings

# 1. Initialize the same embedding model used in your notebook
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [2]:
from langchain_community.vectorstores import FAISS
import pandas as pd

In [3]:
from langchain_core.documents import Document

In [4]:
df = pd.read_csv("df1_cleaned.csv")

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [7]:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=100
)
# 2. Process your DataFrame into Split Documents
documents = []
for _, row in df.iterrows():
    # Split the long plot into smaller chunks
    chunks = text_splitter.split_text(str(row['clean_plot']))
    
    for chunk in chunks:
        documents.append(
            Document(
                page_content=chunk,
                metadata={
                    "title": row['Title'], 
                    "year": row['Release Year']
                }
            )
        )

In [9]:
# Create the vector store using the fixed documents list
vector_store = FAISS.from_documents(documents, embeddings_model)

# Save it to the artifacts folder
vector_store.save_local("artifacts/movie_faiss_bm")

In [10]:
query = "A movie about space exploration and black holes"
docs = vector_store.similarity_search(query, k=3)

for doc in docs:
    print(f"Title: {doc.metadata['title']} ({doc.metadata['year']})")
    print(f"Snippet: {doc.page_content[:150]}...\n")

Title: Sphere (1998)
Snippet: an alien spacecraft is discovered on the floor of the pacific ocean, estimated to have been there for nearly 300 years. a team of experts, including m...

Title: Ulsaha Committee (2014)
Snippet: the film is about a school drop out whose pursuit for amazing scientific inventions lands him in trouble....

Title: The Black Hole (1979)
Snippet: nearing the end of a long mission exploring deep space, the spacecraft uss palomino is returning to earth. the crew consists of captain dan holland, f...



In [11]:
movie_plot = df[df['Title'].str.contains("Sphere", case=False)]['clean_plot'].value_counts().idxmax()

print(movie_plot)

in the mid-1930s, in the early days of military aviation, an era of open cockpits and biplanes, two u.s. army pilots, in a friendly rivalry, are always trying to get the best of each other. 2nd lt. tom cooper (william cagney) gets the nickname "soapy", from his friend, 1st lt. richard "dick" wood, "woody" (edward j. nugent). tom's trademark gift to a female friend is an inscribed bar of soap. tom finds out that "ida johnson", the girl he's been seeing while dick has been off the base, is really dick's fianc√©e, evelyn worthington (june collyer). she introduced herself as ida (hattie mcdaniel), using her maid's name as a lark. when dick finds the tell-tale bar of soap from tom, it's no joke to him, and two friends are at odds. dick breaks off the engagement while evelyn is torn between two loves. the two pilots are picked to go on a dangerous balloon mission launched into the stratosphere, to evaluate high altitude flight capability. before they get off the ground, the tense relationshi

In [13]:
movie_plot = df[df['Title'].str.contains("The Black Hole", case=False)]['clean_plot'].value_counts().idxmax()

print(movie_plot)

nearing the end of a long mission exploring deep space, the spacecraft uss palomino is returning to earth. the crew consists of captain dan holland, first officer lieutenant charlie pizer, journalist harry booth, esp-sensitive scientist dr. kate mccrae, the expedition's civilian leader dr. alex durant and the diminutive robot v.i.n.cent ("vital information necessary centralized"). the palomino crew discovers a black hole in space with a spaceship nearby, somehow defying the hole's massive gravitational pull. the ship is identified as the long-lost uss cygnus, the ship mccrae's father served aboard when it went missing. deciding to investigate, the palomino encounters a mysterious null gravity field surrounding the cygnus. the palomino becomes damaged when it drifts away from the cygnus and into the black hole's intense gravity field, but the ship manages to move back to the cygnus and finds itself able to dock with it. the cygnus appears abandoned. the palomino crew cautiously boards t

In [12]:

# Change your search type to MMR for better variety
retriever = vector_store.as_retriever(
    search_kwargs={'k': 5, 'fetch_k': 20}
)

# Use .invoke() instead of .get_relevant_documents()
results = retriever.invoke("A movie about space exploration and black holes")

# To see your results
for doc in results:
    print(f"Movie: {doc.metadata['title']}")

Movie: Sphere
Movie: Ulsaha Committee
Movie: The Black Hole
Movie: The Thousand Faces of Dunjia
Movie: Interstellar
