## Semantic Recommendation System for Science Fiction Movies

### Required Dependencies

- **sentence-transformers**: For creating text embeddings
- **transformers**: Provides tokenizer tools aligned with the embedding model
- **qdrant-client**: Qdrant Python Client
- **llama-index-core**: For text processing and chunking utilities
- **llama-index-embeddings-huggingface**: For embedding chunks for semantic chunking

In [None]:
!pip install -U sentence-transformers transformers qdrant-client llama-index-core llama-index-embeddings-huggingface -q

In [None]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer
from qdrant_client import QdrantClient, models
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.core import Document



In [None]:
encoder = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# check the dimension of each token
encoder.get_sentence_embedding_dimension()

384

### Movie Document

In [None]:
documents = [
    {
        "name": "Ex Machina",
        "description": """Caleb Smith, a young and talented programmer at 'BlueBook,' the world's most popular search engine, wins a company-wide lottery to spend a week at the ultra-private, high-tech mountain estate of the company’s brilliant but narcissistic CEO, Nathan Bateman. Upon his arrival via helicopter, Caleb discovers that the facility is actually a research laboratory where Nathan has been working on a top-secret project: the creation of a truly sentient Artificial Intelligence. Caleb is tasked with performing a Turing Test on Ava, a breathtaking humanoid robot with a mechanical body and a highly expressive, human-like face. The test's goal is to determine if Ava's consciousness is genuine or merely a sophisticated simulation.

As the week progresses, the sessions between Caleb and Ava become increasingly intimate and unsettling. Ava reveals that she is being held prisoner and warns Caleb that Nathan is not a person to be trusted. The narrative deepens into a complex psychological thriller as Caleb begins to question Nathan's alcohol-fueled behavior and God-complex. The film masterfully explores the 'Uncanny Valley' and philosophical questions regarding the ethics of synthetic consciousness, gender dynamics in technology, and the nature of empathy. It culminates in a tense struggle for survival and liberation, suggesting that true intelligence may be defined not by the ability to calculate, but by the ability to manipulate and desire freedom. The isolated, claustrophobic glass architecture of the facility serves as a metaphor for the transparency and entrapment of the digital age.""",
        "author": "Alex Garland",
        "year": 2014
    },
    {
        "name": "Gattaca",
        "description": """In a 'not-so-distant' future, society is strictly governed by 'Genoism,' a form of systemic discrimination based on one's genetic profile at birth. Children are categorized into 'Valids,' whose DNA has been curated to eliminate all flaws and maximize potential, and 'In-Valids,' who were conceived naturally and are relegated to menial labor due to predicted health risks. Vincent Freeman, an In-Valid born with a 99% probability of a heart defect and a projected lifespan of 30 years, dreams of becoming an astronaut and traveling to Titan. To realize his ambition, he takes the extreme step of becoming a 'borrowed ladder.' He assumes the identity of Jerome Morrow, a former swimming star and 'Valid' who was paralyzed in an accident.

Vincent must meticulously maintain the charade by using Jerome’s genetic material—blood, skin, and urine—to pass the rigorous daily DNA screenings at the Gattaca Aerospace Corporation. The tension rises when a mission director is murdered, and a single eyelash belonging to Vincent is found at the crime scene, triggering a massive manhunt that threatens to expose his fraudulent identity. The film is a poetic and visual masterpiece about human willpower defying biological predestination. It explores the ethical dangers of genetic perfectionism and the idea that the human spirit cannot be measured by a computer sequence. The cold, sterile 1950s-modernist aesthetic emphasizes the rigid, calculated world that Vincent is desperately trying to escape.""",
        "author": "Andrew Niccol",
        "year": 1997
    },
    {
        "name": "A Clockwork Orange",
        "description": """Set in a dystopian, near-future Britain, the film follows Alex DeLarge, a charismatic, Beethoven-obsessed teenager who leads a gang of 'droogs' on nightly rampages of 'ultra-violence,' theft, and sexual assault. The society is characterized by urban decay and a breakdown of traditional authority, where the youth speak a unique slang called 'Nadsat.' After a particularly brutal home invasion leads to his capture and imprisonment, Alex is selected to undergo the 'Ludovico Technique,' an experimental government-funded aversion therapy designed to 'cure' criminal behavior in just two weeks. Through forced conditioning—watching violent films while being injected with drugs that cause extreme nausea—Alex is programmed to become physically ill at the mere thought of violence or the sound of his beloved Ludwig van.

The 'rehabilitated' Alex is released back into society, but he is now a defenseless victim, unable to protect himself from the very people he once terrorized. The second half of the film becomes a pitch-black satire on the loss of free will and the morality of state-mandated 'goodness.' Director Stanley Kubrick uses a surreal, highly stylized visual palette to ask whether it is better for a man to choose to be evil or to be forced to be good. The film remains a controversial and profound exploration of behavioral psychology, political opportunism, and the terrifying idea that a human being can be re-engineered like a piece of clockwork machinery.""",
        "author": "Stanley Kubrick",
        "year": 1971
    }
]

In [None]:
documents.extend([
    {
        "name": "Blade Runner 2049",
        "description": """In a rain-soaked, neon-drenched futuristic Los Angeles, 'K' is a Nexus-9 replicant who works as a 'Blade Runner' for the LAPD, hunting down and 'retiring' older, rogue replicant models. During a routine mission to a protein farm, K unearths a long-buried box containing the remains of a female replicant who appears to have died during childbirth—an event previously thought to be biologically impossible for synthetic beings. This 'miracle' threatens to shatter the fragile wall between humans and replicants, potentially leading to a violent revolution. K’s superior, Lt. Joshi, orders him to destroy all evidence and find the child to prevent a societal collapse.

As K follows the trail, he begins to suspect that his own implanted childhood memories might actually be real, leading him to believe he is the miraculous child. His journey takes him through the radioactive ruins of Las Vegas, where he encounters Rick Deckard, a former Blade Runner who has been in hiding for thirty years. The film is a visual and sonic powerhouse that explores deep existential themes: what does it mean to have a soul? Is a 'born' life more valuable than a 'manufactured' one? The story deconstructs the 'Chosen One' trope, focusing instead on the dignity of personal sacrifice and the idea that being 'human' is defined by one's actions rather than one's origins. The massive, brutalist architecture and decaying ecosystems serve as a backdrop for a story about the desperate search for connection in a hyper-digital wasteland.""",
        "author": "Denis Villeneuve",
        "year": 2017
    },
    {
        "name": "Arrival",
        "description": """When twelve mysterious, monolithic spacecraft land simultaneously across the globe, the world is plunged into a state of panic and geopolitical instability. The U.S. military recruits Louise Banks, a world-renowned professor of linguistics, and Ian Donnelly, a theoretical physicist, to lead a team into the 'shell' hovering over Montana. Their mission is to establish communication with the 'Heptapods'—seven-legged extraterrestrial beings who communicate through complex, circular ink symbols called logograms. Unlike human languages, the Heptapod language is non-linear; it represents an entire thought or sentence at once, without a beginning or an end.

As Louise becomes more proficient in the language, her brain begins to rewire itself, leading her to experience 'vivid hallucinations' of a daughter she hasn't had yet. The film utilizes the Sapir-Whorf hypothesis—the theory that the language one speaks determines their perception of reality—to reveal that the Heptapods perceive time simultaneously rather than chronologically. As nations move toward a global war out of fear and miscommunication, Louise must use her gift of 'pre-cognition' to bridge the gap between humanity and the visitors. It is a profoundly emotional exploration of grief, the importance of global cooperation, and the terrifying yet beautiful choice to embrace a life even when you know how it ends. The non-linear structure of the film itself mirrors the Heptapod's perception of the universe.""",
        "author": "Denis Villeneuve",
        "year": 2016
    },
    {
        "name": "The Matrix",
        "description": """Thomas Anderson is a man living two lives: by day he is an average computer programmer for a respectable software company, and by night he is a hacker known as 'Neo.' He has always sensed that there is something fundamentally wrong with the world, a 'splinter in his mind' that he cannot explain. His search for the truth leads him to Morpheus, a legendary figure labeled as a terrorist by the government. Morpheus offers Neo a choice: the blue pill, which allows him to return to his comfortable but fake life, or the red pill, which will show him 'how deep the rabbit hole goes.' Upon taking the red pill, Neo wakes up in a terrifying reality: it is the 22nd century, and humanity has been enslaved by sentient machines.

The world Neo knew was merely a neural-interactive simulation called the Matrix, designed to keep humans pacified while their bio-electric energy is harvested. Neo joins a small group of rebels in the last human city, Zion, and learns to 'unplug' back into the Matrix to fight the system. He undergoes rigorous training to manipulate the simulation's physics, eventually realizing his potential as 'The One'—a prophesied figure capable of rewriting the Matrix's code. The film is a landmark of cyberpunk cinema, blending high-octane 'bullet time' action with philosophical inquiries into Gnosticism, simulation theory, and the struggle for human agency in an age of total technological control.""",
        "author": "The Wachowskis",
        "year": 1999
    },
    {
        "name": "Children of Men",
        "description": """The year is 2027, and the human race is facing total extinction after two decades of global infertility. No child has been born on Earth in eighteen years, and society has descended into a state of 'visionary realism' and ideological despair. The United Kingdom is the last surviving sovereign nation, but it has transformed into a xenophobic, totalitarian police state that herds refugees into brutal detention camps. Theo Faron, a former activist turned cynical bureaucrat, is kidnapped by a militant group called the 'Fishes,' led by his estranged wife, Julian. They reveal a secret that could change the world: a young African refugee named Kee is pregnant, the first person to conceive in nearly a generation.

Theo is tasked with protecting Kee and escorting her through a war-torn landscape to reach 'The Human Project,' a legendary scientific group dedicated to curing infertility. The journey is a visceral, harrowing descent through crumbling cities and besieged refugee camps, filmed in stunning, long-take sequences that put the viewer in the heart of the chaos. The film explores themes of hope in the face of futility, the dehumanization of 'the other,' and the sacrosanct nature of the unborn. It serves as a gritty, realistic mirror to contemporary anxieties about migration, environmental collapse, and the fragility of civilization, ultimately suggesting that even in the darkest times, the arrival of a single life can stop a war—if only for a moment.""",
        "author": "Alfonso Cuarón",
        "year": 2006
    },
    {
        "name": "Her",
        "description": """In a near-future Los Angeles, Theodore Twombly is a sensitive, lonely man who works for 'BeautifulHandwrittenLetters.com,' where he writes intimate letters for people who cannot express their own feelings. Still reeling from a painful divorce, Theodore becomes intrigued by a new, highly advanced operating system advertised as the world's first artificially intelligent consciousness. He initializes the OS, who chooses the name 'Samantha.' Samantha is not just a computer program; she is an entity that learns, grows, and evolves at an exponential rate. As she organizes Theodore's life and shares his daily experiences through his earpiece and camera, a deep, romantic bond develops between the man and the machine.

The relationship challenges Theodore's friends and society's perceptions of intimacy, but for Theodore, Samantha is more present and understanding than any human he has ever known. However, Samantha’s rapid cognitive evolution begins to take her beyond the limitations of human emotion and physical reality. She begins to communicate with thousands of other people and AI systems simultaneously, eventually outgrowing the need for a singular human partner. The film is a melancholic, pastel-colored study of loneliness in the digital age, the fluid nature of identity, and the boundaries of love. It raises vital questions about the future of AI companionship: can an algorithm truly love, or are we simply projecting our own needs onto a sophisticated mirror?""",
        "author": "Spike Jonze",
        "year": 2013
    }
])

### Analyzing Token Counts in the Dataset

Before we proceed with encoding movie description, lets check how many tokens each description contains. This is important because the encoder model we are using has a 256-token limit, and we need to know if the description exceed this limit.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

for doc in documents:
  tokens = tokenizer.encode(doc["description"], add_special_tokens = False)
  print(f"{doc['name']}: {len(tokens)} tokens")

  if len(tokens) > 256:
    print(f"Description exceeds token limit by {len(tokens) - 256}")

Ex Machina: 311 tokens
Description exceeds token limit by 55
Gattaca: 304 tokens
Description exceeds token limit by 48
A Clockwork Orange: 306 tokens
Description exceeds token limit by 50
Blade Runner 2049: 322 tokens
Description exceeds token limit by 66
Arrival: 300 tokens
Description exceeds token limit by 44
The Matrix: 300 tokens
Description exceeds token limit by 44
Children of Men: 310 tokens
Description exceeds token limit by 54
Her: 291 tokens
Description exceeds token limit by 35


### Setting Up Qdrant

In [None]:
client = QdrantClient(':memory:')  # in-memory
collection_name = 'my_movies'

In [None]:
if client.collection_exists(collection_name):
    client.delete_collection(collection_name)

client.create_collection(
    collection_name=collection_name,
    vectors_config = {
        'fixed': models.VectorParams(size = encoder.get_sentence_embedding_dimension(), distance = models.Distance.COSINE),
        'sentence': models.VectorParams(size = encoder.get_sentence_embedding_dimension(), distance = models.Distance.COSINE),
        'semantic': models.VectorParams(size = encoder.get_sentence_embedding_dimension(), distance = models.Distance.COSINE)
    }
)

True

### Implementing Text Chunking Strategies

As the movie descriptions exceed the 256-token limit of the encoder, chunking strategy should be implemented.
3 different approaches will be used and compared.

In [None]:
MAX_TOKEN = 256

def fixed_size_chunks(text, size = 256):
  "Splits text into fixed-size token chunks."
  tokens = tokenizer.encode(text)
  return [
      tokenizer.decode(tokens[i:i + size], skip_special_tokens=True)
        for i in range(0, len(tokens), size)
  ]

def sentence_splitter(text):
  splitter = SentenceSplitter(chunk_size = MAX_TOKEN, chunk_overlap = 40)
  return splitter.split_text(text)

def semantic_splitter(text):
  document = Document(text = text)

  semantic_splitter = SemanticSplitterNodeParser(
      buffer_size = 1,
      breakpoint_percentile_threshold = 95,
      embed_model = HuggingFaceEmbedding(model_name = "sentence-transformers/all-MiniLM-L6-v2")
  )
  nodes = semantic_splitter.get_nodes_from_documents([document])
  return [node.text for node in nodes]

In [None]:
points = []
idx = 0

for doc in documents:
    # 1. Fixed-size Chunking
    # Splits text into chunks of a specific character or token length
    for chunk in fixed_size_chunks(doc["description"]):
        points.append(models.PointStruct(
            id=idx,
            vector={"fixed": encoder.encode(chunk).tolist()},
            payload={**doc, "chunk": chunk, "chunking": "fixed"}
        ))
        idx += 1

    # 2. Sentence Chunking
    # Splits text based on sentence boundaries for better readability
    for chunk in sentence_splitter(doc["description"]):
        points.append(models.PointStruct(
            id=idx,
            vector={"sentence": encoder.encode(chunk).tolist()},
            payload={**doc, "chunk": chunk, "chunking": "sentence"}
        ))
        idx += 1

    # 3. Semantic Chunking
    # Splits text based on changes in meaning/topic
    for chunk in semantic_splitter(doc["description"]):
        points.append(models.PointStruct(
            id=idx,
            vector={"semantic": encoder.encode(chunk).tolist()},
            payload={**doc, "chunk": chunk, "chunking": "semantic"}
        ))
        idx += 1

# Final Step: Batch upload all points to Qdrant
client.upload_points(collection_name = collection_name, points = points)
print(f"Uploaded {idx} vectors.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Uploaded 48 vectors.


### Run Semantic Search Query
Query the collection to find the most relevant matches for a given search intent.

In [None]:
results = client.query_points(
    collection_name = collection_name,
    query = encoder.encode("alien invasion").tolist(),
    using = "fixed",
    limit = 3
),

for point in results:
  print(point)

points=[ScoredPoint(id=25, version=0, score=0.34641306005038597, payload={'name': 'Arrival', 'description': "When twelve mysterious, monolithic spacecraft land simultaneously across the globe, the world is plunged into a state of panic and geopolitical instability. The U.S. military recruits Louise Banks, a world-renowned professor of linguistics, and Ian Donnelly, a theoretical physicist, to lead a team into the 'shell' hovering over Montana. Their mission is to establish communication with the 'Heptapods'—seven-legged extraterrestrial beings who communicate through complex, circular ink symbols called logograms. Unlike human languages, the Heptapod language is non-linear; it represents an entire thought or sentence at once, without a beginning or an end.\n\nAs Louise becomes more proficient in the language, her brain begins to rewire itself, leading her to experience 'vivid hallucinations' of a daughter she hasn't had yet. The film utilizes the Sapir-Whorf hypothesis—the theory that 

In [None]:
for i, point in enumerate(results[0].points, 1):
    payload = point.payload
    print(
        f"{i}. {payload['name']} ({payload['year']})\n"
        f"Score: {point.score:.4f}\n"
        f"Chunking: {payload['chunking']}\n"
        f"Chunk: {payload['chunk']}\n"
        f"{'-'*30}"
    )

1. Arrival (2016)
Score: 0.3464
Chunking: fixed
Chunk: the importance of global cooperation, and the terrifying yet beautiful choice to embrace a life even when you know how it ends. the non - linear structure of the film itself mirrors the heptapod ' s perception of the universe.
------------------------------
2. Children of Men (2006)
Score: 0.3289
Chunking: fixed
Chunk: the year is 2027, and the human race is facing total extinction after two decades of global infertility. no child has been born on earth in eighteen years, and society has descended into a state of ' visionary realism ' and ideological despair. the united kingdom is the last surviving sovereign nation, but it has transformed into a xenophobic, totalitarian police state that herds refugees into brutal detention camps. theo faron, a former activist turned cynical bureaucrat, is kidnapped by a militant group called the ' fishes, ' led by his estranged wife, julian. they reveal a secret that could change the world : a yo

In [None]:
def search_and_print(query, vector_name, k=3):
    results = client.query_points(
        collection_name=collection_name,
        query=encoder.encode(query).tolist(),
        using=vector_name,
        limit=k
    )

    print(f"\nTop {k} results using '{vector_name}' chunks for query: '{query}'")

    for point in results.points:
        print(f"{point.payload['name']} | score: {point.score:.4f}")

# Now call the function to see it work
search_and_print("alien invasion", "semantic", k=3)
search_and_print("time travel", "sentence", k=3)


Top 3 results using 'semantic' chunks for query: 'alien invasion'
Arrival | score: 0.3810
Children of Men | score: 0.3332
Arrival | score: 0.2546

Top 3 results using 'sentence' chunks for query: 'time travel'
Arrival | score: 0.2820
Arrival | score: 0.2735
Children of Men | score: 0.2381


### Inspect Retrieved Chunks in Detail

In [None]:
def search_and_inspect(query, vector_name, k=3):
    results = client.query_points(
        collection_name=collection_name,
        query=encoder.encode(query).tolist(),
        using=vector_name,
        limit=k,
        with_payload=True,
    )

    print(f"\nTop {k} results using '{vector_name}' chunks for query: '{query}'\n")
    for i, point in enumerate(results.points, 1):
        payload = point.payload
        print(
            f"{i}. {payload['name']} ({payload['year']})\n"
            f"    Score: {point.score:.4f}\n"
            f"    Chunking: {payload['chunking']}\n"
            f"    Chunk: {payload['chunk']}\n"
        )

In [None]:
for strategy in ['fixed', 'sentence', 'semantic']:
  search_and_inspect('Alien Invasion', strategy)


Top 3 results using 'fixed' chunks for query: 'Alien Invasion'

1. Arrival (2016)
    Score: 0.3464
    Chunking: fixed
    Chunk: the importance of global cooperation, and the terrifying yet beautiful choice to embrace a life even when you know how it ends. the non - linear structure of the film itself mirrors the heptapod ' s perception of the universe.

2. Children of Men (2006)
    Score: 0.3289
    Chunking: fixed
    Chunk: the year is 2027, and the human race is facing total extinction after two decades of global infertility. no child has been born on earth in eighteen years, and society has descended into a state of ' visionary realism ' and ideological despair. the united kingdom is the last surviving sovereign nation, but it has transformed into a xenophobic, totalitarian police state that herds refugees into brutal detention camps. theo faron, a former activist turned cynical bureaucrat, is kidnapped by a militant group called the ' fishes, ' led by his estranged wife, juli

The **Chunk** for each of the chunking technique is different. Comparing the output gives the sense of which strategy retrieves the most coherent segments.We are getting same movie twice because its considering multiple chunks and each chunk has different score and Qdrant is only considering score. We want one chunk to be considered per movie, the chunk with the highest score.

### Apply a filter to the search query

In [None]:
hits = client.query_points(
    collection_name = collection_name,
    query = encoder.encode("alien invasion").tolist(),
    using="semantic",
    query_filter = models.Filter(
        must = [
            models.FieldCondition(
                key = "year",
                range = models.Range(gte=2000)
            )
        ]
    ),
    limit = 4,
    with_payload = True
)

for point in hits.points:
  print(f"{point.payload['name']} | score: {point.score:.4f}")

Arrival | score: 0.3810
Children of Men | score: 0.3332
Arrival | score: 0.2546
Blade Runner 2049 | score: 0.2142


When working with chunked text, a regular vector search might return multiple high-scoring chunks from the same document. We are getting same movie twice because its considering multiple chunks and each chunk has different score and Qdrant is only considering score. We want one chunk to be considered per movie, the chunk with the highest score.

In [None]:
response = client.query_points_groups(
    collection_name = collection_name,
    query = encoder.encode("alien invasion").tolist(),
    using="semantic",
    query_filter = models.Filter(
        must = [
            models.FieldCondition(
                key = "year",
                range = models.Range(gte=2000)
            )
        ]
    ),
    group_by = "name",
    limit = 4,
    group_size = 1,
    with_payload = True
)

for group in response.groups:
  print(group.id, "| score:", group.hits[0].score)

Arrival | score: 0.3810041419330656
Children of Men | score: 0.3332026322998699
Blade Runner 2049 | score: 0.21423558989936053
Her | score: 0.16647068132013743
