# 🎬 Movie plot similarity explorer

This notebook helps you understand **text similarity** by comparing movie plot descriptions. It is a *walkthrough* notebook; hence, it is not an assignment but serves to illustrate how this works!

We will:
- Use **CountVectorizer** and **TfidfVectorizer** with cosine similarity
- Use **spaCy embeddings** for a smarter similarity (soft cosine)
- Take a **user query** and find the most similar movie plot

Let's get started!

In [8]:
# Only run this once to install the model
!python -m spacy download en_core_web_md

[0mCollecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [10]:
import numpy as np
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/anne/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Define movie plots

In [17]:
movie_plots = [
    "A man discovers where reality is an illusion and he joins a resistance to fight a digital overlords. (The Matrix)",
    "An exiled heir returns to take back his homeland from a tyrant uncle. (The Lion King)",
    "A young couple from different worlds fall in love aboard a doomed ship. (Titanic)",
    "A group undertakes a journey to destroy a powerful object and defeat a rising darkness. (The Lord of the Rings)",
    "A young orphan is invited to a hidden institution to learn to harness mystical forces. (Harry Potter)",
    "A man wakes up with no memory and evades secret agents while uncovering his past. (The Bourne Identity)",
    "A linguist must interpret an alien language to prevent global war. (Arrival)",
    "A baseball manager uses data and algorithms to rebuild his losing team. (Moneyball)",
    "A student builds a tech empire while navigating betrayal and lawsuits. (The Social Network)",
    "A lonely man falls in love with an intelligent operating system. (Her)"
]


## You can explore/ play around with regular and soft cosine with this dataset [here](https://moviesimilarity-guvj2z7bwubdlxkibnn7e3.streamlit.app/)

### 🔍 Enter your own plot/query
Type a short movie description or idea, and we'll show the most similar movie.

In [15]:
# User query
query_plot = "Boy finds school for magic, where he was taught how to control powerful spells and faces dark force."
all_texts = movie_plots + [query]

## Cosine Similarity based on CountVectorizer
This just looks at word overlap, not meaning.

In [24]:
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform([query_plot] + movie_plots)
cosine_sim = cosine_similarity(vectors[0:1], vectors[1:])
#print("CountVectorizer Cosine Similarities:", cosine_sim)

# Find the index of the most similar movie
most_similar_index = cosine_sim[0].argmax()
most_similar_score = cosine_sim[0][most_similar_index]

# Output the best match
print("🎯 Most Similar Movie:")
print(f"Movie {most_similar_index}: {movie_plots[most_similar_index]}")
print(f"Similarity Score: {most_similar_score:.4f}")

Most Similar Movie:
Movie 10: A young man moves to a new city where he is invited for a job and tries to balance her personal and professional life. (The New Beginnings)
Similarity Score: 0.4082


In [27]:
## Whant to know the cosine scores of the other movies? 
for idx, score in enumerate(cosine_sim[0]):
    print(f"Movie {idx}: {movie_plots[idx]}")
    print(f"Similarity Score: {score:.4f}")
    print("------")

Movie 0: A man discovers reality is an illusion and joins a resistance to fight digital overlords. (The Matrix)
Similarity Score: 0.2309
------
Movie 1: An exiled heir returns to take back his homeland from a tyrant uncle. (The Lion King)
Similarity Score: 0.1155
------
Movie 2: A young couple from different worlds fall in love aboard a doomed ship. (Titanic)
Similarity Score: 0.0000
------
Movie 3: A group undertakes a journey to destroy a powerful object and defeat a rising darkness. (The Lord of the Rings)
Similarity Score: 0.2108
------
Movie 4: A young orphan is invited to a hidden institution where he learns to harness mystical forces. (Harry Potter)
Similarity Score: 0.3689
------
Movie 5: A man wakes up with no memory and evades secret agents while uncovering his past. (The Bourne Identity)
Similarity Score: 0.0542
------
Movie 6: A linguist must interpret an alien language to prevent global war. (Arrival)
Similarity Score: 0.1348
------
Movie 7: A baseball manager uses data an

## Cosine similarity using TF-IDF
This gives more importance to important words, less to common words like "the".

In [29]:
tfidf = TfidfVectorizer()
tfidf_vectors = tfidf.fit_transform([query_plot] + movie_plots)
tfidf_sim = cosine_similarity(tfidf_vectors[0:1], tfidf_vectors[1:])
#print("TF-IDF Cosine Similarities:", tfidf_sim)
# Find the index of the most similar movie
most_similar_index = tfidf_sim[0].argmax()
most_similar_score = tfidf_sim[0][most_similar_index]

# Output the best match
print("🎯 Most Similar Movie:")
print(f"Movie {most_similar_index}: {movie_plots[most_similar_index]}")
print(f"Similarity Score: {most_similar_score:.4f}")

Most Similar Movie:
Movie 10: A young man moves to a new city where he is invited for a job and tries to balance her personal and professional life. (The New Beginnings)
Similarity Score: 0.2238


## Soft Cosine similarity (using embeddings from spaCy)
This is the smartest method. It uses word vectors to compare meaning.
It knows "wizard" and "magic" are related, even if not identical.

In [31]:
nlp = spacy.load("en_core_web_md")
query_doc = nlp(query_plot)
similarities = [query_doc.similarity(nlp(p)) for p in movie_plots]
#print("spaCy Similarities:", similarities)

# Find the index of the most similar plot
most_similar_index = similarities.index(max(similarities))
most_similar_score = similarities[most_similar_index]

# Output the best match
print("Most Similar Movie:")
print(f"Movie {most_similar_index}: {movie_plots[most_similar_index]}")
print(f"Similarity Score: {most_similar_score:.4f}")

🎯 Most Similar Movie:
Movie 4: A young orphan is invited to a hidden institution where he learns to harness mystical forces. (Harry Potter)
Similarity Score: 0.9142


# Add preprocessing to the mix!

### In a next step, you can investigate how cosine similiary scores differ after applying preprocessing steps.

## Why can preprocessing improves cosine similarity?

When comparing text similarity using vectorizers like `CountVectorizer` or `TfidfVectorizer`, **preprocessing the text** can significantly improve the results. Here's why:

### What Preprocessing Does

| Preprocessing Step     | Purpose |
|------------------------|---------|
| **Lowercasing**        | Makes "The" and "the" identical |
| **Tokenization**       | Breaks text into words for further processing |
| **Stopword Removal**   | Removes common, non-informative words like "the", "is", etc. |
| **Stemming**           | Reduces words to their root form (e.g., "loved" → "love") |

### How this helps cosine similarity

Cosine similarity measures the **angle between two vectors** — not their length. Cleaned and normalized vectors are:
- More compact
- Less noisy
- More focused on the meaningful content words

This leads to **more accurate similarity comparisons** between texts.

For example:
Raw: "The movie is amazing!" Cleaned: "movi amaz"
Both "The movie is amazing!" and "An amazing movie indeed." will reduce to something like `["movi", "amaz"]`, making them more likely to match.

---

### When NOT to preprocess

When using embeddings (for Soft Cosine), extensive preprocessing, like stemming or stopword removal, is not needed!

---

### Summary

> Use basic preprocessing (lowercasing, punctuation removal, stopword removal, stemming) when working with `CountVectorizer` or `TfidfVectorizer` for better cosine similarity -- but not when using `spacy Embeddings` for soft cosine.

In [52]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Stem the tokens
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    
    return stemmed_tokens

In [54]:
# Preprocess plots
processed_plots = [preprocess_text(text) for text in movie_plots]
processed_query = preprocess_text(query_plot)

# Join tokens back into strings (otherwise you will get errors when trying to vectorize your text)
processed_plots_joined = [' '.join(tokens) for tokens in processed_plots]
processed_query_joined = ' '.join(processed_query)

In [40]:
#now, lets use the cleaned text to cacluate cosine similarity

In [58]:
vectorizer = CountVectorizer()
#vectorizer = TfidfVectorizer() # or use tfidf
vectors = vectorizer.fit_transform([processed_query_joined] + processed_plots_joined)

# Calculate cosine similarity
cosine_sim = cosine_similarity(vectors[0:1], vectors[1:])

# Find the most similar movie
most_similar_index = cosine_sim[0].argmax()
most_similar_score = cosine_sim[0][most_similar_index]

# Output the result
print("🎯 Most Similar Movie:")
print(f"Movie {most_similar_index}: {movie_plots[most_similar_index]}")
print(f"Similarity Score: {most_similar_score:.4f}")

🎯 Most Similar Movie:
Movie 3: A group undertakes a journey to destroy a powerful object and defeat a rising darkness. (The Lord of the Rings)
Similarity Score: 0.1818
