## Deeper Analysis and Further Experiments
Analysis of TF-IDF vectorization on the dataset and experiments with different vector encoding methods outside the scope of the challenge, which I did on my own time.

While TF-IDF was sufficient for the purpose of this challenge, there were several key things I noticed. To start, using TF-IDF and cosine similarity was excellent in comparing sentence similarity in a lexical sense such that if two sentences or phrases share common vocabulary, the algorithm would usually pick it up. However, this is also where the shortcoming come in. If the ordering of the words were changed or some synonyms are used, the program would sometimes struggle to return the correct results.

Take these two phrases as example using the `main.py` script.

Query: `psycho horror movies`
```txt
Results
Movie: Psycho --- Similarity: 0.2611649965940957
Movie: A Night at the Opera --- Similarity: 0.08203814048662973
Movie: What Ever Happened to Baby Jane? --- Similarity: 0.06942657526655543
Movie: All Quiet on the Western Front --- Similarity: 0.06685645193757431  
Movie: The Exorcist --- Similarity: 0.06444365239730257
```

Query: `horror psycho movies`
```txt
Results
Movie: Slumdog Millionaire --- Similarity: 0.0
Movie: The Straight Story --- Similarity: 0.0 
Movie: His Girl Friday --- Similarity: 0.0    
Movie: Short Term 12 --- Similarity: 0.0      
Movie: The Lost Weekend --- Similarity: 0.0 
```

As one can see, while the queries were similar lexically, they gave completely different results. In fact, the second query resulted in no matches found, even though it had essentially the same meaning as the first one. What this shows is that for tasks such as content recommendation, TF-IDF and lexical analysis can only go so far. What we need are vector embeddings that can capture the semantic meaning of the data and the queries.

### SBERT
After doing further research, I found a pre-trained embedding model that can capture a sentence's semantic meaning into vector form. This model, SBERT, is built on top of the BERT model, which utilizes transformers to capture information within words and sentences. Since SBERT is a complex neural network, this was not included in the main submission; however, it was still interesting to see the improvements involving this change.

Using the Juypter code below and the same queries from above:

Query: `psycho horror movies`
```txt
Movie: Psycho --- Similarity: 0.6416359543800354
Movie: Touch of Evil --- Similarity: 0.5505009889602661
Movie: The Exorcist --- Similarity: 0.5373932123184204
Movie: Rope --- Similarity: 0.5182092189788818
Movie: The Shining --- Similarity: 0.5134167075157166
```

Query: `horror psycho movies`
```txt
Movie: Psycho --- Similarity: 0.6472702622413635
Movie: Touch of Evil --- Similarity: 0.551189661026001
Movie: The Exorcist --- Similarity: 0.529065728187561
Movie: The Shining --- Similarity: 0.5264713168144226
Movie: Rope --- Similarity: 0.5127915740013123
```

Because these two queries were semantically the same, SBERT was able to capture this and returned similar outputs after calculating cosine similarity. This is a stark difference from what was outputed in the TF-IDF pipeline.

Below is the Juypter code for my SBERT prompting pipeline.

In [None]:
# install the sentence-transformer package
%pip install sentence-transformers

In [None]:
# import dependencies
import numpy as np
import pandas as pd

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


In [None]:
# Use one of SBERT's pretrained sentence transformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
# Load training set and add All_Info column, same as in main.py
df = pd.read_csv('./data/imdb_top250_movies.csv')
all_info = df[["Title", "Genre", "Director", "Actors", "Plot", "Language"]].values.tolist()
    
for i, info in enumerate(all_info):
    all_info[i] = " ".join(info)

df["All_Info"] = all_info

sentences = df["All_Info"].to_numpy()
vectors = model.encode(sentences)

In [None]:
# Prompting block

# Enter prompt here _=============================
prompt = "return a list of superhero movies"
# ================================================

prompt_vector = model.encode(prompt)
similarity = cosine_similarity([prompt_vector], vectors).flatten()

idx = np.argsort(similarity)[-5:][::-1]
titles = df["Title"].to_numpy()[idx]

for i in range(len(idx)):
    print(f"Movie: {titles[i]} --- Similarity: {similarity[idx][i]}")
print("===========================================")