# 1. Install necessary requirements

In [None]:
%pip install -r requirements.txt

# 2. Import necessary libraries

In [2]:
import sys
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 3. Implement Content-Based Recommendation Systems

In [4]:
def preprocess_data(dataset: str) -> pd.DataFrame:
    '''
    Preprocesses a given csv file for the task.

    :param dataset: Path to csv file to preprocess.
    :type dataset: str
    :returns: Preprocessed DataFrame object.
    :rtype: pd.DataFrame
    '''
    # Read in file, drop unnecessary and missing rows and columns, and sample 500 examples at random
    df = pd.read_csv('movies.csv')[['Title', 'Plot']].dropna()
    df = df.sample(n=500, random_state=22)
    return df

def vectorize(text: list[str], vectorizer: TfidfVectorizer):
    '''
    Vectorizes a given input text given a fitted vectorizer.

    :param text: Text to vectorize.
    :type text: list[str]
    :param vectorizer: Fitted vectorizer to transform text.
    :type vectorizer: TfidfVectorizer()
    :returns: Vectorized text.
    :rtype: scipy.sparse._csr.csr_matrix
    '''
    return vectorizer.transform(text)
    
def compute_similarity(a: str, b: str | list) -> np.ndarray:
    '''
    Computes the cosine similarity between two items.
    
    :param a: First text to compute similarity with.
    :type a: str
    :param b: Second text to compute similarity with.
    :type b: str | list
    :returns: Similarity score between 0 and 1 of the two items
    :rtype: 1D np.ndarray (float)
    '''
    return cosine_similarity(a, b).flatten()
    
def recommend_movies(df: pd.DataFrame, review: str, vectorizer: TfidfVectorizer) -> list:
    '''
    Given a user review, recommend movies based on the similarity metric.

    :param df: DataFrame containing movies and corresponding summaries.
    :type df: pd.DataFrame
    :param review: User sentiment on movies.
    :type review: str
    :param vectorizer: Fitted vectorizer object to transformed text.
    :type vectorizer: TfIdfVectorizer()
    :returns: Movies sorted in descending similarity score.
    :rtype: list of (title, similarity score) tuples.
    '''
    # Obtain vectorized dataset and user review
    vectorized_plots = vectorize(df['Plot'], vectorizer)
    vectorized_review = vectorize([review], vectorizer)

    # Calculate and sort titles from high to low similarity scores
    similarity_scores = compute_similarity(vectorized_review, vectorized_plots)
    sorted_indices = np.argsort(similarity_scores)[::-1]

    # Format and return recommendations
    recommendations = [(df['Title'].iloc[idx], similarity_scores[idx]) for idx in sorted_indices]
    return recommendations

# 4. Recommend Movies

In [5]:
def main():
    dataset = './movies.csv'
    # Read in a user review and preprocess the dataset
    user_review = input('Enter user review here:')
    df = preprocess_data(dataset)

    # Init and fit vectorizer with the custom tokenization scheme
    vectorizer = TfidfVectorizer()
    vectorizer.fit(df['Plot'])
    
    # Recommend users movies
    user_recommendations = recommend_movies(df, user_review, vectorizer)
    
    print('Top 5 Movie Recommendations')
    for i in range(5): print(f'{i+1}. {user_recommendations[i][0]} | Similarity Score: {user_recommendations[i][1]}')

In [10]:
if __name__ == '__main__': main()

Enter user review here: I like comedy films


Top 5 Movie Recommendations
1. Best Actor | Similarity Score: 0.08575388093328194
2. Alice in Wonderland | Similarity Score: 0.06601668577160774
3. The Band Wagon | Similarity Score: 0.0447180503457958
4. Major Barbara | Similarity Score: 0.03407259752676045
5. An Angel from Texas | Similarity Score: 0.03352653805024108


## Additional Design Considerations / Experimentation Details

1. **Using stemmers, removing punctuation, and general text preprocessing**

There are different improvements that can be made with bag-of-words based vectorization schemes such as that implemented in TfidfVectorizer. In performing this task, I experimented with stemming, removing stopwords, and other textual preprocessing. However, many of these coincided with a decrease in performance in the recommendation system due to conflicts with TfidfVectorizer's internal tokenization and fitting scheme (non-empirically measured via sanity checks on the top 5 recommendations), leading to suboptimal recommendations. Thus, I discarded these changes and opted for a barebone approach with just the vectorizer. 

Keeping text in its barebone format may lead to ease of experimentation with other models that are attention-based, where the data needs to be kept as-is to take into account the semantic meaning and position of words in a text.

2. **Vectorization scheme**

This slightly ties to the previous point about preprocessing in that different vectorization schemes can lead to different results. TfidfVectorizer is by no means the only vectorization scheme suitable for this task, and other attention-based methods like [SBERT](https://sbert.net/), which take into account semantic meaning, can also be used. However, due to simplicity constraints, this model was shown, though the code is modularized for ease of replacement if needed.

3. **Modularizing portions of code**

While many methods may seem useless to modularize, there were design choices in play for each function:

- `preprocess_data()`: Performed all the necessary dataset reduction to fit specifications of problem; easily interchangeable for other datasets
- `vectorize()`: Performs the vectorization procedure; modularized according to the example given in the README.
- `compute_similarity()`: Computes the similarity score according to a specific metric; interchangeable for different similarity metrics.
- `recommend_movies()`: Handles and formats the recommendation aspect of the code nicely such that not everything is located in `main()`

There were some parts which depend on what libraries one uses to accomplish this task as well as things that did not fit nicely into one of the general functions above (namely the lines pertaining to fitting the vectorizer). For that reason, I placed it in `main()` so as to not overly complicate this coding challenge too heavily.

4. **Docstrings, Parameter Types, Return Types**

I've worked on alot of projects where the codebase I inherited was large and the documentation small, and it takes too much time and effort to just learn how everything works. I am someone who cares about the longevity and understanding of the code and the product to both developers and consumers, and I would much rather overspecify than underspecify.

## Salary Expectation

I would be fine with 1600 per month, or the base 20 per hour. I would just be excited at the opportunity to work for a Neo Accelerator-backed company!