- This project is a movie recommendation system using content based filtering.
- The first dataset which contains information about movies till 2017 used is the TMDB 5000 Movie Dataset which contains information about 5000 movies. The dataset can be found here.
- The dataset for movies from 2018 to 2023 is created using the TMDB API and Wikipedia.
- The names of the movies were scraped from Wikipedia and the data was collected using the TMDB API.
- TMDB API :- TMDB
- Wikipedia links :-
- The model uses the cosine similarity between the movie vectors to find the most similar movies.
- The movie vectors are created using the following features:
- The genres of the movie
- The cast of the movie
- The director of the movie
- The keywords of the movie
- The overview of the movie
-
Stemming:
- It uses the
nltk.stem.portermodule'sPorterStemmerto perform stemming on the text data in the 'tags' column of the DataFrame (ew_df). - Stemming reduces words to their root form (e.g., 'running' to 'run', 'easily' to 'easili'), aiming to normalize the text data for analysis.
- It uses the
-
CountVectorizer:
- Utilizes
CountVectorizerfromsklearn.feature_extraction.textto convert the processed 'tags' data into numerical vectors. max_features=5000sets the maximum number of features to consider.stop_words='english'removes English stop words (common words like 'and', 'the', 'is', etc.) during vectorization.
- Utilizes
-
Cosine Similarity:
- Calculates the cosine similarity between the vectors derived from the 'tags' using
cosine_similarityfromsklearn.metrics.pairwise.
- This similarity matrix shows how similar movies are based on their 'tags' content.
- Calculates the cosine similarity between the vectors derived from the 'tags' using
-
Recommendation Function (
recommend):- Finds the index of the given movie in the DataFrame (
new_df) based on its title. - Retrieves the similarity scores of that movie with all others from the similarity matrix.
- Sorts the movies by similarity scores and prints the top 5 recommendations (excluding the queried movie).
- Finds the index of the given movie in the DataFrame (