In [1]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load the datasets
ratings_df = pd.read_csv('Dataset_Rating.csv')
movies_df = pd.read_csv('Dataset_Movie.csv')

# Inspect the datasets
print("Ratings DataFrame:")
print(ratings_df.head())
print("\nMovies DataFrame:")
print(movies_df.head())


Ratings DataFrame:
   User_ID  Rating  Movie_ID
0   712664       5         3
1  1331154       4         3
2  2632461       3         3
3    44937       5         3
4   656399       4         3

Movies DataFrame:
   Movie_ID  Year                          Name
0         1  2003               Dinosaur Planet
1         2  2004    Isle of Man TT 2004 Review
2         3  1997                     Character
3         4  1994  Paula Abdul's Get Up & Dance
4         5  2004      The Rise and Fall of ECW


In [2]:
# Merge the ratings with movie information to get full details
merged_df = pd.merge(ratings_df, movies_df, on="Movie_ID")


Let’s break down each part of this code in detail:

### 1. **Merging the Ratings with Movie Information**

```python
merged_df = pd.merge(ratings_df, movies_df, on="Movie_ID")
```

- **`pd.merge()`**:
  - This is a function in pandas used to combine two DataFrames based on a common column, also known as a "key." In this case, the two DataFrames (`ratings_df` and `movies_df`) are merged using the `Movie_ID` column.
  
- **Purpose**:
  - The `ratings_df` contains ratings information, including `User_ID`, `Movie_ID`, and the `Rating`. 
  - The `movies_df` contains metadata about the movies, including the `Movie_ID`, `Year`, and `Name`.
  
- **How it works**:
  - The `merge` operation combines these two DataFrames into a single DataFrame (`merged_df`), where the `Movie_ID` from both datasets is used to align the rows.
  - After the merge, the resulting `merged_df` will have columns from both datasets: `User_ID`, `Rating`, `Movie_ID`, `Year`, and `Name`.
  
- **Example**:
  After merging, `merged_df` might look like this:
  ```
     User_ID  Rating  Movie_ID  Year                               Name
  0  712664      5        3   1997                          Character
  1  1331154     4        3   1997                          Character
  2  2632461     3        3   1997                          Character
  3  44937       5        3   1997                          Character
  4  656399      4        3   1997                          Character
  ```
  - Now, each row contains both the user’s rating and the movie’s metadata (e.g., name and year).

---


In [3]:
# Create a TF-IDF Vectorizer for the movie titles
tfidf_vectorizer = TfidfVectorizer(stop_words='english')


### 2. **Creating a TF-IDF Vectorizer for Movie Titles**

```python
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
```

- **`TfidfVectorizer`**:
  - The `TfidfVectorizer` is a method from the `scikit-learn` library used to convert a collection of text documents into a matrix of TF-IDF features. TF-IDF stands for **Term Frequency-Inverse Document Frequency**, which is a statistic that reflects how important a word is to a document within a corpus.
  
- **`stop_words='english'`**:
  - The `stop_words` parameter is used to specify words that should be ignored when calculating TF-IDF. "Stop words" are common words like "the", "and", "in", etc., which do not carry significant meaning and can skew the results.
  - By setting `stop_words='english'`, the `TfidfVectorizer` will automatically ignore common English stop words in the movie titles.

- **Purpose**:
  - The purpose of this step is to vectorize the movie titles, converting each title into a numeric representation. This representation can then be used to compute similarities between movies.

---


In [4]:
# Fit and transform the movie titles to vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(movies_df['Name'])


### 3. **Fitting and Transforming the Movie Titles to Vectors**

```python
tfidf_matrix = tfidf_vectorizer.fit_transform(movies_df['Name'])
```

- **`fit_transform()`**:
  - The `fit_transform()` method performs two actions:
    1. **Fit**: It learns the vocabulary of the movie titles and calculates the inverse document frequency (IDF) for each word across all movie titles.
    2. **Transform**: It transforms the movie titles into a matrix of TF-IDF features based on the learned vocabulary. Each row in this matrix corresponds to a movie, and each column corresponds to a word in the vocabulary. The values in the matrix represent the TF-IDF score of each word in each movie title.
  
- **Purpose**:
  - This step converts the list of movie titles (`movies_df['Name']`) into a numerical format, where each movie title is represented as a vector. The size of the vector corresponds to the number of unique words (after removing stop words) across all titles.
  
- **Example**:
  - If the movie titles are:
    - `Dinosaur Planet`
    - `Character`
    - `The Rise and Fall of ECW`
  - The TF-IDF matrix might look like this (simplified view):
    ```
    [[0.5, 0.5, 0.5],  # Dinosaur Planet
     [1.0, 0.0, 0.0],  # Character
     [0.3, 0.3, 0.4]]  # The Rise and Fall of ECW
    ```
  - Each row represents a movie, and each column corresponds to a word’s importance (TF-IDF score) in that movie title.

---


In [5]:
# Compute the cosine similarity between the movies based on their titles
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


### 4. **Computing Cosine Similarity Between Movies Based on Their Titles**

```python
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
```

- **`cosine_similarity()`**:
  - This function calculates the cosine similarity between pairs of vectors. Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space, ranging from -1 (completely dissimilar) to 1 (completely similar).
  - In this case, it calculates the cosine similarity between the TF-IDF vectors of all movies' titles. Each movie title is represented by a vector, and we want to measure how similar each movie title is to all others based on their vector representations.
  
- **Purpose**:
  - The goal of calculating cosine similarity is to quantify how similar the movie titles are to each other based on the words they contain. If two movie titles have similar words (e.g., "Dinosaur Planet" and "Character"), they will have a higher cosine similarity.
  

In [6]:

# Function to get movie recommendations based on a movie title
def get_movie_recommendations(movie_title, cosine_sim=cosine_sim):
    # Get index of the movie that matches the title
    idx = movies_df[movies_df['Name'] == movie_title].index[0]
    
    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the top 5 most similar movies (excluding the movie itself)
    sim_scores = sim_scores[1:6]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 5 most similar movies
    return movies_df['Name'].iloc[movie_indices]

# Example: Get recommendations for the movie "Character"
recommended_movies = get_movie_recommendations("Character")
print("\nMovies recommended based on 'Character':")
print(recommended_movies)



Movies recommended based on 'Character':
0                 Dinosaur Planet
1      Isle of Man TT 2004 Review
3    Paula Abdul's Get Up & Dance
4        The Rise and Fall of ECW
5                            Sick
Name: Name, dtype: object


### 1. **Get Index of the Movie That Matches the Title**

```python
idx = movies_df[movies_df['Name'] == movie_title].index[0]
```

- **Purpose**:
  - We need to identify the index of the movie in the `movies_df` DataFrame that matches the given `movie_title`.
  
- **How it works**:
  - `movies_df[movies_df['Name'] == movie_title]`: This filters the `movies_df` DataFrame to find the row where the movie title matches the input `movie_title`. This results in a DataFrame with just one row corresponding to the movie.
  - `.index[0]`: The `index` attribute returns the index of the row(s) that match the title. Since we expect one match, we select the first index (`[0]`).

- **Result**: 
  - The variable `idx` now holds the index of the movie in the `movies_df` DataFrame.

---

### 2. **Get Pairwise Similarity Scores of All Movies with the Chosen Movie**

```python
sim_scores = list(enumerate(cosine_sim[idx]))
```

- **Purpose**:
  - We want to retrieve the cosine similarity scores between the chosen movie (at index `idx`) and all other movies in the dataset.
  
- **How it works**:
  - `cosine_sim[idx]`: This extracts the row of the cosine similarity matrix that corresponds to the movie at index `idx`. This row contains the cosine similarity scores between the chosen movie and all other movies in the dataset.
  - `enumerate(cosine_sim[idx])`: `enumerate` is used to pair each similarity score with its corresponding movie index. The result is a list of tuples, where each tuple consists of an index and its similarity score (e.g., `(0, 0.8)` means the movie at index 0 has a cosine similarity of 0.8 to the movie at index `idx`).

- **Result**:
  - The variable `sim_scores` is a list of tuples where each tuple contains the index of a movie and its cosine similarity score with the chosen movie.

---

### 3. **Sort Movies Based on Similarity Scores**

```python
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
```

- **Purpose**:
  - We want to sort the movies based on their similarity to the chosen movie in descending order (from most similar to least similar).
  
- **How it works**:
  - `sorted(sim_scores, key=lambda x: x[1], reverse=True)`: This sorts the `sim_scores` list. The `key=lambda x: x[1]` specifies that the sorting should be based on the second element of each tuple (the similarity score). `reverse=True` ensures the list is sorted in descending order, so the most similar movies come first.

- **Result**:
  - The `sim_scores` list is now sorted, with the most similar movie to the chosen movie first.

---

### 4. **Get the Top 5 Most Similar Movies (Excluding the Chosen Movie)**

```python
sim_scores = sim_scores[1:6]
```

- **Purpose**:
  - We want to exclude the movie itself from the recommendations and only return the top 5 most similar movies.

- **How it works**:
  - `sim_scores[1:6]`: This slices the sorted `sim_scores` list to exclude the first element (which is the movie itself, with a cosine similarity of 1.0) and takes the next 5 most similar movies. The slice `1:6` means we start from the second element (index 1) and take the next 5 elements (up to index 5).

- **Result**:
  - The `sim_scores` list now contains the top 5 most similar movies, excluding the chosen movie itself.

---

### 5. **Get Movie Indices and Return Movie Titles**

```python
movie_indices = [i[0] for i in sim_scores]
return movies_df['Name'].iloc[movie_indices]
```

- **Purpose**:
  - We need to extract the movie indices from `sim_scores` and use them to fetch the corresponding movie titles from the `movies_df` DataFrame.

- **How it works**:
  - `[i[0] for i in sim_scores]`: This list comprehension extracts the movie indices from the sorted `sim_scores` list. Each item in `sim_scores` is a tuple `(index, similarity)`, so `i[0]` gives the movie index.
  - `movies_df['Name'].iloc[movie_indices]`: The `iloc` function is used to select the movie titles corresponding to the indices in `movie_indices`. This returns the movie titles of the top 5 most similar movies.


In [7]:

# Now, we can recommend movies to users based on their ratings
def recommend_movies_for_user(user_id):
    # Get the movies rated by the user
    user_ratings = ratings_df[ratings_df['User_ID'] == user_id]
    
    # Get the movie IDs and ratings
    rated_movie_ids = user_ratings['Movie_ID'].tolist()
    
    # Get the corresponding movie titles
    rated_movies = movies_df[movies_df['Movie_ID'].isin(rated_movie_ids)]
    
    # For each movie rated by the user, get recommendations
    recommendations = []
    for movie in rated_movies['Name']:
        recommended_movies = get_movie_recommendations(movie)
        recommendations.extend(recommended_movies)
    
    # Remove movies the user has already rated
    recommendations = list(set(recommendations) - set(rated_movies['Name']))
    
    # Return the recommended movies for the user
    return recommendations

# Example: Recommend movies for user with ID 712664
user_recommendations = recommend_movies_for_user(712664)
print("Number of recommended movies for user 712664:", len(user_recommendations))
print("\nMovies recommended for user 712664:")
for movie in user_recommendations:
    print(f"- {movie}")


Number of recommended movies for user 712664: 1139

Movies recommended for user 712664:
- Stargate SG-1: Season 8
- Modern Vampires
- G3: Live in Denver
- My Husband's Double Life
- The Man Who Came to Dinner
- The Wedding Party
- Pursuit of Happiness
- Boys
- Runaway Jury
- The Fallen Ones
- The Animated Passion
- Robin Hood (Disney)
- Bunny Lake Is Missing
- Wilder Days
- Bright Future
- Better Than Sex
- There Goes My Baby
- Die! Die! Die!
- The Godfather Trilogy: Bonus Material
- Dogs and More Dogs
- The Omen Legacy
- VeggieTales: Bible Heroes: Stand Up! Stand Tall! Stand Strong!
- Eight Men Out
- Out for a Kill
- The Twilight Zone: Vol. 1
- Bringing Down the House
- Passion
- What the #$*! Do We Know!?
- Love
- Will & Grace: Season 1
- When It Was a Game
- Serial Slayer
- Grace of My Heart
- Sometimes in April
- Day for Night
- Brides of Christ
- Yellow
- Nothing But Trouble
- Hard Times
- The Lady from Shanghai
- Runaway Train
- Amityville Dollhouse
- The Great Silence
- The Slee

### 1. **Get the Movies Rated by the User**

```python
user_ratings = ratings_df[ratings_df['User_ID'] == user_id]
```

- **Purpose**:
  - We need to retrieve all the movies that the user has rated.
  
- **How it works**:
  - `ratings_df[ratings_df['User_ID'] == user_id]`: This filters the `ratings_df` DataFrame to get all rows where the `User_ID` column matches the input `user_id`. The result is a DataFrame containing only the movies rated by that user.

- **Result**:
  - The variable `user_ratings` is a DataFrame that contains only the ratings given by the user. It will have columns like `User_ID`, `Rating`, and `Movie_ID`.

---

### 2. **Get the Movie IDs and Ratings**

```python
rated_movie_ids = user_ratings['Movie_ID'].tolist()
```

- **Purpose**:
  - We want to extract the movie IDs of the movies that the user has rated.
  
- **How it works**:
  - `user_ratings['Movie_ID']`: This selects the `Movie_ID` column from the `user_ratings` DataFrame.
  - `.tolist()`: This converts the `Movie_ID` column into a Python list, so we have a list of movie IDs that the user has rated.

- **Result**:
  - The variable `rated_movie_ids` is a list of movie IDs, like `[3, 4, 5]`, representing the movies the user has rated.

---

### 3. **Get the Corresponding Movie Titles**

```python
rated_movies = movies_df[movies_df['Movie_ID'].isin(rated_movie_ids)]
```

- **Purpose**:
  - We want to get the movie titles for the movies that the user has rated.
  
- **How it works**:
  - `movies_df[movies_df['Movie_ID'].isin(rated_movie_ids)]`: This filters the `movies_df` DataFrame to get only the rows where the `Movie_ID` is in the list of `rated_movie_ids`. The `.isin()` function checks if each movie ID in `movies_df` is in the `rated_movie_ids` list.
  
- **Result**:
  - The variable `rated_movies` is a DataFrame containing only the movies that the user has rated, with columns like `Movie_ID`, `Year`, and `Name`.

---

### 4. **Get Recommendations for Each Rated Movie**

```python
recommendations = []
for movie in rated_movies['Name']:
    recommended_movies = get_movie_recommendations(movie)
    recommendations.extend(recommended_movies)
```

- **Purpose**:
  - For each movie the user has rated, we want to generate recommendations by calling the `get_movie_recommendations` function.
  
- **How it works**:
  - `for movie in rated_movies['Name']`: This loops through each movie title in the `rated_movies` DataFrame.
  - `get_movie_recommendations(movie)`: For each movie, this function generates a list of recommended movies based on the similarity of their titles.
  - `recommendations.extend(recommended_movies)`: The `extend()` method is used to add all the recommended movies to the `recommendations` list.

- **Result**:
  - The `recommendations` list contains all the recommended movies for the user based on the movies they have rated.

---

### 5. **Remove Movies the User Has Already Rated**

```python
recommendations = list(set(recommendations) - set(rated_movies['Name']))
```

- **Purpose**:
  - We want to exclude movies that the user has already rated from the recommendations list, so they don’t receive recommendations for movies they've already seen.
  
- **How it works**:
  - `set(recommendations)`: This converts the `recommendations` list to a set, removing any duplicates in the list.
  - `set(rated_movies['Name'])`: This converts the `rated_movies` list of movie titles into a set.
  - `set(recommendations) - set(rated_movies['Name'])`: This performs set subtraction, removing the movies the user has already rated from the recommendations.
  - `list()`: This converts the resulting set back to a list.


### Model Evaluation

In this section, we will evaluate the movie recommendation model based on the recommendations generated for a specific user. This will include measuring the effectiveness of our approach and discussing possible metrics for further evaluation.
1. Evaluation Overview

Our recommendation model relies on cosine similarity between movie titles. This approach assumes that users will enjoy movies with similar titles to those they have already rated highly. In this evaluation, we'll review the process and suggest ways to improve the model's performance.
2. Evaluation Metrics

There are several metrics that can be used to evaluate the effectiveness of a recommendation system:

    Precision and Recall: These metrics measure how many of the recommended movies are relevant to the user (precision) and how many relevant movies are being recommended (recall).
    F1 Score: This metric is the harmonic mean of precision and recall, providing a balanced measure of the model's performance.
    Mean Average Precision (MAP): This evaluates how well the system ranks the relevant items at t

    he top of the list.
    Hit Rate: Measures the number of recommended items that the user interacts with.
    Coverage: Measures the proportion of items from the entire dataset that the recommendation system covers.

Since we are using cosine similarity for the recommendations, the next step would be to collect user feedback or interaction data to apply these metrics.
3. Evaluating Recommendations

Here’s a simple way to evaluate the recommendations generated for a specific user:



In [8]:
from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate_recommendations(user_id):
    # Get the movies rated by the user
    user_ratings = ratings_df[ratings_df['User_ID'] == user_id]
    
    # Get the movie IDs and titles of the rated movies
    rated_movie_ids = user_ratings['Movie_ID'].tolist()
    rated_movie_titles = movies_df[movies_df['Movie_ID'].isin(rated_movie_ids)]['Name'].tolist()
    
    # Get recommendations for the user
    recommended_movies = recommend_movies_for_user(user_id)
    
    # Convert both lists to sets for easier comparison
    rated_movie_set = set(rated_movie_titles)
    recommended_movie_set = set(recommended_movies)
    
    # Define True Positives, False Positives, and False Negatives
    tp = len(rated_movie_set.intersection(recommended_movie_set))
    fp = len(recommended_movie_set - rated_movie_set)
    fn = len(rated_movie_set - recommended_movie_set)
    
    # Calculate precision, recall, and F1 score
    precision = tp / (tp + fp) if tp + fp > 0 else 0
    recall = tp / (tp + fn) if tp + fn > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    
    return precision, recall, f1

# Example: Evaluate for user with ID 712664
precision, recall, f1 = evaluate_recommendations(712664)
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


Precision: 0.0
Recall: 0.0
F1 Score: 0


Precision: Measures how many of the recommended movies are actually rated by the user. A higher precision means the system is recommending relevant movies.
Recall: Measures how many of the movies that the user has rated are being recommended by the system. A higher recall means the system is recommending a broader set of relevant movies.
F1 Score: This combines precision and recall into a single metric that balances both. A higher F1 score indicates a better balance between precision and recall.