# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [152]:
# u.data
!head u.data # displays structural format of the dataset

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


In [153]:
# u.item

!head u.item # displays structural format of the dataset

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

In [154]:
# u.user
!head u.user # displays structural format of the dataset

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703


#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [155]:
#ratings

import pandas as pd

# Load the ratings dataset
ratings = pd.read_csv('u.data', sep='\t', names=['user_id', 'movie_id', 'rating', 'timestamp'])
print("\nRatings dataset:")
print(ratings.head())



Ratings dataset:
   user_id  movie_id  rating  timestamp
0      196       242       3  881250949
1      186       302       3  891717742
2       22       377       1  878887116
3      244        51       2  880606923
4      166       346       1  886397596


In [156]:
#movies

import pandas as pd
# Load the movies dataset
movies = pd.read_csv('u.item', sep='|', encoding='latin-1', usecols=[0, 1, 2], names=['movie_id', 'title', 'release_date'])
print("\nMovies dataset:")
print(movies.head())



Movies dataset:
   movie_id              title release_date
0         1   Toy Story (1995)  01-Jan-1995
1         2   GoldenEye (1995)  01-Jan-1995
2         3  Four Rooms (1995)  01-Jan-1995
3         4  Get Shorty (1995)  01-Jan-1995
4         5     Copycat (1995)  01-Jan-1995


In [157]:
# users

import pandas as pd
# Load the users dataset
users = pd.read_csv('u.user', sep='|', usecols=[0, 1, 2, 3], names=['user_id', 'age', 'gender', 'occupation'])
print("\nUsers dataset:")
print(users.head())


Users dataset:
   user_id  age gender  occupation
0        1   24      M  technician
1        2   53      F       other
2        3   23      M      writer
3        4   24      M  technician
4        5   33      F       other


**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [158]:
# ratings

# Save the Ratings DataFrame to CSV file
ratings.to_csv('ratings.csv', index=False)


# Export the CSV files
from google.colab import files
files.download('ratings.csv')



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [159]:
# movies

# Save the Movies DataFrame to CSV file
movies.to_csv('movies.csv', index=False)

# Export the CSV files
from google.colab import files
files.download('movies.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [160]:
# users

# Save the Users DataFrame to CSV file
users.to_csv('users.csv', index=False)

# Export the CSV files
from google.colab import files
files.download('users.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Display the first 10 rows of each file.**

In [161]:
# ratings

import pandas as pd


try:
    ratings_df = pd.read_csv('ratings.csv')
    print("Ratings (first 10 rows):\n", ratings_df.head(10)) #displays first 10 rows
except FileNotFoundError:
    print("ratings.csv not found.")



Ratings (first 10 rows):
    user_id  movie_id  rating  timestamp
0      196       242       3  881250949
1      186       302       3  891717742
2       22       377       1  878887116
3      244        51       2  880606923
4      166       346       1  886397596
5      298       474       4  884182806
6      115       265       2  881171488
7      253       465       5  891628467
8      305       451       3  886324817
9        6        86       3  883603013


In [162]:
# movies

import pandas as pd

try:
    movies_df = pd.read_csv('movies.csv')
    print("\nMovies (first 10 rows):\n", movies_df.head(10)) #displays first 10 rows
except FileNotFoundError:
    print("\nmovies.csv not found.")



Movies (first 10 rows):
    movie_id                                              title release_date
0         1                                   Toy Story (1995)  01-Jan-1995
1         2                                   GoldenEye (1995)  01-Jan-1995
2         3                                  Four Rooms (1995)  01-Jan-1995
3         4                                  Get Shorty (1995)  01-Jan-1995
4         5                                     Copycat (1995)  01-Jan-1995
5         6  Shanghai Triad (Yao a yao yao dao waipo qiao) ...  01-Jan-1995
6         7                              Twelve Monkeys (1995)  01-Jan-1995
7         8                                        Babe (1995)  01-Jan-1995
8         9                            Dead Man Walking (1995)  01-Jan-1995
9        10                                 Richard III (1995)  22-Jan-1996


In [163]:
# users

import pandas as pd

try:
    users_df = pd.read_csv('users.csv')
    print("\nUsers (first 10 rows):\n", users_df.head(10)) #displays first 10 rows
except FileNotFoundError:
    print("\nusers.csv not found.")




Users (first 10 rows):
    user_id  age gender     occupation
0        1   24      M     technician
1        2   53      F          other
2        3   23      M         writer
3        4   24      M     technician
4        5   33      F          other
5        6   42      M      executive
6        7   57      M  administrator
7        8   36      M  administrator
8        9   29      M        student
9       10   53      M         lawyer


### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

In [164]:
# to use some of the data exploration functions like df.shape, df.nunique(), df['column_name'].unique()

# Load dataset
df = pd.read_csv('ratings.csv')
df_m = pd.read_csv('movies.csv')
df_u = pd.read_csv('users.csv')

# Display dataset shape
print("Ratings dataset shape:", df.shape)
print("Movies dataset shape:", df_m.shape)
print("Users dataset shape:", df_u.shape)

# Display number of unique values in each column
print("\nRatings unique values per column:\n", df.nunique())
print("\nMovies unique values per column:\n", df_m.nunique())
print("\nUsers unique values per column:\n", df_u.nunique())

# Display unique values in specific columns
print("\nUnique movie IDs:", df['movie_id'].unique()[:10])  # Show first 10 unique movie IDs
print("\nUnique user IDs:", df['user_id'].unique()[:10])  # Show first 10 unique user IDs
print("\nUnique ratings:", df['rating'].unique())
print("\nUnique occupations:", df_u['occupation'].unique())


Ratings dataset shape: (100000, 4)
Movies dataset shape: (1682, 3)
Users dataset shape: (943, 4)

Ratings unique values per column:
 user_id        943
movie_id      1682
rating           5
timestamp    49282
dtype: int64

Movies unique values per column:
 movie_id        1682
title           1664
release_date     240
dtype: int64

Users unique values per column:
 user_id       943
age            61
gender          2
occupation     21
dtype: int64

Unique movie IDs: [242 302 377  51 346 474 265 465 451  86]

Unique user IDs: [196 186  22 244 166 298 115 253 305   6]

Unique ratings: [3 1 2 4 5]

Unique occupations: ['technician' 'other' 'writer' 'executive' 'administrator' 'student'
 'lawyer' 'educator' 'scientist' 'entertainment' 'programmer' 'librarian'
 'homemaker' 'artist' 'engineer' 'marketing' 'none' 'healthcare' 'retired'
 'salesman' 'doctor']


**Convert Timestamps into Readable dates.**

In [165]:
# ratings

import pandas as pd

# Load dataset
df = pd.read_csv('ratings.csv')

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display the updated DataFrame
df_cleaned.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,1970-01-01 00:00:00.881250949
1,186,302,3,1970-01-01 00:00:00.891717742
2,22,377,1,1970-01-01 00:00:00.878887116
3,244,51,2,1970-01-01 00:00:00.880606923
4,166,346,1,1970-01-01 00:00:00.886397596


**Check for Missing Values**

In [166]:
# ratings
import pandas as pd

# Load the CSV file
df_rt = pd.read_csv('ratings.csv')

# Check for missing values
missing_values = df_rt.isnull().sum()

print("Missing values:")
for column, count in missing_values.items():
    print(f"{column}: {count}")

df_rt = df_rt.dropna() #Drop any rows with missing values

Missing values:
user_id: 0
movie_id: 0
rating: 0
timestamp: 0


In [167]:
# movies

import pandas as pd

# Load the CSV file
df_mov = pd.read_csv('movies.csv')

# Check for missing values
missing_values = df_mov.isnull().sum()

print("Missing values:")
for column, count in missing_values.items():
    print(f"{column}: {count}")

df_mov = df_mov.dropna() #Drop any rows with missing values

#row 267 has no release date. It's a garbage record.

Missing values:
movie_id: 0
title: 0
release_date: 1


In [168]:
# users
import pandas as pd

# Load the CSV file
df_us = pd.read_csv('users.csv')

# Check for missing values
missing_values = df_us.isnull().sum()

print("Missing values:")
for column, count in missing_values.items():
    print(f"{column}: {count}")

df_us = df_us.dropna() #Drop any rows with missing values

Missing values:
user_id: 0
age: 0
gender: 0
occupation: 0


**Print the total number of users, movies, and ratings.**

In [169]:
import pandas as pd

# Print the total number of users, movies and ratings

print(f"Total Users: { df_us['user_id'].count() }")
print(f"Total Movies: { df_mov['movie_id'].count() }")
print(f"Total Ratings: { df_rt['rating'].count() }")

Total Users: 943
Total Movies: 1681
Total Ratings: 100000


## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

In [170]:

# Creating a user-movie rating matrix using the pivot() function.

user_movie_matrix = ratings.pivot(index='user_id', columns='movie_id', values='rating').fillna(0)



**Display the matrix to verify the transformation.**

In [171]:
# Displaying matrix

print(user_movie_matrix)


movie_id  1     2     3     4     5     6     7     8     9     10    ...  \
user_id                                                               ...   
1          5.0   3.0   4.0   3.0   3.0   5.0   4.0   1.0   5.0   3.0  ...   
2          4.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   2.0  ...   
3          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
4          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
5          4.0   3.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
...        ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   
939        0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   5.0   0.0  ...   
940        0.0   0.0   0.0   2.0   0.0   0.0   4.0   5.0   3.0   0.0  ...   
941        5.0   0.0   0.0   0.0   0.0   0.0   4.0   0.0   0.0   0.0  ...   
942        0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
943        0.0   5.0   0.0   0.0   0.0   0.0   0.0   0.0   3.0   0.0  ...   

### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


**User Similarity Calculation:**
The code computes cosine similarity between all users based on their ratings in user_movie_matrix, resulting in a user_sim_df matrix where each entry represents how similar two users are.

**Filtering Similar Users:**
For a given user (e.g., user 10), the function retrieves and sorts other users by their similarity scores, excluding the user themselves.

**Identifying Unrated Movies:**
It identifies which movies the target user has not rated (i.e., entries are 0.0 in the matrix).

**Neighbor Rating Aggregation:**
For each unrated movie, the function finds neighbors who have rated it, merges these with similarity scores, and averages their ratings to estimate the target user’s potential interest.

**Predicted Ratings Compilation:**
A list of predicted ratings for each unrated movie is compiled into a DataFrame and sorted by predicted score in descending order.

**Mapping Movie IDs to Titles:**
Movie IDs are mapped to their corresponding titles using the movies DataFrame, creating a more user-friendly output.

**Displaying Recommendations:**
The top num (default 5) recommended movies are printed in a formatted table, showing their ranking and title for the user.

In [172]:
# Code the function here

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Compute User-User Similarity
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)

def recommend_movies_for_user(user_id, num=5):
   # Get similarity scores for the user and drop the user themselves
    user_similarity_scores_df = user_sim_df[[user_id]].sort_values(by=user_id, ascending=False)
    user_similarity_scores_df.drop(user_id, inplace=True)
    user_similarity_scores_df.reset_index(inplace=True)
    user_similarity_scores_df.columns = ['user_id', 'sim_score']

    # Get list of movies the user hasn't rated
    target_user_unrated_movies = [columns for columns in user_movie_matrix.columns if user_movie_matrix.loc[user_id, columns] == 0.0]

    predicted_ratings = []
    for movie in target_user_unrated_movies:
        # Get non-zero ratings for the current movie
        movie_ratings_from_neighbors = user_movie_matrix[[movie]]
        movie_ratings_from_neighbors = movie_ratings_from_neighbors[movie_ratings_from_neighbors[movie] > 0.0]
        # Merge with user similarity scores
        neighbor_ratings_for_movie = pd.merge(user_similarity_scores_df, movie_ratings_from_neighbors, on='user_id')
        # Compute average rating from similar users
        if not neighbor_ratings_for_movie.empty:
            average_neighbor_rating = neighbor_ratings_for_movie.iloc[:, 2].sum() / movie_ratings_from_neighbors.shape[0]
        else:
            average_neighbor_rating = 0.0
        predicted_ratings.append(average_neighbor_rating)

    # Prepare prediction DataFrame
    data = {
    "movie_id": target_user_unrated_movies,
    "pred_rating": predicted_ratings
    }
    ranked_movie_predictions_df = pd.DataFrame(data)
    ranked_movie_predictions_df = ranked_movie_predictions_df.sort_values(by="pred_rating", ascending=False).iloc[:num, :]
    movie_id_to_title = dict(zip(movies['movie_id'], movies['title']))
    ranked_movie_predictions_df['title'] = ranked_movie_predictions_df['movie_id'].map(movie_id_to_title)

    # Print movie table
    recommended_titles = list(ranked_movie_predictions_df['title'])
    print(f"| {'Ranking':<7} | {'Movie Name':<62} |")  # Header with fixed widths
    print(f"|{'-' * 9}|{'-' * 64}|")  # Separator with fixed widths
    for i, name in enumerate(recommended_titles, 1):  # Iterate and print
      print(f"| {i:<7} | {name:<62} |")

# Example usage
recommend_movies_for_user(10, num=5)


| Ranking | Movie Name                                                     |
|---------|----------------------------------------------------------------|
| 1       | Star Kid (1997)                                                |
| 2       | Marlene Dietrich: Shadow and Light (1996)                      |
| 3       | Someone Else's America (1995)                                  |
| 4       | Prefontaine (1997)                                             |
| 5       | Saint of Fort Washington, The (1993)                           |


**The system generates a diverse set of movie recommendations, including titles like Star Kid (1997) and Marlene Dietrich: Shadow and Light (1996), demonstrating its ability to suggest films across various genres and time periods**.

**This diversity enhances the recommendation experience by introducing users to a broader range of options**, including lesser-known or niche films they might not have otherwise encountered.

**The output is presented in a straightforward, ranked format that includes movie titles**, contributing to a user-friendly and accessible interface.

**By focusing on user similarity, the system offers personalized recommendations that align closely with individual preferences**, resulting in a more engaging and tailored movie discovery process.










### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


**Calculate Item-Item Similarity:**
The code transposes user_movie_matrix so that movies become rows, and computes cosine similarity between them to get item_sim_df, which stores how similar each pair of movies is based on user ratings.

**Find Movie ID from Title:**
Given a movie_name, the function looks up its corresponding movie_id in the movies DataFrame.

**Handle Missing Movie Names:**
The function checks whether the given movie exists in the dataset; if not found, it returns a "Movie not found" message (although this check could be improved for robustness—see note below).

**Retrieve Similar Movies:**
For the identified movie_id, the function gets similarity scores from item_sim_df, sorts them in descending order, and excludes the movie itself from the results.

**Select Top-N Similar Movies:**
It selects the top num most similar movie IDs, and then maps them to their respective movie titles using the movies DataFrame.

**Formatted Output Display:**
The recommended movie titles are printed in a neatly formatted table showing ranking and names, providing an easy-to-read list of similar films.

**Return Result DataFrame:**
A result_df DataFrame containing the ranked recommendations is created, which could be used for further processing or visualization.

In [173]:
# Code the function here

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)

def recommend_movies(movie_name, num=5):
    # Find the movie_id
    movie_id = movies[movies['title'] == movie_name]['movie_id'].iloc[0]

    # Handle cases where movie name is not found
    if movie_id is None:
        return "Movie not found"

    # Extract similarity scores for this movie
    similar_movies = item_sim_df[movie_id].sort_values(ascending=False)[1:]  # Exclude the movie itself

    # Get top N similar movies
    top_n_movie_ids = similar_movies.index[:num]
    movie_names = movies[movies['movie_id'].isin(top_n_movie_ids)]['title'].tolist()

    print(f"| {'Ranking':<7} | {'Movie Name':<42} |")
    print(f"|{'-'*9}|{'-'*44}|")
    for i, name in enumerate(movie_names, 1):
        print(f"| {i:<7} | {name:<42} |")

    # Create the result DataFrame
    result_df = pd.DataFrame({
        'Ranking': range(1, num + 1),
        'Movie Name': movie_names
    })

# Example usage
recommend_movies("Speed (1994)", num=5)


| Ranking | Movie Name                                 |
|---------|--------------------------------------------|
| 1       | Fugitive, The (1993)                       |
| 2       | Jurassic Park (1993)                       |
| 3       | Terminator 2: Judgment Day (1991)          |
| 4       | Top Gun (1986)                             |
| 5       | True Lies (1994)                           |


**The recommendation system effectively identifies movies with similar themes to Speed (1994)**, such as The Fugitive and Jurassic Park, demonstrating its ability to capture user preferences for action and thriller genres.

**Inclusion of films like Top Gun and True Lies highlights the model's capacity to recognize patterns in user tastes**, considering both genre and temporal context, which enhances recommendation relevance.

**The output is presented in a clear, ranked format that improves readability and user-friendliness**, making it easy for users to interpret and evaluate the suggested movies.

**By leveraging cosine similarity, the system delivers accurate and personalized recommendations**; its solid collaborative filtering foundation offers room for future enhancements like content-based features or advanced modeling techniques.

## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

**Merge and Preprocess Ratings:** The ratings DataFrame is merged with the movies DataFrame to add movie titles using movie_id. Then it averages duplicate user-movie ratings and normalizes each user's ratings by subtracting their mean rating (centering).

In [174]:
ratings = ratings.merge(movies[['movie_id', 'title']], on='movie_id')


In [175]:
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()

In [176]:
# Normalize Ratings
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())

**Build Bipartite Graph:** A bipartite graph graph is constructed where users are connected to the movies they've rated, and movies (represented as (movie_id, title) tuples) are connected back to the users who rated them.

In [177]:
graph = {}
for _, row in ratings.iterrows():
    user, movie_id, movie_title = row['user_id'], row['movie_id'], row['title']
    movie = (movie_id, movie_title) # Create a tuple to represent a movie using both movie_id and title

    if user not in graph:
        graph[user] = set()

    if movie not in graph:
        graph[movie] = set()

    graph[user].add((movie_id, movie_title))
    graph[movie].add(user)

**Inspect User's Rated Movies:** The graph is queried to print a sorted list of all movies rated by user_id = 1.

In [178]:
user_id = 1
print(sorted(graph[user_id]))  # Movies rated by user 1


[(1, 'Toy Story (1995)'), (2, 'GoldenEye (1995)'), (3, 'Four Rooms (1995)'), (4, 'Get Shorty (1995)'), (5, 'Copycat (1995)'), (6, 'Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)'), (7, 'Twelve Monkeys (1995)'), (8, 'Babe (1995)'), (9, 'Dead Man Walking (1995)'), (10, 'Richard III (1995)'), (11, 'Seven (Se7en) (1995)'), (12, 'Usual Suspects, The (1995)'), (13, 'Mighty Aphrodite (1995)'), (14, 'Postino, Il (1994)'), (15, "Mr. Holland's Opus (1995)"), (16, 'French Twist (Gazon maudit) (1995)'), (17, 'From Dusk Till Dawn (1996)'), (18, 'White Balloon, The (1995)'), (19, "Antonia's Line (1995)"), (20, 'Angels and Insects (1995)'), (21, 'Muppet Treasure Island (1996)'), (22, 'Braveheart (1995)'), (23, 'Taxi Driver (1976)'), (24, 'Rumble in the Bronx (1995)'), (25, 'Birdcage, The (1996)'), (26, 'Brothers McMullen, The (1995)'), (27, 'Bad Boys (1995)'), (28, 'Apollo 13 (1995)'), (29, 'Batman Forever (1995)'), (30, 'Belle de jour (1967)'), (31, 'Crimson Tide (1995)'), (32, 'Crumb (1994)')

**Find Users for a Specific Movie:** The code identifies the tuple for movie_id = 50, finds all users who rated that movie via the graph, prints their IDs, and counts them.

In [179]:
movie_to_find = (50, movies[movies['movie_id'] == 50]['title'].iloc[0])
users_who_rated_this_movie = graph.get(movie_to_find, set())
print(users_who_rated_this_movie)
print(len(users_who_rated_this_movie)) # No of users who rated this move_id

{1, 2, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 18, 20, 21, 22, 23, 25, 26, 27, 28, 30, 32, 37, 41, 42, 43, 44, 45, 46, 48, 49, 51, 53, 54, 55, 56, 57, 58, 59, 60, 62, 63, 64, 65, 66, 68, 69, 70, 71, 72, 77, 79, 80, 82, 83, 85, 87, 89, 91, 92, 94, 95, 96, 97, 99, 101, 102, 103, 104, 108, 109, 113, 115, 116, 117, 119, 120, 121, 123, 124, 125, 127, 128, 130, 132, 137, 141, 144, 145, 148, 150, 151, 153, 154, 157, 158, 160, 161, 162, 169, 174, 175, 176, 177, 178, 182, 183, 184, 185, 188, 189, 192, 194, 197, 198, 200, 201, 203, 209, 210, 213, 214, 215, 216, 217, 221, 222, 227, 230, 231, 232, 233, 234, 235, 236, 239, 244, 245, 246, 247, 248, 249, 250, 251, 253, 254, 256, 257, 262, 263, 265, 267, 268, 269, 270, 271, 272, 274, 275, 276, 277, 279, 280, 283, 286, 287, 288, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 301, 303, 305, 307, 308, 310, 311, 312, 313, 316, 318, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 332, 334, 336, 337, 339, 340, 343, 344, 345, 346, 347, 350, 352, 354,

### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

**Input & Movie Lookup:** Given a movie_name, the function retrieves its corresponding movie_id from the movies DataFrame. If the movie isn’t found in the graph, it returns an error message.

**Initialization for Random Walk:** It creates a movie node tuple (movie_id, movie_name) as the starting point for the walk and sets up a dictionary movie_visits to track how often other movies are visited during the walk.

**Random Walk Traversal:** The function performs a random walk of walk_length steps through the bipartite user-movie graph. At each step, it randomly selects a neighbor (user or movie), updating the visit count for movie nodes (excluding the starting movie).

**Handling of Neighbors and Nodes:** If a step lands on a user, the next random step may land on another movie they rated. Movies are identified by checking if the node is a tuple, and only those (excluding the starting node) are recorded in the movie_visits dictionary.

**Ranking and Selection:** After the walk, it sorts the movies based on visit frequency (i.e., how often they were landed on during the walk) in descending order and selects the top num recommended movies.

**Output Formatting:** The top recommended movies are returned as a nicely formatted pandas DataFrame showing rankings and movie names, making the recommendations easy to read and interpret.

In [180]:
import pandas as pd
import random

def weighted_pixie_recommend(movie_name, walk_length, num):
    movie_id = movies[movies['title'] == movie_name]['movie_id'].iloc[0]

    if movie_id not in graph:
        return "Movie not found in the graph."

    movie_node = (movie_id, movie_name)
    movie_visits = {}

    current_node = movie_node
    for _ in range(walk_length):
        neighbors = graph.get(current_node)
        if neighbors:
            next_node = random.choice(list(neighbors))
            if isinstance(next_node, tuple) and next_node != movie_node: #Check if it's a movie node and not the starting node itself
                movie_id, movie_title = next_node
                movie_visits[movie_title] = movie_visits.get(movie_title, 0) + 1
            current_node = next_node
        else:
            break  # Stop the walk if no neighbors

    sorted_movies = sorted(movie_visits.items(), key=lambda x: x[1], reverse=True)

    top_movies = sorted_movies[:num]

    result_df = pd.DataFrame({
        'Ranking': range(1, len(top_movies)+1),
        'Movie Name': [movie[0] for movie in top_movies]
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

# Example Usage
weighted_pixie_recommend("Jurassic Park (1993)", 10, 5)


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Miracle on 34th Street (1994)
2,Bullets Over Broadway (1994)
3,Clerks (1994)
4,Aliens (1986)
5,Back to the Future (1985)


**Walk length controls how far the algorithm explores** the user-movie graph starting from the given movie.

**Shorter walks (e.g., 5 steps) stay near the starting movie**, giving safer and more similar recommendations.

**Longer walks (e.g., 25+ steps) explore further, increasing diversity but potentially reducing relevance.**

**The optimal walk length balances relevance and variety, typically around 10–20 steps** for good recommendations.

In [181]:
# Example Usage
weighted_pixie_recommend("Jurassic Park (1993)", 15, 5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,"Fugitive, The (1993)"
2,Dave (1993)
3,"African Queen, The (1951)"
4,Sling Blade (1996)
5,Benny & Joon (1993)


In [182]:
# Example Usage
weighted_pixie_recommend("Shine (1996)", 15, 7)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,High Noon (1952)
2,Homeward Bound: The Incredible Journey (1993)
3,Babe (1995)
4,"Fugitive, The (1993)"
5,Junior (1994)
6,Independence Day (ID4) (1996)
7,"Craft, The (1996)"


---

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [None]:
# Submit the Github Link here:
https://github.com/sbal2911/Movie-Recommendation/tree/main

### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |