# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [1]:
# u.data
# Inspect the first 10 lines of u.data
!head -n 10 u.data


196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


In [2]:
# u.item
# Inspect the first 10 lines of u.item
!head -n 10 u.item

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

In [3]:
# u.user
# Inspect the first 10 lines of u.user
!head -n 10 u.user

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703


#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [4]:
# ratings
import pandas as pd

# Load ratings data from u.data
ratings = pd.read_csv(
    "u.data",
    sep="\t",               # tab-separated
    header=None,            # no header row in file
    names=["user_id", "movie_id", "rating", "timestamp"]
)

ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [5]:
# movies
# Define column names for u.item (MovieLens 100k format)
movie_cols = [
    "movie_id", "title", "release_date", "video_release_date", "imdb_url"
] + [f"genre_{i}" for i in range(19)]   # 19 genre flags

movies = pd.read_csv(
    "u.item",
    sep="|",
    header=None,
    encoding="latin-1",      # to avoid encoding issues
    names=movie_cols
)

movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_0,genre_1,genre_2,genre_3,genre_4,...,genre_9,genre_10,genre_11,genre_12,genre_13,genre_14,genre_15,genre_16,genre_17,genre_18
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [6]:
# users
# Load user data from u.user
users = pd.read_csv(
    "u.user",
    sep="|",
    header=None,
    names=["user_id", "age", "gender", "occupation", "zip_code"]
)

users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [7]:
# ratings
# Save ratings DataFrame to CSV (bonus)
ratings.to_csv("ratings.csv", index=False)

In [8]:
# movies
# Save movies DataFrame to CSV (bonus)
movies.to_csv("movies.csv", index=False)

In [9]:
# users
# Save users DataFrame to CSV (bonus)
users.to_csv("users.csv", index=False)

**Display the first 10 rows of each file.**

In [10]:
# ratings
# Display first 10 rows of ratings
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


In [11]:
# movies
# Display first 10 rows of movies
movies.head(10)

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_0,genre_1,genre_2,genre_3,genre_4,...,genre_9,genre_10,genre_11,genre_12,genre_13,genre_14,genre_15,genre_16,genre_17,genre_18
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [12]:
# users
# Display first 10 rows of users
users.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Convert Timestamps into Readable dates.**

In [13]:
# ratings
# Convert UNIX timestamp (seconds since epoch) to readable datetime
ratings["timestamp"] = pd.to_datetime(ratings["timestamp"], unit="s")

# Quick check
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16


**Check for Missing Values**

In [14]:
# ratings
# Check for missing values in ratings
ratings.isna().sum()

Unnamed: 0,0
user_id,0
movie_id,0
rating,0
timestamp,0


In [15]:
# movies
# Check for missing values in movies
movies.isna().sum()

Unnamed: 0,0
movie_id,0
title,0
release_date,1
video_release_date,1682
imdb_url,3
genre_0,0
genre_1,0
genre_2,0
genre_3,0
genre_4,0


In [16]:
# users
users.isna().sum()

Unnamed: 0,0
user_id,0
age,0
gender,0
occupation,0
zip_code,0


**Print the total number of users, movies, and ratings.**

In [17]:
print(f"Total Users: {users['user_id'].nunique()}")
print(f"Total Movies: {movies['movie_id'].nunique()}")
print(f"Total Ratings: {ratings.shape[0]}")

Total Users: 943
Total Movies: 1682
Total Ratings: 100000


### Data Quality Check

- The `ratings` DataFrame has 100,000 rows and no missing values in `user_id`, `movie_id`, or `rating`.
- The `movies` DataFrame has 1,682 unique `movie_id` entries. Some non-essential fields (e.g., `video_release_date`) may contain missing values, which we ignore for this assignment.
- The `users` DataFrame has 943 unique users and no missing values in the core fields (`age`, `gender`, `occupation`, `zip_code`).

Since there are no critical missing values or obvious duplicates in the key fields, we proceed without additional cleaning beyond converting timestamps and exporting cleaned CSVs.


## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

In [18]:
user_movie_matrix = ratings.pivot(index="user_id", columns="movie_id", values="rating")

**Display the matrix to verify the transformation.**

In [19]:
user_movie_matrix.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


In [20]:
# Code the function here
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Fill missing ratings with 0 for similarity computation
user_movie_matrix_filled = user_movie_matrix.fillna(0)

# Compute cosine similarity between users
user_similarity = cosine_similarity(user_movie_matrix_filled)

# Convert to a DataFrame for easier lookup
user_sim_df = pd.DataFrame(
    user_similarity,
    index=user_movie_matrix_filled.index,   # user_ids as rows
    columns=user_movie_matrix_filled.index  # user_ids as columns
)

# Quick check
user_sim_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.166931,0.04746,0.064358,0.378475,0.430239,0.440367,0.319072,0.078138,0.376544,...,0.369527,0.119482,0.274876,0.189705,0.197326,0.118095,0.314072,0.148617,0.179508,0.398175
2,0.166931,1.0,0.110591,0.178121,0.072979,0.245843,0.107328,0.103344,0.161048,0.159862,...,0.156986,0.307942,0.358789,0.424046,0.319889,0.228583,0.22679,0.161485,0.172268,0.105798
3,0.04746,0.110591,1.0,0.344151,0.021245,0.072415,0.066137,0.08306,0.06104,0.065151,...,0.031875,0.042753,0.163829,0.069038,0.124245,0.026271,0.16189,0.101243,0.133416,0.026556
4,0.064358,0.178121,0.344151,1.0,0.031804,0.068044,0.09123,0.18806,0.101284,0.060859,...,0.052107,0.036784,0.133115,0.193471,0.146058,0.030138,0.196858,0.152041,0.170086,0.058752
5,0.378475,0.072979,0.021245,0.031804,1.0,0.237286,0.3736,0.24893,0.056847,0.201427,...,0.338794,0.08058,0.094924,0.079779,0.148607,0.071459,0.239955,0.139595,0.152497,0.313941


In [25]:
def recommend_movies_for_user(user_id, num=5):
    """
    Recommend movies for a given user based on similar users' ratings.
    """
    # 1. Check user exists
    if user_id not in user_movie_matrix.index:
        print(f"User {user_id} not found.")
        return None

    # 2. Get similarity scores for this user
    sim_scores = user_sim_df.loc[user_id].drop(index=user_id)
    sim_scores = sim_scores.sort_values(ascending=False)

    # 3. Use top-N neighbors (you can tune this)
    top_neighbors = sim_scores.head(20)

    # 4. Ratings from those neighbors
    neighbor_ratings = user_movie_matrix.loc[top_neighbors.index]

    # 5. Weighted movie scores = R^T * sim / sum(sim)
    weighted_ratings = neighbor_ratings.T.dot(top_neighbors)
    sim_sum = top_neighbors.sum()
    movie_scores = weighted_ratings / sim_sum if sim_sum != 0 else weighted_ratings

    # 6. Remove movies already rated by the user
    user_rated = user_movie_matrix.loc[user_id]
    movie_scores = movie_scores[user_rated.isna()]

    # 7. Get top-N movies
    top_movie_ids = movie_scores.sort_values(ascending=False).head(num).index

    movie_title_map = movies.set_index("movie_id")["title"]
    movie_names = movie_title_map.loc[top_movie_ids].values

    result_df = pd.DataFrame({
        "Ranking": range(1, len(movie_names) + 1),
        "Movie Name": movie_names
    }).set_index("Ranking")

    return result_df


In [26]:
recommend_movies_for_user(10, num=5)


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,GoldenEye (1995)
2,Four Rooms (1995)
3,Copycat (1995)
4,Shanghai Triad (Yao a yao yao dao waipo qiao) ...
5,Babe (1995)


### User-Based Collaborative Filtering – Explanation

We first create a user–movie rating matrix where each row is a user and each column is a movie. We then:

1. Fill missing ratings with 0 and compute a cosine similarity matrix between users (`user_sim_df`).
2. For a target user, we:
   - Find the top-k most similar users (neighbors).
   - Collect their ratings and compute a **weighted average score** for each movie, where the weights are the similarity scores.
3. We filter out movies the user has already rated and recommend the top-N remaining movies.

This approach assumes that **users with similar rating patterns will like similar movies**.

### Example Output (User 10)

For `recommend_movies_for_user(10, num=5)`, the model recommends movies that user 10 has not rated but that are highly rated by users with similar tastes. These recommendations are often popular, well-rated movies within similar genres that user 10 already enjoys.


### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


In [21]:
# Code the function here
# Transpose so that rows = movies, columns = users
# Then fill missing ratings with 0 for similarity computation
item_matrix = user_movie_matrix.T.fillna(0)

# Compute cosine similarity between movies
item_similarity = cosine_similarity(item_matrix)

# Build a DataFrame: rows and columns are movie_ids
item_sim_df = pd.DataFrame(
    item_similarity,
    index=item_matrix.index,   # movie_ids
    columns=item_matrix.index  # movie_ids
)

# Quick look
item_sim_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.402382,0.330245,0.454938,0.286714,0.116344,0.620979,0.481114,0.496288,0.273935,...,0.035387,0.0,0.0,0.0,0.035387,0.0,0.0,0.0,0.047183,0.047183
2,0.402382,1.0,0.273069,0.502571,0.318836,0.083563,0.383403,0.337002,0.255252,0.171082,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078299,0.078299
3,0.330245,0.273069,1.0,0.324866,0.212957,0.106722,0.372921,0.200794,0.273669,0.158104,...,0.0,0.0,0.0,0.0,0.032292,0.0,0.0,0.0,0.0,0.096875
4,0.454938,0.502571,0.324866,1.0,0.334239,0.090308,0.489283,0.490236,0.419044,0.252561,...,0.0,0.0,0.094022,0.094022,0.037609,0.0,0.0,0.0,0.056413,0.075218
5,0.286714,0.318836,0.212957,0.334239,1.0,0.037299,0.334769,0.259161,0.272448,0.055453,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094211


In [27]:
def recommend_movies(movie_name, num=5):
    """
    Recommend movies similar to the given movie_name using item-based CF.
    """
    # 1. Find movie_id from title
    match = movies[movies["title"] == movie_name]
    if match.empty:
        print(f"Movie '{movie_name}' not found.")
        return None
    movie_id = match.iloc[0]["movie_id"]

    # 2. Check movie in similarity matrix
    if movie_id not in item_sim_df.index:
        print(f"Movie ID {movie_id} not found in item_sim_df.")
        return None

    # 3. Get similarity scores to all movies
    sim_scores = item_sim_df.loc[movie_id].drop(index=movie_id)

    # 4. Take top-N
    top_similar = sim_scores.sort_values(ascending=False).head(num)
    similar_movie_ids = top_similar.index

    movie_title_map = movies.set_index("movie_id")["title"]
    movie_names = movie_title_map.loc[similar_movie_ids].values

    result_df = pd.DataFrame({
        "Ranking": range(1, len(movie_names) + 1),
        "Movie Name": movie_names
    }).set_index("Ranking")

    return result_df


In [28]:
recommend_movies("Jurassic Park (1993)", num=5)


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Top Gun (1986)
2,Speed (1994)
3,Raiders of the Lost Ark (1981)
4,"Empire Strikes Back, The (1980)"
5,Indiana Jones and the Last Crusade (1989)


### Item-Based Collaborative Filtering – Explanation

For item-based CF, we transpose the user–movie matrix so that each row is a **movie** and each column is a **user**. Then:

1. We fill missing ratings with 0 and compute cosine similarity between movie vectors (`item_sim_df`).
2. Given a movie title (e.g., "Jurassic Park (1993)"), we:
   - Look up its `movie_id`.
   - Fetch its similarity scores to all other movies.
   - Sort by similarity and return the top-N most similar movies (excluding itself).

This approach assumes that **movies rated similarly by the same users are similar to each other**.

### Example Output ("Jurassic Park (1993)")

For `recommend_movies("Jurassic Park (1993)", num=5)`, the model returns other action/adventure titles that share similar rating patterns. These tend to be movies watched and liked by the same users who liked "Jurassic Park," which makes the recommendations intuitive and easy to justify.


## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

In [22]:
# Code the function here
# Step 1: Merge ratings with movie titles
# --------------------------------------
# We join on 'movie_id' so each rating now also has a human-readable movie 'title'.
ratings = ratings.merge(
    movies[["movie_id", "title"]],  # only keep the columns we actually need
    on="movie_id"
)

# Quick sanity check
print("After merge:", ratings.head())


# Step 2: Aggregate ratings (user-movie pair -> single rating)
# ------------------------------------------------------------
# Some users may have rated the same movie multiple times.
# We group by (user_id, movie_id, title) and take the mean rating.
ratings = (
    ratings
    .groupby(["user_id", "movie_id", "title"])["rating"]
    .mean()
    .reset_index()
)

print("After aggregation:", ratings.head())


# Step 3: Normalize ratings per user
# ----------------------------------
# Different users have different rating habits (some rate high, some low).
# We subtract each user's mean rating so their ratings are centered around 0.
ratings["rating"] = ratings.groupby("user_id")["rating"].transform(
    lambda x: x - x.mean()
)

print("After normalization:", ratings.head())


# Step 4: Build the user-movie graph as an adjacency list
# -------------------------------------------------------
# We create a bipartite graph:
#   - user nodes (user_id)
#   - movie nodes (movie_id)
# Edges connect a user to a movie if the user has rated that movie.
def build_user_movie_graph(ratings_df):
    """
    Build a bipartite graph (adjacency list) from a ratings DataFrame.

    Nodes:
        - user_id values (e.g., 1, 2, 3, ...)
        - movie_id values (e.g., 50, 172, ...)

    Edges:
        - An undirected edge between a user and a movie if the user rated that movie.

    Returns:
        graph: dict
            Keys are node IDs (user_id or movie_id),
            Values are sets of neighboring node IDs.
    """
    graph = {}

    # Iterate over each row = one (user, movie, rating) interaction
    for _, row in ratings_df.iterrows():
        user = row["user_id"]
        movie = row["movie_id"]

        # Ensure the user node exists in the graph
        if user not in graph:
            graph[user] = set()

        # Ensure the movie node exists in the graph
        if movie not in graph:
            graph[movie] = set()

        # Add undirected connections:
        # user -> movie and movie -> user
        graph[user].add(movie)
        graph[movie].add(user)

    return graph


# Actually build the graph from the processed ratings
graph = build_user_movie_graph(ratings)

# quick exploration examples
# ------------------------------------
# Example: movies rated by user 1 (if user 1 exists)
if 1 in graph:
    print("Movies rated by user 1:", list(graph[1])[:10])

# Example: users who rated movie 50 (if movie 50 exists)
if 50 in graph:
    print("Users who rated movie 50:", list(graph[50])[:10])


After merge:    user_id  movie_id  rating           timestamp                       title
0      196       242       3 1997-12-04 15:55:49                Kolya (1996)
1      186       302       3 1998-04-04 19:22:22    L.A. Confidential (1997)
2       22       377       1 1997-11-07 07:18:36         Heavyweights (1994)
3      244        51       2 1997-11-27 05:02:03  Legends of the Fall (1994)
4      166       346       1 1998-02-02 05:33:16         Jackie Brown (1997)
After aggregation:    user_id  movie_id              title  rating
0        1         1   Toy Story (1995)     5.0
1        1         2   GoldenEye (1995)     3.0
2        1         3  Four Rooms (1995)     4.0
3        1         4  Get Shorty (1995)     3.0
4        1         5     Copycat (1995)     3.0
After normalization:    user_id  movie_id              title    rating
0        1         1   Toy Story (1995)  1.389706
1        1         2   GoldenEye (1995) -0.610294
2        1         3  Four Rooms (1995)  0.3897

### Graph Construction – Explanation

We build a **bipartite graph** where:

- One set of nodes are users (`user_id`).
- The other set of nodes are movies (`movie_id`).
- There is an undirected edge between a user and a movie if the user rated that movie.

We store this as an adjacency list: a Python `dict` mapping each node to the set of its neighbors. This structure lets us efficiently traverse user–movie relationships without using an external graph library.


### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

In [23]:
# Code the function here
import random
import pandas as pd

# Precompute sets of user_ids and movie_ids to distinguish node types in the graph
user_ids = set(ratings["user_id"].unique())
movie_ids = set(ratings["movie_id"].unique())


def weighted_pixie_recommend(user_id, walk_length=15, num=5):
    """
    Random-walk-based movie recommendation (Pixie-inspired).

    Starting from a given user_id:
      - Perform a random walk on the user-movie bipartite graph.
      - Count how many times each movie node is visited.
      - Recommend the top-N most frequently visited movies (excluding movies the user already rated).

    Parameters
    ----------
    user_id : int
        The starting user for the random walk.
    walk_length : int, optional
        Number of steps to take in the random walk (default: 15).
    num : int, optional
        Number of movies to recommend (default: 5).

    Returns
    -------
    pandas.DataFrame or None
        DataFrame with columns:
            - Ranking
            - Movie Name
        or None if no recommendations can be made.
    """

    # 1. Check if the user exists in the graph
    if user_id not in graph or user_id not in user_ids:
        print(f"User {user_id} not found in graph.")
        return None

    # 2. Store which movies the user has already rated
    already_rated = set(
        ratings.loc[ratings["user_id"] == user_id, "movie_id"].unique()
    )

    # 3. Initialize the random walk
    current_node = user_id

    # Dictionary to count how many times each movie is visited
    visit_counts = {}

    # Optional: fix random seed for reproducible results (comment out if not needed)
    # random.seed(42)

    # 4. Perform the random walk
    for step in range(walk_length):
        neighbors = graph.get(current_node, None)

        # If the current node has no neighbors, we can't continue walking
        if not neighbors:
            break

        # Randomly choose the next node from the neighbors (user or movie)
        next_node = random.choice(list(neighbors))

        # If we land on a movie node, record the visit
        if next_node in movie_ids:
            visit_counts[next_node] = visit_counts.get(next_node, 0) + 1

        # Move to the next node for the next step
        current_node = next_node

    # 5. If we never visited any movie nodes, we can't recommend anything
    if not visit_counts:
        print("No movies were visited during the random walk. Try increasing walk_length.")
        return None

    # 6. Sort movies by visit frequency (descending)
    sorted_movies = sorted(
        visit_counts.items(),
        key=lambda x: x[1],
        reverse=True
    )

    # 7. Filter out movies the user already rated
    filtered_movie_ids = [
        movie_id for movie_id, count in sorted_movies
        if movie_id not in already_rated
    ]

    if not filtered_movie_ids:
        print("All visited movies were already rated by the user.")
        return None

    # 8. Take the top-N movie IDs
    top_movie_ids = filtered_movie_ids[:num]

    # Map movie IDs to titles
    movie_title_map = movies.set_index("movie_id")["title"]
    movie_names = movie_title_map.loc[top_movie_ids].values

    # 9. Build the result DataFrame
    result_df = pd.DataFrame({
        "Ranking": range(1, len(movie_names) + 1),
        "Movie Name": movie_names
    }).set_index("Ranking")

    return result_df


---

In [24]:
weighted_pixie_recommend(1, walk_length=15, num=5)


Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Around the World in 80 Days (1956)
2,"American Werewolf in London, An (1981)"
3,Nick of Time (1995)
4,Nixon (1995)
5,Annie Hall (1977)


### Pixie-Inspired Random Walk – Explanation

Our random-walk recommender starts from a user node and performs a walk of fixed length on the user–movie graph:

1. We start at the target `user_id`.
2. At each step, we randomly pick one neighbor of the current node. If the current node is a user, neighbors are movies they rated; if it is a movie, neighbors are users who rated that movie.
3. Whenever we land on a **movie node**, we increase its visit count.
4. After the walk, we sort movies by how often they were visited and recommend the top-N movies the user has not already rated.

This is **Pixie-inspired** because it uses short, biased random walks on a bipartite graph to surface items that are “close” to the user in the interaction graph, similar to how Pinterest’s Pixie algorithm finds related pins.

### Example Output (User 1)

For `weighted_pixie_recommend(1, walk_length=15, num=5)`, we get a list of movies that are frequently visited during the walk. These tend to be movies watched by users who share overlapping movie histories with user 1, sometimes surfacing less obvious but relevant titles that standard similarity-based CF might miss.


## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [None]:
# Submit the Github Link here:


### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |