In [2]:
import pandas as pd

In [5]:
movies = pd.read_csv('../data/movies.csv')
ratings = pd.read_csv('../data/ratings.csv')

In [6]:
print("\n***Movies***\n")
movies.head()


***Movies***



Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
print("\n***Ratings***\n")
ratings.head()


***Ratings***



Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
num_movies = movies['movieId'].nunique()
num_users = ratings['userId'].nunique()

In [9]:
print(f"Number of Movies: {num_movies}")
print(f"\nNumber of Users: {num_users}")

Number of Movies: 9742

Number of Users: 610


## üìÖ **Day 2: Data Preprocessing for Collaborative Filtering**

### üéØ **Main Goal**:
Prepare your MovieLens dataset for collaborative filtering by:
- Cleaning the data (handle missing values, filter sparse/low-rated data).
- Creating a **User-Movie Ratings Matrix**, which will later be used to compute similarities.

---

### ‚úÖ **Outcome at the end of Day 2**:
- A **cleaned dataset** (filtered to remove noise or insufficient data).
- A **User-Movie Ratings Matrix** (users as rows, movies as columns, ratings as values).
- You're now ready to calculate similarities from this matrix in later days.

---

## üîÅ Step-by-Step Plan

### **1. Import Necessary Libraries**

### **2. Load the Ratings CSV**

### **3. Check for Missing Values**

#### üß† **Why?**
Missing values can cause issues when building the user-movie matrix.

### **4. Filter Out Infrequently Rated Movies**

#### üß† **Why?**
Movies with very few ratings give weak collaborative filtering signals and make the similarity matrix sparse.

#### üëá Here's the **explanation** of the code 

```python
# Count how many times each movie has been rated
movies_count = ratings['movieId'].value_counts()
```

- üîç `value_counts()` ‚Üí Think of it as: ‚ÄúHow many times does each movie appear in the ratings?‚Äù

```python
# Keep only movies with at least 10 ratings

# Day-2

# Documentation Concepts to Read Up On:
| Concept | What to Search |
|--------|----------------|
| `pandas.DataFrame.value_counts()` | Count frequencies of values |
| `pandas.Series.isin()` | Filter based on a list of values |
| `pandas.pivot_table()` | Creating pivot (matrix-style) tables |
| Missing values (`NaN`) | `dropna()`, handling NaNs |
| Filtering DataFrames | Boolean indexing in Pandas |

---

In [11]:
ratings.isnull()

Unnamed: 0,userId,movieId,rating,timestamp
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
100831,False,False,False,False
100832,False,False,False,False
100833,False,False,False,False
100834,False,False,False,False


In [24]:
# Counting how many times each movies has been rated
movies_count = ratings['movieId'].value_counts()
movies_count

movieId
356       329
318       317
296       307
593       279
2571      278
         ... 
188833      1
189381      1
3899        1
2848        1
147002      1
Name: count, Length: 9724, dtype: int64

In [28]:
# Keeping only moviews with at least 10 ratings
popular_movies = movies_count[movies_count >= 10].index
popular_movies

Index([   356,    318,    296,    593,   2571,    260,    480,    110,    589,
          527,
       ...
          258,   1290,   5621,    918,   2380,   4167,  50794,   4255,   1147,
       120466],
      dtype='int64', name='movieId', length=2269)

In [29]:
# filter the original dataset to include only those popular movies
filter_ratings = ratings[ratings['movieId'].isin(popular_movies)]
filter_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100818,610,159093,3.0,1493847704
100829,610,164179,5.0,1493845631
100830,610,166528,4.0,1493879365
100833,610,168250,5.0,1494273047


In [31]:
# Filtering Users who rated only few Movies
users_counts = filter_ratings['userId'].value_counts()
actice_users = users_counts[users_counts >5].index
filter_ratings = filter_ratings[filter_ratings['userId'].isin(actice_users)]
filter_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100818,610,159093,3.0,1493847704
100829,610,164179,5.0,1493845631
100830,610,166528,4.0,1493879365
100833,610,168250,5.0,1494273047


In [32]:
# Creating the user movie ratings Matrix
user_movie_matrix = filter_ratings.pivot_table(index='userId', columns='movieId', values='rating')

In [33]:
user_movie_matrix.shape

(610, 2269)

In [34]:
user_movie_matrix.head()

movieId,1,2,3,5,6,7,9,10,11,12,...,166461,166528,166643,168250,168252,174055,176371,177765,179819,187593
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,4.0,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


# DAY 3

## What is Cosine Similarity?
- Cosine similarity measures how **‚Äúsimilar‚Äù** two vectors are by calculating the cosine of the angle between them. In simple terms, it tells you if two users rate movies in a similar way.
- **Range:** 0 to 1 (for non-negative ratings like 1‚Äì5 stars):
    - 1 = Identical preferences (users rate movies the same way).
    - 0 = No similarity (completely different tastes).


**Analogy**: Imagine two people pointing at stars in the sky. If their arms point in nearly the same direction (small angle), they‚Äôre looking at similar stars (high cosine similarity). If their arms point in opposite directions (large angle), they‚Äôre looking at different stars (low similarity).

**In AI**: Cosine similarity is used in recommendation systems (like Netflix) to find similar users or items, powering collaborative filtering (recommending based on user behavior).

In [2]:
import pandas as pd
import numpy as np

In [3]:
movies = pd.read_csv("../data/movies.csv")
ratings = pd.read_csv("../data/ratings.csv")

In [6]:
# Creating USer-movie matrix
user_movie_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating')

# Fill NaN (no rating) with 0
user_movie_matrix = user_movie_matrix.fillna(0)
user_movie_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Each row is a user‚Äôs rating vector (e.g., User 1: [4.0, 0.0, 4.5, ...]).

In **AI:** This matrix is the foundation of collaborative filtering. It‚Äôs sparse (lots of zeros) because users rate only a few movies, which is typical in recommendation systems.

## The Cosine Similarity Function
Cosine similarity between two vectors ùê¥ and B is:
- coisine_similarity(A, B) = ( A . B ) / ( ||A|| ||B|| )

- **A‚ãÖB:** Dot product (sum of element-wise products).
- **‚à•ùê¥‚à•:** Magnitude (square root of sum of squared elements).
- NumPy makes this easy with vectorized operations.

In [12]:
def coisine_similarity(user1, user2):
    """
    Compute cosine similarity between two users' rating vectors.
    Parameters:
        user1, user2: NumPy arrays of ratings (e.g., user_movie_matrix.loc[1].values)
    Returns:
        Float between 0 and 1 (1 = identical, 0 = no similarity)
    """
    # Compute dot product
    dot_product = np.dot(user1, user2)

    # Compute magnitude 
    magnitude1 = np.sqrt(np.sum(user1 ** 2))
    magnitude2 = np.sqrt(np.sum(user2 ** 2))

    # Avoid division by zero
    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    
    # Compute cosine similarity
    return dot_product / (magnitude1 + magnitude2)


**What‚Äôs Happening?**
- `np.dot(user1, user2)`: Multiplies corresponding ratings and sums them (like \( 4.0*0.0 + 0.0*3.0 + ... \)).
- `np.sqrt(np.sum(user1 ** 2))`: Computes the magnitude of `user1` (square root of sum of squared ratings).
- If either user has all zeros (no ratings), return 0 to avoid division by zero.
- The result is a number between 0 and 1.

**Analogy**: Think of two users as archers shooting arrows (ratings) at a target (movies). Cosine similarity measures how close their arrows land to each other‚Äôs direction, ignoring how far they shot.

#### Step 3: Test the Function
Let‚Äôs compute the similarity between two users (e.g., `userId` 1 and 2).


In [15]:
# Get ranking vectors for Two users
user1_ratings = user_movie_matrix.loc[1].values
user2_ratings = user_movie_matrix.loc[2].values

# Compute cosine similarity
similarity = coisine_similarity(user1_ratings, user2_ratings)
print(f"Coisine Similarity between User 1 and User 2: {similarity}")

Coisine Similarity between User 1 and User 2: 0.44795632538769775


- A low score (e.g., 0.447) means User 1 and User 2 have different tastes.
- A high score (e.g., 0.9) would mean they rate movies similarly.

**What‚Äôs Happening?**
- `user_movie_matrix.loc[1].values`: Gets User 1‚Äôs ratings as a NumPy array (e.g., `[4.0, 0.0, 4.5, ...]`).
- The function computes how aligned their ratings are.
- **In AI**: A high similarity score means User 2‚Äôs favorite movies could be recommended to User 1.

#### Step 4: Apply to Recommendation System
To make recommendations:
1. Find users similar to a target user (high cosine similarity).
2. Recommend movies they rated highly that the target user hasn‚Äôt seen.


In [20]:
# Merge ratings and movies on movieId
merged_data = pd.merge(ratings, movies, on='movieId')

In [22]:
# Find movies User 2 rated highly (e.g., > 3) that User 1 hasn't rated
user2_high_rated = merged_data[(merged_data['userId'] == 2) & (merged_data['rating'] > 3)][['movieId', 'title', 'rating']]
user1_rated = merged_data[merged_data['userId'] == 1]['movieId']

# Recommend movies User 2 rated but User 1 hasn't
recommendations = user2_high_rated[~user2_high_rated['movieId'].isin(user1_rated)]
print("Recommended movies for User 1 based on User 2:")
print(recommendations)

Recommended movies for User 1 based on User 2:
     movieId                                              title  rating
234     1704                           Good Will Hunting (1997)     4.5
236     6874                           Kill Bill: Vol. 1 (2003)     4.0
237     8798                                  Collateral (2004)     3.5
238    46970  Talladega Nights: The Ballad of Ricky Bobby (2...     4.0
239    48516                               Departed, The (2006)     4.0
240    58559                            Dark Knight, The (2008)     4.5
241    60756                               Step Brothers (2008)     5.0
242    68157                        Inglourious Basterds (2009)     4.5
244    74458                              Shutter Island (2010)     4.0
246    79132                                   Inception (2010)     4.0
247    80489                                   Town, The (2010)     4.5
248    80906                                  Inside Job (2010)     5.0
249    86345     

**Note**: This is a basic example. In a full system, you‚Äôd compute similarities for all users and pick the most similar ones.

**In AI**: This is **user-based collaborative filtering**. You‚Äôre using similarity scores to find ‚Äúneighbors‚Äù (similar users) and recommend their favorite movies.

#### Troubleshooting
- **KeyError: 1**: If `userId` 1 or 2 isn‚Äôt in `user_movie_matrix`, check `user_movie_matrix.index` and pick valid `userId`s (e.g., `user_movie_matrix.index[0]`).
- **NaN in Matrix**: Ensure `fillna(0)` was applied to replace missing ratings.
- **Low Similarity Scores**: If all scores are near 0, it‚Äôs normal for sparse data (users rate few movies). Try users with more ratings or a smaller dataset.
- **File Path Issues**: Use absolute paths (e.g., `'C:/AI_Projects/movies.csv'`) and check with `!dir` (Windows) or `!ls` (Mac/Linux).

---