### Part 2: PCA method with Maximum likelihood Estimation

**Group 16**

Adham Mohmed Elsaied Elwakel 222100195,
Samaa Khaled Eltaky 222100761,
Habiba Ahmed Abdelnapy 222100471, 
Youssef Hussieny 222101943

In [1]:
import pandas as pd
import numpy as np
ratings = pd.read_csv('../data/ratings.csv')

In [2]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858
...,...,...,...,...
32000199,200948,79702,4.5,1294412589
32000200,200948,79796,1.0,1287216292
32000201,200948,80350,0.5,1294412671
32000202,200948,80463,3.5,1350423800


In [None]:
target_items = [1, 2]

In [None]:
#2. Select Top 1,000 Popular Movies (plus target items if they aren't in top 1000)
desired_item_count = 1000

In [None]:
item_counts = ratings['movieId'].value_counts()
top_movies = item_counts.nlargest(desired_item_count).index.tolist()

In [None]:
for t in target_items:
    if t not in top_movies:
        top_movies.append(t)

ratings_filtered = ratings[ratings['movieId'].isin(top_movies)]

In [None]:
# 3. Select Top 100,000 Active Users
desired_user_count = 100000
user_counts = ratings_filtered['userId'].value_counts()
top_users = user_counts.nlargest(desired_user_count).index.tolist()

In [None]:
ratings_final = ratings_filtered[ratings_filtered['userId'].isin(top_users)]

In [None]:
ratings_final.shape

(15899442, 4)

In [None]:
user_item_matrix = ratings_final.pivot(index='userId', columns='movieId', values='rating')

In [None]:
user_item_matrix

movieId,1,2,3,5,6,7,10,11,16,17,...,168252,171763,174055,176371,177765,187593,195159,202439,204698,207313
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,4.0,...,,,,,,,,,,
3,,3.5,,,,,4.0,4.0,,5.0,...,,,,,,,,,,
10,2.5,2.0,,,,,4.0,,,,...,4.0,,,3.5,,,,,,
13,,,,,5.0,,,3.0,,,...,,,,,,,,,,
15,,,,,,,,,,4.5,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200943,3.5,2.5,,,3.5,4.5,3.5,,,5.0,...,,,,,,,,,,
200944,4.0,,,,,,,,,,...,,,,,,,,,,
200945,,,,,,,,,,,...,,,0.5,,,,,,,
200947,4.0,,,,,,,,,,...,,,,,,,,,,


In [None]:
item_means = user_item_matrix.mean()

In [None]:
centered_matrix_mle = user_item_matrix - item_means

#### 1- Generate the covariance matrix

In [None]:
mask = (~centered_matrix_mle.isna()).astype(int)
filled_centered = centered_matrix_mle.fillna(0)
numerator = filled_centered.T.dot(filled_centered)
common_user_counts = mask.T.dot(mask)
denominator = common_user_counts - 1
denominator = denominator.replace(0, np.nan).replace(-1, np.nan)
cov_matrix_mle = numerator / denominator


In [None]:
print(cov_matrix_mle.iloc[:5, :5])

movieId         1         2         3         5         6
movieId                                                  
1        0.824260  0.278830  0.211434  0.241623  0.111825
2        0.278830  0.893008  0.349880  0.410519  0.117049
3        0.211434  0.349880  1.016477  0.530252  0.140805
5        0.241623  0.410519  0.530252  1.031786  0.151808
6        0.111825  0.117049  0.140805  0.151808  0.737174


In [None]:
cov_matrix_mle.isna().sum().sum()

np.int64(0)

In [None]:
cov_matrix_mle = cov_matrix_mle.fillna(0)

#### 2- Determine the top 5-peers and top 10-peers for each of the target items (i1 and I2) using the transformed representation (covariance matrix)

In [None]:
for item_id in target_items:
    if item_id in cov_matrix_mle.index:
        item_covariances = cov_matrix_mle[item_id]
        
        sorted_peers = item_covariances.sort_values(ascending=False)
        
        sorted_peers = sorted_peers.drop(item_id)
        
        top_5_peers = sorted_peers.head(5)
        top_10_peers = sorted_peers.head(10)
        for rank, (peer_id, score) in enumerate(top_5_peers.items(), 1):
            print(f"{rank}. Item {peer_id} (Covariance: {score:.4f})")

1. Item 3114 (Covariance: 0.5809)
2. Item 78499 (Covariance: 0.5538)
3. Item 4886 (Covariance: 0.4165)
4. Item 2355 (Covariance: 0.4049)
5. Item 34 (Covariance: 0.4025)
1. Item 2953 (Covariance: 0.4792)
2. Item 158 (Covariance: 0.4769)
3. Item 3489 (Covariance: 0.4557)
4. Item 673 (Covariance: 0.4519)
5. Item 500 (Covariance: 0.4456)


#### 3- Determine reduced dimensional space for each user in case of using the top 5-peers

In [None]:
reduced_space_matrices = {}

for item_id in target_items:
    if item_id in cov_matrix_mle.index:
        sorted_peers = cov_matrix_mle[item_id].sort_values(ascending=False).drop(item_id)
        
        top_5_peer_ids = sorted_peers.head(5).index.tolist()
        
        reduced_matrix = centered_matrix_mle[top_5_peer_ids]
        
        reduced_space_matrices[item_id] = reduced_matrix
        print(top_5_peer_ids)
        print(reduced_matrix.shape)

[3114, 78499, 4886, 2355, 34]
(100000, 5)
[2953, 158, 3489, 673, 500]
(100000, 5)


#### 4- Use the results from point 3 compute the rating predictions od the original missing rating for each of the target items (I1 and I2) using the top 5-peers.

In [None]:
for item_id in target_items:
    if item_id in reduced_space_matrices:
        reduced_matrix = reduced_space_matrices[item_id]
        peers_indices = reduced_matrix.columns
        weights = cov_matrix_mle.loc[item_id, peers_indices]
        
        # For each user, compute weighted sum using only rated (non-NaN) peers
        predicted_ratings = []
        for user_id in reduced_matrix.index:
            user_ratings = reduced_matrix.loc[user_id]
            # Only use non-NaN ratings
            valid_mask = ~user_ratings.isna()
            
            if valid_mask.sum() == 0:  # No valid ratings for this user
                predicted_ratings.append(np.nan)
            else:
                valid_ratings = user_ratings[valid_mask]
                valid_weights = weights[valid_mask]
                
                weighted_sum = (valid_ratings * valid_weights).sum()
                sum_abs_weights = valid_weights.abs().sum()
                
                target_mean = item_means[item_id]
                prediction = target_mean + (weighted_sum / sum_abs_weights)
                predicted_ratings.append(prediction)
        
        predicted_ratings = pd.Series(predicted_ratings, index=reduced_matrix.index)
        
        target_mean = item_means[item_id]
        print(f"Target Item {item_id}:")
        print(f"  Mean: {target_mean}")
        print(f"  Predicted ratings (top 10 users):")
        print(predicted_ratings.head(10))

Target Item 1:
  Mean: 3.883266029744903
  Predicted ratings (top 10 users):
userId
1     2.363511
3     4.065145
10    2.363511
13         NaN
15    1.565145
16    2.565145
17         NaN
18    4.565145
20    5.081433
25         NaN
dtype: float64
Target Item 2:
  Mean: 3.234285033554382
  Predicted ratings (top 10 users):
userId
1          NaN
3     3.139628
10    2.222354
13         NaN
15    3.956788
16         NaN
17         NaN
18         NaN
20    1.907591
25         NaN
dtype: float64


#### 5- Determine reduced dimensional space for each user in case of using the top 10-peers.

In [None]:
reduced_spaces_10 = {}

for item_id in target_items:
    if item_id in cov_matrix_mle.index:
        # Get top 11 peers (including the item itself), then drop the item
        peers_indices = cov_matrix_mle[item_id].nlargest(11).index.drop(item_id)
        
        # Select only the peer columns from the centered matrix
        reduced_matrix_10 = centered_matrix_mle[peers_indices]
        
        # Store for later use
        reduced_spaces_10[item_id] = reduced_matrix_10

#### 6- Use the results from point 5 to compute the rating predictions of the original missing rating for each of target items (I1 and I2) using the top 10-peers.

In [None]:
for item_id in target_items:
    if item_id in reduced_spaces_10:
        reduced_matrix = reduced_spaces_10[item_id]
        peers_indices = reduced_matrix.columns
        weights = cov_matrix_mle.loc[item_id, peers_indices]
        
        # For each user, compute weighted sum using only rated (non-NaN) peers
        predicted_ratings = []
        for user_id in reduced_matrix.index:
            user_ratings = reduced_matrix.loc[user_id]
            # Only use non-NaN ratings
            valid_mask = ~user_ratings.isna()
            
            if valid_mask.sum() == 0:  # No valid ratings for this user
                predicted_ratings.append(np.nan)
            else:
                valid_ratings = user_ratings[valid_mask]
                valid_weights = weights[valid_mask]
                
                weighted_sum = (valid_ratings * valid_weights).sum()
                sum_abs_weights = valid_weights.abs().sum()
                
                target_mean = item_means[item_id]
                prediction = target_mean + (weighted_sum / sum_abs_weights)
                predicted_ratings.append(prediction)
        
        predicted_ratings = pd.Series(predicted_ratings, index=reduced_matrix.index)
        
        print(target_mean)
        print(predicted_ratings.head(10))

3.883266029744903
userId
1     2.363511
3     4.132448
10    3.154438
13         NaN
15    1.641193
16    1.837532
17         NaN
18    4.820206
20    4.558270
25    3.582667
dtype: float64
3.234285033554382
userId
1          NaN
3     3.139628
10    2.368820
13         NaN
15    3.956788
16    4.269777
17         NaN
18         NaN
20    1.907591
25         NaN
dtype: float64


#### 7- Compare the results of point 3 with results of point 6. comment on your answer.

In [None]:
# Observation: The predictions change when expanding the neighborhood from 5 to 10 peers. 
# Analysis:

# Top 5 Peers: Rely on the strongest correlations. These predictions are more specific to the target item's core traits.

# Top 10 Peers: Include items with weaker similarity. 
# This tends to "smooth" the prediction, pulling it closer to the item's global average because the added peers (ranks 6-10) are less relevant. 
# Conclusion: In the MLE method, Top 5 is generally preferred. Adding weaker peers often introduces noise that dilutes the high-quality signal found in the top 5.

#### 8- Compare the results of point 9 in part 1 with results of point 4. comment on your answer

In [None]:
# Observation: MLE (Part 2) predictions usually show higher variance and are distinct from Mean-Filling (Part 1). 
# Analysis:

# Mean-Filling (Part 1): Filling missing data with the mean (0) "dampens" the covariance, making items seem less correlated than they really are. 
# This pulls predictions conservatively toward the average.

# MLE (Part 2): Calculates covariance using only observed data. 
# This captures the true, stronger relationships between items. 
# Conclusion: MLE is superior. It produces sharper predictions that reflect the user's actual distinctiveness rather than biased averages.

#### 9- Compare the results of point 11 in part 1 with results of point 6. comment on your answer

In [None]:
# Observation: The gap between the methods often widens or stays significant as dimensions increase to 10. 
# Analysis:

# Imputation Bias: In Part 1, assuming missing ratings are "average" essentially tells the model that users are average everywhere they haven't rated. 
# This suppresses the unique signal of the additional 5 peers (ranks 6-10).

# Robustness: MLE ignores the missing slots entirely during covariance calculation. Even with 10 peers, the weights are based purely on real interactions. 
# Conclusion: The MLE method scales better. 
# It ensures that the expanded neighborhood contributes based on actual user behavior, avoiding the "noise" created by the artificial mean-filling in Part 1.