<a href="https://colab.research.google.com/github/sarand0/Mass-Personalization-in-Recommender-Systems-Independent-Research/blob/main/Mass_Personalization_in_Recommender_Systems_An_Analysis_of_Amazon_User_Ratings_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Mass Personalization in Recommender Systems: An Analysis of Amazon User Ratings Data by Saran Duncan**

Exploring the use of the Amazon Data from [this](https://amazon-reviews-2023.github.io/index.html#data-fields) repo.

### Dataset
They provide several processed datasets from May 2000 to Sep. 2023:  I focus on the Amazon data under the category Beauty.

train.rating:
- Train file.
- Each Line is a training instance: userID\t itemID\t rating\t timestamp (if have)

test.rating:
- Test file (positive instances).
- Each Line is a testing instance: userID\t itemID\t rating\t timestamp (if have)

test.negative
- Test file (negative instances).
- Each line corresponds to the line of test.rating, containing 99 negative samples.  
- Each line is in the format: (userID,itemID)\t negativeItemID1\t negativeItemID2 ...

In [None]:
!pip install datasets
from datasets import load_dataset

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl 

In [None]:
data = load_dataset("McAuley-Lab/Amazon-Reviews-2023",
                      "0core_timestamp_Beauty_and_Personal_Care",
                      trust_remote_code=True, split='train').to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/30.3k [00:00<?, ?B/s]

Amazon-Reviews-2023.py:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

Beauty_and_Personal_Care.train.csv:   0%|          | 0.00/995M [00:00<?, ?B/s]

Beauty_and_Personal_Care.valid.csv:   0%|          | 0.00/173M [00:00<?, ?B/s]

Beauty_and_Personal_Care.test.csv:   0%|          | 0.00/200M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating valid split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [None]:
print(data.head())

                        user_id parent_asin rating      timestamp
0  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B07X4FKLNK    3.0  1581313195358
1  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B07YYG76X1    1.0  1609700981786
2  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B08G81QQ9L    5.0  1612052493701
3  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B085PRT2MP    1.0  1614915977684
4  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B00Z03RC80    1.0  1616743454733


In [None]:
# Check the length of the DataFrame
print(f"Data Length: {len(data)}")

# Check for max/min values in specific columns
print(f"Max User ID: {data['user_id'].max()}")
print(f"Min Parent ASIN: {data['parent_asin'].min()}")
print(f"Min Rating: {data['rating'].min()}")
print(f"Min Timestamp: {data['timestamp'].min()}")
print(f"Max Timestamp: {data['timestamp'].max()}")

Data Length: 17151479
Max User ID: AHZZZZ3CH2ZFY4CLZTVGAJYCAWIA
Min Parent ASIN: 0000060356
Min Rating: 1.0
Min Timestamp: 1000050959000
Max Timestamp: 999546809000


In [None]:
print(data.columns)

Index(['user_id', 'parent_asin', 'rating', 'timestamp'], dtype='object')


In [None]:
# Extract a subset of columns and rename them:
column_names2 = ['User ID', 'Product ID', 'Rating', 'Timestamp']
data_subset = data[['user_id', 'parent_asin', 'rating', 'timestamp']]

# Rename columns as 'column_names2' for the subset
data_subset.columns = column_names2[0:]

# Display the new DataFrame with renamed columns
print(data_subset.head())

                        User ID  Product ID Rating      Timestamp
0  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B07X4FKLNK    3.0  1581313195358
1  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B07YYG76X1    1.0  1609700981786
2  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B08G81QQ9L    5.0  1612052493701
3  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B085PRT2MP    1.0  1614915977684
4  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B00Z03RC80    1.0  1616743454733


In [None]:
#print data with timestamps
import pandas as pd
timestamps = data['timestamp']
df = pd.DataFrame(timestamps)
# Convert timestamps to datetime, assuming they are in milliseconds
df['date'] = pd.to_datetime(df['timestamp'], unit='ms')
data['year'] = df['date'].dt.year
print(data.head())

  df['date'] = pd.to_datetime(df['timestamp'], unit='ms')


                        user_id parent_asin rating      timestamp  year
0  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B07X4FKLNK    3.0  1581313195358  2020
1  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B07YYG76X1    1.0  1609700981786  2021
2  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B08G81QQ9L    5.0  1612052493701  2021
3  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B085PRT2MP    1.0  1614915977684  2021
4  AFKZENTNBQ7A7V7UXW5JJI6UGRYQ  B00Z03RC80    1.0  1616743454733  2021


In [None]:
#number of years a user has rated products for
import pandas as pd
data.groupby('user_id')['year'].nunique().sort_values(ascending=False)

Unnamed: 0_level_0,year
user_id,Unnamed: 1_level_1
AGWDYYVVWM3DC3CASUZKXK67G6IA,21
AFIJLAW3HIOMRUFSWNH54IJ3XQAA,18
AG5HHTEBKFEPR7PNAWBEV3BUJT2Q,17
AHMG3ALUBE3FEBHODTBHP5J24YDA,16
AGRBRL2U46HIDKDOKCDG5QVVW3LA,16
...,...
AHZZZSRO7DFLUK3XGUJ3VXSCALDQ,1
AHZZZSYG3JBS74IT2JZF2SIBBNDQ,1
AHZZZTMIGGOPC2OMKOZTQHAGP74Q,1
AHZZZU7NJSVAH7IFQZ2WENUSDOKQ,1


In [None]:
#Users who gave more than 4 ratings in each year
filtered_data = data.groupby(['user_id', 'year'])['rating'].count().reset_index(name='rating_count')
filtered_data = filtered_data[filtered_data['rating_count'] > 4]
filtered_data = filtered_data.sort_values(by='rating_count', ascending=False)
print(filtered_data)

                                user_id  year  rating_count
2901223    AEZP6Z2C5AVQDZAJECQYZWQRNG3Q  2020          1173
6325420    AG73BVBKUOH22USSFJA5ZWL7AKXA  2020          1094
1878893  AEOK4TQIKGO23SJKZ6PW4FETNNDA_1  2020           721
6325421    AG73BVBKUOH22USSFJA5ZWL7AKXA  2021           667
8778805  AGZUJTI7A3JFKB4FP5JOH6NVAJIQ_1  2021           638
...                                 ...   ...           ...
4418219    AFKAQJ2HHGZTTVQNGN4VTNFO3CEQ  2020             5
4418143    AFKAPRIUHOT23VPVRZJ4MMUBD6QQ  2014             5
4418022    AFKAOCD3QF3Z33LZWYFRA5UVKMFA  2020             5
4417990    AFKANWU4SFLPKZTYSTKB2MIKNK2A  2015             5
7735237    AGOIL3JOLYGPQY36UTO7XF3BSRYA  2015             5

[307348 rows x 3 columns]


In [None]:
# Get groups of users who rated the same products in the same year
product_groups = data.groupby(['parent_asin', 'year'])['user_id'].apply(list).reset_index(name='users')
product_groups = product_groups[product_groups['users'].apply(len)>1]
product_groups['user count'] = product_groups['users'].apply(len)
print(product_groups)

        parent_asin  year                                              users  \
0        0000060356  2016  [AERIAI7NKMRVVDKNBQBP5BURZG4A, AF2EAJP6YLG4X4Z...   
8        0008159181  2017  [AG4HINTXX2Y2XDCUA7IQCXBJAV2A, AHWWBWEY4Z5FSH6...   
11       0061689165  2009  [AHPMT5PGCM2P6SYATH5R37VMO6XQ, AEREHMHPE3XD2KQ...   
12       0061689165  2010  [AFEUDM4YGZ2O5O6FJ7TES2G7AKKQ, AHYMV2WMV5CY4GB...   
14       006283827X  2019  [AGQ55KS3MG3ISEJHPZBROYMLNDAQ, AEA3RPVLH2KLFZN...   
...             ...   ...                                                ...   
1870034  B0CJD2SSSR  2021  [AHO75G35ALBCOBA4CVWASKIIZTUQ, AGHS5Y7C63ZF2PX...   
1870036  B0CJPBTZSV  2019  [AEDN5DSMVV2KAS32RHBSQSMFSYFA, AEFYW5GRUL45NOK...   
1870041  B0CK2TXK5B  2021  [AERF32744FF33PEMZ4B7WJLN3LQQ, AFELHTVW3OL6Z6Q...   
1870045  B0CKFCJFWV  2021  [AECT7BIGGB3JDBXQSKU54M57TEGQ, AHNQ4C42KBKK46Y...   
1870050  B0CL3MH43L  2020  [AHCBI3RXZVPJW5STELT4PCTNIEKQ, AFQS766GFBXUG2J...   

         user count  
0                

In [None]:
product_groups.sort_values(by='user count', ascending=False)

Unnamed: 0,parent_asin,year,users,user count
1078305,B01LSUQSB0,2020,"[AHKWW7UAKEI3TIJYRHQEOXPV5JHQ, AHBIGTZCJOWAGUE...",12936
1078304,B01LSUQSB0,2019,"[AHMQXVUL6BW6LNCUMGQRVLATJ3VQ, AE54OHCX3MPLLYK...",8569
1835364,B0BVGHXZJ1,2021,"[AGI3VWFRRXY5MAZALSP2KBQYZMXQ, AFD43HMFBTG2TUI...",8436
1835363,B0BVGHXZJ1,2020,"[AET24VX4XLW3GZ2XNFXZNPTE2AVA, AGZVVKFA4QTXBRL...",7224
1852445,B0C4LYR6Z3,2019,"[AFZUK3MTBIBEDQOPAK3OATUOUKLA, AGS4BZHM7C75W52...",7139
...,...,...,...,...
80,0735356238,2018,"[AEURQ6KMTVX56GF7YFYZPX6YT6XQ, AHEURJF2X4KX374...",2
81,0735356238,2019,"[AHYI24NY54LIE652CBSYOWMRJGBQ, AG2KRHTPM6JR4PW...",2
82,0735356238,2020,"[AH3Q6SV3SPQGQMM2NZFRAWLZ77IA, AG3BO2Q4NZGN455...",2
83,0735356238,2021,"[AGJDVMLXB32DIEDGS6JMKHKXUN3Q, AEKYMJB767VZQYS...",2


In [None]:
from datasets import load_dataset
import pandas as pd
from collections import defaultdict
import json

def get_avg(product_groups, outfile):

    for year in product_groups['year'].unique():
        subset = product_groups[product_groups['year'] == year]
        subset.to_pickle(f"./Group{year}.pkl")

        #hashmap to store user overlaps
        user_overlap = defaultdict(lambda: defaultdict(int))

        # Iterate through each product group i
        for _, row in subset.iterrows():
            users = row['users']
            num_users = len(users)

            # Iterate through all pairs of users within the group
            for i in range(num_users):
                for j in range(i + 1, num_users):
                    user1 = users[i]
                    user2 = users[j]

                    # Increment the overlap count for both user pairs
                    user_overlap[user1][user2] += 1
                    user_overlap[user2][user1] += 1

        with open(f'user_overlap_{year}.json', 'w') as f:
            json.dump(user_overlap, f)

        pair_count = 0
        avg_overlap = 0
        tot_count = 0
        for user1, overlaps in user_overlap.items():
            for user2, count in overlaps.items():
                if count > 1:
                    pair_count+=1
                    tot_count+=count
        if pair_count == 0:
            avg_overlap = 0
        else:
            avg_overlap = tot_count/pair_count
        outfile.write(f"{year}\t{pair_count}\t{avg_overlap}\n")

def main():
    # Load data, removing streaming=True to get a regular Dataset
    data = load_dataset("McAuley-Lab/Amazon-Reviews-2023",
                          "0core_timestamp_Beauty_and_Personal_Care",
                          trust_remote_code=True, split='train').to_pandas()
    # Convert timestamps to datetime, assuming they are in milliseconds
    data['date'] = pd.to_datetime(data['timestamp'], unit='ms')
    data['year'] = data['date'].dt.year

    # Get product groups
    product_groups = data.groupby(['parent_asin', 'year'])['user_id'].apply(list).reset_index(name='users')
    product_groups = product_groups[product_groups['users'].apply(len)>1]
    product_groups['user count'] = product_groups['users'].apply(len)

    outfile = open('info.tsv', 'w')
    outfile.write("year\tcount\tavg\n")
    get_avg(product_groups, outfile)
    outfile.close()
main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  data['date'] = pd.to_datetime(data['timestamp'], unit='ms')


In [None]:
group_2019 = product_groups[product_groups['year']==2019]
print(group_2019)

        parent_asin  year                                              users  \
14       006283827X  2019  [AGQ55KS3MG3ISEJHPZBROYMLNDAQ, AEA3RPVLH2KLFZN...   
37       0326592458  2019  [AFR5YWVJLK6IDMF6UV6EOLCQJSEA, AFJHVWJHJCEOIHH...   
40       0376792108  2019  [AHXQSKZ3WGX4KRTZMLTSU5J3RR4A, AFOWEJJ5XRM4VIX...   
70       0623012030  2019  [AEQPUAK57AXBJ7SCQ5ATVBOEAP7A, AH3QSOQ5UXSPPWL...   
81       0735356238  2019  [AHYI24NY54LIE652CBSYOWMRJGBQ, AG2KRHTPM6JR4PW...   
...             ...   ...                                                ...   
1870011  B0CHSV6PJ8  2019  [AEXHNCR4FKNOT5J7RX2ZCUSO3QXQ, AF5S7QMQ6AZ5BG7...   
1870021  B0CHWJSM9G  2019  [AGCSZJRLGZDN2WLDJBPN3AYUPENA, AFE7JCJ53MIMDZO...   
1870023  B0CHX2RY8Z  2019  [AFM5L5X4I3B22FELXQGDRRAR3E5A, AG3Y2DY3L4IFTAM...   
1870026  B0CJ69126H  2019  [AGZ75WR5EZD5ZDHS4TK2JZOYYXGA, AHNEOPIF3RTTHQX...   
1870036  B0CJPBTZSV  2019  [AEDN5DSMVV2KAS32RHBSQSMFSYFA, AEFYW5GRUL45NOK...   

         user count  
14               

In [None]:
group_2019.to_pickle("./Group2019.pkl")
pd.read_pickle("./Group2019.pkl")

Unnamed: 0,parent_asin,year,users,user count
14,006283827X,2019,"[AGQ55KS3MG3ISEJHPZBROYMLNDAQ, AEA3RPVLH2KLFZN...",2
37,0326592458,2019,"[AFR5YWVJLK6IDMF6UV6EOLCQJSEA, AFJHVWJHJCEOIHH...",6
40,0376792108,2019,"[AHXQSKZ3WGX4KRTZMLTSU5J3RR4A, AFOWEJJ5XRM4VIX...",3
70,0623012030,2019,"[AEQPUAK57AXBJ7SCQ5ATVBOEAP7A, AH3QSOQ5UXSPPWL...",3
81,0735356238,2019,"[AHYI24NY54LIE652CBSYOWMRJGBQ, AG2KRHTPM6JR4PW...",2
...,...,...,...,...
1870011,B0CHSV6PJ8,2019,"[AEXHNCR4FKNOT5J7RX2ZCUSO3QXQ, AF5S7QMQ6AZ5BG7...",7
1870021,B0CHWJSM9G,2019,"[AGCSZJRLGZDN2WLDJBPN3AYUPENA, AFE7JCJ53MIMDZO...",3
1870023,B0CHX2RY8Z,2019,"[AFM5L5X4I3B22FELXQGDRRAR3E5A, AG3Y2DY3L4IFTAM...",5
1870026,B0CJ69126H,2019,"[AGZ75WR5EZD5ZDHS4TK2JZOYYXGA, AHNEOPIF3RTTHQX...",4


In [None]:
#Finding similarities for users in 2017
import pandas as pd
from collections import defaultdict

group_2019 = pd.read_pickle("./Group2019.pkl")

#hashmap to store user overlaps
user_overlap = defaultdict(lambda: defaultdict(int))

# Iterate through each product group in 2017
for _, row in group_2019.iterrows():
    users = row['users']
    num_users = len(users)

    # Iterate through all pairs of users within the group
    for i in range(num_users):
        for j in range(i + 1, num_users):
            user1 = users[i]
            user2 = users[j]

            # Increment the overlap count for both user pairs
            user_overlap[user1][user2] += 1
            user_overlap[user2][user1] += 1


# print(user_overlap['A123']['B456'])
# print(user_overlap['A123'])

# To print the entire overlap data structure
for user1, overlaps in user_overlap.items():
    for user2, count in overlaps.items():
        if count > 1:
          print(f"User {user1} and User {user2} have {count} product groups in common.")


In [None]:
pair_count = 0
avg_overlap = 0
tot_count = 0
for user1, overlaps in user_overlap.items():
    for user2, count in overlaps.items(): #count of how many similar prouduct groups
      if count > 1:
        pair_count+=1
        tot_count+=count
avg_overlap = tot_count/pair_count
print(avg_overlap)


2.0672630748452145


In [None]:
from scipy.sparse import csr_matrix

# transform matrix to scipy sparse matrix
user_to_product_sparse_df = csr_matrix(user_to_product_df.values)
user_to_product_sparse_df

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 14206 stored elements and shape (1620, 6982)>

**Fitting K-Nearest Neighbours model to the scipy sparse matrix:**

In [None]:
from sklearn.neighbors import NearestNeighbors

knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
knn_model.fit(user_to_product_sparse_df)

**Specify User ID and number of similar users we want to consider here**

In [None]:
import numpy as np
from pprint import pprint

user_id = 'AE2CVCNCDLMNEBC6XZLMTHJTYEXA'
print(" Few of the products rated by the User:")
pprint(list(items_data[items_data['User ID'] == user_id]['Product ID'])[:10])

# function to find top n similar users of the given input user
def get_similar_users(user, n = 5):
  # input to this function is the user and number of top similar users we want
  user_index = user_to_product_df.index.get_loc(user) # Get the index corresponding to the user ID
  knn_input = np.asarray([user_to_product_df.values[user_index]])
  print(knn_input.sum(axis=-1))
  distances, indices = knn_model.kneighbors(knn_input, n_neighbors=n+1)

  print("Top",n,"users who are very similar to the user-",user, "are: ")
  print(" ")

  # Get the user IDs of similar users
  similar_users = [user_to_product_df.index[i] for i in indices.flatten()[1:]]

  for i, similar_user_id in enumerate(similar_users):
    print(i+1,". User:", similar_user_id, "separated by distance of", distances[0][i+1])

  return similar_users, distances.flatten()[1:]  # Return the similar user IDs

similar_user_list, distance_list = get_similar_users(user_id,5)

 Few of the products rated by the User:
['B09NDHDB4H', 'B07YD5WC45', 'B07216GPNS']
[4.]
Top 5 users who are very similar to the user- AE2CVCNCDLMNEBC6XZLMTHJTYEXA are: 
 
1 . User: AHZJXBE3DUHP6UYPORBUYKTSJ75Q separated by distance of 1.0
2 . User: AHZGQXTGR3WB6CQR3PP2TB2YPTUA_1 separated by distance of 1.0
3 . User: AHZGQXTGR3WB6CQR3PP2TB2YPTUA separated by distance of 1.0
4 . User: AHZBG2UL6QCVICFURKFMPEMBPMZQ separated by distance of 1.0
5 . User: AHZATJF2HWPHRX46E5VVI5FYWHLA separated by distance of 1.0


**Now we have to pick the top products to recommend. Which we can do by defining weights to ratings made by similar users.**

In [None]:
similar_user_list, distance_list

(['AHZJXBE3DUHP6UYPORBUYKTSJ75Q',
  'AHZGQXTGR3WB6CQR3PP2TB2YPTUA_1',
  'AHZGQXTGR3WB6CQR3PP2TB2YPTUA',
  'AHZBG2UL6QCVICFURKFMPEMBPMZQ',
  'AHZATJF2HWPHRX46E5VVI5FYWHLA'],
 array([1., 1., 1., 1., 1.]))

In [None]:
weight_list = distance_list/np.sum(distance_list)
weight_list

array([0.2, 0.2, 0.2, 0.2, 0.2])

**Getting ratings of all products by derived similar users**

In [None]:
import numpy as np

similar_user_indices = [user_to_product_df.index.get_loc(user_id) for user_id in similar_user_list] #Get the indices of similar users.
product_ratings_sim_users = (user_to_product_df.values[similar_user_indices]) * weight_list[:, np.newaxis] #Use the indices to select rows.
product_ratings_sim_users

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
products_list = user_to_product_df.columns
products_list

Index(['1453085815', 'B000067E30', 'B000068PBJ', 'B00009RB1I', 'B0000ZLF18',
       'B00011JN5G', 'B00014DMLO', 'B00018TN1S', 'B0001DDK2G', 'B0001HYKBM',
       ...
       'B0C1ND3FVR', 'B0C3LVTX2V', 'B0C3XDNTJV', 'B0C49FG2KQ', 'B0C4LR3NZT',
       'B0C4XVMFJY', 'B0C5FZDPD3', 'B0C5V83YMB', 'B0C9CWKY9G', 'B0CDNZ7F2V'],
      dtype='object', name='Product ID', length=6982)

In [None]:
print("Weight list shape:", len(weight_list))
print("product_ratings_sim_users shape:", product_ratings_sim_users.shape)
print("Number of products:", len(products_list))

Weight list shape: 5
product_ratings_sim_users shape: (5, 6982)
Number of products: 6982


**Broadcasting weightage matrix to similar user rating matrix, so that it is compatible for matrix operations**

In [None]:
weight_list = weight_list[:,np.newaxis] + np.zeros(len(products_list))
weight_list.shape

(5, 6982)

In [None]:
new_rating_matrix = weight_list*product_ratings_sim_users
mean_rating_list = new_rating_matrix.sum(axis =0)
mean_rating_list

array([0., 0., 0., ..., 0., 0., 0.])

In [None]:
from pprint import pprint
def recommend_products(n):
  n = min(len(mean_rating_list),n)
  pprint(list(products_list[np.argsort(mean_rating_list)[::-1][:n]]))

In [None]:
print("Products recommended based on similar users are: ")
recommend_products(10)

Products recommended based on similar users are: 
['B0B5XFVSXY',
 'B0BBS6GZ8T',
 'B08GCJZW9V',
 'B086GST51S',
 'B08QD2RNJT',
 'B08BZ1RHPS',
 'B07FP2C8N8',
 'B08P6HB6PH',
 'B08H8RWPT1',
 'B07T3PC35M']


In [None]:
import pandas as pd
import plotly.express as px

try:
    df = pd.read_csv('info.tsv', sep='\t')
    fig = px.scatter(df,
                     x='year',
                     y='avg',
                     #size='count',
                     hover_name='year',
                     title='Average Number of Product Groups in Common Over the Years',
                     labels={'year': 'Year', 'avg': 'Average Overlap', 'count': 'Number of Pairs'})

    fig.show()
except pd.errors.EmptyDataError:
    print("The file 'info.tsv' is empty or has an incorrect format. Please check the data and file format.")
except FileNotFoundError:
    print("The file 'info.tsv' was not found. Please ensure it has been created and is in the correct directory.")

In [None]:
import pandas as pd
import plotly.express as px

try:
    df = pd.read_csv('info.tsv', sep='\t')
    fig = px.scatter(df,
                     x='year',
                     y='avg',
                     hover_name='year',
                     title='Average Number of Product Groups in Common Over the Years',
                     labels={'year': 'Year', 'avg': 'Average Overlap'})

    fig.show()
except pd.errors.EmptyDataError:
    print("The file 'info.tsv' is empty or has an incorrect format. Please check the data and file format.")
except FileNotFoundError:
    print("The file 'info.tsv' was not found. Please ensure it has been created and is in the correct directory.")
