**Personalized Recommendation System with Collaborative Filtering using Scikit-surprise**

This notebook builds a personalized recommendation system using the SVD algorithm (collaborative filtering).
It predicts user preferences based on past interactions and filters recommendations based on user-specified types and price ranges.

Steps to use :
1. Upload your CSV files a folder in your Google Drive with the following path:
   `/MyDrive/RECOMMENDER/
   Required files:
   - `PROJECT_LOCATIONS_TYPES.csv`: Project types and locations.
   - `NYK_USERS_SPENT.csv`: User spending data.
   - `NYK_USERS_PROJ_LIKES.csv`: User project likes and ratings.
   - `NYK_PROJ_LOCATIONS.csv`: Project locations with associated details.

2. Mount your Google Drive to access the files (included in the script)

3. Run the code blocks in order to:
   - Preprocess and merge data.
   - Train the recommendation model.
   - Generate personalized recommendations.
   - Export the recommendations to a CSV file (store in Gdrive same path).

Output:
The recommendations will be saved as `recommendations.csv`.
"""

Each row in the recommendations.csv file represents a single recommendation.

eg: User with USER_ID 293774 is recommended project locations 7225, 7224, 6974, 6955, and so on (up to 10 recommendations).


**RMSE Summary**

**Before Filtering**: Average RMSE = 0.0334 (high accuracy)

**After Filtering**: RMSE = 0.6010 (still good with personalization).

Rating range: 0–5.










In [None]:
# Install required packages
!pip install scikit-surprise

# Import necessary modules
import pandas as pd
from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import cross_validate, train_test_split
from collections import defaultdict, Counter
import ast


Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl size=2505182 sha256=58a7c4aeeed10976291691c90792fd00d54d1d8fd6b56ff623bc1b4fc35cd6f3
  Stored in directory: /root/.cache/pip/wheels/2a/8f/6e/7e28991

In [None]:
# 1. Mount & load data from Google Drive
from google.colab import drive
drive.mount('/content/drive')

project_locations_types = pd.read_csv("/content/drive/MyDrive/RECOMMENDER/PROJECT_LOCATIONS_TYPES.csv", encoding='utf-8')
users_spent = pd.read_csv("/content/drive/MyDrive/RECOMMENDER/NYK_USERS_SPENT.csv", encoding='utf-8')
users_proj_likes = pd.read_csv("/content/drive/MyDrive/RECOMMENDER/NYK_USERS_PROJ_LIKES.csv", encoding='utf-8')
proj_locations = pd.read_csv("/content/drive/MyDrive/RECOMMENDER/NYK_PROJ_LOCATIONS.csv", encoding='utf-8')

# Helper evaluate stringified lists
def safe_eval(type_string):
    try:
        return ast.literal_eval(type_string)
    except (ValueError, SyntaxError):
        return []


Mounted at /content/drive


In [None]:
print(project_locations_types.head())
print(users_spent.head())
print(users_proj_likes.head())
print(proj_locations.head())



   PROJECT_LOC_ID     SLUG
0            4863  italian
1            4864  italian
2            4881  italian
3            4913  italian
4            4863    pasta
        ID  AMOUNT  PROJECT_LOCATION_ID  MIN_AMOUNT  MAX_AMOUNT  \
0  1972987  185.64               6835.0      185.64      185.64   
1  2002212  173.66               5837.0      173.66      173.66   
2  1186294   98.87               5776.0        5.31      551.99   
3  1988260  250.41               5616.0      250.41      250.41   
4   796310   36.06               6480.0       21.28      173.66   

   STDDEV_AMOUNT  AVERAGE_AMOUNT  
0            NaN       185.64000  
1            NaN       173.66000  
2     106.943024       104.01590  
3            NaN       250.41000  
4      52.096791        85.95352  
   USER_ID  PROJECT_LOCATION_ID  IS_FAVORITE
0   293774                 4720            1
1   293774                 4721            1
2   293774                 4722            1
3   293774                 4723            1


# 2. Data Preprocessing

In [None]:

# 2.1 User Preferences (Types)
user_pref_types = users_proj_likes.merge(proj_locations, on="PROJECT_LOCATION_ID")
user_pref_types['TYPES'] = user_pref_types['TYPES'].apply(safe_eval)
user_pref_types = user_pref_types.explode('TYPES')
user_pref_types = user_pref_types.groupby('USER_ID')['TYPES'].apply(list).reset_index()


In [None]:
# check the data

print(user_pref_types.head())



   USER_ID                                              TYPES
0       74  [restaurant, food, point_of_interest, establis...
1      676  [cafe, bar, store, food, point_of_interest, es...
2      923  [bar, restaurant, food, point_of_interest, est...
3     1372  [bar, point_of_interest, establishment, bar, p...
4     1738  [bar, restaurant, food, point_of_interest, est...


In [None]:
# 2.2 User Preferences (Price Categories)
users_spent['c.price'] = pd.cut(users_spent['AVERAGE_AMOUNT'], bins=[0, 50, 100, 200, float('inf')],
                                 labels=['$', '$$', '$$$', '$$$$'])
user_pref_price = users_spent.groupby('ID')['c.price'].apply(lambda x: x.mode()[0]).reset_index()
user_pref_price.rename(columns={'ID': 'USER_ID'}, inplace=True)


In [None]:
# check the data

print(user_pref_price.head())

   USER_ID c.price
0     8432     $$$
1    45378      $$
2    46602     $$$
3    47805      $$
4    51280     $$$


In [None]:
# 2.3 Ratings
ratings = users_proj_likes.merge(proj_locations, on="PROJECT_LOCATION_ID")
ratings = ratings[['USER_ID', 'PROJECT_LOCATION_ID', 'RATING']].rename(columns={'PROJECT_LOCATION_ID': 'ITEM_ID'})
ratings['RATING'] = ratings['RATING'].fillna(ratings['RATING'].median())

In [None]:
# check the data

ratings.head()

Unnamed: 0,USER_ID,ITEM_ID,RATING
0,293774,4720,4.2
1,293774,4721,4.2
2,293774,4722,3.3
3,293774,4723,3.9
4,293774,4724,3.9


In [None]:
# 2.4 Process item details

items = proj_locations[['PROJECT_LOCATION_ID', 'TYPES', 'AVERAGE_AMOUNT']].rename(columns={'PROJECT_LOCATION_ID': 'ITEM_ID'})

# Flatten the list of all non-null types and compute the most common type

all_types = []
for row in items['TYPES'].dropna():
    try:
        all_types.extend(ast.literal_eval(row) if isinstance(row, str) else row)
    except (ValueError, SyntaxError):
        continue

most_common_type = Counter(all_types).most_common(1)[0][0]

def process_types(row):
    try:
        if isinstance(row, str):
            return ast.literal_eval(row)
        elif isinstance(row, list):
            return row
    except (ValueError, SyntaxError):
        return [most_common_type]
    return [most_common_type]

items['TYPES'] = items['TYPES'].apply(process_types)

# Explode 'TYPES' into separate rows and group back into lists by 'ITEM_ID'
items = items.explode('TYPES').groupby('ITEM_ID')['TYPES'].apply(list).reset_index()

# Group price  to the items & label
items['d.price'] = pd.cut(
    proj_locations['AVERAGE_AMOUNT'],
    bins=[0, 50, 100, 200, float('inf')],
    labels=['$', '$$', '$$$', '$$$$']
)


In [None]:
# check the data

items.head()

Unnamed: 0,ITEM_ID,TYPES,d.price
0,183,"[cafe, restaurant, food, point_of_interest, es...",$
1,184,"[restaurant, food, point_of_interest, establis...",$
2,186,"[restaurant, food, point_of_interest, establis...",$
3,2595,"[restaurant, food, point_of_interest, establis...",$$$
4,2597,"[restaurant, food, point_of_interest, establis...",$


In [None]:
# 3. Prepare Data for Surprise
reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(ratings[['USER_ID', 'ITEM_ID', 'RATING']], reader)


In [None]:
# 4. Train and Evaluate using SVD & RMSE
model = SVD()
results = cross_validate(model, dataset, measures=['RMSE'], cv=5, verbose=True)
print(f"Average RMSE: {results['test_rmse'].mean()}")

# Train on the full dataset
trainset = dataset.build_full_trainset()
model.fit(trainset)


Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.0340  0.0334  0.0333  0.0331  0.0336  0.0335  0.0003  
Fit time          2.45    2.26    3.28    2.51    2.62    2.62    0.35    
Test time         0.41    0.59    0.37    0.19    0.16    0.34    0.16    
Average RMSE: 0.03347549581059595


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7d24162e7050>

In [None]:
# 5. Generate Recommendations
def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    return top_n

# Generate predictions for all user-item pairs
testset = trainset.build_anti_testset()
predictions = model.test(testset)
top_n = get_top_n(predictions, n=10)


In [None]:
# 6. Filter Recommendations by User Preferences
filtered_recommendations = []
for uid, user_ratings in top_n.items():
    user_prefs = user_pref_types[user_pref_types['USER_ID'] == uid]
    user_price = user_pref_price[user_pref_price['USER_ID'] == uid]

    if not user_prefs.empty and not user_price.empty:
        pref_types = user_prefs['TYPES'].iloc[0]
        pref_price = user_price['c.price'].iloc[0]

        for iid, est in user_ratings:
            item_row = items[items['ITEM_ID'] == iid]
            if not item_row.empty:
                item_types = item_row['TYPES'].iloc[0]
                item_price = item_row['d.price'].iloc[0]

                if any(t in pref_types for t in item_types) and item_price <= pref_price:
                    filtered_recommendations.append([uid, iid])
    else:
        filtered_recommendations.extend([[uid, iid] for iid, _ in user_ratings])

In [None]:
# 7. RMSE After Filtering
filtered_pairs = {(uid, iid) for uid, iid in filtered_recommendations}
filtered_predictions = [pred for pred in predictions if (pred.uid, pred.iid) in filtered_pairs]
rmse_filtered = accuracy.rmse(filtered_predictions)
print(f"RMSE After Filtering: {rmse_filtered}")

RMSE: 0.6047
RMSE After Filtering: 0.6047480717506137


In [None]:
# 8. Save Recommendations to CSV
recommendations_df = pd.DataFrame(filtered_recommendations, columns=['USER_ID', 'ITEM_ID'])
recommendations_df.to_csv("/content/drive/MyDrive/RECOMMENDER/recommendations.csv", index=False)
print("Recommendations saved to recommendations.csv.")

Recommendations saved to recommendations.csv.


In [None]:
recommendations_df.head()

Unnamed: 0,USER_ID,ITEM_ID
0,293774,8284
1,293774,7225
2,293774,7223
3,293774,7224
4,293774,8974
