## Collaborative filtering recommender system

- Dataset: first 10000 restaurant reviews in Philadelphia
    - Total reviews : 10000
    - Total users: 8236
    - Total restaurants: 470
- Models:
    - Singular Value Decomposition model (SVD)
    - Neural Collaborative Filtering model (NCF)
- Evaluation Metrics

In [1]:
# Import libraries
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Mount google drive,uncomment below if running in collab, change path accordingly
from google.colab import drive
drive.mount('/content/drive',force_remount = True)

In [3]:
!pip install surprise

Collecting surprise
  Using cached surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Using cached scikit-surprise-1.1.3.tar.gz (771 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py): started
  Building wheel for scikit-surprise (setup.py): finished with status 'done'
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-win_amd64.whl size=1085601 sha256=a17fa4916c047e1be8f2067dbfa2822986a77f05555197193caed72b00b99402
  Stored in directory: c:\users\toran\appdata\local\pip\cache\wheels\df\e4\a6\7ad72453dd693f420b0c639bedeec34641738d11b55d8d9b84
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


In [4]:
path_to_review = 'review_philadelphia_50000.json'
path_to_business = 'business_philadelphia.json'

In [6]:
# Read review data from json file
review_philadelphia_original_df = pd.read_json(path_to_review, lines=True)

review_philadelphia_df = review_philadelphia_original_df[:10000]

review_philadelphia_df = review_philadelphia_df[['user_id', 'business_id', 'stars']]
print('Shape of 10000 reviews:', review_philadelphia_df.shape)
print('Attributes in the review dataframe:', review_philadelphia_df.columns)

# read business data
restaurant_philadelphia_df = pd.read_json(path_to_business, lines=True)
restaurant_philadelphia_df = restaurant_philadelphia_df.drop(['city', 'state', 'postal_code', 'is_open'], axis=1)

print('Shape of all restaurants in Philadelphia:', restaurant_philadelphia_df.shape)
print('Attributes in the restaurant dataframe:', restaurant_philadelphia_df.columns)

Shape of 10000 reviews: (10000, 3)
Attributes in the review dataframe: Index(['user_id', 'business_id', 'stars'], dtype='object')
Shape of all restaurants in Philadelphia: (5853, 10)
Attributes in the restaurant dataframe: Index(['business_id', 'name', 'address', 'latitude', 'longitude', 'stars',
       'review_count', 'attributes', 'categories', 'hours'],
      dtype='object')


### Data preprocessing
- Find all the `user_id` in the dataset to the array `users`
- Find all the `business_id` in the dataset to the array `restaurants`
- Convert all the `user_id` to its index in `users`
- Convert all the `business_id` to its index in `restaurant`
- Remove the duplicate review in review dataframe

In [7]:
# find all the restaurant_id and user_id in review dataframe
users = review_philadelphia_df['user_id'].unique()
restaurants = review_philadelphia_df['business_id'].unique()

print('Number of user', users.shape[0])
print('Number of restaurant', restaurants.shape[0])

# Remove duplicate review by unique ['user_id', 'business_id']
review_df_no_duplicates = review_philadelphia_df[~review_philadelphia_df.duplicated(subset=['user_id', 'business_id'])]

# Resetting the indices of the dataframe to make them contiguous
review_df_no_duplicates = review_df_no_duplicates.reset_index(drop=True)

Number of user 8236
Number of restaurant 470


In [8]:
review_philadelphia_df.head(),review_philadelphia_df.shape

(                  user_id             business_id  stars
 0  _7bHUi9Uuf5__HHc_Q8guQ  kxX2SOes4o-D3ZQBkiMRfA      5
 1  eUta8W_HdHMXPzLBBZhL1A  04UD14gamNjLY0IDYVhHJg      1
 2  smOvOajNG0lS4Pq7d8g4JQ  RZtGWDLCAtuipwaZ-UfjmQ      4
 3  Dd1jQj7S-BFGqRbApFzCFw  YtSqYv1Q_pOltsVPSx54SA      5
 4  IQsF3Rc6IgCzjVV9DE8KXg  eFvzHawVJofxSnD7TgbZtg      5,
 (10000, 3))

### Cross-validation of SVD model
- Dataset: all 10000 reviews
- Folds: 5
- Loss function: MAE

In [9]:
from surprise import Dataset
from surprise import Reader
from surprise import SVD, accuracy
from surprise.model_selection import cross_validate

review_df_no_duplicates_svd = review_df_no_duplicates.copy()
review_df_no_duplicates_svd.columns = ["user_id", "business_id", "stars"]

# converting dataframe to desired format by surprise
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(review_df_no_duplicates_svd[["user_id",
                                                     "business_id",
                                                     "stars"]], reader)

# Use the SVD algorithm
svd_model = SVD(n_factors=100, n_epochs=100, biased=True)

# Perform cross-validation
cross_validate(svd_model, data, measures=['MSE', 'MAE'], cv=5, verbose=True);


Evaluating MSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
MSE (testset)     1.4226  1.4151  1.4377  1.4096  1.4040  1.4178  0.0117  
MAE (testset)     0.9509  0.9472  0.9497  0.9497  0.9342  0.9463  0.0062  
Fit time          0.54    0.44    0.52    0.43    0.44    0.47    0.05    
Test time         0.02    0.02    0.01    0.02    0.02    0.02    0.00    


In [10]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(review_df_no_duplicates, test_size=0.25, random_state=17)

### Training SVD model
- Dataset: all 10000 reviews, 75% for training set, 25% for test set
- Number of Embeddings: 100
- Epochs: 100
- Optimizer: Stochastic Gradient Descent
- Learning rate: 0.05
- Loss function: MSE

In [11]:
# dividing the data as trainset and testset
train_df_svd = train_df.copy()
# train_df_svd.columns = ["user_id", "business_id", "stars"]
train_data = Dataset.load_from_df(train_df_svd[['user_id', 'business_id', 'stars']], reader)
trainset = train_data.build_full_trainset()

test_df_svd = test_df.copy()
# test_df_svd.columns = ["user_id", "business_id", "stars"]
test_data = Dataset.load_from_df(test_df_svd[['user_id', 'business_id', 'stars']], reader)
testset = test_data.build_full_trainset().build_testset()

# building model on train set
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x20d0152bf70>

### Evaluation of SVD model


In [12]:
print('----------------------------------------------------------------------------------------------------')

# testing model on test set
prediction = svd_model.test(testset)
print('MSE and MAE of the SVD prediction:')
print()
accuracy.mse(prediction)
accuracy.mae(prediction)

print('----------------------------------------------------------------------------------------------------')

----------------------------------------------------------------------------------------------------
MSE and MAE of the SVD prediction:

MSE: 1.3852
MAE:  0.9382
----------------------------------------------------------------------------------------------------


### Build NCF model
Reference: X. He, L. Liao, H. Zhang, L. Nie, X. Hu and T.-S. Chua. Neural Collaborative Filtering. 2017.

In [13]:
from sklearn.model_selection import train_test_split
from keras.models import Model
from keras.layers import Embedding, Flatten, Input, Dropout, Dense, Concatenate, Dot
from keras.optimizers import Adam
from matplotlib.ticker import MaxNLocator

# Define the NCF model
def get_model(num_users, num_items, latent_dim):
    # Define inputs
    item_input = Input(shape=[1], name='item-input')
    user_input = Input(shape=[1], name='user-input')

    # MLP Embeddings
    item_embedding_mlp = Embedding(num_items + 1, latent_dim, name='item-embedding-mlp')(item_input)
    item_vec_mlp = Flatten(name='flatten-item-mlp')(item_embedding_mlp)

    user_embedding_mlp = Embedding(num_users + 1, latent_dim, name='user-embedding-mlp')(user_input)
    user_vec_mlp = Flatten(name='flatten-user-mlp')(user_embedding_mlp)

    # MF Embeddings
    item_embedding_mf = Embedding(num_items + 1, latent_dim, name='item-embedding-mf')(item_input)
    item_vec_mf = Flatten(name='flatten-item-mf')(item_embedding_mf)

    user_embedding_mf = Embedding(num_users + 1, latent_dim, name='user-embedding-mf')(user_input)
    user_vec_mf = Flatten(name='flatten-user-mf')(user_embedding_mf)

    # MLP layers
    concat = Concatenate(name='concat')([item_vec_mlp, user_vec_mlp])
    concat_dropout = Dropout(0.2)(concat)
    fc_1 = Dense(100, name='fc-1', activation='relu')(concat_dropout)
    fc_1_dropout = Dropout(0.2)(fc_1)
    fc_2 = Dense(50, name='fc-2', activation='relu')(fc_1_dropout)
    fc_2_dropout = Dropout(0.2)(fc_2)

    # Prediction from both layers
    pred_mlp = Dense(10, name='pred-mlp', activation='relu')(fc_2_dropout)
    pred_mf = Dot(axes=-1, normalize=False)([item_vec_mf, user_vec_mf])
    combine_mlp_mf = Concatenate(name='combine-mlp-mf')([pred_mf, pred_mlp])

    # Final prediction
    result = Dense(1, name='result', activation='relu')(combine_mlp_mf)

    model = Model([user_input, item_input], result)

    return model

In [14]:
# convert all the user_id and business_id in review dataframe to their index in user_ids and restaurant_ids
def replace_ids(row):
    # Get the 'title_id' and 'business_id' values from the row
    user_id = row['user_id']
    business_id = row['business_id']

    user_index = np.argmax(users == user_id)
    business_index = np.argmax(restaurants == business_id)

    return pd.Series([user_index, business_index], index=['user_id', 'business_id'])

# get review data for ncf
review_df_no_duplicates_ncf = review_df_no_duplicates.copy()

# Apply the function to change 'title_id' and 'movie_id' values in the DataFrame
review_df_no_duplicates_ncf[['user_id', 'business_id']] = review_df_no_duplicates_ncf.apply(replace_ids, axis=1)

# change the column name of the dataframe
review_df_no_duplicates_ncf.columns = ["user", "item", "label"]

In [15]:
review_df_no_duplicates_ncf.shape

(9925, 3)

### Cross-validation of NCF model
- Dataset: all 10000 reviews
- Folds: 5
- Loss function: MAE, MSE

In [16]:
review_df_no_duplicates_ncf.head(10)

Unnamed: 0,user,item,label
0,0,0,5
1,1,1,1
2,2,2,4
3,3,3,5
4,4,4,5
5,5,5,5
6,6,6,4
7,7,7,5
8,8,8,4
9,9,9,5


In [17]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True)
train_df_kfold, test_df_kfold = train_test_split(review_df_no_duplicates_ncf, test_size=0.00001, random_state=17)
#train_df_kfold.reset_index(drop=True,inplace =True)

for train_index, val_index in kf.split(train_df_kfold):
    print(f"Train indices: {train_index}, Validation indices: {val_index}")
    train_set = train_df_kfold.iloc[train_index]
    val_set = train_df_kfold.iloc[val_index]

    ncf_model = get_model(users.shape[0], restaurants.shape[0], 100)
    ncf_model.compile(optimizer=Adam(learning_rate=0.01), loss='mean_absolute_error', metrics=['mean_squared_error', 'mae'])

    ncf_model.fit([train_set.user, train_set.item], train_set.label, epochs=20, verbose='auto')

    _, mae, mse = ncf_model.evaluate([val_set.user, val_set.item], val_set.label, verbose=2)


Train indices: [   0    1    4 ... 9921 9922 9923], Validation indices: [   2    3    8 ... 9911 9913 9918]
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
63/63 - 0s - loss: 0.9786 - mean_squared_error: 1.5458 - mae: 0.9786 - 301ms/epoch - 5ms/step
Train indices: [   0    1    2 ... 9919 9921 9922], Validation indices: [   6   18   33 ... 9915 9920 9923]
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
63/63 - 0s - loss: 0.9513 - mean_squared_error: 1.5327 - mae: 0.9513 - 251ms/epoch - 4ms/step
Train indices: [   1    2    3 ... 9918 9920 9923], Validation indices: [   0    4   19 ... 9919 9921 9922]
Epoch 1/20
Epoch 2/20
Epoc

Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
63/63 - 0s - loss: 1.0116 - mean_squared_error: 1.6322 - mae: 1.0116 - 378ms/epoch - 6ms/step
Train indices: [   0    1    2 ... 9921 9922 9923], Validation indices: [  10   13   16 ... 9906 9908 9909]
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
62/62 - 0s - loss: 0.9625 - mean_squared_error: 1.5358 - mae: 0.9625 - 227ms/epoch - 4ms/step


### Training NCF model
- Dataset: all 10000 reviews, 75% for training set, 25% for test set
- Number of Embeddings: 100
- Epochs: 20
- Optimizer: Stochastic Gradient Descent
- Learning rate: 0.01
- Loss function: MAE

In [18]:
# show the model structure
ncf_model = get_model(users.shape[0], restaurants.shape[0], 100)
ncf_model.compile(optimizer=Adam(learning_rate=0.01), loss='mean_absolute_error', metrics=['mean_squared_error', 'mae'])
ncf_model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 item-input (InputLayer)        [(None, 1)]          0           []                               
                                                                                                  
 user-input (InputLayer)        [(None, 1)]          0           []                               
                                                                                                  
 item-embedding-mlp (Embedding)  (None, 1, 100)      47100       ['item-input[0][0]']             
                                                                                                  
 user-embedding-mlp (Embedding)  (None, 1, 100)      823700      ['user-input[0][0]']             
                                                                                            

In [19]:
# format the train_df and test_df
train_df_ncf = train_df.copy()
train_df_ncf.columns = ["user", "item", "label"]
test_df_ncf = test_df.copy()
test_df_ncf.columns = ["user", "item", "label"]


# train NCF model
history = ncf_model.fit([train_df_ncf.user, train_df_ncf.item], train_df_ncf.label, epochs=20)
pd.Series(history.history['loss']).plot(logy=True)
plt.xlabel("Epoch")
plt.ylabel("Train Error")
plt.show()

Epoch 1/20


UnimplementedError: Graph execution error:

Detected at node 'model_5/Cast' defined at (most recent call last):
    File "C:\Users\toran\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\toran\anaconda3\lib\runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "C:\Users\toran\anaconda3\lib\site-packages\traitlets\config\application.py", line 992, in launch_instance
      app.start()
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 711, in start
      self.io_loop.start()
    File "C:\Users\toran\anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 215, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\toran\anaconda3\lib\asyncio\base_events.py", line 603, in run_forever
      self._run_once()
    File "C:\Users\toran\anaconda3\lib\asyncio\base_events.py", line 1906, in _run_once
      handle._run()
    File "C:\Users\toran\anaconda3\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 510, in dispatch_queue
      await self.process_one()
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 499, in process_one
      await dispatch(*args)
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 406, in dispatch_shell
      await result
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 729, in execute_request
      reply_content = await reply_content
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 411, in do_execute
      res = shell.run_cell(
    File "C:\Users\toran\anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 531, in run_cell
      return super().run_cell(*args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3006, in run_cell
      result = self._run_cell(
    File "C:\Users\toran\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3061, in _run_cell
      result = runner(coro)
    File "C:\Users\toran\anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\toran\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3266, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\toran\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3445, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "C:\Users\toran\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3505, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\toran\AppData\Local\Temp\ipykernel_23276\1332832854.py", line 9, in <module>
      history = ncf_model.fit([train_df_ncf.user, train_df_ncf.item], train_df_ncf.label, epochs=20)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\training.py", line 1685, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\training.py", line 1284, in train_function
      return step_function(self, iterator)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\training.py", line 1268, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\training.py", line 1249, in run_step
      outputs = model.train_step(data)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\training.py", line 1050, in train_step
      y_pred = self(x, training=True)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\training.py", line 558, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\base_layer.py", line 1145, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\functional.py", line 512, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\functional.py", line 651, in _run_internal_graph
      y = self._conform_to_reference_input(y, ref_input=x)
    File "C:\Users\toran\anaconda3\lib\site-packages\keras\engine\functional.py", line 748, in _conform_to_reference_input
      tensor = tf.cast(tensor, dtype=ref_input.dtype)
Node: 'model_5/Cast'
Cast string to float is not supported
	 [[{{node model_5/Cast}}]] [Op:__inference_train_function_138494]

### Evaluation of SVD model

In [None]:
# evaluate NCF model
ncf_model.evaluate([test_df_ncf.user, test_df_ncf.item],  test_df_ncf.label, verbose=2);

In [None]:
prediction = ncf_model.predict([np.array([0]), np.array([5])])
print(prediction)

### Conclusion of Collaborative filtering recommender system
- Cross-validation
    - SVD is better
- Preidction
    - NCF is better
- Speed
    - SVD is much faster than NCF


### Hybrid recommender system
- Combine content-based and collaborative filtering system together
- The output of content-based system will be the input of collaborative filtering system
- Output of the content-based system: Top_10 recommended restaurants
- CF system will try to recommend Top_5 restaurants inside the output of content-based system

In [None]:
import pandas as pd

review_restaurant_sampled_origin = review_philadelphia_original_df
review_restaurant_sampled_origin = review_restaurant_sampled_origin[~review_restaurant_sampled_origin.duplicated(subset=['user_id', 'business_id'])]

# Resetting the indices of the dataframe to make them contiguous
review_restaurant_sampled_origin = review_restaurant_sampled_origin.reset_index(drop=True)

# Importing data into Pandas DataFrames
business_restaurant = restaurant_philadelphia_df

In [None]:
review_restaurant_sampled = review_restaurant_sampled_origin.loc[:9999]
review_restaurant_sampled.shape

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Assuming the data has already been loaded into the following Pandas DataFrames
# business_restaurant, review_restaurant, business_philadelphia, review_philadelphia

# Step 1: Data Preprocessing
import re

# Function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Optionally, more preprocessing steps like stopword removal, stemming, etc.
    return text

# Apply the text cleaning function to review data
review_restaurant_sampled['text'] = review_restaurant_sampled['text'].apply(clean_text)


# Step 2: Feature Engineering
# Use TF-IDF to transform text data into feature vectors.
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Combine all reviews for each business
business_reviews = review_restaurant_sampled.groupby('business_id')['text'].apply(lambda reviews: ' '.join(reviews))
business_reviews = business_reviews.reset_index()

# Compute TF-IDF matrix for the review text
tfidf_matrix = tfidf_vectorizer.fit_transform(business_reviews['text'])

# Create a mapping from the original DataFrame index to the row numbers in the TF-IDF matrix
index_to_row_mapping = {index: row_number for row_number, index in enumerate(business_reviews['business_id'].values)}

# Function to average TF-IDF vectors by user
def average_tfidf_vectors(reviews):
    # Map the original indices to the corresponding rows in the TF-IDF matrix
    rows_to_use = [index_to_row_mapping[idx] for idx in reviews['business_id']]
    # Use the mapped rows to index into the TF-IDF matrix and take the mean
    return tfidf_matrix[rows_to_use].mean(axis=0)

# Step 3: Profile Creation
# Create a user profile based on the reviews they have written
# Here, we'll simply average the TF-IDF vectors of the reviews written by the user
# Apply the function to each user's reviews
user_profiles = review_restaurant_sampled.groupby('user_id').apply(average_tfidf_vectors)

# Convert each sparse matrix in the series to a 1-D numpy array
user_profiles = user_profiles.apply(lambda x: np.asarray(x).flatten())


# Now, since each element in user_profiles is a 1-D array, we can stack them into a 2-D structure
# We can use numpy's vstack method to stack the 1-D arrays vertically into a 2-D array
import numpy as np
user_profiles_np = np.vstack(user_profiles.values)

# Now create a DataFrame from this 2-D numpy array
user_profiles_df = pd.DataFrame(user_profiles_np, index=user_profiles.index)

# Step 4: Similarity Calculation
# Calculate cosine similarity between user profiles and business TF-IDF feature vectors
cosine_similarities = cosine_similarity(user_profiles_df, tfidf_matrix)

# Step 5: Recommendation Generation
# Create a DataFrame from the cosine similarities
similarity_df = pd.DataFrame(cosine_similarities, index=user_profiles_df.index, columns=business_reviews['business_id'])

# Function to get top N recommendations for a user
def get_recommendations(user_id, similarity_df, N=10):
    # Get the similarity scores for the user
    user_similarities = similarity_df.loc[user_id]

    # Sort the businesses by similarity score
    recommended_business_ids = user_similarities.sort_values(ascending=False).head(N).index

    # Get the business details
    recommendations = business_restaurant[business_restaurant['business_id'].isin(recommended_business_ids)]
    return recommendations

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the review data into training and testing sets
# train_reviews, test_reviews = train_test_split(review_restaurant_sampled, test_size=0.2)
test_reviews = test_df


# Function to calculate precision and recall
def calculate_precision_recall(user_id, recommendations, test_data):
    # True positives: Recommended items that are relevant
    tp = len(recommendations[recommendations['business_id'].isin(test_data[test_data['user_id'] == user_id]['business_id'])])
    # False positives: Recommended items that are not relevant
    fp = len(recommendations) - tp
    # False negatives: Relevant items that are not recommended
    fn = len(test_data[test_data['user_id'] == user_id]) - tp

    precision = tp / (tp + fp) if (tp + fp) != 0 else 0
    recall = tp / (tp + fn) if (tp + fn) != 0 else 0
    return precision, recall

# Evaluate the model
cb_precisions = []
cb_recalls = []
cb_TopK = 10
svd_precisions = []
svd_recalls = []
ncf_precisions = []
ncf_recalls = []
cf_TopK = 5

count = 0

for user_id in test_reviews['user_id'].unique():

    # get prediction from content-based system
    user_test_data = test_reviews[test_reviews['user_id'] == user_id]
    recommendations = get_recommendations(user_id, similarity_df, N=cb_TopK)
    cb_recommended_restaruants = recommendations['business_id'].values

    # get prediction from content-based collaborative filtering hybrid system
    svd_estimate_stars = []
    for restaurant_id in cb_recommended_restaruants:
        svd_estimate_stars.append(svd_model.predict(uid=user_id, iid=restaurant_id, verbose=False).est)

    svd_estimate_stars_numpy = np.array(svd_estimate_stars)
    svd_top_k_indices = np.argsort(-svd_estimate_stars_numpy)[:cf_TopK]

    svd_recommended_restaurants = [ cb_recommended_restaruants[index] for index in svd_top_k_indices ]
    svd_recommendations = business_restaurant[business_restaurant['business_id'].isin(svd_recommended_restaurants)]

    # get prediction from content-based SVD hybrid system
    user_id_num = np.argmax(users == user_id)
    ncf_estimate_stars = []
    for restaurant_id in cb_recommended_restaruants:
        restaurant_id_num = np.argmax(restaurants == restaurant_id)
        ncf_estimate_stars.append(ncf_model.predict([np.array([user_id_num]), np.array([restaurant_id_num])], verbose=0)[0][0])

    ncf_estimate_stars_numpy = np.array(ncf_estimate_stars)
    ncf_top_k_indices = np.argsort(-ncf_estimate_stars_numpy)[:cf_TopK]

    ncf_recommended_restaurants = [ cb_recommended_restaruants[index] for index in ncf_top_k_indices ]
    ncf_recommendations = business_restaurant[business_restaurant['business_id'].isin(ncf_recommended_restaurants)]


    cb_precision, cb_recall = calculate_precision_recall(user_id, recommendations, user_test_data)
    cb_precisions.append(cb_precision)
    cb_recalls.append(cb_recall)

    svd_precision, svd_recall = calculate_precision_recall(user_id, svd_recommendations, user_test_data)
    svd_precisions.append(svd_precision)
    svd_recalls.append(svd_recall)

    ncf_precision, ncf_recall = calculate_precision_recall(user_id, ncf_recommendations, user_test_data)
    ncf_precisions.append(ncf_precision)
    ncf_recalls.append(ncf_recall)

    count += 1
    #print('User:', count)

# Calculate the average precision and recall
cb_average_precision = sum(cb_precisions) / len(cb_precisions)
cb_average_recall = sum(cb_recalls) / len(cb_recalls)
cb_f1_score = 2 * (cb_average_precision * cb_average_recall) / (cb_average_precision + cb_average_recall) if (cb_average_precision + cb_average_recall) != 0 else 0

svd_average_precision = sum(svd_precisions) / len(svd_precisions)
svd_average_recall = sum(svd_recalls) / len(svd_recalls)
svd_f1_score = 2 * (svd_average_precision * svd_average_recall) / (svd_average_precision + svd_average_recall) if (svd_average_precision + svd_average_recall) != 0 else 0

ncf_average_precision = sum(ncf_precisions) / len(ncf_precisions)
ncf_average_recall = sum(ncf_recalls) / len(ncf_recalls)
ncf_f1_score = 2 * (ncf_average_precision * ncf_average_recall) / (ncf_average_precision + ncf_average_recall) if (ncf_average_precision + ncf_average_recall) != 0 else 0

print(f"Content-based recommender system prediction Average Precision: {cb_average_precision}")
print(f"Content-based recommender system prediction Average Recall: {cb_average_recall}")
print(f"Content-based recommender system prediction F1-Score: {cb_f1_score}")
print()
print(f"Content-based & SVD hybrid recommender system prediction Average Precision: {svd_average_precision}")
print(f"Content-based & SVD hybrid recommender system prediction Average Recall: {svd_average_recall}")
print(f"Content-based & SVD hybrid recommender system prediction F1-Score: {svd_f1_score}")
print()
print(f"Content-based & NCF hybrid recommender system prediction Average Precision: {ncf_average_precision}")
print(f"Content-based & NCF hybrid recommender system prediction Average Recall: {ncf_average_recall}")
print(f"Content-based & NCF hybrid recommender system prediction F1-Score: {ncf_f1_score}")


### Conclusion about the hybrid system
- Content-based & SVD system
    - has a higher average precision than simple content-based system
    - has a much lower average recall than simple content-based system
    - slightly improved the F1-score of simple content-based system
- Content-based & NCF system
    - has a higher average precision than simple content-based system
    - has a much lower average recall than simple content-based system
    - has almost the same F1-score as simple content-based system

*So finally we will chosse content-based & SVD hybrid system, because it has the best result and SVD is running much faster than NCF model*