<a href="https://colab.research.google.com/github/tnewtont/ModCloth_Recommendation_System/blob/main/rsp_pre_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse.linalg import svds
from sklearn.model_selection import train_test_split

In [2]:
# Makes utility matrix
def make_um(df):
    um_train = df.pivot_table(index = 'user_id', columns = 'item_id', values = 'rating')
    um_train.fillna(0, inplace = True)
    return um_train

In [58]:
# Makes the model using SVD
def build_model_SVD(um, n, kay, p_val):
    R = um.values
    R_mean = np.mean(R, axis = 1)
    R_mean = R_mean.reshape(-1,1)
    R_demeaned = R - R_mean

    U, sigma, Vt = svds(R_demeaned, k = n) # This gets applied to the *demeaned* ratings matrix
    sigma = np.diag(sigma)

    R_red = U@sigma@Vt
    R_red = R_red + R_mean # Un-demean the ratings matrix

    um_red = pd.DataFrame(R_red, columns = um.columns, index = um.index)
    cs_red = um_red.corr(method = 'spearman')

    nn_red = NearestNeighbors(n_neighbors = kay, p = p_val)
    nn_red.fit(cs_red)

    return nn_red

In [59]:
# Builds a model using SVD and then evaluates each user's linear combination to
# obtain their recommended items.
def recommend_items_SVD(UM_test, user_LC, CS_train_M, dict, n, kay, p_val):
    rec_items = np.zeros((len(UM_test.index), kay))
    model = build_model_SVD(UM_test, n, kay, p_val)

    for user in user_LC.index:
        user_reshaped = np.array(user_LC.loc[user]).reshape((1,len(user_LC.loc[user])))
        items = model.kneighbors(user_reshaped, return_distance = False)
        rec_items[dict[user],:] = np.array(CS_train_M.columns)[items]
    return rec_items

In [60]:
# Calculates the number of matches between our actual data and the model
def num_of_matches(actual_data, rec_items_df):
    counter = 0
    for user in actual_data['user_id'].unique():
        items_actual = actual_data.loc[actual_data['user_id'] == user, 'item_id']
        items_pred = rec_items_df.loc[user]
        counter += len(set(items_actual).intersection(set(items_pred)))
    return counter

In [26]:
def find_RMSE(um, n_list):
    RMSE_dict = {}
    for n in n_list:
        R = um.values
        R_mean = np.mean(R, axis = 1)
        R_mean = R_mean.reshape(-1,1)
        R_demeaned = R - R_mean

        U, sigma, Vt = svds(R_demeaned, k = n) # This gets applied to the *demeaned* ratings matrix
        sigma = np.diag(sigma)

        R_red = U@sigma@Vt
        R_red = R_red + R_mean # Un-demean the ratings matrix

        RMSE = (((((R - R_red)**2).sum().sum()) / (R.shape[0] * R.shape[1]))**0.5)
        RMSE_dict[n] = RMSE
    return RMSE_dict

In [63]:
# Load our filtered data
df = pd.read_csv('/content/df_modcloth_filtered.csv')

Now, we will train-test split our data using 50/50 stratification.

In [7]:
train, test = train_test_split(df, test_size = 0.5, stratify = df['item_id'])

We will then create the utility matrices and calculate Spearman's correlation for the full data, the training data, and the test data.

In [8]:
um = make_um(df)
cs = um.corr(method = 'spearman')

In [9]:
um_train = make_um(train)
cs_train = um_train.corr(method = 'spearman')

In [10]:
um_test = make_um(test)
cs_test = um_test.corr(method = 'spearman')

Storing the values of Spearman's correlation of the training and test data into their own dataframes to ensure the matrix multiplication is done properly when calculating each user's linear combination thereafter.

In [17]:
cs_M = pd.DataFrame(cs, columns = um.columns, index = um.columns)
cs_train_M = pd.DataFrame(cs_train, columns = um_train.columns, index = um_train.columns)
cs_test_M = pd.DataFrame(cs_test, columns = um_test.columns, index = um_test.columns)

Calculate the linear combination of each user by multipling the utility matrix of the test data and the values of Spearman's correlation of the training data.

In [12]:
user_LC = um_test@cs_train_M

In [29]:
user_LC_f = um@cs_M

In [15]:
user_dict = {}
for u in range(len(user_LC.index)):
  user_dict[u] = user_LC.index[u]
user_dict = dict((v,k) for k,v in user_dict.items())

In [31]:
user_dict_f = {}
for u in range(len(user_LC_f.index)):
  user_dict_f[u] = user_LC_f.index[u]
user_dict_f = dict((v,k) for k,v in user_dict_f.items())

We will build several models using different numbers of singular values and utilizing L2 norm.

In [22]:
SVD_dfs = []
for v in [150, 200, 250, 300, 350, 400, 410, 425]:
    SVD_dfs.append(pd.DataFrame(recommend_items_SVD(um_test, user_LC, cs_train_M, user_dict, v, 10, 2), index = user_dict.keys()))

To determine the best model, we will use a simple metric where we obtain the number of matches between the model and the test data.

In [23]:
num_of_matches_list = [num_of_matches(test, d) for d in SVD_dfs]
num_of_matches_list

[28471, 31677, 36024, 36527, 35273, 24905, 16697, 9012]

Using 300 singular values yields the highest number of matches.

In [28]:
find_RMSE(um_test, [150, 200, 250, 300, 350, 400, 410, 425])

{150: 0.11694898618322215,
 200: 0.09220406287725107,
 250: 0.07272982418493837,
 300: 0.05572772770216872,
 350: 0.03960055127367815,
 400: 0.022028825810890874,
 410: 0.017799197684599175,
 425: 0.009877427505372538}

Let's check for singular values between 300 and 350 and see if they improve our model. (In this case, 310, 320, 330, and 340)

In [52]:
list(range(310, 341, 10))

[310, 320, 330, 340]

In [53]:
SVD_dfs_2 = []
for v in list(range(310, 341, 10)):
    SVD_dfs_2.append(pd.DataFrame(recommend_items_SVD(um_test, user_LC, cs_train_M, user_dict, v, 10, 2), index = user_dict.keys()))

In [54]:
num_of_matches_list_2 = [num_of_matches(test, d) for d in SVD_dfs_2]
num_of_matches_list_2

[35820, 35570, 35067, 34545]

The number of singular values of 310, 320, 330, and 340 did not improve our model.

Let's observe our results using L1 norm instead of L2 norm.

In [61]:
SVD_dfs_L1 = []
for v in [150, 200, 250, 300, 350, 400, 410, 425]:
    SVD_dfs_L1.append(pd.DataFrame(recommend_items_SVD(um_test, user_LC, cs_train_M, user_dict, v, 10, 1), index = user_dict.keys()))

In [62]:
num_of_matches_list_L1 = [num_of_matches(test, d) for d in SVD_dfs_L1]
num_of_matches_list_L1

[4941, 3264, 3828, 2999, 2056, 993, 585, 1539]

Using L1 norm drastically reduces the number of matches, therefore using L2 norm is more optimal.

To verify our train-test model validation, we will now build the model on the entire dataset and compare it to itself.

In [34]:
SVD_dfs_full = []
for v in [150, 200, 250, 300, 350, 400, 410, 425]:
    SVD_dfs_full.append(pd.DataFrame(recommend_items_SVD(um, user_LC_f, cs, user_dict_f, v, 10, 2), index = user_dict_f.keys()))

In [35]:
num_of_matches_list_full = [num_of_matches(df, d) for d in SVD_dfs_full]
num_of_matches_list_full

[29421, 37057, 54726, 65142, 60317, 49655, 39560, 20500]

In [36]:
find_RMSE(um, [150, 200, 250, 300, 350, 400, 410, 425])

{150: 0.128778485257336,
 200: 0.1019257880438485,
 250: 0.08059295336375218,
 300: 0.061961617379721494,
 350: 0.04433289670423865,
 400: 0.02500844225436355,
 410: 0.020314003417670517,
 425: 0.011552216980782718}

It is important to note that despite our root mean squared error decreasing as the number of the singular values increases, it does not always guarantee a better model.<br>
As with our train-test split validation, using 300 singular values on the entire dataset yields the best model.