<a href="https://colab.research.google.com/github/tnewtont/ModCloth_Recommendation_System/blob/main/rsp_data_wrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import random

In [None]:
def read_and_filter_data(filepath):
    # Obain the dataframe
    df = pd.read_csv(filepath)

    # We have a null in user_id, so impute it
    df['user_id'] = df['user_id'].fillna('Unknown')

    # Get the necessary columns
    ratings = df[['item_id', 'user_id', 'rating', 'category']]

    # Get the value counts of the items
    vc_items = ratings['item_id'].value_counts()

    # Filter out items that don't have 24 or more instances
    vc_items_24 = vc_items[vc_items >= 24].index

    # Apply this filter to our data
    ratings_filtered = ratings.loc[ratings['item_id'].isin(vc_items_24)]

    return ratings_filtered

First, we are only going to consider four features: <br>
(1) item_id<br>
(2) user_id<br>
(3) rating<br>
(4) category<br>

In [None]:
# Loading the data
df = pd.read_csv('/content/df_modcloth.csv')
df

Unnamed: 0,item_id,user_id,rating,timestamp,size,fit,user_attr,model_attr,category,brand,year,split
0,7443,Alex,4,2010-01-21 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
1,7443,carolyn.agan,3,2010-01-27 08:00:00+00:00,,,,Small,Dresses,,2012,0
2,7443,Robyn,4,2010-01-29 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
3,7443,De,4,2010-02-13 08:00:00+00:00,,,,Small,Dresses,,2012,0
4,7443,tasha,4,2010-02-18 08:00:00+00:00,,,Small,Small,Dresses,,2012,0
...,...,...,...,...,...,...,...,...,...,...,...,...
99888,154797,BernMarie,5,2019-06-26 21:15:13.165000+00:00,6.0,Just right,Large,Small&Large,Dresses,,2017,0
99889,77949,Sam,4,2019-06-26 23:22:29.633000+00:00,4.0,Slightly small,Small,Small&Large,Bottoms,,2014,2
99890,67194,Janice,5,2019-06-27 00:20:52.125000+00:00,,Just right,Small,Small&Large,Dresses,,2013,2
99891,71607,amy,3,2019-06-27 15:45:06.250000+00:00,,Slightly small,Small,Small&Large,Outerwear,Jack by BB Dakota,2016,2


In [None]:
# Filtering the data to include the 4 features
df2 = df[['item_id', 'user_id', 'rating', 'category']]
df2

Unnamed: 0,item_id,user_id,rating,category
0,7443,Alex,4,Dresses
1,7443,carolyn.agan,3,Dresses
2,7443,Robyn,4,Dresses
3,7443,De,4,Dresses
4,7443,tasha,4,Dresses
...,...,...,...,...
99888,154797,BernMarie,5,Dresses
99889,77949,Sam,4,Bottoms
99890,67194,Janice,5,Dresses
99891,71607,amy,3,Outerwear


From our earlier EDA, in this particular case, we need to impute 'user_id' since it has 1 missing value. We will simply fill it with 'Unknown'

In [None]:
df2['user_id'] = df2['user_id'].fillna('Unknown')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['user_id'] = df2['user_id'].fillna('Unknown')


In [None]:
# Checking that the entire dataframe has no nulls
df2.isna().sum().sum()

0

Now, we need to take into account of the following:<br>
(1) There are much more items rated 4 or above.<br>
(2) The number of reviews a user has given varies.<br>
(3) The number of reviews that a product has varies.<br>
 - This means that we need to define what is considered a "good" item.<br>
 - We found out from our EDA that the median number of reviews that an item has is 17. Specifically, for items that are rated 4 or above, the median is 22. Therefore, we define a "good" item to be one that has at least 24 reviews (this number is chosen a bit arbitrarily, but is sufficient for our implementation) and is rated 4 or above.<br>



In [None]:
# Filter our data to include items that have 24 or more reviews

# Obtain the value counts for each item
vc_items = df2['item_id'].value_counts()

# Include only 24 or more reviews
vc_items_24 = vc_items[vc_items >= 24].index

# Apply this filter to our original data
df2_filtered = df2.loc[df2['item_id'].isin(vc_items_24)]
df2_filtered

Unnamed: 0,item_id,user_id,rating,category
0,7443,Alex,4,Dresses
1,7443,carolyn.agan,3,Dresses
2,7443,Robyn,4,Dresses
3,7443,De,4,Dresses
4,7443,tasha,4,Dresses
...,...,...,...,...
99888,154797,BernMarie,5,Dresses
99889,77949,Sam,4,Bottoms
99890,67194,Janice,5,Dresses
99891,71607,amy,3,Outerwear


In [None]:
# Saving the filtered dataframe
df2_filtered.to_csv('df_modcloth_filtered.csv', index = False)

In [None]:
# Reading the filtered dataframe
pd.read_csv('/content/df_modcloth_filtered.csv')

Unnamed: 0,item_id,user_id,rating,category
0,7443,Alex,4,Dresses
1,7443,carolyn.agan,3,Dresses
2,7443,Robyn,4,Dresses
3,7443,De,4,Dresses
4,7443,tasha,4,Dresses
...,...,...,...,...
93910,154797,BernMarie,5,Dresses
93911,77949,Sam,4,Bottoms
93912,67194,Janice,5,Dresses
93913,71607,amy,3,Outerwear


Another thing to consider is that since we will be using collaborative filtering for our recommendation system, we need to consider two cases:<br>
(1) The user's average rating is 4 or above. <br>
(2) Otherwise, the user's average rating is below 4. <br>
This is important because we only want our recommendation system to recommend good products.<br>
Therefore, for users whose average rating is below 4, we will give them generic recommendations (i.e. the top 3 items with the most ratings and are rated 4 or above).

In [None]:
# Obtaining the dataframe that contains generic recommendations

mean_item_ratings = df2.groupby('item_id')['rating'].mean()
mean_item_ratings2 = df2_filtered.groupby('item_id').agg(
        mean = ('rating', 'mean'),
        num_reviews = ('item_id', 'count'),
        cat = ('category', 'min')
    )

In [None]:
pop_items = mean_item_ratings2.sort_values('mean', ascending = False).sort_values('num_reviews', ascending = False)
pop_items

Unnamed: 0_level_0,mean,num_reviews,cat
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
34935,4.482247,1887,Tops
21296,4.171760,1636,Bottoms
32405,4.325829,1599,Dresses
32406,4.328648,1494,Dresses
32403,4.367199,1378,Dresses
...,...,...,...
153801,4.083333,24,Bottoms
138414,4.000000,24,Tops
153397,3.791667,24,Bottoms
153470,4.416667,24,Outerwear


We want to now give more importance for items that are rated 4 or above than those rated below 4. We can create a feature that multiplies mean and num_reviews together and call it weighted_vals. Moreover, the weight is determined by the product rating at specific intervals, as described by the function below. Keep in mind, even though these numbers were chosen arbitrarily, the idea is that the higher the item's rating, the more weight it holds.

In [None]:
def weights_multiply(x):
  if x >= 4.9:
    return (x * 1600)
  elif x >= 4.8 and x < 4.9:
    return (x * 1500)
  elif x >= 4.7 and x < 4.8:
    return (x * 1400)
  elif x >= 4.6 and x < 4.7:
    return (x * 1300)
  elif x >= 4.5 and x < 4.6:
    return (x * 1200)
  elif x >= 4.4 and x < 4.5:
    return (x * 1100)
  elif x >= 4.3 and x < 4.4:
    return (x * 1000)
  elif x >= 4.2 and x < 4.3:
    return (x * 900)
  elif x >= 4.1 and x < 4.2:
    return (x * 800)
  elif x >= 4.0 and x < 4.1:
    return (x * 700)
  else:
    return (x / 1000)

In [None]:
pop_items['weighted_vals'] = (pop_items['mean'].apply(weights_multiply)) * pop_items['num_reviews']
pop_items

Unnamed: 0_level_0,mean,num_reviews,cat,weighted_vals
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
34935,4.482247,1887,Tops,9303800.000
21296,4.171760,1636,Bottoms,5460000.000
32405,4.325829,1599,Dresses,6917000.000
32406,4.328648,1494,Dresses,6467000.000
32403,4.367199,1378,Dresses,6018000.000
...,...,...,...,...
153801,4.083333,24,Bottoms,68600.000
138414,4.000000,24,Tops,67200.000
153397,3.791667,24,Bottoms,0.091
153470,4.416667,24,Outerwear,116600.000


We need to make sure that when we map out each item_id to its corresponding weighted_vals, their mappings are unique and one-to-one (i.e. a bijection). We will introduce random noise (i.e. adding a random float generated between 0 to 1) to weighted_vals ensure this.

In [None]:
# Without adding random noise, we will obtain duplicate values
pop_items['weighted_vals'].duplicated().sum()

16

In [None]:
# Checking that each weighted_vals is unique upon applying the random noise
(pop_items['weighted_vals'].apply(lambda x: x + random.random())).duplicated().sum()

0

In [None]:
pop_items['weighted_vals'] = pop_items['weighted_vals'].apply(lambda x: x + random.random())

In [None]:
# # Save this dataframe of generic recommendations
pop_items.to_csv('pop_items.csv')

In [None]:
# Double checking this dataframe was saved properly
pop_items_read = pd.read_csv('/content/pop_items.csv')
pop_items_read.set_index('item_id', inplace = True)
pop_items_read

Unnamed: 0_level_0,mean,num_reviews,cat,weighted_vals
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
34935,4.482247,1887,Tops,9.303801e+06
21296,4.171760,1636,Bottoms,5.460000e+06
32405,4.325829,1599,Dresses,6.917001e+06
32406,4.328648,1494,Dresses,6.467001e+06
32403,4.367199,1378,Dresses,6.018001e+06
...,...,...,...,...
153801,4.083333,24,Bottoms,6.860005e+04
138414,4.000000,24,Tops,6.720080e+04
153397,3.791667,24,Bottoms,1.371119e-01
153470,4.416667,24,Outerwear,1.166010e+05
