<h1><center>Content-based Filtering - Amazon Beauty Products</center></h1>

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

import gc

In [2]:
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel, euclidean_distances
from scipy.stats import pearsonr

Read the cleaned file into dataframes.

In [3]:
meta = pd.read_csv('cleaned_metadata.csv',index_col=0)
reviews = pd.read_csv('cleaned_reviews.csv',index_col=0)

In [4]:
meta.head(2)

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count
0,205616461,as age youthful healthy skin succumbs enzymatic imbalance wears away cellular network resulting skin thinning aging combining best nature cosmetic biotechnology bioactive products formulated enzymes gently exfoliate skin stimulate regeneration youthful glow benefiting fertile orchards italian countryside bioactive formulas rich phytohormones flavonoids fatty acids active extracts apple pear seeds enzymatically modified developed especially care aging skin this repairing fluid helps nourish firm accelerating penetration delivery active principles skin giving youthful appearance advanced probiotic complex nourishing milk proteins regains skins natural equilibrium boosts immunities protects environmental biological stress peptides ceramides help firm regenerate skin stimulating collagen production strengthening epidermis a calming botanical complex hyaluronic acid wheat germ extract hydrates restores skins protective barriers a nutritive vitamin complex moisturizes protects skin damaging environmental factors paracress extract natural alternative cosmetic injections limits relaxes microcontractions create facial lines producing immediate longterm smoothing skin to use apply pumps apply pumps clean dried face neck dcollet,,Bio-Active Anti-Aging Serum (Firming Ultra-Hydrati,461765.0,-1.0,Skin Care,Face,0.0
1,558925278,mineral powder brushapply powder mineral foundation face circular buffing motion work inward towards nose concealer brushuse liquid mineral powder concealer coverage blemishes eyes eye shading brush expertly cut apply blend powder eye shadows baby kabuki buff powder areas need coverage cosmetic brush bag 55 hemp linen 45 cotton,,Eco Friendly Ecotools Quality Natural Bamboo Cosme,-1.0,402875.0,Tools & Accessories,Makeup Brushes & Tools,0.0


In [5]:
reviews.head(2)

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity
0,A39HTATAQ9V7YF,205616461,5,2013-05-28,bioactive antiaging serum love moisturizer would recommend someone dry skin fine lines wrinkles using brand day night serum,0,0,34,0.283333
1,A3JM6GV9MNOF9X,558925278,3,2012-12-14,product ok im use baby kabuki moment received product deadlinei tested baby kabuki quality material best packaging cute love itthe fibers smell soft,0,1,44,0.52


In [6]:
print("Products:",meta.shape)
print("Reviews:",reviews.shape)

Products: (259204, 9)
Reviews: (2023070, 9)


In [7]:
meta.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259204 entries, 0 to 259203
Data columns (total 9 columns):
asin                    259204 non-null object
description             259137 non-null object
price                   189930 non-null float64
brand_title             258760 non-null object
health_personal_care    259204 non-null float64
beauty                  259204 non-null float64
main_cat                259204 non-null object
sub_cat                 259204 non-null object
related_count           259204 non-null float64
dtypes: float64(4), object(5)
memory usage: 19.8+ MB


In [8]:
reviews.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2023070 entries, 0 to 2023069
Data columns (total 9 columns):
reviewerID    2023070 non-null object
asin          2023070 non-null object
overall       2023070 non-null int64
reviewTime    2023070 non-null object
review        2023067 non-null object
upvotes       2023070 non-null int64
downvotes     2023070 non-null int64
word_count    2023070 non-null int64
polarity      2023070 non-null float64
dtypes: float64(1), int64(4), object(4)
memory usage: 154.3+ MB


In [9]:
meta['description'][54107]

nan

In [10]:
meta['description'] = meta['description'].fillna(meta['main_cat'])
meta['brand_title'] = meta['brand_title'].fillna(meta['main_cat'])
meta['price'] = meta['price'].fillna(0)

Since the product description field will be used extensively for filtering, we will lemmatize it.

In [11]:
from textblob import TextBlob, Word

meta['description'] = meta['description'].apply(lambda x: " ".join([w.lemmatize() for w in TextBlob(x).words]))
#reviews['review'] = reviews['review'].apply(lambda x: " ".join([w.lemmatize() for w in TextBlob(x).words]))

In [12]:
meta[meta['asin'] == 'B001MA0QY2']

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count
66896,B001MA0QY2,the proffesional hsi flat iron great transforming frizzy dull hair gorgeously straight sleek lock aside straightening proffesional hsi flat iron curl flip hair beautifully 1 plate giving maximum control hair type with flash quick heating swivel cord iron provides great style without making mess taking much time featuring new easier grip ergonomic design easier hold styler flat iron also versatile heat setting provide total control hairstyling need moist ceramic heat solid ceramic plate coil maintain even temperature,53.59,HSI PROFESSIONAL HSI PROFESSIONAL 1 CERAMIC TOURMA,-1.0,1.0,Hair Care,Styling Tools,0.0


In [13]:
meta.main_cat.unique()

array(['Skin Care', 'Tools & Accessories', 'Makeup', 'Hair Care',
       'Bath & Body', 'Fragrance', 'Fragrance]', 'Makeup]', 'Skin Care]',
       'Hair Care]', 'Tools & Accessories]', 'Fan Shop]', 'Bath & Body]',
       'Snow Sports', 'Kitchen & Dining', 'Health Care',
       'Stationery & Party Supplies', 'Storage & Organization]',
       'Baby & Child Care]', 'Fan Shop', 'Personal Care',
       'Household Supplies', 'Accessories', 'Hardware'], dtype=object)

In [14]:
meta['main_cat'] = meta['main_cat'].str.replace(']','')

In [15]:
meta['sub_cat'] = meta['sub_cat'].str.replace('"','')

### Content-based recommendation system

Recommends an item based on its features and how similar they are to features of other items in the data set.

Using the nearest neighbor algorithm, which is an unsupervised classifier also known as memory based system. It memorizes instances and then recommends an item based on how quantitatively similar it is to a new incoming instance.

In [16]:
usersperasin = reviews['asin'].value_counts()
#usersperasin

In [17]:
meta_5 = meta[meta['asin'].isin(usersperasin[usersperasin>10].index) & meta['asin'].isin(usersperasin[usersperasin<7000].index)]

In [18]:
meta_5.head()

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count
18,1304351475,too faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full size,33.99,Omagazee NEW EUROPEAN COLLECTION Too Faced Natural,-1.0,15567.0,Makeup,Eyes,309.0
51,1403790965,silkylight powder creates enhances look glowing sunkissed cheek brow bone brush evenly flawlesslooking finish,33.99,Arbonne Bronzer,-1.0,16280.0,Makeup,Face,27.0
61,3227001381,no description,0.0,Elemis Aromazing Shampoo - 300 mL,-1.0,168222.0,Hair Care,Shampoos,0.0
75,5357955948,add pliable firm hold texture,0.0,"Kms California Hairplay Paste Up Spray, 6.4 Fluid",-1.0,77540.0,Hair Care,Styling Products,7.0
84,535795531X,maximum intensity cream using acidfree smoothing agent skin regenerators designed improve skin firmness texture a blend active botanical includes red seaweed rice extract soy protein phytoestrogens kukui nut licorice multivitamin feel difference immediately benefiting stronger resilient skin fragrance free,126.92,Dermalogica Dermalogica AGE Smart Power Rich (5 x,-1.0,16175.0,Skin Care,Face,140.0


In [19]:
meta_5.shape

(33878, 9)

In [20]:
meta_5 = meta_5.reset_index(drop=True)

In [21]:
indices = pd.Series(meta_5['asin'].index)

In [22]:
def recommend(index, method):
    id = indices[index]
    # Get the pairwise similarity scores of all products for this product
    # sort and derive the top 5
    similarity_scores = list(enumerate(method[id]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:6]
    
    # Get the product index
    asin_index = [i[0] for i in similarity_scores]
    
    #Return the top 5 most similar products using integar-location based indexing (iloc)
    return meta_5['brand_title'].iloc[asin_index]

Merge the text fields that describe the item attributes into one field for easy conversion to vector form

In [23]:
meta_5['all_content'] = meta_5['brand_title'] + meta_5['main_cat'] + meta_5['description'] + meta_5['sub_cat']

Use TfidfVectorizer to convert the new attribute (all_content) into vector form. In TF-IDF, the occurrence of each word in a document is counted and the importance of each word is weighed. Based on this information, a score is calculated for that document.

In [24]:
vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', ngram_range=(1, 2), max_features=1000)

In [25]:
tfidf_all_content = vectorizer.fit_transform(meta_5['all_content'])

To compute the similarity between item vectors, various methods can be used: Cosine Similarity, Euclidean Distance, Peason’s Correlation. Then the recommender gives recommendation based on the most similar items.

First, we will use the linear_kernel to perform dot product of the vectors.

In [26]:
cos_sim_lk = linear_kernel(tfidf_all_content, tfidf_all_content)

In [27]:
meta_5.iloc[2].main_cat, meta_5.iloc[2].sub_cat, meta_5.iloc[2].brand_title

('Hair Care', 'Shampoos', 'Elemis Aromazing Shampoo - 300 mL')

In [28]:
recommend(2, cos_sim_lk)

2078     Nexxus Nexxus shampoo therappe, 33.8oz            
16036    Suave Suave Professionals Shampoo, Rosemary Mint f
25901    Suave Suave Professionals mens, shampoo/conditione
28200    Nexxus promend shampoo, 33.8oz                    
32687    Fekkai Fekkai Apple Cider Shampoo 236ml/8oz       
Name: brand_title, dtype: object

In [29]:
meta_5.iloc[2]

asin                    3227001381                                                      
description             no description                                                  
price                   0                                                               
brand_title             Elemis Aromazing Shampoo - 300 mL                               
health_personal_care   -1                                                               
beauty                  168222                                                          
main_cat                Hair Care                                                       
sub_cat                 Shampoos                                                        
related_count           0                                                               
all_content             Elemis Aromazing Shampoo - 300 mLHair Careno descriptionShampoos
Name: 2, dtype: object

In [30]:
#tfidf_all_content = None
tfidf_feature_name = None
cos_sim_lk = None

CountVectorizer can also be used to convert a collection of text documents to a matrix of token counts. It builds a sparse representation of the counts using scipy.sparse.csr_matrix.

In [31]:
cv = CountVectorizer(analyzer='word', stop_words='english', ngram_range=(1, 2), max_features=1000)

Cosine similarity is another way to measure the similarity between two non-zero vectors with n variables. If the cosine value of two vectors is close to 1, then it indicates that they are almost similar. A zero value indicates that they are dissimilar.

In [32]:
count_matrix = cv.fit_transform(meta_5['all_content'])
cos_sim = cosine_similarity(count_matrix)

In [33]:
recommend(2, cos_sim)

2078     Nexxus Nexxus shampoo therappe, 33.8oz            
16036    Suave Suave Professionals Shampoo, Rosemary Mint f
25901    Suave Suave Professionals mens, shampoo/conditione
28200    Nexxus promend shampoo, 33.8oz                    
32687    Fekkai Fekkai Apple Cider Shampoo 236ml/8oz       
Name: brand_title, dtype: object

Results from using TfidfVectorizer and CountVectorizer are exactly the same for the item in index 2

In [34]:
indices_n = pd.Series(meta_5['asin'])
inddict = indices_n.to_dict()
inddict = dict((v,k) for k,v in inddict.items())

In [35]:
def recommend_cosine(asin):
    id = inddict[asin]
    # Get the pairwise similarity scores of all products for this product,
    # sort and derive top 5
    similarity_scores = list(enumerate(cos_sim[id]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:6]
    
    # Get the items index
    asin_index = [i[0] for i in similarity_scores]
    
    # Return the top 5 most similar products using iloc
    return meta_5.iloc[asin_index]

In [36]:
recommend_cosine('3227001381')

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count,all_content
2078,B0009I4MKW,no description,21.0,"Nexxus Nexxus shampoo therappe, 33.8oz",-1.0,895.0,Hair Care,Shampoos,235.0,"Nexxus Nexxus shampoo therappe, 33.8ozHair Careno descriptionShampoos"
16036,B002VA4FXA,no description,19.99,"Suave Suave Professionals Shampoo, Rosemary Mint f",-1.0,177752.0,Hair Care,Shampoos,0.0,"Suave Suave Professionals Shampoo, Rosemary Mint fHair Careno descriptionShampoos"
25901,B006N9LWWW,no description,10.35,"Suave Suave Professionals mens, shampoo/conditione",-1.0,24913.0,Hair Care,Shampoos,140.0,"Suave Suave Professionals mens, shampoo/conditioneHair Careno descriptionShampoos"
28200,B008AGWLJ4,no description,15.97,"Nexxus promend shampoo, 33.8oz",-1.0,18581.0,Hair Care,Shampoos,111.0,"Nexxus promend shampoo, 33.8ozHair Careno descriptionShampoos"
32687,B00EYZY5TY,no description,10.9,Fekkai Fekkai Apple Cider Shampoo 236ml/8oz,37518.0,-1.0,Hair Care,Shampoos,175.0,Fekkai Fekkai Apple Cider Shampoo 236ml/8ozHair Careno descriptionShampoos


In [37]:
tfidf_content_array = tfidf_all_content.toarray()

In [38]:
def recommend_pearson(asin):
    ind = inddict[asin]
    correlation = []
    for i in range(len(tfidf_content_array)):
        correlation.append(pearsonr(tfidf_content_array[ind], tfidf_content_array[i])[0])
    correlation = list(enumerate(correlation))
    sorted_corr = sorted(correlation, reverse=True, key=lambda x: x[1])[1:6]
    asin_index = [i[0] for i in sorted_corr]
    return meta_5.iloc[asin_index]

In [39]:
recommend_pearson('3227001381')

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count,all_content
2078,B0009I4MKW,no description,21.0,"Nexxus Nexxus shampoo therappe, 33.8oz",-1.0,895.0,Hair Care,Shampoos,235.0,"Nexxus Nexxus shampoo therappe, 33.8ozHair Careno descriptionShampoos"
11389,B001EO5WYU,no description,16.69,"Nexxus Color Assure Shampoo, 33.8Ounce Bottle",-1.0,3507.0,Hair Care,Shampoos,0.0,"Nexxus Color Assure Shampoo, 33.8Ounce BottleHair Careno descriptionShampoos"
12988,B001P1ZEJK,no description,4.22,"Suave Suave, shampoo, humectant moisture, 28oz",-1.0,82593.0,Hair Care,Shampoos,60.0,"Suave Suave, shampoo, humectant moisture, 28ozHair Careno descriptionShampoos"
176,B0000530LO,no description,6.85,"Suave Suave Naturals Shampoo, Daily Clarifying - 2",-1.0,10144.0,Hair Care,Shampoos,0.0,"Suave Suave Naturals Shampoo, Daily Clarifying - 2Hair Careno descriptionShampoos"
13765,B0020122ZS,no description,5.16,"Suave Suave Kids 2 in 1 Shampoo and Conditioner, C",-1.0,58864.0,Hair Care,Shampoos,0.0,"Suave Suave Kids 2 in 1 Shampoo and Conditioner, CHair Careno descriptionShampoos"


In [40]:
D = euclidean_distances(tfidf_all_content)

In [41]:
def recommend_euclidean(asin):
    ind = inddict[asin]
    distance = list(enumerate(D[ind]))
    distance = sorted(distance, key=lambda x: x[1])
    distance = distance[1:6]
    #Get the items index
    asin_index = [i[0] for i in distance]

    #Return the top 5 most similar items using integar-location based indexing (iloc)
    return meta_5.iloc[asin_index]

In [42]:
recommend_euclidean('3227001381')

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count,all_content
2078,B0009I4MKW,no description,21.0,"Nexxus Nexxus shampoo therappe, 33.8oz",-1.0,895.0,Hair Care,Shampoos,235.0,"Nexxus Nexxus shampoo therappe, 33.8ozHair Careno descriptionShampoos"
16036,B002VA4FXA,no description,19.99,"Suave Suave Professionals Shampoo, Rosemary Mint f",-1.0,177752.0,Hair Care,Shampoos,0.0,"Suave Suave Professionals Shampoo, Rosemary Mint fHair Careno descriptionShampoos"
25901,B006N9LWWW,no description,10.35,"Suave Suave Professionals mens, shampoo/conditione",-1.0,24913.0,Hair Care,Shampoos,140.0,"Suave Suave Professionals mens, shampoo/conditioneHair Careno descriptionShampoos"
28200,B008AGWLJ4,no description,15.97,"Nexxus promend shampoo, 33.8oz",-1.0,18581.0,Hair Care,Shampoos,111.0,"Nexxus promend shampoo, 33.8ozHair Careno descriptionShampoos"
32687,B00EYZY5TY,no description,10.9,Fekkai Fekkai Apple Cider Shampoo 236ml/8oz,37518.0,-1.0,Hair Care,Shampoos,175.0,Fekkai Fekkai Apple Cider Shampoo 236ml/8ozHair Careno descriptionShampoos


Convert reviewerid from string to int type so it can be used for regression analysis

In [43]:
from sklearn.preprocessing import LabelEncoder

lb_user = LabelEncoder()
reviews["userid"] = lb_user.fit_transform(reviews["reviewerID"])
reviews[["reviewerID", "userid"]].head(11)

Unnamed: 0,reviewerID,userid
0,A39HTATAQ9V7YF,725046
1,A3JM6GV9MNOF9X,814606
2,A1Z513UWSAAO0F,313101
3,A1WMRR494NWEWV,291075
4,A3IAAVS479H7M7,802842
5,AKJHHD5VEH7VG,1073169
6,A1BG8QW55XHN6U,102756
7,A22VW0P4VZHDE3,346278
8,A3V3RE4132GKRO,916162
9,A327B0I7CYTEJC,660058


In [44]:
reviews[reviews['reviewerID'] == 'A39HTATAQ9V7YF']

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity,userid
0,A39HTATAQ9V7YF,0205616461,5,2013-05-28,bioactive antiaging serum love moisturizer would recommend someone dry skin fine lines wrinkles using brand day night serum,0,0,34,0.283333,725046
899125,A39HTATAQ9V7YF,B002OVV7F0,3,2013-05-28,haute model nyx liked different colorsbut find stay longand little bit powderythanks letting express opinion,0,0,29,0.1375,725046
969482,A39HTATAQ9V7YF,B0031IH5FQ,5,2013-05-28,bioactive antiaging cream love product rich texture good dry skin like mine reduces fine lines wrinkles,0,0,26,0.385,725046
1499680,A39HTATAQ9V7YF,B006GQPZ8E,4,2013-05-28,peach parfit revlon found color beautiful smooth lipsstaying power okay like lipstick would recomend others,0,0,29,0.583333,725046


In [45]:
reviews.head(2)

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity,userid
0,A39HTATAQ9V7YF,205616461,5,2013-05-28,bioactive antiaging serum love moisturizer would recommend someone dry skin fine lines wrinkles using brand day night serum,0,0,34,0.283333,725046
1,A3JM6GV9MNOF9X,558925278,3,2012-12-14,product ok im use baby kabuki moment received product deadlinei tested baby kabuki quality material best packaging cute love itthe fibers smell soft,0,1,44,0.52,814606


In [46]:
data = meta_5.merge(reviews, on='asin')
data.head(2)

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count,all_content,reviewerID,overall,reviewTime,review,upvotes,downvotes,word_count,polarity,userid
0,1304351475,too faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full size,33.99,Omagazee NEW EUROPEAN COLLECTION Too Faced Natural,-1.0,15567.0,Makeup,Eyes,309.0,Omagazee NEW EUROPEAN COLLECTION Too Faced NaturalMakeuptoo faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full sizeEyes,A1RXI3A1E99112,5,2014-07-14,great product use almost every day well worth price lovetoo faced products go smooth last pretty long colors coordinated use lot different looks great product,0,0,45,0.3125,249303
1,1304351475,too faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full size,33.99,Omagazee NEW EUROPEAN COLLECTION Too Faced Natural,-1.0,15567.0,Makeup,Eyes,309.0,Omagazee NEW EUROPEAN COLLECTION Too Faced NaturalMakeuptoo faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full sizeEyes,A26QL1FBQO9C0E,5,2014-02-11,great palette colors really pigmented palette creates beautiful natural eye without eyes looking crazy love,0,0,24,0.308333,380136


Perform one-hot encoding for main_cat so it can be used in regression analysis as well

In [47]:
maincat_df = pd.get_dummies(data['main_cat'])
final_df = pd.concat([data, maincat_df], axis=1)

In [48]:
final_df.head(2)

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count,all_content,reviewerID,overall,reviewTime,review,upvotes,downvotes,word_count,polarity,userid,Bath & Body,Fragrance,Hair Care,Makeup,Personal Care,Skin Care,Tools & Accessories
0,1304351475,too faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full size,33.99,Omagazee NEW EUROPEAN COLLECTION Too Faced Natural,-1.0,15567.0,Makeup,Eyes,309.0,Omagazee NEW EUROPEAN COLLECTION Too Faced NaturalMakeuptoo faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full sizeEyes,A1RXI3A1E99112,5,2014-07-14,great product use almost every day well worth price lovetoo faced products go smooth last pretty long colors coordinated use lot different looks great product,0,0,45,0.3125,249303,0,0,0,1,0,0,0
1,1304351475,too faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full size,33.99,Omagazee NEW EUROPEAN COLLECTION Too Faced Natural,-1.0,15567.0,Makeup,Eyes,309.0,Omagazee NEW EUROPEAN COLLECTION Too Faced NaturalMakeuptoo faced natural eye shadow palette color include heaven silk teddy nude beach velvet revolver pushup honey pot sexspresso erotica cocoa puff collectible tin version new box full sizeEyes,A26QL1FBQO9C0E,5,2014-02-11,great palette colors really pigmented palette creates beautiful natural eye without eyes looking crazy love,0,0,24,0.308333,380136,0,0,0,1,0,0,0


In [49]:
final_df.drop(columns=['description','brand_title','main_cat','sub_cat','all_content'], axis=1, inplace=True)
final_df.drop(columns=['reviewTime','review','reviewerID'], axis=1, inplace=True)
final_df.head(2)

Unnamed: 0,asin,price,health_personal_care,beauty,related_count,overall,upvotes,downvotes,word_count,polarity,userid,Bath & Body,Fragrance,Hair Care,Makeup,Personal Care,Skin Care,Tools & Accessories
0,1304351475,33.99,-1.0,15567.0,309.0,5,0,0,45,0.3125,249303,0,0,0,1,0,0,0
1,1304351475,33.99,-1.0,15567.0,309.0,5,0,0,24,0.308333,380136,0,0,0,1,0,0,0


Finally, let's create different regression models for rating predictions based on all numerical columns.

In [50]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# import regression models and metrics
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor 
from sklearn.metrics import r2_score, mean_squared_error

In [51]:
trainset, testset = train_test_split(final_df, test_size=0.25)

In [52]:
X_train, y_train = trainset.drop(['asin','overall'], axis=1), trainset.overall
X_test, y_test = testset.drop(['asin','overall'], axis=1), testset.overall

In [53]:
index = ['Lasso','Ridge','RandomForestRegressor','GradientBoostingRegressor']
score_table = pd.DataFrame(index = index, columns= ['r2_train','mse_train','rmse_train','r2_test','mse_test','rmse_test'])

In [54]:
def compute_log_result(algo, pred_train, pred_test):
    r2_train = r2_score(y_train, pred_train)
    r2_test = r2_score(y_test, pred_test)
    mse_train = mean_squared_error(y_train, pred_train)
    mse_test = mean_squared_error(y_test, pred_test)
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)
    score_table.loc[algo,:] = r2_train, mse_train, rmse_train, r2_test, mse_test, rmse_test

In [55]:
lasso = Pipeline([('scaler', StandardScaler()),('lasso', Lasso(alpha=0.0015, max_iter=1000, selection='random'))])
lasso.fit(X_train, y_train)
pred_train = lasso.predict(X_train)
pred_test = lasso.predict(X_test)
compute_log_result("Lasso", pred_train, pred_test)

In [56]:
ridge = Pipeline([('scaler', StandardScaler()),('ridge',Ridge(alpha=100,max_iter=1000,tol=0.001))])
ridge.fit(X_train, y_train)
pred_train = ridge.predict(X_train)
pred_test = ridge.predict(X_test)
compute_log_result("Ridge", pred_train, pred_test)

In [57]:
rfr = Pipeline([('scaler', StandardScaler()),('rfr', RandomForestRegressor(n_estimators=70, max_features='log2'))])
rfr.fit(X_train, y_train)
pred_train = rfr.predict(X_train)
pred_test = rfr.predict(X_test)
compute_log_result("RandomForestRegressor", pred_train, pred_test)

In [58]:
gbr = Pipeline([('scaler', StandardScaler()),('gbr', GradientBoostingRegressor(n_estimators=400, max_features='log2'))])
gbr.fit(X_train, y_train)
pred_train = gbr.predict(X_train)
pred_test = gbr.predict(X_test)
compute_log_result("GradientBoostingRegressor", pred_train, pred_test)

In [59]:
score_table

Unnamed: 0,r2_train,mse_train,rmse_train,r2_test,mse_test,rmse_test
Lasso,0.236038,1.30334,1.14164,0.2372,1.29666,1.13871
Ridge,0.236289,1.30291,1.14145,0.237376,1.29636,1.13858
RandomForestRegressor,0.902064,0.167081,0.408756,0.313068,1.1677,1.0806
GradientBoostingRegressor,0.31845,1.16275,1.07831,0.318015,1.15929,1.0767


#### Pros:

1. If the items have sufficient descriptions, the new item problem can be avoided.

2. This method has to only analyze the item profiles for making recommendations. It is independent of other users' profiles and ratings.

#### Cons:

1. Recommendations will not be precise if the items are not described precisely.

2. There may not be much novelty in the recommendations made to a user. All recommended items will be similar to those that they have already rated.