<h1><center>Amazon Beauty Products - Recommender System</center></h1>
<h2><center>Second Milestone Report</center></h2>
<h3><center>Machine Learning Techniques</center></h3>

## 1. Introduction

### 1.1 Problem

In this project, I plan to build an intelligent recommendation engine for beauty products that are sold on Amazon.com. By providing recommendations, Amazon.com will give its customers the ability to take their brand experiences into their hands and make informed decisions. This level of personalization will support higher customer retention and reinforce brand loyalty.

### 1.2 Approach

#### Data Wrangling:

After loading the json files into pandas dataframe, the raw data will be cleaned, structured and enriched into a cleaner format. Clean data will allow quicker and better analysis, producing more accurate results. The identification and removal of error and duplicity in datasets will create a reliable dataset, enabling an outcome of better quality. This step was completed in the first milestone.

#### EDA:

EDA (Exploratory Data Analysis) is the initial basic exploration of the dataset in a systematic manner using visual methods. This step will include identification and elimination of outliers as well as checking for correlations between the independent variables. This step was also completed in the first milestone.

#### Machine Learning:

Different types of recommender systems will be built using Machine learning algorithms. 
The algorithms can be classified into two categories: Content-based filtering and Collaborative filtering. Modern recommenders use a hybrid approach which is a combination of these two.
Recommender systems are applied where many users interact with many items. We have a rich dataset with item attributes and historical user reviews for these items, that will be used to train and test the models. 

#### Performance Evaluation:

A variety of offline and online evaluation metrics are available to measure the accuracy of recommender systems. Each model will be evaluated using these metrics and the scores will be compared to determine the best performance. Strong recommender systems can have positive effects on user experience. They can result into higher customer satisfaction and retention, and in turn boost revenues.

## 2. Data Collection

The dataset has been downloaded from the website: http://jmcauley.ucsd.edu/data/amazon/links.html
(Citation:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016)

There are 2 individual json files containing the beauty product reviews and metadata from Amazon. The reviews file has ~2 million reviews spanning May 1996 - July 2014. The metadata file contains the details of ~260K beauty products sold at Amazon.

Reviews includes product ratings, text of the review and helpfulness votes.  


| Column        | Description                     |
|---------------|---------------------------------|
|reviewerID     | Reviewer ID                     |
|asin           | Product ID                      |
|reviewerName   | Name of reviewer                |
|helpful        | Helpfulness rating of the review|
|reviewText     | Text of the review              |
|overall        | Rating of the product           |
|summary        | Summary of the review           |
|unixReviewTime | Unix time of the review         | 
|reviewTime     | Raw time of the review          |   

Metadata includes descriptions, category information, price, sales-rank, brand info and image features.  
  

| Column     | Description                           |
|------------|---------------------------------------|
|asin        | Product ID                            |
|description | Product description                   |
|title       | Name of the product                   |
|imUrl       | URL of the product image              |
|salesRank   | Sales rank information                |
|categories  | List of categories product belongs to |
|price       | Price in US Dollars                   | 
|related     | Related products                      |
|brand       | Brand name                            | 

The json files were retrieved in compressed format. They were parsed line by line and loaded into pandas dataframes reviews and meta. Then these dataframes were written to .csv files for analysis and modeling.

## 3. Data Wrangling

### 3.1 Metadata file

The metadata file contains the details of ~260K beauty products sold at Amazon. There are 9 columns in the file including asin, description, title, brand, categories, related, price, sales rank and image url. No duplicate values were found in the asin column. 

51% of the items were missing "brand" information. Since this is a large percentage, the column could not be dropped. 444 items were missing the "title" information. So "brand" and "title" columns were merged into a single column and the individual columns were then dropped. 

The image url column does not seem to have much meaning, so it was also dropped from the dataframe.

The "salesRank" column had values in dict format (key-value pairs). So this column was split into multiple columns where each key became the column name and the value became the column value. Some of these new columns were not related to Beauty and Personal care so they were dropped. Only the new columns "Beauty" and "Health & Personal Care" were retained.

At this point, the resulting dataframe was written to a .csv file (cleaned_meta.csv).

It includes the columns:
asin, description, categories, price, related, brand_title, Health & Personal Care, Beauty

Next came the most challenging column for cleaning. 

This is the "related" column which had data in the following format:

{'also_bought': ['B002J2KOXK', 'B008TCVZ6Y'], 'also_viewed': ['B002J2KOXK'], 'bought_together': ['B002J2KOXK'], 'buy_after_viewing': ['B002J2KOXK']}

It made most sense to count the number of times an asin occurred in any of these 4 key-value pairs and use that count as an identification of popular and unpopular item. This required creation of a new column called "related_count".

The "asin" and "related" columns were first copied to a separate dataframe. The "related" column was then split into 4 new columns "also_bought", "also_viewed", "bought_together" and "buy_after_viewing". Then the values in these 4 columns were concatenated into a large string. Using iterrows, each asin in the dataframe was checked against the string and the number of occurrences of this asin in the string was captured in the "related_count" column. This particular operation took 20 hours to complete and the resulting dataframe was immediately captured into a .csv file (cleaned_related.csv).

The code for data wrangling of metadata file can be found at:
https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/clean_meta.ipynb

### 3.2 Reviews File

The reviews file has ~2 million reviews collected between May-1996 and July-2014. There are 9 columns in the file including: reviewerID, reviewerName, asin, helpful, reviewText, overall, summary, unixReviewTime and reviewTime.

12248 rows were missing the "reviewerName" information. Also there were some "reviewerID" with multiple values under "reviewerName". Since we can build our model just with the ID, the "reviewerName" column was dropped from the dataframe.

The "summary" and "reviewText" columns have similar information. 14 rows were missing the "summary" information. The 2 columns were merged into "review" column and the individual columns were dropped.

The column "reviewTime" was in text format, so it was converted to datetime and the column "unixReviewTime" was dropped.

The "helpful" column had values in list format, where the first element indicated the number of upvotes and second element indicated the number of downvotes for the review. This column was split into two: "upvotes" and "downvotes" and the original "helpful" column was dropped.

Since the "review" column is most significant for this project and comprises of large text values, it can be used for natural language processing.

First the wordcount of each review was stored into a new column "word_count". 

Next, the text was converted to lowercase to avoid having multiple versions of one word.  

All punctuation marks were removed since they do not add any extra value while processing text data.

Finally, the stop words and 50 most rare words were removed.

Once these steps completed, sentiment analysis was performed on all the reviews and polarity was calculated. This value was stored for each review in the newly created "polarity" column.

After the cleaning, the following columns existed in the dataframe:
reviewerID, asin, overall, review, upvotes, downvotes, word_count, polarity

The dataframe was written to a .csv file for further analysis (cleaned_reviews.csv).

The code for data wrangling of reviews file can be found at: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/clean_reviews.ipynb

### 3.3 Cleaned Datasets

All the cleaned datasets were written to .csv files. The first file has the metadata, the second one has the related_count information for the products and the third file has all the reviews for these products.

The "meta" file includes:
1. asin
2. description
3. categories
4. price
5. related
6. brand_title
7. health_and_personal_care
8. beauty

The "related" file includes:
1. asin
2. related_count

The "reviews" file includes:
1. reviewerID
2. asin
3. overall
4. review
5. upvotes
6. downvotes
7. word_count
8. polarity

First let's import the required packages

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import gc

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)

import warnings 
warnings.filterwarnings('ignore')

The related and meta dataframes obtained from Data wrangling step were merged into one and written to a new .csv file. 
Read the cleaned files into dataframes. 

In [2]:
meta = pd.read_csv('cleaned_metadata.csv',index_col=0)
reviews = pd.read_csv('cleaned_reviews.csv',index_col=0)

In [3]:
reviews.head(2)

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity
0,A39HTATAQ9V7YF,205616461,5,2013-05-28,bioactive antiaging serum love moisturizer would recommend someone dry skin fine lines wrinkles using brand day night serum,0,0,34,0.283333
1,A3JM6GV9MNOF9X,558925278,3,2012-12-14,product ok im use baby kabuki moment received product deadlinei tested baby kabuki quality material best packaging cute love itthe fibers smell soft,0,1,44,0.52


In [4]:
meta.head(2)

Unnamed: 0,asin,description,price,brand_title,health_personal_care,beauty,main_cat,sub_cat,related_count
0,205616461,as age youthful healthy skin succumbs enzymatic imbalance wears away cellular network resulting skin thinning aging combining best nature cosmetic biotechnology bioactive products formulated enzymes gently exfoliate skin stimulate regeneration youthful glow benefiting fertile orchards italian countryside bioactive formulas rich phytohormones flavonoids fatty acids active extracts apple pear seeds enzymatically modified developed especially care aging skin this repairing fluid helps nourish firm accelerating penetration delivery active principles skin giving youthful appearance advanced probiotic complex nourishing milk proteins regains skins natural equilibrium boosts immunities protects environmental biological stress peptides ceramides help firm regenerate skin stimulating collagen production strengthening epidermis a calming botanical complex hyaluronic acid wheat germ extract hydrates restores skins protective barriers a nutritive vitamin complex moisturizes protects skin damaging environmental factors paracress extract natural alternative cosmetic injections limits relaxes microcontractions create facial lines producing immediate longterm smoothing skin to use apply pumps apply pumps clean dried face neck dcollet,,Bio-Active Anti-Aging Serum (Firming Ultra-Hydrati,461765.0,-1.0,Skin Care,Face,0.0
1,558925278,mineral powder brushapply powder mineral foundation face circular buffing motion work inward towards nose concealer brushuse liquid mineral powder concealer coverage blemishes eyes eye shading brush expertly cut apply blend powder eye shadows baby kabuki buff powder areas need coverage cosmetic brush bag 55 hemp linen 45 cotton,,Eco Friendly Ecotools Quality Natural Bamboo Cosme,-1.0,402875.0,Tools & Accessories,Makeup Brushes & Tools,0.0


In [5]:
meta.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 259204 entries, 0 to 259203
Data columns (total 9 columns):
asin                    259204 non-null object
description             259137 non-null object
price                   189930 non-null float64
brand_title             258760 non-null object
health_personal_care    259204 non-null float64
beauty                  259204 non-null float64
main_cat                259204 non-null object
sub_cat                 259204 non-null object
related_count           259204 non-null float64
dtypes: float64(4), object(5)
memory usage: 19.8+ MB


In [6]:
reviews.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2023070 entries, 0 to 2023069
Data columns (total 9 columns):
reviewerID    2023070 non-null object
asin          2023070 non-null object
overall       2023070 non-null int64
reviewTime    2023070 non-null object
review        2023067 non-null object
upvotes       2023070 non-null int64
downvotes     2023070 non-null int64
word_count    2023070 non-null int64
polarity      2023070 non-null float64
dtypes: float64(1), int64(4), object(4)
memory usage: 154.3+ MB


In [7]:
reviews[reviews['review'].isnull()]

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity
1360701,A1LF7KGH427GYC,B00552467Q,5,2014-07-08,,0,0,2,0.0
1467181,AUII249W6FBZY,B00604MV0C,3,2013-01-11,,0,0,2,0.0
1784718,A3BX2SSKN9SFRV,B00ADXZ9LY,1,2014-07-03,,0,0,4,0.0


Missing values can lead to weak or biased analysis. 

It can be handled in 3 ways: 
1. Delete the rows with missing values using dropna - note that this can cause information loss
2. Impute the missing values using fillna
3. Predictive filling using the interpolate method

There are a bunch of reviews that have missing value in the review field. We will impute this field using the overall rating, where 5 will be mapped as Excellent and 1 as Bad. Ratings 2,3,4 fall in between.

In [8]:
d = {5: "Excellent", 4:"Very Good", 3: "Good", 2:"Not Good", 1: "Bad"}
s = reviews.overall.map(d)
reviews['review'] = reviews['review'].combine_first(s)

In [9]:
reviews[reviews['review']=='Excellent']

Unnamed: 0,reviewerID,asin,overall,reviewTime,review,upvotes,downvotes,word_count,polarity
1360701,A1LF7KGH427GYC,B00552467Q,5,2014-07-08,Excellent,0,0,2,0.0


In the meta dataframe, there are 3 columns that still have missing values: description, brand_title and price. Let's try to impute these values. The best way to estimate the price is by looking for the price of similar items. But for now, we will simply replace the missing price with 0.
Since there is no perfect way to determine the description and brand_title of an item, we will use main_cat to populate these 2 columns because main category is somewhere related to them.

In [10]:
meta['main_cat'] = meta['main_cat'].str.replace(']','')

In [11]:
meta['brand_title'].unique()

array(['Bio-Active Anti-Aging Serum (Firming Ultra-Hydrati',
       'Eco Friendly Ecotools Quality Natural Bamboo Cosme',
       'Mastiha Body Lotion', ...,
       'ResQ Organics Face &amp; Body Wash - Aloe Vera Man',
       '2 Tier Tulle Elbow Wedding Veil with Ribbon Edge F',
       '*ECOCRAFTWORLD* GENUINE BUFFALO LEATHER TRAVEL BAG'], dtype=object)

In [12]:
meta['description'] = meta['description'].fillna(meta['main_cat'])
meta['brand_title'] = meta['brand_title'].fillna(meta['main_cat'])
meta['price'] = meta['price'].fillna(0)

Rename the 2 columns that were extracted from sales_price since they are inconsistent with the remaining columns

In [13]:
meta.rename(columns={'Health & Personal Care': 'health_personal_care', 'Beauty': 'beauty'}, inplace=True)

In [14]:
data = pd.merge(reviews, meta, on='asin')

In [15]:
data.shape

(2023070, 17)

The brand_title field will be useful for building the recommender system. Since it is text data, we will have to process it so it can be read by the algorithm. This processing includes converting all text to lowercase and removing punctuation.

In [16]:
meta['brand_title'] = meta['brand_title'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [17]:
meta['brand_title'] = meta['brand_title'].str.replace('[^\w\s]','')

Now that all our collected data is fully cleaned, we have to filter it so that relevant information can be extracted and used to make the final recommendations.

## 4. Building the Recommendation Systems

There are 4 common ways to build a recommender engine. 
1. Popularity based
2. Content based
3. Collaborative
4. Hybrid 


Each will be explored more in detail as we build them one by one.

### 4.1 Popularity-based Recommender

This is the simplest recommendation system. As is obvious from the name, it simply recommends the popular and highly rated items to all the users. The bigest drawback of this system is that it is non-personalized and recommends the same items to everyone. Since users' have different tastes and preferences, this is not a very useful filtering method.

One way of measuring popularity is by counting the number of times an item was rated. Higher the rating count, more popular the item is.

Similarly, the item with highest average ratings are also popular and can be recommended to all customers. We also calculated polarity of customer ratings using the review text. Items with higher mean polarity would be some great recommendations to customers as well.

Another way to measure the popularity of an item is by checking it's related_count. This count was extracted by adding the number of times an item was clicked or viewed or bought after viewing or together with another item. The higher the related_count, the more popular the item should be.

We also have sales ranking for the items either under beauty or under health_personal_care categories. Another way of determining popular items is by looking for their sales rank under either of these columns.

The most popular items can also be determine using weighted average ratings. 

#### Pros and Cons of Popularity-based Filtering

##### Pros:
It is the simplest form of recommendation system.

##### Cons:
The caveat to Popularity based recommenders is that they do not filter items based on personal preferences and recommend the same top-N items to every single user. The method is pretty simple, but not very effective.

For complete code on Popularity-based recommendation, please refer to: 
https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_popular.ipynb

### 4.2 Content-based Filtering

This method recommends an item based on its features and how similar they are to features of other items in the data set. It is based on similarity of item attributes, which can be determined using cosine similarity or nearest neighbor algorithms.

The past interactions of a given user with items is taken into account, ignoring all other users. Items are recommended based on comparison between contents of the items and a user profile. The content of each item could be represented as descriptor or terms. 

We will combine the brand_title, main_cat and description features of the items into a single feature and convert it into vector form so it can be effectively used for determining similarities between different items.

For a fair analysis, we will include only those items from the dataset that have been reviewed more than 10 times and less then 7000 times. Less than 10 ratings means not many users have reviewed the item. More than 7000 ratings means the item is already quite popular and would show up as a recommendation, regardless.

So we will use 33,878 items filtering using their content/attributes.

The item attributes are converted to vector form using TFIDF and Count vectorizers. 

TF-IDF is heavily used in Natural Language Processing and is used in information retrieval using feature extraction processes. It is a measure used to evaluate how important a word is to a document in document corpus. The text columns are combined into one and their text will be used to fit the model.

The other approach is to use the CountVectorizer, which counts the number of times a token shows up in the document and uses this value as its weight. It is simpler than TfidfVectorizer. 

The pairwise similarity is computed using:
1. Cosine Similarity 
2. Linear Kernel 
3. Euclidean distance
4. Pearson Correlation

#### Pros and Cons of Content-based Filtering

##### Pros:
The recommendations are specific to one user, therefore the model does not need any data about other users. This makes it easier to scale to a large number of users. It is very effective in recommending niche items, since it can capture the specific interests of a user. With sufficient description, the cold start problem can be eliminated.

##### Cons:
The feature representation of the items must be very rich because the model solely depends on that. The model can make recommendations based only on existing interests of the user. So it tends to over-specialize and will recommend items only similar to those already used and rated.

For complete code on Content-based recommendation, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_content.ipynb

### 4.3 Collaborative Filtering

Collaborative Filtering method ignores User and Item attributes and instead focuses on User-Item Interactions. Recommendations are purely based on past behavior and not on the context. 

It is based on the similarity in preferences of two users. It determines similarities between 2 users or 2 items, using the historical ratings given to items by various users and makes recommendations on the basis of that.

The concept here is to develop algorithms that operate independently of the characteristics of particular items being recommended. Patterns between users and items are identified and used to make recommendations.

Using the customer rating for items, a utility matrix is built. These ratings can be explicit (rating on a scale of 1 to 5, likes or dislikes) or implicit (viewing an item, adding it to a wish list, the time spent on an article). In the matrix, each row contains the ratings given by a user and each column contains the ratings received by an item. The matrix is typically sparse as not every user uses every item nor do they rate every item they use.

There are two types of collaborative filtering methods:
1. Memory-based: Statistical techniques are applied to the entire dataset to calculate the predictions.
2. Model-based: Involve a step to reduce or compress the large but sparse user-item matrix. Matrix factorization can be performed using the singular value decomposition (SVD) algorithm or principal component analysis (PCA).

Memory based collaborative filtering methods can be further classified as:
1. User-user collaborative filtering
2. Item-item collaborative filtering

#### 4.3.1 User-User collaborative filtering

Recommendations are given to a user on the basis of the likes and dislikes of similar users. This method is useful when the number of users is less. Its not effective when there are a large number of users as it will take a lot of time to compute the similarity between all user pairs.

#### 4.3.2 Item-Item collaborative filtering

Item-based recommenders are faster than user-based for large datasets. They make the recommendations by computing the similarity between each pair of items.

#### Pros and Cons of Collaborative Filtering

##### Pros:
This filtering method does not require any knowledge about features of users and items. Also, it can help recommenders to not overspecialize in a user’s profile and recommend items that are completely different from what they have seen before.

##### Cons:
There can be cold-start problem for new items that are added to the list. Until someone rates them, they don’t get recommended. Also complexity increases as the size of datasets increase. 

For complete code on Collaborative recommendation, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_collaborative.ipynb

## 4.4 Hybrid Method

All the recommender systems suffer from the cold-start problem in some form or another. Hybrid recommender systems combine two or more filtering methods to improve recommendation performance, usually to overcome this problem. 

There are 7 ways to apply the hybrid technique. 
1. Weighted
2. Feature Combination
3. Switching Method
4. Mixed
5. Feature Augmentation
6. Cascade
7. Meta-level

In this project, we will focus on the first three approaches.

For complete code on Hybrid recommendation, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_hybrid.ipynb
