<h1><center>Amazon Beauty Products - Recommender System</center></h1>

## 1. Introduction

### 1.1 Problem

These days, recommendation engines or recommender systems are commonly used in the digital domain, especially on e-commerce sites. According to Wikipedia, a recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. The main goal here is to provide relevant suggestions to online users to make better decisions about their purchases. 

Amazon uses a recommendation system for suggesting products that it's customers might like. According to McKinsey, 35% of Amazon.com's revenue is generated by it's recommendation engine. 

In this project, I plan to build an intelligent recommendation engine for beauty products that are sold on Amazon.com. By providing recommendations, Amazon.com will give its customers the ability to take their brand experiences into their hands and make informed decisions. This level of personalization will support higher customer retention and reinforce brand loyalty.

### 1.2 Approach

#### Data Wrangling:

After loading the json files into pandas dataframe, the raw data will be cleaned, structured and enriched into a cleaner format. Clean data will allow quicker and better analysis, producing more accurate results. The identification and removal of error and duplicity in datasets will create a reliable dataset, enabling an outcome of better quality. This step was completed in the first milestone.

#### EDA:

EDA (Exploratory Data Analysis) is the initial basic exploration of the dataset in a systematic manner using visual methods. This step will include identification and elimination of outliers as well as checking for correlations between the independent variables. This step was also completed in the first milestone.

#### Machine Learning:

Different types of recommender systems will be built using Machine learning algorithms. 
The algorithms can be classified into two categories: Content-based filtering and Collaborative filtering. Modern recommenders use a hybrid approach which is a combination of these two.
Recommender systems are applied where many users interact with many items. We have a rich dataset with item attributes and historical user reviews for these items, that will be used to train and test the models. 

#### Performance Evaluation:

A variety of offline and online evaluation metrics are available to measure the accuracy of recommender systems. Each model will be evaluated using these metrics and the scores will be compared to determine the best performance. Strong recommender systems can have positive effects on user experience. They can result into higher customer satisfaction and retention, and in turn boost revenues.

### 1.3 Impact

Amazon is a widely popular e-commerce company with a huge selection of apparel, household items, beauty products, books etc. Customers do online shopping from it’s website and frequently, provide reviews of these products based on their experience. In this project, I will focus mainly on the beauty product. So my client is the product owner of the Amazon beauty department. Through this recommendation system, they will be able to suggest beauty products to their customers, based on the likes and dislikes of other customers. Strong recommender systems can have positive effects on user experience. This can result into higher customer satisfaction and retention, and in turn boost revenues.

Personalization is a technique of dynamically tailoring recommendations based on tastes and preferences of each user. 

Personalization helps with -

1. Recognizing users' profiles, including demographics, geography, and expressed and shared interests

2. Remembering users’ interaction history, explicit and implicit ratings

3. Delivering the content and recommendation for a user based on their actions, preferences, and interests

4. Delivering personalization within the context of user locations and shopping timelines

Benefits of personalization -

1. Enhancement of customer experience by showing relevant content
2. Increase in products visibility, thereby leading to higher sales
3. Increased basket size and more purchases
4. Higher customer retention

### 1.4 Dataset

The data sets for this project are publicly provided by Julian McAuley of UCSD. They are available in json format on the website http://jmcauley.ucsd.edu/data/amazon/links.html. They contain reviews spanning May 1996 – Jul 2014.

The Product reviews dataset has the following columns:
ReviewerId, ProductId, ReviewText, OverallRating, Summary, ReviewTimestamp

The Product metadata dataset has these columns:
ProductId, Description, Title, Category, Brand, ImageURL, SalesRank, SalesPrice, Related

Dataset can be explicit or implicit. 

Explicit data would include ratings and text reviews for a product.

Implicit data would include number of times an item was clicked on or viewed or added to cart or purchased. 

## 2. Data Collection

The dataset has been downloaded from the website: http://jmcauley.ucsd.edu/data/amazon/links.html
(Citation:
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016)

There are 2 individual json files containing the beauty product reviews and metadata from Amazon. The reviews file has ~2 million reviews spanning May 1996 - July 2014. The metadata file contains the details of ~260K beauty products sold at Amazon.

Reviews includes product ratings, text of the review and helpfulness votes.  


| Column        | Description                     |
|---------------|---------------------------------|
|reviewerID     | Reviewer ID                     |
|asin           | Product ID                      |
|reviewerName   | Name of reviewer                |
|helpful        | Helpfulness rating of the review|
|reviewText     | Text of the review              |
|overall        | Rating of the product           |
|summary        | Summary of the review           |
|unixReviewTime | Unix time of the review         | 
|reviewTime     | Raw time of the review          |   

Metadata includes descriptions, category information, price, sales-rank, brand info and image features.  
  

| Column     | Description                           |
|------------|---------------------------------------|
|asin        | Product ID                            |
|description | Product description                   |
|title       | Name of the product                   |
|imUrl       | URL of the product image              |
|salesRank   | Sales rank information                |
|categories  | List of categories product belongs to |
|price       | Price in US Dollars                   | 
|related     | Related products                      |
|brand       | Brand name                            | 

The json files were retrieved in compressed format. They were parsed line by line and loaded into pandas dataframes reviews and meta. Then these dataframes were written to .csv files for analysis and modeling.

Here's the code to load the raw json compressed files into pandas dataframes and then load the dataframes to csv files.

In [None]:
import pandas as pd
import gzip

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')   

In [None]:
meta = getDF('meta_Beauty.json.gz')

reviews = getDF('reviews_Beauty.json.gz')

In [None]:
meta.to_csv('meta_beauty.csv')

reviews.to_csv('reviews_beauty.csv')

## 3. Data Wrangling

### 3.1 Metadata file

The metadata file contains the details of ~260K beauty products sold at Amazon. There are 9 columns in the file including asin, description, title, brand, categories, related, price, sales rank and image url. No duplicate values were found in the asin column. 

51% of the items were missing "brand" information. Since this is a large percentage, the column could not be dropped. 444 items were missing the "title" information. So "brand" and "title" columns were merged into a single column and the individual columns were then dropped. 

The image url column does not seem to have much meaning, so it was also dropped from the dataframe.

The "salesRank" column had values in dict format (key-value pairs). So this column was split into multiple columns where each key became the column name and the value became the column value. Some of these new columns were not related to Beauty and Personal care so they were dropped. Only the new columns "Beauty" and "Health & Personal Care" were retained.

At this point, the resulting dataframe was written to a .csv file (cleaned_meta.csv).

It includes the columns:
asin, description, categories, price, related, brand_title, Health & Personal Care, Beauty

Next came the most challenging column for cleaning. 

This is the "related" column which had data in the following format:

{'also_bought': ['B002J2KOXK', 'B008TCVZ6Y'], 'also_viewed': ['B002J2KOXK'], 'bought_together': ['B002J2KOXK'], 'buy_after_viewing': ['B002J2KOXK']}

It made most sense to count the number of times an asin occurred in any of these 4 key-value pairs and use that count as an identification of popular and unpopular item. This required creation of a new column called "related_count".

The "asin" and "related" columns were first copied to a separate dataframe. The "related" column was then split into 4 new columns "also_bought", "also_viewed", "bought_together" and "buy_after_viewing". Then the values in these 4 columns were concatenated into a large string. Using iterrows, each asin in the dataframe was checked against the string and the number of occurrences of this asin in the string was captured in the "related_count" column. This particular operation took 20 hours to complete and the resulting dataframe was immediately captured into a .csv file (cleaned_related.csv).

The code for data wrangling of metadata file can be found at:
https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/clean_meta.ipynb

### 3.2 Reviews File

The reviews file has ~2 million reviews collected between May-1996 and July-2014. There are 9 columns in the file including: reviewerID, reviewerName, asin, helpful, reviewText, overall, summary, unixReviewTime and reviewTime.

12248 rows were missing the "reviewerName" information. Also there were some "reviewerID" with multiple values under "reviewerName". Since we can build our model just with the ID, the "reviewerName" column was dropped from the dataframe.

The "summary" and "reviewText" columns have similar information. 14 rows were missing the "summary" information. The 2 columns were merged into "review" column and the individual columns were dropped.

The column "reviewTime" was in text format, so it was converted to datetime and the column "unixReviewTime" was dropped.

The "helpful" column had values in list format, where the first element indicated the number of upvotes and second element indicated the number of downvotes for the review. This column was split into two: "upvotes" and "downvotes" and the original "helpful" column was dropped.

Since the "review" column is most significant for this project and comprises of large text values, it can be used for natural language processing.

First the wordcount of each review was stored into a new column "word_count". 

Next, the text was converted to lowercase to avoid having multiple versions of one word.  

All punctuation marks were removed since they do not add any extra value while processing text data.

Finally, the stop words and 50 most rare words were removed.

Once these steps completed, sentiment analysis was performed on all the reviews and polarity was calculated. This value was stored for each review in the newly created "polarity" column.

After the cleaning, the following columns existed in the dataframe:
reviewerID, asin, overall, review, upvotes, downvotes, word_count, polarity

The dataframe was written to a .csv file for further analysis (cleaned_reviews.csv).

The code for data wrangling of reviews file can be found at: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/clean_reviews.ipynb

### 3.3 Cleaned Datasets

All the cleaned datasets were written to .csv files. The first file has the metadata, the second one has the related_count information for the products and the third file has all the reviews for these products.

The "meta" file includes:
1. asin
2. description
3. categories
4. price
5. related
6. brand_title
7. health_and_personal_care
8. beauty

The "related" file includes:
1. asin
2. related_count

The "reviews" file includes:
1. reviewerID
2. asin
3. overall
4. review
5. upvotes
6. downvotes
7. word_count
8. polarity

## 4. Exploratory Data Analysis

All the dataframes are merged into one to do the exploratory data analysis.

EDA is an important step because it provides insight to the data that we are dealing with. It helps identify the kind of data we have and the different types. It helps detect anomalies, outliers and missing values. It involves descriptive statistics, grouping of data, handling missing values, analysis of variance and correlations.

Missing values can lead to weak or biased analysis. 

It can be handled in 3 ways: 
1. Delete the rows with missing values using dropna - note that this can cause information loss
2. Impute the missing values using fillna
3. Predictive filling using the interpolate method

Various kinds of plots and charts were created to do a visual analysis of the data set.

For complete code on EDA, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/milestone_report.ipynb

## 5. Building the Recommendation Systems

There are 4 common ways to build a recommender engine. 
1. Popularity based
2. Content based
3. Collaborative
4. Hybrid 


Each will be explored more in detail as we build them one by one.

### 5.1 Popularity-based Recommender

This is the simplest recommendation system. As is obvious from the name, it simply recommends the popular and highly rated items to all the users. The bigest drawback of this system is that it is non-personalized and recommends the same items to everyone. Since users' have different tastes and preferences, this is not a very useful filtering method.

One way of measuring popularity is by counting the number of times an item was rated. Higher the rating count, more popular the item is.

Similarly, the item with highest average ratings are also popular and can be recommended to all customers. We also calculated polarity of customer ratings using the review text. Items with higher mean polarity would be some great recommendations to customers as well.

Another way to measure the popularity of an item is by checking it's related_count. This count was extracted by adding the number of times an item was clicked or viewed or bought after viewing or together with another item. The higher the related_count, the more popular the item should be.

We also have sales ranking for the items either under beauty or under health_personal_care categories. Another way of determining popular items is by looking for their sales rank under either of these columns.

The most popular items can also be determine using weighted average ratings. 

#### Pros and Cons of Popularity-based Filtering

##### Pros:
It is the simplest form of recommendation system.

##### Cons:
The caveat to Popularity based recommenders is that they do not filter items based on personal preferences and recommend the same top-N items to every single user. The method is pretty simple, but not very effective.

For complete code on Popularity-based recommendation, please refer to: 
https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_popular.ipynb

### 5.2 Content-based Filtering

This method recommends an item based on its features and how similar they are to features of other items in the data set. It is based on similarity of item attributes, which can be determined using cosine similarity or nearest neighbor algorithms.

The past interactions of a given user with items is taken into account, ignoring all other users. Items are recommended based on comparison between contents of the items and a user profile. The content of each item could be represented as descriptor or terms. 

We will combine the brand_title, main_cat and description features of the items into a single feature and convert it into vector form so it can be effectively used for determining similarities between different items.

For a fair analysis, we will include only those items from the dataset that have been reviewed more than 10 times and less then 7000 times. Less than 10 ratings means not many users have reviewed the item. More than 7000 ratings means the item is already quite popular and would show up as a recommendation, regardless.

So we will use 33,878 items filtering using their content/attributes.

The item attributes are converted to vector form using TFIDF and Count vectorizers. 

TF-IDF is heavily used in Natural Language Processing and is used in information retrieval using feature extraction processes. It is a measure used to evaluate how important a word is to a document in document corpus. The text columns are combined into one and their text will be used to fit the model.

The other approach is to use the CountVectorizer, which counts the number of times a token shows up in the document and uses this value as its weight. It is simpler than TfidfVectorizer. 

The pairwise similarity is computed using:
1. Cosine Similarity 
2. Linear Kernel 
3. Euclidean distance
4. Pearson Correlation

#### Pros and Cons of Content-based Filtering

##### Pros:
The recommendations are specific to one user, therefore the model does not need any data about other users. This makes it easier to scale to a large number of users. It is very effective in recommending niche items, since it can capture the specific interests of a user. With sufficient description, the cold start problem can be eliminated.

##### Cons:
The feature representation of the items must be very rich because the model solely depends on that. The model can make recommendations based only on existing interests of the user. So it tends to over-specialize and will recommend items only similar to those already used and rated.

For complete code on Content-based recommendation, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_content.ipynb

### 5.3 Collaborative Filtering

Collaborative Filtering method ignores User and Item attributes and instead focuses on User-Item Interactions. Recommendations are purely based on past behavior and not on the context. 

It is based on the similarity in preferences of two users. It determines similarities between 2 users or 2 items, using the historical ratings given to items by various users and makes recommendations on the basis of that.

The concept here is to develop algorithms that operate independently of the characteristics of particular items being recommended. Patterns between users and items are identified and used to make recommendations. Collaborative filtering is the workhorse of recommender engines. It has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use.

Using the customer rating for items, a utility matrix is built. These ratings can be explicit (rating on a scale of 1 to 5, likes or dislikes) or implicit (viewing an item, adding it to a wish list, the time spent on an article). In the matrix, each row contains the ratings given by a user and each column contains the ratings received by an item. The matrix is typically sparse as not every user uses every item nor do they rate every item they use.

There are two types of collaborative filtering methods:
1. Memory-based: Statistical techniques are applied to the entire dataset to calculate the predictions.
2. Model-based: Involve a step to reduce or compress the large but sparse user-item matrix. Matrix factorization can be performed using the singular value decomposition (SVD) algorithm or principal component analysis (PCA).

Memory based collaborative filtering methods can be further classified as:
1. User-user collaborative filtering
2. Item-item collaborative filtering

Similarities between users or between items can be computed using:
1. Adjusted cosine
2. Pearson correlation
3. Spearman rank correlation
4. Mean squared difference (MSD)
5. Jaccard similarity
6. Cosine similarity 

#### 5.3.1 User-User collaborative filtering

Recommendations are given to a user on the basis of the likes and dislikes of similar users. This method is useful when the number of users is less. Its not effective when there are a large number of users as it will take a lot of time and resources to compute the similarity between all user pairs. For very large datasets, this method is hard to implement without a very strong parallelizable system.

#### 5.3.2 Item-Item collaborative filtering

Item-based recommenders are faster than user-based for large datasets. They make the recommendations by computing the similarity between each pair of items. This technique was developed by Amazon and when there are more users than items, it is much faster and more stable than user-based. The reason for higher stability is that more useful items are relatively stable in their features and ratings. User preferences and choices can change any time.

#### Pros and Cons of Collaborative Filtering

##### Pros:
This filtering method does not require any knowledge about features of users and items. Also, it can help recommenders to not overspecialize in a user’s profile and recommend items that are completely different from what they have seen before.

##### Cons:
There can be cold-start problem for new items and new users that are added to the list. Until someone rates the new items, they don’t get recommended. Similarly, until a new user rates an item, it is difficult to recommend items to them. Also complexity increases as the size of datasets increase. This method is computationally expensive. Dimenstionality Reduction increased performance but reduces quality.

For complete code on Collaborative recommendation, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_collab.ipynb

### 5.4 Hybrid Method

All the recommender systems suffer from the cold-start problem in some form or another. Hybrid recommender systems combine two or more filtering methods to improve recommendation performance, usually to overcome this problem. 

There are 7 ways to apply the hybrid technique. 

• Weighted: The score of different recommendation components are combined numerically.

• Switching: The system chooses among recommendation components and applies the selected one.

• Feature Combination: Features derived from different knowledge sources are combined together and given to a single recommendation algorithm.

• Mixed: Recommendations from different recommenders are presented together.

• Feature Augmentation: One recommendation technique is used to compute a feature or set of features, which is then part of the input to the next technique.

• Cascade: Recommenders are given strict priority, with the lower priority ones breaking ties in the scoring of the higher ones.

• Meta-level: One recommendation technique is applied and produces some sort of model, which is then the input used by the next technique.

In this project, we will focus on the first three approaches.

For complete code on Hybrid recommendation, please refer to: https://github.com/venustrip/Amazon-Recommendation-Engine/blob/master/recommender_hybrid.ipynb

## 6. Evaluation

Ideal ways to evaluate a strong system would have high:

1. Coverage - Percentage of user/item pairs that can be predicted

2. Diversity - Breadth of variety of items recommended to different users

3. Novelty - Balance between familiar popular items and novel items

4. Churn - Sensitivity towards new user behavior

5. Responsiveness - Swiftness of new user behavior towards the system

Evaluations of the different models can be performed by either splitting data into train and test or by using K-fold cross validation. 

In train/test splits, training dataset learns the relationship between items and users. Test set can then be used to measure prediction accuracy.

K-fold cross validation technique does not overfit to a single training set and ensures that the recommendation system will work for any set of ratings.

Top-N accuracy metrics can also be used. They evaluate the accuracy of the top recommendations provided to a user by comparing them to the items the user has actually rated in test set.

RMSE, R-squared and MAE are the most commonly used evaluation metrics. The best method to test a recommendation system is online A/B tests. Put different algorithms in front of different users and measure if they actually buy or view or indicate interest. Real impact from real users needs to be measured. Prediction accuracy doesn't always result in good recommendations. Users typically do not care about how accurately their rating for an item was predicted. They care about the recommender engine's ability to show them new items that they will like.

## 7. Potential Improvement

1. Use more features from the 2 datasets to build the personalized recommender 
2. Apply NLP techniques on item description and review text for a richer and more accurate recommender
3. Improve computation using cloud distributed systems and Spark
4. Use the entire data set (without dropping any rows) 
5. Build a user interface screen to input an item or a user, that will output a blend of top-N non-personalized and personalized recommendations on the same screen
6. Improve the rating metric by adjusting overall score using review time and age relevance of the rating
7. Use the related column in meta file; it indicates implicit ratings of the items.