<a href="https://colab.research.google.com/github/semant/MachineLearning/blob/master/RecommenderSystem_Content.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommender System Using Content Based Filtering to Identify Most Similar Products for an E-Commerce Operation

> Semant Jain, PhD
> semant@gmail.com

### Background:
> There are four types of recommendation systems:
> + Social and demographic recommenders: These algorithms do not require any preferences by the user. They suggest items that are liked by friends, friends of friends, and demographically-similar people. 
> + Contextual recommenders: By incorporating a user's current context, these algorithms are more likely to elicit a response than methods based only on historical data.
> + Collaborative filtering: Starting with a matrix of preferences by users for items, they are used to predict missing preferences and recommend items with high predictions.
> + Content-based filtering: For cases where features that characterize items or user preferences are available , these algorithms recommend similar items. 


### Summary
> At ACM, content based recommenders were used to identify stocks behaving similarly. As this project is protected by non disclosure agreements, here, content based recommenders have been demonstrated through natural language processing of a dataset with 500 different items. The following preprocessing were necessary:
+ TF: Term Frequency of a word is the number of times it appears in a document. 
+ IDF: Inverse Document Frequency of a word is the measure of how significant that term is in the whole corpus. IDF(t) is computed as the log of the ratio of the total number of documents divided by the number of documents with term in it.
+ Stop words: Commonly used words such as 'and', 'the', 'an', 'is' were removed from the analysis
+ Cosine similarity: Computes the similarity of an item with all other items in the dataset
Thereafter, the recommender system was setup.


### Contents


### Libraries
+ Pandas
+ Sklearn

# 1. Setting up

### Importing libraries

In [0]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

### Preprocessing data

In [0]:
ds = pd.read_csv("CE_ML_Project_21_sample-data.csv")

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(ds['description'])

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

results = {}

for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]

    results[row['id']] = similar_items[1:]

# 2. Code

### Helper functions

In [0]:
def item(id):
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0]

def recommend(item_id, num):
    print("Top " + str(num) + " products similar to " + item(item_id) + ": ")
    recs = results[item_id][:num]
    for rec in recs:
        print("- " + item(rec[1]) + " (score:" + str(round(rec[0],4)) + ")")

### Execution

In [14]:
recommend(item_id=2, num=5)
print()
recommend(item_id=12, num=5)

Top 5 products similar to Active sport boxer briefs: 
- Active sport briefs (score:0.4182)
- Cap 1 boxer briefs (score:0.1155)
- Active boxer briefs (score:0.113)
- Active briefs (score:0.1125)
- Active boy shorts (score:0.1115)

Top 5 products similar to Baggies shorts: 
- River shorts (score:0.2465)
- Baby baggies shorts (score:0.1682)
- Baggies shorts (score:0.1644)
- Girl's baggies shorts (score:0.1498)
- Baggies shorts (score:0.1474)
