# Content Based Recommendation System

## Step 1 - Train the engine.
Create a TF-IDF matrix of unigrams, bigrams, and trigrams for each product. The 'stop_words' param tells the TF-IDF module to ignore common english words like 'the', etc.

Then we compute similarity between all products using SciKit Leanr's linear_kernel (which in this case is equivalent to cosine similarity).

Iterate through each item's similar items and store the 100 most-similar. Stops at 100 because well...how many similar products do you really need to show?

Similarities and their scores are stored in a dictionary as a list of Tuples, indexed to their item id.


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
ds = pd.read_csv("data.csv")

In [3]:
ds.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


In [4]:
len(ds)

500

## What is Ngrams???
N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon".

If N=2 (known as bigrams), then the ngrams would be:
- the cow
- cow jumps
- jumps over
- over the
- the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be: 
- the cow jumps
- cow jumps over
- jumps over the
- over the moon

So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on. 

How many N-grams in a sentence?
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:
### NGRAM = X - (N - 1)

In [5]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 3), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(ds['description'])

In [6]:
tfidf_matrix.shape

(500, 52262)

<img src='cosine1.png' />

In [7]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

In [8]:
type(cosine_similarities), cosine_similarities.shape

(numpy.ndarray, (500, 500))

In [9]:
tfidf_matrix[0, :50].toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.03509164,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ]])

In [10]:
len(cosine_similarities[:-5])

495

In [11]:
len(cosine_similarities[:-5:-1])

4

In [12]:
cosine_similarities[:-5:-1]

array([[ 0.06955608,  0.06480538,  0.05038512, ...,  0.04958298,
         0.36281626,  1.        ],
       [ 0.06546914,  0.06936414,  0.0455137 , ...,  0.04187121,
         1.        ,  0.36281626],
       [ 0.06097409,  0.03550042,  0.03402428, ...,  1.        ,
         0.04187121,  0.04958298],
       [ 0.11657908,  0.05444406,  0.05937085, ...,  0.04918646,
         0.05983098,  0.06523674]])

### Numpy sorting

In [13]:
x = np.array([3, 1, 2])

In [14]:
np.argsort(x)

array([1, 2, 0])

### Find pair of 98 similar item

In [15]:
results = {}

for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices]

    # First item is the item itself, so remove it.
    # Each dictionary entry is like: [(1,2), (3,4)], with each tuple being (score, item_id)
    results[row['id']] = similar_items[1:]
    
print('done!')

done!


In [16]:
results.keys()

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 22

In [17]:
results.values()

dict_values([[(0.22037921472617453, 19), (0.16938950913002357, 494), (0.16769458065321555, 18), (0.16485527745622977, 172), (0.14812615460586401, 442), (0.14577863284367545, 171), (0.1413764236536125, 21), (0.13884463426216978, 495), (0.13879533331363048, 25), (0.13813550299091404, 496), (0.13481110970996832, 487), (0.13225329613833622, 20), (0.13028260329762048, 341), (0.12768743540103286, 176), (0.12671622868413698, 488), (0.12319623660641409, 365), (0.12155681060658907, 340), (0.11800704948227406, 60), (0.11786722607586674, 440), (0.11657908072337515, 497), (0.11184896270837259, 173), (0.11069752245804719, 441), (0.10857685392562949, 413), (0.10572078621963336, 443), (0.10553058093119776, 174), (0.10403103809186293, 359), (0.10338035552770783, 22), (0.10290746221687935, 61), (0.10286246471301803, 312), (0.10166673618893814, 23), (0.10110641701157386, 2), (0.10082418508282549, 360), (0.099140299683494942, 175), (0.098829765199383385, 329), (0.09677082647987284, 24), (0.09627616966125

In [18]:
len(results[2])

98

## Step - 2 Predict 

In [19]:
ds.loc[ds['id'] == 11]['description'].tolist()[0].split(' - ')[0]

'Baby sunshade top'

In [20]:
# hacky little function to get a friendly item name from the description field, given an item ID
def item(id):
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0]

In [21]:
# Just reads the results out of the dictionary. No real logic here.
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")
    print("-------")
    recs = results[item_id][:num]
    for rec in recs:
        print("Recommended: " + item(rec[1]) + " (score:" + str(rec[0]) + ")")

In [22]:
# Just plug in any item id here (1-500), and the number of recommendations you want (1-99)
# You can get a list of valid item IDs by evaluating the variable 'ds', or a few are listed below

recommend(item_id=11, num=5)

Recommending 5 products similar to Baby sunshade top...
-------
Recommended: Sunshade hoody (score:0.213302960211)
Recommended: Baby baggies apron dress (score:0.109753112963)
Recommended: Runshade t-shirt (score:0.0998815126278)
Recommended: Runshade t-shirt (score:0.0953069824169)
Recommended: Runshade top (score:0.0851055009302)


# Try it yourself!

Here are some product IDs to try. Just call:

    recommend(<id>)

1 - Active classic boxers

2 - Active sport boxer briefs

3 - Active sport briefs

4 - Alpine guide pants

5 - Alpine wind jkt

6 - Ascensionist jkt

8 - Print banded betina btm

9 - Baby micro d-luxe cardigan

10 - Baby sun bucket hat

11 - Baby sunshade top