# Content base filtering

## Abstract
It is a method to recommend by sorting similarity by the feature vector of items.
The case where the shop associated with the keyword "hanoi + japanese restaurant" entered by the user at the gourmet site is displayed corresponds. Recommendation based on content is a way to recommend based on the feature of items.

## Vectorize feature
In the first place, why nead to vectorize feature? So in the machine learning, it often appear in some paper or decks.

The definision of "vector" is like below:
Quantity that can be expressed by specifying size and orientation

A set of n real numbers $x_i (i = 1, 2, ⋯, n)$ vertically arranged

$$ 
  X = \left(
    \begin{array}{c}
      x_1 \\
      x_2 \\
      \vdots \\
      x_n
    \end{array}
  \right)
$$

is called an n-dimensional vector on real number field R, or simply called real vector

Let's assume, for example, to capture the feature of a sentence.
We transform sentence to vector firstly.
Why transform to vector? The answer is very simply, it is much easier and faster than take for-loop.

## How do we vectorize?
In this time I would like to make it a feature vector using BoW (Back of word) method.

The concept of BoW is very clear and only counts, so how many times a word has appeared in the target sentence.

This is sample sentence:
```
The sun is shining
The weather is sweet
The sun is shining, the weather is sweet, and one and is two
```
Try to write down any words in this sentence.
```
the, sun, is, shining,weather, sweet, and, one, two
```
In this case, "The" is same as "the". So there are 9 words.

Make these associately each other and uniquly.

In [16]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
words_ = np.array(['The sun is shining',
                                   'The weather is sweet',
                                   'The sun is shining, the weather is sweet, and one and is two'])
counts_ = vectorizer.fit_transform(words_)
print(vectorizer.vocabulary_)
print(counts_.toarray())

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 1 1 1 1 2 1 1]]


Shaping:

In [23]:
import pandas as pd
df = pd.DataFrame(counts_.toarray())
df.columns = ["and(x1)", "is(x2)", "one(x3)", "shiging(x4)", "sun(x5)", "sweet(x6)", "the(x7)", "two(x8)", "weather(x9)"]
df

Unnamed: 0,and(x1),is(x2),one(x3),shiging(x4),sun(x5),sweet(x6),the(x7),two(x8),weather(x9)
0,0,1,0,1,1,0,1,0,0
1,0,1,0,0,0,1,1,0,1
2,2,3,1,1,1,1,2,1,1


Actually, work continue after this.  
But today, we have no time to explain all things.  
I think that the process could be explained sufficiently enough.

And also, "BoW" is so usefull but nowerdays "word2vec" is better solution for that.
But today's mainly agenda is not NLP, so that I don't explain about this.

## let's try to impletent Content based filtering!
data A: 
>The sun is shining.  
>The weather is sweet.  
>The sun is shining, the weather is sweet, and one and is two

data B:
>The moon is shining.  
>The juice is sweet.  
>The moon is shining, the juice is sweet, and one and is three.

### 1st. Calc similarity
`Cosine simularity` is better.
It is so easy to calc by transform to vector.
You can imagine like below:
![cos_sim.png](attachment:cos_sim.png)

In [26]:
from math import sqrt

In [28]:
def calc_sim(text_a, text_b):
    b_dict = {key: value for key, value in text_b}
    
    # A * B
    ab = 0
    for key, value in text_a:
        value_b = b_dict.get(key)
        if value_b:
            ab += float(value * value_b)
            
    # |A| and |B|
    a = sqrt(sum([v ** 2 for k, v in text_a]))
    b = sqrt(sum([v ** 2 for k, v in text_b]))
    return float(ab / (a * b))

2nd. Impletent CBF class

In [25]:
class ContentBaseFiltering:
    def __init__(self, item_1, item2):
        self.item_1 = item_1
        self.item_2 = item_2
        
    def calculate_simularity_by_cosine(self):
        sim = None
        item_2_dict = {key: value for key, value in self.item_2}
        
        ab = 0
        for key, value in self.item_1:
            value2 = item_2_dict.get(key)
            if value2:
                ab += float(value * value2)
                
        a = sqrt(sum([v ** 2 for k, v in self.item_1]))
        b = sqrt(sum([v ** 2 for k, v in self.item_2]))
        sim = float(ab / (a * b))
        return sim

### 3rd. Pre-Processing datas
use `TF-IDF` to add weight

In [36]:
import os 
print(os.getcwd())

/Users/maki/dev/src/github.com/simula-labs/tech_note_for_machine_learning/data


In [40]:
file_a = open('world_cup_a.txt', 'r')
file_b = open('world_cup_b.txt', 'r')
text_a = file_a.read()
text_b = file_b.read()
file_a.close()
file_b.close()


In [41]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
words_ = np.array([text_a, text_b])
a_counts_ = vectorizer.fit_transform(words_)[0]
b_counts_ = vectorizer.fit_transform(words_)[1]

In [55]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True, stop_words=['the', 'in', 'group', 'of', 'to', 'and', 'will', 'on', 'but', '103', '12', '14', '16', '17'])
np.set_printoptions(precision=2)
res = tfidf.fit_transform(vectorizer.fit_transform(words_)).toarray()
tfidf_a = res[0]
tfidf_b = res[1]
labels_ = vectorizer.get_feature_names()

TypeError: __init__() got an unexpected keyword argument 'stop_words'

In [56]:
import pandas as pd
df = pd.DataFrame(res)
df.columns = labels_
df

Unnamed: 0,103,12,14,16,17,2010,2018,64th,65,aaron,...,would,wrong,xherdan,years,yoshida,you,young,youth,zlatan,zuber
0,0.0,0.0,0.011498,0.011498,0.0,0.011498,0.011498,0.0,0.0,0.011498,...,0.016362,0.011498,0.011498,0.011498,0.011498,0.011498,0.011498,0.011498,0.011498,0.011498
1,0.044545,0.044545,0.0,0.0,0.044545,0.0,0.0,0.044545,0.044545,0.0,...,0.031694,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [52]:
series_a = df.sort_values(by=0, axis=1, ascending=False)[:1]
series_b = df.sort_values(by=1, axis=1, ascending=False)[1:2]
series_a

Unnamed: 0,the,in,group,of,to,and,will,on,but,have,...,hat,bottom,trick,treble,training,re,hand,half,haiti,103
0,0.572674,0.319061,0.287455,0.237251,0.22907,0.220889,0.163621,0.12648,0.122716,0.106354,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
series_b

Unnamed: 0,the,to,minutes,messi,world,in,haiti,dream,match,argentina,...,high,hernandez,hernan,here,heavyweight,he,having,has,hard,zuber
1,0.633876,0.253551,0.178178,0.158469,0.158469,0.158469,0.133634,0.133634,0.133634,0.126775,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
