# Introduction

Content-Based Recommender Systems (CBRS) use item attributes to computing predictions. They try to match users to items that are similar to what they have liked in the past. This similarity is not neccessary based on rating correlations across users but on the basis of the attributes of the objects liked by the user. <br>
CBRS focus on the target user's own ratings and the attributes of the items liked by user. Therefore, other users play no role in CBRS.

CBRS are dependent on two sources of data:
1. Description of various items in terms of content-centric attributes (text descriptions of an item).
2. User profile, which is generated from user feedback about various items.

# Basic components of Content-Based Systems

1. Preprocessing and feature extraction
2. Content-Based leraning of user profiles
3. Filtering and recommendation

## Feature Representation and Cleaning
Keywords are converted int a vector-space representation:
1. Stop-word removal
2. Stemming
3. Phrase extraction

## Creating User Profile
The data about user can be gotten from:
1. Ratings
2. Implicit feedback
3. Text opinions
4. Cases

## Supervised Feature Selection and Weighting

### Gini index
Let $t$ be the total number of possible values of the ratings. Among documents containning a particular word $w$, let $p_1(w)...p_t(w)$ be the fraction of item rated at each of these $t$ possible values. Then the Gini index of the word $w$ is definded as follows:
$$ Gini(w) = 1 - \sum_{i=1}^t p_i(w)^2 $$ <br>
Small values being indicative of greater discriminative power.

### Entropy
$$ Entropy(w) = -\sum_{i=1}^t p_i(w)log(p_i(w)) $$

### $\chi^2$ Statistic
Let $O_i$ be the observed value of the $i^{th}$ cell and $E_i$ be the expected value of the $i^{th}$ cell. Then the $\chi^2$ Statistic is computed as follows:
$$ \chi^2 = \sum_{i=1}^p \dfrac{(O_i - E_i)^2}{E_i} $$
$$ \chi^2 = \dfrac{(O_1 + O_2 + O_3 + O_4)(O_1O_4 - O_2O_3)}{(O_1 + O_2)(O_3 + O_4)(O_1 + O_3)(O_2 + O_4)} $$

# Learning User Profiles and Filtering
We can use Nearest Neighbor Classification. But in this notebook, we just only use Bayes Classifier

## Bayes Classifier

$D_L$ containing the training documents, $D_U$ containing the test documents. <br>
We assume that the label are binary in which users specify either a like or a dislike rating as $1$ and $-1$, the rating of the $i^{th}$ document in $D_L$ is denoted by $c_i \in {-1, 1}$ <br>
Each document is treated as a binary vector of $d$ word containing only values of $0$ and $1$. <br>
Consider a target document $\bar{X} \in D_U$ are denoted by $(x_1...x_d)$. We would like to determine $P(Active\ user\ likes\ \bar{X}|x_1...x_d)$. Then we need to determine the value of $ P(c(\bar{X}) = 1|x_1...x_d) $ and $ P(c(\bar{X}) = -1|x_1...x_d) $, then select the larger of the two, one can determnine whether or not the active user likes $\bar{X}$. We apply the Bayes rule as follows:
$$ P(c(\bar{X}) = 1|x_1...x_d) = \dfrac{P(c(\bar{X}) = 1).P(x_1...x_d|c(\bar{X}) = 1)}{P(x_1...x_d)} $$
$$ \propto \dfrac{P(c(\bar{X}) = 1).P(x_1...x_d|c(\bar{X}) = 1)}{P(x_1...x_d)} $$
$$ = P(c(\bar{X}) = 1).\Pi_{i=1}^dP(x_i|c(\bar{X}) = 1) $$ <br>

$$ P(c(\bar{X}) = 1) = \dfrac{|D_L^+| + \alpha}{|D_L| + 2\alpha} $$

Conditional feature probability P(x_i|c(\bar{X}) = 1) is estimated as the fraction of the instances in the positive class for which the $i^{th}$ feature take on the value of $x_i$. Let $q^+(x_i)$ represent the number of instances on the positive class that take on the value of $ x_i \ in \{0, 1\}$ for the $i^{th}$ feature. Then:
$$ P(x_i|c(\bar{X}) = 1) = \dfrac{q^+(x_i) + \beta}{|D_L^+| + 2\beta} $$

#### Movie Lens

### Example

In [1]:
import numpy as np

In [2]:
x_train = np.array([[1, 1, 1, 0, 0, 0],
                  [1, 1, 0, 0, 0, 1],
                  [0, 1, 1, 0, 0, 0], 
                  [0, 0, 0, 1, 1, 1],
                  [0, 1, 0, 1, 0, 1],
                  [0, 0, 0, 1, 1, 0]])

y_train = np.array([0, 0, 0, 1, 1, 1])

x_test = np.array([[0, 0, 0, 1, 0, 0],
                   [1, 0, 1, 0, 0, 0]])

In [3]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
model = MultinomialNB()
model.fit(x_train, y_train)
model.predict(x_test)

array([1, 0])

## Rule-Based Classifiers

Rule-based classifiers can be designed in a variety of ways, including leave-one-out methods, as well as associative methods. For example:
- Item contains keyword set A $\Rightarrow$ $Rating = Like$
- Item contains keyword set B $\Rightarrow$ $Rating = Dislike$

The overall approach for rule-based classification can be described as follows:
1. $\textbf{Training phase: }$ Determine all the relevant rules from the user profile at the desired level of minimum support adn confidence from the training dataset $D_L$
2. $\textbf{Testing phases: }$ For each item description in $D_U$, determine the fired rules and an average rating. Rank the items in $D_U$ on the basis of this average rating.

In [9]:
import itertools
def findsubsets(S,m):
    return set(itertools.combinations(S, m))

def get_all_subset(input_list):
    res_list = []
    for i in range(len(input_list)):
        r_i = list(findsubsets(input_list, i+1))
        res_list += r_i
    return res_list

def get_number_of_occurrences(binary_matrix, index):
    count = 0
    for arr in binary_matrix:
        if np.sum(arr[list(index)] != 1) == 0:
            count += 1
    
    return count

def get_support(binary_matrix, index):
    occurrences = get_number_of_occurrences(binary_matrix, index)
    
    return occurrences/len(binary_matrix)

def get_confidence(binary_matrix, true_labels, label, index):
    count = 0
    contain_x = 0
    for i in range(len(binary_matrix)):
        arr = binary_matrix[i]
        
        if np.sum(arr[list(index)] != 1) == 0:
            contain_x += 1
            if true_labels[i] == label:
                count += 1
    if contain_x == 0:
        return 0
    return count/contain_x
    
def get_rules(binary_matrix, true_labels, min_support, min_confidence):
    res = []
    
    indexes_range = range(len(binary_matrix[0]))
    indexes = get_all_subset(indexes_range)
    
    for index in indexes:
        for label in range(2):
            support = get_support(binary_matrix, index)
            confidence = get_confidence(binary_matrix, true_labels, label, index)
            
            if support >= min_support and confidence >= min_confidence:
                res.append((index, label, support, confidence))
#                 print(index, " => ", label, "\t ===> support : ", support, "\t confidence: ", confidence)
    res = sorted(res, key=lambda tup: (tup[3], tup[2]), reverse=True)
    return res
# get_rules(x_train, y_train, 0.33, 0.75)

def get_rule_in_text(binary_matrix, true_labels, min_support, min_confidence,
                     attribute_names = np.array(['Drums', 'Guitar', 'Beat', 'Classical', 'Symphony', 'Orchestra']),
                     label_names = ['Dislike', 'Like']):
    res = get_rules(binary_matrix, true_labels, min_support, min_confidence)
    for r in res:
        index, label, support, confidence = r
        print(attribute_names[list(index)], " => ", label_names[label], "\t ===> support : ", support, "\t confidence: ", confidence)
        
rules = get_rules(x_train, y_train, 0.33, 0.75)


In [11]:
def check_fired_rules(test_matrix, rules):
    for i in range(len(test_matrix)):
        arr = test_matrix[i]
        for j in range(len(rules)):
            index, label, support, confidence = rules[j]
            if np.sum(arr[list(index)] != 1) == 0:
                print("Test ", i, " fired by rule ", j + 1)
                
check_fired_rules(x_test, rules)

Test  0  fired by rule  1
Test  1  fired by rule  2
Test  1  fired by rule  3


## Regression-Based Models

Let $\bar{y}$ be the $n$-dimensional column vector containing the ratings of the actove user for the $n$ documents in the training set. The basic idea in linear regression is to assume that the ratings can be modeled as a linear function of the word frequencies. Let $\bar{W}$ be a $d$-dimensional row vector representing the coefficients of each word in the linear function relating word frequencies to the rating:
$$ \bar{y} = D_L\bar{W}^T $$
The objective function $O$ can be expressed as follows:
$$  Minimize\ O = \|D_L\bar{W}^T - \bar{y}\|^2 + \lambda\|\bar{W}\|^2 $$



# Advantages and Disadvantages

## Advantage
1. Content-based systems have cold-start problems only for new users
2. Content-based systems provide explanations in terms of the features of items.
3. Content-based methods can generally be used with off-the-shelf text classifiers

## Disadvantage
1. Content-based systems tend to find items that are similar to those the user has seen so far (overspecialization). It is always desirable to have a certain amount of novelty and serendipity in the recommendations.
2. Content-based systems do not help in resolving the problem for new users.