# Factorization Machine
----

### Concept
- Factorization Machine is general predictor like SVM. FM calculate all of pair-wise interaction between variables. So, FM can overcome the situation `cold-start` because of pair-wise vector factorization. It works like latent vector, but break the independence of the interaction parameter by factorizating them. Therefore, this algorithm can have a similar effect to using SVD++ with multiple variables.
- This Algorithm works well especially in recommender system like e-commerce. Because of the sparsity in implicit data and content meta-information.
- [Paper is here](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
- [My implemented code is here](https://github.com/yoonkt200/ml-theory-python/tree/master/nn-recommender/FM.py)

### Model Equation
- Conceptional model equation

$$ \hat{y}(x) = w_0 + \sum_{i=1}^{n}w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} <v_i, v_j> x_i x_j $$

$$ <v_i, v_j> = \hat{w}_{ij} $$

- FM have a closed model equation that can be computed in linear time O(kn), but actually O(kmd) because of zero-values.  

$$ 0.5\sum_{i=1}^{n}\sum_{j=1}^{n} <v_i, v_j> x_i x_j - 0.5\sum_{i=1}^{n}<v_i, v_i> x_i x_i $$

$$ 0.5(\sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{f=1}^{k} v_{i,f} v_{j,f} x_i x_j - \sum_{i=1}^{n}\sum_{f=1}^{k} v_{i,f} v_{i,f} x_i x_i) $$

$$ 0.5\sum_{f=1}^{k}((\sum_{i=1}^{n} v_i x_i)(\sum_{j=1}^{n} v_j x_j) - \sum_{i=1}^{n} {v_i}^2{x_i}^2) $$

$$ 0.5\sum_{f=1}^{k}((\sum_{i=1}^{n} v_i x_i)^2 - \sum_{i=1}^{n} {v_i}^2{x_i}^2) $$

- Below code is implementation of FM's equation

In [1]:
import numpy as np

# Pre-trained parameters
b = 0.3
w = np.array([0.001, 0.02, 0.009, -0.001])
v = np.array([[0.00516, 0.0212581, 0.150338, 0.22903],
              [0.241989, 0.0474224, 0.128744, 0.0995021], 
              [0.0657265, 0.1858, 0.0223, 0.140097], 
              [0.145557, 0.202392, 0.14798, 0.127928]])

# Equation of FM model
def inference(data):
    num_data = len(data)
    scores = np.zeros(num_data)
    for n in range(num_data):
        feat_idx = data[n][0]
        val = np.array(data[n][1])
        
        # linear feature score
        linear_feature_score = np.sum(w[feat_idx] * val)
        
        # factorized feature score
        vx = v[feat_idx] * (val.reshape(-1, 1))
        cross_sum = np.sum(vx, axis=0)
        square_sum = np.sum(vx*vx, axis=0)
        cross_feature_score = 0.5 * np.sum(np.square(cross_sum) - square_sum)
        
        # Model's equation
        scores[n] = b + linear_feature_score + cross_feature_score

    # Sigmoid transformation for binary classification
    scores = 1.0 / (1.0 + np.exp(-scores))
    return scores

In [None]:
# Inference test for 3 case
data = [[[0, 1, 3], # feature index 
         [0.33, 1, 1]], # feature value
        [[2],
         [1]],
        [[0, 1, 2, 3],
         [0.96, 1, 1, 1]]]

inference(data)

### Learning FM
- The equation computed in linear time. So we can use SGD with below gradients. And in most cases add L2 Regularization on training model.

$$ 1, \hspace{1cm} if \hspace{0.3cm} \theta \hspace{0.2cm} is \hspace{0.2cm} w_0 $$

$$ x_i, \hspace{1cm} if \hspace{0.3cm} \theta \hspace{0.2cm} is \hspace{0.2cm} w_i $$
 
$$ x_i \sum_{j=1}^{n} v_{j,f}x_{j} - v_{i,f} {x_i}^2 \hspace{1cm} if \hspace{0.3cm} \theta \hspace{0.2cm} is \hspace{0.2cm} v_{i,f} $$  

### Binary Classification by FM
- The FM classification model follow this rules.
    - 1. h(x) = add sigmoid function from original h(x)
    - 2. cost function is based on binary-cross entropy(or MLE)
    - 3. parameter has different 3 type : bias(b), linear weight(w), latent weight(v)
    - 4. gradient is based on binary-cross entropy and add FM parameter's gradient (Above `Learning FM chapter`)

$$ h(x) = \hat{y}(x) = sigmoid(0.5\sum_{f=1}^{k}((\sum_{i=1}^{n} v_i x_i)^2 - \sum_{i=1}^{n} {v_i}^2{x_i}^2)) $$

$$ Cost(\Theta) = -1/n()\sum_{i=1}^{n}[y^{(i)}log(h(x^{i})) + (1-y^{(i)})log(1-h(x^{(i)}))] $$

$$ SGD repeat \{ $$

$$ \theta_j = \theta_j - \alpha (h(x^{i})-y^{i}) * Gradient $$

$$ \} $$