# 因子分解机FM
   因子分解机（Factorization Machines） 是由 Steffen Rendle于2010年提出一种因子分解模型，其目的是解决传统的因子分解模型的一些缺点:  
   首先，传统的因子模型,每遇到一种新问题,都需要在矩阵分解的基础上建立一个新模型(例如SVD)，推导出新的参数学习算法，并在学习参数过程中调节各种参数。以至于这些因子分解模型对于那些对因子分解模型的使用不是很熟悉的人来说是费事、耗力、易错的。  
   其次，传统的因子分解模型不能很好地利用特征工程法（feature engineering）来完成学习任务。在实际的机器学习任务中，常用的方法是首先用特征向量来表示数据，然后用一些开源工具LibSVM或Weka等工具进行学习，方便地完成分类或决策任务。  
   FM的优势在于它能够通过特征向量去模拟因子分解模型它既结合了特征工程法的普遍性和适用性，又能够利用因子分解模型对不同类别的变量之间的交互作用（interaction)进行建模估计，借助开源实现工具libFM，能够快速地完成学习任务，取得很好的精度。将这一模型命名为因子分解机，作者正是希望该模型能像支撑向量机那样简单、易用、高精度。  

## 代码实现(调包)
### 安装
``` powershell
pip install git+https://github.com/coreylynch/pyFM
```

In [1]:
from pyfm import pylibfm 
from sklearn.feature_extraction import DictVectorizer 
import numpy as np

创建训练集并转换成one-hot编码的特征形式

In [2]:
train = [
    {"user": "1", "item": "5", "age": 19},
    {"user": "2", "item": "43", "age": 33},
    {"user": "3", "item": "20", "age": 55},
    {"user": "4", "item": "10", "age": 20},
]
v = DictVectorizer()
X = v.fit_transform(train)
print(X.toarray())

[[19.  0.  0.  0.  1.  1.  0.  0.  0.]
 [33.  0.  0.  1.  0.  0.  1.  0.  0.]
 [55.  0.  1.  0.  0.  0.  0.  1.  0.]
 [20.  1.  0.  0.  0.  0.  0.  0.  1.]]


创建标签。这里简单创建了一个全1的标签

In [4]:
y = np.repeat(1.0,X.shape[0])
y

array([1., 1., 1., 1.])

训练并预测

In [5]:
fm = pylibfm.FM()
fm.fit(X,y)
fm.predict(v.transform({"user": "1", "item": "10", "age": 24}))

Creating validation dataset of 0.01 of training for adaptive regularization
-- Epoch 1
Training log loss: 0.13187


array([0.97810867])

### 电影评分数据集实战
数据集： http://www.grouplens.org/system/files/ml-100k.zip

In [7]:
def loadData(filename,path="ml-100k/"):
    data = []
    y = []
    users=set()
    items=set()
    with open(path+filename) as f:
        for line in f:
            (user,movieid,rating,ts)=line.split('\t')
            data.append({ "user_id": str(user), "movie_id": str(movieid)})
            y.append(float(rating))
            users.add(user)
            items.add(movieid)

    return (data, np.array(y), users, items)

导入训练集和测试集，并转换格式

In [10]:
(train_data, y_train, train_users, train_items) = loadData("ua.base")
(test_data, y_test, test_users, test_items) = loadData("ua.test")
v = DictVectorizer()
X_train = v.fit_transform(train_data)
X_test = v.transform(test_data)

训练模型并测试，训练100轮

In [11]:
fm = pylibfm.FM(num_factors=10, num_iter=100, verbose=True, task="regression", initial_learning_rate=0.001, learning_rate_schedule="optimal")
fm.fit(X_train,y_train)

Creating validation dataset of 0.01 of training for adaptive regularization
-- Epoch 1
Training MSE: 0.59525
-- Epoch 2
Training MSE: 0.51804
-- Epoch 3
Training MSE: 0.49046
-- Epoch 4
Training MSE: 0.47458
-- Epoch 5
Training MSE: 0.46416
-- Epoch 6
Training MSE: 0.45662
-- Epoch 7
Training MSE: 0.45099
-- Epoch 8
Training MSE: 0.44639
-- Epoch 9
Training MSE: 0.44264
-- Epoch 10
Training MSE: 0.43949
-- Epoch 11
Training MSE: 0.43675
-- Epoch 12
Training MSE: 0.43430
-- Epoch 13
Training MSE: 0.43223
-- Epoch 14
Training MSE: 0.43020
-- Epoch 15
Training MSE: 0.42851
-- Epoch 16
Training MSE: 0.42691
-- Epoch 17
Training MSE: 0.42531
-- Epoch 18
Training MSE: 0.42389
-- Epoch 19
Training MSE: 0.42255
-- Epoch 20
Training MSE: 0.42128
-- Epoch 21
Training MSE: 0.42003
-- Epoch 22
Training MSE: 0.41873
-- Epoch 23
Training MSE: 0.41756
-- Epoch 24
Training MSE: 0.41634
-- Epoch 25
Training MSE: 0.41509
-- Epoch 26
Training MSE: 0.41391
-- Epoch 27
Training MSE: 0.41274
-- Epoch 28
Tra

预测结果打印误差

In [13]:
preds = fm.predict(X_test)
from sklearn.metrics import mean_squared_error
print("FM MSE: %.4f" % mean_squared_error(y_test,preds))

FM MSE: 0.8873
