## Description：
这里分别用FM演示分类和回归问题如何通过调包来完成任务。 回归任务这里依然采用的协同过滤里面的评分数据集， 分类任务， 这里先采用自动生成的一个数据集，造轮子的时候， 会采用kaggle上的一个ctr数据集进行实战

## 回归任务
回归任务的数据依然是电影评分数据集， 数据集的下载地址: [ http://www.grouplens.org/system/files/ml-100k.zip](http://www.grouplens.org/system/files/ml-100k.zip)

In [1]:
# 导入包
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from pyfm import pylibfm

In [3]:
# 导入数据
def loadData(filename, path='ml-100k/'):
    data = []
    y = []
    users = set()
    items = set()
    with open(path+filename) as f:
        for line in f:
            (user, movieid, rating, ts) = line.split('\t')
            data.append({'user_id': str(user), 'movie_id': str(movieid)})
            y.append(float(rating))
            users.add(user)
            items.add(movieid)
    
    return (data, np.array(y), users, items)

In [7]:
# 导入数据
(train_data, y_train, train_users, train_items) = loadData('ua.base')
(test_data, y_test, test_users, test_items) = loadData('ua.test')

In [21]:
train_data

[{'user_id': '1', 'movie_id': '1'},
 {'user_id': '1', 'movie_id': '2'},
 {'user_id': '1', 'movie_id': '3'},
 {'user_id': '1', 'movie_id': '4'},
 {'user_id': '1', 'movie_id': '5'},
 {'user_id': '1', 'movie_id': '6'},
 {'user_id': '1', 'movie_id': '7'},
 {'user_id': '1', 'movie_id': '8'},
 {'user_id': '1', 'movie_id': '9'},
 {'user_id': '1', 'movie_id': '10'},
 {'user_id': '1', 'movie_id': '11'},
 {'user_id': '1', 'movie_id': '12'},
 {'user_id': '1', 'movie_id': '13'},
 {'user_id': '1', 'movie_id': '14'},
 {'user_id': '1', 'movie_id': '15'},
 {'user_id': '1', 'movie_id': '16'},
 {'user_id': '1', 'movie_id': '17'},
 {'user_id': '1', 'movie_id': '18'},
 {'user_id': '1', 'movie_id': '19'},
 {'user_id': '1', 'movie_id': '21'},
 {'user_id': '1', 'movie_id': '22'},
 {'user_id': '1', 'movie_id': '23'},
 {'user_id': '1', 'movie_id': '24'},
 {'user_id': '1', 'movie_id': '25'},
 {'user_id': '1', 'movie_id': '26'},
 {'user_id': '1', 'movie_id': '27'},
 {'user_id': '1', 'movie_id': '28'},
 {'user_id

In [9]:
# 下面需要转成one-hot
v = DictVectorizer()
X_train = v.fit_transform(train_data)
X_test = v.transform(test_data)

In [15]:
# 建立FM模型 
fm = pylibfm.FM(num_factors=10, num_iter=100, verbose=True, task='regression', initial_learning_rate=0.001, learning_rate_schedule='optimal')

FM的具体参数函数如下: 这里面重点需要设置的我已标出(详细的可以参考源码)
* **num_factors**: 隐向量的维度， 也就是k
* **num_iter**: 迭代次数， 由于使用的SGD， 随机梯度下降， 要指明迭代多少个epoch
* k0, k1: k0表示是否用偏置（看FM的公式)， k1表示是否要第二项， 就是单个特征的， 这俩默认True
* init_stdev: 初始化隐向量时候的方差, 默认0.01
* **validation_size**: 验证集的比例， 默认0.01
* learning_rate_schedule: 学习率衰减方式， 有constant, optimal, 和invscaling三种方式， 具体公式看源码
* **initial_learning_rate**: 初始学习率， 默认0.01
* power_t， t0: 逆缩放学习率的指数，最优学习率分母常数， 这两个和上面学习率衰减方式的计算有关
* **task**: 分类或者回归任务， 要指明
* verbose: 是否打印当前的迭代次数， 训练误差
* shuffle_training: 是否在学习之前打乱训练集
* seed: 随机种子

In [16]:
# 模型训练
fm.fit(X_train, y_train)

Creating validation dataset of 0.01 of training for adaptive regularization
-- Epoch 1
Training MSE: 0.59515
-- Epoch 2
Training MSE: 0.51796
-- Epoch 3
Training MSE: 0.49021
-- Epoch 4
Training MSE: 0.47434
-- Epoch 5
Training MSE: 0.46388
-- Epoch 6
Training MSE: 0.45643
-- Epoch 7
Training MSE: 0.45072
-- Epoch 8
Training MSE: 0.44616
-- Epoch 9
Training MSE: 0.44245
-- Epoch 10
Training MSE: 0.43925
-- Epoch 11
Training MSE: 0.43646
-- Epoch 12
Training MSE: 0.43414
-- Epoch 13
Training MSE: 0.43197
-- Epoch 14
Training MSE: 0.43014
-- Epoch 15
Training MSE: 0.42832
-- Epoch 16
Training MSE: 0.42672
-- Epoch 17
Training MSE: 0.42534
-- Epoch 18
Training MSE: 0.42394
-- Epoch 19
Training MSE: 0.42274
-- Epoch 20
Training MSE: 0.42146
-- Epoch 21
Training MSE: 0.42029
-- Epoch 22
Training MSE: 0.41925
-- Epoch 23
Training MSE: 0.41809
-- Epoch 24
Training MSE: 0.41714
-- Epoch 25
Training MSE: 0.41611
-- Epoch 26
Training MSE: 0.41504
-- Epoch 27
Training MSE: 0.41400
-- Epoch 28
Tra

In [17]:
# 评估
preds = fm.predict(X_test)

In [18]:
preds

array([3.82775899, 3.36936358, 3.97046438, ..., 3.1684557 , 2.83862524,
       3.30447369])

In [19]:
from sklearn.metrics import mean_squared_error

In [20]:
print('FM MSE: %.4f' % mean_squared_error(y_test, preds))

FM MSE: 0.8894


## 分类任务

In [15]:
from sklearn.datasets import make_classification   # 创建一个随机的分类数据集
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

In [30]:
X, y = make_classification(n_samples=1000, n_features=100, n_clusters_per_class=1) # 1000个训练样本， 100维的数据
data = [{v: k for k, v in dict(zip(i, range(len(i)))).items()} for i in X]

In [31]:
x_train, x_test, y_train, y_test = train_test_split(data, y, test_size=0.1, random_state=42)

In [32]:
v = DictVectorizer()
x_train = v.fit_transform(x_train)
x_test = v.transform(x_test)

In [33]:
x_train.toarray()

array([[ 1.13546601, -0.44001418,  0.05883419, ...,  0.1106177 ,
         2.40164079,  1.06692114],
       [-0.95999225, -0.11428069,  1.52518244, ...,  0.34756597,
        -1.02474209, -0.89588306],
       [ 0.33820952,  0.02851325,  0.44870256, ..., -2.15892132,
         0.62617635,  1.42945567],
       ...,
       [-1.17764912,  0.52031492, -0.58740119, ...,  0.27980417,
        -0.17167864, -0.10519347],
       [-0.35854756,  0.31230172, -0.19378306, ..., -0.40616573,
        -0.10869339, -0.72132468],
       [ 1.14339962,  0.1036347 , -0.44070509, ...,  0.94306115,
         0.22989284,  1.45309461]])

In [21]:
# 建立模型
fm = pylibfm.FM(num_factors=50, num_iter=10, verbose=True, task='classification', initial_learning_rate=0.0001, learning_rate_schedule='optimal')

In [22]:
fm.fit(x_train, y_train)

Creating validation dataset of 0.01 of training for adaptive regularization
-- Epoch 1
Training log loss: 2.06286
-- Epoch 2
Training log loss: 1.77003
-- Epoch 3
Training log loss: 1.51550
-- Epoch 4
Training log loss: 1.29592
-- Epoch 5
Training log loss: 1.10810
-- Epoch 6
Training log loss: 0.94841
-- Epoch 7
Training log loss: 0.81340
-- Epoch 8
Training log loss: 0.70018
-- Epoch 9
Training log loss: 0.60551
-- Epoch 10
Training log loss: 0.52595


In [23]:
y_pre = fm.predict(x_test)

In [25]:
print('validation log loss: %.4f' % log_loss(y_test, y_pre))

validation log loss: 1.3945
