# 个性化推荐
本项目使用文本卷积神经网络，并使用[`MovieLens`](https://grouplens.org/datasets/movielens/)数据集完成电影推荐的任务。



推荐系统在日常的网络应用中无处不在，比如网上购物、网上买书、新闻app、社交网络、音乐网站、电影网站等等等等，有人的地方就有推荐。根据个人的喜好，相同喜好人群的习惯等信息进行个性化的内容推荐。比如打开新闻类的app，因为有了个性化的内容，每个人看到的新闻首页都是不一样的。

这当然是很有用的，在信息爆炸的今天，获取信息的途径和方式多种多样，人们花费时间最多的不再是去哪获取信息，而是要在众多的信息中寻找自己感兴趣的，这就是信息超载问题。为了解决这个问题，推荐系统应运而生。

协同过滤是推荐系统应用较广泛的技术，该方法搜集用户的历史记录、个人喜好等信息，计算与其他用户的相似度，利用相似用户的评价来预测目标用户对特定项目的喜好程度。优点是会给用户推荐未浏览过的项目，缺点呢，对于新用户来说，没有任何与商品的交互记录和个人喜好等信息，存在冷启动问题，导致模型无法找到相似的用户或商品。

为了解决冷启动的问题，通常的做法是对于刚注册的用户，要求用户先选择自己感兴趣的话题、群组、商品、性格、喜欢的音乐类型等信息，比如豆瓣FM：
<img src="assets/IMG_6242_300.PNG"/>

## 下载数据集
运行下面代码把[`数据集`](http://files.grouplens.org/datasets/movielens/ml-1m.zip)下载下来

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from collections import Counter
import tensorflow as tf

import os
import pickle
import re
from tensorflow.python.ops import math_ops

In [3]:
from urllib.request import urlretrieve
from os.path import isfile, isdir
from tqdm import tqdm
import zipfile
import hashlib

def _unzip(save_path, _, database_name, data_path):
    """
    Unzip wrapper with the same interface as _ungzip
    :param save_path: The path of the gzip files
    :param database_name: Name of database
    :param data_path: Path to extract to
    :param _: HACK - Used to have to same interface as _ungzip
    """
    print('Extracting {}...'.format(database_name))
    with zipfile.ZipFile(save_path) as zf:
        zf.extractall(data_path)

def download_extract(database_name, data_path):
    """
    Download and extract database
    :param database_name: Database name
    """
    DATASET_ML1M = 'ml-1m'

    if database_name == DATASET_ML1M:
        url = 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
        hash_code = 'c4d9eecfca2ab87c1945afe126590906'
        extract_path = os.path.join(data_path, 'ml-1m')
        save_path = os.path.join(data_path, 'ml-1m.zip')
        extract_fn = _unzip

    if os.path.exists(extract_path):
        print('Found {} Data'.format(database_name))
        return

    if not os.path.exists(data_path):
        os.makedirs(data_path)

    if not os.path.exists(save_path):
        with DLProgress(unit='B', unit_scale=True, miniters=1, desc='Downloading {}'.format(database_name)) as pbar:
            urlretrieve(
                url,
                save_path,
                pbar.hook)

    assert hashlib.md5(open(save_path, 'rb').read()).hexdigest() == hash_code, \
        '{} file is corrupted.  Remove the file and try again.'.format(save_path)

    os.makedirs(extract_path)
    try:
        extract_fn(save_path, extract_path, database_name, data_path)
    except Exception as err:
        shutil.rmtree(extract_path)  # Remove extraction folder if there is an error
        raise err

    print('Done.')
    # Remove compressed data
#     os.remove(save_path)

class DLProgress(tqdm):
    """
    Handle Progress Bar while Downloading
    """
    last_block = 0

    def hook(self, block_num=1, block_size=1, total_size=None):
        """
        A hook function that will be called once on establishment of the network connection and
        once after each block read thereafter.
        :param block_num: A count of blocks transferred so far
        :param block_size: Block size in bytes
        :param total_size: The total size of the file. This may be -1 on older FTP servers which do not return
                            a file size in response to a retrieval request.
        """
        self.total = total_size
        self.update((block_num - self.last_block) * block_size)
        self.last_block = block_num

In [4]:
data_dir = '../data/'
download_extract('ml-1m', data_dir)

Extracting ml-1m...
Done.


## 先来看看数据

本项目使用的是MovieLens 1M 数据集，包含6000个用户在近4000部电影上的1亿条评论。

数据集分为三个文件：用户数据users.dat，电影数据movies.dat和评分数据ratings.dat。

### 用户数据
分别有用户ID、性别、年龄、职业ID和邮编等字段。

数据中的格式：UserID::Gender::Age::Occupation::Zip-code

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"



In [6]:
users_fname = 'users.dat'
users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
users = pd.read_table(data_dir+ufile_name, sep='::', header=None, names=users_title, engine = 'python')
users.head()

Unnamed: 0,UserID,Gender,Age,OccupationID,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


可以看出UserID、Gender、Age和Occupation都是类别字段，其中邮编字段是我们不使用的。

In [13]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
UserID          6040 non-null int64
Gender          6040 non-null object
Age             6040 non-null int64
OccupationID    6040 non-null int64
Zip-code        6040 non-null object
dtypes: int64(3), object(2)
memory usage: 236.0+ KB


In [17]:
for c in users.columns:
    print('the value counts of '+c, end=': ')
    print(users[c].value_counts().shape[0])

the value counts of UserID: 6040
the value counts of Gender: 2
the value counts of Age: 7
the value counts of OccupationID: 21
the value counts of Zip-code: 3439


In [31]:
users['Gender'].value_counts()

M    4331
F    1709
Name: Gender, dtype: int64

男女比例 2.5:1

In [33]:
users['Age'].value_counts()

25    2096
35    1193
18    1103
45     550
50     496
56     380
1      222
Name: Age, dtype: int64

年龄主要集中在25-34之间, 这个年龄的人有时间, 也有能力去看电影吧

In [36]:
users['OccupationID'].value_counts()[:5]

4     759
0     711
7     679
1     528
17    502
Name: OccupationID, dtype: int64

4: "college/grad student"    大学生人数最多   
0: "other" or not specified   第二多的是其它    
7: "executive/managerial"   第三多是行政/管理   

In [39]:
users['Zip-code'].value_counts()[:10]

48104    19
22903    18
94110    17
55104    17
55105    16
55455    16
10025    16
55408    15
02138    15
94114    15
Name: Zip-code, dtype: int64

In [76]:
OccupationID_key = {
    0: "other or not specified",
    1: "academic/educator",
    2: "artist",
    3: "clerical/admin",
    4: "college/grad student",
    5: "customer service",
    6: "doctor/health care",
    7: "executive/managerial",
    8: "farmer",
    9: "homemaker",
    10: "K-12 student",
    11: "lawyer",
    12: "programmer",
    13: "retired",
    14: "sales/marketing",
    15: "scientist",
    16: "self-employed",
    17: "technician/engineer",
    18: "tradesman/craftsman",
    19: "unemployed",
    20: "writer",
    }
for groupby_age  in users.groupby(['Age'])['OccupationID']:
    print('='*80)
    print('Age={0}'.format(groupby_age[0]))
    top_5 = groupby_age[1].value_counts().copy()[:5]
    #print(top_5)
    top_5_index = top_5.index
    print('the top 1 OccupationID is:{0}({1}),frequency is:{2}'.format(
            top_5_index[0], OccupationID_key[top_5_index[0]], top_5.iloc[0]))
    print('the top 2 OccupationID is:{0}({1}),frequency is:{2}'.format(
            top_5_index[1], OccupationID_key[top_5_index[1]], top_5.iloc[1]))
    

Age=1
the top 1 OccupationID is:10(K-12 student),frequency is:163
the top 2 OccupationID is:0(other or not specified),frequency is:27
Age=18
the top 1 OccupationID is:4(college/grad student),frequency is:534
the top 2 OccupationID is:0(other or not specified),frequency is:106
Age=25
the top 1 OccupationID is:0(other or not specified),frequency is:298
the top 2 OccupationID is:7(executive/managerial),frequency is:253
Age=35
the top 1 OccupationID is:7(executive/managerial),frequency is:214
the top 2 OccupationID is:0(other or not specified),frequency is:135
Age=45
the top 1 OccupationID is:1(academic/educator),frequency is:80
the top 2 OccupationID is:7(executive/managerial),frequency is:74
Age=50
the top 1 OccupationID is:7(executive/managerial),frequency is:80
the top 2 OccupationID is:1(academic/educator),frequency is:70
Age=56
the top 1 OccupationID is:13(retired),frequency is:102
the top 2 OccupationID is:1(academic/educator),frequency is:55


00-18岁之间的主要是中学生,   
18-25岁之间的主要是大学生,   
25-34岁之间的主要是有职业,但是其它(不愿透露个人信息?)和行政人员,   
35-44岁之间的主要是行政人员(有时间)和其它,   
45-49岁之间的主要是学术人员/老师和行政人员(有时间,会用电脑),   
50-55岁之间的主要是行政人员和学术人员/老师(有时间,会用电脑),    
55-99岁之间的主要是退休人员和学术人员/老师(有时间,会用电脑),    

### 电影数据
分别有电影ID、电影名和电影风格等字段。

数据中的格式：MovieID::Title::Genres

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:

	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western


In [23]:
movies_fname = 'movies.dat'
movies_title = ['MovieID', 'Title', 'Genres']
movies = pd.read_table(data_dir+movies_fname, sep='::', header=None, names=movies_title, engine = 'python')
movies.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


MovieID是类别字段，Title是文本，Genres也是类别字段

In [24]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
MovieID    3883 non-null int64
Title      3883 non-null object
Genres     3883 non-null object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB


In [25]:
for c in movies.columns:
    print('the value counts of '+c, end=': ')
    print(movies[c].value_counts().shape[0])

the value counts of MovieID: 3883
the value counts of Title: 3883
the value counts of Genres: 301


In [41]:
movies['Genres'].value_counts()[:10]

Drama             843
Comedy            521
Horror            178
Comedy|Drama      162
Comedy|Romance    142
Drama|Romance     134
Documentary       116
Thriller          101
Action             65
Drama|Thriller     63
Name: Genres, dtype: int64

戏剧:843
喜剧:521
恐怖:178
喜剧|戏剧:162
喜剧|浪漫:142
戏剧|浪漫:134
纪录片:116
惊悚片:101
行动:65
戏剧|惊悚片:63

戏剧:843 喜剧:521 喜剧|戏剧:162, 大多数人喜欢戏剧和喜剧, 说明大多数人还是以消遣娱乐的心态来看电影的   
恐怖:178 惊悚片:101 戏剧|惊悚片:63, 还是有相当一部分人有好奇心以及对未知的痴迷   
喜剧|浪漫:142 戏剧|浪漫:134, 爱是人类永恒的追求, 可能这里有相当一部分女性观众   
纪录片:116 喜欢科学与自然的人也有一部分呢   
动作:65, 喜欢动作片的居然没那么多   

### 评分数据
分别有用户ID、电影ID、评分和时间戳等字段。

数据中的格式：UserID::MovieID::Rating::Timestamp

- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings

In [8]:
ratings_fname = 'ratings.dat'
ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
ratings = pd.read_table(data_dir+ratings_fname, sep='::', header=None, names=ratings_title, engine = 'python')
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,timestamps
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


评分字段Rating就是我们要学习的targets，时间戳字段我们不使用。

In [43]:
for c in ratings.columns:
    print('the value counts of '+c, end=': ')
    print(ratings[c].value_counts().shape[0])

the value counts of UserID: 6040
the value counts of MovieID: 3706
the value counts of Rating: 5
the value counts of timestamps: 458455


6040(6040有基本信息)个user, 3706(3883有基本信息)个movie, 评分1-5