Skip to content

yennanliu/movie_recommendation

Repository files navigation

movie_recommendation

INTRO

Build various recommend systems via user rating, movie data, and other meta data via CF, KNN, and DL models.The main idea of this prove-of-concept project is "making data production" via ML/DL apart from analysis level. The output can be dump csv/DB, APIs or web APPs.

There will be 4 different ways building the different recommend systems. Current plan is : Build CF model via Pyspark, Popularity model via Numpy, KNN model via scikit-learn, and the RNN model via Tensforflow/Keras.

Please check the theory intro, step-by-step notebook, and quick start start below.

FILE STRUCTURE

├── [ 12k]  README.md
├── [ 224]  analysis
├── [ 160]  datasets                           : main training dataset
├── [ 352]  model
│   ├── [5.2k]  movie_recommend_ALS.java       : movie recommend via java spark ALS model 
│   ├── [4.7k]  movie_recommend_KNN.py         : movie recommend via KNN (user similarity)
│   ├── [ 11k]  movie_recommend_NCF.py         : movie recommend via NN+CF (dev)
│   ├── [9.7k]  movie_recommend_Similarity.py  : movie recommend via user similarity
│   ├── [3.0k]  movie_recommend_benchmark.py   : movie recommend benchmark model 
│   ├── [4.4k]  movie_recommend_popularity.py  : movie recommend via movie popularity
│   └── [ 13k]  movie_recommend_spark_CF.py    : movie recommend via CF (pyspark ML)  
├── [ 320]  notebook                           : step by step ML code notebook demo 

THEORY

THEORY
  • Collaborative Filtering (CF)
####  - User-Based Collaborative Filter

# 1)
# consider there is an user movie rating matrix :
# x-axis : user
# y-axis : movie 

	[[ 2  ?  0  0  4]
	 [ ?  ?  8  5  4]
	 [ 1 1  0  ?   3]
	 [ 3 2  2  ?   1]
	 [ 1 2  3  1   1] 
	 [ 1 2  3  ?   ?]]

# 2)
# the purpose of user-based CF is predicting the unknown rating of the movies haven't rated by the given user and give recommendation based on the user silimarity 
# for example if we have 3 user rating as below:
user1 	 [ 3 2  2  ?   1]
user2	 [ 1 2  3  1   1] 
user3	 [ 1 2  3  ?   ?]
# then based on user silimarity, we can say user2 and user3 are "much similar" then others. So we can push user2's movie taste to user3 


# 3)
# Here is the logic in python 

def cosine_similarity(v,w):
	import numpy as np 
    # v,w is a vector, return the cosine between the two vectors,
    # value between -1~1
    return np.dot(v,w)/(math.sqrt(np.dot(v,v) * np.dot(w,w)))


def users_similarity(users_movie_matrix):
    users_movie_matrix = []
    for i,user in enumerate(users_movie_matrix):
        similarity_vector = []
        for j in range(len(users_movie_matrix)):
        	# use cosine similarity
            similarity_val=cosine_similarity(user, users_movie_matrix[j])
            similarity_vector.append(similarity_val)
        users_movie_matrix.append(similarity_vector)
    return users_movie_matrix
    ####  - Item-Based Collaborative Filter
    # dev 
    
  • Popularity

    • Movie popularity based recommender
  • Similarity

    • User similarity based recommender (KNN)
  • DL

    • Collaborative Filtering via Neural network

PROCESS

data EDA -> feature engineering -> benchmark model -> popularity model -> CF model ->  NCF model -> model tuning -> save trained models -> hybrid model Architecture 
(offline + online training)

QUICK START

Quick-Start
# get the repo 
$ git clone https://github.com/yennanliu/movie_recommendation.git
# get needed dataset 
$ cd ~ && cd movie_recommendation/ && brew install Wget && bash download_dataset.sh
### CF model via spark ### 
# install pyspark and needed dataset 
$ cd ~ && bash /Users/$USER/movie_recommendation/install_pyspark.sh  && cd movie_recommendation/ && brew install Wget && bash download_dataset.sh
# declare env variables  
$ export SPARK_HOME=/Users/$USER/spark && export PATH=$SPARK_HOME/bin:$PATH
# run the pyspark model train script 
$ spark-submit  movie_recommend_spark.py 

# output 
For rank 4 the RMSE is 0.9432575570983046
For rank 8 the RMSE is 0.9566157499964845
For rank 12 the RMSE is 0.9521388924465031
The best model was trained with rank 4
************
For testing data the RMSE is 0.9491107183690944
************
[(2.0, 107), (97328.0, 1), (4.0, 13)]
random movid id :  [3613 4927 8845 1508 5692]
-------------------
Please rate following 5 random movies as new user teste interest : 
-------------------
movie_id : 3613
movie_name : Things Change (1988)
 * What is your rating? 3
-> Your rating for Things Change (1988) is : 3.0
movie_id : 4927
movie_name : "Last Wave
 * What is your rating? 1
-> Your rating for "Last Wave is : 1.0
insert movie_id : 8845
movie_id not exist
 * What is your rating? 0
-> Your rating for None is : 0.0
movie_id : 1508
movie_name : Traveller (1997)
 * What is your rating? 3
-> Your rating for Traveller (1997) is : 3.0
insert movie_id : 5692
movie_id not exist
 * What is your rating? 2
-> Your rating for None is : 2.0
New user ratings: [(9997, 3613, 3.0), (9997, 4927, 1.0), (9997, 8845, 0.0), (9997, 1508, 3.0), (9997, 5692, 2.0)]
[(1.0, 31.0, 2.5), (1.0, 1029.0, 3.0), (1.0, 1061.0, 3.0), (1.0, 1129.0, 2.0), (1.0, 1172.0, 4.0), (1.0, 1263.0, 2.0), (1.0, 1287.0, 2.0), (1.0, 1293.0, 2.0), (1.0, 1339.0, 3.5), (1.0, 1343.0, 2.0)]
<pyspark.mllib.recommendation.MatrixFactorizationModel object at 0x10f4b1240>
=======================
[Rating(user=9997, product=267, rating=1.9431035590658032), Rating(user=9997, product=18, rating=2.4471404575434224), Rating(user=9997, product=227, rating=1.8898669807166826), Rating(user=9997, product=639, rating=1.2313836204250688), Rating(user=9997, product=630, rating=2.0651897033288247), Rating(user=9997, product=248, rating=0.9056995408584969), Rating(user=9997, product=183, rating=1.096099378407863), Rating(user=9997, product=62, rating=2.3965661520727375), Rating(user=9997, product=318, rating=2.693287049630902), Rating(user=9997, product=6, rating=2.403548622949053)]
=======================
=======================
TOP recommended movies (with more than 25 reviews):
('Forrest Gump (1994)', 2.740902212733893, 341)
('Braveheart (1995)', 2.7270995452301943, 228)
('"Shawshank Redemption', 2.693287049630902, 311)
("Schindler's List (1993)", 2.6615360145071953, 244)
('Much Ado About Nothing (1993)', 2.646212665422727, 60)
('Welcome to the Dollhouse (1995)', 2.578621020576323, 30)
('Philadelphia (1993)', 2.5727445302939564, 86)
.....
### user similarity model ###
$ python movie_recommend_Similarity.py 

# output 

------------------------------------------------------------------------------------
Training data movies for the user userid: 37:
------------------------------------------------------------------------------------
3481
3564
364
4018
1246
3948
2085
912
4034
4054
2028
3538
1307
3977
3751
1196
940
1193
2858
1907
595
920
2081
1
902
2273
4011
4015
----------------------------------------------------------------------
Recommendation process going on:
----------------------------------------------------------------------
------------
[3481, 3564, 364, 4018, 1246, 3948, 2085, 912, 4034, 4054, 2028, 3538, 1307, 3977, 3751, 1196, 940, 1193, 2858, 1907, 595, 920, 2081, 1, 902, 2273, 4011, 4015]
------------
no. of unique movies in the training set: 8401
Non zero values in cooccurence_matrix :142279
   userId  movieId    rating  view_count
0    37.0   1380.0  0.167462         1.0
1    37.0   2918.0  0.165759         2.0
2    37.0   1270.0  0.165678         3.0
3    37.0    919.0  0.163948         4.0
4    37.0   2762.0  0.163071         5.0
5    37.0   1682.0  0.162160         6.0
6    37.0   4027.0  0.160869         7.0
7    37.0   2797.0  0.159535         8.0
8    37.0   1704.0  0.158849         9.0
9    37.0   4306.0  0.158516        10.0

### Benchmark model ###
$  python movie_recommend_benchmark.py

# output 

recommend list  : 
    movieId  total_view  avg_rating
0       356       341.0    4.054252
1       296       324.0    4.256173
2       318       311.0    4.487138
3       593       304.0    4.138158
4       260       291.0    4.221649
5       480       274.0    3.706204
6      2571       259.0    4.183398
7         1       247.0    3.872470
8       527       244.0    4.303279
9       589       237.0    4.006329
10     1196       234.0    4.232906
11      110       228.0    3.945175
12     1270       226.0    4.015487
13      608       224.0    4.256696
14     2858       220.0    4.236364
15     1198       220.0    4.193182
16      780       218.0    3.483945
17     1210       217.0    4.059908
18      588       215.0    3.674419
19      457       213.0    3.953052
# run via docker
# install wget via apt-get (ubuntu)

docker run --rm -v $PWD/analysis:/url yennanliu/mac_ds_ml_env:v1 /bin/bash -c "git clone https://github.com/yennanliu/movie_recommendation.git ;  ls  ;  pwd ;  apt-get install wget ; bash download_dataset.sh ; cd movie_recommendation && python movie_recommend_benchmark.py"

# output 

Cloning into 'movie_recommendation'...
bin
boot
dev
ds
etc
home
lib
lib64
media
mnt
movie_recommendation
notebooks
opt
proc
root
run
sbin
srv
sys
tmp
url
usr
var
/
Reading package lists...
Building dependency tree...
Reading state information...
wget is already the newest version (1.18-5+deb9u2).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
bash: download_dataset.sh: No such file or directory
/opt/conda/lib/python3.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
   userId  std_rating
0       1    0.887041
1       2    0.901753
2       3    0.741752
....

[100004 rows x 5 columns]
        userId  movieId  rating   timestamp  total_view  avg_rating
0            1       31     2.5  1260759144          42    3.178571
1            7       31     3.0   851868750          42    3.178571
2           31       31     4.0  1273541953          42    3.178571
3           32       31     4.0   834828440          42    3.178571
4           36       31     3.0   847057202          42    3.178571
5           39       31     3.0   832525157          42    3.178571
6           73       31     3.5  1255591860          42    3.178571
7           88       31     3.0  1239755559          42    3.178571
8           96       31     2.5  1223256331          42    3.178571
...

[100004 rows x 6 columns]
recommend list  : 
    movieId  total_view  avg_rating
0       356       341.0    4.054252
1       296       324.0    4.256173
2       318       311.0    4.487138
3       593       304.0    4.138158
4       260       291.0    4.221649
5       480       274.0    3.706204
6      2571       259.0    4.183398
7         1       247.0    3.872470
....

REFERENCE