## To do for 07282017:
1. User_filter:
   Filter user based on certain features, e.g., 
   consistent with theme, certain time of viewing, 
   or certain time interval before each item viewing.
2. Recommendation core:
   It will basically be the collaborative filter (CF),
   but instead of using real items, I'd like to use 
   features extracted from CNN and dimension-reduced
   by tSNE to maybe 20 D.
3. Processor:
   Input are
   a. log of user history
   b. item features
   Output are
   a. Top N rank of recommendation item for each user
4. Evaluator:
   Evaluate whether the user buy the item within the top
   N rank of recommended items.

## After trial run:
* tSNE for this amount of sample and the dimension we want may not be feasible. Need to try small portion and time it or try PCA instead

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

In [2]:
os.chdir('/Users/Walkon302/Desktop/deep-learning-models-master/view2buy')

In [3]:
# Read the preprocessed file, containing the user profile and item features from view2buy folder
df = pd.read_pickle('user_fea_for_eval.pkl')

In [4]:
# Drop the first column, which is the original data format.
df.drop('0', axis = 1, inplace = True)

In [5]:
# Check the data
df.head()

Unnamed: 0,user_id,buy_spu,buy_sn,buy_ct3,view_spu,view_sn,view_ct3,time_interval,view_cnt,view_secondes,view_features,buy_features
0,2469583035,4199682998971011301,10013436,334,220189917005230097,10013861,334,37496,7,45,"[0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757..."
1,2469583035,4199682998971011301,10013436,334,234826617504419925,10003862,334,170826,2,23,"[0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757..."
2,2469583035,4199682998971011301,10013436,334,235671027621670949,10003862,334,426968,2,11,"[0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757..."
3,1488725183,4199682998971011301,10013436,334,235671027621670949,10003862,334,180564,1,22,"[0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757..."
4,2469583035,4199682998971011301,10013436,334,245522675097001998,10026364,334,83993,2,7,"[0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757..."


In [6]:
# Calculate the average view sec for all view items per user
avg_view_sec = pd.DataFrame(df.groupby(['user_id', 'buy_spu'])['view_secondes'].mean())

In [7]:
# Reset the index and rename the column
avg_view_sec.reset_index(inplace=True)
avg_view_sec.rename(columns = {'view_secondes':'avg_view_sec'}, inplace=True)

In [8]:
# Check the data
avg_view_sec.head()

Unnamed: 0,user_id,buy_spu,avg_view_sec
0,88224,91837317697974354,13.0
1,149036,9928117558321210,29.75
2,187458,296751091217272878,24.537255
3,187458,308010090285679082,28.190299
4,187458,7364869139418243104,28.230483


In [9]:
# Merge avg item view into data
df = pd.merge(df, avg_view_sec, on=['user_id', 'buy_spu'])

In [10]:
# Calculate the weights for view item vec
df['weight_of_view'] = df['view_secondes']/df['avg_view_sec']

In [11]:
df.head()

Unnamed: 0,user_id,buy_spu,buy_sn,buy_ct3,view_spu,view_sn,view_ct3,time_interval,view_cnt,view_secondes,view_features,buy_features,avg_view_sec,weight_of_view
0,2469583035,4199682998971011301,10013436,334,220189917005230097,10013861,334,37496,7,45,"[0.621, 0.542, 0.0, 0.369, 0.062, 0.039, 0.103...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...",30.520833,1.474403
1,2469583035,4199682998971011301,10013436,334,234826617504419925,10003862,334,170826,2,23,"[0.15, 0.98, 0.104, 1.295, 0.111, 0.0, 0.0, 0....","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...",30.520833,0.753584
2,2469583035,4199682998971011301,10013436,334,235671027621670949,10003862,334,426968,2,11,"[0.106, 0.027, 0.0, 1.398, 0.096, 0.021, 0.072...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...",30.520833,0.36041
3,2469583035,4199682998971011301,10013436,334,245522675097001998,10026364,334,83993,2,7,"[0.019, 1.415, 0.007, 0.088, 0.055, 0.015, 0.0...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...",30.520833,0.229352
4,2469583035,4199682998971011301,10013436,334,296751124749754369,10005367,334,427866,2,12,"[0.066, 0.328, 0.043, 0.0, 0.062, 0.016, 0.303...","[0.091, 0.805, 0.0, 0.591, 0.981, 0.026, 0.757...",30.520833,0.393174


In [12]:
# Generate TSNE model
model = TSNE(n_components=10, random_state=0)

In [13]:
# Fit to view_item_vec
view_item_vec = df['view_features']

In [14]:
len(view_item_vec)

3176676

## Try TSNE and time it
* It turns out that TSNE is too time consuming.

In [None]:
# Generate TSNE model
model = PCA(n_components=200, random_state=0)

In [121]:
# Time the tSNE with 250 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:250]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)

CPU times: user 22.3 s, sys: 501 ms, total: 22.8 s
Wall time: 22.8 s


In [114]:
# Time the tSNE with 500 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:500]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)

CPU times: user 1min 23s, sys: 2.57 s, total: 1min 25s
Wall time: 1min 31s


In [113]:
# Time the tSNE with 1000 samples
%%time
a = pd.DataFrame()
for i, j in enumerate(view_item_vec.iloc[0:1000]):
    a = pd.concat([a, pd.DataFrame(j).transpose()], axis = 0)
vt = model.fit_transform(a)

CPU times: user 4min 25s, sys: 6.05 s, total: 4min 31s
Wall time: 4min 33s


## Try PCA instead
* PCA looks resonable. We can process 500k data around 90 secs. I will proceed with this setting for first try

In [22]:
# Generate TSNE model
model = PCA(n_components=200, random_state=0)

In [18]:
%%time
a = []
for i in view_item_vec.iloc[0:500000]:
    a.append(i)
b = np.array(a)

CPU times: user 34.1 s, sys: 12.5 s, total: 46.6 s
Wall time: 1min 12s


In [21]:
%%time
pca_vec = model.fit_transform(b)

KeyboardInterrupt: 

In [19]:
# 200 dimensions of PCA can explain 85% of variables. Beyond that, e.g., 300 D, my computer will run out of memory (8g)
sum(model.explained_variance_ratio_)

0.8508820736566447

In [19]:
view_item_vec.shape

(3176676,)