# SHORYA SETHIA [ 22B2725 ]


### Data
Using data from : https://www.kaggle.com/netflix-inc/netflix-prize-data/data
It contains:
1. combined_data_1.txt
2. combined_data_2.txt
3. combined_data_3.txt
4. combined_data_4.txt
5. movie_titles.csv

### Data Overview
The first line of each file combined_data_{i}.txt contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the format: CustomerID,Rating,Date

- MovieIDs range from 1 to 17770 sequentially.
- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
- Ratings are on a five star (integral) scale from 1 to 5.
- Dates have the format YYYY-MM-DD.

In [1]:
# this is just to know how much time will it take to run this entire ipython notebook 
from datetime import datetime

In [2]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('nbagg')

import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})

import seaborn as sns
sns.set_style('whitegrid')
import os
from scipy import sparse
from scipy.sparse import csr_matrix

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
import random

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

# Exploratory Data Analysis
### Converting entire data to following format:
u_i,m_j,r_ij

In [3]:
start = datetime.now()
if not os.path.isfile('data.csv'):
    # Create a file 'data.csv' before reading it
    # Read all the files in Netflix Prize Data and store them in one big file('data.csv')
    # I am Re-reading from each of the four files and appendig each rating to a global file 'train.csv'
    data = open('data.csv', mode='w')
    
    row = list()
    files=['data/combined_data_1.txt','data/combined_data_2.txt', 
           'data/combined_data_3.txt', 'data/combined_data_4.txt']
    for file in files:
        print("Reading ratings from {}...".format(file))
        with open(file) as f:
            for line in f: 
                del row[:] # you don't have to do this.
                line = line.strip()
                if line.endswith(':'):
                    # All below are ratings for this movie, until another movie appears.
                    movie_id = line.replace(':', '')
                else:
                    row = [x for x in line.split(',')]
                    row.insert(0, movie_id)
                    data.write(','.join(row))
                    data.write('\n')
        print("Done.\n")
    data.close()
print('Time taken :', datetime.now() - start)

Time taken : 0:00:00


In [4]:
loaded_data=pd.read_csv('data.csv')
loaded_data.head()

Unnamed: 0,1,1488844,3,2005-09-06
0,1,822109,5,2005-05-13
1,1,885013,4,2005-10-19
2,1,30878,4,2005-12-26
3,1,823519,3,2004-05-03
4,1,893988,3,2005-11-17


In [5]:
start = datetime.now()

if not os.path.isfile('sorted_data.csv'):
  print("creating the dataframe from data.csv file..")
  df = pd.read_csv('data.csv', sep=',', names=['movie', 'user','rating','date'])
  df.date = pd.to_datetime(df.date)
  print('Done.\n')

  # I am arranging the ratings according to time
  print('Sorting the dataframe by date..')
  df.sort_values(by='date', inplace=True)
  print('Done..')

  output_filename = 'sorted_data.csv'
  df.to_csv(output_filename, index=False)

else:
  print("File already exists. Reading it...")
  df = pd.read_csv('sorted_data.csv')
  
print('Time taken :', datetime.now() - start)

File already exists. Reading it...
Time taken : 0:00:28.001155


In [5]:
df.shape

(100480507, 4)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100480507 entries, 0 to 100480506
Data columns (total 4 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   movie   int64 
 1   user    int64 
 2   rating  int64 
 3   date    object
dtypes: int64(3), object(1)
memory usage: 3.0+ GB


In [7]:
df.head()

Unnamed: 0,movie,user,rating,date
0,10341,510180,4,1999-11-11
1,1798,510180,5,1999-11-11
2,10774,510180,3,1999-11-11
3,8651,510180,2,1999-11-11
4,14660,510180,2,1999-11-11


In [12]:
df.describe()['rating']

count    1.004805e+08
mean     3.604290e+00
min      1.000000e+00
25%      3.000000e+00
50%      4.000000e+00
75%      4.000000e+00
max      5.000000e+00
std      1.085219e+00
Name: rating, dtype: float64

### Checking for NaN values

In [9]:
print("Number of Nan values in our dataframe : ", sum(df.isnull().any()))

Number of Nan values in our dataframe :  0


### Deleting Duplicates either movie_id, user/customer_id, ratings, date

In [8]:
dup_bool = df.duplicated(['movie','user','rating'])
dups = sum(dup_bool) 
print("There are {} duplicate rating entries in the data..".format(dups))

There are 0 duplicate rating entries in the data..


### Number of Users, movies and ratings in sorted_data.csv

In [None]:
print("Total No of Users   :", len(np.unique(df.user)))
print("Total No of movies  :", len(np.unique(df.movie)))
print("Total no of ratings :",df.shape[0]) #total rows == no. of ratings

Total No of Users   : 480189
Total No of movies  : 17770
Total no of ratings : 100480507


### Spliting data into Train and Test (0.80 : 0.20 respectively)

In [10]:
if not os.path.isfile('train.csv'):
    # create the dataframe and store it as csv for further purposes
    df.iloc[:int(df.shape[0]*0.80)].to_csv("train.csv", index=False)
    print("train.csv formed.")
else :
    print("train.csv exists")

if not os.path.isfile('test.csv'):
    # create the dataframe and store it as csv for further purposes
    df.iloc[int(df.shape[0]*0.80):].to_csv("test.csv", index=False)
    print("test.csv formed.")
else :
    print("test.csv exists")

start = datetime.now()
train_df = pd.read_csv("train.csv", parse_dates=['date'])
test_df = pd.read_csv("test.csv")
print("read both csv")
print('Time taken :', datetime.now() - start)


train.csv exists
test.csv exists
read both csv
Time taken : 0:00:55.331386


In [11]:
train_df.head()

Unnamed: 0,movie,user,rating,date
0,10341,510180,4,1999-11-11
1,1798,510180,5,1999-11-11
2,10774,510180,3,1999-11-11
3,8651,510180,2,1999-11-11
4,14660,510180,2,1999-11-11


### Number of Users, Movies and ratings in train.csv and test.csv

In [None]:
print("Numbers for train.csv")
print("Total No of Users   :", len(np.unique(train_df.user)))
print("Total No of movies  :", len(np.unique(train_df.movie)))
print("Total no of ratings :",train_df.shape[0])

print("\nNumbers for test.csv")
print("Total No of Users   :", len(np.unique(test_df.user)))
print("Total No of movies  :", len(np.unique(test_df.movie)))
print("Total no of ratings :",test_df.shape[0])

Numbers for train.csv


Total No of Users   : 405041
Total No of movies  : 17424
Total no of ratings : 80384405

Numbers for test.csv
Total No of Users   : 349312
Total No of movies  : 17757
Total no of ratings : 20096102


### EDA on trian_df

In [7]:
# method to make y-axis more readable
def human(num, units = 'M'):
    units = units.lower()
    num = float(num)
    if units == 'k':
        return str(num/10**3) + " K"
    elif units == 'm':
        return str(num/10**6) + " M"
    elif units == 'b':
        return str(num/10**9) +  " B"

In [None]:
#Rating Distribution ploting was taking very long time

# fig, ax = plt.subplots()
# plt.title('Distribution of ratings over Training dataset', fontsize=15)
# sns.countplot(train_df.rating)
# ax.set_yticklabels([human(item, 'M') for item in ax.get_yticks()])
# ax.set_ylabel('No. of Ratings(Millions)')

# plt.savefig('img/rating-distribution-train_df')

In [9]:
start = datetime.now()
rating_counts = train_df['rating'].value_counts()
print("Distribution of ratings over Training dataset:")
print(rating_counts)
print('Time taken:', datetime.now() - start)

Distribution of ratings over Training dataset:
rating
4    27161596
3    23339084
5    17772845
2     8369795
1     3741085
Name: count, dtype: int64
Time taken: 0:00:00.308258


In [12]:
# # Add new column (week day) to the data
# train_df['day_of_week'] = train_df.date.dt.weekday_name
# train_df.head()

# Add new column (week day) to the data
train_df['day_of_week'] = train_df['date'].dt.day_name()
train_df.head()

Unnamed: 0,movie,user,rating,date,day_of_week
0,10341,510180,4,1999-11-11,Thursday
1,1798,510180,5,1999-11-11,Thursday
2,10774,510180,3,1999-11-11,Thursday
3,8651,510180,2,1999-11-11,Thursday
4,14660,510180,2,1999-11-11,Thursday


In [13]:
avg_week_df = train_df.groupby(by=['day_of_week'])['rating'].mean()
print("Average ratings")
print(avg_week_df)

Average ratings
day_of_week
Friday       3.585274
Monday       3.577250
Saturday     3.591791
Sunday       3.594144
Thursday     3.582463
Tuesday      3.574438
Wednesday    3.583751
Name: rating, dtype: float64


In [None]:
fig, ax = plt.subplots()
sns.countplot(x='day_of_week', data=train_df, ax=ax)
plt.title('No of ratings on each day.')
plt.ylabel('Total no of ratings')
plt.xlabel('')
ax.set_yticklabels([human(item, 'M') for item in ax.get_yticks()])
plt.savefig('img/no.-of-rating-on-each-day_of_week-train_df.png')

  ax.set_yticklabels([human(item, 'M') for item in ax.get_yticks()])


In [None]:
ax = train_df.resample('m', on='date')['rating'].count().plot()
ax.set_title('No of ratings per month (Training data)')
plt.xlabel('Month')
plt.ylabel('No of ratings(per month)')
ax.set_yticklabels([human(item, 'M') for item in ax.get_yticks()])
plt.savefig('img/no.-of-ratings-per-month-train_df.png')

### Analysis on ratings given by a user

In [14]:
no_of_rated_movies_per_user = train_df.groupby(by='user')['rating'].count().sort_values(ascending=False)
no_of_rated_movies_per_user.head()

user
305344     17112
2439493    15896
387418     15402
1639792     9767
1461435     9447
Name: rating, dtype: int64

In [None]:
fig = plt.figure(figsize=plt.figaspect(.5))

ax1 = plt.subplot(121)
sns.kdeplot(no_of_rated_movies_per_user, shade=True, ax=ax1)
plt.xlabel('No of ratings by user')
plt.title("PDF")

ax2 = plt.subplot(122)
sns.kdeplot(no_of_rated_movies_per_user, shade=True, cumulative=True,ax=ax2)
plt.xlabel('No of ratings by user')
plt.title('CDF')

plt.savefig('img/pdf-cdf-rating-by-user-train_df.png')


`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(no_of_rated_movies_per_user, shade=True, ax=ax1)

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(no_of_rated_movies_per_user, shade=True, cumulative=True,ax=ax2)


Above warning is just about to use "fill" in place of "shade"

In [15]:
no_of_rated_movies_per_user.describe()

count    405041.000000
mean        198.459921
std         290.793238
min           1.000000
25%          34.000000
50%          89.000000
75%         245.000000
max       17112.000000
Name: rating, dtype: float64

In [16]:
quantiles = no_of_rated_movies_per_user.quantile(np.arange(0,1.01,0.01), interpolation='higher')
quantiles

0.00        1
0.01        1
0.02        2
0.03        4
0.04        5
        ...  
0.96      829
0.97      934
0.98     1079
0.99     1341
1.00    17112
Name: rating, Length: 101, dtype: int64

In [None]:
plt.title("Quantiles and their Values")
quantiles.plot()
# quantiles with 0.05 difference
plt.scatter(x=quantiles.index[::5], y=quantiles.values[::5], c='orange', label="quantiles with 0.05 intervals")
# quantiles with 0.25 difference
plt.scatter(x=quantiles.index[::25], y=quantiles.values[::25], c='m', label = "quantiles with 0.25 intervals")
plt.ylabel('No of ratings by user')
plt.xlabel('Value at the quantile')
plt.legend(loc='best')

# annotate the 25th, 50th, 75th and 100th percentile values....
for x,y in zip(quantiles.index[::25], quantiles[::25]):
    s= s="({} , {})".format(x,y)
    plt.annotate(s, xy=(x,y), xytext=(x-0.05, y+500)
                ,fontweight='bold')

plt.savefig('img/quantiles.png')

In [None]:
quantiles[::5]

0.00        1
0.05        7
0.10       15
0.15       21
0.20       27
0.25       34
0.30       41
0.35       50
0.40       60
0.45       73
0.50       89
0.55      109
0.60      133
0.65      163
0.70      199
0.75      245
0.80      307
0.85      392
0.90      520
0.95      749
1.00    17112
Name: rating, dtype: int64

In [None]:
no_of_ratings_per_movie = train_df.groupby(by='movie')['rating'].count().sort_values(ascending=False)

fig = plt.figure(figsize=plt.figaspect(.5))
ax = plt.gca()
plt.plot(no_of_ratings_per_movie.values)
plt.title('# RATINGS per Movie')
plt.xlabel('Movie')
plt.ylabel('No of Users who rated a movie')
ax.set_xticklabels([])

plt.savefig('img/per-movie-ratings-train_df.png')
# plt.show()

- There are some (<10%) movies which are rated by huge number of users.
- But majority movies exists which are rated by some hundereds of users. 

## Building sparse matrices from data

- Present data has 4 columns, user, movie, ratings and date; for each movie there are many users and each user gives rating.
- This takes lot of memory.
- To minimize usage of memory, I am creating two arrays, one for movies(m_i's) and one for users(u_j's), by some matrix operation (generally dot product) would give me rating (r_ij's)

In [17]:
start = datetime.now()
if os.path.isfile('train_sparse_matrix.npz'):
    print("It is present in pwd, loading it")
    train_sparse_matrix = sparse.load_npz('train_sparse_matrix.npz')
    print('Done. It\'s shape is : (user, movie) : ',train_sparse_matrix.shape)
else: 
    print("Building sparse_matrix from the dataframe...")
    # create sparse_matrix and store it for after usage.
    # csr_matrix(data_values, (row_index, col_index), shape_of_matrix)
    # It should be in such a way that, MATRIX[row, col] = data
    train_sparse_matrix = sparse.csr_matrix((train_df.rating.values, (train_df.user.values,
                                               train_df.movie.values)),)
    
    print('Done. It\'s shape is : (user, movie) : ',train_sparse_matrix.shape)
    print('Saving it into pwd for further usages...')

    sparse.save_npz("train_sparse_matrix.npz", train_sparse_matrix)
    print('Done.\n')

print(datetime.now() - start)

It is present in pwd, loading it
Done. It's shape is : (user, movie) :  (2649430, 17771)
0:00:02.099595


In [18]:
start = datetime.now()
if os.path.isfile('test_sparse_matrix.npz'):
    print("It is present in pwd, loading it.")
    test_sparse_matrix = sparse.load_npz('test_sparse_matrix.npz')
    print('Done. It\'s shape is : (user, movie) : ',test_sparse_matrix.shape)
else: 
    print("Building sparse_matrix from the dataframe...")
    # create sparse_matrix and store it for after usage.
    # csr_matrix(data_values, (row_index, col_index), shape_of_matrix)
    # It should be in such a way that, MATRIX[row, col] = data
    test_sparse_matrix = sparse.csr_matrix((test_df.rating.values, (test_df.user.values,
                                               test_df.movie.values)))
    
    print('Done. It\'s shape is : (user, movie) : ',test_sparse_matrix.shape)
    print('Saving it into pwd for further usages...')

    sparse.save_npz("test_sparse_matrix.npz", test_sparse_matrix)
    print('Done.')
    
print(datetime.now() - start)

It is present in pwd, loading it.
Done. It's shape is : (user, movie) :  (2649430, 17771)
0:00:00.521067


### Sparsity = (Number of Zero enteries/Number of total enteries)*100

In [17]:
us,mv = train_sparse_matrix.shape
elem = train_sparse_matrix.count_nonzero()
print("Sparsity Of Train matrix : {} % ".format((1-(elem/(us*mv)))*100))

Sparsity Of Train matrix : 99.8292709259195 % 


In [18]:
us,mv = test_sparse_matrix.shape
elem = test_sparse_matrix.count_nonzero()
print("Sparsity Of Test matrix : {} % ".format(  (1-(elem/(us*mv))) * 100) )

Sparsity Of Test matrix : 99.95731772988694 % 


### Calculating Average rating globally, per movie and per user

In [40]:
def get_average_ratings(sparse_matrix, of_users):  # of_users is boolean flag (1: users, 0:movies)
    
    # selecting axes of sparse matrix
    ax = 1 if of_users else 0
    
    sum_of_ratings = sparse_matrix.sum(axis=ax).A1     # ".A1" for converting Column_Matrix to 1-D numpy array 
    
    # Boolean matrix of ratings (whether a user rated that movie or not)
    is_rated = sparse_matrix!=0
    
    # no of ratings that each user OR movie..
    no_of_ratings = is_rated.sum(axis=ax).A1
    
    u,m = sparse_matrix.shape     # max_user(u)  and max_movie(m) id's in sparse matrix 

    # average_rating = sum of ratings/sum of non-zero entries
    average_ratings = { i : sum_of_ratings[i]/no_of_ratings[i]            
                                 for i in range(u if of_users else m) 
                                    if no_of_ratings[i] !=0}  
    
    return average_ratings # returns dict

In [None]:
train_averages = dict()

train_global_average = train_sparse_matrix.sum()/train_sparse_matrix.count_nonzero()
train_averages['global'] = train_global_average
print(f"Global Average of Ratings in training data is {train_averages}")


Global Average of Ratings in training data is {'global': 3.582890686321557}


In [None]:
train_averages['user'] = get_average_ratings(train_sparse_matrix, of_users=True)
# user = random.randint(1,train_sparse_matrix.shape [0])
# print(user)

# Generate a random user ID within the valid range
valid_users = list(train_averages['user'].keys())  # Get the list of valid user IDs
user = random.choice(valid_users) 
print(f'Average rating of user {user} :',train_averages['user'][user])

Average rating of user 573242 : 4.138339920948616


In [None]:
train_averages['movie'] =  get_average_ratings(train_sparse_matrix, of_users=False)

valid_movies = list(train_averages['movie'].keys())
movie = random.choice(valid_movies)
print(f'Average rating of movie {movie} :',train_averages['movie'][movie])

Average rating of movie 12910 : 2.7708333333333335


PDF and CDF of avg rating of user and movie in train_df

In [None]:
start = datetime.now()

# Draw PDFs for average rating per user and per movie
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=plt.figaspect(.5))
fig.suptitle('Avg Ratings per User and per Movie', fontsize=15)

ax1.set_title('Users-Avg-Ratings')
# Get the list of average user ratings from the averages dictionary
user_averages = [rat for rat in train_averages['user'].values()]
sns.kdeplot(user_averages, cumulative=True, ax=ax1, label='Cdf')
sns.kdeplot(user_averages, ax=ax1, label='Pdf')

ax2.set_title('Movies-Avg-Rating')
# Get the list of movie average ratings from the dictionary
movie_averages = [rat for rat in train_averages['movie'].values()]
sns.kdeplot(movie_averages, cumulative=True, ax=ax2, label='Cdf')
sns.kdeplot(movie_averages, ax=ax2, label='Pdf')

plt.savefig('img/pdf-cdf-avg-rating-user&movie.png')
print(datetime.now() - start)

0:00:05.615467


### How many new users and movies would I encounter in test_csv ?

In [None]:
total_users = len(np.unique(df.user))
users_train = len(train_averages['user'])
new_users = total_users - users_train

print('Total number of Users  :', total_users)
print('Number of Users in Train data :', users_train)
print("No of Users that didn't appear in train data: {} ({} %) \n ".format(new_users,(new_users/total_users)*100))

Total number of Users  : 480189
Number of Users in Train data : 405041
No of Users that didn't appear in train data: 75148 (15.649671275268695 %) 
 


In [None]:
total_movies = len(np.unique(df.movie))
movies_train = len(train_averages['movie'])
new_movies = total_movies - movies_train

print('Total number of Movies  :', total_movies)
print('Number of Users in Train data :', movies_train)
print("No of Movies that didn't appear in train data: {} ({} %) \n ".format(new_movies,(new_movies/total_movies)*100))

Total number of Movies  : 17770
Number of Users in Train data : 17424
No of Movies that didn't appear in train data: 346 (1.9471018570624647 %) 
 


# Computing similarity matrix

### user - user collaborative filtering

In [20]:
from sklearn.metrics.pairwise import cosine_similarity


def compute_user_similarity(sparse_matrix, compute_for_few=False, top = 100, verbose=False, verb_for_n_rows = 20,
                            draw_time_taken=True):
    no_of_users = sparse_matrix.shape[0]
    # get the indices of  non zero rows (users) from our sparse matrix
    row_ind, col_ind = sparse_matrix.nonzero()
    row_ind = sorted(set(row_ind)) 
    time_taken = list() #  time taken for finding similar users for an user
    
    # Create rows, cols, and data lists.., which can be used to create sparse matrices
    rows, cols, data = list(), list(), list()
    if verbose: print("Computing strted for top",top,"similarities for each user...")
    
    start = datetime.now()
    temp = 0
    
    for row in row_ind[:top] if compute_for_few else row_ind:
        temp = temp+1
        prev = datetime.now()
        
        # get the similarity row for this user with all other users
        sim = cosine_similarity(sparse_matrix.getrow(row), sparse_matrix).ravel()
        # I will consider only the top 10/20/40/100 etc  most similar users and ignore rest of them..
        top_sim_ind = sim.argsort()[-top:]
        top_sim_val = sim[top_sim_ind]
        
        # add them to our rows, cols and data
        rows.extend([row]*top)
        cols.extend(top_sim_ind)
        data.extend(top_sim_val)
        time_taken.append(datetime.now().timestamp() - prev.timestamp())
        if verbose:
            if temp%verb_for_n_rows == 0:
                print("Computing done for {} users [  time elapsed : {}  ]"
                      .format(temp, datetime.now()-start))
            
        
    # lets create sparse matrix out of these and return it
    if verbose: print('Creating Sparse matrix from the computed similarities')
    #return rows, cols, data
    
    if draw_time_taken:
        plt.plot(time_taken, label = 'time taken for each user')
        plt.plot(np.cumsum(time_taken), label='Total time')
        plt.legend(loc='best')
        plt.xlabel('User')
        plt.ylabel('Time (seconds)')
        plt.savefig('img/u-u-cf-17k-dim-per-user.png')
        
    return sparse.csr_matrix((data, (rows, cols)), shape=(no_of_users, no_of_users)), time_taken 

In [None]:
start = datetime.now()
u_u_sim_sparse, _ = compute_user_similarity(train_sparse_matrix, compute_for_few=True, top = 200,
                                                     verbose=True)
print("Time taken for user-user cf with 17k dimensions per user :",datetime.now()-start)

Computing strted for top 200 similarities for each user...
Computing done for 20 users [  time elapsed : 0:01:04.300235  ]
Computing done for 40 users [  time elapsed : 0:01:57.593116  ]
Computing done for 60 users [  time elapsed : 0:02:48.430230  ]
Computing done for 80 users [  time elapsed : 0:03:38.995090  ]
Computing done for 100 users [  time elapsed : 0:04:30.595433  ]
Computing done for 120 users [  time elapsed : 0:05:25.550533  ]
Computing done for 140 users [  time elapsed : 0:06:17.773366  ]
Computing done for 160 users [  time elapsed : 0:07:08.075599  ]
Computing done for 180 users [  time elapsed : 0:08:07.425267  ]
Computing done for 200 users [  time elapsed : 0:09:08.188622  ]
Creating Sparse matrix from the computed similarities
Time taken for user-user cf with 17k dimensions per user : 0:09:13.713449


- Calculating user-user Similarity_Matrix (user-user collaborative filtering) is not an easy task
- For top 200 users it took **0:09:13.713449** time, and as users count increases, complexity increases as one could find more and more similarities. 

* On avg per time consumed for searching similarity for one user = (9*60 + 13.71)/200 = **2.76 seconds**
* training data have 405041 users, so approximately it would take **405041*2.76 = 1117913 seconds = 12.93 days**
* It will take almost **13** days to just find similarities !

- Hence, i would try to find user-user similarity via reduced dimensions

### Truncated SVD for reducing the dimesnion of user vector
- SVD basically is a factorization of that matrix into three smaller matrices.
- The SVD of mxn matrix A is given by the formula A = U Σ V^T 
- Where
   - U is m*m matrix of orthonormal eigen vectors of AA^T
   - V^T is n*n matrix of orthonormal eigen vectors of (A^T)A
   - Σ is diagonal matrix with r elements, r = square root of positive eigen values of AA^T (or (A^T)A)

In [None]:
from datetime import datetime
from sklearn.decomposition import TruncatedSVD

start = datetime.now()

# All parameters are default except n_components. n_itr is for Randomized SVD solver.
netflix_svd = TruncatedSVD(n_components=100, algorithm='randomized', random_state=42)
print("Fitting started...")
trunc_svd = netflix_svd.fit_transform(train_sparse_matrix)

# num_iterations = 10
# for i in range(num_iterations):
#     # Fit the TruncatedSVD model for each iteration
#     trunc_svd = netflix_svd.fit_transform(train_sparse_matrix)
    
#     # Print progress update
#     print(f"Iteration {i+1}/{num_iterations} completed")

print(datetime.now()-start)

In [None]:
expl_var = np.cumsum(netflix_svd.explained_variance_ratio_)
expl_var

array([0.23362135, 0.26270872, 0.28323418, 0.29936103, 0.31129667,
       0.32272449, 0.33168545, 0.33816688, 0.34421001, 0.34939129,
       0.35412811, 0.35790579, 0.36145969, 0.36481079, 0.36796535,
       0.3709693 , 0.37381048, 0.37654066, 0.37892266, 0.38128434,
       0.38355732, 0.38573246, 0.38787214, 0.38996681, 0.39201513,
       0.39393495, 0.39577018, 0.39753914, 0.39924786, 0.40091947,
       0.40251418, 0.40408101, 0.40563205, 0.40715363, 0.40864418,
       0.41009275, 0.41151715, 0.41291575, 0.41428276, 0.4156209 ,
       0.41692581, 0.41818944, 0.41941626, 0.4206308 , 0.42183602,
       0.42301205, 0.42417439, 0.42530795, 0.42642577, 0.42753769,
       0.42862981, 0.4297012 , 0.43074982, 0.43178112, 0.43281107,
       0.43382522, 0.434825  , 0.43580279, 0.43677693, 0.43773492,
       0.43868205, 0.43962079, 0.44054526, 0.44145286, 0.44236078,
       0.44325366, 0.44413545, 0.4450134 , 0.44587698, 0.44672994,
       0.44757815, 0.44841497, 0.44924205, 0.45006438, 0.45087

- It basically is the gain of variance explained, if we add one additional latent factor to it via np.cumsum()
- By adding one by one latent factore to it,___gain in explained variance__ is decreasing.
- To take it to greter than 0.60, we have to take almost 400-500+ latent factors. It's totally us-less (more compute power and memory loss)

In [23]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 12))

ax1.set_ylabel("Cummulative Variance Explained")
ax1.set_xlabel("Number of Latent Facors")
ax1.plot(expl_var)
# annote some (latentfactors, expl_var) to make it clear
ind = [1, 2, 4, 8, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
ax1.scatter(x = [i-1 for i in ind], y = expl_var[[i-1 for i in ind]], c='#ff3300')
for i in ind:
    ax1.annotate("({}, {})".format(i, np.round(expl_var[i-1], 2)), xy=(i-1, expl_var[i-1]),
                xytext = ( i+20, expl_var[i-1] - 0.01),fontweight='bold')

change_in_expl_var = [expl_var[i+1] - expl_var[i] for i in range(len(expl_var)-1)]
ax2.plot(change_in_expl_var)

ax2.set_ylabel("Increment in Cummulative Variance with One Additional Latent Factor", fontsize=10)
ax2.yaxis.set_label_position("right")
ax2.set_xlabel("Number of Latent Factor")

plt.savefig('img/netflix_svd-expl-var.png')

We are not getting benifitted from adding one latent factor each time. This is what is shown in the plots (specially the bottom plot, it gets almost flatten after that knee).

In [24]:
for i in ind:
    print("({}, {})".format(i, np.round(expl_var[i-1], 2)))

(1, 0.23)
(2, 0.26)
(4, 0.3)
(8, 0.34)
(10, 0.35)
(20, 0.38)
(30, 0.4)
(40, 0.42)
(50, 0.43)
(60, 0.44)
(70, 0.45)
(80, 0.45)
(90, 0.46)
(100, 0.47)


In [None]:
# Project Original U_M matrix into into 100 Dimensional space...
start = datetime.now()
trunc_matrix = train_sparse_matrix.dot(netflix_svd.components_.T)
print(datetime.now()- start)

0:00:03.645870


In [26]:
type(trunc_matrix), trunc_matrix.shape

(numpy.ndarray, (2649430, 100))

In [22]:
if not os.path.isfile('trunc_sparse_matrix.npz'):
    trunc_sparse_matrix = sparse.csr_matrix(trunc_matrix)
    sparse.save_npz('trunc_sparse_matrix', trunc_sparse_matrix)
else:
    print("trunc_sparse_matrix.npz already exists. Loading it...")
    start = datetime.now()
    trunc_sparse_matrix = sparse.load_npz('trunc_sparse_matrix.npz')
    print(datetime.now()- start)

trunc_sparse_matrix.npz already exists. Loading it...
0:00:01.445727


In [25]:
trunc_sparse_matrix.shape

(2649430, 100)

In [28]:
start = datetime.now()
trunc_u_u_sim_matrix, _ = compute_user_similarity(trunc_sparse_matrix, compute_for_few=True, top=50, verbose=True, 
                                                 verb_for_n_rows=10)

print("time:",datetime.now()-start)

Computing strted for top 50 similarities for each user...
Computing done for 10 users [  time elapsed : 0:00:07.767252  ]
Computing done for 20 users [  time elapsed : 0:00:15.033117  ]
Computing done for 30 users [  time elapsed : 0:00:22.483663  ]
Computing done for 40 users [  time elapsed : 0:00:29.950224  ]
Computing done for 50 users [  time elapsed : 0:00:37.384297  ]
Creating Sparse matrix from the computed similarities
time: 0:00:40.147739


- Time taken per user = 0:00:44.379248 / 50 = **0.88 seconds**
-  We have total users = 405041, which means u-u similarity presize computation would take 405041*0.88 = 4.125 days
- No doubt, svd has decreased the time of computation, but 4+ days time is also a very long time. It would take lot of memory and computation power, which is very very hard to execute.

In [14]:
if not os.path.isfile('u_u_sim_sparse.npz'):
    # Save the computed user-user similarity matrix
    sparse.save_npz("u_u_sim_sparse.npz", trunc_u_u_sim_matrix)
else:
    print("u_u_sim_sparse.npz already exists. Loading it...")
    start = datetime.now()
    u_u_sim_sparse = sparse.load_npz('u_u_sim_sparse.npz')
    print(datetime.now()- start)

u_u_sim_sparse.npz already exists. Loading it...
0:00:00.067518


### Alternative/Modification to traditional SVD
But one drawback i noticed in my above method is, it re-calculate the similarities of a user with another user in some iterations.
To minimize/optimize it:
- I will maintain a binary Vector for users, which tells us whether program has already computed top(say, 100) similarities for a user or not.
-  **If not** : Compute top (say, 100) most similar users for this user, and add this to our datastructure, so that we can just access it(similar users) without recomputing it again. The way which i did above
- But **If It is already Computed** : Just get it directly from our datastructure. In due time,i might have to recompute similarities, if it is computed a long time ago. Because user preferences changes over time. 
- So, program could maintain some kind of **Timer**, which when expires, we have to update it ( recompute it ).


### Movie - Movie collaborative filtering

In [23]:
start = datetime.now()
if not os.path.isfile('m_m_sim_sparse.npz'):
    start = datetime.now()
    m_m_sim_sparse = cosine_similarity(X=train_sparse_matrix.T, dense_output=False)
    # store this sparse matrix in disk before using it. For future purposes.
    sparse.save_npz("m_m_sim_sparse.npz", m_m_sim_sparse)
    print("Done.")
else:
    print("m_m_sim_saprse.npz is there already, Loading it...")
    m_m_sim_sparse = sparse.load_npz("m_m_sim_sparse.npz")
    print("Done.")

# print("m_m_sim_sparse.npz is a ",m_m_sim_sparse.shape," dimensional matrix")

print(datetime.now() - start)

m_m_sim_saprse.npz is there already, Loading it...
Done.
0:00:14.750535


In [24]:
m_m_sim_sparse.shape

(17771, 17771)

- Even though we have similarity measure of each movie, with all other movies. But generally one don't care much about least similar movies.
- Most of the times platforms recommends only top_xx similar items (here, item = movie). It may be top 10 or 100.
- So, its better to take only top similar movie ratings and store them in a saperate dictionary.

In [25]:
movie_ids = np.unique(m_m_sim_sparse.nonzero()[1])
movie_ids

array([    1,     2,     3, ..., 17768, 17769, 17770])

In [38]:
len(movie_ids)

17424

m_m_sim_sparse is based on training dataset, so 0.8*17771 = 17424

In [26]:
start = datetime.now()
similar_movies = dict() 
for movie in movie_ids:
    # get the top similar movies and store them in the dictionary
    sim_movies = m_m_sim_sparse[movie].toarray().ravel().argsort()[::-1][1:]
    similar_movies[movie] = sim_movies[:100]
print(datetime.now() - start)

# just testing similar movies for randomly choosing movie_id
movie=random.choice(movie_ids)
print(f"Similar movies for movie id {movie} are :\n")
similar_movies[movie]

0:00:27.543420
Similar movies for movie id 860 are :



array([ 3191, 15935, 10309,  7746,  7337, 14221,  1529, 16887,  3258,
       12041,   838,  6662,  5831,  4674,  2999,  1989,  1361,  2844,
        7017,  3361, 10129,  8495,  3298,  1830,  4462,  2003, 11408,
       11906,  3199,   669,  3101, 13976, 13450, 10448, 15604,   245,
       12483,   821, 12364, 16250,  7164,  8095, 12357,  6288,  2099,
       12389,  6757,  8523, 11951,  9168,  7280, 14369, 16391, 14263,
        2289,  7300, 15515,  1846,  2668,  7731, 15581,  3104,  7051,
        2513,   622,   274, 10470,  2415, 16071, 12344,  1398, 10658,
        4758, 10093,   517, 12119,  7743, 12033, 11996, 12916,    26,
       13043, 15405, 16706, 16351, 14674,  9806, 12271, 11870, 10323,
        8855,  6005, 10059, 16509, 12499, 14495,  3186, 13597,  2493,
       11588], dtype=int64)

### To verify whether these movies are actually similar? 
 - #### I am using netflix's movie_titles.csv to get their names and cross check manually

In [27]:
movie_titles = pd.read_csv("Data/movie_titles.csv", sep=',', header = None,
                           names=['movie_id', 'year_of_release', 'title'],
                           usecols=[0, 1, 2], verbose=True,
                      index_col = 'movie_id', encoding = "ISO-8859-1")  
#encoding necessary as movie_titles.csv has characters outside ASCII range

Tokenization took: 0.81 ms
Type conversion took: 9.01 ms
Parser memory cleanup took: 0.00 ms


  movie_titles = pd.read_csv("Data/movie_titles.csv", sep=',', header = None,


In [28]:
movie_titles.head()

Unnamed: 0_level_0,year_of_release,title
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2003.0,Dinosaur Planet
2,2004.0,Isle of Man TT 2004 Review
3,1997.0,Character
4,1994.0,Paula Abdul's Get Up & Dance
5,2004.0,The Rise and Fall of ECW


### Recommendations similar movies for a given movie id 

In [29]:
mv_id = 1061

print(f"Movie id {mv_id} corresponds to ",movie_titles.loc[mv_id].values[1])

print("It has {} Ratings from users.".format(train_sparse_matrix[:,mv_id].getnnz()))

print(f"Movide id = {mv_id}" + " have {} movies which are similar to this and but only top 100 most similar ones are of interest.".format(m_m_sim_sparse[:,mv_id].getnnz()))

Movie id 1061 corresponds to  Spider-Man vs. Doc Ock
It has 744 Ratings from users.
Movide id = 1061 have 17320 movies which are similar to this and but only top 100 most similar ones are of interest.


In [30]:
similarities = m_m_sim_sparse[mv_id].toarray().ravel()
similar_indices = similarities.argsort()[::-1][1:]
similarities[similar_indices]

# It will sort and reverse the array and ignore its similarity (i.e. 1)
# and return its indices (movie_ids)

similar_indices

array([12524,  2279, 13274, ...,  6725, 15104,     0], dtype=int64)

In [58]:
len(similar_indices)

17770

In [59]:
plt.plot(similarities[similar_indices], label='All the ratings')
plt.plot(similarities[similar_indices[:100]], label='top 100 similar movies')
plt.title("Similar Movies of {}(movie_id)".format(mv_id), fontsize=20)
plt.xlabel("Movies (Not Movie_Ids)", fontsize=15)
plt.ylabel("Cosine Similarity",fontsize=15)
plt.legend()
plt.savefig("img/similar-movies-for-movieId-1061.png")

In [31]:
# Top 10 similar movies for choosed moive_id
movie_titles.loc[similar_indices[:10]]

Unnamed: 0_level_0,year_of_release,title
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1
12524,2004.0,Spider-Man: The Return of the Green Goblin
2279,2002.0,Spider-Man: The Ultimate Villain Showdown
13274,1996.0,Daredevil vs. Spiderman
1231,1999.0,Batman Beyond: Tech Wars / Disappearing Inque
14184,1967.0,Spider-Man: The '67 Classic Collection
2912,2003.0,Spider-Man: The New Animated Series: Season 1
14017,1999.0,Batman Beyond: School Dayz / Spellbound
11088,1992.0,Adventures of Batman & Robin: The Joker/Fire &...
4342,1992.0,Batman: The Animated Series: Out of the Shadows
7903,1992.0,Adventures of Batman & Robin: Poison Ivy/The P...


### This same approach could be applied for **user-user** similarity too, then I could get top 10 etc similar users

In [33]:
def get_watched_movies(user_id, df):
    # Filter DataFrame to get rows corresponding to the specified user
    user_movies_df = df[df['user'] == user_id]
    # Get the list of movies watched by the user
    watched_movies = user_movies_df['movie'].tolist()
    return watched_movies

# Example usage:
target_user_id = 510180  # user to which we are recommending top 10 movies
watched_movies = get_watched_movies(target_user_id, train_df)
print("Movies already watched by user {}: {}".format(target_user_id, watched_movies))
  

Movies already watched by user 510180: [10341, 1798, 10774, 8651, 14660, 3870, 8357, 15894, 9003, 9392, 11234, 5571, 12470, 11313, 2866, 12818, 5625, 9798, 16465, 15057, 12473, 17064, 6615, 15105, 8079, 1367, 3730, 17764, 11612, 15336, 14455, 11080, 5474, 6902, 2948, 3421, 16182, 11259, 2478, 9536, 14869, 11638, 6336, 11005, 1314, 4912, 13651, 7617, 15674, 15421, 9785, 1421, 15813, 15922, 11888, 8773, 5237, 4883, 12317, 15466, 2518, 14610, 10832, 15940, 15698, 13622, 550, 607, 6971, 6240, 9432, 1324, 432, 15875, 15455, 11490, 16668, 4870, 14624, 9189, 6574, 13705, 599, 6158, 1035, 3113, 14403, 13058, 3439, 12926, 15472, 2622, 12633, 12003, 12160, 13216, 13195, 10392, 12309, 4612, 16339, 16605, 5477, 5882, 2851, 6988, 3617, 6797, 674, 10872, 8177, 14858, 1615, 11544, 3535, 8840, 5845, 6266, 2876, 3168, 4824, 12435, 4248, 5775, 2209, 14109, 13748, 10375, 11051, 7055, 13248, 2174, 7397, 10123, 14928, 1027, 7399, 4214, 4031, 16992, 8317, 5402, 2000, 2953, 8596, 13580, 17709, 4705, 4402, 17

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

def get_similar_users(target_user_id, train_sparse_matrix, top_n):
    # Calculate cosine similarity between the target user and all other users
    similarity_scores = cosine_similarity(train_sparse_matrix[target_user_id], train_sparse_matrix).ravel()
    
    # Sort the similarity scores in descending order and get the indices of top similar users
    similar_users_indices = similarity_scores.argsort()[::-1][1:top_n+1]
    
    return similar_users_indices


In [34]:
target_user_id = 510180  # ID of the targeted user
similar_users_indices = get_similar_users(target_user_id, train_sparse_matrix, top_n=10)
print("Top similar users for user", target_user_id, ":", similar_users_indices)

Top similar users for user 510180 : [2070820 1912012   15191  443193  829101 1791707  113369  383858 1797525
 1413561]


### Combining both user-user and movie-movie collaborative filtering via Machine Learning

In [49]:
def get_sample_sparse_matrix(sparse_matrix, no_users, no_movies, path, verbose = True):
    """
        It will get it from the ''path'' if it is present  or It will create 
        and store the sampled sparse matrix in the path specified.
    """

    # get (row, col) and (rating) tuple from sparse_matrix...
    row_ind, col_ind, ratings = sparse.find(sparse_matrix)
    users = np.unique(row_ind)
    movies = np.unique(col_ind)

    print("Original Matrix : (users, movies) -- ({} {})".format(len(users), len(movies)))
    print("Original Matrix : Ratings -- {}\n".format(len(ratings)))

    # It just to make sure to get same sample everytime we run this program..
    # and pick without replacement....
    np.random.seed(15)
    sample_users = np.random.choice(users, no_users, replace=False)
    sample_movies = np.random.choice(movies, no_movies, replace=False)
    # get the boolean mask or these sampled_items in originl row/col_inds..
    mask = np.logical_and( np.isin(row_ind, sample_users),
                      np.isin(col_ind, sample_movies) )
    
    sample_sparse_matrix = sparse.csr_matrix((ratings[mask], (row_ind[mask], col_ind[mask])),
                                             shape=(max(sample_users)+1, max(sample_movies)+1))

    if verbose:
        print("Sampled Matrix : (users, movies) -- ({} {})".format(len(sample_users), len(sample_movies)))
        print("Sampled Matrix : Ratings --", format(ratings[mask].shape[0]))

    print('Saving it into pwd for furthur operations...')
    
    sparse.save_npz(path, sample_sparse_matrix)
    if verbose:
            print('Done.')
    
    return sample_sparse_matrix

### Creating Sample Train data from train_df

In [50]:
start = datetime.now()
path = "sample/small/sample_train_sparse_matrix.npz"
if os.path.isfile(path):
    print("It is present in your pwd, getting it from disk....")
    # just get it from the disk instead of computing it
    sample_train_sparse_matrix = sparse.load_npz(path)
    print("DONE.")
else: 
    # get 8k users and 0.8k movies from available data 
    print("It is not present in pwd...")
    sample_train_sparse_matrix = get_sample_sparse_matrix(train_sparse_matrix, no_users=8000, no_movies=800,
                                             path = path)

print(datetime.now() - start)

It is not present in pwd...
Original Matrix : (users, movies) -- (405041 17424)
Original Matrix : Ratings -- 80384405

Sampled Matrix : (users, movies) -- (8000 800)
Sampled Matrix : Ratings -- 78751
Saving it into pwd for furthur operations...
Done.
0:01:27.232241


### Creating Sample Test Data from test_df

In [51]:
start = datetime.now()

path = "sample/small/sample_test_sparse_matrix.npz"
if os.path.isfile(path):
    print("It is present in your pwd, getting it from disk....")
    # just get it from the disk instead of computing it
    sample_test_sparse_matrix = sparse.load_npz(path)
    print("DONE.")
else:
    # get 4k users and 400 movies from available data 
    print("It is not present in pwd...")
    sample_test_sparse_matrix = get_sample_sparse_matrix(test_sparse_matrix, no_users=4000, no_movies=400,
                                                 path = "sample/small/sample_test_sparse_matrix.npz")
print(datetime.now() - start)

It is not present in pwd...


Original Matrix : (users, movies) -- (349312 17757)
Original Matrix : Ratings -- 20096102

Sampled Matrix : (users, movies) -- (4000 400)
Sampled Matrix : Ratings -- 4530
Saving it into pwd for furthur operations...
Done.
0:00:17.270051


#### Finding Global Average of all movie ratings, Average rating per User, and Average rating per Movie (from sampled train)

In [52]:
sample_train_averages = dict()

In [68]:
# get the global average of ratings in our train set.
global_average = sample_train_sparse_matrix.sum()/sample_train_sparse_matrix.count_nonzero()
sample_train_averages['global'] = global_average
sample_train_averages['global']

3.6119795304186613

In [62]:
sample_train_averages['user'] = get_average_ratings(sample_train_sparse_matrix, of_users=True)
print('\nAverage rating of user 1179 :',sample_train_averages['user'][1179])


Average rating of user 1179 : 3.7142857142857144


In [63]:
sample_train_averages['movie'] =  get_average_ratings(sample_train_sparse_matrix, of_users=False)
print('\n AVerage rating of movie 6464 :',sample_train_averages['movie'][6464])


 AVerage rating of movie 6464 : 3.400396432111001


### Featurizing Data

In [57]:
print('No of ratings in Our Sampled train matrix is : {}\n'.format(sample_train_sparse_matrix.count_nonzero()))
print('No of ratings in Our Sampled test  matrix is : {}\n'.format(sample_test_sparse_matrix.count_nonzero()))

No of ratings in Our Sampled train matrix is : 78751

No of ratings in Our Sampled test  matrix is : 4530



####  Featurizing train data

In [58]:
# get users, movies and ratings from our samples train sparse matrix
sample_train_users, sample_train_movies, sample_train_ratings = sparse.find(sample_train_sparse_matrix)

In [59]:
start = datetime.now()
if os.path.isfile('sample/small/reg_train.csv'):
    print("File already exists you don't have to prepare again..." )
else:
    print('preparing {} tuples for the dataset..\n'.format(len(sample_train_ratings)))
    with open('sample/small/reg_train.csv', mode='w') as reg_data_file:
        count = 0
        for (user, movie, rating)  in zip(sample_train_users, sample_train_movies, sample_train_ratings):
            st = datetime.now()
        #     print(user, movie)    
            #--------------------- Ratings of "movie" by similar users of "user" ---------------------
            # compute the similar Users of the "user"        
            user_sim = cosine_similarity(sample_train_sparse_matrix[user], sample_train_sparse_matrix).ravel()
            top_sim_users = user_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.
            # get the ratings of most similar users for this movie
            top_ratings = sample_train_sparse_matrix[top_sim_users, movie].toarray().ravel()
            # we will make it's length "5" by adding movie averages to .
            top_sim_users_ratings = list(top_ratings[top_ratings != 0][:5])
            top_sim_users_ratings.extend([sample_train_averages['movie'][movie]]*(5 - len(top_sim_users_ratings)))
        #     print(top_sim_users_ratings, end=" ")    


            #--------------------- Ratings by "user"  to similar movies of "movie" ---------------------
            # compute the similar movies of the "movie"        
            movie_sim = cosine_similarity(sample_train_sparse_matrix[:,movie].T, sample_train_sparse_matrix.T).ravel()
            top_sim_movies = movie_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.
            # get the ratings of most similar movie rated by this user..
            top_ratings = sample_train_sparse_matrix[user, top_sim_movies].toarray().ravel()
            # we will make it's length "5" by adding user averages to.
            top_sim_movies_ratings = list(top_ratings[top_ratings != 0][:5])
            top_sim_movies_ratings.extend([sample_train_averages['user'][user]]*(5-len(top_sim_movies_ratings))) 
        #     print(top_sim_movies_ratings, end=" : -- ")

            #-----------------prepare the row to be stores in a file-----------------#
            row = list()
            row.append(user)
            row.append(movie)
            # Now add the other features to this data...
            row.append(sample_train_averages['global']) # first feature
            # next 5 features are similar_users "movie" ratings
            row.extend(top_sim_users_ratings)
            # next 5 features are "user" ratings for similar_movies
            row.extend(top_sim_movies_ratings)
            # Avg_user rating
            row.append(sample_train_averages['user'][user])
            # Avg_movie rating
            row.append(sample_train_averages['movie'][movie])

            # finalley, The actual Rating of this user-movie pair...
            row.append(rating)
            count = count + 1

            # add rows to the file opened..
            reg_data_file.write(','.join(map(str, row)))
            reg_data_file.write('\n')        
            if (count)%10000 == 0:
                # print(','.join(map(str, row)))
                print("Done for {} rows----- {}".format(count, datetime.now() - start))


print(datetime.now() - start)

preparing 78751 tuples for the dataset..

Done for 10000 rows----- 0:27:34.998557
Done for 20000 rows----- 0:54:02.188989
Done for 30000 rows----- 1:20:31.693370
Done for 40000 rows----- 1:46:57.151897
Done for 50000 rows----- 2:13:21.269826
Done for 60000 rows----- 2:39:44.191039
Done for 70000 rows----- 3:06:00.945686
3:29:14.949788


In [60]:
reg_train = pd.read_csv('sample/small/reg_train.csv', names = ['user', 'movie', 'GAvg', 'sur1', 'sur2', 'sur3', 'sur4', 'sur5','smr1', 'smr2', 'smr3', 'smr4', 'smr5', 'UAvg', 'MAvg', 'rating'], header=None)
reg_train.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating
0,692,14621,3.61198,2.0,4.0,5.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.329095,4
1,1179,2239,3.61198,5.0,3.0,3.0,2.0,2.0,3.0,5.0,4.0,3.0,4.0,3.714286,2.909091,5
2,1179,4352,3.61198,4.0,3.0,2.0,3.0,3.0,4.0,4.0,3.0,4.0,5.0,3.714286,3.140845,3
3,1179,6464,3.61198,3.0,5.0,4.0,2.0,4.0,4.0,4.0,2.0,5.0,3.0,3.714286,3.400396,4
4,1179,6510,3.61198,4.0,4.0,3.0,4.0,3.0,5.0,4.0,4.0,3.0,4.0,3.714286,3.936614,4


- GAvg :  Avg Global Rating
- UAvg : User's avg movie rating
- MAvg : Movie's avg rating
- rating : rating given by user to movie
- sur1,sur2,sur3,sur4,sur5 : top 5 similar users who rated that movie
- sm1r,smr2,smr3,smr4,smr5 : top 5 movies similar to that movie


In [61]:
reg_train.shape

(78751, 16)

#### Featurising test data

In [64]:
# get users, movies and ratings from the Sampled Test 
sample_test_users, sample_test_movies, sample_test_ratings = sparse.find(sample_test_sparse_matrix)

In [69]:
global_average = sample_test_sparse_matrix.sum()/sample_test_sparse_matrix.count_nonzero()
sample_train_averages['global'] = global_average
sample_train_averages['global']

3.5304635761589402

In [70]:
start = datetime.now()

if os.path.isfile('sample/small/reg_test.csv'):
    print("It is already created...")
else:

    print('preparing {} tuples for the dataset..\n'.format(len(sample_test_ratings)))
    with open('sample/small/reg_test.csv', mode='w') as reg_data_file:
        count = 0 
        for (user, movie, rating)  in zip(sample_test_users, sample_test_movies, sample_test_ratings):
            st = datetime.now()

        #--------------------- Ratings of "movie" by similar users of "user" ---------------------
            #print(user, movie)
            try:
                # compute the similar Users of the "user"        
                user_sim = cosine_similarity(sample_train_sparse_matrix[user], sample_train_sparse_matrix).ravel()
                top_sim_users = user_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.
                # get the ratings of most similar users for this movie
                top_ratings = sample_train_sparse_matrix[top_sim_users, movie].toarray().ravel()
                # we will make it's length "5" by adding movie averages to .
                top_sim_users_ratings = list(top_ratings[top_ratings != 0][:5])
                top_sim_users_ratings.extend([sample_train_averages['movie'][movie]]*(5 - len(top_sim_users_ratings)))
                # print(top_sim_users_ratings, end="--")

            except (IndexError, KeyError):
                # It is a new User or new Movie or there are no ratings for given user for top similar movies...
                ########## Cold STart Problem ##########
                top_sim_users_ratings.extend([sample_train_averages['global']]*(5 - len(top_sim_users_ratings)))
                #print(top_sim_users_ratings)
            except:
                print(user, movie)
                # we just want KeyErrors to be resolved. Not every Exception...
                raise



            #--------------------- Ratings by "user"  to similar movies of "movie" ---------------------
            try:
                # compute the similar movies of the "movie"        
                movie_sim = cosine_similarity(sample_train_sparse_matrix[:,movie].T, sample_train_sparse_matrix.T).ravel()
                top_sim_movies = movie_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.
                # get the ratings of most similar movie rated by this user..
                top_ratings = sample_train_sparse_matrix[user, top_sim_movies].toarray().ravel()
                # we will make it's length "5" by adding user averages to.
                top_sim_movies_ratings = list(top_ratings[top_ratings != 0][:5])
                top_sim_movies_ratings.extend([sample_train_averages['user'][user]]*(5-len(top_sim_movies_ratings))) 
                #print(top_sim_movies_ratings)
            except (IndexError, KeyError):
                #print(top_sim_movies_ratings, end=" : -- ")
                top_sim_movies_ratings.extend([sample_train_averages['global']]*(5-len(top_sim_movies_ratings)))
                #print(top_sim_movies_ratings)
            except :
                raise

            #-----------------prepare the row to be stores in a file-----------------#
            row = list()
            # add usser and movie name first
            row.append(user)
            row.append(movie)
            row.append(sample_train_averages['global']) # first feature
            #print(row)
            # next 5 features are similar_users "movie" ratings
            row.extend(top_sim_users_ratings)
            #print(row)
            # next 5 features are "user" ratings for similar_movies
            row.extend(top_sim_movies_ratings)
            #print(row)
            # Avg_user rating
            try:
                row.append(sample_train_averages['user'][user])
            except KeyError:
                row.append(sample_train_averages['global'])
            except:
                raise
            #print(row)
            # Avg_movie rating
            try:
                row.append(sample_train_averages['movie'][movie])
            except KeyError:
                row.append(sample_train_averages['global'])
            except:
                raise
            #print(row)
            # finalley, The actual Rating of this user-movie pair...
            row.append(rating)
            #print(row)
            count = count + 1

            # add rows to the file opened..
            reg_data_file.write(','.join(map(str, row)))
            #print(','.join(map(str, row)))
            reg_data_file.write('\n')        
            if (count)%1000 == 0:
                #print(','.join(map(str, row)))
                print("Done for {} rows----- {}".format(count, datetime.now() - start))
    print("",datetime.now() - start)  

preparing 4530 tuples for the dataset..

Done for 1000 rows----- 0:04:00.820938
Done for 2000 rows----- 0:07:26.288614
Done for 3000 rows----- 0:11:52.674543
Done for 4000 rows----- 0:15:36.367723
 0:17:06.540811


In [71]:
reg_test_df = pd.read_csv('sample/small/reg_test.csv', names = ['user', 'movie', 'GAvg', 'sur1', 'sur2', 'sur3', 'sur4', 'sur5',
                                                          'smr1', 'smr2', 'smr3', 'smr4', 'smr5',
                                                          'UAvg', 'MAvg', 'rating'], header=None)
reg_test_df.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating
0,7,13072,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,5
1,126,3418,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,5
2,268,11740,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,5
3,1809,705,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,4
4,1809,1140,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3


### Surprise model
Ref : http://surprise.readthedocs.io/en/stable/getting_started.html#load-dom-dataframe-py

In [72]:
from surprise import Reader, Dataset

In [73]:
reader = Reader(rating_scale=(1,5))

# create the traindata from the dataframe...
train_data = Dataset.load_from_df(reg_train[['user', 'movie', 'rating']], reader)

# build the trainset from traindata.., It is of dataset format from surprise library..
trainset = train_data.build_full_trainset() 

In [77]:
testset = list(zip(reg_test_df.user.values, reg_test_df.movie.values, reg_test_df.rating.values))
testset

[(7, 13072, 5),
 (126, 3418, 5),
 (268, 11740, 5),
 (1809, 705, 4),
 (1809, 1140, 3),
 (1809, 2533, 3),
 (1809, 3418, 3),
 (1809, 4011, 4),
 (1809, 6849, 5),
 (1809, 11740, 4),
 (3300, 5226, 3),
 (3300, 5601, 3),
 (3321, 348, 3),
 (3321, 750, 1),
 (3321, 1140, 2),
 (3321, 1648, 4),
 (3321, 3247, 3),
 (3321, 3913, 3),
 (3321, 4072, 2),
 (3321, 4461, 2),
 (3321, 5071, 3),
 (3321, 5521, 2),
 (3321, 5601, 3),
 (3321, 5821, 4),
 (3321, 5840, 2),
 (3321, 6131, 1),
 (3321, 6849, 2),
 (3321, 8127, 1),
 (3321, 10787, 3),
 (3321, 11149, 1),
 (3321, 11263, 2),
 (3321, 11268, 2),
 (3321, 11292, 2),
 (3321, 11350, 3),
 (3321, 11740, 3),
 (3321, 12135, 4),
 (3321, 12367, 1),
 (3321, 12846, 1),
 (3321, 13592, 3),
 (3321, 13649, 1),
 (3321, 13687, 3),
 (3321, 13770, 1),
 (3321, 13866, 1),
 (3321, 14320, 4),
 (3321, 14435, 1),
 (3321, 14766, 1),
 (3321, 14938, 3),
 (3321, 14943, 1),
 (3321, 15392, 4),
 (3321, 15564, 3),
 (3321, 15738, 1),
 (3321, 15989, 1),
 (3321, 16352, 2),
 (3321, 16858, 3),
 (3321,

In [79]:
len(testset)

4530

### Applying ML models 
I would be using ML models and storing their MAPE and RMSE in following

In [80]:
models_evaluation_train = dict()
models_evaluation_test = dict()

models_evaluation_train, models_evaluation_test

({}, {})

#### XgBoost

In [83]:
# to get rmse and mape given actual and predicted ratings..
def get_error_metrics(y_true, y_pred):
    rmse = np.sqrt(np.mean([ (y_true[i] - y_pred[i])**2 for i in range(len(y_pred)) ]))
    mape = np.mean(np.abs( (y_true - y_pred)/y_true )) * 100
    return rmse, mape

def run_xgboost(algo,  x_train, y_train, x_test, y_test, verbose=True):
    """
    It will return train_results and test_results
    """
    
    # dictionaries for storing train and test results
    train_results = dict()
    test_results = dict()
    
    
    # fit the model
    print('Training the model..')
    start =datetime.now()
    algo.fit(x_train, y_train, eval_metric = 'rmse')
    print('Done. Time taken : {}\n'.format(datetime.now()-start))
    print('Done \n')

    # from the trained model, get the predictions....
    print('Evaluating the model with TRAIN data...')
    start =datetime.now()
    y_train_pred = algo.predict(x_train)
    # get the rmse and mape of train data...
    rmse_train, mape_train = get_error_metrics(y_train.values, y_train_pred)
    
    # store the results in train_results dictionary..
    train_results = {'rmse': rmse_train,
                    'mape' : mape_train,
                    'predictions' : y_train_pred}
    
  
    # get the test data predictions and compute rmse and mape
    print('Evaluating Test data')
    y_test_pred = algo.predict(x_test) 
    rmse_test, mape_test = get_error_metrics(y_true=y_test.values, y_pred=y_test_pred)
    # store them in our test results dictionary.
    test_results = {'rmse': rmse_test,
                    'mape' : mape_test,
                    'predictions':y_test_pred}
    if verbose:
        print('-'*15)
        print('\nTEST DATA')
        print('-'*15)
        print('RMSE : ', rmse_test)
        print('MAPE : ', mape_test)
        
    # return these train and test results...
    return train_results, test_results

#### Surprise Model

In [84]:
# it is just to makesure that all of our algorithms should produce same results
# everytime it run

my_seed = 15
random.seed(my_seed)
np.random.seed(my_seed)


def get_ratings(predictions):
    actual = np.array([pred.r_ui for pred in predictions])
    pred = np.array([pred.est for pred in predictions])
    
    return actual, pred


def get_errors(predictions, print_them=False):

    actual, pred = get_ratings(predictions)
    rmse = np.sqrt(np.mean((pred - actual)**2))
    mape = np.mean(np.abs(pred - actual)/actual)

    return rmse, mape*100


def run_surprise(algo, trainset, testset, verbose=True): 
    '''
        return train_dict, test_dict
    
        It returns two dictionaries, one for train and the other is for test
        Each of them have 3 key-value pairs, which specify ''rmse'', ''mape'', and ''predicted ratings''.
    '''
    start = datetime.now()
    # dictionaries that stores metrics for train and test..
    train = dict()
    test = dict()
    
    # train the algorithm with the trainset
    st = datetime.now()
    print('Training the model...')
    algo.fit(trainset)
    print('Done. time taken : {} \n'.format(datetime.now()-st))
    
    # ---------------- Evaluating train data--------------------#
    st = datetime.now()
    print('Evaluating the model with train data..')
    # get the train predictions (list of prediction class inside Surprise)
    train_preds = algo.test(trainset.build_testset())
    # get predicted ratings from the train predictions..
    train_actual_ratings, train_pred_ratings = get_ratings(train_preds)
    # get ''rmse'' and ''mape'' from the train predictions.
    train_rmse, train_mape = get_errors(train_preds)
    print('time taken : {}'.format(datetime.now()-st))
    
    if verbose:
        print('-'*15)
        print('Train Data')
        print('-'*15)
        print("RMSE : {}\n\nMAPE : {}\n".format(train_rmse, train_mape))
    
    #store them in the train dictionary
    if verbose:
        print('adding train results in the dictionary..')
    train['rmse'] = train_rmse
    train['mape'] = train_mape
    train['predictions'] = train_pred_ratings
    
    #------------ Evaluating Test data---------------#
    st = datetime.now()
    print('\nEvaluating for test data...')
    # get the predictions( list of prediction classes) of test data
    test_preds = algo.test(testset)
    # get the predicted ratings from the list of predictions
    test_actual_ratings, test_pred_ratings = get_ratings(test_preds)
    # get error metrics from the predicted and actual ratings
    test_rmse, test_mape = get_errors(test_preds)
    print('time taken : {}'.format(datetime.now()-st))
    
    if verbose:
        print('-'*15)
        print('Test Data')
        print('-'*15)
        print("RMSE : {}\n\nMAPE : {}\n".format(test_rmse, test_mape))
    # store them in test dictionary
    if verbose:
        print('storing the test results in test dictionary...')
    test['rmse'] = test_rmse
    test['mape'] = test_mape
    test['predictions'] = test_pred_ratings
    
    print('\n'+'-'*45)
    print('Total time taken to run this algorithm :', datetime.now() - start)
    
    # return two dictionaries train and test
    return train, test

#### XGBoost with 13 features

In [165]:
import xgboost as xgb

In [None]:
# prepare Train data
x_train = reg_train.drop(['user','movie','rating'], axis=1)
y_train = reg_train['rating']

# Prepare Test data
x_test = reg_test_df.drop(['user','movie','rating'], axis=1)
y_test = reg_test_df['rating']

Before running XGBRegressor, we will tune hyperparameter using gridsearch cross validation.

In [153]:
parameters = {'max_depth':[1,2,3],
              'learning_rate':[0.001,0.01,0.1],
              'n_estimators':[100,300,500,700,900,1100,1300]} 

In [154]:
start = datetime.now()

# Initialize Our first XGBoost model
first_xgb = xgb.XGBRegressor(nthread=-1)

# Perform cross validation 
gscv = GridSearchCV(first_xgb,
                    param_grid = parameters,
                    scoring="neg_mean_squared_error",
                    cv = TimeSeriesSplit(n_splits=5),
                    n_jobs = -1,
                    verbose = 1)
gscv_result = gscv.fit(x_train, y_train)

# Summarize results
print("Best: %f using %s" % (gscv_result.best_score_, gscv_result.best_params_))
means = gscv_result.cv_results_['mean_test_score']
stds = gscv_result.cv_results_['std_test_score']
params = gscv_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))   
    
print("\nTime Taken: ",start - datetime.now())

Fitting 5 folds for each of 63 candidates, totalling 315 fits
Best: -0.017268 using {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1300}
-1.012954 (0.030909) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 100}
-0.794077 (0.026827) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 300}
-0.639678 (0.022093) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 500}
-0.519820 (0.018441) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 700}
-0.424710 (0.015506) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 900}
-0.349214 (0.013238) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1100}
-0.289241 (0.011460) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1300}
-0.961326 (0.028048) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 100}
-0.665545 (0.020358) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 300}
-0.466127 (0.015410) with: {'learning_rate': 0.001, 

In [155]:
# Create new instance of XGBRegressor with tuned hyperparameters
first_xgb = xgb.XGBRegressor(max_depth=3,learning_rate = 0.1,n_estimators=500,nthread=-1)
first_xgb

In [156]:
train_results, test_results = run_xgboost(first_xgb, x_train, y_train, x_test, y_test)

# store the results in models_evaluations dictionaries
models_evaluation_train['first_algo'] = train_results
models_evaluation_test['first_algo'] = test_results

xgb.plot_importance(first_xgb)
plt.savefig("sample/small/img/feature-importance-xgboost.png")

Training the model..




Done. Time taken : 0:00:00.507970

Done 

Evaluating the model with TRAIN data...
Evaluating Test data
---------------

TEST DATA
---------------
RMSE :  1.1075854286551927
MAPE :  38.06150792818722


#### Suprise BaselineModel

In [157]:
from surprise import BaselineOnly 

In [158]:
# Instantiate BaselineOnly
bsl_options = {'method': 'sgd',
               'reg':0.01,
               'learning_rate': 0.001,
               'n_epochs':120
               }
bsl_algo = BaselineOnly(bsl_options=bsl_options)
bsl_algo

<surprise.prediction_algorithms.baseline_only.BaselineOnly at 0x2b3d507c310>

In [159]:
%%time

# run this algorithm.., It will return the train and test results..
bsl_train_results, bsl_test_results = run_surprise(bsl_algo, trainset, testset, verbose=True)


# Just store these error metrics in our models_evaluation datastructure
models_evaluation_train['bsl_algo'] = bsl_train_results 
models_evaluation_test['bsl_algo'] = bsl_test_results

Training the model...
Estimating biases using sgd...


Done. time taken : 0:00:00.626486 

Evaluating the model with train data..
time taken : 0:00:00.441202
---------------
Train Data
---------------
RMSE : 0.8828766422875534

MAPE : 26.93089077683406

adding train results in the dictionary..

Evaluating for test data...
time taken : 0:00:00.021342
---------------
Test Data
---------------
RMSE : 1.0822513101982842

MAPE : 36.634142696886315

storing the test results in test dictionary...

---------------------------------------------
Total time taken to run this algorithm : 0:00:01.090014
CPU times: total: 203 ms
Wall time: 1.1 s


## XGBoost with initial 13 features + Surprise Baseline predictor

updating train data

In [160]:
# add our baseline_predicted value as our feature..
reg_train['bslpr'] = models_evaluation_train['bsl_algo']['predictions']
reg_train.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,...,smr4,smr5,UAvg,MAvg,rating,bslpr,knn_bsl_u,knn_bsl_m,svd,svdpp
0,692,14621,3.61198,2.0,4.0,5.0,5.0,5.0,4.0,4.0,...,4.0,4.0,4.0,4.329095,4,4.299739,4.0,4.0,4.183244,4.010225
1,1179,2239,3.61198,5.0,3.0,3.0,2.0,2.0,3.0,5.0,...,3.0,4.0,3.714286,2.909091,5,3.294208,4.992258,4.889138,3.61713,3.475458
2,1179,4352,3.61198,4.0,3.0,2.0,3.0,3.0,4.0,4.0,...,4.0,5.0,3.714286,3.140845,3,3.273046,3.031813,2.952089,3.308964,3.084976
3,1179,6464,3.61198,3.0,5.0,4.0,2.0,4.0,4.0,4.0,...,5.0,3.0,3.714286,3.400396,4,3.537082,3.986084,3.716389,3.488253,3.685102
4,1179,6510,3.61198,4.0,4.0,3.0,4.0,3.0,5.0,4.0,...,3.0,4.0,3.714286,3.936614,4,4.021263,4.080001,4.014558,3.853108,4.039471


updating test data

In [161]:
# add that baseline predicted ratings with Surprise to the test data as well
reg_test_df['bslpr']  = models_evaluation_test['bsl_algo']['predictions']
reg_test_df.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,...,smr4,smr5,UAvg,MAvg,rating,bslpr,knn_bsl_u,knn_bsl_m,svd,svdpp
0,7,13072,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198,3.61198,3.61198
1,126,3418,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198,3.61198,3.61198
2,268,11740,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198,3.61198,3.61198
3,1809,705,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,4,3.61198,3.61198,3.61198,3.61198,3.61198
4,1809,1140,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,3,3.61198,3.61198,3.61198,3.61198,3.61198


In [162]:
# prepare train data
x_train = reg_train.drop(['user', 'movie','rating'], axis=1)
y_train = reg_train['rating']

# Prepare Test data
x_test = reg_test_df.drop(['user','movie','rating'], axis=1)
y_test = reg_test_df['rating']

Before running XGBRegressor, we will tune hyperparameter using gridsearch cross validation.

In [163]:
start = datetime.now()

# Initialize Our first XGBoost model
xgb = xgb.XGBRegressor(nthread=-1)

# Perform cross validation 
gscv = GridSearchCV(xgb,
                    param_grid = parameters,
                    scoring="neg_mean_squared_error",
                    cv = TimeSeriesSplit(n_splits=5),
                    n_jobs = -1,
                    verbose = 1)
gscv_result = gscv.fit(x_train, y_train)

# Summarize results
print("Best: %f using %s" % (gscv_result.best_score_, gscv_result.best_params_))
print()
means = gscv_result.cv_results_['mean_test_score']
stds = gscv_result.cv_results_['std_test_score']
params = gscv_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))  

print("\nTime Taken: ",datetime.now() -start)

Fitting 5 folds for each of 63 candidates, totalling 315 fits


Best: -0.011012 using {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1300}

-1.012954 (0.030909) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 100}
-0.794077 (0.026827) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 300}
-0.639678 (0.022093) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 500}
-0.519820 (0.018441) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 700}
-0.424710 (0.015506) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 900}
-0.349214 (0.013238) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1100}
-0.289241 (0.011460) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1300}
-0.961326 (0.028048) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 100}
-0.665545 (0.020358) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 300}
-0.466127 (0.015410) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 500}
-0.329994 (0.012028) wit

In [166]:
# Create new instance of XGBRegressor with tuned hyperparameters
xgb_bsl = xgb.XGBRegressor(max_depth=3,learning_rate = 0.01,n_estimators=1300,nthread=-1)
xgb_bsl

In [167]:
# Run XGBRegressor
train_results, test_results = run_xgboost(xgb_bsl, x_train, y_train, x_test, y_test)

# store the results in models_evaluations dictionaries
models_evaluation_train['xgb_bsl'] = train_results
models_evaluation_test['xgb_bsl'] = test_results

xgb.plot_importance(xgb_bsl)
plt.savefig("sample/small/img/feature-importance-xgboost+surprise.png")

Training the model..




Done. Time taken : 0:00:01.690594

Done 

Evaluating the model with TRAIN data...
Evaluating Test data
---------------

TEST DATA
---------------
RMSE :  1.1234619836629212
MAPE :  38.588619195862314


## Surprise KNNBaseline predictor

In [108]:
from surprise import KNNBaseline

Surprise KNNBaseline with user user similarities

In [109]:
# we specify , how to compute similarities and what to consider with sim_options to our algorithm
sim_options = {'user_based' : True,
               'name': 'pearson_baseline',
               'shrinkage': 100,
               'min_support': 2
              } 
# we keep other parameters like regularization parameter and learning_rate as default values.
bsl_options = {'method': 'sgd'} 

knn_bsl_u = KNNBaseline(k=40, sim_options = sim_options, bsl_options = bsl_options)
knn_bsl_u_train_results, knn_bsl_u_test_results = run_surprise(knn_bsl_u, trainset, testset, verbose=True)

# Just store these error metrics in our models_evaluation datastructure
models_evaluation_train['knn_bsl_u'] = knn_bsl_u_train_results 
models_evaluation_test['knn_bsl_u'] = knn_bsl_u_test_results

Training the model...
Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Done. time taken : 0:00:05.905061 

Evaluating the model with train data..
time taken : 0:00:43.490439
---------------
Train Data
---------------
RMSE : 0.2941885927227935

MAPE : 7.815052348749038

adding train results in the dictionary..

Evaluating for test data...
time taken : 0:00:00.014664
---------------
Test Data
---------------
RMSE : 1.0822531134111517

MAPE : 36.63794599842355

storing the test results in test dictionary...

---------------------------------------------
Total time taken to run this algorithm : 0:00:49.410164


Surprise KNNBaseline with movie movie similarities

In [110]:
# I specified , how to compute similarities and what to consider with sim_options to our algorithm

# 'user_based' : Fals => this considers the similarities of movies instead of users

sim_options = {'user_based' : False,
               'name': 'pearson_baseline',
               'shrinkage': 100,
               'min_support': 2
              } 
# we keep other parameters like regularization parameter and learning_rate as default values.
bsl_options = {'method': 'sgd'}


knn_bsl_m = KNNBaseline(k=40, sim_options = sim_options, bsl_options = bsl_options)

knn_bsl_m_train_results, knn_bsl_m_test_results = run_surprise(knn_bsl_m, trainset, testset, verbose=True)

# Just store these error metrics in our models_evaluation datastructure
models_evaluation_train['knn_bsl_m'] = knn_bsl_m_train_results 
models_evaluation_test['knn_bsl_m'] = knn_bsl_m_test_results

Training the model...
Estimating biases using sgd...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Done. time taken : 0:00:00.217665 

Evaluating the model with train data..
time taken : 0:00:01.896365
---------------
Train Data
---------------
RMSE : 0.26140322547389855

MAPE : 6.516388525578446

adding train results in the dictionary..

Evaluating for test data...
time taken : 0:00:00.034854
---------------
Test Data
---------------
RMSE : 1.0821949464872673

MAPE : 36.63194849597806

storing the test results in test dictionary...

---------------------------------------------
Total time taken to run this algorithm : 0:00:02.148884


## XGBoost with initial 13 features + Surprise Baseline predictor + KNNBaseline predictor

- First I will run XGBoost with predictions from both KNN's ( that uses User_User and Item_Item similarities along with our previous features)
- Then I will run XGBoost with just predictions form both knn models and preditions from our baseline model.

preparing train data

In [111]:
# add the predicted values from both knns to this dataframe
reg_train['knn_bsl_u'] = models_evaluation_train['knn_bsl_u']['predictions']
reg_train['knn_bsl_m'] = models_evaluation_train['knn_bsl_m']['predictions']

reg_train.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating,bslpr,knn_bsl_u,knn_bsl_m
0,692,14621,3.61198,2.0,4.0,5.0,5.0,5.0,4.0,4.0,4.0,4.0,4.0,4.0,4.329095,4,4.299739,4.0,4.0
1,1179,2239,3.61198,5.0,3.0,3.0,2.0,2.0,3.0,5.0,4.0,3.0,4.0,3.714286,2.909091,5,3.294208,4.992258,4.889138
2,1179,4352,3.61198,4.0,3.0,2.0,3.0,3.0,4.0,4.0,3.0,4.0,5.0,3.714286,3.140845,3,3.273046,3.031813,2.952089
3,1179,6464,3.61198,3.0,5.0,4.0,2.0,4.0,4.0,4.0,2.0,5.0,3.0,3.714286,3.400396,4,3.537082,3.986084,3.716389
4,1179,6510,3.61198,4.0,4.0,3.0,4.0,3.0,5.0,4.0,4.0,3.0,4.0,3.714286,3.936614,4,4.021263,4.080001,4.014558


preparing test data

In [113]:
reg_test_df['knn_bsl_u'] = models_evaluation_test['knn_bsl_u']['predictions']
reg_test_df['knn_bsl_m'] = models_evaluation_test['knn_bsl_m']['predictions']

reg_test_df.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,smr3,smr4,smr5,UAvg,MAvg,rating,bslpr,knn_bsl_u,knn_bsl_m
0,7,13072,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198
1,126,3418,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198
2,268,11740,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198
3,1809,705,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,4,3.61198,3.61198,3.61198
4,1809,1140,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3,3.61198,3.61198,3.61198


In [114]:
# prepare the train data....
x_train = reg_train.drop(['user', 'movie', 'rating'], axis=1)
y_train = reg_train['rating']

# prepare the train data....
x_test = reg_test_df.drop(['user','movie','rating'], axis=1)
y_test = reg_test_df['rating']

Before running XGBRegressor, we will tune hyperparameter using gridsearch cross validation.

In [115]:
start = datetime.now()

# Initialize Our first XGBoost model
model = xgb.XGBRegressor(nthread=-1)

# Perform cross validation 
gscv = GridSearchCV(model,
                    param_grid = parameters,
                    scoring="neg_mean_squared_error",
                    cv = TimeSeriesSplit(n_splits=5),
                    n_jobs = -1,
                    verbose = 1)
gscv_result = gscv.fit(x_train, y_train)

# Summarize results
print("Best: %f using %s" % (gscv_result.best_score_, gscv_result.best_params_))
print()
means = gscv_result.cv_results_['mean_test_score']
stds = gscv_result.cv_results_['std_test_score']
params = gscv_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))  

print("\nTime Taken: ",datetime.now() - start)

Fitting 5 folds for each of 63 candidates, totalling 315 fits
Best: -0.011088 using {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1300}

-1.012954 (0.030909) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 100}
-0.794077 (0.026827) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 300}
-0.639678 (0.022093) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 500}
-0.519820 (0.018441) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 700}
-0.424710 (0.015506) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 900}
-0.349214 (0.013238) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1100}
-0.289241 (0.011460) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1300}
-0.961326 (0.028048) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 100}
-0.665545 (0.020358) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 300}
-0.466127 (0.015410) with: {'learning_rate': 0.001,

In [116]:
# Create new instance of XGBRegressor with tuned hyperparameters
xgb_knn_bsl = xgb.XGBRegressor(max_depth=3,learning_rate = 0.1,n_estimators=300,nthread=-1)
xgb_knn_bsl

In [118]:
train_results, test_results = run_xgboost(xgb_knn_bsl, x_train, y_train, x_test, y_test)

# store the results in models_evaluations dictionaries
models_evaluation_train['xgb_knn_bsl'] = train_results
models_evaluation_test['xgb_knn_bsl'] = test_results


xgb.plot_importance(xgb_knn_bsl)
plt.savefig("sample/small/img/feature-importance-xgb+bsl+knn.png")

Training the model..




Done. Time taken : 0:00:00.468740

Done 

Evaluating the model with TRAIN data...
Evaluating Test data
---------------

TEST DATA
---------------
RMSE :  1.11901091975606
MAPE :  38.45632696116646


## Matrix Factorization Techniques

### SVD

In [174]:
from surprise import SVD

In [175]:
# initiallize the model
svd = SVD(n_factors=100, biased=True, random_state=30, verbose=True)
svd_train_results, svd_test_results = run_surprise(svd, trainset, testset, verbose=True)

# Just store these error metrics in our models_evaluation datastructure
models_evaluation_train['svd'] = svd_train_results 
models_evaluation_test['svd'] = svd_test_results

Training the model...
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11


Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Done. time taken : 0:00:00.382676 

Evaluating the model with train data..
time taken : 0:00:00.450654
---------------
Train Data
---------------
RMSE : 0.652056585294612

MAPE : 19.474744533179717

adding train results in the dictionary..

Evaluating for test data...
time taken : 0:00:00.018005
---------------
Test Data
---------------
RMSE : 1.0821912114098393

MAPE : 36.63640232612494

storing the test results in test dictionary...

---------------------------------------------
Total time taken to run this algorithm : 0:00:00.864688


### SVD Matrix Factorization with implicit feedback from user ( user rated movies )

In [176]:
from surprise import SVDpp

In [177]:
# initiallize the model
svdpp = SVDpp(n_factors=50, random_state=30, verbose=True)
svdpp_train_results, svdpp_test_results = run_surprise(svdpp, trainset, testset, verbose=True)

# Just store these error metrics in our models_evaluation datastructure
models_evaluation_train['svdpp'] = svdpp_train_results 
models_evaluation_test['svdpp'] = svdpp_test_results

Training the model...
 processing epoch 0


 processing epoch 1
 processing epoch 2
 processing epoch 3
 processing epoch 4
 processing epoch 5
 processing epoch 6
 processing epoch 7
 processing epoch 8
 processing epoch 9
 processing epoch 10
 processing epoch 11
 processing epoch 12
 processing epoch 13
 processing epoch 14
 processing epoch 15
 processing epoch 16
 processing epoch 17
 processing epoch 18
 processing epoch 19
Done. time taken : 0:00:04.654456 

Evaluating the model with train data..
time taken : 0:00:01.861405
---------------
Train Data
---------------
RMSE : 0.5906323996698382

MAPE : 17.159395676027554

adding train results in the dictionary..

Evaluating for test data...
time taken : 0:00:00.016660
---------------
Test Data
---------------
RMSE : 1.0822543401393163

MAPE : 36.641062221542555

storing the test results in test dictionary...

---------------------------------------------
Total time taken to run this algorithm : 0:00:06.532521


## XgBoost with 13 features + Surprise Baseline + Surprise KNNbaseline + MF Techniques

Preparing Train data

In [125]:
# add the predicted values from both knns to this dataframe
reg_train['svd'] = models_evaluation_train['svd']['predictions']
reg_train['svdpp'] = models_evaluation_train['svdpp']['predictions']

reg_train.head() 

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,...,smr4,smr5,UAvg,MAvg,rating,bslpr,knn_bsl_u,knn_bsl_m,svd,svdpp
0,692,14621,3.61198,2.0,4.0,5.0,5.0,5.0,4.0,4.0,...,4.0,4.0,4.0,4.329095,4,4.299739,4.0,4.0,4.183244,4.010225
1,1179,2239,3.61198,5.0,3.0,3.0,2.0,2.0,3.0,5.0,...,3.0,4.0,3.714286,2.909091,5,3.294208,4.992258,4.889138,3.61713,3.475458
2,1179,4352,3.61198,4.0,3.0,2.0,3.0,3.0,4.0,4.0,...,4.0,5.0,3.714286,3.140845,3,3.273046,3.031813,2.952089,3.308964,3.084976
3,1179,6464,3.61198,3.0,5.0,4.0,2.0,4.0,4.0,4.0,...,5.0,3.0,3.714286,3.400396,4,3.537082,3.986084,3.716389,3.488253,3.685102
4,1179,6510,3.61198,4.0,4.0,3.0,4.0,3.0,5.0,4.0,...,3.0,4.0,3.714286,3.936614,4,4.021263,4.080001,4.014558,3.853108,4.039471


Preparing Test data

In [126]:
reg_test_df['svd'] = models_evaluation_test['svd']['predictions']
reg_test_df['svdpp'] = models_evaluation_test['svdpp']['predictions']

reg_test_df.head()

Unnamed: 0,user,movie,GAvg,sur1,sur2,sur3,sur4,sur5,smr1,smr2,...,smr4,smr5,UAvg,MAvg,rating,bslpr,knn_bsl_u,knn_bsl_m,svd,svdpp
0,7,13072,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198,3.61198,3.61198
1,126,3418,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198,3.61198,3.61198
2,268,11740,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,5,3.61198,3.61198,3.61198,3.61198,3.61198
3,1809,705,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,4,3.61198,3.61198,3.61198,3.61198,3.61198
4,1809,1140,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,3.530464,...,3.530464,3.530464,3.530464,3.530464,3,3.61198,3.61198,3.61198,3.61198,3.61198


In [127]:
# prepare x_train and y_train
x_train = reg_train.drop(['user', 'movie', 'rating',], axis=1)
y_train = reg_train['rating']

# prepare test data
x_test = reg_test_df.drop(['user', 'movie', 'rating'], axis=1)
y_test = reg_test_df['rating']

Before running XGBRegressor, we will tune hyperparameter using gridsearch cross validation.

In [128]:
start = datetime.now()

# Initialize Our first XGBoost model
model = xgb.XGBRegressor(nthread=-1)

# Perform cross validation 
gscv = GridSearchCV(model,
                    param_grid = parameters,
                    scoring="neg_mean_squared_error",
                    cv = TimeSeriesSplit(n_splits=5),
                    n_jobs = -1,
                    verbose = 1)
gscv_result = gscv.fit(x_train, y_train)

# Summarize results
print("Best: %f using %s" % (gscv_result.best_score_, gscv_result.best_params_))
print()
means = gscv_result.cv_results_['mean_test_score']
stds = gscv_result.cv_results_['std_test_score']
params = gscv_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))  

print("\nTime Taken: ",datetime.now() - start)

Fitting 5 folds for each of 63 candidates, totalling 315 fits
Best: -0.011012 using {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1300}

-1.012954 (0.030909) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 100}
-0.794077 (0.026827) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 300}
-0.639678 (0.022093) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 500}
-0.519820 (0.018441) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 700}
-0.424710 (0.015506) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 900}
-0.349214 (0.013238) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1100}
-0.289241 (0.011460) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1300}
-0.961326 (0.028048) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 100}
-0.665545 (0.020358) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 300}
-0.466127 (0.015410) with: {'learning_rate': 0.001,

In [129]:
# Create new instance of XGBRegressor with tuned hyperparameters
xgb_final = xgb.XGBRegressor(max_depth=3,learning_rate = 0.01,n_estimators=1300,nthread=-1)
xgb_final

In [131]:
train_results, test_results = run_xgboost(xgb_final, x_train, y_train, x_test, y_test)

# store the results in models_evaluations dictionaries
models_evaluation_train['xgb_final'] = train_results
models_evaluation_test['xgb_final'] = test_results


xgb.plot_importance(xgb_final)
plt.savefig("sample/small/img/feature-importance-xgb+knn+bsl+svd.png")

Training the model..




Done. Time taken : 0:00:01.771107

Done 

Evaluating the model with TRAIN data...
Evaluating Test data
---------------

TEST DATA
---------------
RMSE :  1.1234619836629212
MAPE :  38.588619195862314


## XgBoost with Surprise Baseline + Surprise KNNbaseline + MF Techniques

In [134]:
# prepare train data
x_train = reg_train[['knn_bsl_u', 'knn_bsl_m', 'svd', 'svdpp']]
y_train = reg_train['rating']

# test data
x_test = reg_test_df[['knn_bsl_u', 'knn_bsl_m', 'svd', 'svdpp']]
y_test = reg_test_df['rating']

Before running XGBRegressor, we will tune hyperparameter using gridsearch cross validation.

In [135]:
start = datetime.now()

# Initialize Our first XGBoost model
model = xgb.XGBRegressor(nthread=-1)

# Perform cross validation 
gscv = GridSearchCV(model,
                    param_grid = parameters,
                    scoring="neg_mean_squared_error",
                    cv = TimeSeriesSplit(n_splits=5),
                    n_jobs = -1,
                    verbose = 1)
gscv_result = gscv.fit(x_train, y_train)

# Summarize results
print("Best: %f using %s" % (gscv_result.best_score_, gscv_result.best_params_))
print()
means = gscv_result.cv_results_['mean_test_score']
stds = gscv_result.cv_results_['std_test_score']
params = gscv_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))  

print("\nTime Taken: ",datetime.now() - start)

Fitting 5 folds for each of 63 candidates, totalling 315 fits
Best: -0.017268 using {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 1300}

-1.012954 (0.030909) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 100}
-0.794077 (0.026827) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 300}
-0.639678 (0.022093) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 500}
-0.519820 (0.018441) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 700}
-0.424710 (0.015506) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 900}
-0.349214 (0.013238) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1100}
-0.289241 (0.011460) with: {'learning_rate': 0.001, 'max_depth': 1, 'n_estimators': 1300}
-0.961326 (0.028048) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 100}
-0.665545 (0.020358) with: {'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 300}
-0.466127 (0.015410) with: {'learning_rate': 0.001,

In [136]:
# Create new instance of XGBRegressor with tuned hyperparameters
xgb_all_models = xgb.XGBRegressor(max_depth=1,learning_rate = 0.01,n_estimators=700,nthread=-1)
xgb_all_models

In [137]:
train_results, test_results = run_xgboost(xgb_all_models, x_train, y_train, x_test, y_test)

# store the results in models_evaluations dictionaries
models_evaluation_train['xgb_all_models'] = train_results
models_evaluation_test['xgb_all_models'] = test_results

xgb.plot_importance(xgb_all_models)
plt.savefig("sample/small/img/feature-importance-xgb_with_bsl+knn+svd.png")

Training the model..




Done. Time taken : 0:00:00.460385

Done 

Evaluating the model with TRAIN data...
Evaluating Test data
---------------

TEST DATA
---------------
RMSE :  1.1367862682953647
MAPE :  38.96381974913251


# Comparision between all

In [178]:
# Saving our TEST_RESULTS into a dataframe to avoid running it again
pd.DataFrame(models_evaluation_test).to_csv('sample/small/models-rmse-comparison.csv')
models = pd.read_csv('sample/small/models-rmse-comparison.csv', index_col=0)
models.loc['rmse'].sort_values()

svd               1.0821912114098393
knn_bsl_m         1.0821949464872673
bsl_algo          1.0822513101982842
knn_bsl_u         1.0822531134111517
svdpp             1.0822543401393163
first_algo        1.1075854286551927
xgb_knn_bsl         1.11901091975606
xgb_bsl           1.1234619836629212
xgb_final         1.1234619836629212
xgb_all_models    1.1367862682953647
Name: rmse, dtype: object

# Which is best method/model and Conclusion

- ### XGBoost with Surprise Baseline Predictor model (xgb_bsl) showed good result among all the models I tried.
- ### Due to high computational power and time, I have completed this case study on (8000,800) training dataset and (4000,400) testing dataset.
- ### Approach followed is as mentioned in [this](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf) research paper
- ### Small decrease in 'RMSE' score is observed, but this can be drastically improved by using the whole dataset for modeling.(Not feasible at the moment)

# SOTA Comparison

### As per [this](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf) research paper, the Netflix system achieves **RMSE = 0.9514** on the same dataset
### While the grand prize’s required accuracy is RMSE = 0.8563, and it was won by BigChaos solution with **RMSE=0.8567**

### I acheived minimum **RMSE = 1.0821912114098393** 