# Mckinsey Hackathon - Recommendation Design
I am going to use collaboratitve filtering technique over here since they are generally found to be better in most situations. Since there are no ratings given for any challenge, we are going to use the challenges solved as implicit feedback. This implies that our ratings matrix will contain only 0-1 values. We will apply a model-based (latent factors) collaborativie fitering model as that's what I found to be one of the best algorithms from my last week's research on Recommendation systems. There are other methods too, but, given the time limitations and my limited knowledge, we will try to have some fun with Latent Factors models only. We will be using [implicit](http://implicit.readthedocs.io/en/latest/) package to compute ALS based matrix factorization model.      

In [321]:
%matplotlib inline
import numpy as np
import scipy as sp
import pandas as pd
# pd.set_option('display.width', 500)
# pd.set_option('display.max_columns', 100)
# pd.set_option('display.notebook_repr_html', True)
import implicit
import os

In [322]:
os.getcwd()

'D:\\DS\\AV\\McK Hack 3'

## Processing the data

In [323]:
train = pd.read_csv("train.csv", header=0)

In [324]:
train.head(4)

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4576_1,4576,1,CI23714
1,4576_2,4576,2,CI23855
2,4576_3,4576,3,CI24917
3,4576_4,4576,4,CI23663


Checking whether there are any null values in any of the data frames

In [325]:
train.isnull().sum()

user_sequence         0
user_id               0
challenge_sequence    0
challenge             0
dtype: int64

In [326]:
test = pd.read_csv("test.csv", header= 0)
test.head(4)

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4577_1,4577,1,CI23855
1,4577_2,4577,2,CI23933
2,4577_3,4577,3,CI24917
3,4577_4,4577,4,CI24915


In [327]:
test.isnull().sum()

user_sequence         0
user_id               0
challenge_sequence    0
challenge             0
dtype: int64

In [328]:
challenges = pd.read_csv("challenge_data.csv", header= 0)
challenges.head(4)

Unnamed: 0,challenge_ID,programming_language,challenge_series_ID,total_submissions,publish_date,author_ID,author_gender,author_org_ID,category_id
0,CI23478,2,SI2445,37.0,06-05-2006,AI563576,M,AOI100001,
1,CI23479,2,SI2435,48.0,17-10-2002,AI563577,M,AOI100002,32.0
2,CI23480,1,SI2435,15.0,16-10-2002,AI563578,M,AOI100003,
3,CI23481,1,SI2710,236.0,19-09-2003,AI563579,M,AOI100004,70.0


In [329]:
challenges.isnull().sum()

challenge_ID               0
programming_language       0
challenge_series_ID       12
total_submissions        352
publish_date               0
author_ID                 39
author_gender             97
author_org_ID            248
category_id             1841
dtype: int64

In [330]:
challenges.describe()

Unnamed: 0,programming_language,total_submissions,category_id
count,5606.0,5254.0,3765.0
mean,1.081877,348.362581,81.083665
std,0.316487,1044.810816,56.367797
min,1.0,2.0,22.0
25%,1.0,67.0,36.0
50%,1.0,134.0,66.0
75%,1.0,297.0,113.0
max,3.0,43409.0,304.0


In [331]:
print( "unique challenges | unique users in train set:", len(train.challenge.unique()),"|", len(train.user_id.unique()))

unique challenges | unique users in train set: 5348 | 69532


In [332]:
print("Total unique challenges:", len(challenges.challenge_ID.unique()))

Total unique challenges: 5606


In [333]:
print( "unique challenges | unique users in test set:", len(test.challenge.unique()), "|" ,len(test.user_id.unique()))

unique challenges | unique users in test set: 4477 | 39732


Since we are going to model based collaborative filtering, we won't require the challenges dataset. Although we don't have ratings for any challenge, I am going to refer the user-challenge matrix as "ratings" matrix below.

In [334]:
import scipy.sparse as sparse
from scipy.sparse.linalg import spsolve

## ALS for implict feedback
Now, we will be applying ALS on the complete dataset by concatenating train and test dataframes.   

In [346]:
df = pd.concat([train, test], axis=0, ignore_index=True) 

In [347]:
df.head()

Unnamed: 0,user_sequence,user_id,challenge_sequence,challenge
0,4576_1,4576,1,CI23714
1,4576_2,4576,2,CI23855
2,4576_3,4576,3,CI24917
3,4576_4,4576,4,CI23663
4,4576_5,4576,5,CI23933


In [348]:
len(df.challenge.unique())

5502

In [349]:
df = df[["user_id", "challenge"]]

In [350]:
groupby_df = df.groupby(["challenge","user_id"]).size().reset_index(name = "Done")

In [351]:
groupby_df.head(5)

Unnamed: 0,challenge,user_id,Done
0,CI23478,32876,1
1,CI23478,83661,1
2,CI23478,88820,1
3,CI23478,91425,1
4,CI23478,97150,1


In [352]:
user_list = list(np.sort(groupby_df.user_id.unique()))
challenge_list = list(groupby_df.challenge.unique())
done_list = list(groupby_df.Done)

In [353]:
rows = groupby_df.user_id.astype("category", categories = user_list).cat.codes
cols = groupby_df.challenge.astype("category", categories= challenge_list).cat.codes

In [354]:
df_sparse = sparse.csr_matrix((done_list, (rows, cols)), shape = (len(user_list), len(challenge_list)))

In [355]:
df_sparse

<109264x5502 sparse matrix of type '<class 'numpy.int64'>'
	with 1301236 stored elements in Compressed Sparse Row format>

In [357]:
data = (df_sparse*50).astype('double')

In [358]:
model = implicit.als.AlternatingLeastSquares(factors=400, iterations=100, regularization=100)
model.fit(data)
user_vecs, chal_vecs = model.item_factors, model.user_factors.T



In [359]:
user_vecs.shape

(109264, 400)

In [360]:
chal_vecs.shape

(400, 5502)

In [361]:
user_arr = np.array(user_list)
challenge_arr = np.array(challenge_list)

#### Now, we will obtain the ratings for every user in the test set and select the 3 highest rated challenges . 

In [362]:
from sklearn.preprocessing import MinMaxScaler

In [363]:
def rec_items(customer_id, mf_train, user_vecs, chal_vecs, user_list, chal_list, num_items = 10):
    '''
    This function will return the top recommended items to our users 
    
    parameters:
    
    customer_id - Input the customer's id number that you want to get recommendations for
    
    mf_train - The training matrix you used for matrix factorization fitting
    
    user_vecs - the user vectors from your fitted matrix factorization
    
    chal_vecs - the item vectors from your fitted matrix factorization
    
    user_list - an array of the customer's ID numbers that make up the rows of your ratings matrix 
                    (in order of matrix)
    
    chal_list - an array of the products that make up the columns of your ratings matrix
                    (in order of matrix)
    
    num_items - The number of items you want to recommend in order of best recommendations. Default is 10. 
    
    returns:
    
    - The top n recommendations chosen based on the user/item vectors for items never interacted with/purchased
    '''
    
    cust_ind = np.where(user_list == customer_id)[0][0] 
    pref_vec = mf_train[cust_ind,:].toarray() 
    pref_vec = pref_vec.reshape(-1) + 1 
    pref_vec[pref_vec > 1] = 0 
    rec_vector = user_vecs[cust_ind,:].dot(chal_vecs) 
    # Scale this recommendation vector between 0 and 1
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0] 
    recommend_vector = pref_vec*rec_vector_scaled 
    product_idx = np.argsort(recommend_vector)[::-1][:num_items] 
    rec_list = []
    for index in product_idx:
        code = chal_list[index]
        rec_list.append(code) 

    return rec_list 

In [364]:
index = [[i]*3 for i in test.user_id.unique()]

index = [i for sublist in index for i in sublist]

ind = pd.Series(index)

rec_df= pd.DataFrame(index = ind, columns=["Challenge"])
rec_df = rec_df.fillna(0)
rec_df[:5]

Unnamed: 0,Challenge
4577,0
4577,0
4577,0
4578,0
4578,0


In [365]:
unique_test_users = test.user_id.unique()

In [366]:
for user_id in unique_test_users:
    chal_recommeded = rec_items(user_id, df_sparse, user_vecs, chal_vecs, user_arr, challenge_arr, 3)
    rec_df.loc[user_id] = np.array([chal_recommeded]).reshape(-1,1)

In [367]:
rec_df.to_csv("recommeded_challenges_final.csv")