In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Joke Recommendations

In this problem, we will build two recommender systems to recommend jokes to users. 
We'll be using the Jester Dataset (which is available [here](http://eigentaste.berkeley.edu/dataset/)).

The *jester.csv* file has data from 23,500 users who have rated 36 or more jokes. 
Ratings are real values ranging from -10.00 to +10.00.

In [2]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Theory/master/Data/jester.csv'
ratings_matrix = pd.read_csv(url, index_col = 'user').T
ratings_matrix

user,0,1,2,3,4,5,6,7,8,9,...,23490,23491,23492,23493,23494,23495,23496,23497,23498,23499
joke 0,,-4.37,,0.34,,,3.50,-7.67,1.02,3.64,...,0.83,-0.58,,-2.38,9.27,-9.95,-0.78,,1.02,
joke 1,8.11,-3.88,,-6.55,,,2.28,7.14,6.07,1.60,...,4.27,-4.22,8.74,-1.12,8.79,4.85,1.31,,2.77,
joke 2,,0.73,,2.86,,,3.01,-6.31,6.65,-0.39,...,-2.23,2.04,,-0.92,-5.58,-9.95,-2.09,,7.09,
joke 3,,-3.20,,,,,-0.63,,-0.87,3.74,...,-2.14,-6.02,,-2.04,9.27,-8.25,-0.78,,0.05,
joke 4,-2.28,-6.41,0.73,-3.64,9.13,-1.46,4.95,-9.32,6.80,-6.36,...,-1.80,-9.76,1.17,-7.77,8.74,1.41,4.71,6.12,1.26,9.17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
joke 95,-5.92,3.45,,,,,2.28,,5.92,1.02,...,,3.40,,,,,1.02,,2.38,
joke 96,,3.20,,,,,5.05,1.60,-6.60,4.17,...,,-0.53,,,,,2.57,,2.62,
joke 97,,-0.53,,,,,4.51,,0.24,-6.55,...,,-7.28,,,,,1.02,,2.57,
joke 98,,-0.53,,,,,4.08,,9.08,-8.93,...,,-7.52,,,,,-0.29,,1.94,


The *JokeText.csv* contains the text for each of the 100 jokes

In [3]:
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Theory/master/Data/JokeText.csv'
jokes = pd.read_csv(url, index_col = 'JokeId', squeeze = True)
jokes



  jokes = pd.read_csv(url, index_col = 'JokeId', squeeze = True)


JokeId
0     A man visits the doctor. The doctor says "I ha...
1     This couple had an excellent relationship goin...
2     Q. What's 200 feet long and has 4 teeth? \n\nA...
3     Q. What's the difference between a man and a t...
4     Q.\tWhat's O. J. Simpson's Internet address? \...
                            ...                        
95    Two attorneys went into a diner and ordered tw...
96    A teacher is explaining to her class how diffe...
97    Age and Womanhood\n\n1. Between the ages of 13...
98    A bus station is where a bus stops.\nA train s...
99    Q: Whats the difference between greeting a Que...
Name: JokeText, Length: 100, dtype: object

In [4]:
print(jokes[0])

A man visits the doctor. The doctor says "I have bad news for you.You have
cancer and Alzheimer's disease". 
The man replies "Well,thank God I don't have cancer!"



In [5]:
print(jokes[5])

Bill & Hillary are on a trip back to Arkansas. They're almost out of gas, so Bill pulls into a service station on the outskirts of
town. The attendant runs out of the station to serve them when Hillary realizes it's an old boyfriend from high school. She and
the attendant chat as he gases up their car and cleans the windows. Then they all say good-bye. 

As Bill pulls the car onto the road, he turns to Hillary and says, 'Now aren't you glad you married me and not him ? You could've
been the wife of a grease monkey !' 

To which Hillary replied, 'No, Bill. If I would have married him, you'd be pumping gas and he would be the President !' 



In [6]:
print(jokes[20])

What's the difference between a used tire and 365 used condoms?

One's a Goodyear, the other's a great year.



In [7]:
print(jokes[99])

Q: Whats the difference between greeting a Queen and greeting the
President of the United  States?

A: You only have to get on one knee to greet the queen.



## Part 1: create a 'fake user'

Run the following code

In [8]:
#pd.Series(index = ratings_matrix.index, name='fake user').to_csv('fake_user.csv')

Open the *fake_user.csv* file, and rate some jokes (you can find the text of each joke in the jokes dataframe). 
Your ratings must range from -10.00 to +10.00.

Then, run the following code

In [9]:
fake_user = pd.read_csv('fake_user.csv', index_col=0, squeeze=True)
fake_user



  fake_user = pd.read_csv('fake_user.csv', index_col=0, squeeze=True)


joke 0     5.0
joke 1     NaN
joke 2     NaN
joke 3     NaN
joke 4     7.0
          ... 
joke 95    NaN
joke 96    NaN
joke 97    NaN
joke 98    NaN
joke 99    NaN
Name: fake user, Length: 100, dtype: float64

## Part 2: user-based recommendations

Use the user-based recommendation model to find the top-5 recommendations.

In [10]:
user_mean = fake_user.mean()
user_std = fake_user.std()

In [11]:
similarities = ratings_matrix.corrwith(fake_user)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


In [12]:
not_rated_jokes = fake_user[fake_user.isna()].index.to_list()


In [13]:
z_scores = (ratings_matrix - ratings_matrix.mean())/ratings_matrix.std()
z_scores

user,0,1,2,3,4,5,6,7,8,9,...,23490,23491,23492,23493,23494,23495,23496,23497,23498,23499
joke 0,,-1.024861,,-0.054191,,,0.513838,-0.755281,-1.201378,1.408545,...,0.431117,0.071733,,-0.048058,0.957739,-1.261320,-0.395047,,-0.768301,
joke 1,1.569751,-0.902009,,-2.090232,,,0.063200,1.543693,0.126068,1.001843,...,1.230454,-0.692365,1.480818,0.197336,0.889273,1.284369,0.209027,,-0.052175,
joke 2,,0.253802,,0.690485,,,0.332844,-0.544166,0.278527,0.605109,...,-0.279921,0.621716,,0.236287,-1.160416,-1.261320,-0.773678,,1.715634,
joke 3,,-0.731521,,,,,-1.011682,,-1.698185,1.428482,...,-0.259008,-1.070216,,0.018159,0.957739,-0.968910,-0.395047,,-1.165239,
joke 4,-0.426688,-1.536326,-1.226574,-1.230308,1.266762,-1.078827,1.049432,-1.011412,0.317956,-0.585093,...,-0.180004,-1.855307,-0.096370,-1.097799,0.882141,0.692668,1.191732,0.87582,-0.670089,0.762062
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
joke 95,-1.126115,0.935756,,,,,0.063200,,0.086639,0.886212,...,,0.907203,,,,,0.125208,,-0.211769,
joke 96,,0.873076,,,,,1.086370,0.683712,-3.204377,1.514208,...,,0.082229,,,,,0.573206,,-0.113557,
joke 97,,-0.062103,,,,,0.886907,,-1.406410,-0.622972,...,,-1.334712,,,,,0.125208,,-0.134018,
joke 98,,-0.062103,,,,,0.728076,,0.917279,-1.097458,...,,-1.385092,,,,,-0.253422,,-0.391823,


In [14]:
k = 20
for item in not_rated_jokes:
    # k nearest neighbors similarities
    knn_sim = similarities[ratings_matrix.loc[item].notna()].sort_values(ascending=False).head(k)

    # normalization factor
    total = knn_sim.abs().sum()
    
    # k nearest neighbors
    knn = knn_sim.index
    
    # k nearest neighbors z-scores
    knn_z_scores = z_scores.loc[item,knn]
    

    # prediction
    prediction = user_mean + user_std*knn_sim.dot(knn_z_scores)/total
    fake_user.loc[item] = prediction

In [15]:
fake_user.loc[not_rated_jokes].sort_values(ascending=False).head(20)

joke 33    6.793358
joke 52    6.751791
joke 48    6.741578
joke 47    6.731080
joke 35    6.687509
joke 11    6.682506
joke 28    6.628012
joke 31    6.577365
joke 64    6.477840
joke 67    6.474974
joke 10    6.465079
joke 49    6.463592
joke 65    6.461636
joke 39    6.371124
joke 30    6.337450
joke 46    6.308760
joke 25    6.241484
joke 37    6.181663
joke 77    6.115742
joke 99    6.101678
Name: fake user, dtype: float64

## Part 3: matrix-factorization-based recommendations

Use the matrix factorization recommendation model to find the top-5 recommendations.

In [16]:
# append fake user to the ratings matrix
ratings_matrix['user'] = fake_user
ratings_matrix

user,0,1,2,3,4,5,6,7,8,9,...,23491,23492,23493,23494,23495,23496,23497,23498,23499,user.1
joke 0,,-4.37,,0.34,,,3.50,-7.67,1.02,3.64,...,-0.58,,-2.38,9.27,-9.95,-0.78,,1.02,,5.000000
joke 1,8.11,-3.88,,-6.55,,,2.28,7.14,6.07,1.60,...,-4.22,8.74,-1.12,8.79,4.85,1.31,,2.77,,5.253495
joke 2,,0.73,,2.86,,,3.01,-6.31,6.65,-0.39,...,2.04,,-0.92,-5.58,-9.95,-2.09,,7.09,,6.051141
joke 3,,-3.20,,,,,-0.63,,-0.87,3.74,...,-6.02,,-2.04,9.27,-8.25,-0.78,,0.05,,3.777641
joke 4,-2.28,-6.41,0.73,-3.64,9.13,-1.46,4.95,-9.32,6.80,-6.36,...,-9.76,1.17,-7.77,8.74,1.41,4.71,6.12,1.26,9.17,7.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
joke 95,-5.92,3.45,,,,,2.28,,5.92,1.02,...,3.40,,,,,1.02,,2.38,,4.922160
joke 96,,3.20,,,,,5.05,1.60,-6.60,4.17,...,-0.53,,,,,2.57,,2.62,,4.961552
joke 97,,-0.53,,,,,4.51,,0.24,-6.55,...,-7.28,,,,,1.02,,2.57,,5.207057
joke 98,,-0.53,,,,,4.08,,9.08,-8.93,...,-7.52,,,,,-0.29,,1.94,,4.121614


In [17]:
# transpose the ratings_matrix
# ratings_matrix = ratings_matrix.T
# ratings_matrix

In [18]:
def latent_factor_model(R, k, learning_rate, n_epochs):
    m, n = R.shape
    
    # number of nonzero ratings
    n_zero_ratings = np.sum(~np.isnan(R))
    
    # initialization
    U = np.random.randn(m,k)
    V = np.random.randn(n,k)
    
    mean_error = np.zeros(n_epochs)
    
    # gradient descent steps
    for i in range(n_epochs):
        
        # error matrix
        E = R-U.dot(V.T)
        E[np.isnan(E)]=0

        # update U and V
        new_U = U + learning_rate*E.dot(V)
        new_V = V + learning_rate*E.T.dot(U)

        U = new_U
        V = new_V
        
        # compute mean_error
        error_squared = np.sum(E**2)
        mean_error[i] = np.sqrt(error_squared/(n_zero_ratings))
        
    return U, V, mean_error

In [19]:
R = ratings_matrix.to_numpy()
R.shape

(100, 23501)

In [20]:
U, V, mean_error = latent_factor_model(R, 
                                       k = 20, 
                                       learning_rate = .00005, 
                                       n_epochs = 50)

In [22]:
# predictions
R_pred = U.dot(V.T)
user_pred = pd.Series(R_pred[:,-1], index=fake_user.index) 
user_pred

joke 0     0.092099
joke 1     0.189833
joke 2     0.383168
joke 3     0.325473
joke 4     0.006178
             ...   
joke 95   -0.015685
joke 96   -0.394909
joke 97   -0.136757
joke 98    0.148656
joke 99   -0.502069
Length: 100, dtype: float64

In [23]:
user_pred.sort_values(ascending=False).head(20)

joke 2     0.383168
joke 3     0.325473
joke 12    0.235247
joke 57    0.234663
joke 14    0.224538
joke 66    0.210492
joke 39    0.207933
joke 58    0.205300
joke 9     0.195989
joke 23    0.192821
joke 1     0.189833
joke 76    0.183052
joke 24    0.153800
joke 98    0.148656
joke 32    0.145449
joke 6     0.131933
joke 17    0.125316
joke 45    0.109448
joke 83    0.097958
joke 0     0.092099
dtype: float64