# Intro 

Movie recommendation can be framed as a machine learning classification problem. If it is predicted that you like a movie, for example, then it will be on your recommended list, otherwise, it won't. Predicting whether a person likes a movie is also a binary classification problem.

# Bayes' theorem 

Let $A$ and $B$ denote two events. In Bayes' theorem, $P(A |B)$ is the probability that $A$ occurs given that $B$ is true. It can be computed as follows:
$$P(A |B) = \frac{P(B |A)\ P(A)}{P(B)}$$
where:
 * $P(A |B)$ is called the *likelihood*
 * $P(B |A)$ is called the *posterior* (probability)
 * $P(A)$ is called the *prior* (probability)
 * $P(B)$ is called the *evidence*.


# Naïve Bayes classifier

* What Naïve Bayes does:

  * It maps the **probability of observed input features given a possible class** to the **probability of the class given observed pieces of evidence** based on Bayes' theorem.

  * It simplifies probability computation by assuming that predictive features are mutually independent.

* Given a feature vector $\mathbf{x} = (x_1, x_2, \ldots, x_n)$, the goal of Naïve Bayes is to determine the probabilities that $\mathbf{x}$ belongs to each of $K$ possible classes $y_1,y_2,\ldots, y_K$. That is, 
$$P(y_k|\ \mathbf{x}), \text{ where } k=1,2,
\ldots K$$

* By Bayes's theorem, 
$$P(y_k | \mathbf{x})=\frac{P(\mathbf{x}|\ y_k)P(y_k)}{P(\mathbf{x})}$$
where:
  * $P(\mathbf{x}|\ y_k)=P(x_1, x_2, \ldots, x_n|\ y_k)$ is the joint distribution of the $n$ features $ x_1, x_2, \ldots, x_n$, given that the sample belongs to class $y_k$. This is how likely the features with such values co-occur.

  * $P(y_k|\ \mathbf{x})$, in contrast to $P(y_k)$, has extra knowledge of data sample $\mathbf{x}$.

  * $P(y_k)$ portrays how classes are distributed. It can be either predetermined (usually in a uniform manner where each class has an equal chance of occurence) or learned from a set of training examples.

  * $P(\mathbf{x})$ only depends on the overall distribution of features, which is not specific to certain classes and can be treated as a normalization constant, and thus
$$P(y_k | \mathbf{x}) \propto P(\mathbf{x}|\ y_K)P(y_k),$$
where $\propto$ denotes "proportional".


* Under the feature independence assumption, the joint conditional distribution of the $n$ features $x_1, x_2, \ldots, x_n$ can be expressed as the product of individual feature conditional distributions:
$$P(x_1, x_2, \ldots, x_n|\ y_k) = P(x_1|\ y_k)\cdot P(x_2|\ y_k)\cdot \ldots \cdot P(x_n|\ y_k)$$

* Then:
$$P(y_k | \mathbf{x}) \propto P(x_1|\ y_k)\cdot P(x_2|\ y_k)\cdot \ldots \cdot P(x_n|\ y_k)\cdot P(y_k).$$





# Simplified example of movie recommendation

* Given four users, whether they like each of three movies, $m_1, m_2, m_3$ (indicated as `1` or `0`), and whether they like a target movie (denoted as event `Y`) or not (denoted as event `N`), as shown in the following table, we are asked to predict how likely it is that another user will like that movie:




 |    ID       |$m_1$| $m_2$| $m_3$| The user likes the target movie|
 | ----------- |----|----|----|:----:|
 |   1         | 0  |1|1|Y|
 |   2         | 0  |0|1|N|
 |   3         | 0  |0|0|Y|
 |   4         | 1  |1|0|Y|
 |   5         | 1  |1|0| ?|


 * Whether users like three movies, $m_1, m_2, m_3$, are features (signals) that we can utilize to predict the target class. 

* The training data we have are the four samples with both ratings and target information.

* We want to calculate the probability that the user with ID=5 likes the target movie. That is, we want to calculate the posterior probability $P(Y|\ \mathbf{x})$ where $\mathbf{x}=(1,1,0)$.

* The prior probabilities are:
$$ P(Y)= \frac{3}{4} \text{ and } P(N) = \frac{1}{4}$$

*  We will denote the event that a user likes the three movies or not as $f_1$, $f_2$, $f_3$, respectively. We need to compute the likelihoods, 
$$P(f_1=1|\ Y), P(f_2=1|\ Y), P(f_3=0|\ Y) \text{ and }$$  $$P(f_1=1|\ N), P(f_2=1 |\ N), P(f_3=0|\ N).$$

* Since $f_1=1$ was not seen in the $N$ class, we have $P(f_1 = 1|\ N)=0$
Consequently, we will get $$P(N|\ \mathbf{x})\propto P(f_1=1|\ N)\cdot P(f_2=1 |\ N) \cdot P(f_3=0 |\ N) =0,$$ which means we will recklessly predict class = `Y`. 

To eliminate the zero-multiplication factor, the unknown likelihood, we usually assign an initial value of 1 to each feature, that is, we start counting each possible value of a feature from one. This technique is also known as **Laplace smoothing**. For more details, see https://courses.cs.washington.edu/courses/cse446/20wi/Section7/naive-bayes.pdf. We now have the following:

$$P(f_1 = 1|\ N)=\frac{0+1}{1+2}=\frac{1}{3},$$
$$P(f_2 = 1|\ N)=\frac{0+1}{1+2}=\frac{1}{3},$$
$$P(f_3 = 0|\ N)=\frac{0+1}{1+2}=\frac{1}{3},$$

$$P(f_1 = 1|\ Y)= \frac{1+1}{3+2}=\frac{2}{5},$$
$$P(f_2 = 1|\ Y)= \frac{2+1}{3+2}=\frac{3}{5},$$
$$P(f_3 = 0|\ Y)= \frac{2+1}{3+2}=\frac{3}{5}.$$

The reason we add 2 to the denominator when we smooth is because we have two possible values  (0 or 1).

Then 
$$\frac{P(N|\ \mathbf{x})}{P(Y|\ \mathbf{x})} \propto \frac{P(N)\cdot P(f_1=1|\ N)\cdot P(f_2=1 |\ N) \cdot P(f_3=0 |\ N)}{P(Y)\cdot P(f_1=1|\ Y)\cdot P(f_2=1 |\ Y) \cdot P(f_3=0 |\ Y)}= \frac{\frac{1}{4} \cdot \frac{1}{3} \cdot \frac{1}{3}\cdot \frac{1}{3}}{\frac{3}{4} \cdot \frac{2}{5} \cdot \frac{3}{5} \cdot \frac{3}{5}} = \frac{125}{1458}.$$


In [8]:
125/1458

0.08573388203017833

Since $P(N|\ \mathbf{x}) + P(Y|\ \mathbf{x}) = 1$ we get 
$\left(\frac{125}{1458}+1\right) P(Y|\ \mathbf{x}) = 1$ and thus

$$P(Y|\ \mathbf{x}) = 0.9210.$$

This means there is a 92.1% chance that the new user will like the target movie.

In [9]:
1/(1+125/1458)

0.9210360075805433

# Naïve Bayes with scikit-learn

To implement Naïve Bayes with scikit-learn we can use the [BernoulliNB module](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html).

We initialize a model with a smoothing factor (specified as `alpha`) of 1.0, and prior learned from the training set (specified as `fit_prior=True` ).


In [75]:
import numpy as np
X_train = np.array([[0, 1, 1],
                    [0, 0, 1],
                    [0, 0, 0],
                    [1, 1, 0]])
Y_train = ['Y', 'N', 'Y', 'Y']
X_test = np.array([[1, 1, 0]])

In [13]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB(alpha=1.0, fit_prior=True)
clf.fit(X_train, Y_train)

In [15]:
pred_prob = clf.predict_proba(X_test)
print('Predicted probabilities:\n', pred_prob)
pred = clf.predict(X_test)
print('Prediction:', pred)

Predicted probabilities:
 [[0.07896399 0.92103601]]
Prediction: ['Y']


The prediction results for our toy dataset are consistent with what we got using our own solution. Now we will build a movie recommender (or, more specifically, movie preference classifier) using a real dataset.

# Movie preference classifier using MovieLens data

We will now use a real movie rating dataset (https://grouplens.org/datasets/movielens/). Data was collected by the GroupLens Research group from the MovieLens website (http://movielens.org).

We will use the small dataset, ml-latest-small (downloaded from the following link: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) as an example. 
It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

In [170]:
import pandas as pd
import numpy as np

In [171]:
ratings_path = '../datasets/ml-latest-small/ratings.csv'

In [172]:
df = pd.read_csv(ratings_path)

In [173]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [174]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [175]:
df = df.drop(['timestamp'], axis=1)

In [176]:
n_users = len(df.userId.unique())
print(f'Number of users: {n_users}')
n_movies = len(df.movieId.unique())
print(f'Number of movies: {n_movies}')

Number of users: 610
Number of movies: 9724


In [177]:
values, counts = np.unique(df.rating, return_counts=True)
for value, count in zip(values, counts):
    print(f'Number or rating {value}: {count}')

Number or rating 0.5: 1370
Number or rating 1.0: 2811
Number or rating 1.5: 1791
Number or rating 2.0: 7551
Number or rating 2.5: 5550
Number or rating 3.0: 20047
Number or rating 3.5: 13136
Number or rating 4.0: 26818
Number or rating 4.5: 8551
Number or rating 5.0: 13211


In [178]:
df.movieId.value_counts()

356       329
318       317
296       307
593       279
2571      278
         ... 
86279       1
86922       1
5962        1
87660       1
163981      1
Name: movieId, Length: 9724, dtype: int64

In [179]:
movie_id_most = 356
n_rating_most = 329
print(f'Movie ID {movie_id_most} has {n_rating_most} ratings.')

Movie ID 356 has 329 ratings.


We can consider movies with ratings greater than 3 as being liked (being recommended):

In [181]:
#Relabel ratings 
df['recommended'] = (df['rating'] > 3).astype(int)
df

Unnamed: 0,userId,movieId,rating,recommended
0,1,1,4.0,1
1,1,3,4.0,1
2,1,6,4.0,1
3,1,47,5.0,1
4,1,50,5.0,1
...,...,...,...,...
100831,610,166534,4.0,1
100832,610,168248,5.0,1
100833,610,168250,5.0,1
100834,610,168252,5.0,1


In [182]:
X = df[['userId', 'movieId']]  # Features
y = df['recommended']  # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [183]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)

In [193]:
from sklearn.metrics import accuracy_score

accuracy = clf.score(X_test, y_test)
print(f'The accuracy is: {accuracy*100:.1f}%')

The accuracy is: 61.2%


# References 

* [Python Machine Learning By Example - Third Edition,
by Yuxi (Hayden) Liu](https://www.packtpub.com/product/python-machine-learning-by-example-third-edition/9781800209718)