# Matrix Factorization
![](https://miro.medium.com/v2/resize:fit:1400/1*EKYGZohO9XTJMo68OPyImA.png)



Features of Matrix Factorization:
1. Latent Features: The key idea behind matrix factorization is to discover latent features that represent underlying patterns in the user-item interaction. These latent features are not directly observable but help in capturing the preferences and characteristics of users and items.

2. Dimensionality Reduction: By using a smaller value of kk (number of latent features), we achieve dimensionality reduction. This allows us to approximate the original user-item rating matrix with a much smaller matrix, making computations and storage more efficient.

3. Predictions: After factorization, we can predict the missing ratings or suggest new recommendations by calculating the dot product of the matrices $U$, $∑$, and $V^T$ for the unseen user-item pairs.

4. Handling Missing Values: Matrix factorization can handle sparse user-item rating matrices effectively as it approximates the missing values based on the learned latent features.

5. Singular Value Thresholding: To improve recommendation accuracy, we may choose to retain only the top kk singular values and set the rest to zero during the decomposition process. This is known as singular value thresholding.

6. Regularization: To prevent overfitting and improve generalization, regularization terms can be added to the loss function during the optimization process.

Matrix factorization with SVD is a powerful technique used in movie recommendation systems, as it can capture complex user preferences and item characteristics, even in the presence of sparse data. It has been the foundation for many successful recommendation algorithms like Collaborative Filtering, Matrix Factorization with Bias, and its variations.

In [None]:
! pip install surprise

# **Step 1: Preparing the Data Set**

In [None]:
import pandas as pd
movie = pd.read_csv('https://raw.githubusercontent.com/ubaid-shah/recommendation_system_ItVedant/main/movies.csv')
rating = pd.read_csv('https://raw.githubusercontent.com/ubaid-shah/recommendation_system_ItVedant/main/ratings.csv')


In [None]:
print('movie.shape:   ',movie.shape)
print('rating.shape:  ',rating.shape)

In [None]:
movie.head()

In [None]:
rating.head()

In [None]:
df = movie.merge(rating, how="left", on="MovieID")
df.head()

In [None]:
df.shape

In [None]:
df=df.drop(columns='Timestamp')

In [None]:
df.head()

In [None]:
df['UserID']='user'+df['UserID'].astype(str)

In [None]:
df.head()

In [None]:
df['MovieID']=df['MovieID'].astype(str)

In [None]:
df.info()

In [None]:
df['Title'].value_counts()

In [None]:
(df['Title'].value_counts()>10).index

In [None]:
counts=df['Title'].value_counts()
df[df['Title'].isin(counts[counts>1000].index)]

In [None]:
counts=df['Title'].value_counts()
df_final=df[df['Title'].isin(counts[counts>1000].index)]

# **Step 2: Modelling**

In [None]:
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import train_test_split
# pd.set_option('display.max_columns', None)


In [None]:
df_final.Rating.unique()

In [None]:
reader = Reader(rating_scale=(1, 5))

In [None]:
data = Dataset.load_from_df(df_final[['UserID', 'Title', 'Rating']], reader)

In [None]:
trainset, testset = train_test_split(data, test_size=.25)
svd_model = SVD(n_factors=100)
svd_model.fit(trainset)

In [None]:
predictions = svd_model.test(testset)
predictions

In [None]:
accuracy.rmse(predictions)

RMSE: 0.8669


0.8668595840657397

# **Step 3: Prediction**

In [None]:
df_final.sample(2)

In [None]:
svd_model.predict('user1753.0','Star Wars: Episode IV - A New Hope (1977)')

In [None]:
svd_model.predict('user5170.0','M*A*S*H (1970)')

In [None]:
# svd_model.pu
# svd_model.pu.shape
# svd_model.qi
# svd_model.qi.shape
# svd_model.bu.shape
# svd_model.bi.shape
# df_final[(df_final['UserID']==137.0) & (df_final['Title']=='American Beauty (1999)')]

In [None]:
df_final[df_final['Rating'].isna()]

Unnamed: 0,MovieID,Title,Genres,UserID,Rating


In [None]:
df_final.isna().sum()

MovieID    0
Title      0
Genres     0
UserID     0
Rating     0
dtype: int64

In [None]:
df.isna().sum()

MovieID      0
Title        0
Genres       0
UserID     177
Rating     177
dtype: int64

In [None]:
df[df['Rating'].isna()]

Unnamed: 0,MovieID,Title,Genres,UserID,Rating
25085,51,Guardian Angel (1994),Action|Drama|Thriller,,
34063,109,Headless Body in Topless Bar (1995),Comedy,,
38381,115,Happiness Is in the Field (1995),Comedy,,
40480,143,Gospa (1995),Drama,,
74693,284,New York Cop (1996),Action|Crime,,
...,...,...,...,...,...
947047,3650,Anguish (Angustia) (1986),Horror,,
968525,3750,Boricua's Bond (2000),Drama,,
984967,3829,Mad About Mambo (2000),Comedy|Romance,,
987607,3856,Autumn Heart (1999),Drama,,


In [None]:
df1 = movie.merge(rating, how="left", on="MovieID")
df1.drop(columns=['Genres','Timestamp'],inplace=True)

In [None]:
df1.head()

Unnamed: 0,MovieID,Title,UserID,Rating
0,1,Toy Story (1995),1.0,5.0
1,1,Toy Story (1995),6.0,4.0
2,1,Toy Story (1995),8.0,4.0
3,1,Toy Story (1995),9.0,5.0
4,1,Toy Story (1995),10.0,5.0


In [None]:
df1[['UserID', 'Title', 'Rating']]

Unnamed: 0,UserID,Title,Rating
0,1.0,Toy Story (1995),5.0
1,6.0,Toy Story (1995),4.0
2,8.0,Toy Story (1995),4.0
3,9.0,Toy Story (1995),5.0
4,10.0,Toy Story (1995),5.0
...,...,...,...
1000381,5812.0,"Contender, The (2000)",4.0
1000382,5831.0,"Contender, The (2000)",3.0
1000383,5837.0,"Contender, The (2000)",4.0
1000384,5927.0,"Contender, The (2000)",1.0


In [None]:
reader_df = Reader(rating_scale=(1, 5))
data_df = Dataset.load_from_df(df1[['UserID', 'Title', 'Rating']], reader)
trainset_df, testset_df = train_test_split(data_df, test_size=.25)
svd_df=SVD()
svd_df.fit(trainset_df)
predictions = svd_model.test(testset_df)
predictions

[Prediction(uid=1680.0, iid='Pi (1998)', r_ui=4.0, est=3.788260341573804, details={'was_impossible': False}),
 Prediction(uid=4422.0, iid='Braveheart (1995)', r_ui=4.0, est=4.315092314540358, details={'was_impossible': False}),
 Prediction(uid=2180.0, iid='Quick and the Dead, The (1995)', r_ui=4.0, est=3.685298773201385, details={'was_impossible': False}),
 Prediction(uid=3823.0, iid='Adventures of Robin Hood, The (1938)', r_ui=5.0, est=4.0697070288622665, details={'was_impossible': False}),
 Prediction(uid=5812.0, iid='Gold Rush, The (1925)', r_ui=5.0, est=4.250677220999993, details={'was_impossible': False}),
 Prediction(uid=2907.0, iid='MatchMaker, The (1997)', r_ui=2.0, est=3.421705443458544, details={'was_impossible': False}),
 Prediction(uid=1214.0, iid='Bedknobs and Broomsticks (1971)', r_ui=3.0, est=3.7345812401241822, details={'was_impossible': False}),
 Prediction(uid=3231.0, iid='Cape Fear (1962)', r_ui=4.0, est=4.1427242996502285, details={'was_impossible': False}),
 Predic

In [None]:
df1.isna().sum()

MovieID      0
Title        0
UserID     177
Rating     177
dtype: int64

In [None]:
df1[df1['Rating'].isna()]

Unnamed: 0,MovieID,Title,UserID,Rating
25085,51,Guardian Angel (1994),,
34063,109,Headless Body in Topless Bar (1995),,
38381,115,Happiness Is in the Field (1995),,
40480,143,Gospa (1995),,
74693,284,New York Cop (1996),,
...,...,...,...,...
947047,3650,Anguish (Angustia) (1986),,
968525,3750,Boricua's Bond (2000),,
984967,3829,Mad About Mambo (2000),,
987607,3856,Autumn Heart (1999),,


In [None]:
svd_model.predict('user5170.0','M*A*S*H (1970)')

Prediction(uid='user5170.0', iid='M*A*S*H (1970)', r_ui=None, est=4.161905195380475, details={'was_impossible': False})