## importing the datasetst

In [1]:
import pandas as pd
import numpy as np

Let's import the data that we cleaned and prepared in the last notebook.

In [2]:
ratings = pd.read_csv("data/CleanedData/ratings.csv")
movies = pd.read_csv("data/CleanedData/movies.csv")
genres = pd.read_csv("data/CleanedData/genres.csv")
movie_stats = pd.read_csv("data/CleanedData/movieState.csv")

In [3]:
movie_ratings= ratings.merge(movies , on="movieId")
movie_ratings.head()

Unnamed: 0,userId,movieId,rating,title,genres,year
0,1,1,4.0,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995
1,5,1,4.0,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995
2,7,1,4.5,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995
3,18,1,3.5,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995
4,19,1,4.0,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995


## analizing the best movie by personal rating

I'm going to build my own personal profil and rate some movies

In [4]:
userInput = [
            {'title':'Shrek', 'rating':5},
            {'title':'Toy Story', 'rating':4.5},
            {'title':'Ice Age' , 'rating':5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Dark Knight, The", 'rating':4.5},
            {'title':'Mask, The', 'rating':4}

         ]
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Shrek,5.0
1,Toy Story,4.5
2,Ice Age,5.0
3,Jumanji,2.0
4,"Dark Knight, The",4.5
5,"Mask, The",4.0


extracting the movies that I rated

In [5]:
myMovies = movies[movies['title'].isin(inputMovies['title'])].drop("genres" , axis=1)
myMovies

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
325,367,"Mask, The",1994
3194,4306,Shrek,2001
3745,5218,Ice Age,2002
6710,58559,"Dark Knight, The",2008


Next, let's build a genre analysis profile based on my ratings.

In [6]:
userMovies = genres[genres['movieId'].isin(myMovies['movieId'])].reset_index(drop=True)
userProfile = userMovies.transpose().dot(inputMovies['rating'])
#The user profile
pd.DataFrame(userProfile , columns=["waight"] ).drop("movieId" , axis=0)

Unnamed: 0,waight
(no genres listed),0.0
Action,9.0
Adventure,16.0
Animation,11.5
Children,16.0
Comedy,16.5
Crime,5.0
Documentary,0.0
Drama,0.0
Fantasy,16.5


Let's see if we can find some users who have similar tastes to mine.  
First, we'll examine the movies each user has rated or watched.


In [7]:
userAllMovies = ratings.groupby("userId")["movieId"].apply(list)
userAllMovies = pd.DataFrame(userAllMovies)
userAllMovies.reset_index(inplace=True)
userAllMovies.head(2)

Unnamed: 0,userId,movieId
0,1,"[1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163,..."
1,2,"[318, 333, 1704, 3578, 6874, 8798, 46970, 4851..."


Perfect! Now, let's extract the similarity (common movies that they and I watched) and add it to a column.


In [8]:
userAllMovies["semilarity"] = [len(set(myMovies.movieId).intersection(set(g))) for g in userAllMovies["movieId"]]
userAllMovies.head()

Unnamed: 0,userId,movieId,semilarity
0,1,"[1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163,...",2
1,2,"[318, 333, 1704, 3578, 6874, 8798, 46970, 4851...",1
2,3,"[31, 527, 647, 688, 720, 849, 914, 1093, 1124,...",0
3,4,"[21, 32, 52, 58, 106, 125, 126, 162, 171, 176,...",1
4,5,"[1, 21, 34, 36, 39, 50, 58, 110, 150, 153, 232...",3


In [9]:
#people who are closes to me
userAllMovies[userAllMovies["semilarity"] == userAllMovies["semilarity"].max()].head(5)

Unnamed: 0,userId,movieId,semilarity
17,18,"[1, 2, 6, 16, 34, 36, 50, 70, 104, 110, 111, 1...",6
18,19,"[1, 2, 3, 7, 10, 12, 13, 15, 19, 32, 34, 44, 4...",6
20,21,"[1, 2, 10, 19, 38, 44, 48, 145, 165, 170, 260,...",6
67,68,"[1, 2, 3, 5, 6, 7, 10, 11, 16, 17, 18, 19, 25,...",6
102,103,"[1, 2, 5, 16, 18, 19, 34, 36, 48, 50, 60, 70, ...",6


## Correlation

Now, we want to extract the correlation between the movies based on their ratings.

<hr>

First, to extract the correlations, we need to create a pivot table based on user ratings and movies.



In [10]:
movie_ratings = movie_ratings.drop("year", axis=1)
movie_ratings = movie_ratings.drop("movieId", axis=1)
movie_ratings = movie_ratings.drop("genres", axis=1)
movie_ratings = movie_ratings.drop_duplicates(subset=["userId" , "title"])
movie_Pivot = movie_ratings.pivot(values="rating" , columns="title", index="userId" )
movie_Pivot.head()

title,"""Great Performances"" Cats",#1 Cheerleader Camp,$ (Dollars),$5 a Day,$9.99,$ellebrity (Sellebrity),'71,'Hellboy': The Seeds of Creation,'Neath the Arizona Skies,'R Xmas,...,xXx,xXx: State of the Union,¡Three Amigos!,À l'aventure,À nos amours,À nous la liberté (Freedom for Us),À propos de Nice,"Ó Paí, Ó",أهواك,貞子3D
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,4.0,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


Perfect! Now that we have the pivot table, let's create a table with each movie in a row. Additionally, we can count the number of ratings for each movie.


In [11]:
ratingCount = pd.DataFrame(movie_ratings.groupby('title')['rating'].count())

Now, we want to find the correlation between the movie **Shrek** and other movies. We'll add these correlation values to the table we created.


In [12]:
#Getting the correlation
Shrek_corr = movie_Pivot.corrwith(movie_Pivot["Shrek"])

Shrek_corr = pd.DataFrame(Shrek_corr,columns=['Correlation'])
Shrek_corr.dropna(inplace=True)
Shrek_corr = Shrek_corr.join(ratingCount["rating"])  #adding the ratings count to the table

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)
  c *= np.true_divide(1, fact)
  c /= stddev[:, None]
  c /= stddev[None, :]


To ensure a fair comparison, we will only include movies that have more than 100 ratings. Then, we will sort the table by the correlation to find the movies most similar to *Shrek*.


In [13]:
Shrek_corr[Shrek_corr["rating"] > 100].sort_values("Correlation" , ascending=False).head()

Unnamed: 0_level_0,Correlation,rating
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Shrek,1.0,6587
Shrek 2,0.739514,2953
Splendor in the Grass,0.588821,119
"Perez Family, The",0.569923,136
"Fantastic Planet, The (Planète sauvage, La)",0.557171,155


***Wow, look at this! Shrek 2 is the most similar movie to Shrek. Great job ;)***

<img src="img/shrek_2.jpg" alt="Description" width="200" height="120">

## Clustering using (K-means)

Now, we will cluster the movies based on their genre, Bayesian average rating, and release year. This approach will help us group movies that share similar genres, receive similar ratings adjusted using Bayesian averaging to account for rating quantity, and were released in similar years. Clustering based on these features can reveal patterns and similarities among movies.



In [14]:
movie_stats.head(2)

Unnamed: 0,movieId,title,year,count,mean,bayesian_avg
0,1,Toy Story,1995,10464,3.88685,3.875
1,2,Jumanji,1995,4206,3.225036,3.22


Let's add the genres

In [15]:
movie_content = movie_stats.merge(genres , on= "movieId")
movie_content.head(3)

Unnamed: 0,movieId,title,year,count,mean,bayesian_avg,(no genres listed),Action,Adventure,Animation,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,1995,10464,3.88685,3.875,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,1995,4206,3.225036,3.22,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,1995,2676,3.210949,3.204,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [16]:
movie_content = movie_content.drop("mean" ,axis=1 ) #removing the Mean (because we use bayesian-mean)
# removing the title and movieId
x = movie_content.drop("movieId" , axis=1)
x = x.drop("title" , axis = 1)

For clustering, we will use the ***K-means*** algorithm from the `sklearn` library. Let's import it and fit our data to it.


In [17]:
from sklearn.cluster import KMeans
N = 30 # number of Cluster Groups
clusterModle = KMeans(init="k-means++", n_clusters = N , n_init=12)
clusterModle.fit(x)

Let's examine the labels (group numbers) assigned by the K-means algorithm.


In [18]:
lables = clusterModle.labels_
lables

array([12, 21, 19, ..., 29, 29, 13])

In [19]:
#adding the lable to the table
movie_content["lable"] = lables
movie_content.head(3)

Unnamed: 0,movieId,title,year,count,bayesian_avg,(no genres listed),Action,Adventure,Animation,Children,...,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,lable
0,1,Toy Story,1995,10464,3.875,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,12
1,2,Jumanji,1995,4206,3.22,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,21
2,3,Grumpier Old Men,1995,2676,3.204,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,19


Let's determine which cluster the movie "Shrek" belongs to.


In [27]:
movie_content[movie_content.title == "Shrek"].lable

3194    22
Name: lable, dtype: int32

In [28]:
movies[movies["movieId"].isin( movie_content["movieId"][movie_content["lable"] == 22].values)]

Unnamed: 0,movieId,title,genres,year
126,153,Batman Forever,"['Action', 'Adventure', 'Comedy', 'Crime']",1995
138,165,Die Hard: With a Vengeance,"['Action', 'Crime', 'Thriller']",1995
197,231,Dumb & Dumber (Dumb and Dumber),"['Adventure', 'Comedy']",1994
325,367,"Mask, The","['Action', 'Comedy', 'Crime', 'Fantasy']",1994
436,500,Mrs. Doubtfire,"['Comedy', 'Drama']",1993
512,595,Beauty and the Beast,"['Animation', 'Children', 'Fantasy', 'Musical'...",1991
594,736,Twister,"['Action', 'Adventure', 'Romance', 'Thriller']",1996
836,1097,E.T. the Extra-Terrestrial,"['Children', 'Drama', 'Sci-Fi']",1982
896,1193,One Flew Over the Cuckoo's Nest,['Drama'],1975
899,1197,"Princess Bride, The","['Action', 'Adventure', 'Comedy', 'Fantasy', '...",1987


Wow, there are some great movies here! As you can see, one of my favorite movies, `'Beauty and the Beast,'` is included in this cluster. It's always a delight to find movies that I enjoy. I've also spotted some other movies that I like as well.

<img src="img/BeautyandtheBeast.jpg" alt="Description" width="200" height="120">

Additionally, you might recall that I added ***"The Mask"*** to `my personal profile` earlier in the notebook. As it turns out, this movie is  included in our list:) 

<img src="img/Mask.jpg" alt="Description" width="200" height="120">


## Collaborative filtering using (KNN)
Now, we are going to implement a popular method called Collaborative Filtering using the **K-Nearest Neighbors** (KNN) algorithm.


*Collaborative filtering* is a technique used in recommendation systems to predict interests or preferences by leveraging similarities between users  based on their ratings. It helps recommend items to users based on the preferences of similar users or recommend similar items based on past user interactions.

First, we need to create a pivot table where movies are in rows, users are in columns, and ratings are the values.


In [30]:
rating_Pivot = movie_ratings.pivot(values="rating" , columns="userId", index="title" ).fillna(0)

In [32]:
rating_Pivot.head(2)

userId,1,2,3,4,5,6,7,8,9,10,...,42121,42122,42123,42124,42125,42126,42127,42128,42129,42130
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Great Performances"" Cats",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#1 Cheerleader Camp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, we will import `csr_matrix` to convert the pivot table to a sparse matrix format. Then, we'll import `KNN` from `sklearn` and fit the matrix to it.


In [31]:
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
rating_matrix = csr_matrix(rating_Pivot)

model_KNN = NearestNeighbors(metric="cosine", algorithm="brute")
model_KNN.fit(rating_matrix)

Now, we will extract the ratings matrix for the movie **"Shrek"** and find the 5 nearest neighbors for this movie using the model that we just fit.
<img src="img/sherek.jpg" alt="Description" width="200" height="120">

In [33]:
ShrekMetrix = np.array(movie_Pivot["Shrek"].fillna(0)).reshape(1,-1)
distance , neighbor = model_KNN.kneighbors(ShrekMetrix , n_neighbors=5)

Now , let's list out the nearest movies

In [34]:
for i in range(0 , len(distance.flatten())):
  movie =rating_Pivot.iloc[neighbor.flatten()[i]].name

  if i == 0 :
    print(f"closses movies to  ( {movie} ) : \n  ")
  else :
    print(f"{i+1}:( {movie} )  distance => : {distance.flatten()[i]} ")

closses movies to  ( Shrek ) : 
  
2:( Monsters, Inc. )  distance => : 0.361110207885454 
3:( Finding Nemo )  distance => : 0.36597993409505347 
4:( Lord of the Rings: The Fellowship of the Ring, The )  distance => : 0.36840866704134534 
5:( Pirates of the Caribbean: The Curse of the Black Pearl )  distance => : 0.38939525382499296 


Wow, look at the first two movies: "Monsters, Inc." and "Finding Nemo"! They are both awesome and perfect for Shrek fans.

<p float="left">
  <img src="img/nemo.jpg" alt="Finding Nemo" width="200" height="120">
  <img src="img/monster_inc.jpg" alt="Monsters, Inc." width="200" height="120">
</p>


## content based filtering
`Content-based filtering` is a recommendation system technique that recommends items based on their features or attributes, such as genre, keywords, or descriptions, matching the user's preferences derived from past interactions or profiles.

For this part, I have an additional dataset containing descriptions for each movie. We will use this dataset to conduct content-based filtering based on the description of each movie.


In [37]:
desc = pd.read_csv("data/CleanedData/description.csv")
desc.head()

Unnamed: 0,title,description
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


let's add the description to the movies dataset

In [38]:
movies_desc = movies.merge(desc , on= "title")
movies_desc.head(1)

Unnamed: 0,movieId,title,genres,year,description
0,1,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy...",1995,"Led by Woody, Andy's toys live happily in his ..."


We will use the `TfidfVectorizer` from `sklearn` to vectorize the movie descriptions because computers understand numbers, not text. First, let's import `TfidfVectorizer` and set the parameters.


In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3 , max_features=None ,
                     strip_accents="unicode" , analyzer="word" ,
                      token_pattern="r'\w{1,}" , ngram_range=(1,3),
                      stop_words="english")
#filling the null value with ""
movies_desc["description"] = movies_desc["description"].fillna("")
# fitting the descriptions to the model
tfv_matrix = tfv.fit_transform(movies_desc["description"])

Next, from `sklearn`, we will import `sigmoid_kernel` and assign the TF-IDF vectorized matrix to it.


In [40]:
from sklearn.metrics.pairwise import sigmoid_kernel

sigmoid = sigmoid_kernel(tfv_matrix , tfv_matrix)

In [41]:
sigmoid[0]

array([0.76159416, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])

Now, let's retrieve the index for every movie. We will use these indices later to extract each movie's title.


In [42]:
indices = pd.Series(movies_desc.index, index=movies_desc['title']).drop_duplicates()
indices

title
Toy Story                                              0
Jumanji                                                1
Grumpier Old Men                                       2
Waiting to Exhale                                      3
Father of the Bride Part II                            4
                                                   ...  
Obsession: Radical Islam's War Against the West    24355
Hollywood High                                     24356
Bloodmoney                                         24357
The Butterfly Circus                               24358
The 2000 Year Old Man                              24359
Length: 24360, dtype: int64

Next, we will build a function that takes a movie title, uses the sigmoid kernel function to calculate the similarity scores between that movie and others based on their descriptions, and displays the top 5 similar movies.


In [43]:
def Content_recommendation(title, sig=sigmoid):

    idx = indices[title]

    sig_scores = list(enumerate(sig[3]))


    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)


    sig_scores = sig_scores[1:6]

    movie_indices = [i[0] for i in sig_scores]

    return movies_desc['title'].iloc[movie_indices]

In [61]:
Content_recommendation("Jumanji")

1                        Jumanji
2               Grumpier Old Men
3              Waiting to Exhale
4    Father of the Bride Part II
5                           Heat
Name: title, dtype: object

<hr>

Alright, that wraps it up! We explored different methods for recommending movies and, along the way, discovered some great films to add to our watchlist.
