The Entertainment Company** which is an online movie watching platform, wants to improve its collection of movies and showcase those that are highly rated and recommend those movies to its customer by their movie watching footprint. For this, the company has collected the data and shared it with you to provide some analytical insights and also to come up with a recommendation algorithm so that it can automate its process for effective recommendations. The ratings are between -9 and +9

In [None]:
# Let us start with importing the data on which we need to work and importing the libraries as well
import pandas as pd

movies = pd.read_csv("Entertainment.csv")

In [None]:
movies.shape

(51, 4)

In [None]:
movies.columns

Index(['Id', 'Titles', 'Category', 'Reviews'], dtype='object')

**Data Description: Entertainment Dataset**

ID -- Nominal ID of the movies

Titles -- Names of the movies

Category -- Category/ genre the film belonging to

Reviews -- Review rating of the movies by the users

We happen to notice that the data has the names and category provided, which are in text format. We will have to decrypt the same using **TFIDF - "Term Frequency Inverse Document Frequency"** which will help us create a matrix of items and find the similarity matrix among the **Titles**.

In [None]:
# Importing the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Creating TfidfVectorizer to remove all stop words

Tfidf = TfidfVectorizer(stop_words="english")

In [None]:
# Checking for the NaN values in category
movies["Category"].isnull().sum()


0

In [None]:
#creating tfidf matrix
tfidf_matrix = Tfidf.fit_transform(movies.Category)
tfidf_matrix.shape

(51, 34)

**Cosine Similarity**: Measures the cosine of the angle between two vectors. It is a judgment of orientation rather than magnitude between two vectors with respect to the origin. The cosine of 0 degrees is 1 which means the data points are similar and cosine of 90 degrees is 0 which means data points are dissimilar.

In [None]:
# To find the similarity scores we import linear_kernel from sklearn
from sklearn.metrics.pairwise import linear_kernel

In [None]:
# Creating Cosine similarity matrix, which will create the matrix of similarities 
# based on the magnitude calculated based on the cosine similarities

cos_sim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
# We now create a series of the movie titles, while removing the duplicate values
movies_index = pd.Series(movies.index, index = movies["Titles"]).drop_duplicates()

In [None]:
movies_index

Titles
Toy Story (1995)                                         0
Jumanji (1995)                                           1
Grumpier Old Men (1995)                                  2
Waiting to Exhale (1995)                                 3
Father of the Bride Part II (1995)                       4
Heat (1995)                                              5
Sabrina (1995)                                           6
Tom and Huck (1995)                                      7
Sudden Death (1995)                                      8
GoldenEye (1995)                                         9
American President, The (1995)                          10
Dracula: Dead and Loving It (1995)                      11
Balto (1995)                                            12
Nixon (1995)                                            13
Cutthroat Island (1995)                                 14
Casino (1995)                                           15
Sense and Sensibility (1995)                     

In [None]:
# Checking the same for a random movie picked up
movies_id = movies_index["Heat (1995)"]
movies_id

5

In [None]:
# We will have to create a user defined function for generating recommendations for the movies as under
def get_recommendations(Name, topN):
    
    # topN = 10
    # Getting the movie index using its title 
    movies_id = movies_index[Name]
    
    # Getting the pair wise similarity score for all the Titles using the cosine based similarities
    cosine_scores = list(enumerate(cos_sim_matrix[movies_id]))
    cosine_scores = sorted(cosine_scores, key= lambda x:x[1], reverse= True)
    
    # We get the scores of top N most similar movies
    cosine_scores_N = cosine_scores[0:topN+1]
    
    # Getting the movie index 
    movies_idx = [i[0] for i in cosine_scores_N]
    movies_scores = [i[1] for i in cosine_scores_N]
    
    movies_similar = pd.DataFrame(columns = ["Titles","Scores"])
    movies_similar["Titles"] = movies.loc[movies_idx, "Titles"]
    movies_similar["Scores"] = movies_scores
    movies_similar.reset_index(inplace = True)
    
    print(movies_similar)


The above defined function helps us to recommend the movies based on the similarity on the categories they belong to. The scores are calculated for n number of similar movies and the recomendation for the similar movies is printed out. To understand better we write the code as below.

In [None]:
# We are trying to recommend using the above defined function top 10 movies 
# that stand similar in category as that of the movie defined in the code

get_recommendations("Casino (1995)", topN = 10) 
movies_index["Casino (1995)"]

    index                          Titles    Scores
0      15                   Casino (1995)  1.000000
1      35                 Clueless (1995)  0.546160
2       0                Toy Story (1995)  0.432793
3      24        Leaving Las Vegas (1995)  0.418992
4      17               Four Rooms (1995)  0.400306
5      27               Persuasion (1995)  0.400306
6      28    City of Lost Children (1995)  0.400306
7      33         Dead Man Walking (1995)  0.400306
8      48                 Lamerica (1994)  0.400306
9      22                Assassins (1995)  0.379688
10     10  American President, The (1995)  0.349033


15

Hence, we see the result that clearly show the movies as above which match the closest to the movie defined above **"Casino (1995)"**