# Movie Recommendation System

In [77]:
import pandas as pd

### importing the data


In [78]:
data = pd.read_csv('dataset/movies.csv')

data.head()

# len(data.columns)

Unnamed: 0,id,title,genre,original_language,overview,popularity,release_date,vote_average,vote_count
0,278,The Shawshank Redemption,"Drama,Crime",en,Framed in the 1940s for the double murder of h...,94.075,1994-09-23,8.7,21862
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance",hi,"Raj is a rich, carefree, happy-go-lucky second...",25.408,1995-10-19,8.7,3731
2,238,The Godfather,"Drama,Crime",en,"Spanning the years 1945 to 1955, a chronicle o...",90.585,1972-03-14,8.7,16280
3,424,Schindler's List,"Drama,History,War",en,The true story of how businessman Oskar Schind...,44.761,1993-12-15,8.6,12959
4,240,The Godfather: Part II,"Drama,Crime",en,In the continuing saga of the Corleone crime f...,57.749,1974-12-20,8.6,9811


## Feature Selection

Feature selection process is to get the exact important data that are responsible to be consider rsulting to the accuracy towards the results.


In [79]:
data.columns

Index(['id', 'title', 'genre', 'original_language', 'overview', 'popularity',
       'release_date', 'vote_average', 'vote_count'],
      dtype='object')

#### We will consider only those which are necessary, they are : id, title, genre, overview


In [80]:
movies = data[['id', 'title', 'genre', 'overview']]
movies

Unnamed: 0,id,title,genre,overview
0,278,The Shawshank Redemption,"Drama,Crime",Framed in the 1940s for the double murder of h...
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,Romance","Raj is a rich, carefree, happy-go-lucky second..."
2,238,The Godfather,"Drama,Crime","Spanning the years 1945 to 1955, a chronicle o..."
3,424,Schindler's List,"Drama,History,War",The true story of how businessman Oskar Schind...
4,240,The Godfather: Part II,"Drama,Crime",In the continuing saga of the Corleone crime f...
...,...,...,...,...
9995,10196,The Last Airbender,"Action,Adventure,Fantasy","The story follows the adventures of Aang, a yo..."
9996,331446,Sharknado 3: Oh Hell No!,"Action,TV Movie,Science Fiction,Comedy,Adventure",The sharks take bite out of the East Coast whe...
9997,13995,Captain America,"Action,Science Fiction,War","During World War II, a brave, patriotic Americ..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,"Adventure,Fantasy,Action,Drama",A man named Farmer sets out to rescue his kidn...


#### Magic of NLP
Generate tags which will help us to segregate the actual data incoming with the trained ones. To generate tags, we will merge overview and genre together which will then move under vectorization


In [81]:
movies['tags'] = movies['genre']+movies['overview']
movies = movies.drop(columns=['genre', 'overview'])
movies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['tags'] = movies['genre']+movies['overview']


Unnamed: 0,id,title,tags
0,278,The Shawshank Redemption,"Drama,CrimeFramed in the 1940s for the double ..."
1,19404,Dilwale Dulhania Le Jayenge,"Comedy,Drama,RomanceRaj is a rich, carefree, h..."
2,238,The Godfather,"Drama,CrimeSpanning the years 1945 to 1955, a ..."
3,424,Schindler's List,"Drama,History,WarThe true story of how busines..."
4,240,The Godfather: Part II,"Drama,CrimeIn the continuing saga of the Corle..."
...,...,...,...
9995,10196,The Last Airbender,"Action,Adventure,FantasyThe story follows the ..."
9996,331446,Sharknado 3: Oh Hell No!,"Action,TV Movie,Science Fiction,Comedy,Adventu..."
9997,13995,Captain America,"Action,Science Fiction,WarDuring World War II,..."
9998,2312,In the Name of the King: A Dungeon Siege Tale,"Adventure,Fantasy,Action,DramaA man named Farm..."


### Text Vectorization 
Convert text into numbers so that it will be easy while model training


In [82]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000, stop_words='english')
vector = vectorizer.fit_transform(movies['tags'].values.astype("U")).toarray()

### Understanding
This will generate the similarities score/values based on the vectorized data given in the input,as of here, **the tags**.The tags generated will provide us the **cosine similarities** between them. Highly similar (value as 1) will be recommended as the movie. The most related movie will have **higher value of similarity** thus recommending it.

In [83]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)

# Test example
We will print the first 5 movies similarity score with the index associated with it based on the name of the movie given as input


In [84]:
distance = sorted(enumerate(list(similarity[2])), reverse=True, key=lambda vector:vector[1])[0:5]
print(distance)

[(2, 1.0000000000000002), (4, 0.41247895569215265), (2412, 0.28867513459481287), (4569, 0.28426762180748055), (9520, 0.27801921874276636)]


# Function Capsule here.

In [85]:
# Let's Define our method here.
def recommend_movies(movie_name):
    try:
        index = movies[movies['title'].str.lower()==movie_name.lower()].index[0]
        distance = sorted(enumerate(list(similarity[index])), reverse=True, key=lambda vector:vector[1])[0:6]
        for i in distance:
            print(movies.iloc[i[0]].title)
        return
    except Exception as error:
            print(f"Input movie should be from the provided list {error}")
        
     



In [86]:
recommend_movies("Iron man")

Iron Man
Mazinger Z: Infinity
Justice League Dark
Iron Man 3
The Colony
Marvel One-Shot: Item 47


# Generate the model using Pickle

In [87]:
import pickle
pickle.dump(recommend_movies, open('movies.pkl', 'wb'))

In [88]:
recommend = pickle.load(open('movies.pkl', 'rb'))
recommend("Iron Man")

Iron Man
Mazinger Z: Infinity
Justice League Dark
Iron Man 3
The Colony
Marvel One-Shot: Item 47
