## Movie Recommendation Systems
Project Proposal : I plan to use The Movies dataset from kaggle to create a recommendation engine for users

Dataset: Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages. This dataset also has files containing 100,000 ratings from 700 users for a small subset of 9,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.

## Loading the data

In [3]:
%matplotlib inline
import json
import datetime
import ast
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [52]:
df = pd.read_csv('D:/Cinci prep/Coursework/Python/the-movies-dataset/movies_metadata.csv')
df.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [53]:
print("The dataset has {} rows and {}columns".format(df.shape[0],df.shape[1]))

The dataset has 45466 rows and 24columns


In [54]:
#add object type to table

| Column Name | Description |
| :- | :- |
| adult | Indicates if the movie is X-Rated or Adult. |
| belongs_to_collection | Information on the movie series the particular film belongs to |
| budget | The budget of the movie in dollars. |
|genres: | genres associated with the movie |
| homepage | The Official Homepage of the move |
|id |: The ID of the move. |
| imdb_id |  The IMDB ID of the movie. |
| original_language |  The language in which the movie was originally shot in. |
| original_title |  The original title of the movie. |
| overview |  A brief blurb of the movie. |
| popularity |  The Popularity Score assigned by TMDB. |
| poster_path |  The URL of the poster image. |
| production_companies |  List of Production companies involved in making of the movie |
| production_countries |  List of countries where the movie was shot/produced in. |
| release_date |  Theatrical Release Date of the movie. |
| revenue |  The total revenue of the movie in dollars. |
| runtime |  The runtime of the movie in minutes. |
| spoken_languages |  A stringified list of spoken languages in the film. |
| status |  The status of the movie (Released, To Be Released, Announced, etc.) |
| tagline |  The tagline of the movie. |
| title | The Official Title of the movie. |
| video | Indicates if there is a video present of the movie with TMDB. |
| vote_average | The average rating of the movie. |
| vote_count | The number of votes by users, as counted by TMDB. |

## DATA WRANGLING

In [55]:
def missing_values(df):
    total_na = df.isnull().sum()
    percent = round((df.isnull().sum()/df.isnull().count()*100),2)
    output =  pd.concat([total_na, percent], axis=1, keys=['Total', 'Percent (%)'])
    return output[output['Percent (%)']>0].sort_values(by = ['Percent (%)'],ascending = False)

missing_values(df)

Unnamed: 0,Total,Percent (%)
belongs_to_collection,40972,90.12
homepage,37684,82.88
tagline,25054,55.1
overview,954,2.1
poster_path,386,0.85
runtime,263,0.58
status,87,0.19
release_date,87,0.19
imdb_id,17,0.04
original_language,11,0.02


In [56]:
df['adult'].value_counts()

False                                                                                                                             45454
True                                                                                                                                  9
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 - Written by Ørnås                                                                                                                   1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: adult, dtype: int64

In [57]:
df[(df['title'].isna())==True]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19729,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,82663,tt0113002,en,Midnight Man,British soldiers force a recently captured IRA...,...,,,,,,,,,,
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1.0,,,,,,,,,
29502,False,"{'id': 122661, 'name': 'Mardock Scramble Colle...",0,"[{'id': 16, 'name': 'Animation'}, {'id': 878, ...",http://m-scramble.jp/exhaust/,122662,tt2423504,ja,マルドゥック・スクランブル 排気,Third film of the Mardock Scramble series.,...,,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12.0,,,,,,,,,
35586,False,,0,"[{'id': 10770, 'name': 'TV Movie'}, {'id': 28,...",,249260,tt2622826,en,Avalanche Sharks,A group of skiers are terrorized during spring...,...,,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22.0,,,,,,,,,


In [58]:
df = df[df['title'].notna()]

In [60]:
df[df['original_title'] != df['title']][['title', 'original_title']].shape

(11396, 2)

In [61]:
df = df.drop('imdb_id', axis=1)
df = df.drop('original_title', axis=1)
df = df.drop('adult', axis=1)

In [62]:
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df['poster_path'] = "<img src='" + base_poster_url + df['poster_path'] + "' style='height:100px;'>"

In [63]:
print("The final dataset has {} rows and {}columns".format(df.shape[0],df.shape[1]))

The final dataset has 45460 rows and 21columns
