# Movie Recommendation System

1. Retrieve data from online movie database.
    * Each movie has a unique IMDb ID (begin with tt) and each crew member has a NM Id.
    * Besides movie information, select a group of top reviewers and get their raking on movies in database.
        - Each user has a unique UR Id on IMDb.
2. Create a recommendation system that allows users to like or dislike several movies and tries to recommend movies for them.

Useful information about a movie:
- ID：tt1375666
  * `https://www.imdb.com/title/tt1375666/`
- Name: Inception
  * `#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1`
- Duration: 148 min
  * `#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > div > time`
- MPAA: PG-13
  * `#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > div`
- Genres: Action，Adventure，Sci-fi
  * `#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > div > a:nth-child(4)`
  * `#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > div > a:nth-child(5)`
  * `#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > div > a:nth-child(6)`
- Release Year: 2010
  * `#titleYear > a`
- Rating：8.8
  * `#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > div.ratingValue > strong > span`
- Rating Count: 1,870,457
  * `#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span`
- Male Rating: 8.8
  * https://www.imdb.com/title/tt1375666/ratings
  * `#main > section > div > table:nth-child(14) > tbody > tr:nth-child(3) > td:nth-child(2) > div.bigcell`
- Female Ratings: 8.6
  * https://www.imdb.com/title/tt1375666/ratings
  * `#main > section > div > table:nth-child(14) > tbody > tr:nth-child(4) > td:nth-child(2) > div.bigcell`
- 18-29 Ratings: 9.0
  * https://www.imdb.com/title/tt1375666/ratings
  * `#main > section > div > table:nth-child(14) > tbody > tr:nth-child(2) > td:nth-child(4) > div.bigcell`
- 10 stars rating ratio
  * https://www.imdb.com/title/tt1375666/ratings
  * `#main > section > div > table:nth-child(7) > tbody > tr:nth-child(2) > td:nth-child(2) > div.allText > div`
- 1 star rating ratio
  * https://www.imdb.com/title/tt1375666/ratings
  * `#main > section > div > table:nth-child(7) > tbody > tr:nth-child(11) > td:nth-child(2) > div.allText > div`
- Director: Christopher Nolan
  * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(2)`
    * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(2) > a`
- Writer: Christopher Nolan
  * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(3)`
    * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(3) > a`
- Stars: Lenoardo DiCaprio, Joseph Gordon-Levitt, Ellen Page
  * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(4)`
    * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(4) > a:nth-child(2)`
    * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(4) > a:nth-child(3)`
    * `#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(4) > a:nth-child(4)`
- Metascore: 74
  * `#title-overview-widget > div.plot_summary_wrapper > div.titleReviewBar > div:nth-child(1) > a > div > span`
- Keywords: dream, subconscious, ambiguous ending, thief, psycho thriller
  * `#titleStoryLine > div:nth-child(6)`
    * `#titleStoryLine > div:nth-child(6) > a:nth-child(2) > span`
    * `#titleStoryLine > div:nth-child(6) > a:nth-child(4) > span`
    * etc.
- Taglines: Your mind is the scene of the crime
  * `#titleStoryLine > div:nth-child(8)`         
- Production: WB
  * `#titleDetails > div:nth-child(19) > a:nth-child(2)`
- Worldwide Gross: \$829,895,144
  * `#titleDetails > div:nth-child(15)`
- Budget: \$160,000,000
  * `#titleDetails > div:nth-child(12)`


In [1]:
from requests_html import HTMLSession
import re
session = HTMLSession()

In [2]:
class Movie:
    def getId(self, url):
        s = re.search(r'tt\d+', url)
        return s.group()
    def getNameYear(self, r):
        sel = '#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1'
        s = r.html.find(sel, first=True).text
        year = re.search(r'\(\d+\)', s).group().strip("()")
        name = re.sub(r'\(\d+\)', "", s).strip()
        return name, year
    def getRateTimeGenre(self, r):
        sel = '#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > div'
        s = r.html.find(sel, first=True).text
        s = s.split('|')
        rate = s[0].strip()
        duration = s[1].strip()
        genre = s[2].strip()
        return rate, duration, genre
    def getRating(self, r):
        sel = '#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > div.ratingValue > strong > span'
        return r.html.find(sel, first=True).text
    def getRatingCount(self, r):
        sel = '#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span'
        return r.html.find(sel, first=True).text
    def getGenderRating(self, r):
        r = session.get(url)
        sel_m = "#main > section > div > table:nth-child(14) > tbody > tr:nth-child(3) > td:nth-child(2) > div.bigcell"
        sel_f = "#main > section > div > table:nth-child(14) > tbody > tr:nth-child(4) > td:nth-child(2) > div.bigcell"
        return r.html.find(sel_m, first=True).text, r.html.find(sel_f, first=True).text
    def getDirector(self, r):
        sel = "#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(2) > a"
        return r.html.find(sel, first=True).text
    def getWriter(self, r):
        sel = "#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(3) > a"
        return r.html.find(sel, first=True).text
    def getStar(self, r):
        # Just get the first
        sel = "#title-overview-widget > div.plot_summary_wrapper > div.plot_summary > div:nth-child(4) > a"
        s = r.html.find(sel)
        stars = []
        for star in s:
            stars.append(star.text)
        stars.pop()
        return stars
    def getMetascore(self, r):
        sel = "#title-overview-widget > div.plot_summary_wrapper > div.titleReviewBar > div:nth-child(1) > a > div > span"
        return r.html.find(sel, first=True).text
    def getKeywords(self, r):
        sel = "#titleStoryLine > div:nth-child(6) > a > span"
        s = r.html.find(sel)
        keywords = []
        for word in s:
            keywords.append(word.text)
        return keywords
    def getTagline(self, r):
        sel = "#titleStoryLine > div:nth-child(8)"
        taglines = r.html.find(sel, first=True).text.split('\n')
        return taglines[1]
    def getCompany(self, r):
        sel = "#titleDetails > div:nth-child(19) > a:nth-child(2)"
        return r.html.find(sel, first=True).text
    def getBudget(self, r):
        sel = "#titleDetails > div:nth-child(12)"
        result = r.html.find(sel, first=True).text.split('\n')
        result = re.sub(r'\(.*\)', "", result[1])
        return result
    def getGross(self, r):
        sel = "#titleDetails > div:nth-child(15)"
        result = r.html.find(sel, first=True).text.split('\n')
        return result[1]
    def __init__(self, url):
        self.Id = self.getId(url)
        session = HTMLSession()
        r = session.get(url)
        self.Name, self.Year = self.getNameYear(r)
        self.MPAA, self.Duration, self.Genre = self.getRateTimeGenre(r)
        self.Rating = self.getRating(r)
        self.RatingCount = self.getRatingCount(r)
        self.Director = self.getDirector(r)
        self.Writer = self.getWriter(r)
        self.Star = self.getStar(r)
        self.Metascore = self.getMetascore(r)
        self.Keywords = self.getKeywords(r)
        self.Tagline = self.getTagline(r)
        self.Company = self.getCompany(r)
        self.Budget = self.getBudget(r)
        self.Gross = self.getGross(r)
        session.close()
    def printInfo(self):
        print("Id: " + self.Id)
        print("Name: " + self.Name)
        print("Release Year: " + self.Year)
        print("MPAA Rating: " + self.MPAA)
        print("Duration: " + self.Duration)
        print("Genre: " + self.Genre)
        print("Rating Count: " + self.RatingCount)
        print("Rating: " + self.Rating)
        print("Metascore: " + self.Metascore)
        print("Director: " + self.Director)
        print("Writer: " + self.Writer)
        print("Main Star: " + str(self.Star))
        print("Keywords: " + str(self.Keywords))
        print("Tagline: " + self.Tagline)
        print("Production Company: " + self.Company)
        print("Budget: " + self.Budget)
        print("Worldwide Gross: " + self.Gross)

url = "https://www.imdb.com/title/tt0111161"
a = Movie(url)
a.printInfo()

Id: tt0111161
Name: The Shawshank Redemption
Release Year: 1994
MPAA Rating: R
Duration: 2h 22min
Genre: Drama
Rating Count: 2,132,787
Rating: 9.3
Metascore: 80
Director: Frank Darabont
Writer: Stephen King
Main Star: ['Tim Robbins', 'Morgan Freeman', 'Bob Gunton']
Keywords: ['wrongful imprisonment', 'escape from prison', 'based on the works of stephen king', 'prison', 'voice over narration']
Tagline: Fear can hold you prisoner. Hope can set you free.
Production Company: Castle Rock Entertainment
Budget: $25,000,000 
Worldwide Gross: $28,341,469
