# Ultimate Must-See Movie List Aggregator

This Jupyter Notebook project aims to curate the ultimate must-see movie list by aggregating data from multiple websites that highlight the 100 must-watch movies. 

We begin by scraping data from multiple credible websites. This function, 'scrape_movies,' takes a URL and a specified HTML keyword, retrieves the webpage content, and returns a list of HTML elements containing the specified keyword, providing a foundational step in web scraping for movie data. While there may be more optimal ways to achieve this, I have started with this approach.

In [3]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
#import openai

In [4]:
def scrape_movies(url, keyword):
 
    response = requests.get(url)
    if response.status_code == 200 or response.status_code == 403:
        soup = BeautifulSoup(response.text, 'html.parser')

        elements_with_title = soup.select(keyword)
        
        return elements_with_title

    else:
        print(f"Failed to retrieve content. Status code: {response.status_code}")



In [5]:
url_to_scrape = 'https://www.timeout.com/film/best-movies-of-all-time'
timeout_ = scrape_movies(url_to_scrape, '[title]')
timeout = []
for element in timeout_:
    title = element.get('title')
    if title and title[-1]==')':
        timeout.append(title[:-7].lower())


In [6]:
url_to_scrape = 'https://www.imdb.com/list/ls091520106/'
imdb_ = scrape_movies(url_to_scrape, 'img')
imdb = []
for element in imdb_:

    title = element.get('alt')
    if title and title!='loading':
        imdb.append(title.lower())
imdb=imdb[:100]

In [7]:
url_to_scrape = 'https://www.theguardian.com/film/2019/sep/13/100-best-films-movies-of-the-21st-century'
guardian_ =scrape_movies(url_to_scrape, 'strong')
guardian =[]
for element in guardian_:
    title = element.get_text(strip=True)
    if title and title!='CS' and title!='AP' and title!='CC':
        guardian.append(title.lower())
guardian=guardian[:100]        

In [8]:
url_to_scrape = 'https://www.empireonline.com/movies/features/best-movies-2/'
empire_ = scrape_movies(url_to_scrape,'img')
empire = []
for element in empire_:
    title = element.get('alt')
    if title:
        empire.append(title.lower())


Here are the overlap between imdb, timeout and empire lists

In [9]:
list(set(imdb) & set(timeout)& set(empire))

['2001: a space odyssey',
 'the dark knight',
 'the godfather',
 'spirited away',
 'seven samurai',
 'raiders of the lost ark',
 'the shining',
 'citizen kane',
 'eternal sunshine of the spotless mind',
 'goodfellas',
 'psycho',
 'vertigo',
 'pulp fiction',
 'alien',
 'apocalypse now']

Given that the overlap is not too extensive, I consider a movie worth watching if it appears in **at least two** of the websites.

In [10]:
overal = guardian+imdb+empire+timeout
overal_ = np.array(overal)
values, counts = np.unique(overal_, return_counts=True)

In [11]:
must_watch =[]
for movie, score in zip(values, counts):
    if score>2:
        must_watch.append(movie)
        

I also wanted to know the organize the list by the genere, so I founf this OMDB api that outputs details about movies.The 'get_movie_details' function retrieves detailed information about a movie using the OMDB API.The fill_movie_info function takes a list of movie titles, utilizes the get_movie_details function to gather information about each movie, and compiles relevant details, including title, year, director, and genre, into a Pandas DataFrame. 

In [12]:
def get_movie_details(title):
    api_key = "your_omdb_api_key"  # Replace with your actual OMDB API key
    base_url = "http://www.omdbapi.com/"

    params = {
        "apikey": '85a3d012',
        "t": title
    }

    response = requests.get(base_url, params=params)
    data = response.json()

    return data

def fill_movie_info(movie_list):
    movie_data = []

    for movie_title in movie_list:
        details = get_movie_details(movie_title)

        if details["Response"] == "True":
            movie_info = {
                "title": details["Title"],
                "year": details["Year"],
                "director": details["Director"],
                "genre": details["Genre"]
            }
            movie_data.append(movie_info)

    return pd.DataFrame(movie_data)

movie_list = must_watch
movies_df = fill_movie_info(movie_list)



In [13]:
movies_df

Unnamed: 0,title,year,director,genre
0,2001: A Space Odyssey,1968,Stanley Kubrick,"Adventure, Sci-Fi"
1,Alien,1979,Ridley Scott,"Horror, Sci-Fi"
2,Apocalypse Now,1979,Francis Ford Coppola,"Drama, Mystery, War"
3,Brokeback Mountain,2005,Ang Lee,"Drama, Romance"
4,Citizen Kane,1941,Orson Welles,"Drama, Mystery"
5,Eternal Sunshine of the Spotless Mind,2004,Michel Gondry,"Drama, Romance, Sci-Fi"
6,Get Out,2017,Jordan Peele,"Horror, Mystery, Thriller"
7,Gladiator,2000,Ridley Scott,"Action, Adventure, Drama"
8,Goodfellas,1990,Martin Scorsese,"Biography, Crime, Drama"
9,In the Mood for Love,2000,Kar-Wai Wong,"Drama, Romance"


In [176]:
movies_df_ = movies_df.sort_values(by="genre", ascending=False)

In [177]:
movies_df_

Unnamed: 0,title,year,director,genre
22,Vertigo,1958,Alfred Hitchcock,"Mystery, Romance, Thriller"
1,Alien,1979,Ridley Scott,"Horror, Sci-Fi"
13,Psycho,1960,Alfred Hitchcock,"Horror, Mystery, Thriller"
6,Get Out,2017,Jordan Peele,"Horror, Mystery, Thriller"
5,Eternal Sunshine of the Spotless Mind,2004,Michel Gondry,"Drama, Romance, Sci-Fi"
9,In the Mood for Love,2000,Kar-Wai Wong,"Drama, Romance"
3,Brokeback Mountain,2005,Ang Lee,"Drama, Romance"
2,Apocalypse Now,1979,Francis Ford Coppola,"Drama, Mystery, War"
11,Mulholland Drive,2001,David Lynch,"Drama, Mystery, Thriller"
4,Citizen Kane,1941,Orson Welles,"Drama, Mystery"
