This notebook contains the code for the webscraping of an IMDB page to create a MongoDB Database. The database is a list of the top 50 Animated Films.

In [1]:
import requests
import json
from bs4 import BeautifulSoup
import re
import numpy as np

import pymongo
from bson.json_util import dumps

In [2]:
url='https://www.imdb.com/search/title/?genres=animation&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=RSN2DY6XD2962XY8AM4V&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_3'
response = requests.get(url)
html = response.content
scraped = BeautifulSoup(html, 'html.parser')

In [3]:
movies = scraped.find('div', class_='lister list detail sub-list')

The following information is taken from the IMDB page:

rank - the rank of the movie out of 50  
title - the title of the film  
year - the year the film was made  
rating - the MPAA rating of the film (ie G, PG, R)  
runtime - the runtime of the film in minutes  
genre - the genre of the film  
score -  the rating of the film out of 10  
directors - the director/directors of the film  
cast - the main actors in the film  
gross - the gross of the film  

In [4]:
rank = []
title = []
year = []
for item in movies.find_all('h3'):
    rank.append(re.findall('\d+',item.text.split('\n')[1])[0])
    title.append(item.text.split('\n')[2])
    year.append(item.text.split('\n')[-2][-5:-1])
    

In [5]:
rating = []
runtime = []
genre = []
for item in movies.find_all('span', class_='certificate'):
    rating.append(item.text)
for item in movies.find_all('span', class_='runtime'):
    runtime.append(int(item.text[:-4]))
for item in movies.find_all('span', class_='genre'):
    genre.append(re.findall('\n(.*)', item.text)[0].strip().split(', '))
    

In [6]:
score = []
for item in movies.find_all('strong'):
    score.append(float(item.text))

In [7]:
directors = []
cast = []

for item in movies.find_all('p', class_=''):
    directors.append(re.findall('[A-Z][a-z]*\s[A-Z\[a-z]*', item.text.split('|')[0]))
    cast.append(re.findall('[A-Z][a-z]*\s[A-Z][a-z]*', item.text.split('Stars:\n')[1]))

In [8]:
gross = []
for item in movies.find_all('p', class_='sort-num_votes-visible'):
    amount = re.findall('Gross:\n(.*)', item.text)
    gross.append(int(amount[0].replace('.', '').replace('M', '0000')[1:]) if amount else np.nan)
   

In [9]:
#taking these lists and iterating through them to create a list of dictionaries.

movie_db = []

for i in range(len(title)):
    movie_db.append({'rank':rank[i],
                    'title':title[i],
                    'year': year[i], 
                     'rating': rating[i],
                     'runtime': runtime[i], 
                     'genre': genre[i],
                     'score': score[i],
                     'directors': directors[i],
                     'cast': cast[i],
                     'gross': gross[i]})

The process of how to do this can be seen on a great blog by Halil Yildirim. The blog can be found [here](https://towardsdatascience.com/using-mongodb-with-python-bcb26bf25d5d).

In [10]:
from host import *

client = pymongo.MongoClient(hostname)

In [11]:
db = client['movies']

In [12]:
collection = db['animatedFilms']

In [13]:
collection.insert_many(movie_db)

<pymongo.results.InsertManyResult at 0x7ff18bab2e80>

For reproducibility, the animatedFilms collection in the movies db has been exported as a JSON file. To learn how to do this, see this [link](https://www.geeksforgeeks.org/convert-pymongo-cursor-to-json/)

In [14]:
documents = collection.find({})

list_documents = list(documents)

json_films = dumps(list_documents, indent = 2)

with open('films.json', 'w') as file:
    file.write(json_films)