# **Data From Web Scraping**

The objective will be scraping the most information for each film in different top 100 lists from [Imdb](https://www.imdb.com/?ref_=nv_home). For example, in [The Top 100 Greatest Movies of all Time](https://www.imdb.com/list/ls055592025/) there are listed 100 films, from here it will be necessary to to access each film web page, once there, it can be found all the information about the film.

- The top lists selected for this exercise are:

    1. [The Top 100 Greatest Movies of all Time](https://www.imdb.com/list/ls055592025/)
    2. [Top 100 Horror Films](https://www.imdb.com/list/ls000007562/)
    3. [100 Best Sci-Fi movies](https://www.imdb.com/list/ls009668082/)
    4. [Top 100 Best Action Movies Of All Time](https://www.imdb.com/list/ls063897780/)
    5. [TOP 100 BEST DRAMA MOVIES OF ALL TIME](https://www.imdb.com/list/ls069376839/)
    6. [Top 100 Best Foreign Films](https://www.imdb.com/list/ls062615147/)
    7. [100 Best Movies of this Decade (2010-2019)](https://www.imdb.com/list/ls062615147/)
    8. [2000s Top 100 Movies](https://www.imdb.com/list/ls002065120/)
    
From a quick view of the page of a [film](https://www.imdb.com/title/tt0078748/?ref_=ttls_li_tt), the information selected to be obatined is listed below:

1. **`Top`** : Top list in which the film appears (str)
2. **`Ranking`** : Rank in which he movie appears (int)
3. **`Title`** : Title of the film (str)
4. **`Classification`**: Classificaion of the film (str)
5. **`Duration_min`**: Length of the film in minutes (int)
6. **`Genre(s)`**: Genre(s) of the film (str)
7. **`Imdb_Rating`**: Imdb of the film (float)
8. **`No_Ratings`**: Number of ratings (int)
9. **`Release_Date`**: Release date of the film (datetime)
10. **`Summary`**: Summary of the film (str)
11. **`Storyline`**: Storyline of the film (str)
12. **`Metascore`**: Metacritic score of the film (int)
13. **`User_reviews`**: Number of user reviews (int)
14. **`Critic_reviews`**: Number of critic reviews (int)
15. **`Popularity`**: Popularity rank (int)
16. **`Cast`**: Main Cast of the film (dict)
17. **`Taglines`**: Taglines of the film (str)
18. **`Country`**: Country of Origin (str)
19. **`Budget`**: Budget of the film (str)
21. **`Cumulative_Worldwide_Gross`**: Grossing revenue of the film (str)
22. **`Trivia`**: Trivia of the film (str)
23. **`Goof`**: A goof find in the movie (str)

Namely, the final dataframe (an the csv file) will have the above as columns.

First, the necessary libraries are imported:

In [None]:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup
from IPython.display import clear_output
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
import matplotlib.style as style
style.use('seaborn-darkgrid')

In [3]:
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

Next is to create two functions, `get_top_films` and `get_movie_info`. 

- **`get_top_films`**: this function will take one argument, the top list link; with this the web page will be scrapped in order to obtain the rank, title and the film web page reference. The result will be a list of tuples, one tuple for each movie listed.
- **`get_movie_info`**: this function wil take two arguments, the base url and the movie info. The base url is a common url template shared by all movies web pages, the diference lies in the film reference number, which will be passed with a string concatenation. The movie info will be received by the function in the form of a tuple, which will contain in order the top list name in which the movie appears, the rank of the movie in the top list and the title of the movie. The result will be a list containing all the information of the movie required.


In [8]:
def get_top_films(top):
    # Scraping the films with rank, title and web reference
    parsed_top = BeautifulSoup(requests.get(tops_imdb[top], verify = False).content)
    return [(top, rank, title.next_sibling.next_sibling.text, title.next_sibling.next_sibling['href']) 
            for rank,title in enumerate(parsed_top.find_all('span',{'class':'lister-item-index unbold text-primary'}), 1)]

In [261]:
def get_movie_info(base, title):
    # Print progress
    clear_output()
    print(f'{title[0]} - {title[2]} - {title[1]}')
    
    # Technical movie data base url
    tech = 'https://www.imdb.com/%s/technical?ref_=tt_dt_spec'
    
    # Parsing the webpage of the given title
    parsed_title = BeautifulSoup(requests.get(base % title[3][1:-1], verify = False).content)
    
    # Each data requires a specific search in the parsed webpage
    top = title[0]
    ranking = title[1]
    title_film = parsed_title.find('div', class_='originalTitle').text.split(' (')[0] if parsed_title.find('div', class_='originalTitle') else title[2]
    classification = parsed_title.find_all('div', class_='subtext')[0].text.split('\n')[1].replace(' ','')
    length = BeautifulSoup(requests.get(tech % gm[26][3][1:-1], verify = False).content).select_one('.label').next_sibling.next_sibling.text.split('\n ')[1].lstrip().rstrip()
    genre = ', '.join(set([i.text.replace(' ','') for i in parsed_title.select("a[href*=genres]")]))
    imdb_rating = float(parsed_title.find('div', class_='ratingValue').text.split('/')[0][1:])
    ratings = int(parsed_title.find('span', class_='small').text.replace(',',''))
    release = pd.to_datetime(''.join([i.next_sibling for i in parsed_title.find_all('h4', class_='inline') if 'Release' in i.text]).split(' (')[0])
    summary = parsed_title.find('div', class_='summary_text').text.replace('\n','').rstrip().lstrip()
    storyline = parsed_title.find('div', class_='inline canwrap').text.split('\n')[2].lstrip()
    metascore = int(parsed_title.select_one('.metacriticScore').text.replace('\n','')) if parsed_title.select_one('.metacriticScore') else ''
    u_reviews = int(re.findall(r'\d+(?:,\d+)?', ''.join([i.text.replace(',','') for i in parsed_title.find_all('span', class_='subText')]))[0]) if len(parsed_title.find_all('span', class_='subText'))>=1 else ''
    c_reviews = int(re.findall(r'\d+(?:,\d+)?', ''.join([i.text.replace(',','') for i in parsed_title.find_all('span', class_='subText')]))[1]) if len(parsed_title.find_all('span', class_='subText'))!=1 else ''
    popularity = int(parsed_title.find_all('span', class_='subText')[-1].text.replace('\n','').replace(',','').lstrip().split()[0])
    cast = {pair[0]:('' if len(pair)==1 else pair[1].split('2 e')[0].replace('/','').rstrip().replace('\xa0','')) for pair in [i.text.replace('\n','').replace('  ','').split('...') for i in parsed_title.find_all('tr')[1:-2][0::2]]}
    tagline = parsed_title.find_all('h4', class_='inline')[4].next_sibling[1:]
    country = ' / '.join([c.text for c in parsed_title.find_all('div', class_='txt-block')[4].find_all('a')])
    budget = ''.join([i.next_sibling for i in parsed_title.find_all('h4', class_='inline') if 'Budget' in i.text]).replace('\n','').replace(' ','')
    cwg = ''.join([i.next_sibling for i in parsed_title.find_all('h4', class_='inline') if 'Cumulative' in i.text]).replace('\n','').replace(' ','')
    trivia = parsed_title.find('div', {'id':'trivia'}).text.split('    ')[1] if parsed_title.find('div', {'id':'trivia'}) else ''
    goof = parsed_title.find('div', {'id':'goofs'}).text.split('\n')[2].split('   ')[0] if parsed_title.find('div', {'id':'goofs'}) else ''
    
    return [top,ranking, title_film, classification, length,genre, imdb_rating, metascore, ratings, release, summary, storyline, u_reviews, c_reviews, popularity, cast,tagline, country, budget, cwg, trivia, goof]

The links of the top lists are stored in the `tops_imdb` dictionary.

In [207]:
tops_imdb = {'The Top 100 Greatest Movies of all Time':'https://www.imdb.com/list/ls055592025/',
             'Top 100 Horror Films':'https://www.imdb.com/list/ls000007562/',
             '100 Best Sci-Fi movies':'https://www.imdb.com/list/ls009668082/',
             'Top 100 Best Action Movies Of All Time':'https://www.imdb.com/list/ls063897780/',
             'TOP 100 BEST DRAMA MOVIES OF ALL TIME':'https://www.imdb.com/list/ls069376839/',
             'Top 100 Best Foreign Films':'https://www.imdb.com/list/ls062615147/',
             '100 Best Movies of this Decade (2010-2019)':'https://www.imdb.com/list/ls062615147/',
             '2000s Top 100 Movies':'https://www.imdb.com/list/ls002065120/'}

The `base_url` string varible will contain the url template.

In [208]:
base_url = 'https://www.imdb.com/%s/?ref_=ttls_li_tt'

Having crated all the variables and functions before, only rest to call for them in order to scrape all the 800 web pages. This is made in the line of code below:

In [262]:
movies_info = [get_movie_info(base_url, film) for film in [title for top in [get_top_films(key) for key in list(tops_imdb.keys())] for title in top]]

2000s Top 100 Movies - El desesperar de los muertos - 100


The names for the dataframe are defined in the `names` list.

In [265]:
names = ['Top', 'Ranking', 'Title', 'Classification', 'Length', 'Genre', 'Imdb_Rating', 'Metascore', 'No_Ratings', 'Release_Date', 'Summary', 'Storyline', 'User_reviews', 
         'Critic_reviews', 'Popularity', 'Cast', 'Taglines', 'Country', 'Budget', 'Cumulative_Worldwide_Gross', 'Trivia', 'Goof']

The `movies_df` dataframe is created using the `from_records` method, wich is more suitable for the data that is in the form of a list of lists.

In [268]:
movies_df = pd.DataFrame.from_records(movies_info, columns = names)

In [288]:
movies_df.head(50)

Unnamed: 0,Top,Ranking,Title,Classification,Length,Genre,Imdb_Rating,Metascore,No_Ratings,Release_Date,...,User_reviews,Critic_reviews,Popularity,Cast,Taglines,Country,Budget,Cumulative_Worldwide_Gross,Trivia,Goof
0,The Top 100 Greatest Movies of all Time,1,The Godfather,C,1 hr 37 min (97 min),"Crime, Drama",9.2,100.0,1451346,1972-10-04,...,3200,241,126,"{'Marlon Brando ': 'Don Vito Corleone', 'James...",An offer you can't refuse.,USA,"$6,000,000","$245,066,411",,While Michael is talking to Apollonia's father...
1,The Top 100 Greatest Movies of all Time,2,The Shawshank Redemption,B,1 hr 37 min (97 min),Drama,9.3,80.0,2113978,1994-09-23,...,6650,217,83,"{'Tim Robbins ': 'Andy Dufresne', 'Bob Gunton ...",Fear can hold you prisoner. Hope can set you f...,USA,"$25,000,000","$58,500,000",,"Towards the beginning of the film, during a be..."
2,The Top 100 Greatest Movies of all Time,3,Schindler's List,B15,1 hr 37 min (97 min),"History, Biography, Drama",8.9,93.0,1096633,1994-02-04,...,1601,220,262,"{'Liam Neeson ': 'Oskar Schindler', 'Ralph Fie...","Whoever saves one life, saves the world entire...",USA,"$22,000,000","$221,000,000","This film and E.T., El extraterrestre (1982) a...",When Oskar is in bed with his wife and talking...
3,The Top 100 Greatest Movies of all Time,4,Raging Bull,C,1 hr 37 min (97 min),"Biography, Sport, Drama",8.2,89.0,290559,1980-12-19,...,536,178,1816,"{'Robert De Niro ': 'Jake La Motta', 'Joe Pesc...",,See more,"$18,000,000",,Cinematographer Michael Chapman drew inspirati...,When Jake and Joey are sitting at the kitchen ...
4,The Top 100 Greatest Movies of all Time,5,Casablanca,B,1 hr 37 min (97 min),"Romance, War, Drama",8.5,100.0,479858,1943-03-04,...,1215,184,610,"{'Humphrey Bogart ': 'Rick Blaine', 'Paul Henr...",As big and timely a picture as ever you've see...,USA,"$950,000","$14,952,620,",,The German Wehrmacht-Heer (army) personnel sho...
5,The Top 100 Greatest Movies of all Time,6,Citizen Kane,PG,1 hr 37 min (97 min),"Mystery, Drama",8.3,100.0,366096,1941-06-06,...,1357,261,1087,{'Joseph Cotten ': 'Jedediah Leland Screening ...,Everybody's talking about it!,USA,"$839,727",,,At the party scene where Kane dances with the ...
6,The Top 100 Greatest Movies of all Time,7,Gone with the Wind,B,1 hr 37 min (97 min),"Romance, History, War, Drama",8.2,97.0,265596,1941-01-22,...,794,182,225,"{'Thomas Mitchell ': 'Gerald O'Hara', 'Vivien ...",Winner of Ten Academy Awards [reissue] ...,USA,"$3,977,000","$400,176,459",,After Ashley Wilkes is carried into his room f...
7,The Top 100 Greatest Movies of all Time,8,The Wizard of Oz,A,1 hr 37 min (97 min),"Fantasy, Musical, Adventure, Family",8.0,100.0,353850,1940-01-01,...,653,154,310,"{'Judy Garland ': 'Dorothy', 'Ray Bolger ': ''...",Amazing Sights To See ! The Tornado . . . Munc...,USA,"$2,800,000",,The scene in which the Wicked Witch tries to t...,When the Witch sees the ruby slippers under th...
8,The Top 100 Greatest Movies of all Time,9,One Flew Over the Cuckoo's Nest,,1 hr 37 min (97 min),Drama,8.7,83.0,837147,1975-11-19,...,875,163,516,"{'Jack Nicholson ': 'R.P. McMurphy', 'Will Sam...","If he's crazy, what does that make you?",English,"$4,400,000",,(At around six minutes and sixteen seconds) Wh...,"After the Christmas party, and Randall is wait..."
9,The Top 100 Greatest Movies of all Time,10,Lawrence of Arabia,B,1 hr 37 min (97 min),"History, Adventure, Biography, War, Drama",8.3,100.0,244538,1964-05-14,...,615,188,1388,"{'Peter O'Toole ': 'T.E. Lawrence', 'Anthony Q...",An Epic Masterpiece As You've Never Seen It Be...,UK,"$15,000,000","$77,324,144","After six months filming in the desert, Peter ...",General Sir Edmund Allenby (promoted to Field ...


Finally, all the information in `movies_df` is saved as a CSV file calling for the respective method of the dataframe.

In [287]:
movies_df.to_csv('Data_from_Web_Scraping.csv', index = False)