# Introduction to BeautifulSoup

*BeautifulSoup* is a class in the BS4 library of python made for the purpose of parsing **HTML or XML** document.

**How cool does that sound?**

Think of any website on the internet and you want to collect only a specific detail of the subject and there are n number of categories , how long will you scroll.
BeautifulSoup allows you to fetch that specific detail from the webpage for every category in a more structured manner.

# What we are trying to do?

Let's say we want to fetch only the movie name , IMDB score and metascore from the IMDB website, how will we do it. Ofcourse , with the help of BeautifulSoup.

So without further ado, let's dive into it.

![Alt Text](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252)

In this simple kaggle, we will illustrate how to we can use BeautifulSoup to scrape the International Movies Database (IMDB) at imdb.com for top films released in year 2018 with the highest US box office.

We will be organizing the final results as a dataframe with below elements:

* name - title of the movie,
* year - release year of the movie,
* imdb - IMDB score of the movie,
* m_score - meta score of the movie,
* vote - number of votes.

 **Let's import all the necessary modules**

In [2]:
# !pip install bs4
import bs4
import requests
import time
import random as ran
import sys
import pandas as pd


> Now we search for the for top 1000 films released in year of 2018 at imdb.com and scrape results from the first page

In [3]:
url = 'https://www.imdb.com/search/title?release_date=2018&sort=boxoffice_gross_us,desc&start=1'

source = requests.get(url).text
soup = bs4.BeautifulSoup(source,'html.parser')

In [6]:
print(soup)


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Released between 2018-01-01 and 2018-12-31
(Sorted by US Box Office Descending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/search/title/?release_date=2018-01-01,20

> Since above code extracts all data on the first page, below code is run only to extract movie information on it.

In [3]:
movie_blocks = soup.findAll('div',{'class':'lister-item-content'})


**BeautifulSoup.find_all(arguments)** 
returns a list of BeautifulSoup objects. These are all occurrences matching the arguments. If there are no matches, method returns empty list. This is obviously used, when you cannot identify it right away and have to do some more digging before you get to the data you want.

> Let's examine one of the extracted block to identify the elements that we need to scrape.

In [4]:
mname = movie_blocks[0].find('a').get_text() # Name of the movie

m_reyear = int(movie_blocks[0].find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year

m_rating = float(movie_blocks[0].find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating

m_mscore = float(movie_blocks[0].find('span',{'class':'metascore favorable'}).contents[0].strip()) #meta score

m_votes = int(movie_blocks[0].find('span',{'name':'nv'}).get('data-value')) # votes

print("Movie Name: " + mname,
      "\nRelease Year: " + str(m_reyear),
      "\nIMDb Rating: " + str(m_rating),
      "\nMeta score: " + str(m_mscore),
      "\nVotes: " + '{:,}'.format(m_votes)

)

Movie Name: Black Panther 
Release Year: 2018 
IMDb Rating: 7.3 
Meta score: 88.0 
Votes: 674,492


> Once you examine the resulting pages of the imbd search that we initially did , it's obvious that by editing the html link it is possible to view all search results. Thus we will be using this feature during the scrape to iterate through all pages.
Now since scraping the data is an iterative process, we define separate functions for each purpose.
First we are going to define a function which will extract the targeted elements from a 'movie block list' (discussed above)

In [5]:
def scrape_mblock(movie_block):
    
    movieb_data ={}
  
    try:
        movieb_data['name'] = movie_block.find('a').get_text() # Name of the movie
    except:
        movieb_data['name'] = None

    try:    
        movieb_data['year'] = str(movie_block.find('span',{'class': 'lister-item-year'}).contents[0][1:-1]) # Release year
    except:
        movieb_data['year'] = None

    try:
        movieb_data['rating'] = float(movie_block.find('div',{'class':'inline-block ratings-imdb-rating'}).get('data-value')) #rating
    except:
        movieb_data['rating'] = None
    
    try:
        movieb_data['m_score'] = float(movie_block.find('span',{'class':'metascore favorable'}).contents[0].strip()) #meta score
    except:
        movieb_data['m_score'] = None

    try:
        movieb_data['votes'] = int(movie_block.find('span',{'name':'nv'}).get('data-value')) # votes
    except:
        movieb_data['votes'] = None

    return movieb_data
    

> Then I create the below function to scrape all movie blocks within a single search result page


In [6]:
def scrape_m_page(movie_blocks):
    
    page_movie_data = []
    num_blocks = len(movie_blocks)
    
    for block in range(num_blocks):
        page_movie_data.append(scrape_mblock(movie_blocks[block]))
    
    return page_movie_data


> Now we built functions to extract all movie data from a single page.

Next function will be created to iterate the above made function through all pages of the search result untill we scrape data for the targeted number of movies

In [7]:
def scrape_this(link,t_count):
    
    #from IPython.core.debugger import set_trace

    base_url = link
    target = t_count
    
    current_mcount_start = 0
    current_mcount_end = 0
    remaining_mcount = target - current_mcount_end 
    
    new_page_number = 1
    
    movie_data = []
    
    
    while remaining_mcount > 0:

        url = base_url + str(new_page_number)
        
        #set_trace()
        
        source = requests.get(url).text
        soup = bs4.BeautifulSoup(source,'html.parser')
        
        movie_blocks = soup.findAll('div',{'class':'lister-item-content'})
        
        movie_data.extend(scrape_m_page(movie_blocks))   
        
        current_mcount_start = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[0])

        current_mcount_end = int(soup.find("div", {"class":"nav"}).find("div", {"class": "desc"}).contents[1].get_text().split("-")[1].split(" ")[0])

        remaining_mcount = target - current_mcount_end
        
        print('\r' + "currently scraping movies from: " + str(current_mcount_start) + " - "+str(current_mcount_end), "| remaining count: " + str(remaining_mcount), flush=True, end ="")
        
        new_page_number = current_mcount_end + 1
        
        time.sleep(ran.randint(0, 10))
    
    return movie_data

> Finally, we have put together all functions created above to scrape the top 150 movies on the list



In [8]:
base_scraping_link = "https://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=boxoffice_gross_us,desc&start="

top_movies = 150 #input("How many movies do you want to scrape?")
films = []

movies = scrape_this(base_scraping_link,int(top_movies))

print('\r'+"List of top " + str(top_movies) +" movies:" + "\n", end="\n")
movies=pd.DataFrame(movies)
movies

List of top 150 movies:es from: 101 - 150 | remaining count: 0



Unnamed: 0,name,year,rating,m_score,votes
0,Black Panther,2018,7.3,88.0,674492
1,Avengers: Infinity War,2018,8.4,68.0,920841
2,Incredibles 2,2018,7.6,80.0,268217
3,Jurassic World: Fallen Kingdom,2018,6.2,,282417
4,Aquaman,2018,6.9,,414547
...,...,...,...,...,...
145,Boy Erased,2018,6.9,69.0,35790
146,Hotel Artemis,2018,6.1,,50303
147,A-X-L,2018,5.3,,11239
148,Run the Race,2018,6.1,,1448


In [9]:
movies.to_csv('movies.csv', index=False)


**The below data can be worked on easily with excel or spreadsheet as shown in the picture below**

# Important pointers

Some websites, especially famous websites, deploy defense mechanisms to deter you from crawling/scraping their website. Mechanisms tend to vary, but some of the most common ones are:
1. Blocking common web scraping user agents 
2. Exceeding a connection quota 
3. Blocking you based on your behavior
4. Captcha — Website can ask you to answer a question, which normally only human should be able to answer. The website can have this mechanism deployed when you try to access certain pages of the site.

# Conclusion

As you can see, all you need is BeautifulSoup library and basic knowledge of python. Even though in this tutorial we are using little snippet of HTML code loaded from a local file, the routine for scraping a data from online websites is exactly the same:

1. Download the HTML source code.
2. Find under what tags is the data you want.
3. Scrape them out of the source code.
4. Extract the data you want from the strings.

I strongly recommend you to play around with BeautifulSoup, as there is a lot more to do with it. Anyways, that's it for the introduction to python web scraping with BeautifulSoup. If you have any questions, feel free to contact me or write down a comment.

Stay tuned for beginner friendly notebooks!!