# Web-Scraping-Example
This notebook is a simple example showing how to scrape some information from the top 10 games from Jeff Gerstmann's *top 10 games of 2017* list on [giantbomb.com](https://www.giantbomb.com/articles/jeff-gerstmanns-top-10-games-of-2017/1100-5703/). 

<img src="img/top-10-header.jpg">

The information we want is actually the games and platforms from the top 10 list. This is contained in the headers for each game futher down the page.

<img src="img/top-10-item.jpg">

### Import libraries

In [2]:
from bs4 import BeautifulSoup  
import pandas as pd
import requests

### Read html into Python

The headers argument below is to prevent *Wordpress RSS Reader, Anonymous Bot or Scraper Blocked* error when scraping from the website.

In [3]:
address = 'https://www.giantbomb.com/articles/jeff-gerstmanns-top-10-games-of-2017/1100-5703/'
r = requests.get(address, headers = {'user-agent': 'rangell'})

In [4]:
r.text[0:1000]

'<!doctype html>\n<html lang="en" itemscope id="" class="no-js no-touch ">\n\n<head><script type="text/javascript">(window.NREUM||(NREUM={})).loader_config={xpid:"Vg8OUFFACQQBUFVbDg=="};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,n,e){r(e.stack)}),s.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(s,function(t,n){return t}).join(", ")))},{}],2:[functi

### Parse html

In [5]:
soup = BeautifulSoup(r.text, 'html.parser')  

### Extract main body of the webpage
Looking at the source code for the webpage; the "div" tag with the attribute "js-content-entity-body" seems to be the main body containing the headers we are after.

<img src="img/top-10-source.jpg">

This returns a lot of items from which we want to extract the headers - it is only converted to string to limit the amount printed to the notebook.

In [6]:
results = soup.find_all('div', attrs={'class':'js-content-entity-body'})  
str(results)[1:1000]

'<div class="js-content-entity-body">\n<figure data-align="right" data-embed-type="image" data-img-src="https://static.giantbomb.com/uploads/original/0/9446/2862597-jeffe32016.gif" data-ratio="0.75510204081633" data-ref-id="1300-2862597" data-size="medium" data-width="294" style="width: 294px"><a class="fluid-height" data-ref-id="1300-2862597" href="https://static.giantbomb.com/uploads/original/0/9446/2862597-jeffe32016.gif" style="padding-bottom:75.5%"><img alt="No Caption Provided" data-width="294" sizes="(max-width: 294px) 100vw, 294px" src="https://static.giantbomb.com/uploads/original/0/9446/2862597-jeffe32016.gif" srcset="https://static.giantbomb.com/uploads/original/0/9446/2862597-jeffe32016.gif 294w"/></a></figure><p><em><a href="http://twitter.com/jeffgerstmann" rel="nofollow">Jeff Gerstmann</a> is a professional car streamer and pillow enthusiast from Sonoma County, CA. His hobbies include sitting down and staring forward, macaroni and cheese, and thinking about (but not list

### Extract the headers from main body

"h3" seems to indicate the headers containing the information we want to extract.

In [7]:
headers = results[0].find_all('h3')
headers

[<h3>10. <a data-ref-id="3030-59450" href="/assassins-creed-origins/3030-59450/">Assassin's Creed Origins</a> (Xbox One X)</h3>,
 <h3>9. <a data-ref-id="3030-57687" href="/splatoon-2/3030-57687/">Splatoon 2</a> (Switch)</h3>,
 <h3>8. <a data-ref-id="3030-48255" href="/heat-signature/3030-48255/">Heat Signature</a> (PC)</h3>,
 <h3>7. <a data-ref-id="3030-51837" href="/tekken-7-fated-retribution/3030-51837/">Tekken 7 </a> (PS4, PC)</h3>,
 <h3>6. <a data-ref-id="3030-52647" href="/destiny-2/3030-52647/">Destiny 2</a> (PS4, PC)</h3>,
 <h3>5. <a data-ref-id="3030-59906" href="/wolfenstein-ii-the-new-colossus/3030-59906/">Wolfenstein II: The New Colossus</a> (Xbox One)</h3>,
 <h3>4. <a data-ref-id="3030-58593" href="/steamworld-dig-2/3030-58593/">SteamWorld Dig 2</a> (PC)</h3>,
 <h3>3. <a data-ref-id="3030-56733" href="/super-mario-odyssey/3030-56733/">Super Mario Odyssey</a> (Switch)</h3>,
 <h3>2. <a data-ref-id="3030-49998" href="/nier-automata/3030-49998/">NieR:Automata</a> (PC)</h3>,
 <h

### Extract info from headers
Function is tailored to the header structures to extract the relevant information.

In [8]:
def extract_info(h):
    header_text = h.text
    game = h.find('a').text
    data_ref_id = h.find('a')['data-ref-id']
    href = h.find('a')['href']
    rank = int(header_text.split(".")[0])
    platform = header_text.split("(")[1].replace(")", "")
    return([rank, game, platform, data_ref_id, href])

In [9]:
headers_processed = [extract_info(x) for x in headers[:10]]
headers_processed

[[10,
  "Assassin's Creed Origins",
  'Xbox One X',
  '3030-59450',
  '/assassins-creed-origins/3030-59450/'],
 [9, 'Splatoon 2', 'Switch', '3030-57687', '/splatoon-2/3030-57687/'],
 [8, 'Heat Signature', 'PC', '3030-48255', '/heat-signature/3030-48255/'],
 [7,
  'Tekken 7 ',
  'PS4, PC',
  '3030-51837',
  '/tekken-7-fated-retribution/3030-51837/'],
 [6, 'Destiny 2', 'PS4, PC', '3030-52647', '/destiny-2/3030-52647/'],
 [5,
  'Wolfenstein II: The New Colossus',
  'Xbox One',
  '3030-59906',
  '/wolfenstein-ii-the-new-colossus/3030-59906/'],
 [4, 'SteamWorld Dig 2', 'PC', '3030-58593', '/steamworld-dig-2/3030-58593/'],
 [3,
  'Super Mario Odyssey',
  'Switch',
  '3030-56733',
  '/super-mario-odyssey/3030-56733/'],
 [2, 'NieR:Automata', 'PC', '3030-49998', '/nier-automata/3030-49998/'],
 [1,
  "PlayerUnknown's Battlegrounds",
  'PC',
  '3030-54979',
  '/playerunknowns-battlegrounds/3030-54979/']]

### Convert the nested list into a pandas dataframe

In [10]:
headers_processed_pd = pd.DataFrame(headers_processed, columns = ["rank", "game", "platform", "data_ref_id", "href"])
headers_processed_pd.sort_values("rank")

Unnamed: 0,rank,game,platform,data_ref_id,href
9,1,PlayerUnknown's Battlegrounds,PC,3030-54979,/playerunknowns-battlegrounds/3030-54979/
8,2,NieR:Automata,PC,3030-49998,/nier-automata/3030-49998/
7,3,Super Mario Odyssey,Switch,3030-56733,/super-mario-odyssey/3030-56733/
6,4,SteamWorld Dig 2,PC,3030-58593,/steamworld-dig-2/3030-58593/
5,5,Wolfenstein II: The New Colossus,Xbox One,3030-59906,/wolfenstein-ii-the-new-colossus/3030-59906/
4,6,Destiny 2,"PS4, PC",3030-52647,/destiny-2/3030-52647/
3,7,Tekken 7,"PS4, PC",3030-51837,/tekken-7-fated-retribution/3030-51837/
2,8,Heat Signature,PC,3030-48255,/heat-signature/3030-48255/
1,9,Splatoon 2,Switch,3030-57687,/splatoon-2/3030-57687/
0,10,Assassin's Creed Origins,Xbox One X,3030-59450,/assassins-creed-origins/3030-59450/
