## Extract data from web

With the boom of internet there is so much data lying in the web in the form of websites. 
There are many ways to extract data from the web. APIs are probably the best way to extract data from a website. 
Most of the big websites like Twitter, Facebook, amazon, New York Times provide APIs to access their data.But not all websites have an API. 
Some websites don't provide one because of privacy concerns or they lack technical knowledge to provide one. 

Web scraping is a technique of extracting information from websites. 
It focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

Python has rich eocsystem to scrape data from web and is easy to use. 
The library ‘BeautifulSoup’ assists this task.

#### LIbraries used

**`requests`**: 
This library is used for fetching data from web pages. 
[Click here for documentation](http://docs.python-requests.org/en/master/)

**`BeautifulSoup`**: 
Use this library to extract tables, lists, paragraph from html web pages. 
It also allows filters to extract information from web pages. 
[Click here for documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
#import the library to query a website
import requests

In [2]:
# specify the url
url = "https://en.wikipedia.org/wiki/List_of_World_Series_champions"

In [3]:
# Open website URL and return the html to the variable 'response'
response = requests.get(url)

In [4]:
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup

The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a BeautifulSoup object is create from an html response, we explicitly reference the text format(`response.text`). 
Because the default encoding format is 'UTF-8' as shown below. 
[Click here for documentation](http://docs.python-requests.org/en/master/user/quickstart/#response-content)

In [5]:
response.encoding

'UTF-8'

In [6]:
response

<Response [200]>

In [7]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "lxml")

Use prettify function to print the data in nested html structured format.

In [8]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of World Series champions - Wikipedia</title>
<script>document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_World_Series_champions","wgTitle":"List of World Series champions","wgCurRevisionId":907234451,"wgRevisionId":907234451,"wgArticleId":7599168,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with hCards","Commons category link is on Wikidata","Commons category link is on Wikidata using P373","Featured lists","World Series","World Series lists","Lists of sports championships"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransfo

We need to extract the table which has list of all baseball world series champions. This table should be present in one of the html tags. Work with the tags to extract data present in them.  "**soup.tag**": will return the content between opening and closing tag including tag. 

In [9]:
soup.title

<title>List of World Series champions - Wikipedia</title>

In [10]:
# Return string within given tag 
soup.title.string

'List of World Series champions - Wikipedia'

**Identify the html tag**: The data is in a table. You can use inspect element option when you right click the mouse to identify the tag which has the data. 

 * [Additional guide on webpage inspection](../../../datasets/AnalyzingHTMLwithTheWebInspector.pdf)


<img src="../images/table.png">

**Find the right table:** As we are seeking a table to extract information about baseball champions, we should identify the right table first. Let’s write the command to extract information within all table tags. 

In [11]:
all_tables=soup.find_all('table')

In [12]:
print(all_tables)

[<table class="vertical-navbox nowraplinks hlist" style="float:right;clear:right;width:22.0em;margin:0 0 1.0em 1.0em;background:#f9f9f9;border:1px solid #aaa;padding:0.2em;border-spacing:0.4em 0;text-align:center;line-height:1.4em;font-size:88%;font-size: 95%; width: 18em; line-height: 1.3em; padding: 0px;"><tbody><tr><td style="padding-top:0.4em;line-height:1.2em">Part of a series on the</td></tr><tr><th style="padding:0.2em 0.4em 0.2em;padding-top:0;font-size:145%;line-height:1.2em;font-size: 105%;"><a href="/wiki/Major_League_Baseball_postseason" title="Major League Baseball postseason">Major League Baseball postseason</a></th></tr><tr><th class="navbox-abovebelow" style="padding:0.1em;padding-bottom:0.3em">
<a href="/wiki/Major_League_Baseball_wild_card" title="Major League Baseball wild card">Wild Card</a></th></tr><tr><td style="padding:0 0.1em 0.4em">
<ul><li><a class="mw-redirect" href="/wiki/American_League_Wild_Card_Game" title="American League Wild Card Game">ALWCG</a></li>


Now to identify the right table, we will use attribute “class” of table and use it to filter the right table. In chrome, you can check the class name by right click on the required table of web page –> Inspect element –> Copy the class name OR go through the output of above command find the class name of right table.

In [13]:
right_table=soup.find('table', class_='wikitable sortable plainrowheaders')
right_table

<table class="wikitable sortable plainrowheaders">
<tbody><tr>
<th scope="col">Year
</th>
<th scope="col">Winning team
</th>
<th scope="col">Manager
</th>
<th class="unsortable" scope="col">Games
</th>
<th scope="col">Losing team
</th>
<th scope="col">Manager
</th>
<th class="unsortable" scope="col">Ref.
</th></tr>
<tr>
<th align="left" scope="row"><a href="/wiki/1903_World_Series" title="1903 World Series">1903</a>
</th>
<td align="left" style="background:#fcc;"><a href="/wiki/1903_Boston_Americans_season" title="1903 Boston Americans season">Boston Americans</a><small> (1, 1–0)</small></td>
<td align="left"><span data-sort-value="Collins, Jimmy"><span class="vcard"><span class="fn"><a href="/wiki/Jimmy_Collins" title="Jimmy Collins">Jimmy Collins</a></span></span></span></td>
<td>5–3<sup class="reference" id="ref_BestOf9V"><a href="#endnote_BestOf9V">[V]</a></sup></td>
<td align="left" style="background:#d0e7ff;"><a href="/wiki/1903_Pittsburgh_Pirates_season" title="1903 Pittsburgh P

In [14]:
#Generate lists
Year=[]
Winning_team=[]
Winning_Manager=[]
Games=[]
Losing_team=[]
Losing_Manager=[]
Ref=[]

# skip first iteration as we dont need headers 
for row in right_table.findAll("tr")[1:]: 
    game_year=row.findAll('th') # To store game year which is in <th> tag
    cells = row.findAll('td') # To store all other details
    if len(cells)>2: # Only extract information if there is table body not heading
        Year.append(game_year[0].find(text=True))
        Winning_team.append(cells[0].find(text=True))
        Winning_Manager.append(cells[1].find(text=True))
        Games.append(cells[2].find(text=True))
        Losing_team.append(cells[3].find(text=True))
        Losing_Manager.append(cells[4].find(text=True))
        Ref.append(cells[5].find(text=True))

Extract the information to DataFrame:
Here, we need to iterate through each row (tr) and then assign each element of tr (td) to a variable and append it to a list. Let’s first look at the HTML structure of the table

In [16]:
#import pandas to convert list to data frame
import pandas as pd
df=pd.DataFrame(Year,columns=['Year'])
df['Winning_team']=Winning_team
df['Winning_Manager']=Winning_Manager
df['Games']=Games
df['Losing_team']=Losing_team
df['Losing_Manager']=Losing_Manager
df['Ref']=Ref
df

Unnamed: 0,Year,Winning_team,Winning_Manager,Games,Losing_team,Losing_Manager,Ref
0,1903,Boston Americans,Jimmy Collins,5–3,Pittsburgh Pirates,Fred Clarke,[8]
1,1905,New York Giants,John McGraw,4–1,Philadelphia Athletics,Connie Mack,[9]
2,1906,Chicago White Sox,Fielder Jones,4–2,Chicago Cubs,Frank Chance,[10]
3,1907,Chicago Cubs,Frank Chance,4–0–(1),Detroit Tigers,Hugh Jennings,[11]
4,1908,Chicago Cubs,Frank Chance,4–1,Detroit Tigers,Hugh Jennings,[12]
5,1909,Pittsburgh Pirates,Fred Clarke,4–3,Detroit Tigers,Hugh Jennings,[13]
6,1910,Philadelphia Athletics,Connie Mack,4–1,Chicago Cubs,Frank Chance,[14]
7,1911,Philadelphia Athletics,Connie Mack,4–2,New York Giants,John McGraw,[15]
8,1912,Boston Red Sox,Jake Stahl,4–3–(1),New York Giants,John McGraw,[16]
9,1913,Philadelphia Athletics,Connie Mack,4–1,New York Giants,John McGraw,[17]
