# Web Scraping
Web Scraping is the process of extracting information from a web page by taking advantage of patterns in the web page's underlying code. 

For web scraping there are a few different libraries to consider, including:

- [Beautiful Soup](https://pypi.org/project/beautifulsoup4/)
- [Requests](http://docs.python-requests.org/en/master/)
- [Scrapy](https://scrapy.org/)
- [Selenium](https://selenium-python.readthedocs.io/)

In this tutorial, we will go through a simple example of how to scrape a website to gather data on [the top 100 companies in 2018 from Fast Track](http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/). Automating this process with a web scraper avoids manual data gathering, saves time and also allows you to have all the data on the companies in one structured file.

For this example we will be using Beautiful Soup. 


## HTML Tags 
Do get familiar with HTML tags!

https://www.w3schools.com/

### HTML Tags Meaning
https://html-css-js.com/html/tags/

# HTML CheatSheet
https://htmlcheatsheet.com/  an online cheatsheet

## Inspect the webpage
Before coding let's first inspect the webpage we will need to scrap.

http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

![title](./../../img/FastTrackWebPage.png)


![title](./../../img/FastTrackWebPageOutput.png)

NOTE: All 100 results are contained within rows in <tr> elements and these are all visible on the one page. This will not always be the case and when results span over many pages you may need to either change the number of results displayed on a webpage, or loop over all pages to gather all the information.

NOTE: Another check that can be done is to check whether a HTTP GET request is being made on the website which may already return the results as structured response such as a JSON or XML format. You can check this from within the network tab in the inspect tools, often in the XHR tab. Once a page is refreshed it will display the requests as they are loaded and if the response contains a formatted structure, it is often easier to make a request using a REST Client such as Insomnia to return the output.

In [1]:
# installing BeautifulSoup - uncomment the line below to proceed to its installation.
#! pip3 install BeautifulSoup4

In [2]:
# import libraries
from bs4 import BeautifulSoup
import urllib.request
import csv

In [3]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [4]:
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)

# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')

NOTE:  the `html.parser` is the parser included with the Python standard library, though other parsers can be used by Beautiful Soup. See [differences between parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) to learn more.)

In [5]:
print(soup)

<!-- Template Name: League Table page
-->
<!DOCTYPE html>

<!--[if lt IE 7 ]> <html class="ie ie6 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 7 ]>    <html class="ie ie7 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 8 ]>    <html class="ie ie8 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 9 ]>    <html class="ie ie9 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if gt IE 9]><!--><html class="no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"><!--<![endif]-->
<!-- the "no-js" class is for Modernizr. -->
<head id="live2-fasttrack-com"><link data-minify="1" href="http://www.fasttrack.co.uk/wp-content/cache/min/1/e6fd2bd36052429c415bff0bc306fc34.css" rel="stylesheet"/>
<meta charset="utf-8"/>
<!-- Always force latest IE rendering engine (even in intranet) & Chrome Frame -->
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<title>
		League table - Fast Track	</ti

## Search for HTML Elements

As all of the results are contained within a table, we can search the soup object for the table using the find method. We can then find each row within the table using the find_all method.

If we print the number of rows we should get a result of 101, the 100 rows plus the header.

In [6]:
# find results within table
table = soup.find('table', attrs={'class': 'tableSorter'})
#print(table)

# find all the table rows (tr)
results = table.find_all('tr')
#print(results)

print('Number of results', len(results))

Number of results 101


In [7]:
#we can slice the results, to get 3 results
results[0:3]

[<tr>
 <th>Rank</th>
 <th>Company</th>
 <th class="">Location</th>
 <th class="no-word-wrap">Year end</th>
 <th class="" style="text-align:right;">Annual sales rise over 3 years</th>
 <th class="" style="text-align:right;">Latest sales £000s</th>
 <th class="" style="text-align:right;">Staff</th>
 <th class="">Comment</th>
 <!--				<th>FYE</th>-->
 </tr>, <tr>
 <td>1</td>
 <td><a href="http://www.fasttrack.co.uk/company_profile/plan-com/"><span class="company-name">Plan.com</span></a>Communications provider</td>
 <td>Isle of Man</td>
 <td>Sep-17</td>
 <td style="text-align:right;">364.38%</td>
 <td style="text-align:right;">*35,418</td>
 <td style="text-align:right;">90</td>
 <td>About 650 partners use its telecoms platform to support more than 100,000 UK business customers</td>
 <!--						<td>Sep-17</td>-->
 </tr>, <tr>
 <td>2</td>
 <td><a href="http://www.fasttrack.co.uk/company_profile/psioxus-2/"><span class="company-name">PsiOxus</span></a>Biotechnology developer</td>
 <td>Oxfords

In [8]:
# we can find the last line from the results list
results[-1]

<tr>
<td>100</td>
<td><a href="http://www.fasttrack.co.uk/company_profile/brompton-technology/"><span class="company-name">Brompton Technology</span></a>Video technology provider</td>
<td>West London</td>
<td>Aug-17</td>
<td style="text-align:right;">50.17%</td>
<td style="text-align:right;">*5,250</td>
<td style="text-align:right;">27</td>
<td>Its technology is used in high-profile events such as the Oscars</td>
<!--						<td>Aug-17</td>-->
</tr>

We can therefore loop over the results to gather the data.

Printing the first 2 rows in the soup object, we can see that the structure of each row is:

In [9]:
# nicely printing the first 5 rows using the soup method: prettify()
results = soup.prettify().splitlines()
print('\n''newline'.join(results[:5]))

<!-- Template Name: League Table page
newline-->
newline<!DOCTYPE html>
newline<!--[if lt IE 7 ]> <html class="ie ie6 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
newline<!--[if IE 7 ]>    <html class="ie ie7 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->


In [10]:
#print the first 10 and the last 10 rows. 
results = soup.prettify().splitlines()
print('\n'.join(results[:10] + results[-10:]))

<!-- Template Name: League Table page
-->
<!DOCTYPE html>
<!--[if lt IE 7 ]> <html class="ie ie6 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 7 ]>    <html class="ie ie7 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 8 ]>    <html class="ie ie8 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if IE 9 ]>    <html class="ie ie9 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" lang="en-GB" prefix="og: http://ogp.me/ns#">
 <!--<![endif]-->

	  (function() {
	    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
	    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
	    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
	  })();

	</script>
	-->
<!-- Cached page for great performance - Debug: cached@1543249281 -->


In [11]:
print(results)

['<!-- Template Name: League Table page', '-->', '<!DOCTYPE html>', '<!--[if lt IE 7 ]> <html class="ie ie6 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->', '<!--[if IE 7 ]>    <html class="ie ie7 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->', '<!--[if IE 8 ]>    <html class="ie ie8 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->', '<!--[if IE 9 ]>    <html class="ie ie9 no-js" lang="en-GB" prefix="og: http://ogp.me/ns#"> <![endif]-->', '<!--[if gt IE 9]><!-->', '<html class="no-js" lang="en-GB" prefix="og: http://ogp.me/ns#">', ' <!--<![endif]-->', ' <!-- the "no-js" class is for Modernizr. -->', ' <head id="live2-fasttrack-com">', '  <link data-minify="1" href="http://www.fasttrack.co.uk/wp-content/cache/min/1/e6fd2bd36052429c415bff0bc306fc34.css" rel="stylesheet"/>', '  <meta charset="utf-8"/>', '  <!-- Always force latest IE rendering engine (even in intranet) & Chrome Frame -->', '  <meta content="IE=edge,chrome=1" http-equiv=

## Looping through elements and saving variables
In python, it is useful to append the results to a list to then write the data to a file. We should declare the list and set the headers of the csv before the loop with the following:

In [12]:
# create and write headers to a list 
rows = []
rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments'])
print(rows)

[['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments']]


The code above prints out the first row that we have added to the list containing the headers.

You might notice that there are a few extra fields Webpage and Description which are not column names in the table, but if you take a closer look in the html from when we printed the soup variable above, the second row contains more than just the company name. We can use some further extraction to get this extra information.


The next step is to loop over the results, process the data and append to rows which can be written to a csv.

In [14]:
results = table.find_all('tr')
#   Loop over the results
for result in results:
#   Find all columns per result
    data = result.find_all('td')
#   Check that columns have data
    if len(data) == 0:
        continue

Since the first row in the table contains only the headers, we can skip this result, as shown above. It also does not contain any <td> elements so when searching for the element, nothing is returned. We can then check that only results containing data are processed by requiring the length of the data to be non-zero.

We can then start to process the data and save to variables.


In [15]:
# write columns to variables
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()

The above simply gets the text from each of the columns and saves to variables. Some of this data however needs further cleaning to remove unwanted characters or extract further information.

## Data Cleaning
If we print out the variable company, the text not only contains the **name of the company** but also a **description**. If we then print out sales, it contains unwanted characters such as footnote symbols that would be useful to remove.

In [16]:
print('Company is', company)
# Company is Wonderbly Personalised children's books          
print('Sales', sales)
# Sales *25,860

Company is Brompton TechnologyVideo technology provider
Sales *5,250


We would like to split company into the company name and the description which we can do in a few lines of code. Looking again at the html, for this column there is a <span> element that contains only the company name. There is also a link in this column to another page on the website that has more detailed information about the company. We will be using this a little later!

To separate company into two fields, we can use the **find** method to save the <span> element and then use either **strip** or **replace** to remove the company name from the company variable, so that it leaves only the description. 

In [17]:
# extract description from the name
companyname = data[1].find('span', attrs={'class':'company-name'}).getText()    
description = company.replace(companyname, '')
    
# remove unwanted characters
sales = sales.strip('*').strip('†').replace(',','')

In [22]:
print(companyname)
print(description)
print(sales)

Brompton Technology
Video technology provider
5250


The last variable we would like to save is the company website. As discussed above, the second column contains a link to another page that has an overview of each company. Each company page has it’s own table, which most of the time contains the company website.

### Inspecting the element of the url on the company page
![title](./../../img/fastTrack.png)



To scrape the url from each table and save it as a variable, we need to use the same steps as above:

- Find the element that has the url of the the company page on the fast track website
- Make a request to each company page url
- Parse the html using Beautifulsoup
- Find the elements of interest


Looking at a few of the company pages, as in the screenshot above, the urls are in last row in the table so we can search within the last row for the <a> element.

In [24]:
# go to link and extract company website
url = data[1].find('a').get('href')
page = urllib.request.urlopen(url)

# parse the html 
soup = BeautifulSoup(page, 'html.parser')

# find the last result in the table and get the link
try:
    tableRow = soup.find('table').find_all('tr')[-1]
    webpage = tableRow.find('a').get('href')
except:
    webpage = None

There also may be cases where the company website is not displayed so we can use a try except condition, in case a url is not foun

Once we have saved all of the data to variables, still within the loop, we can add each result to the list rows.

In [25]:
# write each result to rows
rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales, staff, comments])
print(rows)

[['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year end', 'Annual sales rise over 3 years', 'Sales £000s', 'Staff', 'Comments'], ['100', 'Brompton Technology', 'http://www.bromptontech.com', 'Video technology provider', 'West London', 'Aug-17', '50.17%', '5250', '27', 'Its technology is used in high-profile events such as the Oscars']]


It is then useful to print the variable outside of the loop, to check that it looks as you expect before writing it to a file!

NOTE: THE SCRIPT ABOVE NEEDS TO BE MODIFIED TO GENERATE THE 100 ROWS

## Writing an output file
You may want to save this data for analysis and this can be done very simply within python from our list.

In [28]:
# Create csv and write rows to output file
try:
    with open('./../../files/techtrack100.csv','w', newline='', encoding="utf-8") as f_output:
        csv_output = csv.writer(f_output)
        csv_output.writerows(rows)
        print('file successfully created')
except Exception as e:
    print ('The following error occured: ',str(e))

file successfully created


### File Ouput - techtrack100.csv

![title](./../../img/TeckTrack100-csv.png)

## Full Program

In [32]:
# -*- coding: utf-8 -*-

# importing libraries
from bs4 import BeautifulSoup
import urllib.request
import csv

#   Specify the URL
urlpage =   'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

#   Query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)

#   Parse the HTML using BS and store in variable 'Soup'
soup = BeautifulSoup(page,'html.parser')

#   Print the results
#print(soup)

#   Create customer error handler
#--------------

#   Find results within the table
table = soup.find('table', attrs={'class': 'tableSorter'})
results = table.find_all('tr')
print('Number of results', len(results))

#   Create and write headers to a list 
rows = []
rows.append(['Rank', 'Company Name', 'Webpage', 'Description', 'Location', 'Year End', 'Annual Sales Rise over 3 Years', ' Sales $000s', 'Staff', 'Comments',])
#print(rows)

#   Loop over the results
results = table.find_all('tr')
for result in results:
    #   Find all cloumns per result
    data = result.find_all('td')
    #   Check that columns have data
    if len(data) == 0:
        continue
    
    #   Write columsn to variables
    rank = data[0].getText()
    company = data[1].getText()
    location = data[2].getText()
    yearend = data[3].getText()
    salesrise = data[4].getText()
    sales = data[5].getText()
    staff = data[6].getText()
    comments = data[7].getText()
    #    print(rank)
    #    extract description from the name
    companyname = data[1].find('span', attrs={'class':'company-name'}).getText()    
    description = company.replace(companyname, '')
    
    #   remove unwanted characters
    sales = sales.strip('*').strip('†').replace(',','')

    #    go to link and extract company website
    url = data[1].find('a').get('href')
    page = urllib.request.urlopen(url)
    #    parse the html 
    soup = BeautifulSoup(page, 'html.parser')
    #    find the last result in the table and get the link
    try:
        tableRow = soup.find('table').find_all('tr')[-1]
        webpage = tableRow.find('a').get('href')
    except:
        webpage = None

    #    write each result to rows
    rows.append([rank, companyname, webpage, description, location, yearend, salesrise, sales, staff, comments])
    #print(rows)

#    Create csv and write rows to output file - amr - adding encoding
with open('techtrack100.csv','w', newline='', encoding="utf-8") as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerows(rows)


Number of results 101


## Web Scraping Advice
- Web scraping works best with **static, well-structured web pages**. Dynamic or interactive content on a web page is often not accessible through the HTML source, which makes scraping it much harder!
- Web scraping is a "fragile" approach for building a dataset. The HTML on a page you are scraping can **change at any time**, which may cause your scraper to stop working.
- If you can **download the data** you need from a website, or if the website provides an **API with data access**, those approaches are preferable to scraping since they are easier to implement and less likely to break.
- If you are **scraping a lot of pages** from the same website (in rapid succession), it's best to insert delays in your code so that **you don't overwhelm** the website with requests. If the website decides you are causing a problem, they can block your IP address (which may affect everyone in your building!)
- Before scraping a website, you should review its **robots.txt file** (also known as the [Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)) to check whether you are "allowed" to scrape their website. (Here is the [robots.txt file for nytimes.com](http://www.nytimes.com/robots.txt).)

## Summary
This brief tutorial on web scraping with python has outlined:

- Connecting to a webpage
- Parsing html using BeautifulSoup
- Looping through the soup object to find elements
- Performing some simple data cleaning
- Writing data to a csv


# Additional References

- The [Beautiful Soup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is written like a tutorial, and is worth reading to gain a detailed understanding of the library.
- For more Beautiful Soup examples, see [Web Scraping 101 with Python](http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/), [More web scraping with Python](http://www.gregreda.com/2013/04/29/more-web-scraping-with-python/), and this [web scraping lesson](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html) from Stanford's "Text As Data" course.
- [Web Scraping with Python](https://www.youtube.com/watch?v=p1iX0uxM1w8) is a 3-hour video tutorial covering Beautiful Soup and other scraping tools. (The [slides](https://docs.google.com/presentation/d/1uHM_esB13VuSf7O1ScGueisnrtu-6usGFD3fs4z5YCE/edit#slide=id.p) and [code](https://github.com/kjam/python-web-scraping-tutorial) are also available.)
- [Scrapy](http://scrapy.org/) is a popular application framework that is useful for more complex web scraping projects.
- [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/) and [How Netflix Reverse Engineered Hollywood](http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/?single_page=true) are two fun examples of using web scraping to build an interesting dataset.