<a href="https://colab.research.google.com/github/umarhassan1996/SJICWeek5/blob/main/01032021Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##An example scraper using a list

The code below can be copied and adapted to create your own scraper.

The first part installs all the libraries. I've kept this separate to the other parts so that you don't have to install them every time you want to run the scraper itself.

In [34]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd



##Using a list

Below we write some code to create a list of counties that can be used to generate URLs on a karting site.

We also store the 'base URL' that we will add to each item in the list to create a full URL.

In [69]:
#create a list of dates that we will need to generate URLs
journojobtypes = ["online","nationals","b2b"]
#store the base URL we will add those to
baseurl = "https://www.cisionjobs.co.uk/jobs/journalist/"

## Using a loop

Next we loop through each item in the list and add it to that base url using the + operator.

We add a print function inside the loop to check that it works each time - and copy those links into a browser to check that they are the right links.

In [70]:
#start looping through our list
for i in journojobtypes:
  fullurl = baseurl+i
  print(fullurl)

https://www.cisionjobs.co.uk/jobs/journalist/online
https://www.cisionjobs.co.uk/jobs/journalist/nationals
https://www.cisionjobs.co.uk/jobs/journalist/b2b


## Scraping each URL as we loop

Now that we know the loop works in generating the right URLs, we can extend the code inside the loop so that it scrapes each URL.

At this point we are using some of the libraries we imported at the start. scraperwiki.scrape(), for example, is the scrape() function from the scraperwiki library.

Let's look at the code first, and then explain it...

In [75]:
for i in journojobtypes:
  fullurl = baseurl+i
  print(fullurl)
  #Scrape the html at that url
  html = scraperwiki.scrape(fullurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #The names are all in <td> and <a>
  #This targets the contents of those html tags
jobs = root.cssselect('span')
  #the results are always a list so we have to loop through it using a 'for' loop
for i in jobs:
    #each item in the list is called i as it loops
    print(i)
    #on its own it looks odd, but we can attach .text_content() to translate it into text
    job = i.text_content()
    print(job)

https://www.cisionjobs.co.uk/jobs/journalist/online
https://www.cisionjobs.co.uk/jobs/journalist/nationals
https://www.cisionjobs.co.uk/jobs/journalist/b2b
<Element span at 0x7fde4d9fa110>
Skip to main menu
<Element span at 0x7fde4d9fa170>
Skip to user menu
<Element span at 0x7fde519b4e30>
Jobseekers
<Element span at 0x7fde51a08dd0>
or
<Element span at 0x7fde519f2530>

			
				US
			
			
				EU
			
		
<Element span at 0x7fde4d987050>

<Element span at 0x7fde4f86fdd0>
Email
<Element span at 0x7fde4da6ccb0>

<Element span at 0x7fde4da6c5f0>
Remove selection
<Element span at 0x7fde4da6ca70>

<Element span at 0x7fde4da6c0b0>
Remove selection
<Element span at 0x7fde4da6cad0>

<Element span at 0x7fde4da6c290>

<Element span at 0x7fde4da6c170>

<Element span at 0x7fde4da6c590>

<Element span at 0x7fde4da6c2f0>

<Element span at 0x7fde4da6c710>

<Element span at 0x7fde4da6cf50>
Trading Risk, Reporter
<Element span at 0x7fde4da6c770>
Trading Risk, Reporter
<Element span at 0x7fde4da6c890>
Save 

## The functions we are using

## Let's break some of this down.

So scraperwiki.scrape() is the scrape() function from the scraperwiki library. The ingredient we give to that function is the URL we stored in the fullurl variable.

The scrape() function basically fetches the whole webpage at a given address (the ingredient it's given).

The results of running that function are stored in a new variable called html.

This isn't in a form we can easily work with, yet, so we need another function to convert it to something we can drill down into.

That function is the fromstring() function from the lxml.html library. The ingredient we give to that function is the html variable we just created.

The results are stored in another new variable, root.

This variable is a particular type of object (an "lxml object" if you need to know) that can be drilled down into using the cssselect function. That function will grab elements that match the CSS selectors that you give it as an ingredient.

In this case we specify 'h2', which means "any h2 tag" - so it will grab the contents of any h2 tags in the page.

Don't worry about memorising any of the code above: this is code that you can re-use time and time again. The only bit you will need to change is the selector, in order to specify the particular HTML you're after.

To work out the selector you need, you'll often need to Google around, learning as you go, but selectors are pretty easy to get the hang of, and I'll talk about it more below.

## Using CSS selectors

CSS selectors are used to target different elements in a HTML page.

## Saving the information we've grabbed.

Now we've grabbed some information we can extend the code further to save it.

At this point we need to use functions from another library: pandas. This is a library for data storage and analysis. When we imported pandas we called it pd for short. This is quite common. Any reference to pd in the code, then, means pandas

First, we use the function DataFrame() which creates a pandas dataframe. As ingredients it needs to know the names of any columns.

You will see below that we add a line before the loop which uses that to create an empty dataframe to store the data in.

Then, inside the loop, the data we extract is added to the dataframe.

Here's the code first - then I'll explain the new bits after.

In [82]:
#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["job-role"])

for i in journojobtypes:
  fullurl = baseurl+i
  print(fullurl)
  #Scrape the html at that url
  html = scraperwiki.scrape(fullurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #The names are all in <td> and <a>
  #This targets the contents of those html tags
jobs = root.cssselect('h3 span')
  #the results are always a list so we have to loop through it using a 'for' loop
for i in jobs:
    #each item in the list is called i as it loops
    print(i)
    #on its own it looks odd, but we can attach .text_content() to translate it into text
    job = i.text_content()
    print(job)
    #Now we need to store it in that variable called 'df' 
    df = df.append({
      "job-role" : job
      }, ignore_index=True)

https://www.cisionjobs.co.uk/jobs/journalist/online
https://www.cisionjobs.co.uk/jobs/journalist/nationals
https://www.cisionjobs.co.uk/jobs/journalist/b2b
<Element span at 0x7fde4f3ca1d0>
Trading Risk, Reporter
<Element span at 0x7fde4f3ca830>
MCA, Editor 
<Element span at 0x7fde4f86d0b0>
MCA, Editor 
<Element span at 0x7fde4f86dd70>
Global Restructuring Review, News Reporter
<Element span at 0x7fde4d987050>
Global Banking Regulation Review, Trainee News Reporter
<Element span at 0x7fde51a2ead0>
B2B Editor 
<Element span at 0x7fde4da0d1d0>
Employee Benefits, Reporter 
<Element span at 0x7fde4f8700b0>
Gaming Intelligence, Editorial Internship
<Element span at 0x7fde4f3a9bf0>
Gaming Intelligence, Multimedia Editor
<Element span at 0x7fde4d990830>
Gaming Intelligence, B2B Reporter
<Element span at 0x7fde4d9908f0>
The Media Eye, Researcher & Content Writer - Celebrity Team
<Element span at 0x7fde4f3e54d0>
Distillery, Writer
<Element span at 0x7fde4da6c530>
Ground Engineering, Senior Repor

## The new code

The first line of new code is this:

df = pd.DataFrame(columns=["title"])

We are creating a new variable here, called df, and assigning to it the results of using a function: pd.DataFrame() (the pandas function DataFrame).

That takes an ingredient which specifies the columns as being a list (note the square brackets) of one string: "title".

The second line of new code is this:

df = df.append({
      "title" : title
      }, ignore_index=True)
This takes the df variable and updates it.

On the right of the equals sign is df.append() - this means it is using a function called append to append (add) new data to the df variable it's attached to.

The append function can include various ingredients: firstly the data that you want to append to the dataframe; but also settings, such as whether you want something called ignore_index to be True or False. Setting this to True just avoids problems when your data isn't unique.

What about the data that you are appending? Well, this has to be in the form of a dictionary. A dictionary is like a list, but with two key differences: firstly that it uses curly brackets instead of square ones: {}, and secondly it's a list of pairs: a 'key', and a 'value', separated by a colon.

Here's the dictionary in our code:

{"title" : title}

The first part, "title" is the key. This matches the column heading in the empty data frame. Note that it's a string: a label, basically.

The second part, title, is the value. This isn't in quotes so it's not a string - it's a variable. A few lines earlier we created this variable with title = i.text_content()

So having extracted that information and stored it in title, the line of code is storing it in a dataframe with the label (key) "title":

df = df.append({
      "title" : title
      }, ignore_index=True)

We can print the dataframe to see what's in there:

In [83]:
#Once the loop has finished we can take a look at the data
print(df)

                                             job-role
0                              Trading Risk, Reporter
1                                        MCA, Editor 
2                                        MCA, Editor 
3          Global Restructuring Review, News Reporter
4   Global Banking Regulation Review, Trainee News...
5                                         B2B Editor 
6                        Employee Benefits, Reporter 
7           Gaming Intelligence, Editorial Internship
8              Gaming Intelligence, Multimedia Editor
9                   Gaming Intelligence, B2B Reporter
10  The Media Eye, Researcher & Content Writer - C...
11                                 Distillery, Writer
12               Ground Engineering, Senior Reporter 
13                             Trading Risk, Reporter
14                            British Baker, Reporter
15                        Construction News, Reporter
16                     Construction News, News Editor
17                          

##Exporting the data

The pandas library has another function for exporting data: to_csv().

It needs to be attached to the name of the dataframe variable with a period, then, in the brackets, you specify the name of the file you want to export it as. Make sure this ends in '.csv' so it can be used in a spreadsheet.

In [84]:
#And we can export it
df.to_csv("journojobs.csv")

##That's it. I have managed to use a list to scrape through a list of URLs. Unbelievable.