# Scraping with BeautifulSoup

In [1]:
html_doc = """
<html><head><title>The title is Demo for BeautifulSoup</title></head>
<body>
<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></p>

<section class="main" id="all_plants">

<p class="article">There are three things to keep in mind:
<a href="http://example.com/plant1" class="plants life" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a href="http://example.com/plant2" class="plants life" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a href="http://example.com/plant3" class="plants life" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>
<strong>Don't forget to water these 3 plants.</strong>


</section>

<section class="main" id="all_animals">
<p class="article"> There are three animals in the barn:
<a href="http://example.com/animal1" class="animals life" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
<a href="http://example.com/animal2" class="animals life" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
<a href="http://example.com/animal3" class="animals life" id="animal3">Animal 3</a>: <span class="cost">$700</span>;
</p>
<strong>Don't forget these feed these 3 animals.</strong>
</section>


<section>
<p><span>Inanimate object 1</span></p>
<p><span>Inanimate object 2</span></p>
<p><span>Inanimate object 3</span></p>
</section>
"""

print(html_doc)


<html><head><title>The title is Demo for BeautifulSoup</title></head>
<body>
<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></p>

<section class="main" id="all_plants">

<p class="article">There are three things to keep in mind:
<a href="http://example.com/plant1" class="plants life" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a href="http://example.com/plant2" class="plants life" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a href="http://example.com/plant3" class="plants life" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>
<strong>Don't forget to water these 3 plants.</strong>


</section>

<section class="main" id="all_animals">
<p class="article"> There are three animals in the barn:
<a href="http://example.com/animal1" class="animals life" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
<a href="http://example.com/animal2" class="animals life" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
<a

In [2]:
## import library
from bs4 import BeautifulSoup

In [3]:
pip install bs4

Collecting bs4
  Using cached bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=3462a030731c910ada8fe6679ebed3a805a75df5a4ec9a69be9354cb1e31bd6f
  Stored in directory: /Users/sandeep.junnarkar/Library/Caches/pip/wheels/75/78/21/68b124549c9bdc94f822c02fb9aa3578a669843f9767776bca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1
Note: you may need to restart the kernel to use updated packages.


## Create a BeautifulSoup object
<img src="../support_files/bs-soup.png">

In [4]:
## we add name of our file
soup = BeautifulSoup(html_doc, "html.parser")

In [5]:
## What type of file is it?
type(soup)

bs4.BeautifulSoup

In [7]:
print(soup.prettify())

<html>
 <head>
  <title>
   The title is Demo for BeautifulSoup
  </title>
 </head>
 <body>
  <h1 class="title">
   <b>
    The title headline is Demo for BeautifulSoup
   </b>
  </h1>
 </body>
</html>
<section class="main" id="all_plants">
 <p class="article">
  There are three things to keep in mind:
  <a class="plants life" href="http://example.com/plant1" id="plant1">
   Plant 1
  </a>
  :
  <span class="cost">
   $10
  </span>
  ,
  <a class="plants life" href="http://example.com/plant2" id="plant2">
   Plant 2
  </a>
  :
  <span class="cost">
   $20
  </span>
  and
  <a class="plants life" href="http://example.com/plant3" id="plant3">
   Plant 3
  </a>
  <span class="cost">
   $30
  </span>
  ;
 </p>
 <strong>
  Don't forget to water these 3 plants.
 </strong>
</section>
<section class="main" id="all_animals">
 <p class="article">
  There are three animals in the barn:
  <a class="animals life" href="http://example.com/animal1" id="animal1">
   Animal 1
  </a>
  :
  <span class="

In [8]:
## get title of page
soup.title

<title>The title is Demo for BeautifulSoup</title>

In [11]:
soup.span

<span class="cost">$10</span>

In [10]:
soup.p

<p class="article">There are three things to keep in mind:
<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>

In [9]:
soup.h1

<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></h1>

In [13]:
## What about the h1 tag with the class of title? 
## How can we have two titles?
soup("h1", class_="title")


[<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></h1>]

### string v. get_text()

In most cases, our final step in a scrape is to convert everything to a string. We don't want all the html. 

We can use ```.string``` or ```get_text().```

- ```get_text()``` is far more powerful because you can add parameters to strip, specify separators, etc.


In [15]:
## get text from soup
print(soup.get_text())


The title is Demo for BeautifulSoup

The title headline is Demo for BeautifulSoup

There are three things to keep in mind:
Plant 1: $10,
Plant 2: $20 and
Plant 3 $30;

Don't forget to water these 3 plants.


 There are three animals in the barn:
Animal 1: $500,
Animal 2: $600 and
Animal 3: $700;

Don't forget these feed these 3 animals.


Inanimate object 1
Inanimate object 2
Inanimate object 3




In [16]:
soup.get_text()

"\nThe title is Demo for BeautifulSoup\n\nThe title headline is Demo for BeautifulSoup\n\nThere are three things to keep in mind:\nPlant 1: $10,\nPlant 2: $20 and\nPlant 3 $30;\n\nDon't forget to water these 3 plants.\n\n\n There are three animals in the barn:\nAnimal 1: $500,\nAnimal 2: $600 and\nAnimal 3: $700;\n\nDon't forget these feed these 3 animals.\n\n\nInanimate object 1\nInanimate object 2\nInanimate object 3\n\n"

In [17]:
## get the type
print(type(soup.get_text()))

<class 'str'>


In [18]:
## string will only work on individual tags, not on the entire soup object.
##soup.string returns nothing
soup.string

In [19]:
## TYPE
print(type(soup.string))

<class 'NoneType'>


In [23]:
## return just a string of the tag:
soup.title.string

'The title is Demo for BeautifulSoup'

In [22]:
## get only title text and not html
soup.title.get_text()

'The title is Demo for BeautifulSoup'

In [24]:
## get all p tag text
soup.p.get_text()

'There are three things to keep in mind:\nPlant 1: $10,\nPlant 2: $20 and\nPlant 3 $30;\n'

In [25]:
## get rid of weird characters
soup.p.get_text(strip=True)

'There are three things to keep in mind:Plant 1:$10,Plant 2:$20andPlant 3$30;'

## Finding ```class```

### There are three ways ranging from simplicity to precision.

```find()``` returns the first occurence of any item you are searching for.


### 1. Simplicity

Because ```find``` is one of the most popular functions in ```BeautifulSoup```, 
you don't even have to write it.

```soup("tag_name", "class_name")```


In [26]:
## simple but without precision
# find all p tags with class "article" 

soup("p", "article")

[<p class="article">There are three things to keep in mind:
 <a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span>;
 </p>,
 <p class="article"> There are three animals in the barn:
 <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
 <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
 <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>: <span class="cost">$700</span>;
 </p>]

### 2. Clarity

We want to be clear what we are writing so all team members can more easily understand it later.

- Use ```find``` to know what function it is.
- Use ```soup.find(class_="class_name")``` to be clear what class we are looking for.
- ```class_``` is not Python or BeautifulSoup. It is simply there to tell us we are looking for a ```class```. Because ```class``` (a type of data) is a Python reserved word, we add the ```_``` to tell us we are referring to an ```HTML class```.


In [30]:
soup.find(class_="article")

<p class="article">There are three things to keep in mind:
<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>

In [27]:
# find all tags in the first occurence class "article" 
soup.find("p", class_="article")

<p class="article">There are three things to keep in mind:
<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>

### 3. Precision

In the previous example, we could have run into trouble in case the ```class = "article"``` applied to multiple tags.

- Use the tag name to add precision.
- ```soup.find("tag_name", class_="class_name")```

In [None]:
## WITH PRECISION


## ```find_all``` tags, classes

- ```find_all``` is *the most widely* used BeautifulSoup command.
- Unlike ```find``` it returns *ALL* occurences of a class or tag.
- Remember ```find``` returns just the first occurence.
- It returns all occurences in a ```beautifulSoup object``` that is similiar to a ```list```.

In [32]:
## Return all p tags with class article
art = soup.find_all("p", class_="article")

In [38]:
type(art)

bs4.element.ResultSet

In [39]:
art

[<p class="article">There are three things to keep in mind:
 <a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span>;
 </p>,
 <p class="article"> There are three animals in the barn:
 <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
 <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
 <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>: <span class="cost">$700</span>;
 </p>]

In [41]:
for a in art:
    print(a)
    print("********************")

<p class="article">There are three things to keep in mind:
<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>
********************
<p class="article"> There are three animals in the barn:
<a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
<a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
<a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>: <span class="cost">$700</span>;
</p>
********************


In [37]:
## What if you want only the second group of life forms?
soup.find_all("p", class_="article")[1]

<p class="article"> There are three animals in the barn:
<a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
<a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
<a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>: <span class="cost">$700</span>;
</p>

In [42]:
## SEARCH BY ID for "animal1"
soup(id="animal1")

[<a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>]

In [43]:
## SEARCH BY ID for "plant1"
soup(id="plant1")

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>]

## Storing values

We haven't been saving in values in memory. 

If we want to move beyond a demo, we need to start saving them.

In [45]:
## save all lifeforms in a object called lifeforms
lifeforms = soup.find_all("a", class_="life")
lifeforms

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>,
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>,
 <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>,
 <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>,
 <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>]

In [46]:
## what kind of object it it?
type(lifeforms)

bs4.element.ResultSet

### Print lifeforms. Does it look familiar?

In [47]:
## print lifeforms
print(lifeforms)

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>, <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>, <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>, <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>, <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>, <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>]


In [None]:
## This breaks!
## You can't just get the text for the lifeforms.
## Why? Because you can't call .get_text() on a <class 'bs4.element.ResultSet'>
# print(lifeforms.string)
# print(lifeforms.get_text())

In [55]:
lifeforms

[<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>,
 <a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>,
 <a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>,
 <a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>,
 <a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>,
 <a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>]

In [57]:
## just the text, no html
## Using for loop
lifeforms_list = []
for life in lifeforms:
    lifeforms_list.append(life.get_text())

In [58]:
lifeforms_list

['Plant 1', 'Plant 2', 'Plant 3', 'Animal 1', 'Animal 2', 'Animal 3']

In [63]:
## just the text, no html
## Using for list comprehension
## lifeforms_lc

lifeforms_lc = [life.get_text() for life in lifeforms]
lifeforms_lc

['Plant 1', 'Plant 2', 'Plant 3', 'Animal 1', 'Animal 2', 'Animal 3']

## Get the urls for each

In [67]:
## use for loop
all_urls_fl = []
for link in lifeforms:
#     print(link)
    url = link.get("href")
#     print(url)
    all_urls_fl.append(url)
    
all_urls_fl

<a class="plants life" href="http://example.com/plant1" id="plant1">Plant 1</a>
<a class="plants life" href="http://example.com/plant2" id="plant2">Plant 2</a>
<a class="plants life" href="http://example.com/plant3" id="plant3">Plant 3</a>
<a class="animals life" href="http://example.com/animal1" id="animal1">Animal 1</a>
<a class="animals life" href="http://example.com/animal2" id="animal2">Animal 2</a>
<a class="animals life" href="http://example.com/animal3" id="animal3">Animal 3</a>


['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

In [68]:
# using list comprehension

all_urls_lc = [link.get("href") for link in lifeforms]

all_urls_lc

['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

## Cost

Let's grab the cost

How do we target the cost?

In [71]:
print(soup.prettify())

<html>
 <head>
  <title>
   The title is Demo for BeautifulSoup
  </title>
 </head>
 <body>
  <h1 class="title">
   <b>
    The title headline is Demo for BeautifulSoup
   </b>
  </h1>
 </body>
</html>
<section class="main" id="all_plants">
 <p class="article">
  There are three things to keep in mind:
  <a class="plants life" href="http://example.com/plant1" id="plant1">
   Plant 1
  </a>
  :
  <span class="cost">
   $10
  </span>
  ,
  <a class="plants life" href="http://example.com/plant2" id="plant2">
   Plant 2
  </a>
  :
  <span class="cost">
   $20
  </span>
  and
  <a class="plants life" href="http://example.com/plant3" id="plant3">
   Plant 3
  </a>
  <span class="cost">
   $30
  </span>
  ;
 </p>
 <strong>
  Don't forget to water these 3 plants.
 </strong>
</section>
<section class="main" id="all_animals">
 <p class="article">
  There are three animals in the barn:
  <a class="animals life" href="http://example.com/animal1" id="animal1">
   Animal 1
  </a>
  :
  <span class="

In [72]:
## A wide target:
cost = soup.find_all("span")
cost

[<span class="cost">$10</span>,
 <span class="cost">$20</span>,
 <span class="cost">$30</span>,
 <span class="cost">$500</span>,
 <span class="cost">$600</span>,
 <span class="cost">$700</span>,
 <span>Inanimate object 1</span>,
 <span>Inanimate object 2</span>,
 <span>Inanimate object 3</span>]

In [73]:
## narrow the target
cost = soup.find_all("span",  class_="cost")
cost

[<span class="cost">$10</span>,
 <span class="cost">$20</span>,
 <span class="cost">$30</span>,
 <span class="cost">$500</span>,
 <span class="cost">$600</span>,
 <span class="cost">$700</span>]

In [None]:
## using for loop


In [74]:
## using list comprehension
cost_list_lc = [amount.get_text() for amount in cost]
cost_list_lc

['$10', '$20', '$30', '$500', '$600', '$700']

In [75]:
lifeforms_lc

['Plant 1', 'Plant 2', 'Plant 3', 'Animal 1', 'Animal 2', 'Animal 3']

In [76]:
all_urls_lc

['http://example.com/plant1',
 'http://example.com/plant2',
 'http://example.com/plant3',
 'http://example.com/animal1',
 'http://example.com/animal2',
 'http://example.com/animal3']

## Prepare to Export

You now have one list that holds the name of the lifeform and another that holds the related URL.

Let's create a dict call ```life_dict```.

Keys are name and url...values are the related values


In [78]:
life_dict_list = []
for (name, cost, url) in zip(lifeforms_lc, cost_list_lc, all_urls_lc):
    life_dict = {"life_form": name, "cost": cost, "link": url}
    life_dict_list.append(life_dict)

life_dict_list

[{'life_form': 'Plant 1', 'cost': '$10', 'link': 'http://example.com/plant1'},
 {'life_form': 'Plant 2', 'cost': '$20', 'link': 'http://example.com/plant2'},
 {'life_form': 'Plant 3', 'cost': '$30', 'link': 'http://example.com/plant3'},
 {'life_form': 'Animal 1',
  'cost': '$500',
  'link': 'http://example.com/animal1'},
 {'life_form': 'Animal 2',
  'cost': '$600',
  'link': 'http://example.com/animal2'},
 {'life_form': 'Animal 3',
  'cost': '$700',
  'link': 'http://example.com/animal3'}]

## Export as CSV

We'll use Pandas to export our data to an external file.

We'll cover this in more detail soon, but for now here it is:

In [79]:
## import pandas
import pandas as pd

In [80]:
## use pandas to write to csv file
filename = "test.csv" ## what are file name is
df = pd.DataFrame(life_dict_list) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False) ## export to csv as utf-8 coding (it just has to be this)

print(f"{filename} is in your project folder!") ## a print out that tells us the file is ready

test.csv is in your project folder!


In [81]:
df

Unnamed: 0,life_form,cost,link
0,Plant 1,$10,http://example.com/plant1
1,Plant 2,$20,http://example.com/plant2
2,Plant 3,$30,http://example.com/plant3
3,Animal 1,$500,http://example.com/animal1
4,Animal 2,$600,http://example.com/animal2
5,Animal 3,$700,http://example.com/animal3


In [82]:
df1 = pd.DataFrame(list(zip(lifeforms_lc, cost_list_lc,all_urls_lc)), 
               columns =['life_form', 'cost', 'link'])

In [83]:
df1

Unnamed: 0,life_form,cost,link
0,Plant 1,$10,http://example.com/plant1
1,Plant 2,$20,http://example.com/plant2
2,Plant 3,$30,http://example.com/plant3
3,Animal 1,$500,http://example.com/animal1
4,Animal 2,$600,http://example.com/animal2
5,Animal 3,$700,http://example.com/animal3


# BeautifulSoup

We covered some basic BeautifulSoup functionality:

- Remember ```soup``` is just a term we use to store an entire webpage or file. We could call it anything we want.
- Searching by ```tags``` like ```title```, ```h1```, ```span``` etc.
- Searching by ```class``` or ```id```
- Finding all occurences of an item using ```find_all()```
- Finding the first occurence of an item using ```find()```
- Removing the html and returning just the string by using ```.string``` or ```get_text()```
- Grabbing just the URL(s) using ```get("href")```

These are the most frequently used BeautifulSoup functions. You can [find many more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) in the documentation. 
