# Scraping with BeautifulSoup

In [None]:
html_doc = """
<html><head><title>The title is Demo for BeautifulSoup</title></head>
<body>
<h1 class="title"><b>The title headline is Demo for BeautifulSoup</b></p>

<section class="main" id="all_plants">

<p class="article">There are three things to keep in mind:
<a href="http://example.com/plant1" class="plants life" id="plant1">Plant 1</a>: <span class="cost">$10</span>,
<a href="http://example.com/plant2" class="plants life" id="plant2">Plant 2</a>: <span class="cost">$20</span> and
<a href="http://example.com/plant3" class="plants life" id="plant3">Plant 3</a> <span class="cost">$30</span>;
</p>
<strong>Don't forget to water these 3 plants.</strong>


</section>

<section class="main" id="all_animals">
<p class="article"> There are three animals in the barn:
<a href="http://example.com/animal1" class="animals life" id="animal1">Animal 1</a>: <span class="cost">$500</span>,
<a href="http://example.com/animal2" class="animals life" id="animal2">Animal 2</a>: <span class="cost">$600</span> and
<a href="http://example.com/animal3" class="animals life" id="animal3">Animal 3</a>: <span class="cost">$700</span>;
</p>
<strong>Don't forget these feed these 3 animals.</strong>
</section>


<section>
<p><span>Inanimate object 1</span></p>
<p><span>Inanimate object 2</span></p>
<p><span>Inanimate object 3</span></p>
</section>
"""

print(html_doc)

In [None]:
## import library

## Create a BeautifulSoup object
<img src="../support_files/bs-soup.png">

In [None]:
## we add name of our file


In [None]:
## What type of file is it?


In [None]:
## get title of page


In [None]:
## What about the h1 tag with the class of title? 
## How can we have two titles?



### string v. get_text()

In most cases, our final step in a scrape is to convert everything to a string. We don't want all the html. 

We can use ```.string``` or ```get_text().```

- ```get_text()``` is far more powerful because you can add parameters to strip, specify separators, etc.


In [None]:
## get text from soup


In [None]:
## get the type


In [None]:
## string will only work on individual tags, not on the entire soup object.
##soup.string returns nothing


In [None]:
## TYPE


In [None]:
## return just a string of the tag:


In [None]:
## get only title text and not html


In [None]:
## get all p tag text


In [None]:
## get rid of weird characters


## Finding ```class```

### There are three ways ranging from simplicity to precision.

```find()``` returns the first occurence of any item you are searching for.


### 1. Simplicity

Because ```find``` is one of the most popular functions in ```BeautifulSoup```, 
you don't even have to write it.

```soup("tag_name", "class_name")```


In [None]:
## simple but without precision
# find all p tags with class "article" 


### 2. Clarity

We want to be clear what we are writing so all team members can more easily understand it later.

- Use ```find``` to know what function it is.
- Use ```soup.find(class_="class_name"``` to be clear what class we are looking for.
- ```class_``` is not Python or BeautifulSoup. It is simply there to tell us we are looking for a ```class```. Because ```class``` (a type of data) is a Python reserved word, we add the ```_``` to tell us we are referring to an ```HTML class```.


In [None]:
# find all tags in the first occurence class "article" 


### 3. Precision

In the previous example, we could have run into trouble in case the ```class = "article"``` applied to multiple tags.

- Use the tag name to add precision.
- ```soup.find("tag_name", class_="class_name")```

In [None]:
## WITH PRECISION


## ```find_all``` tags, classes

- ```find_all``` is *the most widely* used BeautifulSoup command.
- Unlike ```find``` it returns *ALL* occurences of a class or tag.
- Remember ```find``` returns just the first occurence.
- It returns all occurences in a ```beautifulSoup object``` that is similiar to a ```list```.

In [None]:
## Return all p tags with class article


In [None]:
## What if you want only the second group of life forms?


In [None]:
## SEARCH BY ID for "animal1"


In [None]:
## SEARCH BY ID for "plant1"


## Storing values

We haven't been saving in values in memory. 

If we want to move beyond a demo, we need to start saving them.

In [None]:
## save all lifeforms in a object called lifeforms


In [None]:
## what kind of object it it?


### Print lifeforms. Does it look familiar?

In [None]:
## print lifeforms


In [None]:
## This breaks!
## You can't just get the text for the lifeforms.
## Why? Because you can't call .get_text() on a <class 'bs4.element.ResultSet'>
# print(lifeforms.string)


In [None]:
## just the text, no html
## Using for loop
   

In [None]:
## just the text, no html
## Using for list comprehension



## Get the urls for each

In [None]:
## use for loop
 
    


In [None]:
# using list comprehension



## Cost

Let's grab the cost

How do we target the cost?

In [None]:
## A wide target:


In [None]:
## narrow the target


In [None]:
## using for loop


In [None]:
## using list comprehension


## Prepare to Export

You now have one list that holds the name of the lifeform and another that holds the related URL.

Let's create a dict call ```life_dict```.

Keys are name and url...values are the related values


## Export as CSV

We'll use Pandas to export our data to an external file.

We'll cover this in more detail soon, but for now here it is:

In [None]:
## import pandas
import pandas as pd

In [None]:
## use pandas to write to csv file
filename = "test.csv" ## what are file name is
df = pd.DataFrame(life_dict_list) ## we turn our life dict into a dataframe which we're call df
df.to_csv(filename, encoding='utf-8', index=False) ## export to csv as utf-8 coding (it just has to be this)

print(f"{filename} is in your project folder!") ## a print out that tells us the file is ready

# BeautifulSoup

We covered some basic BeautifulSoup functionality:

- Remember ```soup``` is just a term we use to store an entire webpage or file. We could call it anything we want.
- Searching by ```tags``` like ```title```, ```h1```, ```span``` etc.
- Searching by ```class``` or ```id```
- Finding all occurences of an item using ```find_all()```
- Finding the first occurence of an item using ```find()```
- Removing the html and returning just the string by using ```.string``` or ```get_text()```
- Grabbing just the URL(s) using ```get("href")```

These are the most frequently used BeautifulSoup functions. You can [find many more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) in the documentation. 
