In [None]:
pip install icecream

### We'll learn to some basic scraping techniques using this mock site <a href="https://sandeepmj.github.io/scrape-example-page/demo-text.html">demo page</a>. 

The webpage is ```https://sandeepmj.github.io/scrape-example-page/demo-text.html```

### All web scraping requires a little sleuthing:

* Where and how is the content held on the page?
* How can we access it?
* Is there a pattern?
* Is there anything that breaks the pattern?

In [None]:
## import library
from bs4 import BeautifulSoup ## package to parse HTML and XML
import icecream as ic ## for debugging
import requests ## The most widely downloaded package - captures content from web


In [None]:
## Requesting web content

##scrape url website


In [None]:
## did it work?


In [None]:
## what type of object did we capture?


## Pull out what we want using

- ```response.text``` for string content like HTML, XML etc.
- ```response.content``` for binary content like PDFs, images, etc.

In [None]:
## what object does it return


## Create a BeautifulSoup object
<img src="img/bs-soup.png">

In [None]:
## we add name of our file


In [None]:
## prettify our printout


In [None]:
## What type of file is it?


In [None]:
## get title of page


In [None]:
## What about the h1 tag with the class of title? 
## How can we have two titles?



### string v. get_text()

In most cases, our final step in a scrape is to convert everything to a string. We don't want all the html. 

We can use ```.string``` or ```get_text().```

- ```get_text()``` is far more powerful because you can add parameters to strip, specify separators, etc.

I **only** use ```get_text()```.


In [None]:
## return just a string of the tag:


In [None]:
## get only title text and not html


In [None]:
## use string on soup (returns nothing)


In [None]:
## get text from soup


In [None]:
## get rid of weird characters


In [None]:
## get p tag text


# Targeting content



## Searching for IDs

```soup(id="ID_name")```

In [None]:
## SEARCH BY ID for "animal1"


In [None]:
## SEARCH BY ID for "plant1"


## Finding ```class```

Let's say we want to find the ```p tag``` content for the ```article class``` 

```find()``` returns the first occurence of any item you are searching for.

There are three ways to target our content but only Method 3 is the correct way




In [None]:
## a wide net is not best


### Method 1. Target the tag only.

```soup.find("tag_name")```


In [None]:
## simple but without precision
## still too wide a net


### Method 2. Target the class only




- Use ```soup.find(class_="class_name"``` to be clear what class we are looking for.
- ```class_``` is not Python or BeautifulSoup. It is simply there to tell us we are looking for a ```class```. Because ```class``` (a type of data) is a Python reserved word, we add the ```_``` to tell us we are referring to an ```HTML class```.


In [None]:
# find the first p tag with the class "article"
## this is still too wide


### Method 3. Precision, clarity and simplicity

In the previous example, we could have run into trouble in case the ```class = "article"``` applied to multiple tags.

- Use the ```tag``` and the ```class``` to add precision, clarity and simplicity.

```soup.find("tag_name", class_="class_name")```

In [None]:
# find the first p tag with the class "article"


## ```find_all``` tags, classes

- ```find_all``` is **the most widely** used BeautifulSoup command.
- Unlike ```find``` it returns **ALL** occurences of a class or tag.
- Remember ```find``` returns just the first occurence.
- ```soup.find_all("tag_name", class_="class_name")```
- It returns all occurences in a **```beautifulSoup object```** that is similiar to a **```list```**.

In [None]:
## Return all p tag content with the class "article"


In [None]:
## what type of object is returned


In [None]:
## Return all all content in the sections with the main class


In [None]:
## how many items are in this object


In [None]:
## how many items are there if we targeted only the tag "section"


### Find all life forms on the page

In [None]:
## code it here


## The old ways

Earlier versions of BeautifulSoup did not use the ```class_``` notation. They used:

```soup.find_all("tag", {"class": "class_name"})```

and ever older way:

```soup.find_all("tag", attrs={"class": "class_name"})```

FYI since you might still encounter these in your stacking.

To recap, the most current/modern way is:

```soup.find_all("tag_name", class_="class_name")```

# Excluding classes

Most modern sites have tags that include multiple classes. 

What if you want to target a tag with a single class but that class also appears in tags with others that holds other types of content.

For example, target the ```animals``` class tag that does not also have the ```life``` class.

In this case we use ```.select``` which looks for that tag by itself.

```soup.select('[class="class_name"]')```


In [None]:
'''
if use find_all to look for the class animals
it turns all animals class, along with life class
'''



In [None]:
## write code here



## Storing values

We haven't been saving in values in memory. 


In [None]:
## Again, save all lifeforms in a object called lifeforms


In [None]:
## what kind of object it it?


### Print lifeforms. Does it look familiar?

In [None]:
## print lifeforms


In [None]:
## print it out with a break between each
for life in lifeforms:
    print(life)
    print("************")

### You can't just get the text for the lifeforms.
### Why? You can't call ```.get_text()``` on a ```<class 'bs4.element.ResultSet'>``` object.


In [None]:
## try it


## Instead, iterate through and work on each item in the list whic in this case is a ```<class 'bs4.element.Tag'>```

In [None]:
## see type of object


In [None]:
## just the text, no html
## Using for loop
  

In [None]:
## just the text, no html
## Using for list comprehension



## Get the urls for each

In [None]:
## use for loop

    


In [None]:
# using list comprehension



## Cost

Let's grab the cost

How do we target the cost?

In [None]:
## A wide target:


In [None]:
## narrow the target


In [None]:
## using for loop


In [None]:
## using list comprehension


In [None]:
## in your function to clean string values
def clean_numbers(some_string_number):
  '''
  Enter a number or a list of numbers. 
  The items can be strings, integers, floats or a mix of all. 
  I will convert it to an integer.
  '''
  if isinstance(some_string_number, str): 
    amount = round(float(some_string_number.replace("$","").replace(",","")))

  else:
    amount = round(float(some_string_number))

  return amount

In [None]:
## final cost


## Prepare to Export

You now have one list that holds the name of the lifeform and another that holds the related URL.

Let's create a dict call ```life_dict```.

Keys are name and url...values are the related values


In [None]:
## create it here

## Export as CSV

We'll use Pandas to export our data to an external file.

We'll cover this in more detail soon, but for now here it is:

In [None]:
## import pandas
import pandas as pd

In [None]:
## use pandas to write to csv file


# BeautifulSoup

We covered some basic BeautifulSoup functionality:

- Remember ```soup``` is just a term we use to store an entire webpage or file. We could call it anything we want.
- Searching by ```tags``` like ```title```, ```h1```, ```span``` etc.
- Searching by ```class``` or ```id```
- Finding all occurences of an item using ```find_all()```
- Finding the first occurence of an item using ```find()```
- Removing the html and returning just the string by using ```.string``` or ```get_text()```
- Grabbing just the URL(s) using ```get("href")```

These are the most frequently used BeautifulSoup functions. You can [find many more](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) in the documentation. 
