# Webscrapping with Python - Requests and Beautiful Soup Libs

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
r = requests.get("http://www.pyclass.com/example.html", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})

In [3]:
type(r)

requests.models.Response

In [4]:
#here I grab the content from the request datatype and store in another variable 
c=r.content
c

b'<!DOCTYPE html>\n<html>\n<head>\n<style>\ndiv.cities {\n    background-color:black;\n    color:white;\n    margin:20px;\n    padding:20px;\n} \n</style>\n</head>\n<body>\n\n<h1 align="center"> Here are three big cities </h1>\n\n<div class="cities">\n<h2>London</h2>\n<p>London is the capital of England and it\'s been a British settlement since 2000 years ago. </p>\n</div>\n\n<div class="cities">\n<h2>Paris</h2>\n<p>Paris is the capital city of France. It was declared capital since 508.</p>\n</div>\n\n<div class="cities">\n<h2>Tokyo</h2>\n<p>Tokyo is the capital of Japan and one of the most populated cities in the world.</p>\n</div>\n\n</body>\n</html>'

### Even though this one doesn't look very nice this is actually the source code that I can see in here, so we have the head tags, and the HTML tags and everything else there.  
-  here is where the BeautifulSoup comes into play.
-  requests library loaded the source code but in a very scrambled form
- BeautifulSoup will give me the elements of the HTML tags I'm interested about.

In [5]:
#normalmente o parser é html.parser. Se eu não passar nada, dá aviso mas funciona, então eu normalmente passo.
soup=BeautifulSoup(c,'html.parser')

### If you know print(soup.prettify()) you'll see the source code of the webpage in an organized form, so BeautifulSoup is trained to actually recognize these tags and then render them in a visual way for the human eye.

In [6]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <style>
   div.cities {
    background-color:black;
    color:white;
    margin:20px;
    padding:20px;
}
  </style>
 </head>
 <body>
  <h1 align="center">
   Here are three big cities
  </h1>
  <div class="cities">
   <h2>
    London
   </h2>
   <p>
    London is the capital of England and it's been a British settlement since 2000 years ago.
   </p>
  </div>
  <div class="cities">
   <h2>
    Paris
   </h2>
   <p>
    Paris is the capital city of France. It was declared capital since 508.
   </p>
  </div>
  <div class="cities">
   <h2>
    Tokyo
   </h2>
   <p>
    Tokyo is the capital of Japan and one of the most populated cities in the world.
   </p>
  </div>
 </body>
</html>


### Normally you will not have to actually use the prettify method a lot because a better way to see that source code is to go to your web page and go to inspect and here you see a  better syntax of the HTML code.

- So we go back to the code and what you want to do is perform a method called `find all`, and what you want to find is divs

- But there may be lots of divs in the web page so for instance we have two more divs here and we don't want these to be found, we only one want these three.

- But these three as you see they have a common class attribute which is equal to cities. So we want to make use of that and we pass here a dictionary which would be class equals to cities.

In [7]:
all=soup.find_all('div',{'class':'cities'})

In [8]:
all

[<div class="cities">
 <h2>London</h2>
 <p>London is the capital of England and it's been a British settlement since 2000 years ago. </p>
 </div>, <div class="cities">
 <h2>Paris</h2>
 <p>Paris is the capital city of France. It was declared capital since 508.</p>
 </div>, <div class="cities">
 <h2>Tokyo</h2>
 <p>Tokyo is the capital of Japan and one of the most populated cities in the world.</p>
 </div>]

In [9]:
type(all)

bs4.element.ResultSet

In [10]:
all[0] #posso it iterando como em uma lista!

<div class="cities">
<h2>London</h2>
<p>London is the capital of England and it's been a British settlement since 2000 years ago. </p>
</div>

In [11]:
all[1]

<div class="cities">
<h2>Paris</h2>
<p>Paris is the capital city of France. It was declared capital since 508.</p>
</div>

### Notice above that we've got a list with three elements, one for each division. Now if you want to find **only the first element with this class attribute** (cities) you'd want to use the find method. Equivale a indexar all[0]

In [12]:
all2=soup.find('div',{'class':'cities'})
all2

<div class="cities">
<h2>London</h2>
<p>London is the capital of England and it's been a British settlement since 2000 years ago. </p>
</div>

What if I want only the <h2> tags from the div class? 

In [26]:
all.find_all('h2')

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Erro. Porque estou apontando pro all todo. Preciso apontar pros elementos da lista.

In [27]:
all[0].find_all('h2') 

[<h2>London</h2>]

In [28]:
k=all[0].find_all('h2') 

In [29]:
k[0]

<h2>London</h2>

In [30]:
type(k[0])

bs4.element.Tag

In [31]:
k[0].text

'London'

In [32]:
all

[<div class="cities">
 <h2>London</h2>
 <p>London is the capital of England and it's been a British settlement since 2000 years ago. </p>
 </div>, <div class="cities">
 <h2>Paris</h2>
 <p>Paris is the capital city of France. It was declared capital since 508.</p>
 </div>, <div class="cities">
 <h2>Tokyo</h2>
 <p>Tokyo is the capital of Japan and one of the most populated cities in the world.</p>
 </div>]

In [33]:
len(all)

3

In [34]:
a=[]
for i in range(len(all)):
    k=all[i].find_all('h2') 
    k=k[0].text
    a.append(k)
print(a)

['London', 'Paris', 'Tokyo']


In [35]:
v=[]
for i in range(len(all)):
    k=all[i].find_all('p') 
    k=k[0].text
    v.append(k)
print(v)

["London is the capital of England and it's been a British settlement since 2000 years ago. ", 'Paris is the capital city of France. It was declared capital since 508.', 'Tokyo is the capital of Japan and one of the most populated cities in the world.']
