# Python 爬虫
____

[TOC]

## 1. Concept

### What Is Web Scraping?
`In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of the HTML and other files that comprise web pages), and then parses that data to extract needed informa‐ tion.`

* Retrieving HTML data from a domain name 
* Parsing that data for target information 
* Storing the target information 
* Optionally, moving to another page to repeat the process


____
## 2. Basic 
### 1. Conection


In [1]:
from urllib.request import urlopen 
html = urlopen("http://pythonscraping.com/pages/page1.html") 
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


____
This will output the complete HTML code for the page at http://pythonscraping.com/ pages/page1.html.

**Most modern web pages have many resource files associated with them. These could be image files, JavaScript files, CSS files, or any other content that the page you are requesting is linked to. When a web browser hits a tag such as `<img src="cuteKitten.jpg">`, the browser knows that it needs to make another request to the server to get the data at the file cuteKitten.jpg in order to fully render the page for the user. Keep in mind that our Python script doesn’t have the logic to go back and request multiple files (yet);it can only read the single HTML file that we’ve requested.**
___



### 2. Parsing
`Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily-traversible Python objects representing XML structures.`

In [5]:
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen("http://www.pythonscraping.com/pages/page1.html") 
bsObj = BeautifulSoup(html.read(),"html.parser") 
print(bsObj.h1)

<h1>An Interesting Title</h1>


In fact, any of the following function calls would produce the same output:
```
bsObj.html.body.h1 
bsObj.body.h1 
bsObj.html.h1
```

#### Connecting Reliably
There are two main things that can go wrong in this line:

* The page is not found on the server (or there was some error in retrieving it) 
* The server is not found

In [8]:
from urllib.request import urlopen 
from urllib.error import HTTPError 
from bs4 import BeautifulSoup 
def getTitle(url): 
    try:
        html = urlopen(url)    
    except HTTPError as e:
        return None 
    try: 
        bsObj = BeautifulSoup(html.read(),"html.parser") 
        title = bsObj.body.h1 
    except AttributeError as e:
        return None 
    
    return title 
    
title = getTitle("http://www.pythonscraping.com/pages/page1.html") 

if title == None:
    print("Title could not be found") 
else:
    print(title)

<h1>An Interesting Title</h1>


1. could heavily reuse code
2. `bsObj.body` or `bsObj.body.h1` could be missing. We only need to handle them once.

____
#### Advanced Parsing

This kind of coding should be avoided!
```python
bsObj.findAll("table")[4].findAll("tr")[2].find("td").findAll("div")[1].find("a")
```
**Options:**

* Look for a “print this page” link, or perhaps a mobile version of the site that has better-formatted HTML (more on presenting yourself as a mobile device—and receiving mobile site versions—in Chapter 12). 
* Look for the information hidden in a JavaScript file. Remember, you might need to examine the imported JavaScript files in order to do this. For example, I once collected street addresses (along with latitude and longitude) off a website in a neatly formatted array by looking at the JavaScript for the embedded Google Map that displayed a pinpoint over each address. 
* This is more common for page titles, but the information might be available in the URL of the page itself. 
v If the information you are looking for is unique to this website for some reason, you’re out of luck. If not, try to think of other sources you could get this informa‐ tion from. Is there another website with the same data? Is this website displaying data that it scraped or aggregated from another website?

**OR:**
1. Taking advantages of CSS tags --- `find() and findAll()`

In [5]:
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html") 
bsObj = BeautifulSoup(html,"html.parser")

In [9]:
# bsObj.findAll(tagName, tagAttributes)
nameList = bsObj.findAll("span", {"class":"green"}) 
for name in nameList: 
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


____
2 .**Navigating Trees**

* Dealing with children and other descendants
> All children are descendants, but not all descendants are children.

In [12]:
from urllib.request import urlopen 
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html") 
bsObj = BeautifulSoup(html,"html.parser") 
for child in bsObj.find("table",{"id":"giftList"}).children: 
    print(child)
    
for descendants in bsObj.find("table",{"id":"giftList"}).descendants:
    print(descendants)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg">
</img></td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg">
</img></td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/im

* Siblings

>The output of this code is to print all rows of products from the product table, except for the first title row.

In [13]:
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings: 
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg">
</img></td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg">
</img></td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg">
</img></td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
Thi

* Dealing with your parents

In [14]:
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg" 
                       }).parent.previous_sibling.get_text())


$15.00



3. Regular Expressions