<img src="./images/Banner_NB.png">

# Web Crawling

In this notebook we will practice a full site crawling, divided into the following steps:
+ General considerations, and site analysis
- Build a url list for all pages to crawl
- Extract main content from each page
- Mention a class for table parsing
- Summary


## Fetching a Website

Downloading websites is easy and very efficient. It turns out, that you can cause quite high load on a server when you scrape a lot. So webmasters usually publish what kinds of scraping they allow on their websites. You should check out a websites terms of service and the `robots.txt` of a domain before crawling excessively. Terms of service are usually broad, so searching for “scraping” or “crawling” is a good idea.



Let's take a look at [Google Scholar's robots.txt](https://scholar.google.com/robots.txt):

```
User-agent: *
Disallow: /search
Allow: /search/about
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
...
Disallow: /scholar
Disallow: /citations?
...
```



Here it specifies that you're not allowed to crawl a lot of the pages. The `/scholar` subdirectory is especially painful because it prohibits you from generating queries dynamically. 

It's also common that sites ask you to delay crawiling: 

```
Crawl-delay: 30 
Request-rate: 1/30 
```



You should respect those restrictions. Now, no one can stop you from running a request through a crawler, but sites like google scholar will block you VERY quickly if you request to many pages in a short time-frame.

An alternative strategy to dynamically accessing the site you're crawling (as we're doing in the next example) is to download a local copy of the website and crawl that. This ensures that you hit the site only once per page. A good tool to achieve that is [wget](https://www.gnu.org/software/wget/). 

# Real world crawling example
Let's imagine we want to build a really large repository of jokes. We found this site `https://www.rd.com/jokes/` with amazing collection of jokes and would like to download all jokes from them. All jokes, but only jokes.

## Crawling steps:
1. Analyze site


## Define crawling strategy:
1. Get a list of all the pages that contain jokes (i.e. joke category pages)
2. Iterate through the list, and download each page individualy
3. Parse each page, and extract jokes.

### Step1 : Get all pages that contain jokes (and save it into a list)

In [1]:
from bs4 import BeautifulSoup
import requests 

url1="https://www.rd.com/jokes/"
user_agent = {'User-agent': 'Mozilla/5.0'}
response1 = requests.get(url1,headers=user_agent)

soup1 = BeautifulSoup(response1.content, "html.parser")
mtag=soup1.find("div",attrs={"class":"joke-tax-popular"})

linksToPages=[t['href'] for t in mtag.findAll("a")]


print(linksToPages)


['https://www.rd.com/jokes/animal/', 'https://www.rd.com/jokes/animal-puns/', 'https://www.rd.com/jokes/puns/bad-puns/', 'https://www.rd.com/jokes/bar/', 'https://www.rd.com/jokes/birthday/', 'https://www.rd.com/jokes/cat/', 'https://www.rd.com/jokes/cat-puns/', 'https://www.rd.com/jokes/christmas-jokes/', 'https://www.rd.com/jokes/coffee-jokes/', 'https://www.rd.com/jokes/computer/', 'https://www.rd.com/jokes/corny/', 'https://www.rd.com/jokes/customer-service/', 'https://www.rd.com/jokes/puns/cute-puns/', 'https://www.rd.com/jokes/dad/', 'https://www.rd.com/jokes/daily-life/', 'https://www.rd.com/jokes/diet-jokes/', 'https://www.rd.com/jokes/doctor/', 'https://www.rd.com/jokes/dog/', 'https://www.rd.com/jokes/dog-puns/', 'https://www.rd.com/jokes/dumb/', 'https://www.rd.com/jokes/easter-jokes/', 'https://www.rd.com/jokes/family/', 'https://www.rd.com/jokes/food-jokes/', 'https://www.rd.com/jokes/food-jokes/food-puns/', 'https://www.rd.com/jokes/headlines/', 'https://www.rd.com/jokes/

### Steps 2 and 3: Iterate through list, download pages, and extract jokes

In [2]:
import requests
#urls=linksToPages
urls=['https://www.rd.com/jokes/animal/','https://www.rd.com/jokes/animal-puns/']
for url2 in urls:
    response1 = requests.get(url2,headers=user_agent)
    soup1 = BeautifulSoup(response1.content)
    for t in soup1.findAll("div",attrs={"class":"content-wrapper"}):
        print(t.text)
        print("####################")
        





 







 


 



####################

							Two men are hiking through the woods when one of them cries out, “Snake! Run!” His companion laughs at him. “Oh, relax. It’s only a baby,” he says. “Don’t you hear the rattle?” —Steve Smith						
####################

							Q: Did you hear about the racing snail who got rid of his shell?

A: He thought it would make him faster, but it just made him sluggish.						
####################

							It’s a good thing snakes and dogs don’t interbreed. Nobody wants a loyal snake. —Roy Blount, humorist						
####################

							Q: How are a cat and a sentence different?

A: A cat has claws at the end of its paws; a sentence has a pause at the end of its clause!						
####################

							Q: What do you call a penguin in the desert?
A: Lost						
####################

							Q: What did the SNAIL say while riding on the turtles back?

A: Wheeeeeeeee						
####################

							Q: What is the best way to cook a gator?

A: I

When we have a table, there is an easy way to get the data out - use the magic of Pandas data import functions. There is a sophisticated [`read_html()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html) function. 

Behind the scenes, this uses html5lib, which should be part of your anaconda installation, but if not, you will have to install it:

```
conda install -c anaconda html5lib
```

Once installed, one can pass in the table as a string..

## Scraping Wrap-Up

Scraping is a way to get information from website that were not designed to make data accessible. As such, it can often be **brittle**: a website change will break your scraping script. It is also often not welcome, as a scaper can cause a lot of traffic. 

The way we scraped information here also made the **assumption that HTML is generated consistently** based just on the URL. That is, unfortunately, less and less common, as websites adapt to browser types, resolutions, locales, but also as a lot of content is loaded dynamically e.g., via web-sockets. For example, many websites now auotmatically load more data once you scroll to the bottom of the page. These websites couldn't be scraped with our approach, instead, a browser-emulation approach, using e.g., [Selenium]() would be necessary. [Here is a tutorial](https://medium.com/the-andela-way/introduction-to-web-scraping-using-selenium-7ec377a8cf72) on how to do that. 

Finally, many services make their data available through a well-defined interface, an API. Using an API is always a better idea than scraping, but scraping is a good fallback!