## 1. Get webpage using *requests*

In [None]:
import requests

req = requests.get('https://en.wikipedia.org/wiki/Data_science')

In [None]:
req

In [None]:
webpage = req.text

In [None]:
with open("filename", "wb") as f:
    f.write(webpage)

In [None]:
print(webpage)

## 2. Get specific contents using BeatifulSoup

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage, 'html.parser')

### 2.1 Prettify the webpage

In [None]:
print(soup.prettify())

### 2.2 Get all links and their titles

You can try to remove "attrs" to see how it works.

In [None]:
links = soup.find_all('a')

In [None]:
links

In [None]:
links2 = soup.find('a', attrs={"class":False})

In [None]:
links2

In [None]:
paragraph.find_all('a', attrs={"title":True})

In [None]:
data = {"title":[], "href":[]}
for link in soup.find_all('a', attrs={"title":True}):
    data["title"].append(link["title"])
    data["href"].append(link["href"])

In [None]:
import pandas as pd
df = pd.DataFrame(data)
df

In [None]:
df.to_csv(r'Sample_data.csv', index = False)

## 3. Get the contents from all the webpages

In [None]:
webpages = []
head = "https://en.wikipedia.org"
for href in data["href"]:
    link = head + href
    req = requests.get(link)
    webpage = req.text
    webpages.append(webpage)

## 4. Futher readings

### 4.1 robots.txt

Check robots.txt of the website to find out what are allowed.

In [None]:
req = requests.get("https://en.wikipedia.org/robots.txt")
webpage = req.text

In [None]:
soup = BeautifulSoup(webpage, 'html.parser')
print(soup.text)

### 4.2 Sleep

You would be banned, if you scrape a website too fast. Let your crawler sleep for a while after each round.

In [None]:
import time

for i in range(5):
    time.sleep(3)
    print(i)

### 4.3 Randomness

Pausing for extactly three seconds after each round is too robotic. Let's add some randomness to make your crawler looks more like a human.

In [None]:
from random import random

for i in range(5):
    t = 1 + 2 * random()
    time.sleep(t)
    print(i)

### 4.4 Separate the codes for scraping from the ones for data extraction

1. Scraping is more vulnerable. Nothing is more annoying than your crawler breaks because of a bug in the data extraction part.  
2. You never know what data you would need for modeling. So keep all the webpages you obtain. 

### 4.5 Chrome Driver and Selenium

These are the tools make your crawler act even more like a human.