## 1. Get webpage using *requests*

In [60]:
import requests

req = requests.get('https://en.wikipedia.org/wiki/Computer_security')

In [61]:
req

<Response [200]>

In [63]:
webpage = req.text

## 2. Get specific contents using BeatifulSoup

In [64]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage, 'html.parser')

In [65]:
paragraph = soup.find_all('p', attrs={'class':False})

In [66]:
import time
from random import random
import pandas as pd

data = {"title":[], "href":[]}
for p in paragraph:
    for link in p.find_all('a', attrs={'title': True}):
        data["title"].append(link["title"])
        data["href"].append(link["href"])

df = pd.DataFrame(data)

In [67]:
df

Unnamed: 0,title,href
0,Computer system,/wiki/Computer_system
1,Computer network,/wiki/Computer_network
2,Computer hardware,/wiki/Computer_hardware
3,Software,/wiki/Software
4,Data (computing),/wiki/Data_(computing)
...,...,...
419,Wikipedia:Citation needed,/wiki/Wikipedia:Citation_needed
420,Cyberwarfare,/wiki/Cyberwarfare
421,Wikipedia:Citation needed,/wiki/Wikipedia:Citation_needed
422,Israel,/wiki/Israel


In [68]:
df.head(100).to_csv('Sample_data.csv')

### 2.1 Prettify the webpage

In [33]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Anomaly detection - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-

### 2.2 Get the first paragraph

You can try to remove "attrs" to see how it works.

In [34]:
paragraph = soup.find_all('p')

In [35]:
paragraph

[<p>In <a href="/wiki/Data_analysis" title="Data analysis">data analysis</a>, <b>anomaly detection</b> (also referred to as <b>outlier detection</b> and sometimes as <b>novelty detection</b>) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior.<sup class="reference" id="cite_ref-ChandolaSurvey_1-0"><a href="#cite_note-ChandolaSurvey-1">[1]</a></sup> Such examples may arouse suspicions of being generated by a different mechanism,<sup class="reference" id="cite_ref-Hawkins_1980_2-0"><a href="#cite_note-Hawkins_1980-2">[2]</a></sup> or appear inconsistent with the remainder of that set of data.<sup class="reference" id="cite_ref-Outliers_in_statistical_data_3-0"><a href="#cite_note-Outliers_in_statistical_data-3">[3]</a></sup>
 </p>,
 <p>Anomaly detection finds application in many domains including <a href="/wiki/Computer_security" tit

In [36]:
paragraph = soup.find('p', attrs={"class":False})

In [37]:
paragraph

<p>In <a href="/wiki/Data_analysis" title="Data analysis">data analysis</a>, <b>anomaly detection</b> (also referred to as <b>outlier detection</b> and sometimes as <b>novelty detection</b>) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior.<sup class="reference" id="cite_ref-ChandolaSurvey_1-0"><a href="#cite_note-ChandolaSurvey-1">[1]</a></sup> Such examples may arouse suspicions of being generated by a different mechanism,<sup class="reference" id="cite_ref-Hawkins_1980_2-0"><a href="#cite_note-Hawkins_1980-2">[2]</a></sup> or appear inconsistent with the remainder of that set of data.<sup class="reference" id="cite_ref-Outliers_in_statistical_data_3-0"><a href="#cite_note-Outliers_in_statistical_data-3">[3]</a></sup>
</p>

### 2.3 Get all the links in this paragraph which point to other webpages

In [38]:
paragraph.find_all('a')

[<a href="/wiki/Data_analysis" title="Data analysis">data analysis</a>,
 <a href="#cite_note-ChandolaSurvey-1">[1]</a>,
 <a href="#cite_note-Hawkins_1980-2">[2]</a>,
 <a href="#cite_note-Outliers_in_statistical_data-3">[3]</a>]

In [39]:
paragraph.find_all('a', attrs={"title":True})

[<a href="/wiki/Data_analysis" title="Data analysis">data analysis</a>]

In [40]:
data = {"title":[], "href":[]}
for link in paragraph.find_all('a', attrs={"title":True}):
    data["title"].append(link["title"])
    data["href"].append(link["href"])

In [41]:
import pandas as pd
df = pd.DataFrame(data)

In [42]:
df

Unnamed: 0,title,href
0,Data analysis,/wiki/Data_analysis


## 3. Get the contents from all the webpages

In [21]:
webpages = []
head = "https://en.wikipedia.org"
for href in data["href"]:
    link = head + href
    req = requests.get(link)
    webpage = req.text
    webpages.append(webpage)

## 4. Futher readings

### 4.1 robots.txt

Check robots.txt of the website to find out what are allowed.

In [22]:
req = requests.get("https://en.wikipedia.org/robots.txt")
webpage = req.text

In [23]:
soup = BeautifulSoup(webpage, 'html.parser')
print(soup.text)

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: 

### 4.2 Sleep

You would be banned, if you scrape a website too fast. Let your crawler sleep for a while after each round.

In [24]:
import time

for i in range(5):
    time.sleep(3)
    print(i)

0
1
2
3
4


### 4.3 Randomness

Pausing for extactly three seconds after each round is too robotic. Let's add some randomness to make your crawler looks more like a human.

In [25]:
from random import random

for i in range(5):
    t = 1 + 2 * random()
    time.sleep(t)
    print(i)

0
1
2
3
4


### 4.4 Separate the codes for scraping from the ones for data extraction

1. Scraping is more vulnerable. Nothing is more annoying than your crawler breaks because of a bug in the data extraction part.  
2. You never know what data you would need for modeling. So keep all the webpages you obtain.

### 4.5 Chrome Driver and Selenium

These are the tools make your crawler act even more like a human.