# <center>Парсинг страниц сайта

Продемонстрируем, как использовать **requests** и **beautiful soup**.   

Нужно распарсить:    

1. из статьи https://en.wikipedia.org/wiki/Bias-variance_tradeoff все заголовки верхнего уровня;    
2. со страницы https://en.wikipedia.org/wiki/Category:Machine_learning_algorithms названия всех статей в категории Machine Learning Algorithms

In [1]:
import requests
import bs4
from functools import reduce

## <center>1. Все заголовки верхнего уровня из статьи https://en.wikipedia.org/wiki/Bias-variance_tradeoff

Заголовки верхнего (первого) уровня имеют тег **`<h1>`**

### Сделаем для начала по шагам

#### получим html-код страницы

In [2]:
url = 'https://en.wikipedia.org/wiki/Bias-variance_tradeoff'
req = requests.get(url)
print(req)

<Response [200]>


In [3]:
print (type(req))
print (req.text)

<class 'requests.models.Response'>
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Bias–variance tradeoff - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bias–variance_tradeoff","wgTitle":"Bias–variance tradeoff","wgCurRevisionId":821026037,"wgRevisionId":821026037,"wgArticleId":40678189,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from August 2017","All articles needing additional references","All articles with unsourced statements","Articles with unsourced statements from August 2017","Dilemmas","Model selection","Machine learning","Statistical classi

#### обработаем этот html-код при помощи библиотеки Beautiful Soup 4

In [4]:
parser = bs4.BeautifulSoup(req.text, 'lxml')
print (type(parser))
print (parser)

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Bias–variance tradeoff - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Bias–variance_tradeoff","wgTitle":"Bias–variance tradeoff","wgCurRevisionId":821026037,"wgRevisionId":821026037,"wgArticleId":40678189,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from August 2017","All articles needing additional references","All articles with unsourced statements","Articles with unsourced statements from August 2017","Dilemmas","Model selection","Machine learning","Statistical classificatio

#### Выделим первый тег h1

In [5]:
print (parser.find('h1'))

<h1 class="firstHeading" id="firstHeading" lang="en">Bias–variance tradeoff</h1>


In [6]:
x = parser.find('h1')
print (type(x))

<class 'bs4.element.Tag'>


In [7]:
print (x.text)

Bias–variance tradeoff


#### Выделим все заголовки с тегом h1 со страницы:

In [8]:
y = parser.findAll('h1')
print (type(y))

<class 'bs4.element.ResultSet'>


In [9]:
for result in y:
    print (result.text)
    print ("\n------\n")

Bias–variance tradeoff

------



#### Для компактности соберем все вместе.
Итого, все заголовки верхнего уровня (тег - h1) из статьи https://en.wikipedia.org/wiki/Bias-variance_tradeof:

In [17]:
url = 'https://en.wikipedia.org/wiki/Bias-variance_tradeoff'
text = requests.get(url).text
parser = bs4.BeautifulSoup(text, 'lxml')
x = parser.findAll(name='h1')
titles_h1 = [res.text for res in x]
titles_h1

['Bias–variance tradeoff']

## <center>2. Hазвания всех статей в категории Machine Learning Algorithms со страницы https://en.wikipedia.org/wiki/Category:Machine_learning_algorithms :

#### Получим html-код страницы

In [23]:
url = 'https://en.wikipedia.org/wiki/Category:Machine_learning_algorithms'
print(requests.get(url).text)

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Category:Machine learning algorithms - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":14,"wgPageName":"Category:Machine_learning_algorithms","wgTitle":"Machine learning algorithms","wgCurRevisionId":675167466,"wgRevisionId":675167466,"wgArticleId":33547228,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Machine learning","Algorithms"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March

Определим, с каким тегом (и классом) идут названия статей.    
Присмотревшись (здесь и в браузере), можно увидеть, что:
- все названия - внутри **одного** тега **`<div class="mw-category>`"**, 
- каждое отдельное название идет под тегом **`<li>`**

In [31]:
url = 'https://en.wikipedia.org/wiki/Category:Machine_learning_algorithms'
text = requests.get(url).text
parser = bs4.BeautifulSoup(text, 'lxml')
x = parser.find('div', attrs={'class':'mw-category'})
titles = [title.text for title in x.findAll('li')]
titles

['Algorithms of Oppression',
 'Almeida–Pineda recurrent backpropagation',
 'Backpropagation',
 'Bootstrap aggregating',
 'CN2 algorithm',
 'Constructing skill trees',
 'Dehaene–Changeux model',
 'Diffusion map',
 'Dominance-based rough set approach',
 'Dynamic time warping',
 'Error-driven learning',
 'Evolutionary multimodal optimization',
 'Expectation–maximization algorithm',
 'FastICA',
 'Forward–backward algorithm',
 'GeneRec',
 'Genetic Algorithm for Rule Set Production',
 'Growing self-organizing map',
 'HEXQ',
 'Hyper basis function network',
 'IDistance',
 'K-nearest neighbors algorithm',
 'Kernel methods for vector output',
 'Kernel principal component analysis',
 'Leabra',
 'Linde–Buzo–Gray algorithm',
 'Local outlier factor',
 'Logic learning machine',
 'LogitBoost',
 'Loss functions for classification',
 'Manifold alignment',
 'Minimum redundancy feature selection',
 'Mixture of experts',
 'Multiple kernel learning',
 'Non-negative matrix factorization',
 'Online machine l

Для проверки (исходя из текста на странице - должно быть **58**):

In [20]:
len(titles) == 58

True

### Еще несколько вариантов для иллюстрации (более универсальных)

выводим код в более читаемом виде:

In [21]:
url = 'https://en.wikipedia.org/wiki/Category:Machine_learning_algorithms'
text = requests.get(url).text
parser = bs4.BeautifulSoup(text, 'lxml')
print(parser.prettify())             # выводим код в более читаемом виде

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Category:Machine learning algorithms - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":14,"wgPageName":"Category:Machine_learning_algorithms","wgTitle":"Machine learning algorithms","wgCurRevisionId":675167466,"wgRevisionId":675167466,"wgArticleId":33547228,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Machine learning","Algorithms"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["",

#### Вариант 1:
Ищем сразу по **нескольким** условиям для тегов

In [24]:
x = parser.findAll(lambda tag: 
            tag.find_parent('div', attrs={'class':'mw-category'}) # все, для чего 'div' с аттр. 'mw-category' является родителем
        and tag.find_parent('li')                                 # все, для чего 'li' является родителем
                   )
# то есть ищем все, что удовлетворяет (одновременно) обоим условиям

[t.text for t in x]

['Algorithms of Oppression',
 'Almeida–Pineda recurrent backpropagation',
 'Backpropagation',
 'Bootstrap aggregating',
 'CN2 algorithm',
 'Constructing skill trees',
 'Dehaene–Changeux model',
 'Diffusion map',
 'Dominance-based rough set approach',
 'Dynamic time warping',
 'Error-driven learning',
 'Evolutionary multimodal optimization',
 'Expectation–maximization algorithm',
 'FastICA',
 'Forward–backward algorithm',
 'GeneRec',
 'Genetic Algorithm for Rule Set Production',
 'Growing self-organizing map',
 'HEXQ',
 'Hyper basis function network',
 'IDistance',
 'K-nearest neighbors algorithm',
 'Kernel methods for vector output',
 'Kernel principal component analysis',
 'Leabra',
 'Linde–Buzo–Gray algorithm',
 'Local outlier factor',
 'Logic learning machine',
 'LogitBoost',
 'Loss functions for classification',
 'Manifold alignment',
 'Minimum redundancy feature selection',
 'Mixture of experts',
 'Multiple kernel learning',
 'Non-negative matrix factorization',
 'Online machine l

#### Вариант 2:
Ищем тоже по **нескольким** условиям для тегов

In [26]:
x = parser.findAll(lambda tag: 
            tag.find_parent('div', attrs={'class':'mw-category'}) # все, для чего 'div' с аттр. 'mw-category' является родителем
        and tag.name == 'li'                                 # все c тегом 'li'
                   )
# то есть ищем все, что удовлетворяет (одновременно) обоим условиям

[t.text for t in x]

['Algorithms of Oppression',
 'Almeida–Pineda recurrent backpropagation',
 'Backpropagation',
 'Bootstrap aggregating',
 'CN2 algorithm',
 'Constructing skill trees',
 'Dehaene–Changeux model',
 'Diffusion map',
 'Dominance-based rough set approach',
 'Dynamic time warping',
 'Error-driven learning',
 'Evolutionary multimodal optimization',
 'Expectation–maximization algorithm',
 'FastICA',
 'Forward–backward algorithm',
 'GeneRec',
 'Genetic Algorithm for Rule Set Production',
 'Growing self-organizing map',
 'HEXQ',
 'Hyper basis function network',
 'IDistance',
 'K-nearest neighbors algorithm',
 'Kernel methods for vector output',
 'Kernel principal component analysis',
 'Leabra',
 'Linde–Buzo–Gray algorithm',
 'Local outlier factor',
 'Logic learning machine',
 'LogitBoost',
 'Loss functions for classification',
 'Manifold alignment',
 'Minimum redundancy feature selection',
 'Mixture of experts',
 'Multiple kernel learning',
 'Non-negative matrix factorization',
 'Online machine l

#### Вариант 3:
Ищем тоже по **нескольким** условиям для тегов

In [28]:
x = parser.findAll(lambda tag: 
            tag.find_parent('div', attrs={'class':'mw-category-group'}) # все, для чего 'div' с аттр. 'mw-category-group' 
                                                                        # является родителем
        and tag.name == 'li' # все c тегом 'li'
                   )
# то есть ищем все, что удовлетворяет (одновременно) обоим условиям

[t.text for t in x]

['Algorithms of Oppression',
 'Almeida–Pineda recurrent backpropagation',
 'Backpropagation',
 'Bootstrap aggregating',
 'CN2 algorithm',
 'Constructing skill trees',
 'Dehaene–Changeux model',
 'Diffusion map',
 'Dominance-based rough set approach',
 'Dynamic time warping',
 'Error-driven learning',
 'Evolutionary multimodal optimization',
 'Expectation–maximization algorithm',
 'FastICA',
 'Forward–backward algorithm',
 'GeneRec',
 'Genetic Algorithm for Rule Set Production',
 'Growing self-organizing map',
 'HEXQ',
 'Hyper basis function network',
 'IDistance',
 'K-nearest neighbors algorithm',
 'Kernel methods for vector output',
 'Kernel principal component analysis',
 'Leabra',
 'Linde–Buzo–Gray algorithm',
 'Local outlier factor',
 'Logic learning machine',
 'LogitBoost',
 'Loss functions for classification',
 'Manifold alignment',
 'Minimum redundancy feature selection',
 'Mixture of experts',
 'Multiple kernel learning',
 'Non-negative matrix factorization',
 'Online machine l

**Аналогично для "братьев" и "потомков"**