https://towardsdatascience.com/introduction-to-web-scraping-with-beautifulsoup-e87a06c2b857

Target article -> https://en.wikipedia.org/wiki/Artificial_intelligence

Scrapy is a complete web scraping framework which takes care of everything from getting the HTML, to processing the data. Selenium is a browser automation tool that can for example enable you to navigate between multiple pages. These two libraries have a steeper learning curve than Request which is used to get HTML data and BeautifulSoup which is used as a parser for the HTML.

In this post we will scrap the “content” and “see also” sections from an arbitrary Wikipedia article

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import re

In [2]:
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"

In [3]:
page = urllib.request.urlopen(url) # conntect to website

In [4]:
soup = BeautifulSoup(page, 'html.parser')

We can now start parsing the article

In [7]:
soup.title.string

'Artificial intelligence - Wikipedia'

## Find specific elements in the page
The created BeautifulSoup object can now be used to find elements in the HTML. When we inspected the website we saw that every list item in the content section has a class that starts with `tocsection-` and we can us BeautifulSoup’s find_all method to find all list items with that class.

In [10]:
regex = re.compile('^tocsection-')
content_lis = soup.find_all('li', attrs={'class': regex})
print(len(content_lis))

72


To get the raw text we can loop through the array and call the `getText` method on each list item.

In [12]:
content = []
for li in content_lis:
    content.append(li.getText().split('\n')[0])

In [13]:
content

['1 History',
 '2 Basics',
 '3 Problems',
 '3.1 Reasoning, problem solving',
 '3.2 Knowledge representation',
 '3.3 Planning',
 '3.4 Learning',
 '3.5 Natural language processing',
 '3.6 Perception',
 '3.7 Motion and manipulation',
 '3.8 Social intelligence',
 '3.9 General intelligence',
 '4 Approaches',
 '4.1 Cybernetics and brain simulation',
 '4.2 Symbolic',
 '4.2.1 Cognitive simulation',
 '4.2.2 Logic-based',
 '4.2.3 Anti-logic or scruffy',
 '4.2.4 Knowledge-based',
 '4.3 Sub-symbolic',
 '4.3.1 Embodied intelligence',
 '4.3.2 Computational intelligence and soft computing',
 '4.4 Statistical learning',
 '4.5 Integrating the approaches',
 '5 Tools',
 '5.1 Search and optimization',
 '5.2 Logic',
 '5.3 Probabilistic methods for uncertain reasoning',
 '5.4 Classifiers and statistical learning methods',
 '5.5 Artificial neural networks',
 '5.5.1 Deep feedforward neural networks',
 '5.5.2 Deep recurrent neural networks',
 '5.6 Evaluating progress',
 '6 Applications',
 '6.1 Healthcare',
 '6

To get the data from the “see also” section, we use the find method to get the div containing the list items, and then use find_all to get an array of list items.

In [14]:
see_also_section = soup.find('div', attrs={'class': 'div-col columns column-width'})
see_also_soup =  see_also_section.find_all('li')

In [15]:
see_also_soup

[<li><a class="image" href="/wiki/File:Animation2.gif"><img alt="Animation2.gif" class="noviewer" data-file-height="78" data-file-width="48" decoding="async" height="16" src="//upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Animation2.gif/10px-Animation2.gif" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Animation2.gif/15px-Animation2.gif 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Animation2.gif/19px-Animation2.gif 2x" width="10"/></a> <a href="/wiki/Portal:Artificial_intelligence" title="Portal:Artificial intelligence">Artificial intelligence portal</a></li>,
 <li><a href="/wiki/Abductive_reasoning" title="Abductive reasoning">Abductive reasoning</a></li>,
 <li><i><a href="/wiki/A.I._Rising" title="A.I. Rising">A.I. Rising</a></i></li>,
 <li><a href="/wiki/Behavior_selection_algorithm" title="Behavior selection algorithm">Behavior selection algorithm</a></li>,
 <li><a href="/wiki/Business_process_automation" title="Business process automation">Business 

To extract the hrefs and the text a loop in combination with the find method can be used.

In [16]:
see_also = []
for li in see_also_soup:
    a_tag = li.find('a', href=True, attrs={'title':True, 'class':False}) # find a tags that have a title and a class
    href = a_tag['href'] # get the href attribute
    text = a_tag.getText() # get the text
    see_also.append([href, text]) # append to array

In [17]:
see_also

[['/wiki/Portal:Artificial_intelligence', 'Artificial intelligence portal'],
 ['/wiki/Abductive_reasoning', 'Abductive reasoning'],
 ['/wiki/A.I._Rising', 'A.I. Rising'],
 ['/wiki/Behavior_selection_algorithm', 'Behavior selection algorithm'],
 ['/wiki/Business_process_automation', 'Business process automation'],
 ['/wiki/Case-based_reasoning', 'Case-based reasoning'],
 ['/wiki/Commonsense_reasoning', 'Commonsense reasoning'],
 ['/wiki/Emergent_algorithm', 'Emergent algorithm'],
 ['/wiki/Evolutionary_computation', 'Evolutionary computation'],
 ['/wiki/Glossary_of_artificial_intelligence',
  'Glossary of artificial intelligence'],
 ['/wiki/Machine_learning', 'Machine learning'],
 ['/wiki/Mathematical_optimization', 'Mathematical optimization'],
 ['/wiki/Multi-agent_system', 'Multi-agent system'],
 ['/wiki/Robotic_process_automation', 'Robotic process automation'],
 ['/wiki/Soft_computing', 'Soft computing'],
 ['/wiki/Weak_AI', 'Weak AI'],
 ['/wiki/Personality_computing', 'Personality co

## Saving data
Almost all of the time we would like to save our scraped data, so we can use it later. The easiest way is to save it to a .txt or .csv file by using the open function which is build into Python.

We will save the content section into a text file with the name content.txt .

In [18]:
with open('content.txt', 'w') as f:
    for i in content:
        f.write(i+"\n")

The best format for the “see also” data is probably a csv because it has two columns(One for the href and one for the text).

In [19]:
with open('see_also.csv', 'w') as f:
    for i in see_also:
        f.write(",".join(i)+"\n")