## Scrapping Springer Science using Python and Beautiful Soup


Springer Science the publisher of different journals and books with more than 2,900 journals and 300,000 books. Springer Science is one of the world's leading global research, educational and professional publishers, home to an array of respected and trusted brands providing quality content through a range of innovative products and services.

The content from the the https://link.springer.com/ Springer can pulled using keyword for example If someone need to get the research papers on Computer Vision or Machine Learning or Internet of Things etc. You will go the website and search the required papers using the keyword and it pulls all the releevnt papers.

![](https://i.imgur.com/adS0N6J.jpg)

## Project Outline

  1. Download the webpage using the requests library and parse the html source cod using beautiful soup.
  
  2. Extract item such as paper title, author name, year of publication, repository URL.
  
  3. Compile the extracted information into python lists.
  
  4. Combine data from multiple pages.
  
  4. Save the extracted data to a csv(comma separated values) file.

## Project Goal

The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:

![](https://imgur.com/N0aHqBr.jpg)


>### Packages Used:
>1. Requests — For downloading the HTML code from the Springer URL
>2. BeautifulSoup4 — For parsing and extracting data from the HTML string
>3. Pandas — to gather my data into a dataframe for further processing

We will use the `Jovian` library and its `commit()` function to save our work

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

# 1. Downloading a web page using requests


We'll use a library called **requests** to download web pages 

In [3]:
!pip install requests --upgrade --quiet

In [4]:
import requests

We have imported the request 


To downlaod the page, we can use the `get` functon from the requests

In [58]:
topic_url =  'https://link.springer.com/search?query=industry+4.0'
response = requests.get(topic_url)

Here we are checking the **Status code**, -> 200-299 will mean that the request was successful

In [119]:
response.status_code


200

We can get the contents of the page using

In [60]:
page_content = response.text

The `len` fucnction tells us the length of the response object

In [62]:
len(page_content)

71644

Displaying the first 1000 characters of `page_contents`

In [63]:
page_content[: 1000]

'<!DOCTYPE html>\n<!--[if lt IE 7]> <html lang="en" class="no-js ie6 lt-ie10 lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 7]>    <html lang="en" class="no-js ie7 lt-ie10 lt-ie9 lt-ie8"> <![endif]-->\n<!--[if IE 8]>    <html lang="en" class="no-js ie8 lt-ie10 lt-ie9"> <![endif]-->\n<!--[if IE 9]>    <html lang="en" class="no-js ie9 lt-ie10"> <![endif]-->\n<!--[if gt IE 9]><!--> <html lang="en" class="no-js"> <!--<![endif]-->\n<head>\n  <meta charset="UTF-8"/>\n  <meta name="description" content=""/>\n  <meta name="author" content=""/>\n  <meta name="viewport" content="width=device-width, minimum-scale=1, maximum-scale=1"/>\n  <meta name="format-detection" content="telephone=no"/>\n  <meta name="robots" content="noindex,follow"/>\n  <!--[if (gt IE 8) | (IEMobile)]><!--> <link rel="stylesheet" media="screen" href="/static/9e87700671f4a1203c14482a5d93ed498f4beca9/css/modern_link.min.css"> <!--<![endif]-->\n  <!--[if (lt IE 9) & (!IEMobile)]> <link rel="stylesheet" media="screen" href="/static/

In [130]:
#jovian.commit()

We save the text that we have got into a HTML file with open statement.

In [127]:
with open('Industry_4_0.html', 'w', encoding="utf-8") as file:
    file.write(page_content)

We can see there are 412,183 relevant results found with the keyword Industry 4.0. 

![](https://imgur.com/7aqxlZG.jpg)

In [143]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "mohammed-shahid920/webscrapping-project-1" on https://jovian.ai/
[jovian] Committed successfully! https://jovian.ai/mohammed-shahid920/webscrapping-project-1


'https://jovian.ai/mohammed-shahid920/webscrapping-project-1'

# 2. Inspecting the HTML source code of a web page


To view the source code of any webpage right within your browser, you can right click anywhere on a page and select the "Inspect" option. You access the "Developer Tools" mode, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page

The inspect page with details
![](https://imgur.com/acl7KZp.jpg)



We will be inspecting [HTML code] and using beautiful soup library to get the required details from the Springer website such as **paper title, author name, year of publication, and url link**.

Let's import the **Beautiful Soup library** and start our web Scraping project


In [66]:
#Install BeautifulSoup to retrieve data.
!pip install beautifulsoup4 --upgrade --quiet

We are opening the [HTML] file as an read using the open funtion.

In [70]:
with open('Industry_4_0.html', 'r', encoding="utf-8") as sp:
    springer_page = sp.read()

In [71]:
from bs4 import BeautifulSoup

To parse a document, pass it into the BeautifulSoup constructor.


In [72]:
doc = BeautifulSoup(springer_page, 'html.parser')

Checking type of the **document and title** of the document.

In [73]:
type(doc)

bs4.BeautifulSoup

In [74]:
doc.title

<title>Search Results - Springer</title>

### Feteching relevant papers from Springer Science

In [75]:
def get_topic_springer(topic):
    topic_repos_url = 'https://link.springer.com/search?query=' + topic
    response = requests.get(topic_repos_url)
    print(topic_repos_url)
    if response.status_code !=200:
        print("status code", response.status_code)
        raise Exception ("Failed to fetch web page", + topic_repos_url)
    
    doc = BeautifulSoup(response.text)
    return doc

We are going to search the topic -  **Internet of Things** on Springer Science. 

In [76]:
doc = get_topic_springer('Internet+of+things')

https://link.springer.com/search?query=Internet+of+things


You reting to open the above URl, we will able sucessfully able to retrive the details of Internet Of Things from Springer Science.

![](https://imgur.com/v5vmHCt.jpg)

In [77]:
ML = get_topic_springer('Machine+Learning')
CV = get_topic_springer('Computer+vision')

https://link.springer.com/search?query=Machine+Learning
https://link.springer.com/search?query=Computer+vision


We are able to get all the relevant papers using the keyword Machine Learning and Computer vision.

![](https://imgur.com/E8kugJk.jpg=50x50)
![](https://imgur.com/MZAcuY8.jpg=50x50)

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

# 3. Parsing the HTML source code using Beautiful soup


Here we need to parse the [HTML] `source code`  get the **Paper title, Name of the Author, Year of Publication and the url link of the paper**

We will try to understand parsing:
 - Lets begin with opening the [HTML] file which is stored on the local machine.
 - We are opening the [HTML] file Industry_4_0.html as read.
 - Then we are parsing the [HTML] source code using `BeautifulSoup`

In [135]:
with open('Industry_4_0.html', 'r', encoding="utf-8") as sp:
    springer_page = sp.read()
doc = BeautifulSoup(springer_page, 'html.parser')
type(doc)

bs4.BeautifulSoup

We have parsed the [HTML] source code and stored in doc variable and the type of doc variable is `bs4.BeautifulSoup`


In [80]:
# After insepection of HTML souce find we found all the required inforamtion stored in Div tag in HTML
#Souce code so we are capturing the details

# In article tags gets all the div tags on the page. 

article_tags = doc.find_all('div', class_='text')

#Getting the first Article details from the souce code.
article_tag = article_tags[0]


# We are filtering out the required information from the div tag. here we are filtering out the h2 tag from the div tag.
h_tag = article_tag.find('h2')


# We are filtering out the required information from the div tag. here we are filtering out the a tag from the div tag.
a_tag = h_tag.find_all('a', recursive = False)

In [81]:
# to get the complete url link of the first article we are using base url + additional links
# base url + addition links 
base_pep_url = "https://link.springer.com"
pep_url = base_pep_url + a_tag[0]['href'].strip()
pep_url

'https://link.springer.com/book/10.1007/978-1-4842-2047-4'

We got the first article url link , lets grap the url link and try to open and check if it works. The above article is from the paper title Industry 4.0. Lets open the url link: https://link.springer.com/book/10.1007/978-1-4842-2047-4
 
![](https://imgur.com/YFo3RUn.jpg)

In [82]:
article_tag = article_tags[2]
h_tag = article_tag.find('h2')
a_tag = h_tag.find_all('a', recursive = False)
pep_url = base_pep_url + a_tag[0]['href'].strip()
pep_url

'https://link.springer.com/book/10.1007/978-981-13-8165-2'

We are able to retrieve the second article url link as well.

![](https://imgur.com/41Gr7xA.png)

In [83]:
#First artile title Tag
article_7_title= article_tags[7].find('a', class_= 'title').text
article_7_title

'Industry 4.0 and Engineering for a Sustainable Future'

In [84]:
article_8_title = article_tags[8].find('a', class_= 'title').text
article_8_title

'CT Scan Generated Material Twins for Composites Manufacturing in Industry 4.0'

In [85]:
 article_7_authors = article_tags[7].find('span', class_='authors').text.strip() 
article_7_authors


'Prof. Dr. Mohammad Dastbaz…'

In [86]:
 article_8_authors = article_tags[8].find('span', class_='authors').text.strip() 
article_8_authors

'Dr. Muhammad A. Ali…'

In [87]:
article_7_year = article_tags[7].find('span', class_= 'year').text
article_7_year

'(2019)'

In [88]:
article_8_year = article_tags[8].find('span', class_= 'year').text
article_8_year

'(2020)'

In [91]:
# Selecting the article 7 from bunch of articles
article_tag = article_tags[7]
article_7_url = base_pep_url + a_tag[0]['href'].strip()
article_7_url

'https://link.springer.com/book/10.1007/978-981-13-8165-2'

### Print the article 7 retrieved details

Paper Title

Author names

Year of Publication

URL link.

In [92]:
print('Paper Title:', article_7_title)
print("Authors names", article_7_authors)
print('Year of publication:', article_7_year)
print(' URL Link:', article_7_url)

Paper Title: Industry 4.0 and Engineering for a Sustainable Future
Authors names Prof. Dr. Mohammad Dastbaz…
Year of publication: (2019)
 URL Link: https://link.springer.com/book/10.1007/978-981-13-8165-2


Let's automate the process and get all the page details using fuction.

In [93]:
article_tag

<div class="text">
<h2>
<a class="title" href="/book/10.1007/978-3-030-12953-8" title="">Industry 4.0 and Engineering for a Sustainable Future</a>
</h2>
<p class="subtitle"></p>
<p class="meta">
<span class="authors">
<a href="/search?facet-creator=%22Prof.+Dr.+Mohammad+Dastbaz%22">Prof. Dr. Mohammad Dastbaz</a><span title="Prof. Peter Cochrane">…</span></span>
<span class="enumeration">
<span class="year" title="2019">(2019)</span>
</span>
</p>
<div class="actions">
<span class="action">
</span>
</div>
</div>


We are getting a tag which contains the title of the paper

In [94]:
a_tags = article_tag.h2.find_all('a')
a_tags

[<a class="title" href="/book/10.1007/978-3-030-12953-8" title="">Industry 4.0 and Engineering for a Sustainable Future</a>]

We are getting the title of the paper by applying `.text.strip()` method.

In [136]:
title = a_tags[0].text.strip()
title

'Industry 4.0 and Engineering for a Sustainable Future'

We can get author names, year of publication and url

In [96]:
auth_tag = article_tag.find_all('span', class_='authors')
auth_tag

[<span class="authors">
 <a href="/search?facet-creator=%22Prof.+Dr.+Mohammad+Dastbaz%22">Prof. Dr. Mohammad Dastbaz</a><span title="Prof. Peter Cochrane">…</span></span>]

In [97]:
authors = auth_tag[0].text.strip()
authors

'Prof. Dr. Mohammad Dastbaz…'

In [98]:
    year_tag = article_tag.find_all('span', class_= 'year')
    year_tag

[<span class="year" title="2019">(2019)</span>]

In [99]:
year = year_tag[0].text.strip()
year

'(2019)'

Lets use the above code and write a function to reterieve all the papers details such as Paper Title, Author name, year of publication and url link.


We given name **parse_article() function** which takes the article as input and provides the information such as paper title, authors name, year of publication, and url link.

We are using the above code and a dictionary to store these information .


In [100]:
def parse_articles(article_tag):
    # <a> tags containing username, repository name and URL
    a_tags = article_tag.h2.find_all('a')
    auth_tag = article_tag.find_all('span', class_='authors')    
    title_name = a_tags[0].text.strip()
    # Repository name
    authors_name = auth_tag[0].text.strip() 
    year_tag = article_tag.find_all('span', class_= 'year')
    year = year_tag[0].text.strip()
    pep_url = base_pep_url + a_tag[0]['href'].strip()
    # Return a dictionary
    return {
        'Paper Title': title_name,
        'Authors': authors_name,        
        'Year of Publication': year,
        'repository_url': pep_url
    }

In [101]:
parse_articles(article_tags[4])

{'Paper Title': 'Smart Agents for the Industry 4.0',
 'Authors': 'Max Hoffmann',
 'Year of Publication': '(2019)',
 'repository_url': 'https://link.springer.com/book/10.1007/978-981-13-8165-2'}

Lets try to get other article information.

In [102]:
parse_articles(article_tags[10])

{'Paper Title': 'Contemporary Challenges in Cooperation and Coopetition in the Age of Industry 4.0',
 'Authors': 'Agnieszka Zakrzewska-Bielawska…',
 'Year of Publication': '(2020)',
 'repository_url': 'https://link.springer.com/book/10.1007/978-981-13-8165-2'}

Now lets automate and get all the paper or get the top 10 or whatever number papers you want to get using the get_top_papers()

get_top_papers function takes topic name and returns all the required information available on the perticular page

In [103]:
# Getting the top search papers 
def get_top_papers(topic):
    article_tags = doc.find_all('div', class_='text')
    topic_repos = [parse_articles(tag) for tag in article_tags]
    return topic_repos

In [104]:
machine_learning = get_topic_springer('Machine+Learning')
top_papers_ml = get_top_papers(machine_learning)
top_papers_ml[:10]

https://link.springer.com/search?query=Machine+Learning


[{'Paper Title': 'Industry 4.0',
  'Authors': 'Alasdair Gilchrist',
  'Year of Publication': '(2016)',
  'repository_url': 'https://link.springer.com/book/10.1007/978-981-13-8165-2'},
 {'Paper Title': 'Industry 4.0',
  'Authors': 'Tessaleno Devezas,\n    João Leitão,\n    Askar Sarygulov',
  'Year of Publication': '(2017)',
  'repository_url': 'https://link.springer.com/book/10.1007/978-981-13-8165-2'},
 {'Paper Title': 'Industry 4.0',
  'Authors': 'Dr. Kaushik Kumar,\n    Divya Zindani,\n    Dr. J. Paulo Davim',
  'Year of Publication': '(2019)',
  'repository_url': 'https://link.springer.com/book/10.1007/978-981-13-8165-2'},
 {'Paper Title': 'A Digital Framework for Industry 4.0',
  'Authors': 'Dr. Ana Landeta Echeberria',
  'Year of Publication': '(2020)',
  'repository_url': 'https://link.springer.com/book/10.1007/978-981-13-8165-2'},
 {'Paper Title': 'Smart Agents for the Industry 4.0',
  'Authors': 'Max Hoffmann',
  'Year of Publication': '(2019)',
  'repository_url': 'https://li

# 4. Writing parsed information into CSV files


By helper function we can list dictionaries and writes them to a CSV file.

In [106]:
def write_paper_csv(items, path):
    # Open the file in write mode
    with open(path, 'w', encoding="utf-8") as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

topic_page_ml = get_topic_springer('Machine+Learning')
#len(topic_page_ml)

top_pep = get_top_papers(topic_page_ml)
#top_pep

write_paper_csv(top_pep, 'Machine_Learning.csv')


import pandas as pd
spr_df = pd.read_csv('Machine_Learning.csv')
spr_df

https://link.springer.com/search?query=Machine+Learning


Unnamed: 0,Paper Title,Authors,Year of Publication,repository_url
0,Industry 4.0,Alasdair Gilchrist,(2016),https://link.springer.com/book/10.1007/978-981...
1,Industry 4.0,Tessaleno Devezas,,
2,João Leitão,,,
3,Askar Sarygulov,(2017),https://link.springer.com/book/10.1007/978-981...,
4,Industry 4.0,Dr. Kaushik Kumar,,
5,Divya Zindani,,,
6,Dr. J. Paulo Davim,(2019),https://link.springer.com/book/10.1007/978-981...,
7,A Digital Framework for Industry 4.0,Dr. Ana Landeta Echeberria,(2020),https://link.springer.com/book/10.1007/978-981...
8,Smart Agents for the Industry 4.0,Max Hoffmann,(2019),https://link.springer.com/book/10.1007/978-981...
9,The Concept Industry 4.0,Christoph Jan Bartodziej,(2017),https://link.springer.com/book/10.1007/978-981...


We are able to get Title of the paper, Name of the Authors, Year of Publication and url link for single page from Springer Science. The below screen shot shows the CSV file saved on location machine and the content of the CSV File.

Already we have displayed the content of CSV using pandas.

We can find the details of the CSV on the below screen shots.

![](https://imgur.com/STgtzCd.jpg)
![](https://imgur.com/VaHRyU0.jpg)

In [117]:
spr_df.tail()

Unnamed: 0,Paper Title,Authors,Year of Publication,repository_url
13,Applications of Artificial Intelligence Techni...,Prof. Dr. Aydin Azizi,(2019),https://link.springer.com/book/10.1007/978-981...
14,Contemporary Challenges in Cooperation and Coo...,Agnieszka Zakrzewska-Bielawska…,(2020),https://link.springer.com/book/10.1007/978-981...
15,Simulation for Industry 4.0,Dr. Murat M. Gunal,(2019),https://link.springer.com/book/10.1007/978-981...
16,Enabling Systems for Intelligent Manufacturing...,Dr. Arturo Molina…,(2021),https://link.springer.com/book/10.1007/978-981...
17,New Paradigm of Industry 4.0,Prof. Dr. Srikanta Patnaik,(2020),https://link.springer.com/book/10.1007/978-981...


In [129]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Committed successfully! https://jovian.ai/mohammed-shahid920/webscrapping-project-1


'https://jovian.ai/mohammed-shahid920/webscrapping-project-1'

# Summary


Finally, we have managed to parse 'Springer Webiste' to get our hands on very interesting and insightful data when it comes to the books, journals, videos.

We have saved all the information we could extract from that website for our needs in a CSV file.

## References


[1] Python offical documentation. https://docs.python.org/3/


[2] Requests library. https://pypi.org/project/requests/


[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/


[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api


[5] Pandas library documentation. https://pandas.pydata.org/docs/


[6] Springer Site. https://link.springer.com/


[7] Web Scraping Article. https://www.toptal.com/python/web-scraping-with-python


In [142]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "mohammed-shahid920/webscrapping-project-1" on https://jovian.ai/
[jovian] Committed successfully! https://jovian.ai/mohammed-shahid920/webscrapping-project-1


'https://jovian.ai/mohammed-shahid920/webscrapping-project-1'