# Web Scraping using Python

## Target Journal: MRM

- https://onlinelibrary.wiley.com/toc/15222594/2022/87/2
- https://onlinelibrary.wiley.com/toc/15222594/2022/87/1

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To easily display the plots, make sure to include the line %matplotlib inline as shown below.
%matplotlib inline

import urllib
import requests
import lxml

#To perform web scraping, you should also import the libraries shown below. 
#The urllib.request module is used to open URLs. 
#The Beautiful Soup package is used to extract data from html files. 
#The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

## First, save as a local html file
 - See: https://zetcode.com/python/beautifulsoup/

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
# MRM Vol 87 No 2
URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Magnetic Resonance in Medicine_ Vol 87, No 2.html'
# MRM Vol 87 No 1
# URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Magnetic Resonance in Medicine_ Vol 87, No 1.html'

with open(URL, 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

Find the corresponding index, and identify the title info

In [19]:
def safeOpenParsePage(targetUrl):
    try:
        # tmpurl = urlopen(Request(targetUrl, headers={'User-Agent': 'Chrome/92.0.4515.107'}))
        tmpurl = open(targetUrl, 'r')
        tmpR = tmpurl.read()
        # tmpSoup = BeautifulSoup(tmpR, 'html.parser')
        tmpSoup = BeautifulSoup(tmpR, 'lxml')
        return tmpSoup
    except urllib.error.HTTPError as e:
        print(e)
        return None

soupMRM = safeOpenParsePage(URL)
if soupMRM is not None:
      print(soupMRM.prettify())

<!DOCTYPE html>
<html class="pb-page" data-request-id="77439159-6e1b-4bfb-8a48-74ca1c8a7e72" lang="en">
 <head data-pb-dropzone="head">
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content=";journal:journal:15222594;ctype:string:Journal Content;website:website:pericles;requestedJournal:journal:15222594;page:string:Table of Contents;wgroup:string:Publication Websites;pageGroup:string:Publication Pages;issue:issue:doi\:10.1002/mrm.v87.2" name="pbContext"/>
  <script type="text/javascript">
   var $DoubleClickZone = "magres-med_mrm";var $DoubleClickSite =  "wly.radiol.imag_000105";
  </script>
  <script id="analyticDigitalData">
   digitalData = {"site":{"ip":"152.78.0.24","environment":"LIVE","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2022-01-09"},"identities":[{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"ReferrerUser","uuid":"5a2c1d70-1255-4a45-95f3-a3aa3a2771fb"},{"type":"InstitutionUser","uuid":"58cbd8

In [20]:
# Get the title
title = soupMRM.title
print(title)

<title> Magnetic Resonance in Medicine: Vol 87, No 2</title>


In [21]:
# Print out the text
text = soupMRM.get_text()
print(soup.text)

# Another way to extract text
# str(all_links[103]).split("<h2>")[1].replace("</h2></a>", "")



var $DoubleClickZone = "magres-med_mrm";var $DoubleClickSite =  "wly.radiol.imag_000105";digitalData = {"site":{"ip":"152.78.0.24","environment":"LIVE","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2022-01-09"},"identities":[{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"ReferrerUser","uuid":"5a2c1d70-1255-4a45-95f3-a3aa3a2771fb"},{"type":"InstitutionUser","uuid":"58cbd8e6-3c5d-4fc0-ac26-57c4792d2c4b","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"EALWB000204"}]},{"type":"SmartGroupUser","uuid":"cf9c85b9-7f3b-4309-aeaa-c41669cabe3b"},{"type":"SmartGroupUser","uuid":"db8b8d10-1aed-4809-b3ee-70938c569014"},{"type":"SmartGroupUser","uuid":"cdce41c2-a8de-4a64-a4d9-a85caaaa2314"},{"type":"SmartGroupUser","uuid":"cb1790a9-a0b9-40b3-915d-806df593aeeb"},{"type":"ReferrerUser","uuid":"6401b279-b825-4f41-a389-04e29feb967f"},{"type":"InstitutionUser","uuid":"bfa4ccf8-8126-488a-b090-da278aa7b4d9","customerRecords":[{"cus

In [22]:
all_links = soupMRM.find_all('a')

### Explore Category

In [23]:
# h3 class corresponds to editorial

all_h3 = soupMRM.find_all("h3")

i = 0
for h3 in all_h3:
  # print(h3)
  print(h3.get_text())
  i += 1
print(i)

Menu
Format
Type of import
COVER IMAGE
ISSUE INFORMATION
RESEARCH ARTICLES—SPECTROSCOPIC METHODOLOGY
TECHNICAL NOTE—SPECTROSCOPIC METHODOLOGY
RESEARCH ARTICLE—PRECINICAL AND CLINICAL SPECTROSCOPY
RESEARCH ARTICLES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING
TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING
RESEARCH ARTICLES—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH
TECHNICAL NOTE—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH
RESEARCH ARTICLES—COMPUTER PROCESSING AND MODELING
TECHNICAL NOTES—COMPUTER PROCESSING AND MODELING
RESEARCH ARTICLES—HARDWARE AND INSTRUMENTATION
About Wiley Online Library
Help & Support
Opportunities
Connect with Wiley
21


### Explore Original Research Subcategory

In [None]:
# h4 class corresponds to category

# all_h4 = soupMRM.find_all("h4")

# i = 0
# for h4 in all_h4:
#   # print(h4)
#   print(h4.get_text())
#   i += 1
# print(i)

0


In [24]:
all_h2 = soupMRM.find_all("h2")

i = 0
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    print(h2.get_text())
    i += 1
print(i)

# str_h2 = str(all_h2)

# links = soupJMRI.find_all("a")
# str_links = str(links)
# print(str_h2)

# cleantext = BeautifulSoup(str_h2, "lxml").get_text()
# cleantext = BeautifulSoup(str_links, "lxml").get_text()
# print(cleantext)

Three-dimensional quantification of circulation using finite-element methods in four-dimensional flow MR data of the thoracic aorta
Absolute choline tissue concentration mapping for prostate cancer localization and characterization using 3D 1H MRSI without water-signal suppression
Uncertainty in denoising of MRSI using low-rank methods
Influence of editing pulse flip angle on J-difference MR spectroscopy
Spectroscopy-based multi-parametric quantification in subjects with liver iron overload at 1.5T and 3T
External Dynamic InTerference Estimation and Removal (EDITER) for low field MRI
Comparison of prospective and retrospective motion correction in 3D-encoded neuroanatomical MRI
Quantitative evaluation of prospective motion correction in healthy subjects at 7T MRI
MULTI-parametric MR imaging with fLEXible design (MULTIPLEX)
B1-gradient–based MRI using frequency-modulated Rabi-encoded echoes
MRI-guided attenuation correction in torso PET/MRI: Assessment of segmentation-, atlas-, and deep

## Next, clean up the title list

In [25]:
list_h2 = []
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    cells = h2.get_text()
    list_h2.append(cells)

# Delete the last element: "Log in to Wiley Online Library"
list_h2.pop()

# Insert "Issue Information" as the 2nd element
list_h2.insert(1, "Issue Information")
# MRM Vol 86 No 3
# list_h2.insert(31, "Maxwell parallel imaging")

# https://stackoverflow.com/questions/15715912/remove-the-last-n-elements-of-a-list
del list_h2[-2:]

for l in list_h2:
  print(l)
print(len(list_h2))

Three-dimensional quantification of circulation using finite-element methods in four-dimensional flow MR data of the thoracic aorta
Issue Information
Absolute choline tissue concentration mapping for prostate cancer localization and characterization using 3D 1H MRSI without water-signal suppression
Uncertainty in denoising of MRSI using low-rank methods
Influence of editing pulse flip angle on J-difference MR spectroscopy
Spectroscopy-based multi-parametric quantification in subjects with liver iron overload at 1.5T and 3T
External Dynamic InTerference Estimation and Removal (EDITER) for low field MRI
Comparison of prospective and retrospective motion correction in 3D-encoded neuroanatomical MRI
Quantitative evaluation of prospective motion correction in healthy subjects at 7T MRI
MULTI-parametric MR imaging with fLEXible design (MULTIPLEX)
B1-gradient–based MRI using frequency-modulated Rabi-encoded echoes
MRI-guided attenuation correction in torso PET/MRI: Assessment of segmentation-

### Add dates first published

In [26]:
# li class: corresponds to "First Published"

all_li = soupMRM.find_all("li")

# i = 0
# for li in all_li:
  # li_class = li.get_attribute_list("class")
  # print(li_class)
  # if li_class == ['ePubDate']:
    # print(li.get_text())
    # i += 1
  # print(li.get_text())
  # i += 1
# print(i)

list_date = []
for li in all_li:
  li_class = li.get_attribute_list("class")
  if li_class == ['ePubDate']:
    cells = li.get_text().split(': ')[1]
    list_date.append(cells)

for l in list_date:
  print(l)
print(len(list_date))

26 November 2021
26 November 2021
23 September 2021
21 September 2021
14 September 2021
23 September 2021
04 September 2021
07 September 2021
31 August 2021
31 August 2021
09 September 2021
04 September 2021
23 September 2021
05 October 2021
30 September 2021
02 October 2021
02 October 2021
04 September 2021
14 September 2021
20 October 2021
30 September 2021
30 September 2021
30 September 2021
30 September 2021
28 August 2021
14 September 2021
14 September 2021
23 September 2021
22 October 2021
07 September 2021
21 September 2021
05 October 2021
21 September 2021
05 October 2021
05 October 2021
05 October 2021
07 October 2021
31 August 2021
07 September 2021
28 August 2021
21 September 2021
10 October 2021
42


In [27]:
list_url = []
for link in all_links:
  if 'visitable' in str(link.get("class")): # and '/doi/' in link.get("href"):
    cells = link.get("href").replace('/doi', 'https://doi.org')
    list_url.append(cells)

for l in list_url:
  print(l)
print(len(list_url))

https://doi.org/10.1002/mrm.29102
https://doi.org/10.1002/mrm.28860
https://doi.org/10.1002/mrm.29012
https://doi.org/10.1002/mrm.29018
https://doi.org/10.1002/mrm.29008
https://doi.org/10.1002/mrm.29021
https://doi.org/10.1002/mrm.28992
https://doi.org/10.1002/mrm.28991
https://doi.org/10.1002/mrm.28998
https://doi.org/10.1002/mrm.28999
https://doi.org/10.1002/mrm.29002
https://doi.org/10.1002/mrm.29003
https://doi.org/10.1002/mrm.29019
https://doi.org/10.1002/mrm.29023
https://doi.org/10.1002/mrm.29024
https://doi.org/10.1002/mrm.29027
https://doi.org/10.1002/mrm.29036
https://doi.org/10.1002/mrm.28995
https://doi.org/10.1002/mrm.28993
https://doi.org/10.1002/mrm.29016
https://doi.org/10.1002/mrm.29028
https://doi.org/10.1002/mrm.29035
https://doi.org/10.1002/mrm.29030
https://doi.org/10.1002/mrm.29029
https://doi.org/10.1002/mrm.28996
https://doi.org/10.1002/mrm.29005
https://doi.org/10.1002/mrm.29009
https://doi.org/10.1002/mrm.29025
https://doi.org/10.1002/mrm.28989
https://doi.or

In [28]:
# df = pd.DataFrame({'title':list_h2, 'url': list_url})
df = pd.DataFrame({'Journal': title, 
                   #'Category': list_category, 
                   'Title': list_h2, 
                   'First Published': list_date, 
                   'DOI': list_url})
df

Unnamed: 0,Journal,Title,First Published,DOI
0,"Magnetic Resonance in Medicine: Vol 87, No 2",Three-dimensional quantification of circulatio...,26 November 2021,https://doi.org/10.1002/mrm.29102
1,"Magnetic Resonance in Medicine: Vol 87, No 2",Issue Information,26 November 2021,https://doi.org/10.1002/mrm.28860
2,"Magnetic Resonance in Medicine: Vol 87, No 2",Absolute choline tissue concentration mapping ...,23 September 2021,https://doi.org/10.1002/mrm.29012
3,"Magnetic Resonance in Medicine: Vol 87, No 2",Uncertainty in denoising of MRSI using low-ran...,21 September 2021,https://doi.org/10.1002/mrm.29018
4,"Magnetic Resonance in Medicine: Vol 87, No 2",Influence of editing pulse flip angle on J-dif...,14 September 2021,https://doi.org/10.1002/mrm.29008
5,"Magnetic Resonance in Medicine: Vol 87, No 2",Spectroscopy-based multi-parametric quantifica...,23 September 2021,https://doi.org/10.1002/mrm.29021
6,"Magnetic Resonance in Medicine: Vol 87, No 2",External Dynamic InTerference Estimation and R...,04 September 2021,https://doi.org/10.1002/mrm.28992
7,"Magnetic Resonance in Medicine: Vol 87, No 2",Comparison of prospective and retrospective mo...,07 September 2021,https://doi.org/10.1002/mrm.28991
8,"Magnetic Resonance in Medicine: Vol 87, No 2",Quantitative evaluation of prospective motion ...,31 August 2021,https://doi.org/10.1002/mrm.28998
9,"Magnetic Resonance in Medicine: Vol 87, No 2",MULTI-parametric MR imaging with fLEXible desi...,31 August 2021,https://doi.org/10.1002/mrm.28999


## Third, manually create the list of category based on Issue Information
 - See [here](https://stackoverflow.com/questions/4654414/python-append-item-to-list-n-times) for extending elements in a list for X times

In [30]:
# list_category = ['Cover Image', 'Issue Information', 'Commentary']

### Vol 87 No 1 ###
# list_category = ['Cover Image', 'Issue Information']
# list_category.extend(['GUIDELINES—SPECTROSCOPIC METHODOLOGY'] * 1)
# list_category.extend(['RESEARCH ARTICLE—SPECTROSCOPIC METHODOLOGY'] * 1)
# list_category.extend(['TECHNICAL NOTE—SPECTROSCOPIC METHODOLOGY'] * 1)
# list_category.extend(['RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY'] * 1)
# list_category.extend(['RESEARCH ARTICLES—IMAGING METHODOLOGY'] * 13)
# list_category.extend(['TECHNICAL NOTES—IMAGING METHODOLOGY'] * 4)
# list_category.extend(['RAPID COMMUNICATION—PRECLINICAL AND CLINICAL IMAGING'] * 1)
# list_category.extend(['RESEARCH ARTICLE—PRECLINICAL AND CLINICAL IMAGING'] * 1)
# list_category.extend(['RESEARCH ARTICLES—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH'] * 6)
# list_category.extend(['TECHNICAL NOTE—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH'] * 1)
# list_category.extend(['RESEARCH ARTICLES—COMPUTER PROCESSING AND MODELING'] * 5)
# list_category.extend(['TECHNICAL NOTE—COMPUTER PROCESSING AND MODELING'] * 1)
# list_category.extend(['RESEARCH ARTICLES—HARDWARE AND INSTRUMENTATION'] * 3)
# list_category.extend(['TECHNICAL NOTE—HARDWARE AND INSTRUMENTATION'] * 1)

### Vol 87 No 2 ###
list_category = ['Cover Image', 'Issue Information']
list_category.extend(['RESEARCH ARTICLES—SPECTROSCOPIC METHODOLOGY'] * 2)
list_category.extend(['TECHNICAL NOTE—SPECTROSCOPIC METHODOLOGY'] * 1)
list_category.extend(['RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY'] * 1)
list_category.extend(['RESEARCH ARTICLES—IMAGING METHODOLOGY'] * 11)
list_category.extend(['TECHNICAL NOTES—IMAGING METHODOLOGY'] * 4)
list_category.extend(['RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING'] * 2)
list_category.extend(['TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING'] * 1)
list_category.extend(['RESEARCH ARTICLES—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH'] * 3)
list_category.extend(['TECHNICAL NOTE—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH'] * 1)
list_category.extend(['RESEARCH ARTICLES—COMPUTER PROCESSING AND MODELING'] * 9)
list_category.extend(['TECHNICAL NOTES—COMPUTER PROCESSING AND MODELING'] * 2)
list_category.extend(['RESEARCH ARTICLES—HARDWARE AND INSTRUMENTATION'] * 3)

i = 0
for l in list_category:
  print(l)
  i += 1
print(i)

Cover Image
Issue Information
RESEARCH ARTICLES—SPECTROSCOPIC METHODOLOGY
RESEARCH ARTICLES—SPECTROSCOPIC METHODOLOGY
TECHNICAL NOTE—SPECTROSCOPIC METHODOLOGY
RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING
RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING
TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING
RESEARCH ARTICLES—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH
RESEARCH ARTICL

In [31]:
df.insert(1, 'Category', list_category)
df

Unnamed: 0,Journal,Category,Title,First Published,DOI
0,"Magnetic Resonance in Medicine: Vol 87, No 2",Cover Image,Three-dimensional quantification of circulatio...,26 November 2021,https://doi.org/10.1002/mrm.29102
1,"Magnetic Resonance in Medicine: Vol 87, No 2",Issue Information,Issue Information,26 November 2021,https://doi.org/10.1002/mrm.28860
2,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLES—SPECTROSCOPIC METHODOLOGY,Absolute choline tissue concentration mapping ...,23 September 2021,https://doi.org/10.1002/mrm.29012
3,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLES—SPECTROSCOPIC METHODOLOGY,Uncertainty in denoising of MRSI using low-ran...,21 September 2021,https://doi.org/10.1002/mrm.29018
4,"Magnetic Resonance in Medicine: Vol 87, No 2",TECHNICAL NOTE—SPECTROSCOPIC METHODOLOGY,Influence of editing pulse flip angle on J-dif...,14 September 2021,https://doi.org/10.1002/mrm.29008
5,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPEC...,Spectroscopy-based multi-parametric quantifica...,23 September 2021,https://doi.org/10.1002/mrm.29021
6,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLES—IMAGING METHODOLOGY,External Dynamic InTerference Estimation and R...,04 September 2021,https://doi.org/10.1002/mrm.28992
7,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLES—IMAGING METHODOLOGY,Comparison of prospective and retrospective mo...,07 September 2021,https://doi.org/10.1002/mrm.28991
8,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLES—IMAGING METHODOLOGY,Quantitative evaluation of prospective motion ...,31 August 2021,https://doi.org/10.1002/mrm.28998
9,"Magnetic Resonance in Medicine: Vol 87, No 2",RESEARCH ARTICLES—IMAGING METHODOLOGY,MULTI-parametric MR imaging with fLEXible desi...,31 August 2021,https://doi.org/10.1002/mrm.28999


## Save as csv

In [32]:
df.to_csv(path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/mrm/mrm-vol-87-no-2.csv', 
          #path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/mrm/mrm-vol-87-no-1.csv', 
          index=False)

## (Optional) Save as xlsx file
 - See [here](https://xlsxwriter.readthedocs.io/working_with_pandas.html) for instruction

In [None]:
# !pip install xlsxwriter
import xlsxwriter

# (Comment out after saving first sheet) Create an ExcelWriter object
# writer = pd.ExcelWriter('/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-all-summary.xlsx', engine='xlsxwriter')

In [None]:
df.to_excel(excel_writer=writer,
            # sheet_name='jmri-vol-54-no-1',
            # sheet_name='jmri-vol-54-no-2',
            sheet_name='jmri-vol-54-no-3',
            index=False)

In [None]:
# (Uncomment at the end) Close the Pandas Excel writer and output the Excel file.
writer.save()