# Web Scraping using Python

## Target Journal: MRM

- https://onlinelibrary.wiley.com/toc/15222594/2021/86/4
- https://onlinelibrary.wiley.com/toc/15222594/2021/86/3
- https://onlinelibrary.wiley.com/toc/15222594/2021/86/2
- https://onlinelibrary.wiley.com/toc/15222594/2021/86/1

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To easily display the plots, make sure to include the line %matplotlib inline as shown below.
%matplotlib inline

import urllib
import requests
import lxml

#To perform web scraping, you should also import the libraries shown below. 
#The urllib.request module is used to open URLs. 
#The Beautiful Soup package is used to extract data from html files. 
#The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

## First, save as a local html file
 - See: https://zetcode.com/python/beautifulsoup/

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# MRM Vol 86 No 4
URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Magnetic Resonance in Medicine_ Vol 86, No 4.html'

with open(URL, 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

Find the corresponding index, and identify the title info

In [5]:
def safeOpenParsePage(targetUrl):
    try:
        # tmpurl = urlopen(Request(targetUrl, headers={'User-Agent': 'Chrome/92.0.4515.107'}))
        tmpurl = open(targetUrl, 'r')
        tmpR = tmpurl.read()
        # tmpSoup = BeautifulSoup(tmpR, 'html.parser')
        tmpSoup = BeautifulSoup(tmpR, 'lxml')
        return tmpSoup
    except urllib.error.HTTPError as e:
        print(e)
        return None

soupMRM = safeOpenParsePage(URL)
if soupMRM is not None:
      print(soupMRM.prettify())

<!DOCTYPE html>
<html class="pb-page" data-request-id="6fa2a511-9080-4391-bd0b-4f966ca00334" lang="en">
 <head data-pb-dropzone="head">
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content=";journal:journal:15222594;ctype:string:Journal Content;website:website:pericles;requestedJournal:journal:15222594;page:string:Table of Contents;wgroup:string:Publication Websites;pageGroup:string:Publication Pages;issue:issue:doi\:10.1111/mrm.v86.4" name="pbContext"/>
  <script type="text/javascript">
   var $DoubleClickZone = "magres-med_mrm";var $DoubleClickSite =  "wly.radiol.imag_000105";
  </script>
  <script id="analyticDigitalData">
   digitalData = {"site":{"ip":"152.78.0.24","environment":"READONLY","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2021-09-11","server":"web118"},"identities":[{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"InstitutionUser","uuid":"58cbd8e6-3c5d-4fc0-ac26-57c4792d2c4b","customerRecords

In [6]:
# Get the title
title = soupMRM.title
print(title)

<title> Magnetic Resonance in Medicine: Vol 86, No 4</title>


In [7]:
# Print out the text
text = soupMRM.get_text()
print(soup.text)

# Another way to extract text
# str(all_links[103]).split("<h2>")[1].replace("</h2></a>", "")



var $DoubleClickZone = "magres-med_mrm";var $DoubleClickSite =  "wly.radiol.imag_000105";digitalData = {"site":{"ip":"152.78.0.24","environment":"READONLY","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2021-09-11","server":"web118"},"identities":[{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"InstitutionUser","uuid":"58cbd8e6-3c5d-4fc0-ac26-57c4792d2c4b","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"EALWB000204"}]},{"type":"SmartGroupUser","uuid":"cf9c85b9-7f3b-4309-aeaa-c41669cabe3b"},{"type":"SmartGroupUser","uuid":"db8b8d10-1aed-4809-b3ee-70938c569014"},{"type":"SmartGroupUser","uuid":"cdce41c2-a8de-4a64-a4d9-a85caaaa2314"},{"type":"InstitutionUser","uuid":"bfa4ccf8-8126-488a-b090-da278aa7b4d9","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"ALOG00017"}]},{"type":"BasicGroup","uuid":"6a17ff38-0daf-4f0a-b4eb-b42c69d5c12f","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"CORE0

In [8]:
all_links = soupMRM.find_all('a')

### Explore Category

In [9]:
# h3 class corresponds to editorial

all_h3 = soupMRM.find_all("h3")

i = 0
for h3 in all_h3:
  # print(h3)
  print(h3.get_text())
  i += 1
print(i)

Menu
Format
Type of import
COVER IMAGE
ISSUE INFORMATION
OBITUARY
RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY
RESEARCH ARTICLES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING
TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING
TECHNICAL NOTE—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH
RESEARCH ARTICLES—COMPUTER PROCESSING AND MODELING
TECHNICAL NOTE—COMPUTER PROCESSING AND MODELING
RESEARCH ARTICLES—HARDWARE AND INSTRUMENTATION
RESEARCH ARTICLE—ESR
About Wiley Online Library
Help & Support
Opportunities
Connect with Wiley
20


### Explore Original Research Subcategory

In [10]:
# h4 class corresponds to category

all_h4 = soupMRM.find_all("h4")

i = 0
for h4 in all_h4:
  # print(h4)
  print(h4.get_text())
  i += 1
print(i)

0


In [11]:
all_h2 = soupMRM.find_all("h2")

i = 0
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    print(h2.get_text())
    i += 1
print(i)

# str_h2 = str(all_h2)

# links = soupJMRI.find_all("a")
# str_links = str(links)
# print(str_h2)

# cleantext = BeautifulSoup(str_h2, "lxml").get_text()
# cleantext = BeautifulSoup(str_links, "lxml").get_text()
# print(cleantext)

Multi-echo gradient-recalled-echo phase unwrapping using a Nyquist sampled virtual echo train in the presence of high-field gradients
In memoriam: John R. Mallard (1927-2021)
Spectral fitting strategy to overcome the overlap between 2-hydroxyglutarate and lipid resonances at 2.25 ppm
A model selection framework to quantify microvascular liver function in gadoxetate-enhanced MRI: Application to healthy liver, diseased tissue, and hepatocellular carcinoma
Pulseq-CEST: Towards multi-site multi-vendor compatibility and reproducibility of CEST experiments using an open-source sequence standard
Systematic evaluation of iterative deep neural networks for fast parallel MRI reconstruction with sensitivity-weighted coil combination
Local perturbation responses and checkerboard tests: Characterization tools for nonlinear MRI methods
Sources of systematic error in DCE-MRI estimation of low-level blood-brain barrier leakage
Real-time deep artifact suppression using recurrent U-Nets for low-latency 

## Next, clean up the title list

In [19]:
list_h2 = []
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    cells = h2.get_text()
    list_h2.append(cells)

# Delete the last element: "Log in to Wiley Online Library"
list_h2.pop()

# Insert "Issue Information" as the 2nd element
list_h2.insert(1, "Issue Information")

# MRM Vol 86 No 4
# https://stackoverflow.com/questions/15715912/remove-the-last-n-elements-of-a-list
del list_h2[-2:]

for l in list_h2:
  print(l)
print(len(list_h2))

Multi-echo gradient-recalled-echo phase unwrapping using a Nyquist sampled virtual echo train in the presence of high-field gradients
Issue Information
In memoriam: John R. Mallard (1927-2021)
Spectral fitting strategy to overcome the overlap between 2-hydroxyglutarate and lipid resonances at 2.25 ppm
A model selection framework to quantify microvascular liver function in gadoxetate-enhanced MRI: Application to healthy liver, diseased tissue, and hepatocellular carcinoma
Pulseq-CEST: Towards multi-site multi-vendor compatibility and reproducibility of CEST experiments using an open-source sequence standard
Systematic evaluation of iterative deep neural networks for fast parallel MRI reconstruction with sensitivity-weighted coil combination
Local perturbation responses and checkerboard tests: Characterization tools for nonlinear MRI methods
Sources of systematic error in DCE-MRI estimation of low-level blood-brain barrier leakage
Real-time deep artifact suppression using recurrent U-Net

### Add dates first published

In [13]:
# li class: corresponds to "First Published"

all_li = soupMRM.find_all("li")

# i = 0
# for li in all_li:
  # li_class = li.get_attribute_list("class")
  # print(li_class)
  # if li_class == ['ePubDate']:
    # print(li.get_text())
    # i += 1
  # print(li.get_text())
  # i += 1
# print(i)

list_date = []
for li in all_li:
  li_class = li.get_attribute_list("class")
  if li_class == ['ePubDate']:
    cells = li.get_text().split(': ')[1]
    list_date.append(cells)

for l in list_date:
  print(l)
print(len(list_date))

20 July 2021
20 July 2021
06 June 2021
12 May 2021
11 May 2021
07 May 2021
10 June 2021
03 June 2021
18 May 2021
25 May 2021
12 May 2021
12 May 2021
19 May 2021
17 June 2021
02 June 2021
06 June 2021
31 May 2021
07 June 2021
31 May 2021
31 May 2021
10 June 2021
28 May 2021
31 May 2021
05 May 2021
22 May 2021
06 June 2021
15 May 2021
18 May 2021
12 May 2021
03 June 2021
24 May 2021
18 May 2021
06 May 2021
19 May 2021
24 May 2021
25 May 2021
09 June 2021
20 May 2021
24 May 2021
03 June 2021
03 June 2021
03 May 2021
42


In [14]:
list_url = []
for link in all_links:
  if 'visitable' in str(link.get("class")): # and '/doi/' in link.get("href"):
    cells = link.get("href").replace('/doi', 'https://doi.org')
    list_url.append(cells)

for l in list_url:
  print(l)
print(len(list_url))

https://doi.org/10.1002/mrm.28920
https://doi.org/10.1002/mrm.28373
https://doi.org/10.1002/mrm.28838
https://doi.org/10.1002/mrm.28829
https://doi.org/10.1002/mrm.28798
https://doi.org/10.1002/mrm.28825
https://doi.org/10.1002/mrm.28827
https://doi.org/10.1002/mrm.28828
https://doi.org/10.1002/mrm.28833
https://doi.org/10.1002/mrm.28834
https://doi.org/10.1002/mrm.28835
https://doi.org/10.1002/mrm.28837
https://doi.org/10.1002/mrm.28847
https://doi.org/10.1002/mrm.28846
https://doi.org/10.1002/mrm.28850
https://doi.org/10.1002/mrm.28851
https://doi.org/10.1002/mrm.28854
https://doi.org/10.1002/mrm.28855
https://doi.org/10.1002/mrm.28856
https://doi.org/10.1002/mrm.28858
https://doi.org/10.1002/mrm.28857
https://doi.org/10.1002/mrm.28872
https://doi.org/10.1002/mrm.28871
https://doi.org/10.1002/mrm.28826
https://doi.org/10.1002/mrm.28832
https://doi.org/10.1002/mrm.28775
https://doi.org/10.1002/mrm.28842
https://doi.org/10.1002/mrm.28844
https://doi.org/10.1002/mrm.28824
https://doi.or

In [20]:
# df = pd.DataFrame({'title':list_h2, 'url': list_url})
df = pd.DataFrame({'Journal': title, 
                   #'Category': list_category, 
                   'Title': list_h2, 
                   'First Published': list_date, 
                   'DOI': list_url})
df

Unnamed: 0,Journal,Title,First Published,DOI
0,"Magnetic Resonance in Medicine: Vol 86, No 4",Multi-echo gradient-recalled-echo phase unwrap...,20 July 2021,https://doi.org/10.1002/mrm.28920
1,"Magnetic Resonance in Medicine: Vol 86, No 4",Issue Information,20 July 2021,https://doi.org/10.1002/mrm.28373
2,"Magnetic Resonance in Medicine: Vol 86, No 4",In memoriam: John R. Mallard (1927-2021),06 June 2021,https://doi.org/10.1002/mrm.28838
3,"Magnetic Resonance in Medicine: Vol 86, No 4",Spectral fitting strategy to overcome the over...,12 May 2021,https://doi.org/10.1002/mrm.28829
4,"Magnetic Resonance in Medicine: Vol 86, No 4",A model selection framework to quantify microv...,11 May 2021,https://doi.org/10.1002/mrm.28798
5,"Magnetic Resonance in Medicine: Vol 86, No 4",Pulseq-CEST: Towards multi-site multi-vendor c...,07 May 2021,https://doi.org/10.1002/mrm.28825
6,"Magnetic Resonance in Medicine: Vol 86, No 4",Systematic evaluation of iterative deep neural...,10 June 2021,https://doi.org/10.1002/mrm.28827
7,"Magnetic Resonance in Medicine: Vol 86, No 4",Local perturbation responses and checkerboard ...,03 June 2021,https://doi.org/10.1002/mrm.28828
8,"Magnetic Resonance in Medicine: Vol 86, No 4",Sources of systematic error in DCE-MRI estimat...,18 May 2021,https://doi.org/10.1002/mrm.28833
9,"Magnetic Resonance in Medicine: Vol 86, No 4",Real-time deep artifact suppression using recu...,25 May 2021,https://doi.org/10.1002/mrm.28834


## Third, manually create the list of category based on Issue Information
 - See [here](https://stackoverflow.com/questions/4654414/python-append-item-to-list-n-times) for extending elements in a list for X times

In [27]:
# list_category = ['Cover Image', 'Issue Information', 'Commentary']

### Vol 86 No 4 ###
list_category = ['Cover Image', 'Issue Information', 'Orbituary']
list_category.extend(['RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY'] * 1)
list_category.extend(['RESEARCH ARTICLES—IMAGING METHODOLOGY'] * 19)
list_category.extend(['TECHNICAL NOTES—IMAGING METHODOLOGY'] * 2)
list_category.extend(['RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING'] * 3)
list_category.extend(['TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING'] * 1)
list_category.extend(['TECHNICAL NOTE—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH'] * 1)
list_category.extend(['RESEARCH ARTICLES—COMPUTER PROCESSING AND MODELING'] * 7)
list_category.extend(['TECHNICAL NOTE—COMPUTER PROCESSING AND MODELING'] * 1)
list_category.extend(['RESEARCH ARTICLES—HARDWARE AND INSTRUMENTATION'] * 3)
list_category.extend(['RESEARCH ARTICLE—ESR'] * 1)

i = 0
for l in list_category:
  print(l)
  i += 1
print(i)

Cover Image
Issue Information
Orbituary
RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
RESEARCH ARTICLES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING
RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING
RESEARCH 

In [28]:
df.insert(1, 'Category', list_category)
df

Unnamed: 0,Journal,Category,Title,First Published,DOI
0,"Magnetic Resonance in Medicine: Vol 86, No 4",Cover Image,Multi-echo gradient-recalled-echo phase unwrap...,20 July 2021,https://doi.org/10.1002/mrm.28920
1,"Magnetic Resonance in Medicine: Vol 86, No 4",Issue Information,Issue Information,20 July 2021,https://doi.org/10.1002/mrm.28373
2,"Magnetic Resonance in Medicine: Vol 86, No 4",Orbituary,In memoriam: John R. Mallard (1927-2021),06 June 2021,https://doi.org/10.1002/mrm.28838
3,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPEC...,Spectral fitting strategy to overcome the over...,12 May 2021,https://doi.org/10.1002/mrm.28829
4,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLES—IMAGING METHODOLOGY,A model selection framework to quantify microv...,11 May 2021,https://doi.org/10.1002/mrm.28798
5,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLES—IMAGING METHODOLOGY,Pulseq-CEST: Towards multi-site multi-vendor c...,07 May 2021,https://doi.org/10.1002/mrm.28825
6,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLES—IMAGING METHODOLOGY,Systematic evaluation of iterative deep neural...,10 June 2021,https://doi.org/10.1002/mrm.28827
7,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLES—IMAGING METHODOLOGY,Local perturbation responses and checkerboard ...,03 June 2021,https://doi.org/10.1002/mrm.28828
8,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLES—IMAGING METHODOLOGY,Sources of systematic error in DCE-MRI estimat...,18 May 2021,https://doi.org/10.1002/mrm.28833
9,"Magnetic Resonance in Medicine: Vol 86, No 4",RESEARCH ARTICLES—IMAGING METHODOLOGY,Real-time deep artifact suppression using recu...,25 May 2021,https://doi.org/10.1002/mrm.28834


## Save as csv

In [29]:
df.to_csv(path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/mrm-vol-86-no-4.csv', 
          index=False)

## Save as xlsx file
 - See [here](https://xlsxwriter.readthedocs.io/working_with_pandas.html) for instruction

In [None]:
# !pip install xlsxwriter
import xlsxwriter

# (Comment out after saving first sheet) Create an ExcelWriter object
# writer = pd.ExcelWriter('/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-all-summary.xlsx', engine='xlsxwriter')

In [None]:
df.to_excel(excel_writer=writer,
            # sheet_name='jmri-vol-54-no-1',
            # sheet_name='jmri-vol-54-no-2',
            sheet_name='jmri-vol-54-no-3',
            index=False)

In [None]:
# (Uncomment at the end) Close the Pandas Excel writer and output the Excel file.
writer.save()