# Web Scraping using Python

## Target Journal: MRM

- https://onlinelibrary.wiley.com/toc/15222594/2021/86/4
- https://onlinelibrary.wiley.com/toc/15222594/2021/86/3
- https://onlinelibrary.wiley.com/toc/15222594/2021/86/2
- https://onlinelibrary.wiley.com/toc/15222594/2021/86/1

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To easily display the plots, make sure to include the line %matplotlib inline as shown below.
%matplotlib inline

import urllib
import requests
import lxml

#To perform web scraping, you should also import the libraries shown below. 
#The urllib.request module is used to open URLs. 
#The Beautiful Soup package is used to extract data from html files. 
#The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

## First, save as a local html file
 - See: https://zetcode.com/python/beautifulsoup/

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# MRM Vol 86 No 2
URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Magnetic Resonance in Medicine_ Vol 86, No 2.html'
# MRM Vol 86 No 4
# URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Magnetic Resonance in Medicine_ Vol 86, No 4.html'

with open(URL, 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

Find the corresponding index, and identify the title info

In [4]:
def safeOpenParsePage(targetUrl):
    try:
        # tmpurl = urlopen(Request(targetUrl, headers={'User-Agent': 'Chrome/92.0.4515.107'}))
        tmpurl = open(targetUrl, 'r')
        tmpR = tmpurl.read()
        # tmpSoup = BeautifulSoup(tmpR, 'html.parser')
        tmpSoup = BeautifulSoup(tmpR, 'lxml')
        return tmpSoup
    except urllib.error.HTTPError as e:
        print(e)
        return None

soupMRM = safeOpenParsePage(URL)
if soupMRM is not None:
      print(soupMRM.prettify())

<!DOCTYPE html>
<html class="pb-page" data-request-id="60e18eb4-26e3-4638-9f3a-ccad28e4f47d" lang="en">
 <head data-pb-dropzone="head">
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content=";journal:journal:15222594;ctype:string:Journal Content;website:website:pericles;requestedJournal:journal:15222594;page:string:Table of Contents;wgroup:string:Publication Websites;issue:issue:doi\:10.1002/mrm.v86.2;pageGroup:string:Publication Pages" name="pbContext"/>
  <script type="text/javascript">
   var $DoubleClickZone = "magres-med_mrm";var $DoubleClickSite =  "wly.radiol.imag_000105";
  </script>
  <script id="analyticDigitalData">
   digitalData = {}
  </script>
  <meta charset="utf-8"/>
  <meta content="noarchive" name="robots"/>
  <meta content=" Magnetic Resonance in Medicine: Vol 86, No 2" property="og:title"/>
  <meta content="Journal" property="og:type"/>
  <meta content="https://onlinelibrary.wiley.com/toc/15222594/2021/86/2" property="og:url"/>
  <meta content="h

In [5]:
# Get the title
title = soupMRM.title
print(title)

<title> Magnetic Resonance in Medicine: Vol 86, No 2</title>


In [6]:
# Print out the text
text = soupMRM.get_text()
print(soup.text)

# Another way to extract text
# str(all_links[103]).split("<h2>")[1].replace("</h2></a>", "")



var $DoubleClickZone = "magres-med_mrm";var $DoubleClickSite =  "wly.radiol.imag_000105";digitalData = {}












 Magnetic Resonance in Medicine: Vol 86, No 2

@font-face {
    font-family: 'Open Sans Subset';
    font-display: swap;
    src: local('Muli Regular'), local('Muli-Regular'),
    url(data:application/font-woff;charset=utf-8;base64,d09GRgABAAAAAC/wABEAAAAARTwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABHREVGAAAvbAAAABYAAAAWABAAZ0dQT1MAAC+EAAAAEAAAABAAGQAMR1NVQgAAL5QAAABcAAAAdNwO3jBPUy8yAAAlLAAAAFYAAABgfpN4j2NtYXAAACWEAAAALAAAADQADgDRY3Z0IAAAK2QAAABbAAAAphCRGjRmcGdtAAAlsAAABKkAAAe0fmG2EWdhc3AAAC9gAAAADAAAAAwACAAbZ2x5ZgAAAYAAACEfAAAvLOYeag9oZWFkAAAjkAAAADYAAAA2AxccTGhoZWEAACUMAAAAIAAAACQN+wanaG10eAAAI8gAAAFBAAABmr+mJtlsb2NhAAAiwAAAANAAAADQZFtwzW1heHAAACKgAAAAIAAAACAB9QH3bmFtZQAAK8AAAALsAAAGSGhrGxhwb3N0AAAurAAAALQAAAET67Lh1nByZXAAACpcAAABBQAAARh4rJtueNqFeglgU1XW/13ekj3vZW/TLQ1tKAVCmy4WhAaEUmqhpWKHoCwKFkQF2XQQEBhABlkqq8rigggMOIgVARFhEFAQO4B8DB8y6Pg54EJZ/qh8WJrLd+5LWgrq/CkhJW3ePefcc37L

In [7]:
all_links = soupMRM.find_all('a')

### Explore Category

In [8]:
# h3 class corresponds to editorial

all_h3 = soupMRM.find_all("h3")

i = 0
for h3 in all_h3:
  # print(h3)
  print(h3.get_text())
  i += 1
print(i)

Menu
Format
Type of import
COVER IMAGE
ISSUE INFORMATION
FULL PAPER—SPECTROSCOPIC METHODOLOGY
RAPID COMMUNICATIONS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
FULL PAPERS—PRECLINICAL AND CLINICAL IMAGING
TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING
FULL PAPERS—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH
TECHNICAL NOTES—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH
FULL PAPERS—COMPUTER PROCESSING AND MODELING
TECHNICAL NOTE—COMPUTER PROCESSING AND MODELING
FULL PAPER—HARDWARE AND INSTRUMENTATION
TECHNICAL NOTES—HARDWARE AND INSTRUMENTATION
About Wiley Online Library
Help & Support
Opportunities
Connect with Wiley
21


### Explore Original Research Subcategory

In [9]:
# h4 class corresponds to category

# all_h4 = soupMRM.find_all("h4")

# i = 0
# for h4 in all_h4:
#   # print(h4)
#   print(h4.get_text())
#   i += 1
# print(i)

0


In [10]:
all_h2 = soupMRM.find_all("h2")

i = 0
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    print(h2.get_text())
    i += 1
print(i)

# str_h2 = str(all_h2)

# links = soupJMRI.find_all("a")
# str_links = str(links)
# print(str_h2)

# cleantext = BeautifulSoup(str_h2, "lxml").get_text()
# cleantext = BeautifulSoup(str_links, "lxml").get_text()
# print(cleantext)

Synchrotron X-ray micro-CT as a validation dataset for diffusion MRI in whole mouse brain
Calibration-free regional RF shims for MRS
High-resolution sodium imaging using anatomical and sparsity constraints for denoising and recovery of novel features
Motion-compensated 3D turbo spin-echo for more robust MR intracranial vessel wall imaging
High spatial resolution spiral first-pass myocardial perfusion imaging with whole-heart coverage at 3 T
All-systolic first-pass myocardial rest perfusion at a long saturation time using simultaneous multi-slice imaging and compressed sensing acceleration
Apparent exchange rate imaging: On its applicability and the connection to the real exchange rate
Imperfect spoiling in variable flip angle T1 mapping at 7T: Quantifying and minimizing impact
MRzero - Automated discovery of MRI sequences using supervised learning
Prospective motion detection and re-acquisition in diffusion MRI using a phase image–based method—Application to brain and tongue imaging
Ef

## Next, clean up the title list

In [15]:
list_h2 = []
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    cells = h2.get_text()
    list_h2.append(cells)

# Delete the last element: "Log in to Wiley Online Library"
list_h2.pop()

# Insert "Issue Information" as the 2nd element
list_h2.insert(1, "Issue Information")

# https://stackoverflow.com/questions/15715912/remove-the-last-n-elements-of-a-list
del list_h2[-2:]

for l in list_h2:
  print(l)
print(len(list_h2))

Synchrotron X-ray micro-CT as a validation dataset for diffusion MRI in whole mouse brain
Issue Information
Calibration-free regional RF shims for MRS
High-resolution sodium imaging using anatomical and sparsity constraints for denoising and recovery of novel features
Motion-compensated 3D turbo spin-echo for more robust MR intracranial vessel wall imaging
High spatial resolution spiral first-pass myocardial perfusion imaging with whole-heart coverage at 3 T
All-systolic first-pass myocardial rest perfusion at a long saturation time using simultaneous multi-slice imaging and compressed sensing acceleration
Apparent exchange rate imaging: On its applicability and the connection to the real exchange rate
Imperfect spoiling in variable flip angle T1 mapping at 7T: Quantifying and minimizing impact
MRzero - Automated discovery of MRI sequences using supervised learning
Prospective motion detection and re-acquisition in diffusion MRI using a phase image–based method—Application to brain and

### Add dates first published

In [13]:
# li class: corresponds to "First Published"

all_li = soupMRM.find_all("li")

# i = 0
# for li in all_li:
  # li_class = li.get_attribute_list("class")
  # print(li_class)
  # if li_class == ['ePubDate']:
    # print(li.get_text())
    # i += 1
  # print(li.get_text())
  # i += 1
# print(i)

list_date = []
for li in all_li:
  li_class = li.get_attribute_list("class")
  if li_class == ['ePubDate']:
    cells = li.get_text().split(': ')[1]
    list_date.append(cells)

for l in list_date:
  print(l)
print(len(list_date))

24 April 2021
24 April 2021
21 March 2021
25 March 2021
25 March 2021
11 March 2021
10 March 2021
10 March 2021
01 March 2021
23 March 2021
04 March 2021
21 March 2021
23 March 2021
10 March 2021
10 March 2021
10 March 2021
15 March 2021
14 March 2021
24 March 2021
16 March 2021
25 March 2021
23 March 2021
27 March 2021
04 March 2021
16 March 2021
15 March 2021
16 March 2021
15 March 2021
25 March 2021
21 March 2021
16 March 2021
15 March 2021
25 March 2021
27 March 2021
14 March 2021
28 February 2021
16 March 2021
23 March 2021
25 March 2021
15 March 2021
16 March 2021
25 March 2021
23 March 2021
24 March 2021
27 March 2021
19 March 2021
23 March 2021
47


In [14]:
list_url = []
for link in all_links:
  if 'visitable' in str(link.get("class")): # and '/doi/' in link.get("href"):
    cells = link.get("href").replace('/doi', 'https://doi.org')
    list_url.append(cells)

for l in list_url:
  print(l)
print(len(list_url))

https://doi.org/10.1002/mrm.28817
https://doi.org/10.1002/mrm.28371
https://doi.org/10.1002/mrm.28749
https://doi.org/10.1002/mrm.28767
https://doi.org/10.1002/mrm.28777
https://doi.org/10.1002/mrm.28701
https://doi.org/10.1002/mrm.28712
https://doi.org/10.1002/mrm.28714
https://doi.org/10.1002/mrm.28720
https://doi.org/10.1002/mrm.28727
https://doi.org/10.1002/mrm.28729
https://doi.org/10.1002/mrm.28734
https://doi.org/10.1002/mrm.28743
https://doi.org/10.1002/mrm.28744
https://doi.org/10.1002/mrm.28742
https://doi.org/10.1002/mrm.28748
https://doi.org/10.1002/mrm.28750
https://doi.org/10.1002/mrm.28758
https://doi.org/10.1002/mrm.28756
https://doi.org/10.1002/mrm.28763
https://doi.org/10.1002/mrm.28761
https://doi.org/10.1002/mrm.28769
https://doi.org/10.1002/mrm.28770
https://doi.org/10.1002/mrm.28737
https://doi.org/10.1002/mrm.28746
https://doi.org/10.1002/mrm.28747
https://doi.org/10.1002/mrm.28762
https://doi.org/10.1002/mrm.28764
https://doi.org/10.1002/mrm.28765
https://doi.or

In [16]:
# df = pd.DataFrame({'title':list_h2, 'url': list_url})
df = pd.DataFrame({'Journal': title, 
                   #'Category': list_category, 
                   'Title': list_h2, 
                   'First Published': list_date, 
                   'DOI': list_url})
df

Unnamed: 0,Journal,Title,First Published,DOI
0,"Magnetic Resonance in Medicine: Vol 86, No 2",Synchrotron X-ray micro-CT as a validation dat...,24 April 2021,https://doi.org/10.1002/mrm.28817
1,"Magnetic Resonance in Medicine: Vol 86, No 2",Issue Information,24 April 2021,https://doi.org/10.1002/mrm.28371
2,"Magnetic Resonance in Medicine: Vol 86, No 2",Calibration-free regional RF shims for MRS,21 March 2021,https://doi.org/10.1002/mrm.28749
3,"Magnetic Resonance in Medicine: Vol 86, No 2",High-resolution sodium imaging using anatomica...,25 March 2021,https://doi.org/10.1002/mrm.28767
4,"Magnetic Resonance in Medicine: Vol 86, No 2",Motion-compensated 3D turbo spin-echo for more...,25 March 2021,https://doi.org/10.1002/mrm.28777
5,"Magnetic Resonance in Medicine: Vol 86, No 2",High spatial resolution spiral first-pass myoc...,11 March 2021,https://doi.org/10.1002/mrm.28701
6,"Magnetic Resonance in Medicine: Vol 86, No 2",All-systolic first-pass myocardial rest perfus...,10 March 2021,https://doi.org/10.1002/mrm.28712
7,"Magnetic Resonance in Medicine: Vol 86, No 2",Apparent exchange rate imaging: On its applica...,10 March 2021,https://doi.org/10.1002/mrm.28714
8,"Magnetic Resonance in Medicine: Vol 86, No 2",Imperfect spoiling in variable flip angle T1 m...,01 March 2021,https://doi.org/10.1002/mrm.28720
9,"Magnetic Resonance in Medicine: Vol 86, No 2",MRzero - Automated discovery of MRI sequences ...,23 March 2021,https://doi.org/10.1002/mrm.28727


## Third, manually create the list of category based on Issue Information
 - See [here](https://stackoverflow.com/questions/4654414/python-append-item-to-list-n-times) for extending elements in a list for X times

In [22]:
# list_category = ['Cover Image', 'Issue Information', 'Commentary']

### Vol 86 No 2 ###
list_category = ['Cover Image', 'Issue Information']
list_category.extend(['FULL PAPER—SPECTROSCOPIC METHODOLOGY'] * 1)
list_category.extend(['RAPID COMMUNICATIONS—IMAGING METHODOLOGY'] * 2)
list_category.extend(['FULL PAPERS—IMAGING METHODOLOGY'] * 18)
list_category.extend(['TECHNICAL NOTES—IMAGING METHODOLOGY'] * 7)
list_category.extend(['FULL PAPERS—PRECLINICAL AND CLINICAL IMAGING'] * 4)
list_category.extend(['TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING'] * 1)
list_category.extend(['FULL PAPERS—BIOPHYSICS AND BASIC MIOMEDICAL RESEARCH'] * 2)
list_category.extend(['TECHNICAL NOTE—BIOPHYSICS AND BASIC MIOMEDICAL RESEARCH'] * 2)
list_category.extend(['FULL PAPERS—COMPUTER PROCESSING AND MODELING'] * 4)
list_category.extend(['TECHNICAL NOTE—COMPUTER PROCESSING AND MODELING'] * 1)
list_category.extend(['FULL PAPER—HARDWARE AND INSTRUMENTATION'] * 1)
list_category.extend(['TECHNICAL NOTES—HARDWARE AND INSTRUMENTATION'] * 2)

### Vol 86 No 4 ###
# list_category = ['Cover Image', 'Issue Information', 'Orbituary']
# list_category.extend(['RESEARCH ARTICLE—PRECLINICAL AND CLINICAL SPECTROSCOPY'] * 1)
# list_category.extend(['RESEARCH ARTICLES—IMAGING METHODOLOGY'] * 19)
# list_category.extend(['TECHNICAL NOTES—IMAGING METHODOLOGY'] * 2)
# list_category.extend(['RESEARCH ARTICLES—PRECLINICAL AND CLINICAL IMAGING'] * 3)
# list_category.extend(['TECHNICAL NOTE—PRECLINICAL AND CLINICAL IMAGING'] * 1)
# list_category.extend(['TECHNICAL NOTE—BIOPHYSICS AND BASIC BIOMEDICAL RESEARCH'] * 1)
# list_category.extend(['RESEARCH ARTICLES—COMPUTER PROCESSING AND MODELING'] * 7)
# list_category.extend(['TECHNICAL NOTE—COMPUTER PROCESSING AND MODELING'] * 1)
# list_category.extend(['RESEARCH ARTICLES—HARDWARE AND INSTRUMENTATION'] * 3)
# list_category.extend(['RESEARCH ARTICLE—ESR'] * 1)

i = 0
for l in list_category:
  print(l)
  i += 1
print(i)

Cover Image
Issue Information
FULL PAPER—SPECTROSCOPIC METHODOLOGY
RAPID COMMUNICATIONS—IMAGING METHODOLOGY
RAPID COMMUNICATIONS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
FULL PAPERS—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
TECHNICAL NOTES—IMAGING METHODOLOGY
FULL PAPERS—PRECLINICAL

In [23]:
df.insert(1, 'Category', list_category)
df

Unnamed: 0,Journal,Category,Title,First Published,DOI
0,"Magnetic Resonance in Medicine: Vol 86, No 2",Cover Image,Synchrotron X-ray micro-CT as a validation dat...,24 April 2021,https://doi.org/10.1002/mrm.28817
1,"Magnetic Resonance in Medicine: Vol 86, No 2",Issue Information,Issue Information,24 April 2021,https://doi.org/10.1002/mrm.28371
2,"Magnetic Resonance in Medicine: Vol 86, No 2",FULL PAPER—SPECTROSCOPIC METHODOLOGY,Calibration-free regional RF shims for MRS,21 March 2021,https://doi.org/10.1002/mrm.28749
3,"Magnetic Resonance in Medicine: Vol 86, No 2",RAPID COMMUNICATIONS—IMAGING METHODOLOGY,High-resolution sodium imaging using anatomica...,25 March 2021,https://doi.org/10.1002/mrm.28767
4,"Magnetic Resonance in Medicine: Vol 86, No 2",RAPID COMMUNICATIONS—IMAGING METHODOLOGY,Motion-compensated 3D turbo spin-echo for more...,25 March 2021,https://doi.org/10.1002/mrm.28777
5,"Magnetic Resonance in Medicine: Vol 86, No 2",FULL PAPERS—IMAGING METHODOLOGY,High spatial resolution spiral first-pass myoc...,11 March 2021,https://doi.org/10.1002/mrm.28701
6,"Magnetic Resonance in Medicine: Vol 86, No 2",FULL PAPERS—IMAGING METHODOLOGY,All-systolic first-pass myocardial rest perfus...,10 March 2021,https://doi.org/10.1002/mrm.28712
7,"Magnetic Resonance in Medicine: Vol 86, No 2",FULL PAPERS—IMAGING METHODOLOGY,Apparent exchange rate imaging: On its applica...,10 March 2021,https://doi.org/10.1002/mrm.28714
8,"Magnetic Resonance in Medicine: Vol 86, No 2",FULL PAPERS—IMAGING METHODOLOGY,Imperfect spoiling in variable flip angle T1 m...,01 March 2021,https://doi.org/10.1002/mrm.28720
9,"Magnetic Resonance in Medicine: Vol 86, No 2",FULL PAPERS—IMAGING METHODOLOGY,MRzero - Automated discovery of MRI sequences ...,23 March 2021,https://doi.org/10.1002/mrm.28727


## Save as csv

In [24]:
df.to_csv(path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/mrm-vol-86-no-2.csv',
          # path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/mrm-vol-86-no-4.csv', 
          index=False)

## Save as xlsx file
 - See [here](https://xlsxwriter.readthedocs.io/working_with_pandas.html) for instruction

In [None]:
# !pip install xlsxwriter
import xlsxwriter

# (Comment out after saving first sheet) Create an ExcelWriter object
# writer = pd.ExcelWriter('/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-all-summary.xlsx', engine='xlsxwriter')

In [None]:
df.to_excel(excel_writer=writer,
            # sheet_name='jmri-vol-54-no-1',
            # sheet_name='jmri-vol-54-no-2',
            sheet_name='jmri-vol-54-no-3',
            index=False)

In [None]:
# (Uncomment at the end) Close the Pandas Excel writer and output the Excel file.
writer.save()