# Web Scraping using Python

## Target Journal: JMRI

Important volumes to include:
 - Category
 - Date first published

Examples:
- https://onlinelibrary.wiley.com/toc/15222586/2021/54/1
- https://onlinelibrary.wiley.com/toc/15222586/2021/54/2
- https://onlinelibrary.wiley.com/toc/15222586/2021/54/3

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To easily display the plots, make sure to include the line %matplotlib inline as shown below.
%matplotlib inline

import urllib
import requests
import lxml

#To perform web scraping, you should also import the libraries shown below. 
#The urllib.request module is used to open URLs. 
#The Beautiful Soup package is used to extract data from html files. 
#The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

## Method 1 (default): First, save as a local html file
 - See: https://zetcode.com/python/beautifulsoup/

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [57]:
# JMRI Vol 54 No 1
URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_Vol 54, No 1.html'
# JMRI Vol 54 No 2
# URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_ Vol 54, No 2.html'
# Current: JMRI Vol 54 No 3
# URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_ Vol 54, No 3.html'

with open(URL, 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

Find the corresponding index, and identify the title info

In [58]:
def safeOpenParsePage(targetUrl):
    try:
        # tmpurl = urlopen(Request(targetUrl, headers={'User-Agent': 'Chrome/92.0.4515.107'}))
        tmpurl = open(targetUrl, 'r')
        tmpR = tmpurl.read()
        # tmpSoup = BeautifulSoup(tmpR, 'html.parser')
        tmpSoup = BeautifulSoup(tmpR, 'lxml')
        return tmpSoup
    except urllib.error.HTTPError as e:
        print(e)
        return None

soupJMRI = safeOpenParsePage(URL)
if soupJMRI is not None:
      print(soupJMRI.prettify())

<!DOCTYPE html>
<html class="pb-page" data-request-id="037d99ac-cdd2-45fc-859a-57d62b6da74d" lang="en">
 <head data-pb-dropzone="head">
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content=";issue:issue:doi\:10.1002/jmri.v54.1;journal:journal:15222586;ctype:string:Journal Content;website:website:pericles;page:string:Table of Contents;requestedJournal:journal:15222586;wgroup:string:Publication Websites;pageGroup:string:Publication Pages" name="pbContext"/>
  <script type="text/javascript">
   var $DoubleClickZone = "j-magres-imaging_jmri";var $DoubleClickSite =  "wly.radiol.imag_000105";
  </script>
  <script id="analyticDigitalData">
   digitalData = {"site":{"ip":"152.78.0.24","environment":"LIVE","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2021-08-05","server":"web117"},"identities":[{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"InstitutionUser","uuid":"58cbd8e6-3c5d-4fc0-ac26-57c4792d2c4b","customerRec

In [59]:
# Get the title
title = soupJMRI.title
print(title)

<title> Journal of Magnetic Resonance Imaging: Vol 54, No 1</title>


In [60]:
# Print out the text
text = soupJMRI.get_text()
print(soup.text)

# Another way to extract text
# str(all_links[103]).split("<h2>")[1].replace("</h2></a>", "")



var $DoubleClickZone = "j-magres-imaging_jmri";var $DoubleClickSite =  "wly.radiol.imag_000105";digitalData = {"site":{"ip":"152.78.0.24","environment":"LIVE","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2021-08-05","server":"web117"},"identities":[{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"InstitutionUser","uuid":"58cbd8e6-3c5d-4fc0-ac26-57c4792d2c4b","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"EALWB000204"}]},{"type":"SmartGroupUser","uuid":"cf9c85b9-7f3b-4309-aeaa-c41669cabe3b"},{"type":"SmartGroupUser","uuid":"db8b8d10-1aed-4809-b3ee-70938c569014"},{"type":"SmartGroupUser","uuid":"cdce41c2-a8de-4a64-a4d9-a85caaaa2314"},{"type":"InstitutionUser","uuid":"bfa4ccf8-8126-488a-b090-da278aa7b4d9","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"ALOG00017"}]},{"type":"BasicGroup","uuid":"6a17ff38-0daf-4f0a-b4eb-b42c69d5c12f","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"CO

In [61]:
all_links = soupJMRI.find_all('a')

In [62]:
# h3 class corresponds to editorial

all_h3 = soupJMRI.find_all("h3")

i = 0
for h3 in all_h3:
  # print(h3)
  print(h3.get_text())
  i += 1
print(i)

Menu
Format
Type of import
Cover Image
Issue Information
Commentary
Review Articles
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Original Research
Editorial
Letter to the Editor
About Wiley Online Library
Help & Support
Opportunities
Connect with Wiley
40


### Explore: Add Category

In [63]:
# h4 class corresponds to category

all_h4 = soupJMRI.find_all("h4")

i = 0
for h4 in all_h4:
  # print(h4)
  print(h4.get_text())
  i += 1
print(i)

Head and Neck
Head and Neck
Pelvis
Abdomen
Abdomen
Musculoskeletal
Musculoskeletal
Vascular
Neuro
Neuro
Neuro
Breast
Pediatrics
Cardiac
Cardiac
Safety
16


In [64]:
# li class: corresponds to "First Published"

all_li = soupJMRI.find_all("li")

i = 0
for li in all_li:
  li_class = li.get_attribute_list("class")
  # print(li_class)
  if li_class == ['ePubDate']:
    print(li.get_text())
    i += 1
  # print(li.get_text())
  # i += 1
print(i)

First Published: 11 June 2021
First Published: 11 June 2021
First Published: 13 May 2021
First Published: 25 June 2020
First Published: 20 June 2020
First Published: 26 August 2020
First Published: 10 March 2021
First Published: 26 February 2021
First Published: 11 February 2021
First Published: 28 March 2021
First Published: 11 February 2021
First Published: 15 February 2021
First Published: 15 February 2021
First Published: 09 February 2021
First Published: 08 February 2021
First Published: 08 February 2021
First Published: 16 March 2021
First Published: 23 April 2021
First Published: 28 February 2021
First Published: 15 February 2021
First Published: 21 February 2021
First Published: 11 May 2021
First Published: 21 December 2020
First Published: 03 January 2021
First Published: 25 January 2021
First Published: 31 December 2020
First Published: 09 February 2021
First Published: 16 February 2021
First Published: 12 March 2021
First Published: 08 February 2021
First Published: 06 Febru

In [65]:
all_h2 = soupJMRI.find_all("h2")

i = 0
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    print(h2.get_text())
    i += 1
print(i)

# str_h2 = str(all_h2)

# links = soupJMRI.find_all("a")
# str_links = str(links)
# print(str_h2)

# cleantext = BeautifulSoup(str_h2, "lxml").get_text()
# cleantext = BeautifulSoup(str_links, "lxml").get_text()
# print(cleantext)

Can 3D Pseudo-Continuous Territorial Arterial Spin Labeling Effectively Diagnose Patients With Recanalization of Unilateral Middle Cerebral Artery Stenosis?
Societal and Research Population Biases
MRI-Based Quantitative Osteoporosis Imaging at the Spine and Femur
Diffusion Imaging in the Post HCP Era
Frontiers of Sodium MRI Revisited: From Cartilage to Brain Imaging
Association of Hypertension With Both Occurrence and Outcome of Symptomatic Patients With Mild Intracranial Atherosclerotic Stenosis: A Prospective Higher Resolution Magnetic Resonance Imaging Study
Editorial for “The Occurrence and Outcome of Mild Intracranial Atherosclerotic Stenosis: A Prospective High-Resolution MRI Study”
Intravoxel Incoherent Motion Magnetic Resonance Imaging for Prediction of Induction Chemotherapy Response in Locally Advanced Hypopharyngeal Carcinoma: Comparison With Model-Free Dynamic Contrast-Enhanced Magnetic Resonance Imaging
Editorial for “Intra-voxel incoherent motion (IVIM) MRI for prediction

### Clean up the title list

In [66]:
list_h2 = []
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 4:
    cells = h2.get_text()
    list_h2.append(cells)

# Delete the last element: "Log in to Wiley Online Library"
list_h2.pop()

# Insert "Issue Information" as the 2nd element
list_h2.insert(1, "Issue Information")

# JMRI Vol 54 No 2
# list_h2.append("Erratum")
# JMRI Vol 54 No 3
# list_h2.insert(2, "Commentary")
# list_h2.append("Condensation Artifact")
# list_h2.append("Reviewer Acknowledgements")

for l in list_h2:
  print(l)
print(len(list_h2))

Can 3D Pseudo-Continuous Territorial Arterial Spin Labeling Effectively Diagnose Patients With Recanalization of Unilateral Middle Cerebral Artery Stenosis?
Issue Information
Societal and Research Population Biases
MRI-Based Quantitative Osteoporosis Imaging at the Spine and Femur
Diffusion Imaging in the Post HCP Era
Frontiers of Sodium MRI Revisited: From Cartilage to Brain Imaging
Association of Hypertension With Both Occurrence and Outcome of Symptomatic Patients With Mild Intracranial Atherosclerotic Stenosis: A Prospective Higher Resolution Magnetic Resonance Imaging Study
Editorial for “The Occurrence and Outcome of Mild Intracranial Atherosclerotic Stenosis: A Prospective High-Resolution MRI Study”
Intravoxel Incoherent Motion Magnetic Resonance Imaging for Prediction of Induction Chemotherapy Response in Locally Advanced Hypopharyngeal Carcinoma: Comparison With Model-Free Dynamic Contrast-Enhanced Magnetic Resonance Imaging
Editorial for “Intra-voxel incoherent motion (IVIM) 

In [67]:
list_date = []
for li in all_li:
  li_class = li.get_attribute_list("class")
  if li_class == ['ePubDate']:
    cells = li.get_text().split(': ')[1]
    list_date.append(cells)

for l in list_date:
  print(l)
print(len(list_date))

11 June 2021
11 June 2021
13 May 2021
25 June 2020
20 June 2020
26 August 2020
10 March 2021
26 February 2021
11 February 2021
28 March 2021
11 February 2021
15 February 2021
15 February 2021
09 February 2021
08 February 2021
08 February 2021
16 March 2021
23 April 2021
28 February 2021
15 February 2021
21 February 2021
11 May 2021
21 December 2020
03 January 2021
25 January 2021
31 December 2020
09 February 2021
16 February 2021
12 March 2021
08 February 2021
06 February 2021
15 February 2021
11 May 2021
08 February 2021
05 February 2021
09 January 2021
12 January 2021
18 February 2021
26 February 2021
17 February 2021
29 March 2021
09 February 2021
08 February 2021
24 January 2021
44


In [68]:
list_url = []
for link in all_links:
  if 'visitable' in str(link.get("class")): # and '/doi/' in link.get("href"):
    cells = link.get("href").replace('/doi', 'https://doi.org')
    list_url.append(cells)

for l in list_url:
  print(l)
print(len(list_url))

https://doi.org/10.1002/jmri.27227
https://doi.org/10.1002/jmri.27228
https://doi.org/10.1002/jmri.27681
https://doi.org/10.1002/jmri.27260
https://doi.org/10.1002/jmri.27247
https://doi.org/10.1002/jmri.27326
https://doi.org/10.1002/jmri.27516
https://doi.org/10.1002/jmri.27571
https://doi.org/10.1002/jmri.27537
https://doi.org/10.1002/jmri.27621
https://doi.org/10.1002/jmri.27546
https://doi.org/10.1002/jmri.27547
https://doi.org/10.1002/jmri.27549
https://doi.org/10.1002/jmri.27540
https://doi.org/10.1002/jmri.27538
https://doi.org/10.1002/jmri.27533
https://doi.org/10.1002/jmri.27548
https://doi.org/10.1002/jmri.27645
https://doi.org/10.1002/jmri.27574
https://doi.org/10.1002/jmri.27550
https://doi.org/10.1002/jmri.27560
https://doi.org/10.1002/jmri.27660
https://doi.org/10.1002/jmri.27469
https://doi.org/10.1002/jmri.27498
https://doi.org/10.1002/jmri.27505
https://doi.org/10.1002/jmri.27499
https://doi.org/10.1002/jmri.27544
https://doi.org/10.1002/jmri.27514
https://doi.org/10.1

In [69]:
# df = pd.DataFrame({'title':list_h2, 'url': list_url})
df = pd.DataFrame({'Journal': title, 'Title': list_h2, 'First Published': list_date, 'DOI': list_url})
df

Unnamed: 0,Journal,Title,First Published,DOI
0,Journal of Magnetic Resonance Imaging: Vol 54...,Can 3D Pseudo-Continuous Territorial Arterial ...,11 June 2021,https://doi.org/10.1002/jmri.27227
1,Journal of Magnetic Resonance Imaging: Vol 54...,Issue Information,11 June 2021,https://doi.org/10.1002/jmri.27228
2,Journal of Magnetic Resonance Imaging: Vol 54...,Societal and Research Population Biases,13 May 2021,https://doi.org/10.1002/jmri.27681
3,Journal of Magnetic Resonance Imaging: Vol 54...,MRI-Based Quantitative Osteoporosis Imaging at...,25 June 2020,https://doi.org/10.1002/jmri.27260
4,Journal of Magnetic Resonance Imaging: Vol 54...,Diffusion Imaging in the Post HCP Era,20 June 2020,https://doi.org/10.1002/jmri.27247
5,Journal of Magnetic Resonance Imaging: Vol 54...,Frontiers of Sodium MRI Revisited: From Cartil...,26 August 2020,https://doi.org/10.1002/jmri.27326
6,Journal of Magnetic Resonance Imaging: Vol 54...,Association of Hypertension With Both Occurren...,10 March 2021,https://doi.org/10.1002/jmri.27516
7,Journal of Magnetic Resonance Imaging: Vol 54...,Editorial for “The Occurrence and Outcome of M...,26 February 2021,https://doi.org/10.1002/jmri.27571
8,Journal of Magnetic Resonance Imaging: Vol 54...,Intravoxel Incoherent Motion Magnetic Resonanc...,11 February 2021,https://doi.org/10.1002/jmri.27537
9,Journal of Magnetic Resonance Imaging: Vol 54...,Editorial for “Intra-voxel incoherent motion (...,28 March 2021,https://doi.org/10.1002/jmri.27621


### Finally, save as csv file

In [70]:
df.to_csv(#path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-vol-54-no-3.csv',
          #path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-vol-54-no-2.csv',
          path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-vol-54-no-1.csv', 
          index=False)