# Web Scraping using Python

## Target Journal: JMRI

Important volumes to include:
 - Category
 - Date first published

Examples:
- https://onlinelibrary.wiley.com/toc/15222586/2021/54/1
- https://onlinelibrary.wiley.com/toc/15222586/2021/54/2
- https://onlinelibrary.wiley.com/toc/15222586/2021/54/3

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To easily display the plots, make sure to include the line %matplotlib inline as shown below.
%matplotlib inline

import urllib
import requests
import lxml

#To perform web scraping, you should also import the libraries shown below. 
#The urllib.request module is used to open URLs. 
#The Beautiful Soup package is used to extract data from html files. 
#The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

## Method 1 (default): First, save as a local html file
 - See: https://zetcode.com/python/beautifulsoup/

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [90]:
# JMRI Vol 54 No 1
# URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_Vol 54, No 1.html'
# JMRI Vol 54 No 2
# URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_ Vol 54, No 2.html'
# Current: JMRI Vol 54 No 3
URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_ Vol 54, No 3.html'

with open(URL, 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

Find the corresponding index, and identify the title info

In [91]:
def safeOpenParsePage(targetUrl):
    try:
        # tmpurl = urlopen(Request(targetUrl, headers={'User-Agent': 'Chrome/92.0.4515.107'}))
        tmpurl = open(targetUrl, 'r')
        tmpR = tmpurl.read()
        # tmpSoup = BeautifulSoup(tmpR, 'html.parser')
        tmpSoup = BeautifulSoup(tmpR, 'lxml')
        return tmpSoup
    except urllib.error.HTTPError as e:
        print(e)
        return None

soupJMRI = safeOpenParsePage(URL)
if soupJMRI is not None:
      print(soupJMRI.prettify())

<!DOCTYPE html>
<html class="pb-page" data-request-id="e9a30f61-73f9-40c9-a53e-3f820aea96ab" lang="en">
 <head data-pb-dropzone="head">
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content=";journal:journal:15222586;issue:issue:doi\:10.1002/jmri.v54.3;ctype:string:Journal Content;website:website:pericles;page:string:Table of Contents;requestedJournal:journal:15222586;wgroup:string:Publication Websites;pageGroup:string:Publication Pages" name="pbContext"/>
  <script type="text/javascript">
   var $DoubleClickZone = "j-magres-imaging_jmri";var $DoubleClickSite =  "wly.radiol.imag_000105";
  </script>
  <script id="analyticDigitalData">
   digitalData = {"site":{"ip":"109.175.189.34","environment":"LIVE","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2021-08-11","server":"web122"},"identities":[{"type":"SmartGroupUser","uuid":"b238ab85-6bb4-49e4-84ca-a0d9f3ee5a35"},{"type":"BasicGroup","uuid":"84aa8781-b32a-4f06-979c-30f3ab78ad73"},{"type":"

In [92]:
# Get the title
title = soupJMRI.title
print(title)

<title> Journal of Magnetic Resonance Imaging: Vol 54, No 3</title>


In [93]:
# Print out the text
text = soupJMRI.get_text()
print(soup.text)

# Another way to extract text
# str(all_links[103]).split("<h2>")[1].replace("</h2></a>", "")



var $DoubleClickZone = "j-magres-imaging_jmri";var $DoubleClickSite =  "wly.radiol.imag_000105";digitalData = {"site":{"ip":"109.175.189.34","environment":"LIVE","website":"onlinelibrary.wiley.com","websiteCode":"pericles","serverDate":"2021-08-11","server":"web122"},"identities":[{"type":"SmartGroupUser","uuid":"b238ab85-6bb4-49e4-84ca-a0d9f3ee5a35"},{"type":"BasicGroup","uuid":"84aa8781-b32a-4f06-979c-30f3ab78ad73"},{"type":"SmartGroupUser","uuid":"eabce89b-65c2-4a09-ab7c-bd2b3fea2f10"},{"type":"BasicGroup","uuid":"b2f59a00-e5f8-40f1-8e75-8c7dd6bce0ba"},{"type":"BasicGroup","uuid":"f6bf73a6-fd29-4303-8301-24bd8a179b03"},{"type":"BasicGroup","uuid":"e623deab-c83b-41e3-84a2-d9f1d46c5f17"},{"type":"ReferrerUser","uuid":"5a52f27c-b57e-4a2d-8de6-c4f2e0fa8465"},{"type":"BasicGroup","uuid":"8b5697d0-0580-43b3-83ea-a258c8085ad5","customerRecords":[{"customerDomain":"ALM-CU","customerNumber":"CORE00000004148"}]},{"type":"BasicGroup","uuid":"a9298b81-eb3e-4dfb-82a5-86f98d156f4d","customerRec

In [94]:
all_links = soupJMRI.find_all('a')

### Explore Category

In [95]:
# h3 class corresponds to editorial

all_h3 = soupJMRI.find_all("h3")

i = 0
for h3 in all_h3:
  # print(h3)
  print(h3.get_text())
  i += 1
print(i)

Menu
Format
Type of import
Cover Image
Issue Information
Commentary
Review Article
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Case Report
Reviewer Appreciation
About Wiley Online Library
Help & Support
Opportunities
Connect with Wiley
47


### Explore Original Research Subcategory

In [96]:
# h4 class corresponds to category

all_h4 = soupJMRI.find_all("h4")

i = 0
for h4 in all_h4:
  # print(h4)
  print(h4.get_text())
  i += 1
print(i)

Breast
Abdomen
Abdomen
Abdomen
Vascular
Cardiac
Cardiac
Pediatrics
Musculoskeletal
Head and Neck
Neuro
Neuro
Neuro
Neuro
Neuro
Neuro
Thoracic
Pelvis
Pelvis
Technical
Technical
21


In [97]:
all_h2 = soupJMRI.find_all("h2")

i = 0
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    print(h2.get_text())
    i += 1
print(i)

# str_h2 = str(all_h2)

# links = soupJMRI.find_all("a")
# str_links = str(links)
# print(str_h2)

# cleantext = BeautifulSoup(str_h2, "lxml").get_text()
# cleantext = BeautifulSoup(str_links, "lxml").get_text()
# print(cleantext)

Whole Brain Adiabatic T1rho and Relaxation Along a Fictitious Field Imaging in Healthy Volunteers and Patients With Multiple Sclerosis: Initial Findings
               
AI-Enhanced Diagnosis of Challenging Lesions in Breast MRI: A Methodology and Application Primer
Intratumoral and Peritumoral Radiomics Based on Functional Parametric Maps from Breast DCE-MRI for Prediction of HER-2 and Ki-67 Status
Reduced Field-of-View Diffusion-Weighted Magnetic Resonance Imaging of the Pancreas With Tilted Excitation Plane: A Preliminary Study
Quantitative Susceptibility Mapping Using a Multispectral Autoregressive Moving Average Model to Assess Hepatic Iron Overload
Bowel Wall Visualization Using MR Enterography in Relationship to Bowel Lumen Contents and Patient Demographics
Editorial for “Bowel Wall Visualization Using MR Enterography in Relationship to Bowel Lumen Contents and Patient Demographics”
MRI Measures of Murine Liver Fibrosis
Editorial for “MRI measures of murine liver fibrosis”
Intrav

### Clean up the title list

In [103]:
list_h2 = []
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 4:
    cells = h2.get_text()
    list_h2.append(cells)

# Delete the last element: "Log in to Wiley Online Library"
list_h2.pop()

# Insert "Issue Information" as the 2nd element
list_h2.insert(1, "Issue Information")

# JMRI Vol 54 No 2
# list_h2.append("Erratum")

# JMRI Vol 54 No 3
list_h2.insert(2, "Commentary")
list_h2.append("Condensation Artifact")
list_h2.append("Reviewer Acknowledgements")

for l in list_h2:
  print(l)
print(len(list_h2))

Whole Brain Adiabatic T1rho and Relaxation Along a Fictitious Field Imaging in Healthy Volunteers and Patients With Multiple Sclerosis: Initial Findings
               
Issue Information
Commentary
AI-Enhanced Diagnosis of Challenging Lesions in Breast MRI: A Methodology and Application Primer
Intratumoral and Peritumoral Radiomics Based on Functional Parametric Maps from Breast DCE-MRI for Prediction of HER-2 and Ki-67 Status
Reduced Field-of-View Diffusion-Weighted Magnetic Resonance Imaging of the Pancreas With Tilted Excitation Plane: A Preliminary Study
Quantitative Susceptibility Mapping Using a Multispectral Autoregressive Moving Average Model to Assess Hepatic Iron Overload
Bowel Wall Visualization Using MR Enterography in Relationship to Bowel Lumen Contents and Patient Demographics
Editorial for “Bowel Wall Visualization Using MR Enterography in Relationship to Bowel Lumen Contents and Patient Demographics”
MRI Measures of Murine Liver Fibrosis
Editorial for “MRI measures of 

### Add dates first published

In [104]:
# li class: corresponds to "First Published"

all_li = soupJMRI.find_all("li")

# i = 0
# for li in all_li:
  # li_class = li.get_attribute_list("class")
  # print(li_class)
  # if li_class == ['ePubDate']:
    # print(li.get_text())
    # i += 1
  # print(li.get_text())
  # i += 1
# print(i)

list_date = []
for li in all_li:
  li_class = li.get_attribute_list("class")
  if li_class == ['ePubDate']:
    cells = li.get_text().split(': ')[1]
    list_date.append(cells)

for l in list_date:
  print(l)
print(len(list_date))

11 August 2021
11 August 2021
13 July 2021
30 August 2020
06 May 2021
11 March 2021
26 February 2021
04 March 2021
08 May 2021
18 March 2021
01 June 2021
21 March 2021
13 April 2021
07 April 2021
06 May 2021
25 February 2021
01 March 2021
22 February 2021
26 March 2021
04 March 2021
05 May 2021
23 April 2021
01 June 2021
14 March 2021
24 March 2021
29 March 2021
08 April 2021
06 March 2021
11 March 2021
10 March 2021
04 April 2021
28 February 2021
13 March 2021
24 March 2021
20 April 2021
23 April 2021
22 April 2021
20 May 2021
08 May 2021
03 May 2021
14 June 2021
07 May 2021
31 March 2021
21 April 2021
01 April 2021
30 May 2021
15 April 2021
16 April 2021
03 May 2021
23 April 2021
11 August 2021
51


In [105]:
list_url = []
for link in all_links:
  if 'visitable' in str(link.get("class")): # and '/doi/' in link.get("href"):
    cells = link.get("href").replace('/doi', 'https://doi.org')
    list_url.append(cells)

for l in list_url:
  print(l)
print(len(list_url))

https://doi.org/10.1002/jmri.27231
https://doi.org/10.1002/jmri.27232
https://doi.org/10.1002/jmri.27830
https://doi.org/10.1002/jmri.27332
https://doi.org/10.1002/jmri.27651
https://doi.org/10.1002/jmri.27590
https://doi.org/10.1002/jmri.27584
https://doi.org/10.1002/jmri.27589
https://doi.org/10.1002/jmri.27668
https://doi.org/10.1002/jmri.27601
https://doi.org/10.1002/jmri.27698
https://doi.org/10.1002/jmri.27604
https://doi.org/10.1002/jmri.27642
https://doi.org/10.1002/jmri.27631
https://doi.org/10.1002/jmri.27663
https://doi.org/10.1002/jmri.27578
https://doi.org/10.1002/jmri.27581
https://doi.org/10.1002/jmri.27573
https://doi.org/10.1002/jmri.27618
https://doi.org/10.1002/jmri.27588
https://doi.org/10.1002/jmri.27666
https://doi.org/10.1002/jmri.27649
https://doi.org/10.1002/jmri.27759
https://doi.org/10.1002/jmri.27600
https://doi.org/10.1002/jmri.27610
https://doi.org/10.1002/jmri.27624
https://doi.org/10.1002/jmri.27633
https://doi.org/10.1002/jmri.27586
https://doi.org/10.1

In [106]:
# df = pd.DataFrame({'title':list_h2, 'url': list_url})
df = pd.DataFrame({'Journal': title, 
                   #'Category': list_category, 
                   'Title': list_h2, 
                   'First Published': list_date, 
                   'DOI': list_url})
df

Unnamed: 0,Journal,Title,First Published,DOI
0,Journal of Magnetic Resonance Imaging: Vol 54...,Whole Brain Adiabatic T1rho and Relaxation Alo...,11 August 2021,https://doi.org/10.1002/jmri.27231
1,Journal of Magnetic Resonance Imaging: Vol 54...,Issue Information,11 August 2021,https://doi.org/10.1002/jmri.27232
2,Journal of Magnetic Resonance Imaging: Vol 54...,Commentary,13 July 2021,https://doi.org/10.1002/jmri.27830
3,Journal of Magnetic Resonance Imaging: Vol 54...,AI-Enhanced Diagnosis of Challenging Lesions i...,30 August 2020,https://doi.org/10.1002/jmri.27332
4,Journal of Magnetic Resonance Imaging: Vol 54...,Intratumoral and Peritumoral Radiomics Based o...,06 May 2021,https://doi.org/10.1002/jmri.27651
5,Journal of Magnetic Resonance Imaging: Vol 54...,Reduced Field-of-View Diffusion-Weighted Magne...,11 March 2021,https://doi.org/10.1002/jmri.27590
6,Journal of Magnetic Resonance Imaging: Vol 54...,Quantitative Susceptibility Mapping Using a Mu...,26 February 2021,https://doi.org/10.1002/jmri.27584
7,Journal of Magnetic Resonance Imaging: Vol 54...,Bowel Wall Visualization Using MR Enterography...,04 March 2021,https://doi.org/10.1002/jmri.27589
8,Journal of Magnetic Resonance Imaging: Vol 54...,Editorial for “Bowel Wall Visualization Using ...,08 May 2021,https://doi.org/10.1002/jmri.27668
9,Journal of Magnetic Resonance Imaging: Vol 54...,MRI Measures of Murine Liver Fibrosis,18 March 2021,https://doi.org/10.1002/jmri.27601


### Manually create the list of category based on Issue Information
 - See [here](https://stackoverflow.com/questions/4654414/python-append-item-to-list-n-times) for extending elements in a list for X times

In [113]:
list_category = ['Cover Image', 'Issue Information', 'Commentary']

### Vol 54 No 1 ###
# list_category.extend(['Review Articles'] * 3)
# list_category.extend(['Original Research: Head and Neck', 'Editorial'] * 2)
# list_category.extend(['Original Research: Pelvis'] * 1)
# list_category.extend(['Original Research: Abdomen'] * 1)
# list_category.extend(['Original Research: Abdomen', 'Editorial'] * 2)
# list_category.extend(['Original Research: Musculoskeletal', 'Editorial'] * 1)
# list_category.extend(['Original Research: Musculoskeletal'] * 1)
# list_category.extend(['Original Research: Vascular'] * 1)
# list_category.extend(['Original Research: Vascular', 'Editorial'] * 1)
# list_category.extend(['Original Research: Neuro'] * 3)
# list_category.extend(['Original Research: Neuro', 'Editorial'] * 3)
# list_category.extend(['Original Research: Breast', 'Editorial'] * 1)
# list_category.extend(['Original Research: Pediatrics', 'Editorial'] * 1)
# list_category.extend(['Original Research: Cardiac'] * 2)
# list_category.extend(['Original Research: Cardiac', 'Editorial'] * 2)
# list_category.extend(['Original Research: Safety', 'Editorial'] * 1)
# list_category.extend(['Letter to the Editor'] * 1)

### Vol 54 No 2 ###
# list_category.extend(['CME Article'] * 1)
# list_category.extend(['Review Articles'] * 3)
# list_category.extend(['Original Research: Whole Body', 'Editorial'] * 1)
# list_category.extend(['Original Research: Cardiac'] * 4)
# list_category.extend(['Original Research: Pelvis', 'Editorial'] * 2)
# list_category.extend(['Original Research: Technical', 'Editorial'] * 1)
# list_category.extend(['Original Research: Musculoskeletal'] * 2)
# list_category.extend(['Original Research: Abdomen'] * 3)
# list_category.extend(['Original Research: Abdomen', 'Editorial'] * 1)
# list_category.extend(['Original Research: Neuro'] * 2)
# list_category.extend(['Original Research: Neuro', 'Editorial'] * 4)
# list_category.extend(['Original Research: Thoracic', 'Editorial'] * 1)
# list_category.extend(['Original Research: Breast'] * 1)
# list_category.extend(['Original Research: Vascular'] * 1)
# list_category.extend(['Original Research: Vascular', 'Editorial'] * 1)
# list_category.extend(['Original Research: Case Report'] * 1)
# list_category.extend(['Erratum'] * 1)

### Vol 54 No 3 ###
list_category.extend(['Review Articles'] * 1)
list_category.extend(['Original Research: Breast'] * 1)
list_category.extend(['Original Research: Abdomen'] * 2)
list_category.extend(['Original Research: Abdomen', 'Editorial'] * 3)
list_category.extend(['Original Research: Vascular', 'Editorial'] * 1)
list_category.extend(['Original Research: Cardiac'] * 2)
list_category.extend(['Original Research: Cardiac', 'Editorial'] * 2)
list_category.extend(['Original Research: Pediatrics', 'Editorial'] * 1)
list_category.extend(['Original Research: Musculoskeletal'] * 1)
list_category.extend(['Original Research: Musculoskeletal', 'Editorial'] * 1)
list_category.extend(['Original Research: Head and Neck'] * 1)
list_category.extend(['Original Research: Neuro'] * 2)
list_category.extend(['Original Research: Neuro', 'Editorial'] * 6)
list_category.extend(['Original Research: Thoracic'] * 1)
list_category.extend(['Original Research: Pelvis', 'Editorial'] * 2)
list_category.extend(['Original Research: Technical'] * 1)
list_category.extend(['Original Research: Technical', 'Editorial'] * 1)
list_category.extend(['Original Research: Case Report: Technical'] * 1)
list_category.extend(['Reviewer Appreciation'] * 1)

i = 0
for l in list_category:
  print(l)
  i += 1
print(i)

Cover Image
Issue Information
Commentary
Review Articles
Original Research: Breast
Original Research: Abdomen
Original Research: Abdomen
Original Research: Abdomen
Editorial
Original Research: Abdomen
Editorial
Original Research: Abdomen
Editorial
Original Research: Vascular
Editorial
Original Research: Cardiac
Original Research: Cardiac
Original Research: Cardiac
Editorial
Original Research: Cardiac
Editorial
Original Research: Pediatrics
Editorial
Original Research: Musculoskeletal
Original Research: Musculoskeletal
Editorial
Original Research: Head and Neck
Original Research: Neuro
Original Research: Neuro
Original Research: Neuro
Editorial
Original Research: Neuro
Editorial
Original Research: Neuro
Editorial
Original Research: Neuro
Editorial
Original Research: Neuro
Editorial
Original Research: Neuro
Editorial
Original Research: Thoracic
Original Research: Pelvis
Editorial
Original Research: Pelvis
Editorial
Original Research: Technical
Original Research: Technical
Editorial
Origi

In [114]:
df.insert(1, 'Category', list_category)
df

Unnamed: 0,Journal,Category,Title,First Published,DOI
0,Journal of Magnetic Resonance Imaging: Vol 54...,Cover Image,Whole Brain Adiabatic T1rho and Relaxation Alo...,11 August 2021,https://doi.org/10.1002/jmri.27231
1,Journal of Magnetic Resonance Imaging: Vol 54...,Issue Information,Issue Information,11 August 2021,https://doi.org/10.1002/jmri.27232
2,Journal of Magnetic Resonance Imaging: Vol 54...,Commentary,Commentary,13 July 2021,https://doi.org/10.1002/jmri.27830
3,Journal of Magnetic Resonance Imaging: Vol 54...,Review Articles,AI-Enhanced Diagnosis of Challenging Lesions i...,30 August 2020,https://doi.org/10.1002/jmri.27332
4,Journal of Magnetic Resonance Imaging: Vol 54...,Original Research: Breast,Intratumoral and Peritumoral Radiomics Based o...,06 May 2021,https://doi.org/10.1002/jmri.27651
5,Journal of Magnetic Resonance Imaging: Vol 54...,Original Research: Abdomen,Reduced Field-of-View Diffusion-Weighted Magne...,11 March 2021,https://doi.org/10.1002/jmri.27590
6,Journal of Magnetic Resonance Imaging: Vol 54...,Original Research: Abdomen,Quantitative Susceptibility Mapping Using a Mu...,26 February 2021,https://doi.org/10.1002/jmri.27584
7,Journal of Magnetic Resonance Imaging: Vol 54...,Original Research: Abdomen,Bowel Wall Visualization Using MR Enterography...,04 March 2021,https://doi.org/10.1002/jmri.27589
8,Journal of Magnetic Resonance Imaging: Vol 54...,Editorial,Editorial for “Bowel Wall Visualization Using ...,08 May 2021,https://doi.org/10.1002/jmri.27668
9,Journal of Magnetic Resonance Imaging: Vol 54...,Original Research: Abdomen,MRI Measures of Murine Liver Fibrosis,18 March 2021,https://doi.org/10.1002/jmri.27601


### Finally, save as csv file

In [115]:
df.to_csv(path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-vol-54-no-3.csv',
          # path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-vol-54-no-2.csv',
          # path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri-vol-54-no-1.csv', 
          index=False)