# Scraping Web Contents from Journal of Magnetic Resonance Imaging

In this demo, we will scrape web contents from Volume 55 Issue 6 of Journal of Magnetic Resonance Imaging, which can be accessed via the link below:
- https://onlinelibrary.wiley.com/toc/15222586/2022/55/6

Please click on the link above and save the page as an html file locally (for this demo, Google Drive will be used).

## First, import packages

To perform web scraping, let's start by importing the packages below. 

In [17]:
import pandas as pd
from bs4 import BeautifulSoup

The `BeautifulSoup` package is used to extract data from html files. The Beautiful Soup library's name is `bs4` which stands for Beautiful Soup, version 4.

 - See: https://zetcode.com/python/beautifulsoup/

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Now we will load in the html file. Make sure at this point, you've already saved the webpage locally.

In [20]:
# Vol 55 No 6
URL = '/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/rawdata/Journal of Magnetic Resonance Imaging_ Vol 55, No 6.html'

def safeOpenParsePage(targetUrl):
    try:
        tmpurl = open(targetUrl, 'r')
        tmpR = tmpurl.read()
        tmpSoup = BeautifulSoup(tmpR, 'lxml')
        return tmpSoup
    except urllib.error.HTTPError as e:
        print(e)
        return None

soupJMRI = safeOpenParsePage(URL)

Find the corresponding index, and identify the title info

In [21]:
# Get the title
title = soupJMRI.title
print(title)

<title> Journal of Magnetic Resonance Imaging: Vol 55, No 6</title>


In [22]:
# Extract the text component
text = soupJMRI.get_text()

### Explore Category

To explore what categories this issue contain by listing the `h3 class` element:

In [23]:
all_h3 = soupJMRI.find_all("h3")

i = 0
for h3 in all_h3:
  # print(h3)
  print(h3.get_text())
  i += 1
print(i)

Menu
Cover Image
Issue Information
Review Articles
Commentary
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Research Articles
Editorial
Case Report
Letters to the Editor
Erratum
This article corrects the following:
About Wiley Online Library
Help & Support
Opportunities
Connect with Wiley
Format
Type of import
47


There are 47 entries. Although most are repeated such as `Research Articles` and `Editorial`, it does gives us a very general about the categories.

### Explore Subcategory

It would help sorting the research articles further down by subcategories. Again we will take a glimpse over this information, and this time we will list all the `h4 class` elements:

In [24]:
all_h4 = soupJMRI.find_all("h4")

i = 0
for h4 in all_h4:
  # print(h4)
  print(h4.get_text())
  i += 1
print(i)

Breast
Breast
Musculoskeletal
Pediatrics
Chest
Chest
Neuro
Pelvis
Safety
Vascular
Vascular
Cardiac
Cardiac
Cardiac
Abdomen
Abdomen
Abdomen
Safety
Neuro
Abdomen
20


## Explore Title List

In [25]:
all_h2 = soupJMRI.find_all("h2")

i = 0
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    print(h2.get_text())
    i += 1
print(i)

Clinical Feasibility of Structural and Functional MRI in Free-Breathing Neonates and Infants
Ultrashort Echo Time Magnetic Resonance Imaging Techniques: Met and Unmet Needs in Musculoskeletal Imaging
Brain MRI in Autism Spectrum Disorder: Narrative Review and Recent Advances
Application of Magnetic Resonance Imaging in Neoadjuvant Treatment of Pancreatic Ductal Adenocarcinoma
Is It Possible for MRI Screening of Breast Cancer to be Available to Many More Women by Greatly Reducing its False Positive Detections via Ultrafast Time to Enhancement Measurements?
Radiomic Analysis of Pharmacokinetic Heterogeneity Within Tumor Based on the Unsupervised Decomposition of Dynamic Contrast-Enhanced MRI for Predicting Histological Characteristics of Breast Cancer
Editorial for “Radiomic Analysis of Pharmacokinetic Heterogeneity Within Tumor Based on the Unsupervised Decomposition of DCE-MRI for Predicting Histological Characteristics of Breast Cancer”
Cross-Cohort Automatic Knee MRI Segmentation Wit

## Clean up Title list

In the list above, there are still some rows that are unwanted. For example, we don't want to keep "More from this journal", "Log in to Wiley Online Library", nor "Create a new account" as they don't make any sense. More cleaning is needed.

In [26]:
list_h2 = []
for h2 in all_h2:
  h2_len = len(h2.get_text().split())
  if h2_len > 3:
    cells = h2.get_text()
    list_h2.append(cells)

# Delete the last element: "Log in to Wiley Online Library"
list_h2.pop()

# Insert "Issue Information" as the 2nd element
list_h2.insert(1, "Issue Information")

# Pop out a few more unwanted elements
list_h2 = list_h2[:-2]

for l in list_h2:
  print(l)
print(len(list_h2))

Clinical Feasibility of Structural and Functional MRI in Free-Breathing Neonates and Infants
Issue Information
Ultrashort Echo Time Magnetic Resonance Imaging Techniques: Met and Unmet Needs in Musculoskeletal Imaging
Brain MRI in Autism Spectrum Disorder: Narrative Review and Recent Advances
Application of Magnetic Resonance Imaging in Neoadjuvant Treatment of Pancreatic Ductal Adenocarcinoma
Is It Possible for MRI Screening of Breast Cancer to be Available to Many More Women by Greatly Reducing its False Positive Detections via Ultrafast Time to Enhancement Measurements?
Radiomic Analysis of Pharmacokinetic Heterogeneity Within Tumor Based on the Unsupervised Decomposition of Dynamic Contrast-Enhanced MRI for Predicting Histological Characteristics of Breast Cancer
Editorial for “Radiomic Analysis of Pharmacokinetic Heterogeneity Within Tumor Based on the Unsupervised Decomposition of DCE-MRI for Predicting Histological Characteristics of Breast Cancer”
Cross-Cohort Automatic Knee MR

## Add dates first published

In [27]:
# li class: corresponds to "First Published"

all_li = soupJMRI.find_all("li")

list_date = []
for li in all_li:
  li_class = li.get_attribute_list("class")
  if li_class == ['ePubDate']:
    cells = li.get_text().split(': ')[1]
    list_date.append(cells)

for l in list_date:
  print(l)
print(len(list_date))

11 May 2022
11 May 2022
28 December 2021
09 October 2021
08 February 2022
22 October 2021
13 November 2021
26 December 2021
17 December 2021
10 November 2021
18 November 2021
23 November 2021
03 November 2021
19 November 2021
21 March 2022
21 March 2022
06 November 2021
15 November 2021
07 December 2021
13 November 2021
12 November 2021
12 November 2021
01 November 2021
31 October 2021
27 October 2021
18 November 2021
09 November 2021
25 October 2021
02 November 2021
10 August 2021
24 September 2021
22 September 2021
28 September 2021
06 January 2022
15 October 2021
07 October 2021
19 November 2021
20 October 2021
21 September 2021
28 October 2021
20 October 2021
20 October 2021
27 October 2021
29 October 2021
14 August 2021
29 October 2021
23 October 2021
20 April 2022
48


## Add DOI links

Notice that the element to replace might be different (e.g. `/doi` or `onlinelibrary.wiley.com/doi`), always depends so need to check the original data.

In [44]:
all_links = soupJMRI.find_all('a')

list_url = []
for link in all_links:
  if 'visitable' in str(link.get("class")): # and '/doi/' in link.get("href"):
    # cells = link.get("href").replace('/doi', 'https://doi.org')
    cells = link.get("href").replace('onlinelibrary.wiley.com/doi', 'doi.org')    
    list_url.append(cells)

for l in list_url:
  print(l)
print(len(list_url))

https://doi.org/10.1002/jmri.27714
https://doi.org/10.1002/jmri.27715
https://doi.org/10.1002/jmri.28032
https://doi.org/10.1002/jmri.27949
https://doi.org/10.1002/jmri.28096
https://doi.org/10.1002/jmri.27969
https://doi.org/10.1002/jmri.27993
https://doi.org/10.1002/jmri.28042
https://doi.org/10.1002/jmri.27978
https://doi.org/10.1002/jmri.27990
https://doi.org/10.1002/jmri.27995
https://doi.org/10.1002/jmri.28005
https://doi.org/10.1002/jmri.27981
https://doi.org/10.1002/jmri.27991
https://doi.org/10.1002/jmri.28165
https://doi.org/10.1002/jmri.28161
https://doi.org/10.1002/jmri.27984
https://doi.org/10.1002/jmri.27996
https://doi.org/10.1002/jmri.28022
https://doi.org/10.1002/jmri.27992
https://doi.org/10.1002/jmri.27983
https://doi.org/10.1002/jmri.27994
https://doi.org/10.1002/jmri.27979
https://doi.org/10.1002/jmri.27980
https://doi.org/10.1002/jmri.27972
https://doi.org/10.1002/jmri.27989
https://doi.org/10.1002/jmri.27988
https://doi.org/10.1002/jmri.27966
https://doi.org/10.1

In [45]:
df = pd.DataFrame({'Journal': title, 
                   'Title': list_h2, 
                   'First Published': list_date, 
                   'DOI': list_url})
df

Unnamed: 0,Journal,Title,First Published,DOI
0,Journal of Magnetic Resonance Imaging: Vol 55...,Clinical Feasibility of Structural and Functio...,11 May 2022,https://doi.org/10.1002/jmri.27714
1,Journal of Magnetic Resonance Imaging: Vol 55...,Issue Information,11 May 2022,https://doi.org/10.1002/jmri.27715
2,Journal of Magnetic Resonance Imaging: Vol 55...,Ultrashort Echo Time Magnetic Resonance Imagin...,28 December 2021,https://doi.org/10.1002/jmri.28032
3,Journal of Magnetic Resonance Imaging: Vol 55...,Brain MRI in Autism Spectrum Disorder: Narrati...,09 October 2021,https://doi.org/10.1002/jmri.27949
4,Journal of Magnetic Resonance Imaging: Vol 55...,Application of Magnetic Resonance Imaging in N...,08 February 2022,https://doi.org/10.1002/jmri.28096
5,Journal of Magnetic Resonance Imaging: Vol 55...,Is It Possible for MRI Screening of Breast Can...,22 October 2021,https://doi.org/10.1002/jmri.27969
6,Journal of Magnetic Resonance Imaging: Vol 55...,Radiomic Analysis of Pharmacokinetic Heterogen...,13 November 2021,https://doi.org/10.1002/jmri.27993
7,Journal of Magnetic Resonance Imaging: Vol 55...,Editorial for “Radiomic Analysis of Pharmacoki...,26 December 2021,https://doi.org/10.1002/jmri.28042
8,Journal of Magnetic Resonance Imaging: Vol 55...,Cross-Cohort Automatic Knee MRI Segmentation W...,17 December 2021,https://doi.org/10.1002/jmri.27978
9,Journal of Magnetic Resonance Imaging: Vol 55...,Editorial for “Cross-Cohort Automatic Knee MRI...,10 November 2021,https://doi.org/10.1002/jmri.27990


All looks good so far! Now the most tedious task is to manually create a list of categories and subcategories combined, and this cannot really be automated, or at least that easy.

To look up this information, one would go to the Issue Information (https://doi.org/10.1002/jmri.27715).

## Manually create a category list

In [41]:
### Vol 55 No 6 ###
list_category = ['Cover Image', 'Issue Information']
list_category.extend(['Review Articles'] * 3)
list_category.extend(['Commentary: Breast'] * 1)
list_category.extend(['Original Research: Breast', 'Editorial'] * 1)
list_category.extend(['Original Research: Musculoskeletal', 'Editorial'] * 1)
list_category.extend(['Original Research: Pediatrics', 'Editorial'] * 1)
list_category.extend(['Original Research: Chest', 'Editorial'] * 2)
list_category.extend(['Original Research: Neuro'] * 1)
list_category.extend(['Original Research: Neuro', 'Editorial'] * 1)
list_category.extend(['Original Research: Pelvis'] * 1)
list_category.extend(['Original Research: Pelvis', 'Editorial'] * 1)
list_category.extend(['Original Research: Safety', 'Editorial'] * 1)
list_category.extend(['Original Research: Vascular'] * 1)
list_category.extend(['Original Research: Vascular', 'Editorial'] * 2)
list_category.extend(['Editorial'] * 1)
list_category.extend(['Original Research: Cardiac', 'Editorial'] * 3)
list_category.extend(['Editorial'] * 1)
list_category.extend(['Original Research: Abdomen'] * 1)
list_category.extend(['Original Research: Abdomen', 'Editorial'] * 3)
list_category.extend(['Editorial'] * 1)
list_category.extend(['Case Report: Safety'] * 1)
list_category.extend(['Letter to the Editor: Neuro'] * 1)
list_category.extend(['Erratum: Abdomen'] * 1)

i = 0
for l in list_category:
  print(l)
  i += 1
print(i)

Cover Image
Issue Information
Review Articles
Review Articles
Review Articles
Commentary: Breast
Original Research: Breast
Editorial
Original Research: Musculoskeletal
Editorial
Original Research: Pediatrics
Editorial
Original Research: Chest
Editorial
Original Research: Chest
Editorial
Original Research: Neuro
Original Research: Neuro
Editorial
Original Research: Pelvis
Original Research: Pelvis
Editorial
Original Research: Safety
Editorial
Original Research: Vascular
Original Research: Vascular
Editorial
Original Research: Vascular
Editorial
Editorial
Original Research: Cardiac
Editorial
Original Research: Cardiac
Editorial
Original Research: Cardiac
Editorial
Editorial
Original Research: Abdomen
Original Research: Abdomen
Editorial
Original Research: Abdomen
Editorial
Original Research: Abdomen
Editorial
Editorial
Case Report: Safety
Letter to the Editor: Neuro
Erratum: Abdomen
48


Now the number finally matches the row counts (N=38). We can add the manually-created list into the spreadsheet as a column:

In [46]:
df.insert(1, 'Category', list_category)
df

Unnamed: 0,Journal,Category,Title,First Published,DOI
0,Journal of Magnetic Resonance Imaging: Vol 55...,Cover Image,Clinical Feasibility of Structural and Functio...,11 May 2022,https://doi.org/10.1002/jmri.27714
1,Journal of Magnetic Resonance Imaging: Vol 55...,Issue Information,Issue Information,11 May 2022,https://doi.org/10.1002/jmri.27715
2,Journal of Magnetic Resonance Imaging: Vol 55...,Review Articles,Ultrashort Echo Time Magnetic Resonance Imagin...,28 December 2021,https://doi.org/10.1002/jmri.28032
3,Journal of Magnetic Resonance Imaging: Vol 55...,Review Articles,Brain MRI in Autism Spectrum Disorder: Narrati...,09 October 2021,https://doi.org/10.1002/jmri.27949
4,Journal of Magnetic Resonance Imaging: Vol 55...,Review Articles,Application of Magnetic Resonance Imaging in N...,08 February 2022,https://doi.org/10.1002/jmri.28096
5,Journal of Magnetic Resonance Imaging: Vol 55...,Commentary: Breast,Is It Possible for MRI Screening of Breast Can...,22 October 2021,https://doi.org/10.1002/jmri.27969
6,Journal of Magnetic Resonance Imaging: Vol 55...,Original Research: Breast,Radiomic Analysis of Pharmacokinetic Heterogen...,13 November 2021,https://doi.org/10.1002/jmri.27993
7,Journal of Magnetic Resonance Imaging: Vol 55...,Editorial,Editorial for “Radiomic Analysis of Pharmacoki...,26 December 2021,https://doi.org/10.1002/jmri.28042
8,Journal of Magnetic Resonance Imaging: Vol 55...,Original Research: Musculoskeletal,Cross-Cohort Automatic Knee MRI Segmentation W...,17 December 2021,https://doi.org/10.1002/jmri.27978
9,Journal of Magnetic Resonance Imaging: Vol 55...,Editorial,Editorial for “Cross-Cohort Automatic Knee MRI...,10 November 2021,https://doi.org/10.1002/jmri.27990


## Save as csv

In [47]:
df.to_csv(path_or_buf='/content/drive/MyDrive/UHS-MRIPhysics-journal-web-scrapping/processed/jmri/jmri-vol-55-no-6.csv', index=False)