## **Journal of Machine Learning Research (JMLR) - WebScraping Project**

> **Disclaimer:** This is a personal project to practice webscraping skills and exploratory data analysis. I do not recommend to use for other purposes. Use it at your own risk.

### **Libraries**

All volumes and papers of JMLR are on only one page. The scraping process will use only the main tools. 

> **If you wanna replicate, maybe you need to install some of the packages with PIP command.**

In [1]:
import re
import pandas as pd
import requests

from tqdm.notebook import tqdm
from bs4 import BeautifulSoup

### **Variables**

Let's define the URL to request the volumes HTML page.

In [2]:
url = 'https://www.jmlr.org/papers/'

### **Volumes**

Let's start the extracting process of the volumes.

In [3]:
vols_req = requests.get(url)
vols_req.status_code

200

Parsing the HTML to a BeautifulSoup object.

In [4]:
vols_soup = BeautifulSoup(vols_req.text, 'html.parser')

The volumes will be on a list to loop and scrape the papers.

In [5]:
vols = []

The information it's stored on a *font* tag with a *volume* class. It's easy to scrape the data.

In [7]:
content = vols_soup.findAll('font', {'class': 'volume'})
for i in content:
    container = i.parent 
    href = container['href']
    if 'papers' in href:
        href = 'https://www.jmlr.org' + href
    else:
        href = url + href
    
    volume = i.text.strip()   
    vols.append(dict(href=href, volume=volume))

Let's check the last released volume!

In [8]:
vols[0]

{'href': 'https://www.jmlr.org/papers/v23', 'volume': 'Volume 23'}

### **Papers**

If one error occurs in the scraping process, we will lose all the progress. One of the possibilities is to write directly into a file, but it will be a heavy memory consumer. As we are dealing with a notebook, I will write the data in a dictionary, as the key is the volume, and the value it's the paper.

Maybe it's not the best idea, but it will be enough for this project.

In [12]:
papers_per_volume = {}

Unfortunately, over the papers, the data are not distributed uniformly. We will need more steps to scrape all the data correctly. For each volume, it has a set of papers. Let's scrape each one of them and store them in the dictionary.

In [13]:
for v in tqdm(vols):
    url = v['href']
    volume = v['volume']
    
    if papers_per_volume.get(volume, []):
        continue 
    
    req = requests.get(url)
    if req.status_code != 200:
        raise requests.ConnectionError(f"Connection Failed to {url}.")
    
    soup = BeautifulSoup(req.text, 'html.parser')
    
    papers = []
    papers_container = soup.findAll('dl')
    for p in papers_container:        
        dt = p.find('dt')
        dd = p.find('dd')
        
        if dt.find('dd'):
            title = list(dt.children)[0].text.strip()
        else:
            title = dt.text.strip()
        
        authors = dd.b.text.strip().split(', ')
        
        desc = dd.b.nextSibling.text.strip()        
        year = desc.split(' ')[-1].replace('.', '')  
        pages_string = desc[desc.find(":")+1:desc.rfind(",")]
        pages_values = re.findall(r'[0-9]+', pages_string)
        pages = int(pages_values[1]) - int(pages_values[0]) + 1
                
        code_string = dd.find(string='code')
        if code_string:
            code = code_string.parent['href']
        else:
            code = ''
        
        pdf = dd.find(string='pdf')
        if pdf:
            link = pdf.parent['href']
        else:
            link = dd.find(string='[pdf]').parent['href']
        
        if 'www' not in link:
            link = 'https://www.jmlr.org' + link
                
        papers.append(dict(title=title, volume=volume, authors=authors, year=year, pages=pages, link=link, code=code))
        
    papers_per_volume[volume] = papers

  0%|          | 0/47 [00:00<?, ?it/s]

Let's see the first two papers of the first volume!

In [14]:
papers_per_volume['Volume 1'][:2]

[{'title': 'Learning with Mixtures of Trees',
  'volume': 'Volume 1',
  'authors': ['Marina Meila', 'Michael I. Jordan'],
  'year': '2000',
  'pages': 48,
  'link': 'http://www.jmlr.org/papers/volume1/meila00a/meila00a.pdf',
  'code': ''},
 {'title': 'Dependency Networks for Inference, Collaborative Filtering, and Data Visualization',
  'volume': 'Volume 1',
  'authors': ['David Heckerman',
   'David Maxwell Chickering',
   'Christopher Meek',
   'Robert Rounthwaite',
   'Carl Kadie'],
  'year': '2000',
  'pages': 27,
  'link': 'http://www.jmlr.org/papers/volume1/heckerman00a/heckerman00a.pdf',
  'code': ''}]

### **Save data**

As the papers are per volume, let's group them all on a list!

In [16]:
full_data = []
for p in papers_per_volume.values():
    full_data.extend(p)

Now, we can pass to a DataFrame and check the table.

In [17]:
df = pd.DataFrame(full_data)
df.head()

Unnamed: 0,title,volume,authors,year,pages,link,code
0,Joint Estimation and Inference for Data Integr...,Volume 23,"[Subhabrata Majumdar, George Michailidis]",2022,53,https://www.jmlr.org/papers/volume23/18-131/18...,https://github.com/GeorgeMichailidis/JMMLE_code
1,Debiased Distributed Learning for Sparse Part...,Volume 23,"[Shaogao Lv, Heng Lian]",2022,32,https://www.jmlr.org/papers/volume23/18-467/18...,
2,Recovering shared structure from multiple netw...,Volume 23,"[Keith Levin, Asad Lodhia, Elizaveta Levina]",2022,48,https://www.jmlr.org/papers/volume23/19-1056/1...,
3,Exploiting locality in high-dimensional Factor...,Volume 23,"[Lorenzo Rimella, Nick Whiteley]",2022,34,https://www.jmlr.org/papers/volume23/19-267/19...,https://github.com/LorenzoRimella/GraphFilter-...
4,Empirical Risk Minimization under Random Censo...,Volume 23,"[Guillaume Ausset, Stephan ClÃ©menÃ§on, FranÃ§...",2022,59,https://www.jmlr.org/papers/volume23/19-450/19...,


And finally, let's save the data.

In [18]:
df.to_csv('data.csv', index=False)