** **
## Step 1: DBLP PARSER <a class="anchor\" id="parser"></a>
<div class="alert alert-block alert-success">
Designing a DBLP parser for gathering data from AI conferences. (https://dblp.org/xml/)
</div>


##### Imports

In [1]:
import requests
import pprint
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame

### 1.1 Parse authors names

##### Parsing URL

In [13]:
URL = 'https://dblp.org/db/conf/uai/uai2019.html' 
#URL = 'https://dblp.org/db/conf/acii/acii2017.html'
#URL = 'https://dblp.org/db/conf/aaai/aaai2019.html'

page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

##### Extracting Title

In [14]:
title = soup.title.string
title = title.replace("dblp: ", "")

##### Creating a list with authors names and their DBLP corresponding URLs

In [15]:
author_names = soup.find_all('span', itemprop='author')

names_list=[]
for names in author_names:
    name = names['itemprop'].split("."), names.text
    url = names.a['href']
    names_list.append([name[1], url])

##### Creation of the dataframe 

In [16]:
df = DataFrame( names_list, columns = ['Authors', 'Url'])
df['Conference'] = title

df = df.drop_duplicates()

### 1.2 Parse authors Affiliation

In [18]:
import time

start_time = time.time()
afilliation_list=[]
count = 0

for url in df.Url:
    page_auth = requests.get(url)
    soup_authors = BeautifulSoup(page_auth.content, 'html.parser')
    
    author_affilitation = soup_authors.find('div', class_='hide-body')
    
    for names in author_affilitation:
        try:
            name = names.span['itemprop'].split("."), names.text
            name= name[1]
            name = name.replace("affiliation: ", "")
            afilliation_list.append([name, url])
            count+=1
            if count%50 == 0:
                print('Already parsed:',count,'/',len(df.Url), '--- %s minutes ---' %round((time.time() - start_time)/60, 2))
            
        except:
            afilliation_list.append([ None, url])
            count+=1
            if count%50 ==0:
                print('Already parsed:',count,'/',len(df.Url), '--- %s minutes ---' %round((time.time() - start_time)/60, 2))

Already parsed: 50 / 411 --- 0.71 minutes ---
Already parsed: 100 / 411 --- 1.41 minutes ---
Already parsed: 150 / 411 --- 2.15 minutes ---
Already parsed: 200 / 411 --- 2.81 minutes ---
Already parsed: 250 / 411 --- 3.67 minutes ---
Already parsed: 300 / 411 --- 4.33 minutes ---
Already parsed: 350 / 411 --- 4.97 minutes ---
Already parsed: 400 / 411 --- 5.59 minutes ---


In [7]:
df_affiliation = DataFrame( afilliation_list, columns = ['Affiliation', 'Url'])

In [11]:
df_affiliation[8:13]

Unnamed: 0,Affiliation,Url
8,"Delft University of Technology, Department of ...",https://dblp.org/pid/41/631.html
9,,https://dblp.org/pid/245/3429.html
10,,https://dblp.org/pid/245/3413.html
11,LIMSI - Computer Science Laboratory for Mechan...,https://dblp.org/pid/53/6183.html
12,"Telecom-Paris, France",https://dblp.org/pid/50/2768.html


In [12]:
print('Total number of entries:', len(df_affiliation))
print('Number of Null Affiliations:', len(df_affiliation[df_affiliation['Affiliation'].isnull()]))

Total number of entries: 417
Number of Null Affiliations: 346


##### Merge DatAffiliationframes

In [19]:
df_complete = df.merge(df_affiliation, how='left', on='Url')

**Output:** dataframe with the columns Authors , Url, Conference and Affiliation