## Motivation

Around half a year ago, one of my projects required to retrieve some data about my publications from my Google Scholar profile. At that moment, I managed to do it by using the [scholarly](https://github.com/OrganicIrradiation/scholarly) Python module, which allows retrieving author and publication information from Google Scholar in a friendly, Pythonic way. However, Google recently started constantly changing the names of attributes inside many HTML tags, making such modules completely useless (you can read more about such issue [here](https://github.com/OrganicIrradiation/scholarly/issues/37)). Thus, I started looking for a more stable solution that is not involving the HTML attributes to retrieve the information I needed. I developed my own code using the BeautifulSoup4 module and wanted to share the way I did it here in the case someone finds it useful. 
___

## Exploring a Google Sholar profile

First, I explore my Google Scholar profile to determine the elements of the page that consist of the information I need such as my publications and number of citations. After I access my profile using this link: [https://scholar.google.com/citations?user=Jid5DjYAAAAJ&hl=en](https://scholar.google.com/citations?user=Jid5DjYAAAAJ&hl=en), I can see that such information organized as a table (see the red frame below).
![Google Scholar profile](images/GS_profile.jpg "Vlad Turlo's Google Scholar profile in May 2019")

Also, as you may notice, there is another table in the top right corner with the total number of citations and different metrics, such as h-index and i10-index. This data may be also useful and also stored as a table, so we will put the focus on extracting tables from this specific page, while the final code will be useful to parse any Google Scholar profile.

---

## Let's code!

This is my favorite part. First, let's get the source code of the HTML page above using the awesome [requests](https://2.python-requests.org/en/master/) module. As it is mentioned by the authors of this module: 
>Requests is the only Non-GMO HTTP library for Python, safe for human consumption. Requests allow you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. 

In [2]:
# import the module
import requests

# request the page with the Google Scholar profile
requests.get("https://scholar.google.com/citations?user=Jid5DjYAAAAJ&hl=en")

<Response [200]>

Such a response means that the page was successfully retrieved. Let's save the response into the variable and look at what is inside:

In [3]:
# request the page with the Google Scholar profile
r = requests.get("https://scholar.google.com/citations?user=Jid5DjYAAAAJ&hl=en")

# show the response as the text (the first 1000 symbols):
r.text[:1000]

'<!doctype html><html><head><title>Vladyslav Turlo - Google Scholar Citations</title><meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="referrer" content="always"><meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=2"><meta name="format-detection" content="telephone=no"><link rel="shortcut icon" href="/favicon.ico"><link rel="canonical" href="http://scholar.google.com.ua/citations?user=Jid5DjYAAAAJ&amp;hl=ru"><style>html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}html,body{height:100%}#gs_top{position:relative;box-sizing:border-box;min-height:100%;min-width:964px;-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:320px;}#gs_top.gs_nscl{position:fixe

As you can see, the response has the raw HTML code that basically tells how to organize all the information at the webpage.   Thus, we need some parser to extract the information we are interested in without a real burden. And the solution here, as you may expect from the title of this post, is the [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) module. As its developers fairly mentioned in the description:
>Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. **It commonly saves programmers hours or days of work.**

So, let's go ahead and make the soup:

In [4]:
# import the module
from bs4 import BeautifulSoup

# parse the text from our request
soup = BeautifulSoup(r.text, "html.parser")
type(soup)

bs4.BeautifulSoup

Now, we can extract elements of the webpage in a very simple way. For example, let's get a title of the page:

In [5]:
soup.title

<title>Vladyslav Turlo - Google Scholar Citations</title>

If we need just the text without HTML tags, we simply do:

In [6]:
soup.title.text

'Vladyslav Turlo - Google Scholar Citations'

After that, we can easily get the scholar's name just by spliting the string above:

In [7]:
scholar_name = soup.title.text.split(" - ")[0]
print(scholar_name)

Vladyslav Turlo


From analyzing the original webpage, we remember that the information we need is stored in tables. So, let's find all the tables in our soup. The BeautifulSoup4 module allows us to do this with just one line of code:

In [8]:
# find all the tables
tables = soup.find_all("table")

# print the number of tables found
print(len(tables))

2


As we expected, we have just two tables at the page. Let's look at the first one:

In [9]:
print(tables[0])

<table id="gsc_rsb_st"><thead><tr><th class="gsc_rsb_sth"></th><th class="gsc_rsb_sth">All</th><th class="gsc_rsb_sth">Since 2014</th></tr></thead><tbody><tr><td class="gsc_rsb_sc1"><a class="gsc_rsb_f gs_ibl" href="javascript:void(0)" title='This is the number of citations to all publications. The second column has the "recent" version of this metric which is the number of new citations in the last 5 years to all publications.'>Citations</a></td><td class="gsc_rsb_std">84</td><td class="gsc_rsb_std">82</td></tr><tr><td class="gsc_rsb_sc1"><a class="gsc_rsb_f gs_ibl" href="javascript:void(0)" title='h-index is the largest number h such that h publications have at least h citations. The second column has the "recent" version of this metric which is the largest number h such that h publications have at least h new citations in the last 5 years.'>h-index</a></td><td class="gsc_rsb_std">6</td><td class="gsc_rsb_std">6</td></tr><tr><td class="gsc_rsb_sc1"><a class="gsc_rsb_f gs_ibl" href="j

Looks quite messy, but thanks to the developers, we have some easy way to make it prettier:

In [10]:
print(tables[0].prettify())

<table id="gsc_rsb_st">
 <thead>
  <tr>
   <th class="gsc_rsb_sth">
   </th>
   <th class="gsc_rsb_sth">
    All
   </th>
   <th class="gsc_rsb_sth">
    Since 2014
   </th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td class="gsc_rsb_sc1">
    <a class="gsc_rsb_f gs_ibl" href="javascript:void(0)" title='This is the number of citations to all publications. The second column has the "recent" version of this metric which is the number of new citations in the last 5 years to all publications.'>
     Citations
    </a>
   </td>
   <td class="gsc_rsb_std">
    84
   </td>
   <td class="gsc_rsb_std">
    82
   </td>
  </tr>
  <tr>
   <td class="gsc_rsb_sc1">
    <a class="gsc_rsb_f gs_ibl" href="javascript:void(0)" title='h-index is the largest number h such that h publications have at least h citations. The second column has the "recent" version of this metric which is the largest number h such that h publications have at least h new citations in the last 5 years.'>
     h-index
    </a>
   </td

From the HTML syntax, we can see that the table have the head (**&lt;thead&gt;** tag) with a row (**&lt;tr&gt;** tag) of column names (**&lt;th&gt;** tags). Also, the table has the body (**&lt;tbody&gt;** tag) with rows (**&lt;tr&gt;** tags) and columns (**&lt;td&gt;** tags). Using our soup, let's first extract the array of column names for this table, making the loop over the elements of the row in a table head:

In [28]:
# make a list of column names
column_names = [name.text for name in tables[0].thead.tr]
print(column_names)

['', 'All', 'Since 2014']


To extract the column names, we can also use the same method **findAll()** we used before to find the tables in our soup:

In [29]:
# make a list of column names
column_names = [name.text for name in tables[0].findAll('th')]
print(column_names)

['', 'All', 'Since 2014']


Let's now extract data from the body of the table:

In [30]:
# extract the data from the body of the table
data = [[column.text for column in row.findAll('td')] for row in tables[0].tbody.findAll('tr')]
print(data)

[['Citations', '84', '82'], ['h-index', '6', '6'], ['i10-index', '3', '3']]


The one pretty comfortable way to store and work with tables in Python is to use the **pandas DataFrames**.
> [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Let's import **pandas** module and transform our HTML table into the DataFrame:

In [31]:
# import module
import pandas as pd

# create a new DataFrame using the column names we extracted above
df1 = pd.DataFrame(columns = column_names)

# add data to the DataFrame row by row
for i,row in enumerate(data):
    df1.loc[i] = row

# make the first column as the index column in the DataFrame:
df1 = df1.set_index(column_names[0])

# show the final DataFrame
df1

Unnamed: 0,All,Since 2014
,,
Citations,84.0,82.0
h-index,6.0,6.0
i10-index,3.0,3.0


Looks great! Let's now explore the second table:

In [39]:
# make a list of column names
column_names = [name.text for name in tables[1].findAll('th')]
print("Column names:")
print(column_names)

# extract the data from the body of the table
data = [[column.text for column in row.findAll('td')] for row in tables[1].tbody.findAll('tr')]

print("The first row of the table:")
print(data[0])

Column names:
['', '', '', 'Title', 'Cited by', 'Year']
The first row of the table:
['Dissolution process at solid/liquid interface in nanometric metallic multilayers: Molecular dynamics simulations versus diffusion modelingV Turlo, O Politano, F BarasActa Materialia 99, 363-372, 2015', '24', '2015']


Two things we can notice here
1. the number of column names is larger than the number of elements in a row of data
2. the first element of the row of data is one string composed of publication title, author names, and journal information

This makes the transformation of such HTML tables to the pandas DataFrames a little bit more complicated. Let's make a DataFrame with the following columns:
* Publication title
* Authors' names
* Journal information
* Cited by
* Year

The last two can be easily extracted from the second and third columns of the HTML table. However, to extract the other data, we have to dive deeper into the HTML code of the first column. Let's have a look at the first column of the first row of the body of the table:

In [44]:
print(tables[1].tbody.tr.td.prettify())

<td class="gsc_a_t">
 <a class="gsc_a_at" data-href="/citations?view_op=view_citation&amp;hl=en&amp;oe=ASCII&amp;user=Jid5DjYAAAAJ&amp;citation_for_view=Jid5DjYAAAAJ:u-x6o8ySG0sC" href="javascript:void(0)">
  Dissolution process at solid/liquid interface in nanometric metallic multilayers: Molecular dynamics simulations versus diffusion modeling
 </a>
 <div class="gs_gray">
  V Turlo, O Politano, F Baras
 </div>
 <div class="gs_gray">
  Acta Materialia 99, 363-372
  <span class="gs_oph">
   , 2015
  </span>
 </div>
</td>



As we can see, the title of the publication can be accessed by using the **&lt;a&gt;** tag, while authors' names and journal information are stored under **&lt;div&gt;** tags. Let's then define a function that will split each row of the HTML table into the elements using the **&lt;a&gt;**, **&lt;div&gt;**, and **&lt;td&gt;** tags:

In [69]:
# define the function to divide one row of the data on elements
def divide(row):
    # get publication title
    try:
        title = row.findAll('a')[0].text
    except:
        title = ''
    # get authors' names
    try:
        authors = row.findAll('div')[0].text
    except:
        authors = ''
    # get journal information
    try:
        journal_info = row.findAll('div')[1].text
    except:
        journal_info = ''
    # get number of citations
    try:
        cited_by = row.findAll('td')[1].text
    except:
        cited_by = ''
    # get publication year
    try:
        year = row.findAll('td')[2].text
    except:
        year = ''
    # return all the data as a list
    return [title, authors, journal_info, cited_by, year]


# extract the data from the body of the table
data = [divide(row) for row in tables[1].tbody.findAll('tr')]
data[0]

['Dissolution process at solid/liquid interface in nanometric metallic multilayers: Molecular dynamics simulations versus diffusion modeling',
 'V Turlo, O Politano, F Baras',
 'Acta Materialia 99, 363-372, 2015',
 '24',
 '2015']

Great! Let's create the list of our custom column names and create the pandas DataFrame using the same code as for the first table:

In [71]:
# make the list of column names
column_names = ["Publication title", "List of authors", "Journal information", "Number of citations", "Year"]

# create a new DataFrame using the column names we extracted above
df2 = pd.DataFrame(columns = column_names)

# add data to the DataFrame row by row
for i,row in enumerate(data):
    df2.loc[i] = row

# show the final DataFrame
df2

Unnamed: 0,Publication title,List of authors,Journal information,Number of citations,Year
0,Dissolution process at solid/liquid interface ...,"V Turlo, O Politano, F Baras","Acta Materialia 99, 363-372, 2015",24.0,2015
1,Modeling self-sustaining waves of exothermic d...,"V Turlo, O Politano, F Baras","Acta Materialia 120, 189-204, 2016",16.0,2016
2,Alloying propagation in nanometric Ni/Al multi...,"V Turlo, O Politano, F Baras","Journal of Applied Physics 121 (5), 055304, 2017",10.0,2017
3,Grain boundary complexions and the strength of...,"V Turlo, TJ Rupert","Acta Materialia 151, 100-111, 2018",8.0,2018
4,Comparative study of embedded-atom methods app...,"V Turlo, F Baras, O Politano",Modelling and Simulation in Materials Science ...,7.0,2017
5,Microstructure evolution and self-propagating ...,"V Turlo, O Politano, F Baras","Journal of Alloys and Compounds 708, 989-998, ...",6.0,2017
6,Dissolution at interfaces in layered solid-liq...,"F Baras, V Turlo, O Politano",Journal of Materials Engineering and Performan...,4.0,2016
7,Model of phase separation and of morphology ev...,"VV Turlo, AM Gusak, KN Tu","Philosophical Magazine 93 (16), 2013-2025, 2013",4.0,2013
8,Dislocation-assisted linear complexion formati...,"V Turlo, TJ Rupert","Scripta Materialia 154, 25-29, 2018",2.0,2018
9,SHS in Ni/Al nanofoils: a review of experiment...,"F Baras, V Turlo, O Politano, SG Vadchenko, AS...","Advanced Engineering Materials 20 (8), 1800091...",2.0,2018


That's it! Our DataFrame is ready for the analysis and visualization. In the next post, we will explore the ways to advance the data we extracted here by connecting it to external databases. We will extensively work with string transformations, **pandas** DataFrame manipulations, and data visualizations using **matplotlib** and **seaborn** modules.