# Automated Literature Review with Scholarly for Python

Scholarly is a python module providing programmatic access to metadata from Google Scholar.

#### Installation

To use this notebook you need to install the Scholarly module from the command line using `pip install scholarly`.

#### Documentation

- Scholarly github repository: https://github.com/scholarly-python-package/scholarly
- Scholarly documentation: https://scholarly.readthedocs.io/en/stable/

#### Notes:

1. Scholarly doesn't require an API key because it doesn't use an API.
2. The module searches Google Scholar by using a query string and returning a generator-object which behaves like an iterator:https://www.geeksforgeeks.org/generators-in-python/
3. Generator objects interact with Google Scholar dynamically and can be used either by calling the `next()` method on the generator object, or by using the generator object in a `for in` loop.

**Important:** Google Scholar can often block scholarly. The most common solution for avoiding network issues is to use proxies. See the scholarly GitHub page for further information.

# Import libraries

In [None]:
from scholarly import scholarly
import pandas as pd

import time

import matplotlib.pyplot as plt

%matplotlib inline

## Search for Author

In [None]:
print(next(scholarly.search_author('oliver dawkins')))

#### Search for Author and fill details

In [None]:
# Retrieve the author's data, fill-in, and print
search_query = scholarly.search_author('oliver dawkins')
author = next(search_query).fill()
print(author)

## Search for Publications

In [None]:
#Search parameters
search_term = 'digital twin'
number_articles = 20

#### Request Publications from Google Scholar

In [None]:
search_query = scholarly.search_pubs(search_term)

#### Call the generator object in a loop and output to a dataframe

In [None]:
#Create a list of dataframes 
dfs = []

#Convert each publication to a dataframe and add it to the list
for i in range(number_articles):
    pub = next(search_query).fill()
    dfs.append(pd.DataFrame.from_dict(pub.bib))
    
    #Debug logging
    print('Added 1 publication: ' + pub.bib["title"])
    
    #Wait time in secs before calling for next article
    time.sleep(2)
    
print('Concatenating Dataframes')

#Concatenate the articles into a single dataframe
df = pd.concat(dfs, sort=False)

print('done')

#### Display the dataframe and summary statistics

In [None]:
#Dataframe head
df.head()

In [None]:
#Summary information
df.info()

In [None]:
#Unique publications
len(df['ID'].unique())

In [None]:
#Publication provenance
provenance_table = df.groupby(['ID', 'journal']).size().reset_index()
provenance_table = provenance_table.groupby('journal').size()
provenance_table

#### Display Unique Publications per Year

In [None]:
articles_per_year = df.groupby(['ID', 'year']).size().reset_index().groupby('year').size()

fig, ax = plt.subplots(figsize = (10, 5))

plt.title('Total Articles per Year', fontsize = 20)
plt.ylabel('Number of Records', fontsize = 10); 

ax.plot(articles_per_year, zorder = 1, linewidth = 3)

plt.yticks(fontsize = 10)
plt.xticks(fontsize = 10, rotation = 90); 

plt.show()

#### Save dataframe to CSV

In [None]:
df.to_csv('../software_data.csv')

#### Clear dataframe

In [None]:
del df