# Automated Literature Review with Scholarly for Python

Scholarly is a python module providing programmatic access to metadata from Google Scholar.

#### Installation

To use this notebook you need to install the Scholarly module from the command line using `pip install scholarly`.

#### Documentation

- Scholarly github repository: https://github.com/scholarly-python-package/scholarly
- Scholarly documentation: https://scholarly.readthedocs.io/en/stable/

#### Notes:

1. Scholarly doesn't require an API key because it doesn't use an API.
2. The module searches Google Scholar by using a query string and returning a generator-object which behaves like an iterator: https://www.geeksforgeeks.org/generators-in-python/
3. Generator objects interact with Google Scholar dynamically and can be used either by calling the `next()` method on the generator object, or by using the generator object in a `for in` loop.

**Warning:** Google Scholar can block your IP address when using Scholarly. These temporary bans can last for between 1 and 48 hours. To avoiding issues of this kind it is advised that you use a proxy server. Refer to the Scholarly GitHub page and documentation for further information.

# Import libraries

In [None]:
from scholarly import scholarly
import pandas as pd
import json
from flatten_json import flatten
import time
import matplotlib.pyplot as plt
%matplotlib inline

## Search for Author

In [None]:
# User inputs name of author
author_name = input('Enter author name: ')

# Request author from Google Scholar
author = next(scholarly.search_author(author_name))

# Dump author object as string and load into json object
j = json.loads(json.dumps(author))

# Display json with formatting
print(json.dumps(j, indent=3))

#### Search for Author and Fill Respone with Additional Details

In [None]:
# User inputs name of author
author_name = input('Enter author name: ')

# Retrieve the author's data, fill-in, and print
search_query = scholarly.search_author(author_name)

# Obtain additional author details with Fill
author = scholarly.fill(next(search_query))

# Dump author object as string and load into json object
j = json.loads(json.dumps(author))

# Display json with formatting
print(json.dumps(j, indent=3))

## Search for Publications

When performing a search be aware that the choice and ordering of search terms can impact the results.

In [None]:
# User provides search parameters
search_term = input('Enter search term(s) separated by commas: ')
number_articles = int(input('Number of publications to return? '))

#### Request Publications from Google Scholar

In [None]:
search_query = scholarly.search_pubs(search_term)

#### Intialise array of dataframes and count of articles for data collection

In [None]:
#Create a list of to store data
data = []
#Count articles
count = 0

#### Call the generator object in a loop and output to a dataframe

In [None]:
#Convert each publication to a dataframe and add it to the list
for i in range(number_articles):
    # Check count of articles does not exceed requested articles
    if(count < number_articles):
        try:
            # Request next publication
            pub = scholarly.fill(next(search_query))

            # Flatten publication details and removed unnecessary fields
            flat_dict = flatten(pub, root_keys_to_ignore={'source', 'author_id'})

            # Add dictionary to list of data
            data.append(flat_dict)

            #Increment count of articles
            count = count + 1

            #Debug logging
            print('Article ' + str(count) + ': ' + flat_dict['bib_title'])

        except StopIteration:
            print('Ending as generator exhausted')
            break

        except Exception as e:
            print('Error: ' + str(e))

        #Wait time in secs before calling for next article
        time.sleep(5)
    else:
        break
    
print('Search Complete - Reset count if more articles required.')

#### Prepare dataframe by concatenating search results

In [None]:
df = pd.DataFrame(data)

#### Display the dataframe

In [None]:
#Dataframe head
df.head()

#### Remove Duplicates

In [None]:
df.drop_duplicates()

#### Display Summary Information

In [None]:
#Summary information
df.info()

In [None]:
#Unique publications
len(df['bib_bib_id'].unique())

In [None]:
#Publication provenance
provenance_table = df.groupby(['bib_bib_id', 'bib_journal']).size().reset_index()
provenance_table = provenance_table.groupby('bib_journal').size()
provenance_table.replace('&', '\&')

#### Display Unique Publications per Year

In [None]:
articles_per_year = df.groupby(['bib_bib_id', 'bib_pub_year']).size().reset_index().groupby('bib_pub_year').size()

fig, ax = plt.subplots(figsize = (10, 5))

plt.title('Total Articles per Year', fontsize = 20)
plt.ylabel('Number of Records', fontsize = 10); 

ax.plot(articles_per_year, zorder = 1, linewidth = 3)

plt.yticks(fontsize = 10)
plt.xticks(fontsize = 10, rotation = 90); 

plt.show()

## Export Data

A folder called 'data' needs to be added to the folder containing this Jupyter notebook before writing the file.

#### Format Filename

In [None]:
replacement_name = search_term.replace(' ', '_')
replacement_name = replacement_name.replace(',', '')

#### Save to CSV

In [None]:
df.to_csv('data/' + replacement_name + '_Search_Scholarly' + '.csv', index=False)

#### Save to JSON

In [None]:
df.to_json('data/' + replacement_name + '_Search_Scholarly' + '.json', orient='records')