![DBC](Images/DBC.png)

**Introduction**

[Scholarly](https://pypi.org/project/scholarly/) is a Python module that allows users to retrieve bibliometrics from [Google Scholar](https://scholar.google.ca/). This script automates the process of scraping this information from Google Scholar using scholarly for a list of authors. 

This project currently works with scholarly 1.4.5

**Installation and Setup**

1. Upload this project in [Syzygy](https://ubc.syzygy.ca/).
    - Click the "+" button to open a new launcher and click "Terminal"
    - Type  "git clone https://github.com/ubcbraincircuits/Scholar_Scraper" and click enter
    - The file should now be in cloned in your directory. 
2. [Install scholarly](https://pypi.org/project/scholarly/)
    - In the terminal (from above) type "pip3 install --user git+https://github.com/scholarly-python-package/scholarly.git" and press enter
3. Upload a .csv file with the list of author names. Author names should match their names in Google Scholar. This file should be in the same directory as this notebook file. 
4. Modify the names of the input/output files below (in step 1). The input file name must match the .csv file name. 
5. Modify the "affiliations" variable as a list of institution names which the researchers are affiliated with. Include both abbreviated and long form.  
5. Run all cells (click shift+enter to run a cell or the play button above). 
6. Check for warnings in step 5 and make sure that the authors scraped from Google Scholar have the correct affiliation and are in fact the correct author. You may need to change the names in the input csv file to match the author name on Google Scholar. 
7. Step 5 should produce an output csv file in the same directory as this notebook file. You can check the last column of this file for warnings. 


In [1]:
from scholarly import scholarly
import csv
import warnings

1. Modify the names of the input and output files. The input file should be in this directory. The output file does not have to exist yet (it will be created). 

In [2]:
input_authors = 'DBC Investigators.csv'
output_data = 'output_dbc.csv'

2. Load author list as .csv. Note: author names should match their name on Google Scholar.

In [3]:
j = 0
with open(input_authors, encoding ="utf-8-sig") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter =',')
    for row in csv_reader:
        j = j +1

author_names= ['0']*j
i = 0
with open(input_authors, encoding ="utf-8-sig") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter =',')
    for row in csv_reader:
        if (len(row) == 1):
            author_names[i]=row[0]
            i = i +1


3. Specify the institution which the researchers are affiliated with. Tip: include long and abbreviated institution names. Remember to use quotations!

In [4]:
affiliations = ["University of British Columbia", "UBC"]

4. Create rows array containing data for each author.

In [8]:
rows = []
warning_list=[]


for i, athr in enumerate(author_names):
    row_num=str(i+1)
    search_query = scholarly.search_author(athr)
    try :
        author = next(search_query)
    except (RuntimeError,TypeError,StopIteration):
        row = [athr,'','','','','','','','','no information found']
    else:
        data_dict = scholarly.fill(author, sections=['basics', 'indices'])
        row = [data_dict['name'],data_dict['scholar_id'],data_dict['citedby'],data_dict['citedby5y'],data_dict['hindex'],data_dict['hindex5y'],data_dict['i10index'],data_dict['i10index5y'],data_dict['affiliation']]
        # Create list of authors who do not have the specified affiliation
        if not any(a in data_dict['affiliation'] for a in affiliations):
            warning_list_row=[i+1, data_dict['name']]
            warning_list.append(warning_list_row)
            row.append("Specified institutions not found in affiliation!")
    finally:    
        rows.append(row)
        
                

5. Check for authors scraped from Google Scholar without UBC listed as their affiliation. Make sure you have the correct author.

In [11]:
# Notify user of any authors without specified affiliation in case the wrong author has been scraped from Google Scholar.
if warning_list: # (if not empty)
    print('Warning: The following authors (with respective row numbers) do not have any of the specified institutions listed as their affiliation: ')
    print(warning_list)
    

[[6, 'Cheryl Rivers'], [13, 'Jason S. Snyder'], [34, 'Sophia Frangou'], [38, 'Brian D. Fisher'], [39, 'Leigh Anne Swayne'], [41, 'Adrienne Fairhall'], [42, 'Eric Shea-Brown'], [43, 'Emily Sylwestrak'], [44, 'Andy Y. Shih']]


6. Write rows to output .csv

In [10]:
# This creates/opens the file with filename with the intention to write to the csv_file
# The encoding allows the characters to be properly written to the csv_file
with open(output_data, mode='w', encoding ="utf-8") as csv_file:
    csv_writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    query_row = ['Name', 'Scholar ID', 'Cited by', 'Cited by 5 years','H Index', 'H Index 5 years', 'I Index', 'I Index 5 years', 'Affiliation', 'Warning']
    csv_writer.writerow(query_row)
    csv_writer.writerows(rows)