GitHub - starzia/bibliometrics: Scripts to gather and analyze bibliographies for faculty at the top U.S. business schools

Intro

If you are just looking at this code as an example of how to scrape the web, then I'd recommend you look at these files:

school/*.py: which scrapes faculty information from ten different schools. These are each runnable.
web_util.py: which has various utility functions for downloading and parsing pages
google_scholar.py: which uses Selenium to scrape Google Scholar data. The wait_for_captchas function is especially helpful in detecting three different types of Google captchas.

Prerequisites

This is written for Python 3. It will not work with the Python 2.7 that comes with Macs. Mac users can install Python 3 using brew.

Install the prerequisite packages as follows:

pip install --user lxml cssselect pprint selenium unidecode google-api-python-client \
    pycurl bs4 chardet editdistance matplotlib pdfminer3k

(Above, the pip command might actually be pip3 on your system, and you may not need the --user flag.)

To use the Google Scholar scraper, you must first install selenium Gecko and Chrome drivers from: https://github.com/mozilla/geckodriver/releases https://sites.google.com/a/chromium.org/chromedriver/downloads

On RHEL/Centos, you will have to first run sudo yum install libcurl-devel PYCURL_SSL_LIBRARY=nss pip3 install pycurl

On Mac, you may first have to run PYCURL_SSL_LIBRARY=openssl LDFLAGS="-L/usr/local/opt/openssl/lib" CPPFLAGS="-I/usr/local/opt/openssl/include" pip install --no-cache-dir pycurl

Running

There isn't a single command that'll do everything. The project is a collection of functions intended to be run interactively.

For a simple demo you can run

python school/kellogg.py

A full scrape can be started with ./scraper.py. BUT... you'll actually have to look at the code in scraper.py and set the variable do_reload = True.

To run an analysis that produces all the plots used in the Latex report then run ./analysis.py. This relies on your having previously run a full scrape.

Output

./scraper.py

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
school		school
.gitignore		.gitignore
README.md		README.md
airtable_import.py		airtable_import.py
analysis.py		analysis.py
csv_professor_sheet.py		csv_professor_sheet.py
google_scholar.py		google_scholar.py
google_sheets.py		google_sheets.py
google_sheets_util.py		google_sheets_util.py
papers_to_ignore.txt		papers_to_ignore.txt
professor.py		professor.py
professor_scraper.py		professor_scraper.py
professors.csv		professors.csv
scraper.py		scraper.py
web_util.py		web_util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

Prerequisites

Running

Output

About

Releases

Packages

Languages

starzia/bibliometrics

Folders and files

Latest commit

History

Repository files navigation

Intro

Prerequisites

Running

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages