Google scholar scraper written in Go
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Godeps
bin
cmd/goscholar
testdata
.gitignore
.travis.yml
LICENSE
README.md
article.go
article_test.go
bibtex.go
bibtex_test.go
build
fetch.go
fetch_test.go
parse.go
parse_test.go
property.go
query.go
query_test.go
text.go
text_test.go
utils.go
utils_test.go

README.md

GoDoc Build Status Coverage Status Go Report Card license GitHub version

goscholar

Google Scholar scraper written in Go

Install

$ go get github.com/sotetsuk/goscholar

for command line:

$ go get github.com/sotetsuk/goscholar/cmd/goscholar
$ goscholar -h

Build

Also, you can use build command to build command line tool from the source code.

$ git clone git@github.com:sotetsuk/goscholar.git
$ goscholar/build

Options:

--dev: apply go fmt to all files and save dependencies

After build command executed, you will find corss-compiled binary files in bin directory.

Feature

  • API for Go
  • API for command line
  • search by keywords, title, and author
  • find by <cluster-id>
  • search the articles citing <cluster-id>
  • JSON output
  • recursive crawling is not implemented

Go API

Example

// create Query and generate URL
q := Query{Keywords:"nature 2015", Author:"y bengio", Title:"Deep learning"}
url := q.SearchUrl()

// fetch document sending the request to the URL
doc, err := Fetch(url)
if err != nil {
	log.Error(err)
	return
}

// parse articles
ch := make(chan *Article, 10)
go ParseDocument(ch, doc)
for a := range ch {
	fmt.Println("---")
	fmt.Println(a)
}

Command line API

Example

$ goscholar search --keywords "deep learning nature" --author "y bengio" --after 2015 --num 1 | jq .
[
  {
    "title": {
      "name": "Deep learning",
      "url": "http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html"
    },
    "year": "2015",
    "cluster_id": "5362332738201102290",
    "num_cite": "499",
    "num_ver": "7",
    "info_id": "0qfs6zbVakoJ",
    "link": {
      "name": "psu.edu",
      "url": "http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.894&rep=rep1&type=pdf",
      "format": "PDF"
    },
    "bibtex": "@article{lecun2015deep, title={Deep learning}, author={LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey}, journal={Nature}, volume={521}, number={7553}, pages={436--444}, year={2015}, publisher={Nature Publishing Group}}",
    "author": [
      "LeCun, Yann",
      "Bengio, Yoshua",
      "Hinton, Geoffrey"
    ],
    "journal": "Nature",
    "booktitle": "",
    "volume": "521",
    "number": "7553",
    "pages": "436--444",
    "publisher": "Nature Publishing Group"
  }
]
$ goscholar find 15502119379559163003 | jq .
[
  {
    "title": {
      "name": "Deep learning via Hessian-free optimization",
      "url": "http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf"
    },
    "year": "2010",
    "cluster_id": "15502119379559163003",
    "num_cite": "269",
    "num_ver": "",
    "info_id": "e6RSJHGXItcJ",
    "link": {
      "name": "wustl.edu",
      "url": "http://machinelearning.wustl.edu/mlpapers/paper_files/icml2010_Martens10.pdf",
      "format": "PDF"
    },
    "bibtex": "@inproceedings{martens2010deep, title={Deep learning via Hessian-free optimization}, author={Martens, James}, booktitle={Proceedings of the 27th International Conference on Machine Learning (ICML-10)}, pages={735--742}, year={2010}}",
    "author": [
      "Martens, James"
    ],
    "journal": "",
    "booktitle": "Proceedings of the 27th International Conference on Machine Learning (ICML-10)",
    "volume": "",
    "number": "",
    "pages": "735--742",
    "publisher": ""
  }
]
$ goscholar cite 15502119379559163003 --num 1 | python -mjson.tool
[
  {
    "title": {
      "name": "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups",
      "url": "http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6296526"
    },
    "year": "2012",
    "cluster_id": "3674494786452480182",
    "num_cite": "1559",
    "num_ver": "27",
    "info_id": "tmCGO4pt_jIJ",
    "link": {
      "name": "toronto.edu",
      "url": "http://www.cs.toronto.edu/~asamir/papers/SPM_DNN_12.pdf",
      "format": "PDF"
    },
    "bibtex": "@article{hinton2012deep, title={Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups}, author={Hinton, Geoffrey and Deng, Li and Yu, Dong and Dahl, George E and Mohamed, Abdel-rahman and Jaitly, Navdeep and Senior, Andrew and Vanhoucke, Vincent and Nguyen, Patrick and Sainath, Tara N and others}, journal={Signal Processing Magazine, IEEE}, volume={29}, number={6}, pages={82--97}, year={2012}, publisher={IEEE}}",
    "author": [
      "Hinton, Geoffrey",
      "Deng, Li",
      "Yu, Dong",
      "Dahl, George E",
      "Mohamed, Abdel-rahman",
      "Jaitly, Navdeep",
      "Senior, Andrew",
      "Vanhoucke, Vincent",
      "Nguyen, Patrick",
      "Sainath, Tara N",
      "others"
    ],
    "journal": "Signal Processing Magazine, IEEE",
    "booktitle": "",
    "volume": "29",
    "number": "6",
    "pages": "82--97",
    "publisher": "IEEE"
  }
]

(This article cites 15502119379559163003=Deep learning via Hessian-free optimization)

Usage

goscholar: Google Scholar crawler and scraper written in Go

Usage:
  goscholar search [--keywords=<keywords>] [--author=<author>] [--title=<title>]
                   [--after=<year>] [--before=<year>] [--num=<num>] [--start=<start>]
                   [--user-agent=<user-agent>]
  goscholar find <cluster-id> [--user-agent=<user-agent>]
  goscholar cite <cluster-id> [--after=<year>] [--before=<year>] [--num=<num>] [--start=<start>]
                              [--user-agent=<user-agent>]
  goscholar -h | --help
  goscholar --version

Query-options:
  <cluster-id>
  --keywords=<keywords>
  --author=<author>
  --title=<title>

Search-options:
  --after=<year>
  --before=<year>
  --num=<num>
  --start=<start>

Others:
  -h --help
  --version

Dependencies

Related Work

goscholar is inspired by scholar.py

Contribute

Contritubing is more than welcome! See Issues for what is required.

License

MIT License