scrape-google-query

This repo holds some scripts for scraping documents from a google query. For example, say you want to find examples of University Codes of Conduct. You could go to Google and search for: "university code of conduct pdf", and you would get a mix of different filetypes and webpages. If you add to that query "filetype:pdf", you will get a bunch of results where the link is to a pdf hosted on some server. This tool is for scraping those links, and downloading the files.

This tool also provides a scrappy review tool for going through the downloaded files and removing any files which are not up to your standards. For example, maybe some documents are not relevant to your query, and you want to get rid of them. Going through 300 documents by hand can be a chore, so review_docs handles opening the files and prompting you for whether or not the file is relevant. You can also add a note about a document, for later review. The review tool was built for Debian-based linux distros and is not guaranteed to work on other distros. The things that will break are the location of the Trash directory and the command to open the default app. Those are set to the following:

Trash location: ~/.local/share/Trash/files
Default app opener: xdg-open

and can be changed easily in review.py after cloning.

These could also easily be modified for MacOS.

Getting Started

Clone the repo
Run $ python scrape.py --help (example: $ python scrape.py 250 data/conduct "university code of conduct")
Wait
Run $ python review.py --help (continuing with example: $ python review.py data/conduct)

`scrape.py`

usage: scrape.py [-h] [--filetype FILETYPE] [--domain DOMAIN]
                 num_docs save_dir query

Scrape university codes of conduct from google search. Uses BeautifulSoup to
scrape a list of pdf URLs from a google search looking for university codes of
conduct.

positional arguments:
  num_docs              maximum number of docs to download
  save_dir              where to save docs
  query                 query string

optional arguments:
  -h, --help            show this help message and exit
  --filetype FILETYPE, -f FILETYPE
                        file extension to look for. Do not include period.
                        e.g. 'pdf'
  --domain DOMAIN, -d DOMAIN
                        Google domain to use. Defaults to .com

`review.py`

usage: review.py [-h] [--responses] data_path

Use metadata file to open documents for review, and delete irrelevant docs.

positional arguments:
  data_path        path to data to review

optional arguments:
  -h, --help       show this help message and exit
  --responses, -r  display valid responses and exit

Valid responses to prompt

After each document is opened, the user will be prompted: Is this doc relevant? >. These are the valid responses and what they do:

Hint: run $ python review.py -r to see this in CLI

Response	Alternative(s)	Description & Action
`yes`	`y`	Document is relevant: set `reviewed` key to `True`
`no`	`n`	Document is not relevant: remove from metadata and move file to trash
`reopen`	`r`	Open the current document again.
`mistake`	`m`	You replied `y`/`n` when you meant the opposite: user is prompted for which of the last 5 docs was a mistake. That document is readded to the metadata as unreviewed, and if the file is in the trash it is restored.
`note`, `comment`	`c`	Add a note to the current document
`quit`	`q`	Save progress and quit

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
review.py		review.py
scrape.py		scrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrape-google-query

Getting Started

`scrape.py`

`review.py`

Valid responses to prompt

About

Languages

ryanamannion/scrape-google-query

Folders and files

Latest commit

History

Repository files navigation

scrape-google-query

Getting Started

scrape.py

review.py

Valid responses to prompt

About

Topics

Resources

Stars

Watchers

Forks

Languages

`scrape.py`

`review.py`