Skip to content

This repo contains scrapers for all international government websites regarding covid-19

Notifications You must be signed in to change notification settings

anmolshres/all-international-government-website-scrapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Updates Scraper (International)

This web scraper uses Scrapy with Python to scrape all updates posted in select government websites regarding COVID-19 (selenium was used to scrape links to New Zealand's website)

This scraper uses scrapy,CLD-2, dateparser, and html2text as dependencies. I am also using Python3 to create a virtual environment to create an isoalted environment to run the scraper on.

Steps before running scraper:

  • Create a virtualenv and run it. (This is slightly different for Windows vs Linux/Mac)
  • In order for selenium to work, you need to download chromewebdriver and place it into the directory containing the shell scripts. Choose the version that matches the web browser you have. Note: You can always opt to use a different browser like Firefox. Just make sure to change the code accordingly in new_zealand_links,py to reflect that. Also, if you don't have a windows machine, you need to change this part in new_zealand_links.py to reflect that:
CHROMEDRIVER_PATH = './chromedriver.exe'
  • Run pip install -r requirements.txt from the inside the directory containing requirements.txt file while virtualenv is running to install all the dependencies

Running the Scraper on Windows

While inside the virtualenv cd into the directory that contains powershell_script.ps1 and run .\powershell_script.ps1 while passing allowed arguments, from powershell terminal to run the script. For example, running .\powershell_script.ps1 cdc will fetch covid-19 related posts from the CDC website. The list of allowed options can be found in the bottom of this document.

Running the Scraper on Mac/Linux

While inside the virtualenv cd into the directory that contains unix_script.sh and run bash unix_script.sh while passing allowed arguments, from shell terminal to run the script. For example, running bash unix_script.sh cdc will fetch covid-19 related posts from the CDC website. The list of allowed options can be found in the bottom of this document.

Accessing the data

The scraped posts are saved in posts directory in the format {title,source,published,url,scraped,classes,country,municipality,language,text} for each post. The links to each update are saved in links directory.

List of allowed shell arguments:

  • All (Run all scrapers)

v0-dataset:

v1-dataset (v0+others):

Note: Since all the passed arguments are converted into lowercase, casing doesn't matter when you are passing it in the shell. For example: .\powershell_script.ps1 cDc would work the same way as .\powershell_script.ps1 CDC

Important Notes:

  • Since the addition to posts are appended on instead of overwritten, all the contents of or the whole directory - posts must be deleted before each run (except the first run since posts directory does not exist yet during the first run). If this step is not taken posts WILL HAVE incorrect data
  • DO NOT delete the files in links directory even though it is safe to delete the contents of the files themselves
  • Since the log settings has been set to INFO only information will be displayed during runs. If an error is encounterd and the link trying to be scraped has downloads or .pdf on it somewhere, the error message can be ignored. There might also be a 404 response sometimes and dateparser errors which should be ignored on a case-by-case basis
  • While in virtualenv run deactivate to stop and exit the virtual envrionment
  • Source code for scraper can be found in spiders directory
  • new_zealand_links.py is located in a separated in a different directory called new_zealand_links in the root directory because that scraper uses selenium. The reason for not putting the file with all the other scrapers inside the spiders directory is because how Scrapy works is that it pre-compiles(checks) the python scripts in that directory everytime before you call scrapy. Meaning if the the new_zealand_links.py is placed inisde the spiders directory, the file will be run everytime before calling scrapy from the shell. For example, if you run scrapy crawl cdc_links the new_zealand_links.py will still be run before the cdc_links scraper is run. This is specially problematic if you use the script to run all scrapers. This change is also reflected on the scripts.

About

This repo contains scrapers for all international government websites regarding covid-19

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published