-
Notifications
You must be signed in to change notification settings - Fork 0
Scraping Best Practices
sam1rm edited this page Aug 8, 2015
·
4 revisions
- Make sure that you can guarantee great results (x% of the schools will have this field) before you program anything.
- Example: out of three potential options(chegg, rmp, etc) to scrape courses, you chose chegg because it has the most consistent results, and greatest number of options.
- Write a script that gets the data for one university
- Now try it with 10. If there any bugs, adjust the script necessary to make sure it runs all 10
- Now try it with 100
- If #4 has 0 bugs, you are free to run the entire script. Make sure you check-on the script regularly in case a bug comes up. (maybe set a timer
- Only start this after the above process is complete. You should not be running any scripts
- MUST use Tor_Client when retrieving the data
- If you are using a for-loop that retrieves data for each university, make sure you save the progress after each iteration (to prevent starting over if there is an error).
- If you have to re-run the script, the script should resume from where you started.
- If there is a
ConnectionError
or anyError
, pause the script & restart. - When your script is complete, save the file with 'final-{{original_file_name}}.json"
- Save the final .json in two formats, one of them should be compressed - one of them should be formatted with JSON indentation.
- Drag the two .json files in the [Uguru Drive] (https://drive.google.com/drive/folders/0By5VIgFdqFHdfm85QV9lQm5pbHVUdzRsaWtjME0wcm5FUEJkeTF2V1hyU1BtLXM4SXF2LTQ)
- Write a script that will parse the .json file to Uguru Admin API