Skip to content

Scraping Best Practices

sam1rm edited this page Aug 8, 2015 · 4 revisions

Follow this order for the best results

  1. Make sure that you can guarantee great results (x% of the schools will have this field) before you program anything.
    • Example: out of three potential options(chegg, rmp, etc) to scrape courses, you chose chegg because it has the most consistent results, and greatest number of options.
  2. Write a script that gets the data for one university
  3. Now try it with 10. If there any bugs, adjust the script necessary to make sure it runs all 10
  4. Now try it with 100
  5. If #4 has 0 bugs, you are free to run the entire script. Make sure you check-on the script regularly in case a bug comes up. (maybe set a timer

Each Script Must Be Run In This Manner

  1. Only start this after the above process is complete. You should not be running any scripts
  2. MUST use Tor_Client when retrieving the data
  3. If you are using a for-loop that retrieves data for each university, make sure you save the progress after each iteration (to prevent starting over if there is an error).
  4. If you have to re-run the script, the script should resume from where you started.
  5. If there is a ConnectionError or any Error, pause the script & restart.
  6. When your script is complete, save the file with 'final-{{original_file_name}}.json"

Submission (When you are done)

  1. Save the final .json in two formats, one of them should be compressed - one of them should be formatted with JSON indentation.
  2. Drag the two .json files in the [Uguru Drive] (https://drive.google.com/drive/folders/0By5VIgFdqFHdfm85QV9lQm5pbHVUdzRsaWtjME0wcm5FUEJkeTF2V1hyU1BtLXM4SXF2LTQ)

Post-Submission (When you are done with submission)

  1. Write a script that will parse the .json file to Uguru Admin API
Clone this wiki locally