Skip to content

shanshanzhu/Data-Scrappers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Scrapper

This repo contains nodeJS and Python code used to scrap data from various data sources.

###USstock

The source stock files (.csv) can be downloaded in this link http://cs.brown.edu/~pavlo/stocks/history.tar.gz

The code is for clean these files so that they are ready to be converted into Postgres database.

######stockCleanerMultipleFiles.py convert multiple files. Please set your file location accordingly. ######stockcleanerOneFile.py clean data for one single file.

###sfGov

The datasource is: https://data.sfgov.org/

######removeLastCol_cleanTimestamp.py this Python file remove the last column from the csv file: Map__Crime_Incidents_-_Previous_Three_Months.csv, downloaded from https://data.sfgov.org/api/views/gxxq-x39z/rows.csv?accessType=DOWNLOAD It also format the timestamp column as the same as the stock data time.

######urlScrapper.js This script download all the 150 csv files from sfgov automatically. Please read this blog post for detailed explanation: http://shanshanzhu.com/2013/12/08/datsy3-how-do-i-scrape-data-from-data-sfgov-org/

###helper This folder contains several helper functions that can be used to transfer multiple csv files to postgres db set up in Microsoft Azure virtual machine.

######cloudstorage.js SetUp file for using azure blob

######csvtopostgres.js Import 1 csv file into PSQL table

######csvtopostgresMultipleCSVMultiTables.js Successfully import >8000 stock csv files into >8000 PSQL tables

######dataDownloader.js A helper function used in urlScrapper.js to download a single csv from 1 url.

######phantomJSToGetPageImages.js PhantomJS helper function to get screenshot of webpage.

######psgrDataTypes.js a helper function to automatically determine the PSQL datatype from input data.

###factualGeopulse http://www.factual.com/products/geopulse-context NodeJS code to download geoPulase data from factual

######factualNode_centerOfUS_SouthWest.js Parameter starting from center of US, going SouthWest, with 0.05 gap between steps of longitude or latitude. ######factualNode_centerOfUStoNW.js Parameter starting from center of US, going NorthWest, with 0.05 gap between steps of longitude or latitude. ######factualNodeCal_SF.js Parameter covering San Francisco, with 0.05 gap between steps of longitude or latitude.

About

This repo contains nodeJS and Python code used to scrap data from a collection of data sources.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published