ncdc_storm_events is a project that downloads the NCDC Storm Events database from the National Oceanic and Atmospheric Administration.
The NCDC Storm Events Database is a collection of observations for significant weather events. The "database" contains information on property damage, loss of life, intensity of systems and more.
The dataset is updated somewhat regularly, but is not real-time. As of this writing, the dataset covers the time period January, 1950, to Aug, 2018.
There are three tables within the database:
details dataset is the heaviest raw dataset, weighing over 1.1G when combined and saved as CSV.
locations is much smaller; only 75M (CSV) with
fatalities a featherweight at 1.2K.
details also happens to be the dirtiest. With 51 columns, at least a dozen of these are unnecessary - mostly related to date or time observations (there are 13 variables total, all redundant, for date/time info).
fatalities has the same issue with dates and times as
details, but not nearly on the same scale.
locations, though treated as its own dataset, is very comparable to the location data within
details and much of this information is redundant.
The entire database, imo, is a great project for exploring, tidying and cleaning somewhat large datasets. The challenge comes in ensuring data you may remove (what seems to be redundant) should in fact be removed.
The datasets are broken down by table, then further broken down by year of the observations. They are stored in csv.gz format on the NCDC NOAA FTP server.
Each file is named like,
where TABLE is one of details, locations, or fatalities, YEAR is the year of the observations, and LAST_MODIFIED is the last datetime modification of the archive file.
Files are downloaded with ./code/01_get_data.R. All csv.gz datasets are bound together into one dataframe for each of the three tables. These raw data files are saved in the data directory.
All three datasets can be tidied to some extent using ./code/02_tidy_data.R. The
details dataset is the worse with over a dozen date and/or time variables. These variables are dropped and
END_DATE_TIME are reformatted to YYYY-MM-DD HH:MM:SS format as a character string. Timezone information is not saved. Though it is included in the dataset, it is near-completely incorrect.
Additionally, date and time variables in
locations are also modified to remove redundancy or, in the case of
locations which matches the date/time values in
details, have been completely removed.
The damage variables in
details have also been modified. The raw data uses alphanumeric characters; for example, "2k" or "2K" for $2,000 and "10B" or "10b" for $10,000,000,000. These have been cleaned to integer values.
Lastly, the narrative variables (
EVENT_NARRATIVE) are split out to their own respective dataset to avoid redundancy and reduce the size of the other datasets.
Where the raw data is well over 1.2G, the entire dataset, after tidying, sits at 539M.
All tidied datasets are located in the output directory.
Removing data file history from git
When updating data, this repo will use BGP Repo-Cleaner to remove the history of data files from the repository. When this is done, the repository will need to be removed from production environments in favor of a fresh clone
- R 3.5.2 - The R Project for Statistical Computing
Please read Contributing for details on code of conduct.