# `HURDAT2` Data Munge

**NOTE**: This notebook is a mirror copy of notebook `01`, with one important difference: it runs through Atlantic data instead of Pacific data. This means that variables named `pacific` there are renamed `atlantic` here, and the origin and destination URLs are different. But otherwise everything is exactly the same.

This isn't the most elegant way of doing this, but it's the fastest way of doing this, within the time allotted.

## Introduction

This notebook acquires, cleans up, and saves a copy of the United States National Oceanic and Atmospheric Administration's (NOAA) HURDAT2 dataset.

HURDAT2 is the NOAA's current data export of historical hurricane tracking data. It's split into two files, one for the Atlantic Ocean and one for the Pacific. These two files have different start dates (1851 and 1949 respectively).

## Original Text

From its [description](http://www.nhc.noaa.gov/data/#hurdat) on the NOAA's data web page:

---

<p class="hdr">Best Track Data (HURDAT2)

</p>
<p class="reg"><span style="font-weight:bold;">Atlantic hurricane database (HURDAT2) 1851-2015</span> (<a href="/data/hurdat/hurdat2-1851-2015-070616.txt">5.9MB download</a>)
<br>
This dataset was provided on 6 July 2016 to include the 1956 to 1960 revisions to the best tracks.
</p>
<p class="reg">
This dataset (<a href="/data/hurdat/hurdat2-format-atlantic.pdf">known as Atlantic HURDAT2</a>) has
a comma-delimited, text format with six-hourly information on the location,
maximum winds, central pressure, and (beginning in 2004) size of all known tropical cyclones and subtropical cyclones.
The original HURDAT database has been retired.</p>
<p class="reg">
Detailed information regarding the <a href="http://www.aoml.noaa.gov/hrd/data_sub/re_anal.html">
Atlantic Hurricane Database Re-analysis Project</a> is available from the
<a href="http://www.aoml.noaa.gov/hrd/">Hurricane Research Division</a>.
</p>
<p class="reg"><span style="font-weight:bold;">Northeast and North Central Pacific hurricane database (HURDAT2)
1949-2015</span> &nbsp; (<a href="/data/hurdat/hurdat2-nepac-1949-2015-050916.txt">3.2MB download</a>)
<br>
This dataset was provided on 9 May 2016 to include the remaining 2014 best tracks for Genevieve, Iselle, and Julio in the Central Pacific Hurricane Center (CPHC) area
of responsibility.  Note that the 2015 best tracks from CPHC are not yet available and are not currently included.  Once CPHC
completes their post-storm analyses, this dataset will be updated.
</p>
<p class="reg">
This dataset (<a href="/data/hurdat/hurdat2-format-nencpac.pdf">known as NE/NC Pacific HURDAT2</a>)
has a comma-delimited, text format with six-hourly information on the
location, maximum winds, central pressure, and (beginning in 2004)
size of all known tropical cyclones and subtropical cyclones. The
original HURDAT database has been retired.
</p>
<p class="reg">
<b>UPDATE as of 4/16/2025: This code has been edited slightly to accomodate the most recent data through 2023, which includes a newer Radius of Maximum Winds column. This was done before the 2024 hurricane season data was added, but there are no new columns, so it's probably fine.</b>
</p>

---

## Data Dictionary

The dataset's [data dictionary](http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf) shows that the files follow a modified CSV format, with individual hurricanes and storms getting their own subheadings:

In [None]:
from IPython.display import IFrame
IFrame("http://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf", width=900, height=600)

## Initial Read

Because of the non-standard format, a naive `pandas.read_csv` won't get useable data. It will be confused about the storm subheadings, for example, the first row in the following block:

```
EP202015,           PATRICIA,     19,
20151020, 0600,  , TD, 13.4N,  94.0W,  25, 1007,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
20151020, 1200,  , TD, 13.3N,  94.2W,  30, 1006,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
20151020, 1800,  , TD, 13.2N,  94.6W,  30, 1006,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
...
```

Curiously, the file doesn't seem to quite follow the format specified in the data dictionary, either, as it doesn't have any of the homogenized data lines mentioned in the data dictionary. So for instance, the following (used as an example in the data dictionary) never actually shows up:

```
AL092011, IRENE, 39,
1234567890123456789012345768901234567
```

It looks just like the example line above instead.

Since the start of each subheader line is `AL` or `EP` or something, whilst the start of a line of data is a date starting with the year (`2` or `1`), we can remove the subheadings by telling `pandas.read_csv` to ignore lines starting with the characters `A` or `E` (via `comment="E"`. But then we lose the position of those lines!

It's easiest to just build our own parser.

In [None]:
import requests
atlantic_raw = requests.get("http://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2015-070616.txt")
atlantic_raw.raise_for_status()  # check that we actually got something back

Double-checking the sentinels:

In [None]:
import io
from collections import Counter

c = Counter()
for line in io.StringIO(atlantic_raw.text).readlines():
    c[line[:2]] += 1

In [None]:
c

Counter({'18': 9228, '19': 31926, '20': 7951, 'AL': 1814})

In [None]:
import io

atlantic_storms_r = []
atlantic_storm_r = {'header': None, 'data': []}

for i, line in enumerate(io.StringIO(atlantic_raw.text).readlines()):
    if line[:2] == 'AL':
        atlantic_storms_r.append(atlantic_storm_r.copy())
        atlantic_storm_r['header'] = line
        atlantic_storm_r['data'] = []
    else:
        atlantic_storm_r['data'].append(line)

atlantic_storms_r = atlantic_storms_r[1:]

In [None]:
len(atlantic_storms_r)

1813

In [None]:
atlantic_storms_r[0]

{'data': ['18510625, 0000,  , HU, 28.0N,  94.8W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510625, 0600,  , HU, 28.0N,  95.4W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510625, 1200,  , HU, 28.0N,  96.0W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510625, 1800,  , HU, 28.1N,  96.5W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510625, 2100, L, HU, 28.2N,  96.8W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510626, 0000,  , HU, 28.2N,  97.0W,  70, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510626, 0600,  , TS, 28.3N,  97.6W,  60, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
  '18510626, 1200,  , TS, 28.4N,  98.3W,  60, -999, -999, -999, -999, -999, -999, -999, -999, -9

In [None]:
atlantic_storms_r[0]['data']

['18510625, 0000,  , HU, 28.0N,  94.8W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510625, 0600,  , HU, 28.0N,  95.4W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510625, 1200,  , HU, 28.0N,  96.0W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510625, 1800,  , HU, 28.1N,  96.5W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510625, 2100, L, HU, 28.2N,  96.8W,  80, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510626, 0000,  , HU, 28.2N,  97.0W,  70, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510626, 0600,  , TS, 28.3N,  97.6W,  60, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999,\n',
 '18510626, 1200,  , TS, 28.4N,  98.3W,  60, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, -999, 

In [None]:
import pandas as pd

atlantic_storm_dfs = []
for storm_dict in atlantic_storms_r:
    storm_id, storm_name, storm_entries_n = storm_dict['header'].split(",")[:3]
    # remove hanging newline ('\n'), split fields
    data = [[entry.strip() for entry in datum[:-1].split(",")] for datum in storm_dict['data']]
    frame = pd.DataFrame(data)
    frame['id'] = storm_id
    frame['name'] = storm_name
    atlantic_storm_dfs.append(frame)

In [None]:
len(atlantic_storm_dfs)

1813

In [None]:
atlantic_storm_dfs[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,id,name
0,18510625,0,,HU,28.0N,94.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
1,18510625,600,,HU,28.0N,95.4W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
2,18510625,1200,,HU,28.0N,96.0W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
3,18510625,1800,,HU,28.1N,96.5W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
4,18510625,2100,L,HU,28.2N,96.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
5,18510626,0,,HU,28.2N,97.0W,70,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
6,18510626,600,,TS,28.3N,97.6W,60,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
7,18510626,1200,,TS,28.4N,98.3W,60,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
8,18510626,1800,,TS,28.6N,98.9W,50,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
9,18510627,0,,TS,29.0N,99.4W,50,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED


In [None]:
atlantic_storms = pd.concat(atlantic_storm_dfs)

In [None]:
atlantic_storms.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,id,name
0,18510625,0,,HU,28.0N,94.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
1,18510625,600,,HU,28.0N,95.4W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
2,18510625,1200,,HU,28.0N,96.0W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
3,18510625,1800,,HU,28.1N,96.5W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
4,18510625,2100,L,HU,28.2N,96.8W,80,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
5,18510626,0,,HU,28.2N,97.0W,70,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
6,18510626,600,,TS,28.3N,97.6W,60,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
7,18510626,1200,,TS,28.4N,98.3W,60,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
8,18510626,1800,,TS,28.6N,98.9W,50,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED
9,18510627,0,,TS,29.0N,99.4W,50,-999,-999,-999,...,-999,-999,-999,-999,-999,-999,-999,,AL011851,UNNAMED


In [None]:
len(atlantic_storms)

49085

## Setting columns

Now we read the column headers out of the data dictionary and assign them appropriate variable names.

In [None]:
# The following line was part of the original code by Bilogur that broke
# atlantic_storms = atlantic_storms.reindex(columns=atlantic_storms.columns[-2:] | atlantic_storms.columns[:-2])

In [None]:
# A fix for the broken code that accomplishes the same task
cols_to_move = ['id', 'name']
atlantic_storms = atlantic_storms[cols_to_move + [col for col in atlantic_storms.columns if col not in cols_to_move]]

In [None]:
atlantic_storms.head()

Unnamed: 0,id,name,0,1,2,3,4,5,6,7,...,11,12,13,14,15,16,17,18,19,20
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,


In [None]:
atlantic_storms.iloc[0]

id                 AL011851
name                UNNAMED
0                  18510625
1                      0000
2                          
3                        HU
4                     28.0N
5                     94.8W
6                        80
7                      -999
8                      -999
9                      -999
10                     -999
11                     -999
12                     -999
13                     -999
14                     -999
15                     -999
16                     -999
17                     -999
18                     -999
19                     -999
20                         
Name: 0, dtype: object

In [None]:
atlantic_storms.columns

Index([  'id', 'name',      0,      1,      2,      3,      4,      5,      6,
            7,      8,      9,     10,     11,     12,     13,     14,     15,
           16,     17,     18,     19,     20],
      dtype='object')

In [None]:
# Slight edit to original code: Original had an empty final column marked as "na". With the addition of Radius of Maximum Winds in newer data, this column is now filled.
atlantic_storms.columns = [
        "id",
        "name",
        "date",
        "hours_minutes",
        "record_identifier",
        "status_of_system",
        "latitude",
        "longitude",
        "maximum_sustained_wind_knots",
        "maximum_pressure",
        "34_kt_ne",
        "34_kt_se",
        "34_kt_sw",
        "34_kt_nw",
        "50_kt_ne",
        "50_kt_se",
        "50_kt_sw",
        "50_kt_nw",
        "64_kt_ne",
        "64_kt_se",
        "64_kt_sw",
        "64_kt_nw",
        "rmw"
]

In [None]:
# Part of original code. No longer necessary due to addition of RMW column in newer data.
# del atlantic_storms['na']

In [None]:
# This also appears to be unnecessary
# pd.set_option("max_columns", None)

In [None]:
atlantic_storms.head()

Unnamed: 0,id,name,date,hours_minutes,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,34_kt_se,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999


## Inserting sentinels

-999 is used as a sentinel value for old data for which that data point is actually unknown. It'd be better to pass those as blank lines (e.g. `,,`) instead, so let's fill them in thusly.

In [None]:
atlantic_storms.iloc[0]['34_kt_sw']

'-999'

In [None]:
import numpy as np
atlantic_storms = atlantic_storms.replace(to_replace='-999', value=np.nan)

The variables are all string types:

In [None]:
atlantic_storms.dtypes

id                              object
name                            object
date                            object
hours_minutes                   object
record_identifier               object
status_of_system                object
latitude                        object
longitude                       object
maximum_sustained_wind_knots    object
maximum_pressure                object
34_kt_ne                        object
34_kt_se                        object
34_kt_sw                        object
34_kt_nw                        object
50_kt_ne                        object
50_kt_se                        object
50_kt_sw                        object
50_kt_nw                        object
64_kt_ne                        object
64_kt_se                        object
64_kt_sw                        object
64_kt_nw                        object
dtype: object

There are some empty strings present:

In [None]:
atlantic_storms.iloc[0]['record_identifier']

''

In [None]:
atlantic_storms['record_identifier'].value_counts()

     48121
L      903
I       27
P        9
S        7
C        5
T        5
W        4
R        3
G        1
Name: record_identifier, dtype: int64

Which we `nan`-ify:

In [None]:
atlantic_storms = atlantic_storms.replace(to_replace="", value=np.nan)

In [None]:
atlantic_storms['record_identifier'].value_counts(dropna=False)

NaN    48121
L        903
I         27
P          9
S          7
C          5
T          5
W          4
R          3
G          1
Name: record_identifier, dtype: int64

In [None]:
atlantic_storms.head()

Unnamed: 0,id,name,date,hours_minutes,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,34_kt_se,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,,,,,,,,,,,,,
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,,,,,,,,,,,,,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,,,,,,,,,,,,,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,,,,,,,,,,,,,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,,,,,,,,,,,,,


## Datafying columns

Some of the columns could be better formatted.

To start with, the latitude and longitude include `N` and `W` indicators, which we don't really want. We can just use negatives to indicate `S` and `W` (we'll upconvert dtype later).

In [None]:
atlantic_storms['latitude'] = atlantic_storms['latitude'].map(lambda lat: lat[:-1] if lat[-1] == "N" else -lat[:-1])
atlantic_storms['longitude']= atlantic_storms['longitude'].map(lambda long: long[:-1] if long[-1] == "E" else "-" + long[:-1])

In [None]:
atlantic_storms.head()

Unnamed: 0,id,name,date,hours_minutes,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,34_kt_se,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0,-94.8,80,,,,,,,,,,,,,
1,AL011851,UNNAMED,18510625,600,,HU,28.0,-95.4,80,,,,,,,,,,,,,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0,-96.0,80,,,,,,,,,,,,,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1,-96.5,80,,,,,,,,,,,,,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2,-96.8,80,,,,,,,,,,,,,


Next let's store the date in a more standard format. Output to ISO 8601 is automatically covered when we convert a column to `datetime` dtype.

In [None]:
atlantic_storms['date'] = pd.to_datetime(atlantic_storms['date'])

In [None]:
atlantic_storms['date'] = atlantic_storms\
    .apply(
        lambda srs: srs['date'].replace(hour=int(srs['hours_minutes'][:2]), minute=int(srs['hours_minutes'][2:])),
        axis='columns'
    )

In [None]:
del atlantic_storms['hours_minutes']

In [None]:
atlantic_storms.head()

Unnamed: 0,id,name,date,record_identifier,status_of_system,latitude,longitude,maximum_sustained_wind_knots,maximum_pressure,34_kt_ne,34_kt_se,34_kt_sw,34_kt_nw,50_kt_ne,50_kt_se,50_kt_sw,50_kt_nw,64_kt_ne,64_kt_se,64_kt_sw,64_kt_nw
0,AL011851,UNNAMED,1851-06-25 00:00:00,,HU,28.0,-94.8,80,,,,,,,,,,,,,
1,AL011851,UNNAMED,1851-06-25 06:00:00,,HU,28.0,-95.4,80,,,,,,,,,,,,,
2,AL011851,UNNAMED,1851-06-25 12:00:00,,HU,28.0,-96.0,80,,,,,,,,,,,,,
3,AL011851,UNNAMED,1851-06-25 18:00:00,,HU,28.1,-96.5,80,,,,,,,,,,,,,
4,AL011851,UNNAMED,1851-06-25 21:00:00,L,HU,28.2,-96.8,80,,,,,,,,,,,,,


## Final fixes

These were detecting by inspecting saves.

Fix an issue with character stripping in the names:

In [None]:
atlantic_storms['name'].iloc[0]

'            UNNAMED'

In [None]:
atlantic_storms['name'] = atlantic_storms['name'].map(lambda n: n.strip())

In [None]:
atlantic_storms['name'].iloc[0]

'UNNAMED'

Reindex, and attach a name to the index:

In [None]:
atlantic_storms.index = range(len(atlantic_storms.index))
atlantic_storms.index.name = "index"

The data is printable as is.

In [None]:
atlantic_storms.to_csv("../data/atlantic_storms.csv", encoding='utf-8')