## Executive Summary

The end goal of this project was to identity tornadoes in NEXRAD data provided by the National Oceanic and Atmospheric Administration and host on Amazon AWS. I have encounter several issues that are beyond my knowledge regarding to technical weather knowledge, and architecture and computing power with regard to artificial neural networks.

Thus, currently this project is more of a documentation of my approach, what I have learned, the technologies envovled, and then what needs to be done to reach my ultimate goal of classify tornadoes with some accuracy.

## About the Data

### NEXRAD

Next Generation Weather Radar (NEXRAD) also known as Weather Surveillance Radar 1988 Doppler ( WSR-88D ), is a series of 160 radar sites spread across the United States, its terroritories, and two in South Korea. A real time feed and historical archive starting from June 1991 to present is available on Amazon AWS.

###  - Accessing the data

Radar files are hosted on Amazon Simple Storage Service (S3) and individual files can be accessed by

/{Year}/{Month}/{Day}/{NEXRAD Station}/{filename}

More specific time information is in the filename in the format:

GGGGYYYYMMDD_TTTTTT

Where: 

GGGG = Ground station ID (map of ground stations)
YYYY = year
MM = month
DD = day
TTTTTT = time when data started to be collected (GMT)

followed by: 

"_V06.gz" if the file is 2012 and newer
"_V03.gz" if 2011 or older

### - Problems

* A radar volume can range from 1 MB to 15 MB.
* Many many many files
* Where are the tornadoes?

### Storm Events Database

NOAA National Centers for Environmental Information hosts the Storm Events Database.

It contains information on weather events where:

* storms or other weather phenomena cause loss of life, injury, property damange, and/or disruption to commerce
* rare or unusual events, such as snow flurries in South Florida or the San Diego coastal area
* records - min/max temperature for a given time period, or rainfall associated with a storm

This data can be accessed here: [Storm Events Database](https://www.ncdc.noaa.gov/stormevents/ftp.jsp)

The data can be retrieved using ftp or http. Documentation for file name convention and data format can also be found in the link above.

The data is segmented into yearly intervals.

Useful features of this data are:

* Event Type -- we are looking for tornado events
* The closest NEXRAD station code, example "KBHM"
* The time of the event

## Joining Our Data Sources

Using the Storm Events Database in conjunction with NEXRAD S3, we can get radar volumes containing tornadoes.
Preliminaryly, I limited data to weather events that occured in Alabama in the year 2015.

Using the storm events database, we can select with radar volumes we will download from the NEXRAD S3 bucket.

* Event times from Storm Events database are in local time, while radar volume used UTC (GMT). So we need to convert time zones for to select the correct radar data.

* We also need to clean some radar station codes. KHUN was changed to KHTX.

* We also need to calculate what is the closest volume time is to tornado event time.

* We need to remove KTAE, which does track storms in parts of Alabama, but the station is in Florida.

After these steps, we can download the radar volumes we need to classify tornadoes.

## Analyze Our Radar Data

### NEXRAD Radar 101

Before we analyze our radar data we need to some background knowlege of how the radar works and what data it collects.

NEXRAD Radar produces radar volumes at four, five, six, or ten minute intervals.

Each radar volume is composed of sweeps. A sweep is a 360 rotation at a given elevation. A sweep is made up of rays, a sliver of the 360 degree sweep. And a ray is made of gates, or radial pixels, each gate between two distances.

NEXRAD operates in two modes, clean air mode, and precipitation mode.

Clean air mode is the most sensitive mode and each sweep is conducted at a slower rotation.  Thus producing a volume every ten minutes. Elevation of the sweeps is kept low to detect incoming storms at greater distances.

Precipitation mode is not as sensitive because rain returns more signals. Rotation speed for each sweep is faster, and elevation is increase to detect moisture higher in the atmosphere.

<img src="../report_assets/vcp31.gif">

Above is VCP 31. It is a clean air mode vcp performing 5 sweeps every 10 minutes.

<img src="../report_assets/vcp11.gif">

Above is VCP 11. It has 14 elevations slices and completes 16 360° scans in 5 minutes, up to 19.5°,

For more information on VCPs visit http://www.srh.noaa.gov/jetstream/doppler/vcp_max.html

### Radar Fields

The data from NEXRAD S3 comes zipped in gz files and then inside them are NEXRAD format radar files. In order to read them easily and then display the radar data, we will use Department of Energy's Atmospheric Radiation Measurement (ARM) Program's [Python ARM Radar Toolkit](https://github.com/ARM-DOE/pyart)

<img src="../report_assets/radarRaw.png">

The above images are of a storm that generated a tornado. It was picked up on radar by KHTX ground station (Huntsville, Alabama). Radar volume file is read and plotted using Python-ARM Radar Toolkit (pyart)

The range of each NEXRAD radar is 300 km or 120 nm. The radar is in the center of each image.

After joining our data from the Storm Events Database and NEXRAD S3, we get radar volumes. Each file contains on volume. Each sweep of the volume has 6 fields. This data is known to NOAA as Level II data. Data derived from this data is known as Level III.

Here are descriptions of each field:

* reflectivity (dbZ) - detects density of precipitation - heavy rainfall in red, light in green
* Zdr (db) - or differential reflectivity is the difference in returned energy between the horizontal and vertical pulses of the radar. used to help detect water drop shape, and hail
* Phi_DP (deg) - differential phase propagation - measures differences in travel time of radio waves through water/air. used to detect masses of water
* Rho_HV - Correlation Coefficient (Rho HV), A statistical correlation between the reflected horizontal and vertical power returns. It is a good indicator of regions where there is a mixture of precipitation types, such as rain and snow.
* velocity (m/s) - velocity of wind, away or toward the radar
* spectrum width - a measure of variability mean radial velocity (wind shear, turbulence, and quality of velocity measurements. High values indicate chaotic flow, low values indicates smooth flow. Rapidly changing values can indicate a tornado, or strong wind gusts

### Cleaning Our Radar Data

In the above radar plots we can see a storm, that we would typically see on the weather channel. However we also see a large circular cloud around the radar.

This "cloud" is not actually a storm, but bioscatter (bugs, birds, bats, etc) and ground clutter. Before we run any type of model, we need to remove this "cloud'

To do this, we will use Colorado State University's [CSU_RadarTools](https://github.com/CSU-Radarmet/CSU_RadarTools) python package. It contains a hodge podge of radar algorithms, but most importantly to remove the bioscatter/groundclutter and then clean our data of specks (despeckle)

## What do we do now?

Before delving into the this project, I was aiming to develop a simple linear regression classifier to determine if there was a tornado or not. Normally this would be easy in other problems. Just get a bunch of features describing an event with a label. However this will not work with our radar data in raw format. I would have to extract certain metrics from my matrix of radar values, and I do not have the technical knowledge to do so.

### Mesocyclone Detection Algorithm

Published in 1996, [A Neural Network for Tornado Prediction Based on Doppler Radar-Derived Attributes](http://journals.ametsoc.org/doi/abs/10.1175/1520-0450%281996%29035%3C0617%3AANNFTP%3E2.0.CO%3B2) describes National Severe Storms Labratory's Mesocyclone Detection Algorithm (Mesocyclone's are precursors of tornadoes). It extracts 23 variables characterizing the circulations form radar fields and feeds them into a neural network.

The neural network had a 23 node input layer and one hidden network of two nodes

The results from that paper was Critical Success Index of 34.3% 

### Tornado Detection Algorithm

Published in 1998, [The National Severe Storms Laboratory Tornado Detection Algorithm](http://journals.ametsoc.org/doi/abs/10.1175/1520-0434%281998%29013%3C0352%3ATNSSLT%3E2.0.CO%3B2) uses a different method to detect tornadoes. It identifies vortices through multiple elevations for the storm. It is noted that the TDA algorithm performs better on Great Plains tornadoes than non-plains tornadoes. Illustration below

<img src="../report_assets/tdaDiagram.png" width="800px" height="600px">

## New Goal

I am not a meteorologist, I am a data science student. So while my science may not be sound purpose of this project is to learn, and then maybe discover some cool trend in the data. Therefore I can pick something arbitrary and approach it from that angle.

I do not know much about neural networks and data operations on large scale data. So I will attempt to use PySpark, Amazon EC2, to run build a large neural network to classify the raw radar data as containing tornadoes or not.