# Module 5 Lab - Data

In [1]:
% matplotlib inline

The special command above will make all the `matplotlib` images appear in the notebook.

## Directions

**Failure to follow the directions will result in a "0"**

The due dates for each are indicated in the Syllabus and the course calendar. If anything is unclear, please email EN605.448@gmail.com the official email for the course or ask questions in the Lab discussion area on Blackboard.

The Labs also present technical material that augments the lectures and "book".  You should read through the entire lab at the start of each module.

### General Instructions

1.  You will be submitting your assignment to Blackboard. If there are no accompanying files, you should submit *only* your notebook and it should be named using *only* your JHED id: fsmith79.ipynb for example if your JHED id were "fsmith79". If the assignment requires additional files, you should name the *folder/directory* your JHED id and put all items in that folder/directory, ZIP it up (only ZIP...no other compression), and submit it to Blackboard.
    
    * do **not** use absolute paths in your notebooks. All resources should appear in the same directory as the rest of your assignments.
    * the directory **must** be named your JHED id and **only** your JHED id.
    
This assignment has accompanying files. You should include in your zip file:

    1. [jhed_id].ipynb
    2. hurricanes.py - your preprocessing file.
    3. hurricanes.html - the local copy of the Wikipedia page.
    4. hurricanes.db - the SQLite database you create.
    
2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.

3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.

4. Submit a cleanly executed notebook. The first code cell should say `In [1]` and each successive code cell should increase by 1 throughout the notebook.

## Individual Submission

## Getting and Storing Data

This Lab is about acquiring, cleaning and storing data as well as doing a little bit of analysis.

### Basic Outline

1. Using `curl` or `wget` obtain a local copy of the following web page: Atlantic Hurricane Season ( https://en.wikipedia.org/wiki/Atlantic_hurricane_season ). **include this in your submission as `hurricanes.html`**.  This is important. In Spring 2016, the page was edited during the module and different people got different answers at different times.  You only need to be correct with respect to your `hurricanes.html` file.
2. Using Beautiful Soup 4 and Python, parse the HTML file into a useable dataset. **your parsing code should be in a file `hurricanes.py` and included in your submission**.
3. Write this data set to a SQLite3 database called `hurricanes.db` **include this in your submission**.
4. Run the requested queries against the data set. **see below** The results should be **nicely formatted**.

Although Wikipedia has an API, I do not what you to use it for this assignment.

### Details

The data is contained in many separate HTML tables. The challenge is to write a general table parsing function and then locate each table and apply the function to it. You only need to get the data from the tables starting at 1850s. Not all years have the same data. You only need to save the following columns. The name is parentheses is the name the column should have in the database table.

- Year (`year`)
- Number of tropical storms (`tropical_storms`)
- Number of hurricanes (`hurricanes`)
- Number of Major Hurricanes (`major_hurricanes`)
- Deaths (`deaths`)
- Damage (`damage`)
- Notes (`notes`)

Note that "Damage" doesn't start until 1900s and "Notes" was added in 1880s. "Strongest Storm" should be added to the Notes column (even in years that didn't have Notes) as should "Retired Storms". The name of the database table should be atlantic_hurricanes. The name of the table file (SQLite3 uses a file) should be hurricanes.db (who knows...you might need to add Pacific storms someday).

There are a number of parsing problems which will most likely require regular expressions. First, the Deaths column has numbers that include commas and entries that are not numbers (Unknown and None). How should you code Unknown and None so that answers are not misleading but queries are still fairly straightforward to write?

Similarly, Damages has numbers with commas, currency signs and different amount words (millions, billions). How will you normalize all of these so that a query can compare them? You may need regular expressions.

Additionally, the way that Tropical Storms are accounted for seems to change mysteriously. Looking over the data, it's not immediately apparent when the interpretation should change. 1850s, 1860s definitely but 1870s? Not sure. It could just be a coincidence that there were never more hurricanes than tropical storms which seems to be the norm but see, for example, 1975. Welcome to Data Science!

You should put your parsing code in `hurricanes.py`. None of it should be in this file. Imagine this file is going to be the report you give to your boss.

## Documentation

Any time you run into a problem where you have to decide what to do--how to solve the problem or treat the values--document it here.

## Hurricanes.db

What is the *function* of `hurricanes.db` in this assignment?

### Queries

When you are done, you must write and execute the following queries against your database. Those queries should be run from this notebook. Find the documentation for using SQLite3 from Python (the library is already included). You should never output raw Python data structures instead, you need to output well-formatted tables. You may need to look around for a library to help you or write your own formatting code. `Pandas` is one possibility. However, I want you to use raw SQL for your queries so if you use `Pandas` use it only for the formatting of query results (don't load the data into Pandas and use Pandas/Python to query the data).

**Write the most general query possible. Never assume that you are going to get only one result back (i.e, don't assume there won't be ties).** The query result should be in a nicely formatted table; don't show raw data structures to your boss or manager. 

Additionally, don't just run the query. Having gotten an answer, add a textual description of the result to the question. Data Science is not about running code; code is a means to an end. The end is the communication of results. We never just run code in this class.

In [48]:
""# imports
import pandas as pd
import sqlite3

conn = sqlite3.connect("hurricanes.db")
df = pd.read_sql_query("select * from Hurricanes;", conn)

##change strings to numbers 
values = {'tropical_storms': 0, 'hurricanes': 0, 'C': 2, 'major_hurricanes': 0}
df.fillna(value=values)

df[['tropical_storms','hurricanes','major_hurricanes']] = df[['tropical_storms','hurricanes','major_hurricanes']].apply(pd.to_numeric)

values = {'tropical_storms': 0, 'hurricanes': 0, 'C': 2, 'major_hurricanes': 0}
df.fillna(value=values)

df.deaths  = df.deaths.str.replace('+','') 
df.deaths  = df.deaths.str.replace(',','') 
df.deaths  = df.deaths.str.replace('~','') 

df.damage = df.damage.str.replace(">=","")

df.damage = df.damage.str.replace(">","")

df.damage = df.damage.str.replace("$","")

##ignore "Not Know"....
df.deaths = df[['deaths']].apply(pd.to_numeric, errors = "coerce")


##remove invalid data
df = df.loc[df.Year!=""]

df

Unnamed: 0,Year,tropical_storms,hurricanes,major_hurricanes,deaths,damage,notes
2,1850,,3.0,0.0,,,
3,1851,,3.0,1.0,24.0,,
4,1852,,5.0,1.0,100.0,,
5,1853,,4.0,2.0,40.0,,
6,1854,,3.0,1.0,30.0,,
7,1855,,4.0,1.0,,,
8,1856,,4.0,2.0,200.0,,
9,1857,,3.0,0.0,424.0,,
10,1858,,6.0,0.0,,,
11,1859,,7.0,1.0,,,


1\. For the 1920s, list the years by number of tropical storms, then hurricanes.

In [31]:
df.loc[df.Year.str.startswith("192")][["Year","tropical_storms","hurricanes"]]

Unnamed: 0,Year,tropical_storms,hurricanes
72,1920,,4.0
73,1921,,5.0
74,1922,,3.0
75,1923,,4.0
76,1924,,5.0
77,1925,,2.0
78,1926,,8.0
79,1927,,4.0
80,1928,,4.0
81,1929,,3.0


2\. What year had the most tropical storms?

In [32]:
df.sort_values('tropical_storms', ascending=0).iloc[0,]

Year                                                             2005
tropical_storms                                                    28
hurricanes                                                         15
major_hurricanes                                                    7
deaths                                                            NaN
damage                                                  180.4 billion
notes               Second costliest hurricane season on recordSea...
Name: 157, dtype: object

3\. What year had the most major hurricanes?

In [33]:
df.sort_values('major_hurricanes', ascending=0).iloc[0,]

Year                                                             2005
tropical_storms                                                    28
hurricanes                                                         15
major_hurricanes                                                    7
deaths                                                            NaN
damage                                                  180.4 billion
notes               Second costliest hurricane season on recordSea...
Name: 157, dtype: object

4\. What year had the most deaths?

In [49]:
df.sort_values('deaths', ascending=0).iloc[0,]

Year                                                             1998
tropical_storms                                                    14
hurricanes                                                         10
major_hurricanes                                                    3
deaths                                                          12000
damage                                                   12.2 billion
notes               Four simultaneous hurricanes on September 26, ...
Name: 150, dtype: object

5\. What year had the most damage (not inflation adjusted)?

In [66]:
#I could not find a good way to transfer billions and millions to numbers
#My method seems cubersome, but it works 
df.loc[df.damage.str.contains( "282.16")]

max_damage = 0
max_damage_1 = ""
for damage in  df.damage:
    damage_1 = damage.replace(',','')
    damage_1 = damage_1.replace('+','')
    if " billion" in damage_1:
        damage_2 = float(damage_1.replace(" billion",""))
        if damage_2>max_damage:
            max_damage = damage_2
            max_damage_1 = damage

print("converted value:", max_damage, ", raw value:",max_damage_1)

df.loc[df.damage == max_damage_1]

converted value: 282.16 , raw value:  282.16 billion


Unnamed: 0,Year,tropical_storms,hurricanes,major_hurricanes,deaths,damage,notes
169,2017,17.0,10.0,6.0,3361.0,282.16 billion,Costliest hurricane season on record First Apr...


6\. What year had the highest proportion of tropical storms turn into major hurricanes?

In [90]:
df["majorRatio"] = df.major_hurricanes/df.hurricanes

df.loc[df.majorRatio>0].sort_values('majorRatio', ascending=0).iloc[0,]


Year                                                   1930
tropical_storms                                         NaN
hurricanes                                                2
major_hurricanes                                          2
deaths                                                8,000
damage                                          $50 million
notes               The fifth deadliest hurricane on record
majorRatio                                                1
Name: 82, dtype: object

## Things to think about

1. What is the granularity of this data? (Are the rows the most specific observation possible?)
2. What if this data were contained in worksheets in an Excel file. Find a Python library or libraries that work with Excel spreadsheets.
3. Each section links to details about each hurrican season. Review each Season's page and discuss strategies for extracting the information for every hurricane.
4. Hurricane tracking maps were recently added. How would you get and store those images?
5. Damages are not inflation adjusted. How would you go about *enriching* your data with inflation adjusted dollars? Where should this additional data be stored and how would it be used?

*notes here*