## Capstone 1 
# San Francisco Bay Water Quality

ref. [Water quality of SF Bay home page](https://sfbay.wr.usgs.gov/access/wqdata/index.html)
     
     

## Unit 5 - Data Wrangling, Stations

### Tasks

The first step in completing your capstone project is to collect data. Depending on your dataset, you may apply some of the data wrangling techniques that you learned in this unit.

Include answers to these questions in your submission:
   * What kind of cleaning steps did you perform?

   * How did you deal with missing values, if any?

   * Were there outliers, and how did you handle them?


## Data Acquisition



### Station Location Information

#### Access

Location data for "standard" stations is available from [ScienceBase](https://www.sciencebase.gov/catalog/item/5966abe6e4b0d1f9f05cf551).

However, more complete location data is available in tables at [sfbay.wr.usgs.gov](https://sfbay.wr.usgs.gov/access/wqdata/overview/wherewhen/where.html). These tables include the genral location of each station (by geographical landmark) as well as data for "non-standard" stations which are sampled less often.

These tableswere copied, pasted into a spreadsheet, then exported as CSV. Header fields were edied o remove newlines and several fields were modified to remove artifacts before exporting to CSV format.
#### Files

   1. `SFBayStationLocations.csv`


#### Data Format

The Station Locations file is CSV format with one header row and 5 columns.

<small> 
```
Station Number, General Location, North Latitude, West Longitude, Depth MLW (meters)
```
</small>




## Setup

Import libraries

In [1]:
# Import useful libraries

import pandas as pd
import matplotlib.pyplot as plt
import datetime
import re
import json


## Read in the Stations Tables

In [2]:
# Read in Station locations
st_df = pd.read_csv('Data/orig/SFBayStationLocationsTable.csv', 
                    dtype={'Station Number' : str},
                    header=0)


In [3]:
st_df.columns

Index(['Station Number', 'General Location', 'North Latitude',
       'West Longitude', 'Depth MLW (m)'],
      dtype='object')

In [4]:
st_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
Station Number      51 non-null object
General Location    48 non-null object
North Latitude      51 non-null object
West Longitude      51 non-null object
Depth MLW (m)       44 non-null float64
dtypes: float64(1), object(4)
memory usage: 2.1+ KB


In [5]:
# rename 'Station Number' as 'Station'
st_df.rename(columns={"Station Number": "Station"}, inplace=True)

In [6]:
st_df

Unnamed: 0,Station,General Location,North Latitude,West Longitude,Depth MLW (m)
0,657.0,Rio Vista,38 9.1',-121 41.3',10.1
1,649.0,Sacramento River,3.6',48.0',10.1
2,2.0,Chain Island,3.8',51.1',11.3
3,3.0,Pittsburg,3.1',52.8',11.3
4,4.0,Simmons Point,2.9',56.1',11.6
5,5.0,Middle Ground,3.6',58.8',9.8
6,6.0,Roe Island,3.9',-122 2.1',10.1
7,7.0,Avon Pier,2.9',5.8',11.6
8,8.0,Martinez,1.8',9.1',14.3
9,9.0,Benicia,3.4',11.1',34.4


## Many records in the table do not include geographic degrees (they inherit from the station on the row above).

I want to fill in this data.

In [7]:
# First, extract the degrees from the curent Lat and Long columns
st_df['North Lat Degrees'] = st_df['North Latitude'].str.extract('^(3[78]) ')
st_df['West Long Degrees'] = st_df['West Longitude'].str.extract('^(-12[12]) ')

In [8]:
# Then fill forward for rows that did not specify degrees
st_df['North Lat Degrees'].fillna(method='ffill', inplace=True)
st_df['West Long Degrees'].fillna(method='ffill', inplace=True)

In [9]:
# Next, remove the degrees from the columns that contain minutes
st_df['North Lat Minutes'] = st_df['North Latitude'].str.extract('(\d{1,2}\.\d)\'$')
st_df['West Long Minutes'] = st_df['West Longitude'].str.extract('(\d{1,2}\.\d)\'$')

In [10]:
st_df.head(20)

Unnamed: 0,Station,General Location,North Latitude,West Longitude,Depth MLW (m),North Lat Degrees,West Long Degrees,North Lat Minutes,West Long Minutes
0,657,Rio Vista,38 9.1',-121 41.3',10.1,38,-121,9.1,41.3
1,649,Sacramento River,3.6',48.0',10.1,38,-121,3.6,48.0
2,2,Chain Island,3.8',51.1',11.3,38,-121,3.8,51.1
3,3,Pittsburg,3.1',52.8',11.3,38,-121,3.1,52.8
4,4,Simmons Point,2.9',56.1',11.6,38,-121,2.9,56.1
5,5,Middle Ground,3.6',58.8',9.8,38,-121,3.6,58.8
6,6,Roe Island,3.9',-122 2.1',10.1,38,-122,3.9,2.1
7,7,Avon Pier,2.9',5.8',11.6,38,-122,2.9,5.8
8,8,Martinez,1.8',9.1',14.3,38,-122,1.8,9.1
9,9,Benicia,3.4',11.1',34.4,38,-122,3.4,11.1


Now I can drop the original Latitude and Longitude columns

In [11]:
st_df.drop(columns=['North Latitude', 'West Longitude', 
                    ], inplace=True)

And replace them with decimal versions

In [12]:
st_df = st_df.astype({'North Lat Degrees': 'float',
              'North Lat Minutes': 'float',
              'West Long Degrees': 'float',
              'West Long Minutes': 'float'
             })

st_df['Latitude'] = st_df['North Lat Degrees'] + round(st_df['North Lat Minutes']/60, 2)
st_df['Longitude'] = abs(st_df['West Long Degrees']) + round(st_df['West Long Minutes']/60, 2)
st_df['Longitude'] = st_df['Longitude'] * (-1)

In [13]:
st_df

Unnamed: 0,Station,General Location,Depth MLW (m),North Lat Degrees,West Long Degrees,North Lat Minutes,West Long Minutes,Latitude,Longitude
0,657.0,Rio Vista,10.1,38.0,-121.0,9.1,41.3,38.15,-121.69
1,649.0,Sacramento River,10.1,38.0,-121.0,3.6,48.0,38.06,-121.8
2,2.0,Chain Island,11.3,38.0,-121.0,3.8,51.1,38.06,-121.85
3,3.0,Pittsburg,11.3,38.0,-121.0,3.1,52.8,38.05,-121.88
4,4.0,Simmons Point,11.6,38.0,-121.0,2.9,56.1,38.05,-121.94
5,5.0,Middle Ground,9.8,38.0,-121.0,3.6,58.8,38.06,-121.98
6,6.0,Roe Island,10.1,38.0,-122.0,3.9,2.1,38.06,-122.04
7,7.0,Avon Pier,11.6,38.0,-122.0,2.9,5.8,38.05,-122.1
8,8.0,Martinez,14.3,38.0,-122.0,1.8,9.1,38.03,-122.15
9,9.0,Benicia,34.4,38.0,-122.0,3.4,11.1,38.06,-122.18


Organize the stations geographically

In [14]:
# Create a list of stations
station_list_tmp = st_df['Station'].tolist()
print(*station_list_tmp) 

657 649 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24 25 26 27 28 29 29.5 30 31 32 33 34 35 36 662 659 655 654 653 652 651 650 411 407 405 12.5 19 28.5


In [15]:
# The stations, in order from the Sacramento River south to San Jose
station_list = ['662', '659', '657', '655', '654', '653', '652', '651', '650', '649', 
                '2', '3', '4', '407', '411','5', '6', '7', '405', '8', '9', '10', 
                '11', '12','12.5', '13', '14', '15', '16', '17', '18', '19', '20', 
                '21', '22', '23', '24', '25', '26','27', '28', '28.5', '29', '29.5', 
                '30', '31', '32', '33', '34', '35', '36']

#sort the stations


st_df.Station = pd.Categorical(st_df.Station, 
                      categories=station_list,
                      ordered=True)

st_df.sort_values('Station', inplace=True)
st_df.reset_index(drop=True, inplace=True)

Rearrange the columns

In [16]:
st_df.columns

Index(['Station', 'General Location', 'Depth MLW (m)', 'North Lat Degrees',
       'West Long Degrees', 'North Lat Minutes', 'West Long Minutes',
       'Latitude', 'Longitude'],
      dtype='object')

In [17]:
st_df = st_df[['Station', 'General Location', 
              'Latitude', 'Longitude',
              'North Lat Degrees', 'North Lat Minutes',
              'West Long Degrees','West Long Minutes', 
              'Depth MLW (m)'
             ]]

In [18]:
st_df.head()

Unnamed: 0,Station,General Location,Latitude,Longitude,North Lat Degrees,North Lat Minutes,West Long Degrees,West Long Minutes,Depth MLW (m)
0,662,Prospect Island,38.23,-121.67,38.0,13.6,-121.0,40.2,10.1
1,659,Old Sac. River,38.18,-121.67,38.0,10.7,-121.0,40.0,10.1
2,657,Rio Vista,38.15,-121.69,38.0,9.1,-121.0,41.3,10.1
3,655,N.of Three Mile Slough,38.12,-121.7,38.0,7.3,-121.0,42.1,10.1
4,654,,38.1,-121.71,38.0,6.3,-121.0,42.5,


In [19]:
#Save the station locations file
st_df.to_csv('Data/SFBayStationLocations.csv', index=False)

In [20]:
# Save the list of station numbers
with open('Data/station_list.json', 'w') as fp:
    json.dump(station_list, fp) 


<hr style="border: 5px solid green;">

Next time, we can read the data in with

```
st_df = pd.read_csv('Data/SFBayStationLocations.csv', 
                    header=0,
                    dtype={'Station Number' : str}
                    )

with open('out.json', 'r') as fp:
    station_list = json.load(fp)


```
