# Capstone Machine Learning Base

### Created by Bret  Stine, Mark Mocek, and Miranda Saari

Utilize basic data exploration and machine learning techniques to classify plankton.

### Running Notebook

Do we include this part? from classify_data

Executing this notebook requires a personal STOQS database. Follow the steps to [build your own development system](https://github.com/stoqs/stoqs/blob/master/README.md), this will take about an hour or so depending on the quality of your internet connection. Once your server is follow the proceeding step to get your virtual environment up and running:
    
    cd ~/Vagrants/stoqsvm
    vagrant ssh -- -X
    cd /vagrant/dev/stoqsgit
    source venv-stoqs/bin/activate
    
Then load the chosen database (ex:`stoqs_september2013`) database with the commands:

    cd stoqs
    ln -s mbari_campaigns.py campaigns.py
    export DATABASE_URL=postgis://stoqsadm:CHANGEME@127.0.0.1:5438/stoqs
    loaders/load.py --db stoqs_september2013
    loaders/load.py --db stoqs_september2013 --updateprovenance
   
Loading this database can take over a day as there are over 40 million measurements from 22 different platforms. You may want to edit the `stoqs/loaders/CANON/loadCANON_september2013.py` file and comment all but the `loadDorado()` method calls at the end of the file. You can also set a stride value or use the `--test` option to create a `stoqs_september2013_t` database, in which case you'll need to set the STOQS_CAMPAIGNS envrironment variable: 

    export STOQS_CAMPAIGNS=stoqs_september2013_t

Use the `stoqs/contrib/analysis/classify.py` script to create some labeled data that we will learn from:

    contrib/analysis/classify.py --createLabels --groupName Plankton \
        --database stoqs_september2013 --platform dorado \
        --start 20130916T124035 --end 20130919T233905 \
        --inputs bbp700 fl700_uncorr --discriminator salinity \
        --labels diatom dino1 dino2 sediment \
        --mins 33.33 33.65 33.70 33.75 --maxes 33.65 
        33.70 33.75 33.93 --clobber -v

Executing notebooks after installation

Start Xming
Open a putty window
        `cd dev/stoqsgit && source venv-stoqs/bi/activate`
        `export DATABASE_URL=postgis://stoqsadm:CHANGEME@127.0.0.1:5438/stoqs`
        `export STOQS_CAMPAIGNS=stoqs_september2013_t`
        `cd stoqs/contrib/notebooks`
        `../../manage.py shell_plus --notebook`

#### Load the stoqs data into a pandas data frame

To find other parameters to put into your data frame, look at other paramaters by going to http://localhost:8008/stoqs_september2013_o/api/[table_name_here] where your STOQS server is running. Note: if the parameters are changed, the findings of this notebook may no longer correlate. We suggest only doing so for the use of your own notebook.

In [2]:
import pandas as pd
mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(measurement__instantpoint__activity__platform__name='dorado')
# df = pd.DataFrame.from_records(mps.values('measurement__instantpoint__timevalue', 'measurement__depth',
#                                           'measurement__geom', 'parameter__name', 'datavalue', 'id'))
df = pd.DataFrame.from_records(mps.values('measurement__instantpoint__timevalue', 'measurement__depth', 
                                          'measurement__geom', 'parameter__name', 'datavalue', 'id', 
                                          'measuredparameterresource__resource__value'))

Bret and McCann's way

In [9]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(
  measurement__instantpoint__activity__platform__name='dorado')
#mps = mps.filter(measuredparameterresource__resource__name='diatom')
df = pd.DataFrame.from_records(mps.values('measurement__instantpoint__timevalue', 'measurement__depth', 
                     'measurement__geom', 'parameter__name', 'datavalue', 'id', 'measuredparameterresource__resource__value'))

df[0:100]

print(df['parameter__name'].unique())
print(df['measuredparameterresource__resource__value'].unique())

rs = Resource.objects.using('stoqs_september2013_o').filter(value='diatom')

['mepCountList' 'sepCountList' 'altitude' 'spice' 'sigmat' 'yaw' 'pitch'
 'roll' 'biolume' 'salinity' 'fl700_uncorr' 'bbp700' 'bbp420' 'oxygen'
 'temperature' 'nitrate']
[None]


## Exploring Data

stoqs_september2013_o dataset contains 849,935 rows of data

Original Column Names

In [3]:
df.columns

Index(['datavalue', 'id', 'measuredparameterresource__resource__value',
       'measurement__depth', 'measurement__geom',
       'measurement__instantpoint__timevalue', 'parameter__name'],
      dtype='object')

Renamed columns to simpler names

In [3]:
df.columns=['value', 'id', 'resourceValue', 'depth', 'geom', 'time', 'name']
df.head(5)

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
0,828.340949,5682174,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,altitude
1,881.503691,5682017,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,altitude
2,0.915118,5681654,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,spice
3,25.04884,5672989,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,sigmat
4,47.091859,5656022,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,yaw


In [11]:
#df.info()

Looking into the number of null values in the data, we found out of the used paramaters, value and resourceValue are the only columns to contatin null, but resourceValue is all null.

In [4]:
df.isnull().sum()

value              8152
id                    0
resourceValue    849935
depth                 0
geom                  0
time                  0
name                  0
dtype: int64

By looking at the first and last row of data we see the collection of data started at 3:50:48pm on September 9,2013 and ended on 8:07:44 PM on October 3, 2013. We may consider looking at the data in chunks of time since the AUVs move through the water and the data collected in one part of the water may not correlate to the data in another part. 

In [6]:
df.loc[[0]]

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
0,881.503691,5682017,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,altitude


In [7]:
df.loc[[849934]]

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
849934,12.715277,6302753,,3.402838,"[-121.9527427965313, 36.882871083743055]",2013-10-03 20:41:54,temperature


In [8]:
print(df['resourceValue'].unique())

[None]


In [12]:
df.columns

Index(['value', 'id', 'resourceValue', 'depth', 'geom', 'time', 'name'], dtype='object')

With the almost equal dispersement of each data sample, we assume each sample contains each of these values. Thus with the the 849,935 rows of data really 60,254 for each parameter was used.

In [7]:
df.name.value_counts()

altitude        60254
spice           60254
oxygen          60254
pitch           60254
temperature     60254
roll            60254
sigmat          60254
salinity        60254
yaw             60254
fl700_uncorr    60248
bbp700          60224
bbp420          60206
biolume         60128
nitrate         58691
sepCountList     4076
mepCountList     4076
Name: name, dtype: int64

In [21]:
df.head()

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
0,828.340949,5682174,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,altitude
1,881.503691,5682017,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,altitude
2,0.915118,5681654,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,spice
3,25.04884,5672989,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,sigmat
4,47.091859,5656022,,2.757431,"[-122.18338157084469, 36.71159947459987]",2013-09-16 15:45:08,yaw


Minimum and max depth along with counts at the various depths. 

In [18]:
print(min(df.depth))
print(max(df.depth))
print(df.depth.value_counts().head())
df.depth.value_counts().tail()

-0.21825245527775
81.7012949322709
-0.030662    72
 2.046489    58
-0.003854    58
 0.010047    46
 8.407862    44
Name: depth, dtype: int64


20.868667    12
18.419141    12
18.281148    12
19.625418    11
19.587642    11
Name: depth, dtype: int64

In [24]:
print(df.geom[1].split(','))

AttributeError: 'Point' object has no attribute 'split'

In [34]:
str(df.geom).split(',')[0]

'0          [-122.17749112685887'

In [38]:
Series.str(df.geom).split('[', 1)[1].split(',')[0]

NameError: name 'Series' is not defined

In [None]:
df['Latitude']=str(df.geom).split('[', 1)[1].split(',')[0]