# Capstone Machine Learning Base

### Created by Bret  Stine, Mark Mocek, and Miranda Saari

Utilize basic data exploration and machine learning techniques to classify plankton.

### Running Notebook

Do we include this part? from classify_data

Executing this notebook requires a personal STOQS database. Follow the steps to [build your own development system](https://github.com/stoqs/stoqs/blob/master/README.md), this will take about an hour or so depending on the quality of your internet connection. Once your server is follow the proceeding step to get your virtual environment up and running:
    
    cd ~/Vagrants/stoqsvm
    vagrant ssh -- -X
    cd /vagrant/dev/stoqsgit
    source venv-stoqs/bin/activate
    
Then load the chosen database (ex:`stoqs_september2013`) database with the commands:

    cd stoqs
    ln -s mbari_campaigns.py campaigns.py
    export DATABASE_URL=postgis://stoqsadm:CHANGEME@127.0.0.1:5438/stoqs
    loaders/load.py --db stoqs_september2013
    loaders/load.py --db stoqs_september2013 --updateprovenance
   
Loading this database can take over a day as there are over 40 million measurements from 22 different platforms. You may want to edit the `stoqs/loaders/CANON/loadCANON_september2013.py` file and comment all but the `loadDorado()` method calls at the end of the file. You can also set a stride value or use the `--test` option to create a `stoqs_september2013_t` database, in which case you'll need to set the STOQS_CAMPAIGNS envrironment variable: 

    export STOQS_CAMPAIGNS=stoqs_september2013_t

Use the `stoqs/contrib/analysis/classify.py` script to create some labeled data that we will learn from:

    contrib/analysis/classify.py --createLabels --groupName Plankton \
        --database stoqs_september2013 --platform dorado \
        --start 20130916T124035 --end 20130919T233905 \
        --inputs bbp700 fl700_uncorr --discriminator salinity \
        --labels diatom dino1 dino2 sediment \
        --mins 33.33 33.65 33.70 33.75 --maxes 33.65 
        33.70 33.75 33.93 --clobber -v

Executing notebooks after installation

Start Xming
Open a putty window
        `cd dev/stoqsgit && source venv-stoqs/bi/activate`
        `export DATABASE_URL=postgis://stoqsadm:CHANGEME@127.0.0.1:5438/stoqs`
        `export STOQS_CAMPAIGNS=stoqs_september2013_t`
        `cd stoqs/contrib/notebooks`
        `../../manage.py shell_plus --notebook`

#### Load the stoqs data into a pandas data frame

To find other parameters to put into your data frame, look at other paramaters by going to http://localhost:8008/stoqs_september2013_o/api/[table_name_here] where your STOQS server is running. Note: if the parameters are changed, the findings of this notebook may no longer correlate. We suggest only doing so for the use of your own notebook.

In [5]:
import pandas as pd
mps = MeasuredParameter.objects.using('stoqs_september2013_o').filter(measurement__instantpoint__activity__platform__name='dorado')
# df = pd.DataFrame.from_records(mps.values('measurement__instantpoint__timevalue', 'measurement__depth',
#                                           'measurement__geom', 'parameter__name', 'datavalue', 'id'))
df = pd.DataFrame.from_records(mps.values('measurement__instantpoint__timevalue', 'measurement__depth', 
                                          'measurement__geom', 'parameter__name', 'datavalue', 'id', 
                                          'measuredparameterresource__resource__value'))
# , 
#                                           'measuredparameter__parameter__name',
#                                          'measuredparameter__datavalue', 'resource__name', 'resource__value', ,
#                                          'resource__resourcetype__name'))



## Exploring Data

stoqs_september2013_o dataset contains 849,935 rows of data

Original Column Names

In [6]:
df.columns

Index(['datavalue', 'id', 'measuredparameterresource__resource__value',
       'measurement__depth', 'measurement__geom',
       'measurement__instantpoint__timevalue', 'parameter__name'],
      dtype='object')

Renamed columns to simpler names

In [7]:
df.columns=['value', 'id', 'resourceValue', 'depth', 'geom', 'time', 'name']
df.head(5)

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
0,,5691927,,-0.040161,"[-122.18620594558094, 36.710534112118594]",2013-09-16 15:40:20,mepCountList
1,828.340949,5682174,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,altitude
2,0.435725,5681307,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,spice
3,25.585849,5672642,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,sigmat
4,44.072166,5656179,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,yaw


In [11]:
#df.info()

Looking into the number of null values in the data, we found out of the used paramaters, value and resourceValue are the only columns to contatin null, but resourceValue is all null.

In [19]:
df.isnull().sum()

value              8152
id                    0
resourceValue    849935
depth                 0
geom                  0
time                  0
name                  0
dtype: int64

By looking at the first and last row of data we see the collection of data started at 3:50:48pm on September 9,2013 and ended on 8:07:44 PM on October 3, 20

In [21]:
df.loc[[0]]

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
1,828.340949,5682174,,31.204571,"[-122.17749112685887, 36.71451561320433]",2013-09-16 15:50:48,altitude


In [20]:
df.loc[[849934]]

Unnamed: 0,value,id,resourceValue,depth,geom,time,name
849934,10.29544,6301811,,40.741194,"[-121.97556440230393, 36.86074968308047]",2013-10-03 20:07:44,temperature


In [18]:
print(df['resourceValue'].unique())

[None]
