## <span style="color:#0B3B2E;float:right;font-family:Calibri">Jordan Graesser</span>

# MpGlue
### Land cover samples
---

## In this notebook we will learn to load and visualize land cover samples
* MpGlue has a `classification` interface that provides modules for handling land cover samples, training classification and regression models, and predicting land cover on satellite imagery.
* Let's start with setting up the `classification` class and loading land cover data.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import mpglue as gl

# Let's set the classification() class to something short and easy to use.
CL = gl.classification()

print dir(CL)

['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_centroid_histogram', '_create_indices', '_default_parameters', '_get_feas', '_get_slope', '_index_samples', '_load_model', '_mask_background', '_num_rows_cols', '_plot_colors', '_predict', '_recode_labels', '_remove_classes', '_remove_min_observations', '_remove_outliers', '_scale_p_vars', '_set_model', '_set_parameters', '_stack_samples', '_stratify', '_train_model', '_transform4crf', '_update_class_counts', 'add_variable_names', 'compare_features', 'compare_samples', 'construct_model', 'copy', 'extract_endmembers', 'get_abundance', 'get_class_subsample', 'get_test_train', 'grid_search', 'load4crf', 'model_options', 'optimize_parameters', 'predict', 'predict_array', 'rank_feas', 'recursive_elimination', 'remove_outliers', 'remove_valu

## Loading samples
* The `split_samples` module handles training samples. 
* One thing to keep in mind when using `classification` is that object variables do not need to be declared. Everything is inherited into our class (i.e., **cl**)

In [3]:
# let's look at the split_samples() help
print help(CL.split_samples)

Help on method split_samples in module mpglue.classification.classification:

split_samples(self, file_name, perc_samp=0.9, perc_samp_each=0, scale_data=False, class_subs=None, norm_struct=True, labs_type='int', recode_dict=None, classes2remove=None, sample_weight=None, ignore_feas=None, use_xy=False, stratified=False, spacing=1000.0, x_label='X', y_label='Y', response_label='response', clear_observations=None, min_observations=10) method of mpglue.classification.classification.classification instance
    Split samples for training and testing.
    
    Args:
        file_name (str or 2d array): Input text file or 2d array with samples and labels.
        perc_samp (Optional[float]): Percent to sample from all samples. Default is .9. *This parameter
            samples from the entire set of samples, regardless of which class they are in.
        perc_samp_each (Optional[float]): Percent to sample from each class. Default is 0. *This parameter
            overrides ``perc_samp`` and fo

### Requirements
* We can see that `file_name` is the only required parameter. 
* After running, the **cl** instance contains new objects.
* Let's load some samples and look at the class inheritance.

In [3]:
# Setup the samples
samples = '../testing/data/08N_points_merged.txt'

# Load the samples
CL.split_samples(samples)

print dir(CL)

['XY', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_centroid_histogram', '_create_indices', '_default_parameters', '_get_feas', '_get_slope', '_index_samples', '_load_model', '_mask_background', '_num_rows_cols', '_plot_colors', '_predict', '_recode_labels', '_remove_classes', '_remove_min_observations', '_remove_outliers', '_scale_p_vars', '_set_model', '_set_parameters', '_stack_samples', '_stratify', '_train_model', '_transform4crf', '_update_class_counts', 'add_variable_names', 'all_samps', 'class_counts', 'classes', 'classes2remove', 'compare_features', 'compare_samples', 'construct_model', 'copy', 'extract_endmembers', 'file_name', 'get_abundance', 'get_class_subsample', 'get_test_train', 'grid_search', 'headers', 'label_idx', 'labels', 'labels_test', 'load4crf', 'min_observ

In [4]:
# the x, y coordinates
print CL.XY

[[-330385.78125   113423.414062]
 [-330385.78125   113423.414062]
 [-330385.78125   113423.414062]
 ..., 
 [-294261.75     -191333.578125]
 [-294261.75     -191333.578125]
 [-294261.75     -191333.578125]]


In [5]:
# the unique land cover classes
print CL.classes

[1, 2, 3, 4, 5, 6]


In [6]:
# the class label vector
print CL.labels

[1 2 1 ..., 1 2 2]


In [7]:
# the number of land cover samples per class
print CL.class_counts

{1: 2104, 2: 1908, 3: 293, 4: 201, 5: 176, 6: 101}


In [8]:
# the image variables
print CL.p_vars

[[ 141.          159.          152.         ...,   70.            1.86339998
    29.        ]
 [ 120.          167.          143.         ...,   79.            2.12459993
    28.        ]
 [ 127.          177.          143.         ...,  203.            0.           27.        ]
 ..., 
 [ 115.          161.          148.         ...,   27.            0.58929998
    23.        ]
 [ 129.          174.          144.         ...,  120.            2.4296
    25.        ]
 [ 153.          167.          161.         ...,   68.            3.00460005
    23.        ]]


* `p_vars` is a 2-d array of (n_samples x n_variables)
* the `labels` are a 1-d vector of n_samples

In [9]:
print 'Image variables shape: ', CL.p_vars.shape
print 'Number of image variables: ', CL.n_feas
print 'Number of class label samples: ', len(CL.labels)
print 'Class counts: ', CL.class_counts

Image variables shape:  (4783, 41)
Number of image variables:  41
Number of class label samples:  4783
Class counts:  {1: 2104, 2: 1908, 3: 293, 4: 201, 5: 176, 6: 101}


#### We have 4,783 land cover samples and 41 image variables. 
* Let's reload the data, but this time include the x, y coordinates as variables. 
* Also, we saw earlier that the samples are unbalanced, so let's take a more balanced sub-sample from our data.

In [11]:
CL.split_samples(samples, 
                 use_xy=True, 
                 class_subs={0: 0, 1: 100, 2: 100, 3: 100, 4: 100, 5: 100, 6: 50, 7: 100, 9: 100, 10: 100})

print 'Image variables shape: ', cl.p_vars.shape
print 'Number of class label samples: ', len(cl.labels)
print 'Class counts: ', cl.class_counts

Image variables shape:  (550, 43)
Number of class label samples:  550
Class counts:  {1: 100, 2: 100, 3: 100, 4: 100, 5: 100, 6: 50}


#### The x, y coordinates are pre-pended onto p_vars.
* You don't have to use all land cover samples. The `split_samples` function allows the aggregation of samples on-the-fly. In the example below we will merge classes 5--10 (say we have various types of forest cover classes). 

In [12]:
CL.split_samples(samples, 
                 recode_dict={1:1, 2:2, 3:3, 4:4, 5:5, 6:5, 7:5, 8:5, 9:5, 10:5}, 
                 perc_samp=1.)

print CL.class_counts

{1: 2340, 2: 2120, 3: 325, 4: 225, 5: 305}


#### You will notice more samples now that we have sampled 100% of the samples. 
* You might also have the situation where, after testing you find that a class either doesn't have sufficient samples or is not being estimated. The class can be removed. 
* Here we will remove classes 0 and 5--10 (our imaginary forest classes).

In [13]:
CL.split_samples(samples, 
                 classes2remove=[0, 5, 6, 7, 8, 9, 10], 
                 perc_samp=1.)

print CL.class_counts

{1: 2340, 2: 2120, 3: 325, 4: 225}


### Other options
* Grid-based stratification

In [14]:
# Here we sample 70% of the training data to use for
# model training, and the remaining 30% will be used
# for model testing. Further, the samples are sampled
# in a randonly stratified fashion. That is, a sample
# is chosen from each 10km x 10km grid until 70% of the
# samples is reached. The grids are generated on-the-fly.
CL.split_samples(samples, 
                 perc_samp=.7, 
                 stratified=True, 
                 spacing=10000.)

Stratifying ...


## Visualizing data

In [None]:
# plot the first two features in feature space
CL.vis_data(1, 4)

In [20]:
# plot three features
# CL.vis_data(1, 2, fea_3=4)