# Introduction to mapclassify

`mapclassify` implements a family of classification schemes for choropleth maps. 
Its focus is on the determination of the number of classes, and the assignment of observations to those classes.
It is intended for use with upstream mapping and geovisualization packages (see [geopandas](https://geopandas.org/mapping.html) and [geoplot](https://residentmario.github.io/geoplot/user_guide/Customizing_Plots.html) for examples) that handle the rendering of the maps.

In this notebook, the basic functionality of mapclassify is presented.

In [28]:
import mapclassify as mc
mc.__version__

'2.3.0'

## Example data
mapclassify contains a built-in dataset for employment density for the 58 California counties.

In [29]:
y = mc.load_example()

## Basic Functionality
All classifiers in `mapclassify` have a common interface and afford similar functionality. We illustrate these using the `MaximumBreaks` classifier.
`MaximumBreaks` requires that the user specify the number of classes `k`. Given this, the logic of the classifier is to sort the observations in ascending order and find the difference between rank adjacent values. The class boundaries are defined as the $k-1$ largest rank-adjacent breaks in the sorted values.

In [30]:
mc.MaximumBreaks(y, k=4)

MaximumBreaks             

     Interval        Count
--------------------------
[   0.13,  228.49] |    52
( 228.49,  546.67] |     4
( 546.67, 2417.15] |     1
(2417.15, 4111.45] |     1

The classifier returns an instance of `MaximumBreaks` that reports the resulting intervals and counts. The first class has closed lower and upper bounds: 
`[   0.13,  228.49]`, with `0.13` being the minimum value in the dataset:

In [31]:
y.min()

0.13

Subsequent intervals are open on the lower bound and closed on the upper bound. The fourth class has the maximum value as its closed upper bound:

In [32]:
y.max()

4111.45

Assigning the classifier to an object let's us inspect other aspects of the classifier:


In [33]:
mb4 = mc.MaximumBreaks(y, k=4)

In [34]:
mb4

MaximumBreaks             

     Interval        Count
--------------------------
[   0.13,  228.49] |    52
( 228.49,  546.67] |     4
( 546.67, 2417.15] |     1
(2417.15, 4111.45] |     1

The `bins` attribute has the upper bounds of the intervals:

In [35]:
mb4.bins

array([ 228.49 ,  546.675, 2417.15 , 4111.45 ])

and `counts` reports the number of values falling in each bin:


In [36]:
mb4.counts

array([52,  4,  1,  1])

The specific bin (i.e. label) for each observation can be found in the `yb` attribute:

In [37]:
mb4.yb

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Changing the number of classes
Staying the the same classifier, the user can apply the same classification rule, but for a different number of classes:

In [47]:
mc.MaximumBreaks(y, k=7)

MaximumBreaks             

     Interval        Count
--------------------------
[   0.13,  146.00] |    50
( 146.00,  228.49] |     2
( 228.49,  291.02] |     1
( 291.02,  350.21] |     2
( 350.21,  546.67] |     1
( 546.67, 2417.15] |     1
(2417.15, 4111.45] |     1

In [48]:
mb7 = mc.MaximumBreaks(y, k=7)

In [49]:
mb7.bins

array([ 146.005,  228.49 ,  291.02 ,  350.21 ,  546.675, 2417.15 ,
       4111.45 ])

In [50]:
mb7.counts

array([50,  2,  1,  2,  1,  1,  1])

In [51]:
mb7.yb

array([3, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 6, 0, 0, 3, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

One additional attribute to mention here is the `adcm` attribute:

In [43]:
mb7.adcm

727.3200000000002

`adcm` is a measure of fit, defined as the mean absolute deviation around the class median. 

In [44]:
mb4.adcm

1181.4900000000002

The `adcm` can be expected to decrease as $k$ increases for a given classifier. Thus, if using as a measure of fit, the `adcm` should only be used to compare classifiers defined on the same number of classes.

## Next Steps
`MaximumBreaks` is but one of many classifiers in `mapclassify`:

In [53]:
mc.classifiers.CLASSIFIERS

('BoxPlot',
 'EqualInterval',
 'FisherJenks',
 'FisherJenksSampled',
 'HeadTailBreaks',
 'JenksCaspall',
 'JenksCaspallForced',
 'JenksCaspallSampled',
 'MaxP',
 'MaximumBreaks',
 'NaturalBreaks',
 'Quantiles',
 'Percentiles',
 'StdMean',
 'UserDefined')

To learn more about an individual classifier, introspection is available:

In [54]:
mc.MaximumBreaks?

[0;31mInit signature:[0m [0mmc[0m[0;34m.[0m[0mMaximumBreaks[0m[0;34m([0m[0my[0m[0;34m,[0m [0mk[0m[0;34m=[0m[0;36m5[0m[0;34m,[0m [0mmindiff[0m[0;34m=[0m[0;36m0[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Maximum Breaks Map Classification

Parameters
----------
y  : array
     (n, 1), values to classify

k  : int
     number of classes required

mindiff : float
          The minimum difference between class breaks

Attributes
----------
yb : array
     (n, 1), bin ids for observations
bins : array
       (k, 1), the upper bounds of each class
k    : int
       the number of classes
counts : array
         (k, 1), the number of observations falling in each class (numpy
         array k x 1)

Examples
--------
>>> import mapclassify as mc
>>> cal = mc.load_example()
>>> mb = mc.MaximumBreaks(cal, k = 5)
>>> mb.k
5
>>> mb.bins
array([ 146.005,  228.49 ,  546.675, 2417.15 , 4111.45 ])
>>> mb.counts
array([50,  2,  4,  1,  1])
[0;31mFile:[

For more comprehensive appliciations of `mapclassify` the interested reader is directed to the chapter on [choropleth mapping](https://geographicdata.science/book/notebooks/05_choropleth.html) in [Rey, Arribas-Bel, and Wolf (2020) "Geographic Data Science with PySAL and the PyData Stack”](https://geographicdata.science/book).