# Using cut to bin data in pandas dataframe

In this tutorial, we will explore how to bin data in a pandas DataFrame using the **cut** function. 

## What is binning? 

Binning is a way to group data into smaller containers called bins. For example, in surveys, an age question might collect data into ranges. An example of age bins might be: 0 - 25, 25 - 34, 35 - 49, 50 - 70, 70+.

## Tutorial

Let's see how we can bin data using the pandas **cut** function.

First we import the **pandas** and **numpy** libraries. We'll use numpy to generate some sample data.

In [1]:
import numpy
import pandas

Now let's generate 1000 random samples using the **random.normal** function. This will be the data that we plan to bin. In the normal function, we pass in three arguments:

* loc - the mean of the normal distribution
* scale - standard deviation
* size - numer of samples in the returned numpy array

In our case, we want 1000 samples, but we'll only print out the first 10 samples as a sanity check of the data.

In [2]:
samples = numpy.random.normal(loc=0.5, scale=0.2, size=1000)
print(samples[0:10])

[0.42145724 0.51517573 0.26586396 0.89178226 0.36243388 0.41743652
 0.76768077 0.42668549 0.48943054 0.4796403 ]


Next we generate some fake labels to pretend we have a binary classification. We will not be binning this data, but it is just for example. Again, we'll print out the first 10 samples only to sanity check it.

In [3]:
labels = numpy.random.choice(numpy.arange(2), size=1000, replace=True)
print(labels[0:10])

[0 1 1 1 1 0 1 0 0 0]


Let's put these two arrays together to form a pandas DataFrame:

In [4]:
df = pandas.DataFrame({'labels': labels, 'samples': samples})
print(df.head(10))

   labels   samples
0       0  0.421457
1       1  0.515176
2       1  0.265864
3       1  0.891782
4       1  0.362434
5       0  0.417437
6       1  0.767681
7       0  0.426685
8       0  0.489431
9       0  0.479640


To create our bins, we use the **linspace** function in numpy to generate an array of evenly spaced numbers. The arguments for linspace are:

* start - first number in our array
* stop - last number in our array
* num - how many numbers in our array

In [5]:
hist_bins = numpy.linspace(start=0.0, stop=1.0, num=21)
print(hist_bins)

[0.   0.05 0.1  0.15 0.2  0.25 0.3  0.35 0.4  0.45 0.5  0.55 0.6  0.65
 0.7  0.75 0.8  0.85 0.9  0.95 1.  ]


Now we can indicate which sample is in which bin with the **cut** function in pandas. We will bin the samples column, using the hist_bins values to indicate the actual bin boundaries. Note that I am also passing in **right=False** to indicate that the right boundary of each bin is open. If you want the right side to be closed and the left side to be open, pass in **True**.

In [6]:
df['bins'] = pandas.cut(df['samples'], bins=hist_bins, right=False)
print(df.head(10))

   labels   samples         bins
0       0  0.421457  [0.4, 0.45)
1       1  0.515176  [0.5, 0.55)
2       1  0.265864  [0.25, 0.3)
3       1  0.891782  [0.85, 0.9)
4       1  0.362434  [0.35, 0.4)
5       0  0.417437  [0.4, 0.45)
6       1  0.767681  [0.75, 0.8)
7       0  0.426685  [0.4, 0.45)
8       0  0.489431  [0.45, 0.5)
9       0  0.479640  [0.45, 0.5)


Where binning becomes useful is when we want to apply some operation on it. For our example, let's do a **groupby** operation on the bins and then aggregate the labels data by performing a **count** operation it. Then we can do a **cumsum** (cumulative sum) on the labels count.

This is all a contrived example, of course, to give you ideas for your specific use case.

In [7]:
grouped = df.groupby(['bins'], as_index=False).agg({
    'labels': pandas.Series.count,
}).rename(columns={'labels': 'labels_count'})
grouped['cumsum'] = grouped.labels_count.cumsum()

print(grouped)

           bins  labels_count  cumsum
0   [0.0, 0.05)            12      12
1   [0.05, 0.1)             7      19
2   [0.1, 0.15)            19      38
3   [0.15, 0.2)            25      63
4   [0.2, 0.25)            44     107
5   [0.25, 0.3)            68     175
6   [0.3, 0.35)            67     242
7   [0.35, 0.4)            87     329
8   [0.4, 0.45)            87     416
9   [0.45, 0.5)            95     511
10  [0.5, 0.55)            83     594
11  [0.55, 0.6)            94     688
12  [0.6, 0.65)            73     761
13  [0.65, 0.7)            70     831
14  [0.7, 0.75)            60     891
15  [0.75, 0.8)            44     935
16  [0.8, 0.85)            19     954
17  [0.85, 0.9)            25     979
18  [0.9, 0.95)            10     989
19  [0.95, 1.0)             6     995


## Summary

In this article, we learned how to bin our data using the pandas **cut** function, so that we could later perform some aggregate operations on the data. I hope you found this tutorial useful.