# Slicing data in pandas
This is second in the series on indexing and selecting data in pandas. If you haven't read it yet, see [the first post](xxx) that covers the basics of selecting based on index or relative numerical indexing. In this post, I'm going to review slicing, which is a core Python topic, but has a few subtle issues related to pandas. Then, using slicing, we'll review the selection methods from the last post, but now with slicing applied.

## What is a slice, and what is slicing?
A slice is defined in the [Python glossory](https://docs.python.org/3/glossary.html#term-slice).
> An object usually containing a portion of a sequence. A slice is created using the subscript notation, [] with colons between numbers when several are given, such as in variable_name[1:3:5]. The bracket (subscript) notation uses slice objects internally.

Slicing is defined in the [Python reference](https://docs.python.org/3/reference/expressions.html#slicing).
> a slicing selects a range of items in a sequence object (e.g., a string, tuple or list). Slicings may be used as expressions or as targets in assignment or del statements. 

A slicing has three parts: start, stop, and step. A ```slice``` class is a Python builtin. If you instantiante a ```slice``` object explicitly, you can supply a single value (the stop), or you can supply the start and stop and optionally, the step. In terms of what is selected in the sequence, the start is inclusive, the stop is not, and the step determines how you count from start to stop. You can use slice objects or slice notation, but unless you need to store the slice object for re-use, you usually see slice notation in code.

So let's look at few examples to clear this up. We'll start with a string as a sequence object.

In [1]:
w = "abcdefg"

# all of these select the same slice of w, "abc"

w[slice(0, 3, 1)]  # This is slice(start=0, stop=3, step=1)
w[0:3:1]           # This is the same, but in slice notation
w[slice(3)]        # start=0, step=1 are the defaults
w[:3]              # same here

# select the entire sequence. start=0, stop=last element (inclusive), step=1
w[:]

'abcdefg'

Now slices can be used very effectively and concisely to get views of data. I'll just cover a few examples here.

In [2]:
w[::-1] # reverse a string (start=0, stop=end, stepping backwards)

'gfedcba'

In [3]:
w[1:3] # a portion of a string

'bc'

Now strings are special, you can't modify them directly or delete their values. Same with tuples.

In [4]:
try:
    w[2] = 'z'
except TypeError as te:
    print(te)
    
try:
    t = (0, 1, 2)
    t[0] = -1
except TypeError as te:
    print(te)

'str' object does not support item assignment
'tuple' object does not support item assignment


We'll switch to using a list to show some more applications of slicing.

In [5]:
l = [0, 1, 2, 3, 4, 5, 6]

# just as with strings, these all work on a list, and return [0, 1, 2]
l[slice(0,3,1)]
l[0:3:1]
l[slice(3)]
l[:3]

# this returns a full copy
l[:]

[0, 1, 2, 3, 4, 5, 6]

In [6]:
# using step
l[::2]

[0, 2, 4, 6]

In [7]:
# can also assign
l[1::2] = [-1, -1, -1]
l

[0, -1, 2, -1, 4, -1, 6]

## On to pandas
OK, after a quick overview of slicing, let's move onto pandas and see why we're talking about slicing at all. 


Just like the last post, I'll grab a dataset from the [Chicago Data Portal](https://data.cityofchicago.org), but this time we'll grab the [list of Chicago public libraries](https://data.cityofchicago.org/Education/Libraries-Locations-Hours-and-Contact-Information/x8fc-8rcq).

In [8]:
import pandas as pd
import numpy as np

# you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
df = pd.read_json("https://data.cityofchicago.org/resource/x8fc-8rcq.json")

In [9]:
df.dtypes

name_                          object
address                        object
city                           object
state                          object
zip                             int64
phone                          object
website                        object
location                       object
:@computed_region_rpca_8um6     int64
:@computed_region_vrxf_vc4k     int64
:@computed_region_6mkv_f3dw     int64
:@computed_region_bdys_3d7i     int64
:@computed_region_43wa_7qmu     int64
hours_of_operation             object
dtype: object

In [10]:
# trim down the columns
df = df[['name_', 'hours_of_operation', 'address', 'city', 'state', 'zip', 'phone', 'website', 'location']]

In [11]:
df.head(3)

Unnamed: 0,name_,hours_of_operation,address,city,state,zip,phone,website,location
0,Jefferson Park,,5363 W. Lawrence Ave.,Chicago,IL,60630,Closed for Construction,{'url': 'https://www.chipublib.org/locations/3...,"{'latitude': '41.96759739182978', 'longitude':..."
1,Merlo,,644 W. Belmont Ave.,Chicago,IL,60657,Closed for Construction,{'url': 'https://www.chipublib.org/locations/5...,"{'latitude': '41.940084208613214', 'longitude'..."
2,Douglass,,3353 W. 13th St.,Chicago,IL,60623,Closed for Construction,{'url': 'https://www.chipublib.org/locations/2...,"{'latitude': '41.864500759742604', 'longitude'..."


## Slicing in pandas
Now in pandas (and the underlying NumPy data structures), slicing is also supported, and you can use it via the indexing operator (```[]```), ```.loc```, and ```.iloc```.  Just like in the last post, we'll walk through these three methods and look at slicing for both ```Series``` and ```DataFrame``` objects. Also, as last time, to prevent confusion between location based indexing and label indexing, I'm going to update the index of our sample data to be string based for the initial explanations.

In [12]:
# this will make our examples a bit more obvious
df.index = "ID-" + df.index.astype(str)
# Use the name_ column for a sample Series.
s = df['name_']

## Slicing with ```[]```
Just like basic indexing, we'll see that slicing with ```.loc``` and ```.iloc``` is preferred to using ```[]```. Since the behavior of ```[]``` can depend on the index and arugments, it's better to use the more explicit indexers. But it is supported, there's a just a few things to keep in mind.

### Series
Just like standard Python, you can use integer slicing on Series objects, with ```start:stop:step```.

In [13]:
s[:3]            # These all slice the first 3 elements in the Series
s[0:3]
s[0:3:1]
s[slice(0,3,1)]

ID-0    Jefferson Park
ID-1             Merlo
ID-2          Douglass
Name: name_, dtype: object

In a ```Series```, we also can slice on our index, which in this case is an index of objects (strings). Note that the slice start and stop are inclusive, which is different than regular Python slicing!

In [14]:
s['ID-4':'ID-9']

ID-4              Altgeld
ID-5       Archer Heights
ID-6               Austin
ID-7        Austin-Irving
ID-8               Avalon
ID-9    Back of the Yards
Name: name_, dtype: object

You can specify a step as well.

In [15]:
s['ID-10':'ID-15':2]

ID-10    South Shore
ID-12       Bezazian
ID-14       Brainerd
Name: name_, dtype: object

Note that your labels have to be in the index or you'll get a ```KeyError```.

In [16]:
try:
    s['ID-10':'ID-999']
except KeyError as ke:
    print(ke)

'ID-999'


You can also (potentially) assign/update a slice. Note that using ```[]``` is not the right way to do this, you'll get a ```SettingWithCopyWarning```. More on this in future posts (or read the pandas docs for details).

In [17]:
s['ID-79':] = "Changed"
s.tail()

ID-76             West Pullman
ID-77                West Town
ID-78    Whitney M. Young, Jr.
ID-79                  Changed
ID-80                  Changed
Name: name_, dtype: object

### DataFrame
In a ```DataFrame```, slicing with ```[]``` will **slice on the rows**. If you remember back to the last past, selection on a ```DataFrame``` using ```[]``` selects the columns first, so this is a bit confusing. Just remember, if you see a ```:```, it's slicing on rows. 

In [18]:
df[1:3]

Unnamed: 0,name_,hours_of_operation,address,city,state,zip,phone,website,location
ID-1,Merlo,,644 W. Belmont Ave.,Chicago,IL,60657,Closed for Construction,{'url': 'https://www.chipublib.org/locations/5...,"{'latitude': '41.940084208613214', 'longitude'..."
ID-2,Douglass,,3353 W. 13th St.,Chicago,IL,60623,Closed for Construction,{'url': 'https://www.chipublib.org/locations/2...,"{'latitude': '41.864500759742604', 'longitude'..."


You can also use the step.

In [19]:
df['ID-3':'ID-7':2]

Unnamed: 0,name_,hours_of_operation,address,city,state,zip,phone,website,location
ID-3,Albany Park,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",3401 W. Foster Ave.,Chicago,IL,60625,(773) 539-5450,{'url': 'https://www.chipublib.org/locations/3/'},"{'latitude': '41.975456', 'longitude': '-87.71..."
ID-5,Archer Heights,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",5055 S. Archer Ave.,Chicago,IL,60632,(312) 747-9241,{'url': 'https://www.chipublib.org/locations/5/'},"{'latitude': '41.8012136599335', 'longitude': ..."
ID-7,Austin-Irving,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",6100 W. Irving Park Rd.,Chicago,IL,60634,(312) 744-6222,{'url': 'https://www.chipublib.org/locations/7/'},"{'latitude': '41.95317390064158', 'longitude':..."


## Slicing with ```.loc```.
Now, just like with basic selection, ```.loc``` and ```.iloc``` are meant for label and position respectively. With ```.loc``` you cannot pass in locations unless those match the index you are using. In our case, we'll get a ```TypeError```. The issue is our index doesn't contain these values, and so we'll be rejected.

Note that with ```.loc```, we will get both the start and stop values if they are in the index, just like with ```[]```.

### Series
For our example data, you can't slice with integers (since they aren't in the index), but we can slice by labels.

In [20]:
try:
    s.loc[1:3]
except TypeError as te:
    print(te)
    
s.loc['ID-1':'ID-3']

cannot do slice indexing on Index with these indexers [1] of type int


ID-1          Merlo
ID-2       Douglass
ID-3    Albany Park
Name: name_, dtype: object

Assignment or updates using slices (can) work with ```.loc```. We'll talk about updates in more detail in the future.

In [60]:
s.loc['ID-79':] = "Changed again"
s.tail()

ID-76             West Pullman
ID-77                West Town
ID-78    Whitney M. Young, Jr.
ID-79            Changed again
ID-80            Changed again
Name: name_, dtype: object

### DataFrame
And things are similar for a ```DataFrame```.

In [22]:
try:
    df.loc[1:3]
except TypeError as te:
    print(te)
    
df.loc['ID-1':'ID-3']

cannot do slice indexing on Index with these indexers [1] of type int


Unnamed: 0,name_,hours_of_operation,address,city,state,zip,phone,website,location
ID-1,Merlo,,644 W. Belmont Ave.,Chicago,IL,60657,Closed for Construction,{'url': 'https://www.chipublib.org/locations/5...,"{'latitude': '41.940084208613214', 'longitude'..."
ID-2,Douglass,,3353 W. 13th St.,Chicago,IL,60623,Closed for Construction,{'url': 'https://www.chipublib.org/locations/2...,"{'latitude': '41.864500759742604', 'longitude'..."
ID-3,Albany Park,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",3401 W. Foster Ave.,Chicago,IL,60625,(773) 539-5450,{'url': 'https://www.chipublib.org/locations/3/'},"{'latitude': '41.975456', 'longitude': '-87.71..."


Now with a ```DataFrame```, you can slice both on the index and the columns, by label.

In [23]:
df.loc['ID-10':'ID-20':2, "name_":"address"]

Unnamed: 0,name_,hours_of_operation,address
ID-10,South Shore,,2505 E. 73rd St.
ID-12,Bezazian,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",1226 W. Ainslie St.
ID-14,Brainerd,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",1350 W. 89th St.
ID-16,Bucktown-Wicker Park,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",1701 N. Milwaukee Ave.
ID-18,Canaryville,"Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....",642 W. 43rd St.
ID-20,Chicago Lawn,"Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...",6120 S. Kedzie Ave.


I'll also point out a common idiom in pandas code to use a full slice (```:```) to select the entire object when not using slicing to limit selection.

In [24]:
df['name_']           # this is one way to select a single column, for example
df.loc[:,'name_']     # but this is how you select that column using .loc

ID-0            Jefferson Park
ID-1                     Merlo
ID-2                  Douglass
ID-3               Albany Park
ID-4                   Altgeld
                 ...          
ID-76             West Pullman
ID-77                West Town
ID-78    Whitney M. Young, Jr.
ID-79            Changed again
ID-80            Changed again
Name: name_, Length: 81, dtype: object

## Slicing with ```.iloc```
Remember, ```.iloc``` is for strictly integer based indexing, so as you'd expect, it doesn't work with labels. But also note that ```.iloc``` works like standard Python slicing, i.e. the stop value is not included.

### Series

In [25]:
try:
    s.iloc["ID-1":"ID-3"]  # not this way with .iloc
except TypeError as te:
    print(te)
s.iloc[1:3]                # this way

cannot do positional indexing on Index with these indexers [ID-1] of type str


ID-1       Merlo
ID-2    Douglass
Name: name_, dtype: object

### DataFrame
Similarly, ```DataFrame``` selects from the index as the first argument, and the columns with the second argument.

In [26]:
df.iloc[1:3]

Unnamed: 0,name_,hours_of_operation,address,city,state,zip,phone,website,location
ID-1,Merlo,,644 W. Belmont Ave.,Chicago,IL,60657,Closed for Construction,{'url': 'https://www.chipublib.org/locations/5...,"{'latitude': '41.940084208613214', 'longitude'..."
ID-2,Douglass,,3353 W. 13th St.,Chicago,IL,60623,Closed for Construction,{'url': 'https://www.chipublib.org/locations/2...,"{'latitude': '41.864500759742604', 'longitude'..."


In [27]:
df.iloc[1:3, 0:2]

Unnamed: 0,name_,hours_of_operation
ID-1,Merlo,
ID-2,Douglass,


In [28]:
df.iloc[1:10:3, 2:-4]

Unnamed: 0,address,city,state
ID-1,644 W. Belmont Ave.,Chicago,IL
ID-4,13281 S. Corliss Ave.,Chicago,IL
ID-7,6100 W. Irving Park Rd.,Chicago,IL


## Some special cases
There's always some special cases to consider, so I'll go through those last. 

### Changing behavior for integer based indexes
You'll remember that I changed the index on our first test ```DataFrame``` to be a string based index instead of the standard ```RangeIndex``` of sequential integer values. This following example shows why, but you should now understand why it works this way.


In [29]:
df2 = pd.DataFrame({"a": [0, 1, 2, 3], "b": [4, 5, 6, 7], "c": [8, 9, 10, 11]})
df2

Unnamed: 0,a,b,c
0,0,4,8
1,1,5,9
2,2,6,10
3,3,7,11


In [30]:
df2.loc[0:2]  # Remember, .loc is label based

Unnamed: 0,a,b,c
0,0,4,8
1,1,5,9
2,2,6,10


In [31]:
df2.iloc[0:2]  # And .iloc is position based

Unnamed: 0,a,b,c
0,0,4,8
1,1,5,9


In [32]:
df2[0:2]   # So here [] behaves like position based, even though we may be trying to use our index

Unnamed: 0,a,b,c
0,0,4,8
1,1,5,9


In [33]:
df2.index = ["w", "x", "y", "z"]
df2["w":"y"]  # now, [] behaves like it's label based. 

Unnamed: 0,a,b,c
w,0,4,8
x,1,5,9
y,2,6,10


Hopefully this example reinforces for you why you'll want to use ```.loc``` and ```.iloc``` in production code, especially when you're taking values to select rows or columns from unknown sources!

### Label slicing and sorted indexes
We saw that using ```.loc``` will include both the start and stop values of a slice, which is not the way standard Python works. Well, you can end up with indexes that are not sorted and will behave slightly differently than expected. Essentially, what you need to notice is that with a sorted index, all values in between the start and stop will be included (including the value for the stop label)

In [54]:
s2 = pd.Series([1, 2, 3, 4, 5], index=[5, 2, 1, 9, 7])
s3 = pd.Series([1, 2, 3, 4, 5], index=[0, 1, 2, 3, 4])

s3.loc[0:3] # works just as you'd expect for .loc (not standard python slicing, includes end label)

0    1
1    2
2    3
3    4
dtype: int64

In [36]:
s3.loc[0:9]  # no exception raised, even though 9 is not in the index, and includes all values between them

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [37]:
try:
    s2.loc[0:9]   # hmmm, this won't work
except KeyError as ke:
    print(ke)

0


In [38]:
s2.sort_index().loc[0:9] # this works though.

1    3
2    2
5    1
7    5
9    4
dtype: int64

The reason for this functionality is that it is too expensive to do the sorting first, and it may not return the values you'd expect anyway since sorting may not make sense for your index. Just be aware of this behavior when you are dealing with non-sorted indexes. You'll realize that it's important to understand the indexes on your data and ensure they are accurately reflecting your data and what you're trying to accomplish with indexing and slicing it.

### The Ellipsis and NumPy slicing
This is a little known part of Python, but in working through these examples I did come across the ```Ellipsis``` keyword in looking at the slicing documentation. ```Ellipsis``` is just a singleton (like ```None```) and is intended to be used in extended slicing syntax by user-defined container data types. It is represented by the ellipsis literal, ```...```.

It doesn't appear to be supported much by pandas beyond ```Series```, but it is more useful in NumPy. The way to think of it is as a special value that will insert as many full slices as needed to extend the slice at the point of insertion in an index in all directions. This makes more sense with an example.

In [51]:
...   # the singleton

Ellipsis

In [52]:
m = np.arange(27).reshape((3,3,3))  # 3 x 3 x 3 multi-dimensional array
print("Full array:\n", m)
print("First element in last dimension:\n", m[:,:,0])
print("With ellipsis:\n", m[...,0])
print("Last element in last dimension:\n", m[...,-1])

Full array:
 [[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]]

 [[ 9 10 11]
  [12 13 14]
  [15 16 17]]

 [[18 19 20]
  [21 22 23]
  [24 25 26]]]
First element in last dimension:
 [[ 0  3  6]
 [ 9 12 15]
 [18 21 24]]
With ellipsis:
 [[ 0  3  6]
 [ 9 12 15]
 [18 21 24]]
Last element in last dimension:
 [[ 2  5  8]
 [11 14 17]
 [20 23 26]]


This could come in handy if you end up dealing with highly dimensioned arrays and want to save some typing. Or if you just want to try to stump someone with some obscure Python knowledge.

### Wrapping it up
OK, that should be enough for now. Hopefully you've learned something about slicing, I know I have in doing this writeup. 

Next I'll move on to boolean indexing, a very powerful way to select any bit of data in your ```Series``` or ```DataFrame``` using any logic you can think of.