# Agenda: `.loc` in Pandas

1. Retrieving from a series
    - Retrieving individual indexes
    - Don't forget `.iloc`, as well!
    - Boolean indexes / mask indexes
2. Retrieving from a data frame
    - Retrieving via the index
    - Retrieving via row + columns
3. More advanced topics
    - Using `lambda` with `.loc`

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
np.random.seed(0)  # make sure everyone has the same values
s = Series(np.random.randint(0, 1000, 10),
           index=list('abcdefghij'))
s

a    684
b    559
c    629
d    192
e    835
f    763
g    707
h    359
i      9
j    723
dtype: int64

In [4]:
# how can I retrieve from my series?
# the first thing you probably learned in working with Pandas is to use []

s['a']

np.int64(684)

# Don't use `[]` to retrieve from a series!

Why not?

1. It's a bad habit to get into, because when you get to data frames, it'll fail you
2. It's not nearly as powerful as `.loc`

# How do I retrieve?

1. `.iloc`
2. `.loc`

`.iloc` allows us to retrieve via the numeric position, just as we would do with a traditional Python data structure -- the indexes start at 0, and go up to len-1

In [6]:
s.iloc[0]

np.int64(684)

In [7]:
s.iloc[1]

np.int64(559)

In [8]:
s.iloc[-1]  # this returns the final element

np.int64(723)

In [9]:
s.iloc[3:7]  # from position 3 until (not including) position 7

d    192
e    835
f    763
g    707
dtype: int64

In [10]:
# most of the time, we're going to want to use .loc
# .loc works with the index as we've defined it

In [11]:
s.loc[0]  # this will give us an error!

KeyError: 0

In [12]:
s.loc['a']

np.int64(684)

In [13]:
s.loc['b']

np.int64(559)

In [15]:
s.loc['b':'e'] # when we're using .loc, we get a slice UP TO AND INCLUDING the endpoint!

b    559
c    629
d    192
e    835
dtype: int64

In [16]:
# fancy indexing -- I pass a list of values I want inside of the square brackets
# meaning, I have double square brackets -- outside ones let me pass arguments to .loc, and the inside ones are a list
s.loc[['b', 'd', 'f']]

b    559
d    192
f    763
dtype: int64

In [17]:
s.loc[['b', 'd', 'f', 'b', 'd', 'f']]

b    559
d    192
f    763
b    559
d    192
f    763
dtype: int64

In [18]:
# weird, but it works
s.loc[['b', 'd', 'f', 'b', 'd', 'f']].loc['b']

b    559
b    559
dtype: int64

In [19]:
# you probably know about broadcasting in Pandas
# meaning -- I can use a series and an operator, and then a scalar value -- and the scalar will be applied to the entire series

s + 10

a    694
b    569
c    639
d    202
e    845
f    773
g    717
h    369
i     19
j    733
dtype: int64

In [20]:
s / 3

a    228.000000
b    186.333333
c    209.666667
d     64.000000
e    278.333333
f    254.333333
g    235.666667
h    119.666667
i      3.000000
j    241.000000
dtype: float64

In [21]:
# I can also use comparison operators
s > 500

a     True
b     True
c     True
d    False
e     True
f     True
g     True
h    False
i    False
j     True
dtype: bool

In [22]:
# you can actually apply a boolean series, with .loc, to a series
# only the elements matching a True boolean value will emerge (the "mask" in the "mask index")

s.loc[ [True, False, True, False, True, False, True, False, True, False] ]

a    684
c    629
e    835
g    707
i      9
dtype: int64

In [23]:
s.loc[[True, False]]

IndexError: Boolean index has wrong length: 2 instead of 10

In [25]:
# I want all elements of s where s > 500

s.loc[ s > 500 ]

a    684
b    559
c    629
e    835
f    763
g    707
j    723
dtype: int64

In [26]:
# the stuff in the [] runs before the stuff outside of the []
# in this code, we first evaluate s.mean(), then we compare s > that result
# that gives us a boolean series, which is then applied to s.loc

s.loc[ s > s.mean() ]

a    684
b    559
c    629
e    835
f    763
g    707
j    723
dtype: int64

In [27]:
# what if I want all of the values that are greater than the mean + 1 standard deviation?

s.loc[s > s.mean() + s.std() ]

e    835
dtype: int64

# Data frames

A data frame is a 2D table, in which we have rows (marked by the index, same as in a series) and columns (which have names, known as "columns").

Each column is a series, and acts like one. When you retrieve one row, you'll get an artificially created series, that goes away.

In [28]:
df = DataFrame(np.random.randint(0, 1000, [4, 5]),
               index=list('abcd'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,277,754,804,599,70
b,472,600,396,314,705
c,486,551,87,174,600
d,849,677,537,845,72


In [29]:
# how can I retrieve a row? Still with .loc and an index
df.loc['a']

v    277
w    754
x    804
y    599
z     70
Name: a, dtype: int64

In [30]:
df.loc[['a', 'c']]

Unnamed: 0,v,w,x,y,z
a,277,754,804,599,70
c,486,551,87,174,600


In [31]:
df.loc['a':'c']

Unnamed: 0,v,w,x,y,z
a,277,754,804,599,70
b,472,600,396,314,705
c,486,551,87,174,600


In [33]:
df.loc['a':'c':2]   # from 'a', until (and INCLUDING) 'c', only every other row

Unnamed: 0,v,w,x,y,z
a,277,754,804,599,70
c,486,551,87,174,600


In [35]:
# what if I just want a column? There, I use []
df['v']

a    277
b    472
c    486
d    849
Name: v, dtype: int64

In [36]:
df[['v', 'w']]   # fancy indexing to get more than one column

Unnamed: 0,v,w
a,277,754
b,472,600
c,486,551
d,849,677


What if I want to retrieve particular rows and particular columns?
I don't necessarily want all rows and all columns

This is where `.loc` really starts to shine. It has a two-argument version

The way to think about it is:

```python
df.loc[
        # row selector      -- this can be an explicit name, a list of names, a slice, or (as we'll see) a lambda expression
        ,
        # column selector  -- the same is true for columns, although it's usually a name or a list of names
]
```

In [37]:
df.loc[
    'a'   # row selector
    ,
    'v'    # column selector
]

np.int64(277)

In [38]:
df

Unnamed: 0,v,w,x,y,z
a,277,754,804,599,70
b,472,600,396,314,705
c,486,551,87,174,600
d,849,677,537,845,72


In [39]:
# we can provide a list of names for the index

df.loc[
    ['a', 'd']   # row selector
    ,
    'v'    # column selector
]

a    277
d    849
Name: v, dtype: int64

In [40]:
# we can provide a list of names for the columns

df.loc[
    'a'   # row selector
    ,
    ['v', 'x', 'z']    # column selector
]

v    277
x    804
z     70
Name: a, dtype: int64

In [41]:
# we can provide a list for one or both of them

df.loc[
    ['a', 'c']   # row selector
    ,
    ['v', 'w', 'y']    # column selector
]

Unnamed: 0,v,w,y
a,277,754,599
c,486,551,174


In [42]:
# what about boolean indexes?

df

Unnamed: 0,v,w,x,y,z
a,277,754,804,599,70
b,472,600,396,314,705
c,486,551,87,174,600
d,849,677,537,845,72


In [44]:
# I want all of the values from column v that are greater than column v's mean

df.loc[
    df['v'] > df['v'].mean()   # boolean series that we're applying
    ,
    'v'
]


d    849
Name: v, dtype: int64

In [45]:
df['v'].mean()

np.float64(521.0)

In [46]:
df['v'] > df['v'].mean()

a    False
b    False
c    False
d     True
Name: v, dtype: bool

In [47]:
# we can use the test on one column, and grab another column's values

df.loc[
    df['v'] > df['v'].median()   # where v > median
    ,
    ['w', 'x']    # show me columns w and x
]


Unnamed: 0,w,x
c,551,87
d,677,537


In [48]:
filename = '/Users/reuven/Courses/Current/Data/taxi.csv'

df = pd.read_csv(filename,
                 usecols=['passenger_count', 'total_amount', 'trip_distance'])
df

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.80
1,1,0.46,8.30
2,1,0.87,11.00
3,1,2.13,17.16
4,1,1.40,10.30
...,...,...,...
9994,1,2.70,12.30
9995,1,4.50,20.30
9996,1,5.59,22.30
9997,6,1.54,7.80


In [None]:
# with this, I can ask questions like:
# what was the trip_distance where we had more than 2 passengers?

df.loc[
    df['passenger_count'] > 2   # row selector
    ,
    'trip_distanc
]