<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>Basic Indexing</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; align: middle; text-align: center;">            
            $\large{{{a} ={\begin{pmatrix}a_{1}\\a_{2}\\\vdots \\a_{n}\end{pmatrix}}}}$            
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"An index is a great leveller."</p>
                <br>
                <p>-George Bernard Shaw</p>
            </blockquote>
        </div>
    </div>
</div>

<br>



<hr>

In [1]:
# Import stuff so we can use libraries.
import numpy as np
import pandas as pd

## Generally

In Pandas, "indexing" is simply the process of using a key (or "indexer") to select zero or more pieces of data from an object. We can then use this selection to set or get data. We do this by using a look-up structure called an "index". As we've seen, Series are a type of object that can be indexed. Here we're going to look at indexing in more detail.

To reiterate, to perform the **process of "indexing"**: 

* we use an **"indexer" (lookup key)**
* this indexer performs a look up on an **"index" (data structure)**
* this process gives us back a Series or single value, or alternatively allows us to set values for a selected area.

Note: Indexing isn't glamorous, but it is abso-f***ing-lutely crucial to getting stuff done in Pandas

In [2]:
# Let's start with a pd.Series.
# Here we supply an index composed of strings and data composed of integers.
# If we do not supply an index, pandas will give us an integer index.
s = pd.Series(
    data=[0, 1, 2, 3, 4, 5],
    index=['zero', 'one', 'two', 'three', 'four', 'five']
)

# Again, your index (s.index) is on the left
# Your data (s.values) is on the right
s

zero     0
one      1
two      2
three    3
four     4
five     5
dtype: int64

# What kind of indexers and indexes can we use?

Pretty much any [hashable type](https://docs.python.org/3.7/glossary.html). In other words, you can use things like ...

In [3]:
# Like integers (autoconverted to np.int64)
integer_index = pd.Index([100, 200, 300, 400, 500])
integer_index

Int64Index([100, 200, 300, 400, 500], dtype='int64')

In [4]:
# Like floats (autoconverted to np.float64)
float_index   = pd.Index([2.1,2.2,3.4,5])
float_index

Float64Index([2.1, 2.2, 3.4, 5.0], dtype='float64')

In [5]:
# Like Python boolean objects
# Notice the non-unique index. This is a bad idea.
boolean_index = pd.Index([True, False, True])
boolean_index

Index([True, False, True], dtype='object')

In [6]:
# Like datetimes (autoconverted to np.datetime64)
# See offset aliases to get your freq on: pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
# Here we'll do every two weeks
dt_index = pd.DatetimeIndex(start='2018-01-01', end='2018-12-31', freq='2W')
dt_index

DatetimeIndex(['2018-01-07', '2018-01-21', '2018-02-04', '2018-02-18',
               '2018-03-04', '2018-03-18', '2018-04-01', '2018-04-15',
               '2018-04-29', '2018-05-13', '2018-05-27', '2018-06-10',
               '2018-06-24', '2018-07-08', '2018-07-22', '2018-08-05',
               '2018-08-19', '2018-09-02', '2018-09-16', '2018-09-30',
               '2018-10-14', '2018-10-28', '2018-11-11', '2018-11-25',
               '2018-12-09', '2018-12-23'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [7]:
# Like strings
str_index = pd.Index(['Frankie Knuckles', 'Danny Tenaglia', 'Teddy Rockspin'])
str_index

Index(['Frankie Knuckles', 'Danny Tenaglia', 'Teddy Rockspin'], dtype='object')

In [8]:
# Like arbitrary Python objects
tuple_index = pd.Index([(True, 4.4), (True, 4.9), (False, 4.4)])
tuple_index

list_index = pd.Index([[1,2], [2]])
list_index

Index([[1, 2], [2]], dtype='object')

# So how how can we index Series?

Again, "indexing" is the process of using a key to get data.

### Bracket Notation

There are a couple of ways to go about it. First, the **least preferred option** is the Python subscription or bracket notation that you've probably seen with lists:

    # Take a list
    my_list    = [0, 1, 2, 3, 4, 5]
    
    # Get an item using __getitem__() magic method
    first_item = my_list[0]
    
    # Get a slice similarly
    list_slice = my_list[0:3]
    
We can also do this with a Series!

In [9]:
# First, we can use simple list-like subscripting using __getitem__()
item_1 = s[0]
print("Here is a single item from the series:")

item_1

Here is a single item from the series:


0

In [10]:
# Note that if we ask for multiple items, we get a series object back.
# This series shares the exact same index, but chopped down.
sub_series = s[0:3]
print('Here is a the first 3 elements of the Series.')
print('The return value is itself a Series.')

sub_series

Here is a the first 3 elements of the Series.
The return value is itself a Series.


zero    0
one     1
two     2
dtype: int64

### The .loc[] attribute, your new best friend

Note: if you don't have a lot of time to spend on this, spend your time learning about loc[] and boolean indexing in the advanced indexing section.

The best way to select something from a series is using the the Series [.loc attribute](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.loc.html). Loc[] is not a function! This is a special object uses bracket notation on steroids to mimic languages like R. We feed the loc object our indexer and get our result.

What kind of indexers can we feed it?

In [11]:
# We can feed it a single index, giving us a scalar value
s.loc['three']

3

In [12]:
# Notice that invalid keys will raise a KeyError error:
try:
    s.loc['nonexistant key']
except KeyError as e:
    print('Error: ', e)

Error:  'the label [nonexistant key] is not in the [index]'


In [13]:
# We can feed loc a list of keys we want to fetch, giving us a series composed of those indexes.
index_list = ['one', 'three', 'five']
s.loc[index_list]

one      1
three    3
five     5
dtype: int64

In [14]:
# This is identical to the previous example (list inside a list), but this double bracket can frighten and confuse people
s.loc[['one', 'three', 'five']]

one      1
three    3
five     5
dtype: int64

In [15]:
# We can ask for ranges of keys
s.loc['zero': 'three']

zero     0
one      1
two      2
three    3
dtype: int64

##### Note: Notice anything about the following?

    >>> len(s[0:3])
    3
    
    >>> len(my_list[0:3]
    3
    
    >>> len(s.loc['zero': 'three'])
    4

Loc[] includes the first and last item in your result. Regular bracket notation gives you the first item but not the last item. Because R. In all honesty, pretty much anything that doesn't make sense in Pandas is "because R".

### The iloc[] attribute

You can optionally uses the [iloc](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.iloc.html) attribute. This is like the loc[] attribute but it only works with integer indexing.

In [16]:
# Get first item
s.iloc[0]

0

In [17]:
# Get 1st, 3rd, and fith elements
s.iloc[[0, 2, 4]]

zero    0
two     2
four    4
dtype: int64

In [18]:
# Get 3rd to 5th element
# Note: this excludes the upper end of the range
s.iloc[2:4]

two      2
three    3
dtype: int64

# Setting data via indexing

We can do more than simply get data. We can also set values based on selection. We do that by simply putting the loc[] expression on the left side of the assignment ('=') operator.

In [19]:
# Again, here's out base series 
s

zero     0
one      1
two      2
three    3
four     4
five     5
dtype: int64

In [20]:
# Lets first try setting a single value
s.loc['one'] = 8
s

zero     0
one      8
two      2
three    3
four     4
five     5
dtype: int64

In [21]:
# Then as a list. Notice we're changing the original object
# The single value from above is still changed.
s.loc[['two', 'five']] = 9
s

zero     0
one      8
two      9
three    3
four     4
five     9
dtype: int64

In [22]:
# And finally as range. If you omit the endpoint, it will go all the way to the end.
s.loc['three':] = 10
s

zero      0
one       8
two       9
three    10
four     10
five     10
dtype: int64

In [23]:
# This can also use += style syntax.
s.loc['two': 'four'] += 3
s

zero      0
one       8
two      12
three    13
four     13
five     10
dtype: int64

In [24]:
# Notice that you can assign to invalid keys, which will add to your Series.
s.loc['Nonexistant Key'] = 99
s

zero                0
one                 8
two                12
three              13
four               13
five               10
Nonexistant Key    99
dtype: int64

In [25]:
# Be aware that if you assign a different type, the series will be coerced.
s.loc['one'] = 5.9
s

zero                0.0
one                 5.9
two                12.0
three              13.0
four               13.0
five               10.0
Nonexistant Key    99.0
dtype: float64

---

# Recap
* Pandas uses the process of indexing to select data.
* Indexing uses an index for look ups.
* Indexing uses an indexer as a key (can be many types).
* It can use slices, lists, or single elements.
* Indexing can be used to get a series or a single value.
* Indexing can be used to set a value or group of values.

---

# Additional Learing Resources

* ### [Official Pandas Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
* ### [Index API](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html)
* ### [Returning a View vs. Returning a Copy](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy)


---

# Next Up: [Advanced Indexing](5_advanced_indexing.ipynb)

<br>

$\huge{{x}_{i}}$ 

---