# Agenda: Indexing and dtypes

1. Indexes
    - Setting
    - Resetting
2. `inplace=True`
3. dtypes

In [3]:
import numpy as np
import pandas as pd       # if Pandas isn't yet imported, do so -- and give it the alias "pd"
from pandas import Series, DataFrame  # if Pandas isn't yet imported, do so -- and define Series + DataFrame

In [4]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [5]:
s.index = list('abcde')
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [6]:
# what happens if I don't want this index any more?
# what happens if I want the default index?

s.reset_index()

Unnamed: 0,index,0
0,a,10
1,b,20
2,c,30
3,d,40
4,e,50


In [7]:
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

# `reset_index`

Many, *many* methods in Pandas don't modify the series/data frame. Rather, they return a new series/data frame, one that reflects the change that we have made.

If we want to "capture" this change in a variable, or even in the original variable we used, then we have to assign.

In [8]:
df = s.reset_index()
df

Unnamed: 0,index,0
0,a,10
1,b,20
2,c,30
3,d,40
4,e,50


# `set_index`

If I have a data frame, and I want to use one of its columns as the index, then I can call `set_index`, indicating which column should be used. I then get back a new data frame with the specified column as an index.

In [10]:
df.set_index('index')

Unnamed: 0_level_0,0
index,Unnamed: 1_level_1
a,10
b,20
c,30
d,40
e,50


In [11]:
# if I look at df, what do I see?

df

Unnamed: 0,index,0
0,a,10
1,b,20
2,c,30
3,d,40
4,e,50


In [12]:
# set_index returns a new data frame, one based on df, but it doesn't modify df.
# in order to do that, you need to assign to df

df = df.set_index('index')
df

Unnamed: 0_level_0,0
index,Unnamed: 1_level_1
a,10
b,20
c,30
d,40
e,50


In [13]:
# if I want to get a series back, with just our current index and column 0, I can
# retrieve that column with [0].

df[0]

index
a    10
b    20
c    30
d    40
e    50
Name: 0, dtype: int64

In [15]:
df = df.reset_index()

In [16]:
df

Unnamed: 0,index,0
0,a,10
1,b,20
2,c,30
3,d,40
4,e,50


In [18]:
# get all values from column 0
# where index is 'c'
df.loc[df['index'] == 'c', 0]

2    30
Name: 0, dtype: int64

In [19]:
# it's far easier and more intiuitive to set "index" to be the index,
# and then just use .loc to pull out the value(s) we want at that index

(
    df
    .set_index('index')
    .loc['c']  # here, I'm running .loc not on df, but on the result of running df.set_index
)

0    30
Name: c, dtype: int64

# Exercise: Weather and indexes

1. Create a series in which the index contains the dates in MMDD format ('0520', '0521'), all as strings. The values should be the expected high temperatures for the next 10 days.
2. Use `reset_index`. What do you see?
3. Use a mask index to retrieve the projected high temps for May 22 and May 25.
4. Use `set_index` and `.loc` to achieve the same goal.

In [20]:
s = Series([29, 31, 30, 29, 27, 
            25, 22, 24, 23, 21],
           index='0521 0522 0523 0524 0525 0526 0527 0528 0529 0530'.split())

In [21]:
s

0521    29
0522    31
0523    30
0524    29
0525    27
0526    25
0527    22
0528    24
0529    23
0530    21
dtype: int64

In [22]:
s.reset_index()

Unnamed: 0,index,0
0,521,29
1,522,31
2,523,30
3,524,29
4,525,27
5,526,25
6,527,22
7,528,24
8,529,23
9,530,21


In [23]:
s

0521    29
0522    31
0523    30
0524    29
0525    27
0526    25
0527    22
0528    24
0529    23
0530    21
dtype: int64

In [24]:
df = s.reset_index()
df

Unnamed: 0,index,0
0,521,29
1,522,31
2,523,30
3,524,29
4,525,27
5,526,25
6,527,22
7,528,24
8,529,23
9,530,21


In [25]:
(df['index'] == '0522') | (df['index'] == '0525')

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: index, dtype: bool

In [26]:
# I can apply a boolean series to any series whose index matches

(
    df      # data frame
    [0]     # column 0 in the data frame
    .loc[   # get me all values of df[0] where...
        (df['index'] == '0522') |      # either index is 0522
        (df['index'] == '0525')        # or index is 0525
    ]
)

1    31
4    27
Name: 0, dtype: int64

In [29]:
# if we set the index, then we can retrieve these much more easily

(
    df
    .set_index('index')   # make the column named "index" into the index, not a regular column
    .loc[   [   '0522', '0525'   ] ]
)

Unnamed: 0_level_0,0
index,Unnamed: 1_level_1
522,31
525,27


In [30]:
df

Unnamed: 0,index,0
0,521,29
1,522,31
2,523,30
3,524,29
4,525,27
5,526,25
6,527,22
7,528,24
8,529,23
9,530,21


In [31]:
df.loc[3]    # this means: give me the row with index 3

index    0524
0          29
Name: 3, dtype: object

In [32]:
df.loc[ [3, 7, 2] ]  # give me the three rows with indexes 3, 7, and 2

Unnamed: 0,index,0
3,524,29
7,528,24
2,523,30


In [33]:
df[0] > df[0].mean()

0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8    False
9    False
Name: 0, dtype: bool

In [34]:
# let's apply a boolean index to df[0].loc as a mask index
df[0].loc[    df[0] > df[0].mean()   ]

0    29
1    31
2    30
3    29
4    27
Name: 0, dtype: int64

# `inplace=True`

Many, *many* methods in Pandas don't modify their data, even though they are mutable. This is both because the core developers want to encourage us to use method chaining, and because it ensures that if multiple variables are referring to the same value, neither is surprised if the other odifies it.

There is an option, on many Pandas methods, to pass `inplace=True` as a keyword argument. `set_index` is likely the first one we've encountered. This means that we have two choices:

- If we keep the default, then `set_index` doesn't change the original data frame. Also, the method returns the new data frame.
- If wwe pass `inplace=True`, then the method returns `None` (a special Python value) and the data frame itself is modified.

Many people see this and say, "I'll always use `inplace=True` because it's going to use less memory." This turns out to be completely false! You have no way of knowing how much memory Pandas is using, or what optimizations it's doing. The core Pandas developers have been begging people for years *not* to use `inplace=True`. But many still do.

In [None]:
(
    df
    .set_index('whatever')
    .loc['thing']
    .groupby('other')
)

In [35]:
# let's create a series

s = Series([10, 20, 30 ,40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What does it mean for the `dtype` to be `int64`?

Remmber that all of the values in Pandas are stored in NumPy, and that NumPy stores its values in C.

C doesn't have Python integers. It has 8-bit, 16-bit, 32-bit, and 64-bit integers. Every value in a series must have the same dtype -- the type of value that C is using for your data.

If you have a dtype of np.int64, you have 8-byte values you can use to describe your ints.

That's the default!

But... there are problems with that. For example, they're huge! Why not use something smaller?

What's where setting the dtype comes in -- it makes your data much smaller, and thus easier to analayze.

1. How can we get the dtype of a series?
2. How can set the dtype when we creates the series?
3. How can modify the dtype of a series?

In [36]:
# we can get the dtype of a series with the .dtype attribute

s.dtype

dtype('int64')

In [38]:
# set the dtype at creation time

s = Series([10, 20, 30, 40, 50], 
            dtype=np.int32)
s

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [39]:
# what else do we have?
s = Series([10, 20, 30, 40, 50],
           dtype=np.int8)
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

# What dtypes are available?

- Integer
    - `int64`
    - `int32`
    - `int16`
    - `int8`
 
- Floats
    - `float64`
    - `float32`
    - `float16`
- Strings
    - `object` (because they're Python objects)
    

In [42]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [44]:
s + 5   # broadcasting

0    15
1    25
2    35
3    45
4    55
dtype: int8

In [45]:
s * 100

0    -24
1    -48
2    -72
3    -96
4   -120
dtype: int8

In [46]:
100_000_000 * 8

800000000

In [47]:
100_000_000 * 4

400000000

# Exercise: Setting dtypes

1. Create a series of 10 integers with a-j as the index. Then multiply the first and last values by 1,000. Is the dtype adequate? Is it too big? What would be best?
2. Create a series of float values from 1-100 with the dtype `np.float64`. Is this too big? 