# Agenda: Day 2

1. Recap and Q&A
2. Dtypes and `NaN`
3. Data frames (2D data)
    - Creating data frames
    - Retrieving rows
    - Retrieving columns
    - Naming the index and the columns
4. Adding and removing data
5. Useful methods and attributes
6. Boolean indexes
7. Querying with `.loc`
    - Row selectors
    - Column selectors
    - Assigning via `.loc`
8. Reading CSV data

# Recap

- Pandas is for reading, writing, manipulating, cleaning, and analyzing data
- Last time, we talked about the *Series*
- A series contains a bunch of values, all of the same type
- Retrieve from a series using `.loc` (by index) or `.iloc` (by position)
- We can set the index either when we create the series or assign a new value
- We can retrieve using a mask index via a boolean series
- Most operations performed on two series happen via the index
- If we have a series and a scalar value, the operation is "broadcast" to every element of the series

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30, 40, 45, 50, 60, 70])

In [3]:
s

0    10
1    20
2    30
3    40
4    45
5    50
6    60
7    70
dtype: int64

In [4]:
s.loc[4]

45

In [5]:
s.loc[[4, 6]]   # fancy indexing

4    45
6    60
dtype: int64

In [6]:
s = Series([10, 20, 30, 40, 45, 50, 60, 70],
          index=list('abcdefgh'))

In [7]:
s

a    10
b    20
c    30
d    40
e    45
f    50
g    60
h    70
dtype: int64

In [8]:
s.loc['d']

40

In [9]:
s.loc[['d', 'f']]

d    40
f    50
dtype: int64

In [10]:
# we can retrieve via the position using .iloc
s.iloc[4]

45

In [11]:
s.iloc[[4, 6]]

e    45
g    60
dtype: int64

In [12]:
s + s    # two series, thus operations are performed by the index



a     20
b     40
c     60
d     80
e     90
f    100
g    120
h    140
dtype: int64

In [13]:
# broadcasting

s + 4

a    14
b    24
c    34
d    44
e    49
f    54
g    64
h    74
dtype: int64

In [14]:
# we can run comparison operations via broadcast, and get a True/False value for each index

s < 50

a     True
b     True
c     True
d     True
e     True
f    False
g    False
h    False
dtype: bool

In [15]:
# if we have a boolean series, we can apply it with .loc
# this returns only those elements of the series for which our booleans are True

s.loc[s<50]

a    10
b    20
c    30
d    40
e    45
dtype: int64

In [16]:
(s<50).value_counts()   

True     5
False    3
dtype: int64

In [17]:
s.describe()

count     8.000000
mean     40.625000
std      20.077973
min      10.000000
25%      27.500000
50%      42.500000
75%      52.500000
max      70.000000
dtype: float64

# Dtypes

If you're a Python programmer, then you might have wondered what the difference is between a "list" and an "array."  In the case of an array, (a) the length is known when it's created and (b) all of the elements are of the same type.

Python lists don't have either of these restrictions! 

Pandas series can be changed in length, so they aren't arrays, either. But they are more similar to arrays, in that all of the elements must be of the same type.

In [18]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# `dtype`

Every series has a `dtype`, describing what kind of data is in the series. The `dtype` is typically *not* a Python type, but is a special type based on the NumPy library, which is based in the C language.

In Python, we have integers. But in C, we have integers of different sizes.  The `dtype` allows us to specify how mnay bits we want to give to each integer. If we don't have integer data, then we have to specify that, too.

In [19]:
# 64 bits == 8 bytes

In [20]:
s = Series([10, 20, 30.5, 40, 50])   # notice, one number is *not* an integer
s

0    10.0
1    20.0
2    30.5
3    40.0
4    50.0
dtype: float64

# What `dtypes` exist? How can I set it?

Dtypes are all defined in `numpy`, which you can import as 

    import numpy as np
    
Then you can use the following types:

- Integers (ints)
    - `np.int8`
    - `np.int16`
    - `np.int32`
    - `np.int64`  (default if you give integer data)
- Floats 
    - `np.float16`
    - `np.float32`
    - `np.float64` (default if you give float data)
    - `np.float128`
- Unsigned ints
    - `np.uint8`
    - `np.uint16`
    - `np.uint32`
    - `np.uint64`
- `object` (Python objects -- default if you have strings)    

In [21]:
# when I create a new series, I can tell Pandas what dtypes to use

s = Series([10, 20, 30, 40, 50])
s.dtype  # we can get the dtype with this attribute

dtype('int64')

In [22]:
s = Series([10, 20, 30, 40, 50], 
          dtype=np.float128)

s.dtype  # we can get the dtype with this attribute

dtype('float128')

In [23]:
s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float128

In [24]:
s.loc[2] = 345   # set a value in our series

In [25]:
s

0     10.0
1     20.0
2    345.0
3     40.0
4     50.0
dtype: float128

In [26]:
# what happens if we assign a value that doesn't match the dtype?

s = Series([10, 20, 30, 40, 50])   # dtype will be the default, np.int64
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [27]:
s.loc[2] = 12.34

In [28]:
s

0    10.00
1    20.00
2    12.34
3    40.00
4    50.00
dtype: float64

In [29]:
# how can I change a series from one dtype to another?
# you *cannot* assign to the dtype attribute

s.dtype = np.int64

AttributeError: property 'dtype' of 'Series' object has no setter

In [30]:
# you can use the .astype method to get a new series back,
# based on your current series, with a new type

s.astype(np.int64)  # anything after a decimal point was lost

0    10
1    20
2    12
3    40
4    50
dtype: int64

In [31]:
# if I want to "convert" a series from float to int,
# run .astype, and assign the result back to the original series

s = s.astype(np.int64)

In [32]:
s

0    10
1    20
2    12
3    40
4    50
dtype: int64

In [33]:
s = Series('10 20 30 40 50'.split())
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [34]:
s + s   # what happens when I add this series to itself?

0    1010
1    2020
2    3030
3    4040
4    5050
dtype: object

In [35]:
# if I want to turn this series into a bunch of integers, I can use astype

# here, I replaced the original (string/object) version of s with 
# an integer version, assigning it back to the same variable

s = s.astype(np.int64)
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [36]:
s + s

0     20
1     40
2     60
3     80
4    100
dtype: int64

In [37]:
x = 5

x + x

10

In [38]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# How much memory does this series take up?

5 elements * 8 bytes = 40 bytes

In [40]:
s.memory_usage()  # there is some overhead for the series object

172

Since this series only contains small numbers, maybe I can/should use smaller integers.

How much memory would it take if I use 8-bit integers?

5 elements * 1 byte = 5 bytes

In [41]:
s = Series([10, 20, 30, 40, 50], dtype=np.int8)
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [42]:
s.memory_usage()

137

In [43]:
172 - 137

35

# What's wrong with using `int8`?

Nothing, if you want to stay with small numbers...but if they get big, bad news!

In [44]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [45]:
s * 10

0    100
1    -56
2     44
3   -112
4    -12
dtype: int8

In [46]:
# 8 bits gives us 256 (2 ** 8)
# since our integers are signed, that gives us from -128 to 127
# anything outside of that range will be "wrapped around"

# if your dtype is too small, you will LOSE DATA and Pandas won't warn you!

In [47]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [48]:
s.loc[2] = 99999

In [49]:
s

0       10
1       20
2    99999
3       40
4       50
dtype: int32

# Why not always use `np.int64`? 

Answer: you'll use tons of memory unnecessarily.

Consider some data with 10m data points.

10m * 8 bytes = 80 MB
10m * 1 byte =  10 MB

Consider some data with 10b data points.

10b * 8 bytes = 80 GB
10b * 1 bytes = 10 GB

Rule of thumb: Use the smallest dtype you can, without losing data -- with your current data, and with the manipulations/calculations you'll want to do later on.

# Exercise: Strings to numbers

1. Define a series whose values are digits, but in strings. (That is, the series should contain the strings `'10'`, `'20'`, `'30'`, etc. You can use whatever numbers you want.  The `dtype` for this series should be `object`, which generally means strings.
2. Calculate the mean of these numbers.

In [50]:
s = Series(['10', '20', '30', '40', '50'])
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [51]:
# what happens if I call the .mean() method on this?

s.mean()

204060810.0

In [52]:
s.sum()

'1020304050'

In [55]:
# do it on the fly, without changing s's dtype

s.astype(np.int8).mean()

30.0

In [56]:
s = s.astype(np.int8)    # replace the original series with an int series

s.mean()                 # calculate the mean on that

30.0

In [57]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

# Special `dtype` -- `NaN` ("not a number")

Very often, when we're working with data, there will be *missing* data. It might be missing because our sensors failed. Or a computer wasn't connected to the network. Or someone didn't answer a survey question. 

Missing data is a fact of life.

The way that we represent missing data in Pandas is with `NaN` (sometimes written as `nan`), meaning "not a number." This is actually a floating-point value! It's used to represent missing data.  The `pd.NA` value is also for missing data, and you will see it in some cases, but it's still new and relatively unused.

In [59]:
# what happens if we have NaN?

from numpy import nan

s = Series([10, 20, np.nan, 30, 40])

In [61]:
s  # the dtype is float64, because NaN is a float, and that forces all of them to be floats

0    10.0
1    20.0
2     NaN
3    30.0
4    40.0
dtype: float64

In [62]:
s.mean()   # the NaN value will be ignored!

25.0

In [63]:
s.count()   # counts non-NaN values

4

In [64]:
s.loc[0] = np.nan
s

0     NaN
1    20.0
2     NaN
3    30.0
4    40.0
dtype: float64

In [66]:
np.NaN

nan

In [68]:
from numpy import NaN

In [69]:
s

0     NaN
1    20.0
2     NaN
3    30.0
4    40.0
dtype: float64

In [70]:
# how can I remove the NaN values?

# option 1: actually remove them, getting a new series back without the NaNs
s.dropna()  

1    20.0
3    30.0
4    40.0
dtype: float64

In [71]:
# we haven't changed s! To do that, we need to assign the result of s.dropna() to s
s

0     NaN
1    20.0
2     NaN
3    30.0
4    40.0
dtype: float64

In [72]:
# option 2: replace the NaN values with another value
# the fillna method does that for us

s.fillna(999)

0    999.0
1     20.0
2    999.0
3     30.0
4     40.0
dtype: float64

In [73]:
# this didn't change s!
s

0     NaN
1    20.0
2     NaN
3    30.0
4    40.0
dtype: float64

In [74]:
# a common value to use with fillna is the mean of the series!
# in other words: We'll replace NaN with the mean, thus keeping the mean identical 

s.fillna(s.mean())

0    30.0
1    20.0
2    30.0
3    30.0
4    40.0
dtype: float64

In [75]:
s.mean()

30.0

In [76]:
# let's replace NaN with s's mean, then assign that back to s

s = s.fillna(s.mean())

In [77]:
s

0    30.0
1    20.0
2    30.0
3    30.0
4    40.0
dtype: float64

In [78]:
s = Series([10, 20, np.nan, 30, 40])
s

0    10.0
1    20.0
2     NaN
3    30.0
4    40.0
dtype: float64

In [79]:
s.interpolate()

0    10.0
1    20.0
2    25.0
3    30.0
4    40.0
dtype: float64

# Exercise: Missing temperatures

1. Define a series of 10 integers, with the high temperatures expected in your city in the next 10 days.  Make the index the names of the days.
2. Calculate desciptive statistics for these values.
3. Set three of the days' temperatures to be `NaN`.
4. Calculate descriptive statistics again; have they changed a lot?
5. Replace the `NaN` values with the mean of the remaining values. Have they changed much from the original values?

In [80]:
s = Series([15, 22, 23, 18, 14, 17, 19, 18, 19, 18],
          index='Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri'.split())

In [81]:
s

Wed    15
Thu    22
Fri    23
Sat    18
Sun    14
Mon    17
Tue    19
Wed    18
Thu    19
Fri    18
dtype: int64

In [82]:
s.describe()

count    10.000000
mean     18.300000
std       2.750757
min      14.000000
25%      17.250000
50%      18.000000
75%      19.000000
max      23.000000
dtype: float64

In [83]:
s.loc['Sat'] = NaN
s.loc['Sun'] = NaN
s.iloc[-1] = NaN     # use iloc to avoid ambiguity and setting two elements



In [84]:
s

Wed    15.0
Thu    22.0
Fri    23.0
Sat     NaN
Sun     NaN
Mon    17.0
Tue    19.0
Wed    18.0
Thu    19.0
Fri     NaN
dtype: float64

In [85]:
s.describe()

count     7.000000
mean     19.000000
std       2.768875
min      15.000000
25%      17.500000
50%      19.000000
75%      20.500000
max      23.000000
dtype: float64

In [86]:
# now let's replace our NaN values with the mean

s.fillna(s.mean())

Wed    15.0
Thu    22.0
Fri    23.0
Sat    19.0
Sun    19.0
Mon    17.0
Tue    19.0
Wed    18.0
Thu    19.0
Fri    19.0
dtype: float64

In [87]:
# without assigning back to s, I can still get descriptive statistics for s
s.fillna(s.mean()).describe()

count    10.000000
mean     19.000000
std       2.260777
min      15.000000
25%      18.250000
50%      19.000000
75%      19.000000
max      23.000000
dtype: float64

In [88]:
s.interpolate().describe()

count    10.000000
mean     19.200000
std       2.347576
min      15.000000
25%      18.250000
50%      19.000000
75%      20.500000
max      23.000000
dtype: float64