# Agenda: Day 2

1. Recap and Q&A
2. Dtypes and `NaN`
3. Data frames (2D data)
    - Creating data frames
    - Retrieving rows
    - Retrieving columns
    - Naming the index and the columns
4. Adding and removing data
5. Useful methods and attributes
6. Boolean indexes
7. Querying with `.loc`
    - Row selectors
    - Column selectors
    - Assigning via `.loc`
8. Reading CSV data

# Recap

- Pandas is for reading, writing, manipulating, cleaning, and analyzing data
- Last time, we talked about the *Series*
- A series contains a bunch of values, all of the same type
- Retrieve from a series using `.loc` (by index) or `.iloc` (by position)
- We can set the index either when we create the series or assign a new value
- We can retrieve using a mask index via a boolean series
- Most operations performed on two series happen via the index
- If we have a series and a scalar value, the operation is "broadcast" to every element of the series

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
s = Series([10, 20, 30, 40, 45, 50, 60, 70])

In [3]:
s

0    10
1    20
2    30
3    40
4    45
5    50
6    60
7    70
dtype: int64

In [4]:
s.loc[4]

45

In [5]:
s.loc[[4, 6]]   # fancy indexing

4    45
6    60
dtype: int64

In [6]:
s = Series([10, 20, 30, 40, 45, 50, 60, 70],
          index=list('abcdefgh'))

In [7]:
s

a    10
b    20
c    30
d    40
e    45
f    50
g    60
h    70
dtype: int64

In [8]:
s.loc['d']

40

In [9]:
s.loc[['d', 'f']]

d    40
f    50
dtype: int64

In [10]:
# we can retrieve via the position using .iloc
s.iloc[4]

45

In [11]:
s.iloc[[4, 6]]

e    45
g    60
dtype: int64

In [12]:
s + s    # two series, thus operations are performed by the index



a     20
b     40
c     60
d     80
e     90
f    100
g    120
h    140
dtype: int64

In [13]:
# broadcasting

s + 4

a    14
b    24
c    34
d    44
e    49
f    54
g    64
h    74
dtype: int64

In [14]:
# we can run comparison operations via broadcast, and get a True/False value for each index

s < 50

a     True
b     True
c     True
d     True
e     True
f    False
g    False
h    False
dtype: bool

In [15]:
# if we have a boolean series, we can apply it with .loc
# this returns only those elements of the series for which our booleans are True

s.loc[s<50]

a    10
b    20
c    30
d    40
e    45
dtype: int64

In [16]:
(s<50).value_counts()   

True     5
False    3
dtype: int64

In [17]:
s.describe()

count     8.000000
mean     40.625000
std      20.077973
min      10.000000
25%      27.500000
50%      42.500000
75%      52.500000
max      70.000000
dtype: float64