# Agenda, week 2


1. Recap and Q&A
2. dtypes in Pandas
     - What are they?
     - How do they work?
     - How do we change them?
     - Why do we care?
3. `NaN` -- "not a number"
    - What is it?
    - Why do we need it?
    - How do we work with it?
4. Data frames    
    - Creating data frames
    - Retrieving from them (rows vs. columns)
    - `.loc` and `.iloc`
5. Adding and removing data
    - Add rows
    - Add columns
    - Remove rows
    - Remove columns
6. Useful methods and attributes    
7. Using boolean ("mask") indexes to retrieve interesting data
    - Using `.loc` with a row specifier + column specifier
8. Reading data from CSV     

# A quick review of last week's topics

1. A series is a one-dimensional data structure
2. The values in a series can be anything -- typically, text (strings), integers, or floats.
3. The index of a series is, by default, just like in Python, starting at 0 and going to the length-1.  
4. We can set the index of a series to be any values we want -- most typically integers, but we can use strings, too.
5. Unlike most Python data structures, the index of a series can have repeated values.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 10),
           index=list('abcdefghij'))
s2 = Series(np.random.randint(0, 100, 10),
           index=list('fghijfghij'))


In [4]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [5]:
s2

f    70
g    88
h    88
i    12
j    58
f    65
g    39
h    87
i    46
j    88
dtype: int64

In [6]:
s1.loc['b']

47

In [7]:
s1.loc[['b', 'd']]

b    47
d    67
dtype: int64

In [8]:
s2.loc['b']

KeyError: 'b'

In [10]:
s2.loc['f']

f    70
f    65
dtype: int64

In [11]:
s1 + s2

a      NaN
b      NaN
c      NaN
d      NaN
e      NaN
f     79.0
f     74.0
g    171.0
g    122.0
h    109.0
h    108.0
i     48.0
i     82.0
j    145.0
j    175.0
dtype: float64

In [12]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [13]:
s1.mean()

52.5

In [14]:
# which elements of s1 are bigger than s1's mean?
s1 > s1.mean()

a    False
b    False
c     True
d     True
e     True
f    False
g     True
h    False
i    False
j     True
dtype: bool

In [15]:
# now let's apply that boolean series back to s1

# the series we get back contains all elements of s1 
# whose values are greater than s1's mean.
# notice that the index is kept along with the elements

s1.loc[s1 > s1.mean()]

c    64
d    67
e    67
g    83
j    87
dtype: int64

In [16]:
s1.head(2)

a    44
b    47
dtype: int64

In [17]:
# when we run s1.value_counts(), the result is a series
# whose index contains the unique values from s1
# whose values are the number of times that each of s1's elements appeared

s1.value_counts()

67    2
44    1
47    1
64    1
9     1
83    1
21    1
36    1
87    1
dtype: int64

In [18]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64