# Agenda, week 2


1. Recap and Q&A
2. dtypes in Pandas
     - What are they?
     - How do they work?
     - How do we change them?
     - Why do we care?
3. `NaN` -- "not a number"
    - What is it?
    - Why do we need it?
    - How do we work with it?
4. Data frames    
    - Creating data frames
    - Retrieving from them (rows vs. columns)
    - `.loc` and `.iloc`
5. Adding and removing data
    - Add rows
    - Add columns
    - Remove rows
    - Remove columns
6. Useful methods and attributes    
7. Using boolean ("mask") indexes to retrieve interesting data
    - Using `.loc` with a row specifier + column specifier
8. Reading data from CSV     

# A quick review of last week's topics

1. A series is a one-dimensional data structure
2. The values in a series can be anything -- typically, text (strings), integers, or floats.
3. The index of a series is, by default, just like in Python, starting at 0 and going to the length-1.  
4. We can set the index of a series to be any values we want -- most typically integers, but we can use strings, too.
5. Unlike most Python data structures, the index of a series can have repeated values.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 10),
           index=list('abcdefghij'))
s2 = Series(np.random.randint(0, 100, 10),
           index=list('fghijfghij'))


In [4]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [5]:
s2

f    70
g    88
h    88
i    12
j    58
f    65
g    39
h    87
i    46
j    88
dtype: int64

In [6]:
s1.loc['b']

47

In [7]:
s1.loc[['b', 'd']]

b    47
d    67
dtype: int64

In [8]:
s2.loc['b']

KeyError: 'b'

In [10]:
s2.loc['f']

f    70
f    65
dtype: int64

In [11]:
s1 + s2

a      NaN
b      NaN
c      NaN
d      NaN
e      NaN
f     79.0
f     74.0
g    171.0
g    122.0
h    109.0
h    108.0
i     48.0
i     82.0
j    145.0
j    175.0
dtype: float64

In [12]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [13]:
s1.mean()

52.5

In [14]:
# which elements of s1 are bigger than s1's mean?
s1 > s1.mean()

a    False
b    False
c     True
d     True
e     True
f    False
g     True
h    False
i    False
j     True
dtype: bool

In [15]:
# now let's apply that boolean series back to s1

# the series we get back contains all elements of s1 
# whose values are greater than s1's mean.
# notice that the index is kept along with the elements

s1.loc[s1 > s1.mean()]

c    64
d    67
e    67
g    83
j    87
dtype: int64

In [16]:
s1.head(2)

a    44
b    47
dtype: int64

In [17]:
# when we run s1.value_counts(), the result is a series
# whose index contains the unique values from s1
# whose values are the number of times that each of s1's elements appeared

s1.value_counts()

67    2
44    1
47    1
64    1
9     1
83    1
21    1
36    1
87    1
dtype: int64

In [18]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [19]:
# let's talk about dtypes!

s = Series([10, 20, 30, 40, 50],
          index=list('abcde'))
s


a    10
b    20
c    30
d    40
e    50
dtype: int64

# What are dtypes?  

Python is not an obviously good candidate for data analysis. That's because each number in Python is actually an object, one that's very large (in memory usage). If you are dealing with many billions of numbers, this will quickly use up the RAM on your system, and will also make your programs very slow.

The advantage of Pandas (and of NumPy, which sits behind the scenes) is that it doesn't use Python's numbers. Rather, it uses C's numbers, which are VERY VERY small in comparison.

The good news is that Pandas is thus very efficient in both memory usage and speed.

The bad news is that we have to do more work. We have to choose which *type* of integer, or float, or other value (but usually ints and floats) we want to use.

The big choice? How many bits they should contain.

By default, Pandas will use `int64` for our integers. That is: 64-bit integers.

Meaning, that we get 2\*\*64 different integers.

What if my numbers are all small? For example, what if I'm tracking ages in a population? I'm unlkely to have someone several quadrillion years old. It might make more sense to save memory, without messing up the accuracy of our data, by choosing a different dtype.

In [20]:
2**64

18446744073709551616

# Valid dtypes

When you choose a dtype, you have to balance the size/speed with your data needs, because if you choose a dtype that's too small, you will lose data and never know it.

## Integers
- `np.int64` (*default*) or `'int64'`
- `np.int32` or `'int32'`
- `np.int16` or `'int16'`
- `np.int8` or `'int8'`
- `np.uint64` or `'uint64'`
- `np.uint32` or `'uint32'`
- `np.uint16` or `'uint16'`
- `np.uint8` or `'uint8'`

## Floats
- `np.float128` or `'float128'`
- `np.float64`  (*default*) or `'float64'`
- `np.float32` or `'float32'`
- `np.float16` or `'float16'`

In [None]:
np.float

In [21]:
# int8 can, in theory, have numbers from 0-255, because 2**8 is 256.
# but that's not the case, because we also need negative numbers -- so we really get -127 to 126
# if you know you're only going to have positive numbers, you can double the range with uint types

2**8

256

In [22]:
# I can set the dtype when I create a series



s = Series([10, 20, 30, 40, 50], dtype=np.int8)

In [23]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [24]:
# how much memory did I just save?
# int8 == 8 bits, or 1 byte, per integer
# int64 == 64 bits, or 8 bytes, per integer

# in this series, I saved 5*8 - 5*1 = 35 bytes

In [25]:
s**2   # put s to the 2nd power

0    100
1   -112
2   -124
3     64
4    -60
dtype: int8

In [28]:
# it was a big mistake to use int8... now what?
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [29]:
s.dtype

dtype('int8')

In [30]:
# I can just set it to a new dtype!
s.dtype = np.int16

AttributeError: property 'dtype' of 'Series' object has no setter

In [31]:
# we can get a new series back from the existing one, 
# with the values converted to a new dtype

# the way to do that is with "astype"

s.astype(np.int32)

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [32]:
# I still haven't changed s!  I can, however, assign the new series back to s

s = s.astype(np.int32)

In [33]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [34]:
s ** 2

0     100
1     400
2     900
3    1600
4    2500
dtype: int32

In [36]:
# you'll get a warning, not an error, if you try to set the dtype too small
s = Series([10000, 20000, 30000], dtype=np.int8)

  s = Series([10000, 20000, 30000], dtype=np.int8)


In [37]:
s

0    16
1    32
2    48
dtype: int8

In [38]:
s = Series([10000, 20000, 30000])

In [39]:
s.astype(np.int8)

0    16
1    32
2    48
dtype: int8

In [40]:
# what happens if I have a series containing text?
# even if that text contains only digits, there's a difference between numbers and strings

s = Series('12 34 56 78'.split())

In [41]:
# if the dtype is object, that means the series contains Python objects, not NumPy/Pandas data

s

0    12
1    34
2    56
3    78
dtype: object

In [42]:
# what happens if I try to get s.mean()

s.mean()

3086419.5

In [43]:
# huh?

# s.mean() first adds together all of the values

s.sum()

'12345678'

In [45]:
int(s.sum()) / 4

3086419.5

In [47]:
# how can we get a more reasonable answer to this question?
# how can we turn s into a series of integers, and then calculate the mean?

s.astype(np.int64).mean()

45.0

In [48]:
s

0    12
1    34
2    56
3    78
dtype: object

In [49]:
s = s.astype(np.int64)
s

0    12
1    34
2    56
3    78
dtype: int64

In [50]:
# what happens if I now change one of the values to be a float?

s.loc[2] = 34.56

In [51]:
# the dtype for the entire series has changed, to reflect our float values

s

0    12.00
1    34.00
2    34.56
3    78.00
dtype: float64

In [52]:
# Unix time starts at 12 midnight, 1 Jan 1970
# it counts seconds since then

# originally, they used a 32-bit integer
2**32

4294967296

# Exercise: Dtypes

1. Ask the user to enter a bunch of integers, separated by spaces (in a string).
2. Turn that string into a series of integers.
3. Show all of the numbers that are greater than the mean.


In [58]:
x = input('Enter integers: ').strip()

Enter integers: 10 20 30 40 50


In [59]:
x

'10 20 30 40 50'

In [64]:
s = Series(x.split())

In [66]:
s = s.astype(np.int64)

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [67]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [68]:
# boolean series based on s
s > s.mean()

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [69]:
s.loc[s > s.mean()]

3    40
4    50
dtype: int64

# Missing data

In almost every data set, some data will be missing. How do we represent that?

- If we use 0, then our calculations will be completely off. (Also, how can we then determine whether 0 is really 0, or indicating that something isn't there?)
- We could use a very small or large number, like -999. But then, we're in a similar situation, where we might calculate things with that bad value!

We need a value that we cannot possibly confuse with others.

The solution in Pandas (and in NumPy, and many other mathematical systems) is to use a special number called `NaN`, short for "not a number."

In [73]:
# You can write it as "big NaN"
np.NaN

nan

In [74]:
# you can write it as "little nan"
np.nan

nan

In [None]:
# if you want to use these names without
from numpy import nan, NaN