# Agenda, week 2

1. Q&A
2. dtypes
3. `NaN` (not a number)
4. data frames (2D data structures)
5. Adding and removing data in our data frames
6. Useful methods and attributes
7. Querying with boolean indexes
8. Querying with `.loc`
9. Read some CSV data from a file

# dtypes



In [3]:
import numpy as np   # this is not strictly necessary, but very useful
import pandas as pd  # this is necessary!

from pandas import Series, DataFrame   # this is convenient

In [4]:
# let's create a series

s = Series([10, 20, 30, 40, 50])

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What's a dtype?

Many people, when they're learning Python, wonder why we talk about "lists" rather than "arrays." After all, aren't they the same?

No: Lists are different from arrays in two different ways:

- We can change their size (adding and removing items)
- Each object in a list can be of a different type. In an array, they must all be of the same type.

Fast forward to now, when we're working with NumPy and Pandas, and we're really dealing with arrays. That means we cannot change their size (although Pandas does allow for that, thanks to some magic) and all of the elements have to be of the same type.

In the worlds of NumPy and Pandas, that type is known as the "dtype," the data type.

What options do we have for dtypes? These are (mostly) set by NumPy.

Dtypes

- Integers
    - `np.int8`
    - `np.int16`
    - `np.int32`
    - `np.int64` -- the default!
- Unsigned integers
    - `np.uint8`
    - `np.uint16`
    - `np.uint32`
    - `np.uint64`
- Floats
    - `np.float16`
    - `np.float32`
    - `np.float64` -- the default!
    - `np.float128`
    
# What does this mean?

If you don't specify a dtype when you create a series, Pandas will guess what you want/need:

- If you have only integers, then it'll use `np.int64`
- If you have any floating-point numbers, then it'll use `np.float64`
- If you have strings or other funny Python objects, then it'll use `object` as its type

In [5]:
# we can get the dtype of a series by retrieving the dtype attribute

s.dtype

dtype('int64')

If you don't want to specify `np.int8`, then you can instead say `'int8'`, and it'll work the same way.

You can also say `np.dtype('int64')`.

In [8]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [9]:
# how can we specify a different dtype?
# when we create a series, we can pass the keyword argument dtype= along with a valid dtype.

s = Series([10, 20, 30, 40, 50], dtype=np.int8)   # 8-bit numbers
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [10]:
# let's multiply our series (s) by 100!

# I can use broadcasting

s * 100

0    -24
1    -48
2    -72
3    -96
4   -120
dtype: int8

In [11]:
# what happened? 8 bits (signed) aren't enough to hold 1,000 let alone larger numbers.
# so, sort of like a car odometer or an old-style videogame, the numbers roll over
# this is very very bad -- you won't get a warning!



# This is why you need to worry

If your dtype is too small, then if the numbers get too big, you'll lose data without any warning.

So, why not just use larger dtypes? Because that can be a waste of memory.

Imagine 1m 64-bit ints. That'll take up ... 64 MB.

Imagine 1m 8-bit ints. That'll take up 8 MB.

That might not seem like a lot nowadays.  But what if we have 1b rows?

Then it's the difference between 64 GB and 8 GB.. and that's already serious.

So you have to balance between a dtype that's not too small (and won't cause data loss) and not too big (and won't overwhelm your system).  This isn't always easy!

In [12]:
s1 = Series([10, 20, 30, 40, 50])
s2 = Series([90, 91, 92.3, 94, 95])


In [13]:
s1

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [14]:
s2

0    90.0
1    91.0
2    92.3
3    94.0
4    95.0
dtype: float64

In [15]:
s1 + s2    # each of the operations will be int + float, which gives us back a float

0    100.0
1    111.0
2    122.3
3    134.0
4    145.0
dtype: float64

In [16]:
# how can I change the dtype of a series?
# what does that even mean?

# if I change the dtype from int to float, we won't lose any data
# if I change the dtype from float to int, I might well lose data... what happens?

# You cannot change the dtype of a series
s.dtype = np.float64

AttributeError: property 'dtype' of 'Series' object has no setter

In [17]:
# we can create a new series, based on our existing series, with a different dtype
# if we do this, by calling the "astype" method, the new series will have the new dtype
# and each element will go through the appropriate transformation

# floats turned to ints will be truncated, for example

s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [18]:
s.astype(np.float64)   # new series, based on s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

In [19]:
# what about textual data?

s = Series('hello out there to everyone'.split())
s

0       hello
1         out
2       there
3          to
4    everyone
dtype: object

In [20]:
# we'll talk more about text strings in week 4. You should know that text strings have a dtype
# of "object", because they're using regular Python strings, and referring to them there.

# Exercise: Mean from strings

1. Create a list of strings, in which each string contains only digits
2. Create a series based on that list.
3. Transform the series such that you can calculate the mean of those numbers.

Example:

If my list is `[10, 20, 30]`, then I want to have series such that I can call `s.mean()` and get back 20.    

In [21]:
mylist = '11 15 23 97 65'.split()
mylist

['11', '15', '23', '97', '65']

In [22]:
s = Series(mylist)

In [23]:
s

0    11
1    15
2    23
3    97
4    65
dtype: object

In [24]:
# what happens when I try to calculate the mean on them?
s.mean()

223047953.0

In [25]:
# basically, Pandas added together all of the *strings*
s.sum()

'1115239765'

In [26]:
int(s.sum()) / 5

223047953.0

In [28]:
# if we really want to get the mean of these numbers,
# we'll need to transform our series into one of integers

s.astype(np.int8).mean()

42.2

In [29]:
# another way to do this would be at series creation time

s = Series(mylist, dtype=np.int8)

  return bool(asarray(a1 == a2).all())


In [30]:
s

0    11
1    15
2    23
3    97
4    65
dtype: int8

In [31]:
# what if I have floats, and I turn them into ints?

s = Series([10.5, 20.7, 30.8, 40.9])
s

0    10.5
1    20.7
2    30.8
3    40.9
dtype: float64

In [32]:
s.astype(np.int64)  # what happens to our values? We'll just truncate the floats at the decimal point

0    10
1    20
2    30
3    40
dtype: int64

# `NaN`

This is a weird and hard topic! 

Data is often dirty:
- Computers fail
- Sensors fail
- Things are delayed
- People are unreliable

Often, we'll be missing data. Or the data will need to be thrown out. Or the like.

How can we indicate that data is bad?

Imagine a temperature sensor that tells us the current temperature. What should it send to us when there is no data, or it's offline? Could it send us 0? It could, but we might mistake that for a real number.

What if it returns -999, which is clearly not a real temperature? Someone, someday will make the mistake of using that number, and we'll be in real trouble.

So we need a value that is a number, but which we cannot mistake for a number. And that's what `NaN` is all about: It's short for "not a number," but it really is a number!

In [33]:
np.nan  # little nan

nan

In [34]:
np.NaN   # big nan

nan

In [35]:
# these are exactly the same

np.nan is np.NaN

True

In [40]:
type(np.nan)  # what kind of value is it?

float

In [41]:
np.nan == np.nan   # is nan's value equal to itself?

False

To summary:

- `NaN` is a float
- It isn't equal to itself
- We use it where we must have a number, but we don't have a value

In [42]:
# we often use NaN to indicate that data is missing
# for example, let's assume you have a school with 5 tests during the year, and 
# the student was only present for 4 tests.  We want to calculate the mean
# score for a final grade.

scores = Series([95, 90, 97, 92, 0])

scores.mean()

74.8

In [43]:
# let's try this another way, with NaN

scores = Series([95, 90, 97, 92, np.nan])   # use nan instead of 0

scores.mean()  # in NumPy, any NaN in a calculation makes the result NaN

93.5

In [44]:
scores.mean(skipna=False)  # if you want to be a stickler, and not calculate if NaN is around

nan

While I could ignore `nan`, more often I want to actually do something with it, to get rid of it. What are the options?

1. Remove `nan` entirely by running the `dropna` method
2. Replace `nan` with another value

In [45]:
scores

0    95.0
1    90.0
2    97.0
3    92.0
4     NaN
dtype: float64

In [46]:
scores.dropna()  # this returns a new series, based on scores, without any NaN values

0    95.0
1    90.0
2    97.0
3    92.0
dtype: float64

In [47]:
# the other way to handle NaN is to replace it with another value
# there are several schools of thought on this; one is to replace it with the mean of all other values

scores.fillna(scores.mean())  # without-nan mean is 93.5

0    95.0
1    90.0
2    97.0
3    92.0
4    93.5
dtype: float64

In [48]:
scores.fillna(scores.mean()).mean()   # get mean of everything, including filled-in values

93.5

In [49]:
# of course, the standard deviation, which measures how far values go from the mean,
# will be affected - -because we'll now be closer to the mean for 25% of the values

In [54]:
# the way that we can look for a NaN value is with np.isnan
np.isnan(scores.loc[4])

True

In [55]:
np.isnan??

In [58]:
# pandas also provides some other functionality to deal with nan, such as "interpolate"
# where it'll replace NaN with the average of the values next to it

scores.interpolate()

0    95.0
1    90.0
2    97.0
3    92.0
4    92.0
dtype: float64