# Agenda, week 2

1. Q&A
2. dtypes
3. `NaN` (not a number)
4. data frames (2D data structures)
5. Adding and removing data in our data frames
6. Useful methods and attributes
7. Querying with boolean indexes
8. Querying with `.loc`
9. Read some CSV data from a file

# dtypes



In [3]:
import numpy as np   # this is not strictly necessary, but very useful
import pandas as pd  # this is necessary!

from pandas import Series, DataFrame   # this is convenient

In [4]:
# let's create a series

s = Series([10, 20, 30, 40, 50])

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

# What's a dtype?

Many people, when they're learning Python, wonder why we talk about "lists" rather than "arrays." After all, aren't they the same?

No: Lists are different from arrays in two different ways:

- We can change their size (adding and removing items)
- Each object in a list can be of a different type. In an array, they must all be of the same type.

Fast forward to now, when we're working with NumPy and Pandas, and we're really dealing with arrays. That means we cannot change their size (although Pandas does allow for that, thanks to some magic) and all of the elements have to be of the same type.

In the worlds of NumPy and Pandas, that type is known as the "dtype," the data type.

What options do we have for dtypes? These are (mostly) set by NumPy.

Dtypes

- Integers
    - `np.int8`
    - `np.int16`
    - `np.int32`
    - `np.int64` -- the default!
- Unsigned integers
    - `np.uint8`
    - `np.uint16`
    - `np.uint32`
    - `np.uint64`
- Floats
    - `np.float16`
    - `np.float32`
    - `np.float64` -- the default!
    - `np.float128`
    
# What does this mean?

If you don't specify a dtype when you create a series, Pandas will guess what you want/need:

- If you have only integers, then it'll use `np.int64`
- If you have any floating-point numbers, then it'll use `np.float64`
- If you have strings or other funny Python objects, then it'll use `object` as its type

In [5]:
# we can get the dtype of a series by retrieving the dtype attribute

s.dtype

dtype('int64')

If you don't want to specify `np.int8`, then you can instead say `'int8'`, and it'll work the same way.

You can also say `np.dtype('int64')`.

In [8]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [9]:
# how can we specify a different dtype?
# when we create a series, we can pass the keyword argument dtype= along with a valid dtype.

s = Series([10, 20, 30, 40, 50], dtype=np.int8)   # 8-bit numbers
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [10]:
# let's multiply our series (s) by 100!

# I can use broadcasting

s * 100

0    -24
1    -48
2    -72
3    -96
4   -120
dtype: int8

In [11]:
# what happened? 8 bits (signed) aren't enough to hold 1,000 let alone larger numbers.
# so, sort of like a car odometer or an old-style videogame, the numbers roll over
# this is very very bad -- you won't get a warning!



# This is why you need to worry

If your dtype is too small, then if the numbers get too big, you'll lose data without any warning.

So, why not just use larger dtypes? Because that can be a waste of memory.

Imagine 1m 64-bit ints. That'll take up ... 64 MB.

Imagine 1m 8-bit ints. That'll take up 8 MB.

That might not seem like a lot nowadays.  But what if we have 1b rows?

Then it's the difference between 64 GB and 8 GB.. and that's already serious.

So you have to balance between a dtype that's not too small (and won't cause data loss) and not too big (and won't overwhelm your system).  This isn't always easy!

In [12]:
s1 = Series([10, 20, 30, 40, 50])
s2 = Series([90, 91, 92.3, 94, 95])


In [13]:
s1

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [14]:
s2

0    90.0
1    91.0
2    92.3
3    94.0
4    95.0
dtype: float64

In [15]:
s1 + s2    # each of the operations will be int + float, which gives us back a float

0    100.0
1    111.0
2    122.3
3    134.0
4    145.0
dtype: float64

In [16]:
# how can I change the dtype of a series?
# what does that even mean?

# if I change the dtype from int to float, we won't lose any data
# if I change the dtype from float to int, I might well lose data... what happens?

# You cannot change the dtype of a series
s.dtype = np.float64

AttributeError: property 'dtype' of 'Series' object has no setter

In [17]:
# we can create a new series, based on our existing series, with a different dtype
# if we do this, by calling the "astype" method, the new series will have the new dtype
# and each element will go through the appropriate transformation

# floats turned to ints will be truncated, for example

s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [18]:
s.astype(np.float64)   # new series, based on s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

In [19]:
# what about textual data?

s = Series('hello out there to everyone'.split())
s

0       hello
1         out
2       there
3          to
4    everyone
dtype: object

In [20]:
# we'll talk more about text strings in week 4. You should know that text strings have a dtype
# of "object", because they're using regular Python strings, and referring to them there.

# Exercise: Mean from strings

1. Create a list of strings, in which each string contains only digits
2. Create a series based on that list.
3. Transform the series such that you can calculate the mean of those numbers.

Example:

If my list is `[10, 20, 30]`, then I want to have series such that I can call `s.mean()` and get back 20.    

In [21]:
mylist = '11 15 23 97 65'.split()
mylist

['11', '15', '23', '97', '65']

In [22]:
s = Series(mylist)

In [23]:
s

0    11
1    15
2    23
3    97
4    65
dtype: object

In [24]:
# what happens when I try to calculate the mean on them?
s.mean()

223047953.0

In [25]:
# basically, Pandas added together all of the *strings*
s.sum()

'1115239765'

In [26]:
int(s.sum()) / 5

223047953.0

In [28]:
# if we really want to get the mean of these numbers,
# we'll need to transform our series into one of integers

s.astype(np.int8).mean()

42.2

In [29]:
# another way to do this would be at series creation time

s = Series(mylist, dtype=np.int8)

  return bool(asarray(a1 == a2).all())


In [30]:
s

0    11
1    15
2    23
3    97
4    65
dtype: int8

In [31]:
# what if I have floats, and I turn them into ints?

s = Series([10.5, 20.7, 30.8, 40.9])
s

0    10.5
1    20.7
2    30.8
3    40.9
dtype: float64

In [32]:
s.astype(np.int64)  # what happens to our values? We'll just truncate the floats at the decimal point

0    10
1    20
2    30
3    40
dtype: int64

# `NaN`

This is a weird and hard topic! 

Data is often dirty:
- Computers fail
- Sensors fail
- Things are delayed
- People are unreliable

Often, we'll be missing data. Or the data will need to be thrown out. Or the like.

How can we indicate that data is bad?

Imagine a temperature sensor that tells us the current temperature. What should it send to us when there is no data, or it's offline? Could it send us 0? It could, but we might mistake that for a real number.

What if it returns -999, which is clearly not a real temperature? Someone, someday will make the mistake of using that number, and we'll be in real trouble.

So we need a value that is a number, but which we cannot mistake for a number. And that's what `NaN` is all about: It's short for "not a number," but it really is a number!

In [33]:
np.nan  # little nan

nan

In [34]:
np.NaN   # big nan

nan

In [35]:
# these are exactly the same

np.nan is np.NaN

True

In [40]:
type(np.nan)  # what kind of value is it?

float

In [41]:
np.nan == np.nan   # is nan's value equal to itself?

False

To summary:

- `NaN` is a float
- It isn't equal to itself
- We use it where we must have a number, but we don't have a value

In [42]:
# we often use NaN to indicate that data is missing
# for example, let's assume you have a school with 5 tests during the year, and 
# the student was only present for 4 tests.  We want to calculate the mean
# score for a final grade.

scores = Series([95, 90, 97, 92, 0])

scores.mean()

74.8

In [43]:
# let's try this another way, with NaN

scores = Series([95, 90, 97, 92, np.nan])   # use nan instead of 0

scores.mean()  # in NumPy, any NaN in a calculation makes the result NaN

93.5

In [44]:
scores.mean(skipna=False)  # if you want to be a stickler, and not calculate if NaN is around

nan

While I could ignore `nan`, more often I want to actually do something with it, to get rid of it. What are the options?

1. Remove `nan` entirely by running the `dropna` method
2. Replace `nan` with another value

In [45]:
scores

0    95.0
1    90.0
2    97.0
3    92.0
4     NaN
dtype: float64

In [46]:
scores.dropna()  # this returns a new series, based on scores, without any NaN values

0    95.0
1    90.0
2    97.0
3    92.0
dtype: float64

In [47]:
# the other way to handle NaN is to replace it with another value
# there are several schools of thought on this; one is to replace it with the mean of all other values

scores.fillna(scores.mean())  # without-nan mean is 93.5

0    95.0
1    90.0
2    97.0
3    92.0
4    93.5
dtype: float64

In [48]:
scores.fillna(scores.mean()).mean()   # get mean of everything, including filled-in values

93.5

In [49]:
# of course, the standard deviation, which measures how far values go from the mean,
# will be affected - -because we'll now be closer to the mean for 25% of the values

In [54]:
# the way that we can look for a NaN value is with np.isnan
np.isnan(scores.loc[4])

True

In [55]:
np.isnan??

In [58]:
# pandas also provides some other functionality to deal with nan, such as "interpolate"
# where it'll replace NaN with the average of the values next to it

scores.interpolate()

0    95.0
1    90.0
2    97.0
3    92.0
4    92.0
dtype: float64

In [59]:
s = Series([10, 20, np.nan, np.nan, 50])

In [60]:
s

0    10.0
1    20.0
2     NaN
3     NaN
4    50.0
dtype: float64

In [61]:
s.interpolate()

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

# Exercise: Missing weather data

1. Create a series of 10 elements with the predicted high temps for your city in the next 10 days.
2. Have 3-4 of those values be np.nan.
3. Calculate the mean of the values.
4. Use `s.fillna` to replace the nan values with the mean. Has the mean changed?
5. If you use `interpolate`, what sorts of results do you see?

In [62]:
s = Series([32, 25, 24, 24, 27, 35, 30, 27, 26, 29])
s

0    32
1    25
2    24
3    24
4    27
5    35
6    30
7    27
8    26
9    29
dtype: int64

In [64]:
s = Series([32, 25, np.nan, 24, 27, np.nan, 30, 27, np.nan, 29])
s

0    32.0
1    25.0
2     NaN
3    24.0
4    27.0
5     NaN
6    30.0
7    27.0
8     NaN
9    29.0
dtype: float64

In [65]:
s.astype(np.int8)  # what happens if I try to coerce s into being an integer series?

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [64]:
s = Series([32, 25, np.nan, 24, 27, np.nan, 30, 27, np.nan, 29])
s

0    32.0
1    25.0
2     NaN
3    24.0
4    27.0
5     NaN
6    30.0
7    27.0
8     NaN
9    29.0
dtype: float64

In [69]:
# what is the mean?
s.describe()

count     7.000000
mean     27.714286
std       2.811541
min      24.000000
25%      26.000000
50%      27.000000
75%      29.500000
max      32.000000
dtype: float64

In [70]:
# what happens if we replace nan with the mean?

s.fillna(s.mean()).describe()

count    10.000000
mean     27.714286
std       2.295613
min      24.000000
25%      27.000000
50%      27.714286
75%      28.678571
max      32.000000
dtype: float64

In [73]:
s.interpolate()

0    32.0
1    25.0
2    24.5
3    24.0
4    27.0
5    28.5
6    30.0
7    27.0
8    28.0
9    29.0
dtype: float64

In [74]:
# s = Series([32, 25, 24, 24, 27, 35, 30, 27, 26, 29])


# Next up

1. Data frames 
    - Creating them
    - Retrieving rows
    - Retrieving columns
    - Naming the index and columns
2. Adding and removing data to our data frame



# Data frames

A data frame is a 2D data structure in Pandas. It's sort of like an Excel spreadsheet, with columns and rows.

We can think of it as a bunch of series objects, with each column being a series. That means every column has a dtype!

We'll be spending most of our time working with data frames. However, when we work with them, we're often going to be working via columns, which means via series.

In [75]:
# how do we create a data frame?
# easiest way: list of lists, with each inner list representing a row in the data frame

df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120]])
df


Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


In [76]:
# wouldn't it be nice if we could give names to our index.. or even to our columns?

In [77]:
df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120]],
              index=list('abc'),
              columns=list('wxyz'))
df


Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [78]:
# how do we retrieve data from the data frame?

# first: how do we retrieve one column?
# answer: with []
df['x']

a     20
b     60
c    100
Name: x, dtype: int64

In [79]:
df['z']

a     40
b     80
c    120
Name: z, dtype: int64

In [81]:
# can I get more than one column at a time? Of course - just use fancy indexing,
# passing a list of column names to df

# this returns a data frame, because we asked for all rows (a, b, c) and two columns

df[['x', 'z']]

Unnamed: 0,x,z
a,20,40
b,60,80
c,100,120


In [83]:
# what about retrieving rows?
# remember that I told you last week that while you can retrieve from a series
# using [], you should really use .loc, because when we get to data frames, it'll
# make life easier?

# that is now!

# to retrieve a row, use .loc
# we get back a series, whose index is df's column names and whose dtype is the best 
#   we can do for these values
df.loc['a']

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [84]:
df.loc[['a', 'c']]

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [85]:
# I can also use .iloc, if I want to retrieve via the position rather than the index

df.iloc[1]

w    50
x    60
y    70
z    80
Name: b, dtype: int64

In [86]:
df.iloc[[0, 2]]

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


# To summarize

- Retrieve a column with `df[COLNAME]`, such as `df['x']`
- Retrieve a row via the index with `df.loc[ROWNAME]`, such as `df.loc['a']`
- Retrieve a row via the positional index with `df.iloc[NUMBER]`, such as `df.iloc[2]`


# Exercise: Grocery store

1. Define a data frame whose columns are `name`, `price`, `quantity_sold`, and whose rows represent products we sell.
2. Find out the descriptive statistics for `price`.
3. Get descriptive statistics for `quantity_sold`, too.

In [88]:
# if I want a grocery store with 3 columns and 5 rows, I'll need a list of lists --
# each inner list will need 3 elements (name, price, quantity_sold)

df = DataFrame([ ['apple', 1, 10],
                 ['banana', 1.25, 7],
                 ['cucumber', 0.5, 15],
                 ['dill', 0.4, 10],
                 ['eggplant', 0.6, 60]],
              columns=['name', 'price', 'quantity_sold'])
df


Unnamed: 0,name,price,quantity_sold
0,apple,1.0,10
1,banana,1.25,7
2,cucumber,0.5,15
3,dill,0.4,10
4,eggplant,0.6,60


In [89]:
# get descriptive statistics for price

df['price'].describe()

count    5.000000
mean     0.750000
std      0.360555
min      0.400000
25%      0.500000
50%      0.600000
75%      1.000000
max      1.250000
Name: price, dtype: float64

In [90]:
df['quantity_sold'].describe()

count     5.000000
mean     20.400000
std      22.322634
min       7.000000
25%      10.000000
50%      10.000000
75%      15.000000
max      60.000000
Name: quantity_sold, dtype: float64

In [92]:
# what if I ask for descriptive statistics for our two columns?
df[['price', 'quantity_sold']].describe()

Unnamed: 0,price,quantity_sold
count,5.0,5.0
mean,0.75,20.4
std,0.360555,22.322634
min,0.4,7.0
25%,0.5,10.0
50%,0.6,10.0
75%,1.0,15.0
max,1.25,60.0


In [93]:
# instead, I could say

# as a general rule, any method you can run on a series, you can
# also run on a data frame - -you'll get back one result per column

df.describe() 

Unnamed: 0,price,quantity_sold
count,5.0,5.0
mean,0.75,20.4
std,0.360555,22.322634
min,0.4,7.0
25%,0.5,10.0
50%,0.6,10.0
75%,1.0,15.0
max,1.25,60.0


In [94]:
df

Unnamed: 0,name,price,quantity_sold
0,apple,1.0,10
1,banana,1.25,7
2,cucumber,0.5,15
3,dill,0.4,10
4,eggplant,0.6,60


In [95]:
# what kind of descriptive statistics can I get on a text column?

df['name'].describe()

count         5
unique        5
top       apple
freq          1
Name: name, dtype: object

# Modifying a data frame

- How can we add / remove rows?
- How can we add / remove columns?

The answer, is by assigning to them!

If we assign to a column, then it is either created or updated. We can assign it a value of either a list or a series, and it must be the right length.

If we assign to a row, then it also must be of the right length, and it'll be added to our data frame.  We add via `.loc`.

Removing both rows and columns is done with the `df.drop` method.

In [97]:
# add a new row to our data frame

df.loc[5] = ['fennel', 0.5, 20]
df

Unnamed: 0,name,price,quantity_sold
0,apple,1.0,10
1,banana,1.25,7
2,cucumber,0.5,15
3,dill,0.4,10
4,eggplant,0.6,60
5,fennel,0.5,20


In [98]:
# adding a new column -- say, the revenue from each product?

df['revenue'] = df['price'] * df['quantity_sold']
df

Unnamed: 0,name,price,quantity_sold,revenue
0,apple,1.0,10,10.0
1,banana,1.25,7,8.75
2,cucumber,0.5,15,7.5
3,dill,0.4,10,4.0
4,eggplant,0.6,60,36.0
5,fennel,0.5,20,10.0


In [100]:
# how do we remove a row? We use df.drop, passing the index we want to remove
# (we could pass a list of indexes)

# running df.drop returns a new data frame, based on df, without the row(s) we
# removed. It does not modify the data frame in place unless you use inplace=True,
# which you should not use. 

df.drop(5)

Unnamed: 0,name,price,quantity_sold,revenue
0,apple,1.0,10,10.0
1,banana,1.25,7,8.75
2,cucumber,0.5,15,7.5
3,dill,0.4,10,4.0
4,eggplant,0.6,60,36.0


In [101]:
# if I want to permanently drop that row, I have to assign the result of drop back to df

df = df.drop(5)
df

Unnamed: 0,name,price,quantity_sold,revenue
0,apple,1.0,10,10.0
1,banana,1.25,7,8.75
2,cucumber,0.5,15,7.5
3,dill,0.4,10,4.0
4,eggplant,0.6,60,36.0


In [103]:
# what about dropping a column? Same thing, but we need to tell 
# drop that we're working with columns, not rows

# the way to do that is by specifying the keyword argument axis='columns'

df.drop('revenue', axis='columns')

Unnamed: 0,name,price,quantity_sold
0,apple,1.0,10
1,banana,1.25,7
2,cucumber,0.5,15
3,dill,0.4,10
4,eggplant,0.6,60


In [104]:
# really modify df by assigning back

df = df.drop('revenue', axis='columns')
df

Unnamed: 0,name,price,quantity_sold
0,apple,1.0,10
1,banana,1.25,7
2,cucumber,0.5,15
3,dill,0.4,10
4,eggplant,0.6,60


In [107]:
df.describe().round(2)

Unnamed: 0,price,quantity_sold
count,5.0,5.0
mean,0.75,20.4
std,0.36,22.32
min,0.4,7.0
25%,0.5,10.0
50%,0.6,10.0
75%,1.0,15.0
max,1.25,60.0


# Exercise: Family data

(For this exercise, if you don't have any family, or don't want to include them, make someone up.)

1. Create a data frame in which we have 3 columns: Name, age, and shoe size.  (If you don't know the person's shoe size, that's OK. Make something up.)  Have 3-4 people in this data frame, each in its own row.
2. Add another two people (rows) to the data frame after creating it.
3. Add a new column to the data frame, height.
4. Check that all rows and all columns have the data.
5. Remove the two new people.
6. Remove the height column.
7. You should be back to the start.

In [108]:
df = DataFrame([['Reuven', 52, 46],
                 ['Atara', 22, 40],
                 ['Shikma', 20, 40],
                 ['Amotz', 17, 44]],
              columns=['name', 'age', 'shoesize'])
df

Unnamed: 0,name,age,shoesize
0,Reuven,52,46
1,Atara,22,40
2,Shikma,20,40
3,Amotz,17,44


In [110]:
df.loc[4] = ['a', 10, 40]
df.loc[5] = ['b', 10, 40]


In [111]:
df

Unnamed: 0,name,age,shoesize
0,Reuven,52,46
1,Atara,22,40
2,Shikma,20,40
3,Amotz,17,44
4,a,10,40
5,b,10,40


In [112]:
# if I assign one value, then it is put in all rows
df['height'] = 185
df

Unnamed: 0,name,age,shoesize,height
0,Reuven,52,46,185
1,Atara,22,40,185
2,Shikma,20,40,185
3,Amotz,17,44,185
4,a,10,40,185
5,b,10,40,185


In [113]:
# typically, you'll need to specify different values

df['height'] = [185, 180, 180, 182, 180, 180]
df

Unnamed: 0,name,age,shoesize,height
0,Reuven,52,46,185
1,Atara,22,40,180
2,Shikma,20,40,180
3,Amotz,17,44,182
4,a,10,40,180
5,b,10,40,180


In [114]:
# let's remove the height column

df = df.drop('height', axis='columns')
df

Unnamed: 0,name,age,shoesize
0,Reuven,52,46
1,Atara,22,40
2,Shikma,20,40
3,Amotz,17,44
4,a,10,40
5,b,10,40


In [116]:
# let's drop rows 4-5
# I can pass a list of rows to drop

df = df.drop([4, 5])

In [117]:
df

Unnamed: 0,name,age,shoesize
0,Reuven,52,46
1,Atara,22,40
2,Shikma,20,40
3,Amotz,17,44


In [119]:
pd.__version__

'2.0.1'

# Next up

- Useful methods on data frames (some of which we already know from series)
- Boolean series and mask indexes on data frames
- Querying with `loc` -- retrieving + setting values on our data frameb

In [120]:
# I'm going to create a large data frame with some random numbers,
# just for demonstration purposes.

# I'm going to use np.random.randint to generate some random
# numbers.

df = DataFrame(np.random.randint(0, 100, [5,5]),  # creating a 5x5 NumPy array, which Pandas can use
              index=list('abcde'),   # list of the letters a-e, for our index
              columns=list('vwxyz')) # list of the letters v-z, for our column names

df

Unnamed: 0,v,w,x,y,z
a,79,93,88,10,16
b,78,24,59,87,1
c,52,43,46,73,12
d,70,50,56,13,51
e,45,53,98,73,31


In [121]:
# I want to see the first 3 rows of the data frame
# remember, I can see the first 3 elements of a series with .head(3)
# I can do the same thing with a data frame

df.head(3)

Unnamed: 0,v,w,x,y,z
a,79,93,88,10,16
b,78,24,59,87,1
c,52,43,46,73,12


In [122]:
# similarly, I can run df.tail(3) to see the final 3 elements

df.tail(3)

Unnamed: 0,v,w,x,y,z
c,52,43,46,73,12
d,70,50,56,13,51
e,45,53,98,73,31


In [123]:
# by default, head and tail show 5 lines

In [124]:
# I can run each of our aggregate functions

df.mean()  # this runs on each of our columns, and gives a result for each one



v    64.8
w    52.6
x    69.4
y    51.2
z    22.2
dtype: float64

In [125]:
df.sum()

v    324
w    263
x    347
y    256
z    111
dtype: int64

In [126]:
df.std()

v    15.482248
w    25.244801
x    22.356207
y    36.704223
z    19.357169
dtype: float64

In [127]:
df.min()

v    45
w    24
x    46
y    10
z     1
dtype: int64

In [128]:
# I can summarize all of them with df.describe...

df.describe()

Unnamed: 0,v,w,x,y,z
count,5.0,5.0,5.0,5.0,5.0
mean,64.8,52.6,69.4,51.2,22.2
std,15.482248,25.244801,22.356207,36.704223,19.357169
min,45.0,24.0,46.0,10.0,1.0
25%,52.0,43.0,56.0,13.0,12.0
50%,70.0,50.0,59.0,73.0,16.0
75%,78.0,53.0,88.0,73.0,31.0
max,79.0,93.0,98.0,87.0,51.0


In [129]:
# what dtypes do we have in our data frame?

df.dtypes  # notice -- plural

v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [132]:
# I'll add a new column of type float128
df['u'] = Series([10, 20, 30, 40, 50], 
                 index=list('abcde'),
                 dtype=np.float128)

In [133]:
df

Unnamed: 0,v,w,x,y,z,u
a,79,93,88,10,16,10.0
b,78,24,59,87,1,20.0
c,52,43,46,73,12,30.0
d,70,50,56,13,51,40.0
e,45,53,98,73,31,50.0


In [134]:
df.dtypes

v       int64
w       int64
x       int64
y       int64
z       int64
u    float128
dtype: object

In [135]:
# when I retrieve column z, what dtype do I see?
df['z']

a    16
b     1
c    12
d    51
e    31
Name: z, dtype: int64

In [136]:
# when I retrieve column u, what dtype do I see?
df['u'] 

a    10.0
b    20.0
c    30.0
d    40.0
e    50.0
Name: u, dtype: float128

In [137]:
# when I retrieve row e, what dtype do I see?
# row e will be returned as a series, created on the fly from each of the elements in that row
# as a series, it'll need a dtype
# Pandas figures out what kind of dtype will be acceptable -- it tries to find something good

df.loc['e']

v    45.0
w    53.0
x    98.0
y    73.0
z    31.0
u    50.0
Name: e, dtype: float128

In [139]:
# we can get the shape (rows x columns) of a data frame with the .shape attribute
# always a tuple

df.shape

(5, 6)

In [140]:
# what about NaN?

df

Unnamed: 0,v,w,x,y,z,u
a,79,93,88,10,16,10.0
b,78,24,59,87,1,20.0
c,52,43,46,73,12,30.0
d,70,50,56,13,51,40.0
e,45,53,98,73,31,50.0


In [141]:
df['v'] = [79, 78, np.nan, 70, np.nan]
df['y'] = [10, np.nan, 87, np.nan, 31]
df

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,,1,20.0
c,,43,46,87.0,12,30.0
d,70.0,50,56,,51,40.0
e,,53,98,31.0,31,50.0


In [142]:
df.dtypes

v     float64
w       int64
x       int64
y     float64
z       int64
u    float128
dtype: object

In [143]:
# we know that dropna on a series removes all NaN values
# what will happen if we run dropna on our data frame?

df.dropna()

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0


In [144]:
# running dropna on a data frame returns only those rows with
# zero nans in them. Any row with even a single nan is dropped.

# we can indicate how many good values a row should have
# in order not to be removed.

# we do this with the "thresh" keyword argument

df.dropna(thresh=4)   # keep rows with 4 good values and 2 nans

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,,1,20.0
c,,43,46,87.0,12,30.0
d,70.0,50,56,,51,40.0
e,,53,98,31.0,31,50.0


In [146]:
df.dropna(thresh=5)   # keep rows with 5 good values and 1 nan

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,,1,20.0
c,,43,46,87.0,12,30.0
d,70.0,50,56,,51,40.0
e,,53,98,31.0,31,50.0


In [148]:
# I can even indicate which columns cannot have nan in them
# for example, I can say that we'll drop any row with nan
# but only if it has nan in v.

df.dropna(subset=['v'])

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,,1,20.0
d,70.0,50,56,,51,40.0


In [149]:
df

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,,1,20.0
c,,43,46,87.0,12,30.0
d,70.0,50,56,,51,40.0
e,,53,98,31.0,31,50.0


In [150]:
# what about fillna?
# I can fillna with a value

df.fillna(9999)

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,9999.0,1,20.0
c,9999.0,43,46,87.0,12,30.0
d,70.0,50,56,9999.0,51,40.0
e,9999.0,53,98,31.0,31,50.0


In [151]:
# I can pass a series to fillna, and it will use 
# the index of the series to fill values in those columns

# meaning: if my series has an index v,w,x,y,z,u
# then the value in that series at v will be used to replace nan in column v

# how can I get such a series with useful values to replace nan?
# answer: mean()

# if i call df.mean(), I get a series with df's columns as the index
# with the mean of each column

df.fillna(df.mean())

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,42.666667,1,20.0
c,75.666667,43,46,87.0,12,30.0
d,70.0,50,56,42.666667,51,40.0
e,75.666667,53,98,31.0,31,50.0


In [152]:
df.interpolate()

Unnamed: 0,v,w,x,y,z,u
a,79.0,93,88,10.0,16,10.0
b,78.0,24,59,48.5,1,20.0
c,74.0,43,46,87.0,12,30.0
d,70.0,50,56,59.0,51,40.0
e,70.0,53,98,31.0,31,50.0


# Exercise: Weather

1. Create a data frame with two columns, the projected high and low temperatures for the next 10 days. The index can be the dates in `DD` format.  Replace some of the values with `np.nan`.
2. Retrieve the top 5 days of weather info
3. Get descriptive statistics for the weather
4. Replace the nan values with the mean high + low. How close are they to the originals?

In [155]:
df = DataFrame()

# let's add a column to df!
df['highs'] = [32, 25, 24, 24, 27, 35, 30, 27, 26, 29]
df['lows'] =  [20, 18, 16, 16, 15, 16, 21, 18, 15, 15]
df

Unnamed: 0,highs,lows
0,32,20
1,25,18
2,24,16
3,24,16
4,27,15
5,35,16
6,30,21
7,27,18
8,26,15
9,29,15


In [157]:
df.index = '17 18 19 20 21 22 23 24 25 26'.split()

In [158]:
df

Unnamed: 0,highs,lows
17,32,20
18,25,18
19,24,16
20,24,16
21,27,15
22,35,16
23,30,21
24,27,18
25,26,15
26,29,15


In [162]:
df = DataFrame()

# let's add a column to df!
df['highs'] = [32, np.nan, 24, 24, np.nan, 35, 30, 27, np.nan, 29]
df['lows'] =  [20, 18, np.nan, 16, np.nan, 16, 21, 18, 15, 15]
df.index = '17 18 19 20 21 22 23 24 25 26'.split()
df

Unnamed: 0,highs,lows
17,32.0,20.0
18,,18.0
19,24.0,
20,24.0,16.0
21,,
22,35.0,16.0
23,30.0,21.0
24,27.0,18.0
25,,15.0
26,29.0,15.0


In [160]:
df.dtypes

highs    float64
lows     float64
dtype: object

In [163]:
df.head()

Unnamed: 0,highs,lows
17,32.0,20.0
18,,18.0
19,24.0,
20,24.0,16.0
21,,


In [164]:
df.describe()

Unnamed: 0,highs,lows
count,7.0,8.0
mean,28.714286,17.375
std,4.070802,2.263846
min,24.0,15.0
25%,25.5,15.75
50%,29.0,17.0
75%,31.0,18.5
max,35.0,21.0


In [165]:
df

Unnamed: 0,highs,lows
17,32.0,20.0
18,,18.0
19,24.0,
20,24.0,16.0
21,,
22,35.0,16.0
23,30.0,21.0
24,27.0,18.0
25,,15.0
26,29.0,15.0


In [166]:
df['highs'].mean()

28.714285714285715

In [167]:
df['lows'].mean()

17.375

In [168]:
df.mean()

highs    28.714286
lows     17.375000
dtype: float64

In [169]:
df.fillna(df.mean())

Unnamed: 0,highs,lows
17,32.0,20.0
18,28.714286,18.0
19,24.0,17.375
20,24.0,16.0
21,28.714286,17.375
22,35.0,16.0
23,30.0,21.0
24,27.0,18.0
25,28.714286,15.0
26,29.0,15.0


In [170]:
df.index

Index(['17', '18', '19', '20', '21', '22', '23', '24', '25', '26'], dtype='object')

In [173]:
df.index = Series('17 18 19 20 21 22 23 24 25 26'.split())

In [174]:
df

Unnamed: 0,highs,lows
17,32.0,20.0
18,,18.0
19,24.0,
20,24.0,16.0
21,,
22,35.0,16.0
23,30.0,21.0
24,27.0,18.0
25,,15.0
26,29.0,15.0


In [175]:
df.index

Index(['17', '18', '19', '20', '21', '22', '23', '24', '25', '26'], dtype='object')

In [177]:
df.describe()  # count only shows non-NaN data... so if they don't match, at least one column has nans

Unnamed: 0,highs,lows
count,7.0,8.0
mean,28.714286,17.375
std,4.070802,2.263846
min,24.0,15.0
25%,25.5,15.75
50%,29.0,17.0
75%,31.0,18.5
max,35.0,21.0


In [179]:
# are there any NaN values?

df.isna()

Unnamed: 0,highs,lows
17,False,False
18,True,False
19,False,True
20,False,False
21,True,True
22,False,False
23,False,False
24,False,False
25,True,False
26,False,False


In [181]:
# to find how *many* nan values there are in each column,
# we can depend on the fact that in Python, True is 1 and False is 0.

df.isna().sum()

highs    3
lows     2
dtype: int64

We still cannot:

- Retrieve values fitting a certain rule
- Retrieve/set individual values

In [185]:
# remember boolean indexes?

# if I run a comparison on a series, I'll get a boolean series back
# I can then apply that boolean series as a "mask index" -- and only those values
#   that correspond to True in the boolean will be returned

# find all high temperatures greater than the mean
df['highs'][df['highs'] > df['highs'].mean()]

17    32.0
22    35.0
23    30.0
26    29.0
Name: highs, dtype: float64

In [186]:
# we can also apply our boolean series to the entire data frame
# in that case, we're saying: Show me all columns for rows where highs are > mean

df[df['highs'] > df['highs'].mean()]

Unnamed: 0,highs,lows
17,32.0,20.0
22,35.0,16.0
23,30.0,21.0
26,29.0,15.0


Using comparisons + boolean series in this way allows us to say, "Show me all rows where X is true on column C." 

I can use comparisons on multiple columns!

In [191]:
# were there any days with above-average highs and below-average lows?

df[(df['highs'] > df['highs'].mean()) & 
   (df['lows'] < df['lows'].mean())]

Unnamed: 0,highs,lows
22,35.0,16.0
26,29.0,15.0


# Next up

1. Retrieving and setting individual values (and groups of values) with `.loc`
2. Work with real-world CSV data

Note: You'll want to download this zipfile: https://files.lerner.co.il/data-science-exercise-files.zip

# The key to working with data frames is `.loc`

There are two versions of `.loc` you can use on a data frame:

1. Similar to what we already know, with one argument, that describes which rows we want ("row selector")
2. New is a two-argument version -- first the row selector, and then the column selector

The row and column selectors can both be:
- A string
- A list of strings
- A slice
- A boolean series

In [192]:
df

Unnamed: 0,highs,lows
17,32.0,20.0
18,,18.0
19,24.0,
20,24.0,16.0
21,,
22,35.0,16.0
23,30.0,21.0
24,27.0,18.0
25,,15.0
26,29.0,15.0


In [193]:
df = DataFrame(np.random.randint(0, 100, [5,5]),  # creating a 5x5 NumPy array, which Pandas can use
              index=list('abcde'),   # list of the letters a-e, for our index
              columns=list('vwxyz')) # list of the letters v-z, for our column names

df

Unnamed: 0,v,w,x,y,z
a,56,83,63,76,30
b,46,86,37,52,11
c,94,91,59,63,81
d,46,64,93,33,23
e,28,19,87,61,22


In [194]:
# one-argument version of df.loc
# first -- just a string

df.loc['b']   # this returns one row

v    46
w    86
x    37
y    52
z    11
Name: b, dtype: int64

In [195]:
# next -- a list of strings, for more than one row
df.loc[['b', 'd']]

Unnamed: 0,v,w,x,y,z
b,46,86,37,52,11
d,46,64,93,33,23


In [196]:
# a slice, for more than one row
df.loc['b':'d']  # this is up to *AND INCLUDING*

Unnamed: 0,v,w,x,y,z
b,46,86,37,52,11
c,94,91,59,63,81
d,46,64,93,33,23


In [197]:
# a fancy slice, for more than one row... with a skip
df.loc['b':'d':2]  

Unnamed: 0,v,w,x,y,z
b,46,86,37,52,11
d,46,64,93,33,23


In [199]:
# boolean series describing which rows we want

# example: show all rows in df where w > w's mean

df.loc[df['w'] > df['w'].mean()]

Unnamed: 0,v,w,x,y,z
a,56,83,63,76,30
b,46,86,37,52,11
c,94,91,59,63,81


In [201]:
# example: show all rows in df where v is even and y is > y's mean

df.loc[(df['v'] % 2 == 0) &
       (df['y'] > df['y'].mean())]

Unnamed: 0,v,w,x,y,z
a,56,83,63,76,30
c,94,91,59,63,81
e,28,19,87,61,22


In [202]:
# how can we restrict the columns? We add a column selector

# when we retrieve a single value, we get back that value

df.loc[
    'a',     # row selector
       'x'   # column selector
]   

63

In [204]:
# let's retrieve more than one row

# when we retrieve multiple values, we get a series (or a data frame, if it's 2D)

df.loc[
    ['a', 'c'],     # row selector
       'x'         # column selector
]   

a    63
c    59
Name: x, dtype: int64

In [205]:
df.loc[
    ['a', 'c'],     # row selector
    ['x', 'z']         # column selector
]   

Unnamed: 0,x,z
a,63,30
c,59,81


In [206]:
# can I assign this way? Absolutely!

df.loc[
    'a',     # row selector
       'x'   # column selector
]    = 9999

In [207]:
df

Unnamed: 0,v,w,x,y,z
a,56,83,9999,76,30
b,46,86,37,52,11
c,94,91,59,63,81
d,46,64,93,33,23
e,28,19,87,61,22


In [208]:
df.loc[
    ['a', 'c'],     # row selector
    ['x', 'z']         # column selector
]   = 8888

In [209]:
df

Unnamed: 0,v,w,x,y,z
a,56,83,8888,76,8888
b,46,86,37,52,11
c,94,91,8888,63,8888
d,46,64,93,33,23
e,28,19,87,61,22


In [210]:
df.loc[
    ['a', 'c'],        # row selector
    ['x', 'z']         # column selector
]   = np.nan

In [211]:
df

Unnamed: 0,v,w,x,y,z
a,56,83,,76,
b,46,86,37.0,52,11.0
c,94,91,,63,
d,46,64,93.0,33,23.0
e,28,19,87.0,61,22.0


# Exercise: Retrieving family members

1. Recreate your data frame with family members -- their names, ages, and shoe sizes.  Try to have 4-5 rows.
2. Find the names of all family members above the median age.
3. Find the aveage shoe size and age for people whose names > 'M' alphabetically.
