# Agenda, week 2


1. Recap and Q&A
2. dtypes in Pandas
     - What are they?
     - How do they work?
     - How do we change them?
     - Why do we care?
3. `NaN` -- "not a number"
    - What is it?
    - Why do we need it?
    - How do we work with it?
4. Data frames    
    - Creating data frames
    - Retrieving from them (rows vs. columns)
    - `.loc` and `.iloc`
5. Adding and removing data
    - Add rows
    - Add columns
    - Remove rows
    - Remove columns
6. Useful methods and attributes    
7. Using boolean ("mask") indexes to retrieve interesting data
    - Using `.loc` with a row specifier + column specifier
8. Reading data from CSV     

# A quick review of last week's topics

1. A series is a one-dimensional data structure
2. The values in a series can be anything -- typically, text (strings), integers, or floats.
3. The index of a series is, by default, just like in Python, starting at 0 and going to the length-1.  
4. We can set the index of a series to be any values we want -- most typically integers, but we can use strings, too.
5. Unlike most Python data structures, the index of a series can have repeated values.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 10),
           index=list('abcdefghij'))
s2 = Series(np.random.randint(0, 100, 10),
           index=list('fghijfghij'))


In [4]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [5]:
s2

f    70
g    88
h    88
i    12
j    58
f    65
g    39
h    87
i    46
j    88
dtype: int64

In [6]:
s1.loc['b']

47

In [7]:
s1.loc[['b', 'd']]

b    47
d    67
dtype: int64

In [8]:
s2.loc['b']

KeyError: 'b'

In [10]:
s2.loc['f']

f    70
f    65
dtype: int64

In [11]:
s1 + s2

a      NaN
b      NaN
c      NaN
d      NaN
e      NaN
f     79.0
f     74.0
g    171.0
g    122.0
h    109.0
h    108.0
i     48.0
i     82.0
j    145.0
j    175.0
dtype: float64

In [12]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [13]:
s1.mean()

52.5

In [14]:
# which elements of s1 are bigger than s1's mean?
s1 > s1.mean()

a    False
b    False
c     True
d     True
e     True
f    False
g     True
h    False
i    False
j     True
dtype: bool

In [15]:
# now let's apply that boolean series back to s1

# the series we get back contains all elements of s1 
# whose values are greater than s1's mean.
# notice that the index is kept along with the elements

s1.loc[s1 > s1.mean()]

c    64
d    67
e    67
g    83
j    87
dtype: int64

In [16]:
s1.head(2)

a    44
b    47
dtype: int64

In [17]:
# when we run s1.value_counts(), the result is a series
# whose index contains the unique values from s1
# whose values are the number of times that each of s1's elements appeared

s1.value_counts()

67    2
44    1
47    1
64    1
9     1
83    1
21    1
36    1
87    1
dtype: int64

In [18]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [19]:
# let's talk about dtypes!

s = Series([10, 20, 30, 40, 50],
          index=list('abcde'))
s


a    10
b    20
c    30
d    40
e    50
dtype: int64

# What are dtypes?  

Python is not an obviously good candidate for data analysis. That's because each number in Python is actually an object, one that's very large (in memory usage). If you are dealing with many billions of numbers, this will quickly use up the RAM on your system, and will also make your programs very slow.

The advantage of Pandas (and of NumPy, which sits behind the scenes) is that it doesn't use Python's numbers. Rather, it uses C's numbers, which are VERY VERY small in comparison.

The good news is that Pandas is thus very efficient in both memory usage and speed.

The bad news is that we have to do more work. We have to choose which *type* of integer, or float, or other value (but usually ints and floats) we want to use.

The big choice? How many bits they should contain.

By default, Pandas will use `int64` for our integers. That is: 64-bit integers.

Meaning, that we get 2\*\*64 different integers.

What if my numbers are all small? For example, what if I'm tracking ages in a population? I'm unlkely to have someone several quadrillion years old. It might make more sense to save memory, without messing up the accuracy of our data, by choosing a different dtype.

In [20]:
2**64

18446744073709551616

# Valid dtypes

When you choose a dtype, you have to balance the size/speed with your data needs, because if you choose a dtype that's too small, you will lose data and never know it.

## Integers
- `np.int64` (*default*) or `'int64'`
- `np.int32` or `'int32'`
- `np.int16` or `'int16'`
- `np.int8` or `'int8'`
- `np.uint64` or `'uint64'`
- `np.uint32` or `'uint32'`
- `np.uint16` or `'uint16'`
- `np.uint8` or `'uint8'`

## Floats
- `np.float128` or `'float128'`
- `np.float64`  (*default*) or `'float64'`
- `np.float32` or `'float32'`
- `np.float16` or `'float16'`

In [None]:
np.float

In [21]:
# int8 can, in theory, have numbers from 0-255, because 2**8 is 256.
# but that's not the case, because we also need negative numbers -- so we really get -127 to 126
# if you know you're only going to have positive numbers, you can double the range with uint types

2**8

256

In [22]:
# I can set the dtype when I create a series



s = Series([10, 20, 30, 40, 50], dtype=np.int8)

In [23]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [24]:
# how much memory did I just save?
# int8 == 8 bits, or 1 byte, per integer
# int64 == 64 bits, or 8 bytes, per integer

# in this series, I saved 5*8 - 5*1 = 35 bytes

In [25]:
s**2   # put s to the 2nd power

0    100
1   -112
2   -124
3     64
4    -60
dtype: int8

In [28]:
# it was a big mistake to use int8... now what?
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [29]:
s.dtype

dtype('int8')

In [30]:
# I can just set it to a new dtype!
s.dtype = np.int16

AttributeError: property 'dtype' of 'Series' object has no setter

In [31]:
# we can get a new series back from the existing one, 
# with the values converted to a new dtype

# the way to do that is with "astype"

s.astype(np.int32)

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [32]:
# I still haven't changed s!  I can, however, assign the new series back to s

s = s.astype(np.int32)

In [33]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int32

In [34]:
s ** 2

0     100
1     400
2     900
3    1600
4    2500
dtype: int32

In [36]:
# you'll get a warning, not an error, if you try to set the dtype too small
s = Series([10000, 20000, 30000], dtype=np.int8)

  s = Series([10000, 20000, 30000], dtype=np.int8)


In [37]:
s

0    16
1    32
2    48
dtype: int8

In [38]:
s = Series([10000, 20000, 30000])

In [39]:
s.astype(np.int8)

0    16
1    32
2    48
dtype: int8

In [40]:
# what happens if I have a series containing text?
# even if that text contains only digits, there's a difference between numbers and strings

s = Series('12 34 56 78'.split())

In [41]:
# if the dtype is object, that means the series contains Python objects, not NumPy/Pandas data

s

0    12
1    34
2    56
3    78
dtype: object

In [42]:
# what happens if I try to get s.mean()

s.mean()

3086419.5

In [43]:
# huh?

# s.mean() first adds together all of the values

s.sum()

'12345678'

In [45]:
int(s.sum()) / 4

3086419.5

In [47]:
# how can we get a more reasonable answer to this question?
# how can we turn s into a series of integers, and then calculate the mean?

s.astype(np.int64).mean()

45.0

In [48]:
s

0    12
1    34
2    56
3    78
dtype: object

In [49]:
s = s.astype(np.int64)
s

0    12
1    34
2    56
3    78
dtype: int64

In [50]:
# what happens if I now change one of the values to be a float?

s.loc[2] = 34.56

In [51]:
# the dtype for the entire series has changed, to reflect our float values

s

0    12.00
1    34.00
2    34.56
3    78.00
dtype: float64

In [52]:
# Unix time starts at 12 midnight, 1 Jan 1970
# it counts seconds since then

# originally, they used a 32-bit integer
2**32

4294967296

# Exercise: Dtypes

1. Ask the user to enter a bunch of integers, separated by spaces (in a string).
2. Turn that string into a series of integers.
3. Show all of the numbers that are greater than the mean.


In [58]:
x = input('Enter integers: ').strip()

Enter integers: 10 20 30 40 50


In [59]:
x

'10 20 30 40 50'

In [64]:
s = Series(x.split())

In [66]:
s = s.astype(np.int64)

s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [67]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [68]:
# boolean series based on s
s > s.mean()

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [69]:
s.loc[s > s.mean()]

3    40
4    50
dtype: int64

# Missing data

In almost every data set, some data will be missing. How do we represent that?

- If we use 0, then our calculations will be completely off. (Also, how can we then determine whether 0 is really 0, or indicating that something isn't there?)
- We could use a very small or large number, like -999. But then, we're in a similar situation, where we might calculate things with that bad value!

We need a value that we cannot possibly confuse with others.

The solution in Pandas (and in NumPy, and many other mathematical systems) is to use a special number called `NaN`, short for "not a number."

In [73]:
# You can write it as "big NaN"
np.NaN

nan

In [74]:
# you can write it as "little nan"
np.nan

nan

In [75]:
# if you want to use these names without the np. prefix
from numpy import nan, NaN

In [76]:
# what is nan?

type(nan)

float

In [77]:
# is nan equal to itself?
nan == nan

False

In [78]:
s = Series([10, 20, 30, nan, 50, 60])

In [79]:
s

0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
5    60.0
dtype: float64

In [80]:
# can I convert s to ints?

s.astype(np.int64)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

In [81]:
s

0    10.0
1    20.0
2    30.0
3     NaN
4    50.0
5    60.0
dtype: float64

In [82]:
s.min()

10.0

In [83]:
s.max()

60.0

In [84]:
s.mean()

34.0

In [86]:
# in Pandas, most methods ignore NaN values
# this is the *opposite* of NumPy!

s.mean(skipna=False)  # if we insist on taking nan into consideration when calculating the mean... we get nan!

nan

In [88]:
s.count()   # how many non-NaN values are in s?


5

In [90]:
s.size       # since the size (not a method!) is 6, and count() is 5, we know that there is 1 NaN value

6

In [93]:
s.isna()   # this tells us where there are NaNs

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

In [94]:
# let's find out how many NaN values there are:

s.isna().value_counts()

False    5
True     1
dtype: int64

In [95]:
# what can we do about NaN?

# (1) Ignore it
# that will sometimes work

In [96]:
# (2) Delete it
# that's also sometimes good, but often impossible because we'll be in a data frame, not a series

s.dropna()  # this returns a new series, just like s, but without any NaN values

0    10.0
1    20.0
2    30.0
4    50.0
5    60.0
dtype: float64

In [97]:
# (3) Replace it
# I can use the "fillna" method, and give it any value I want

s.fillna(999)

0     10.0
1     20.0
2     30.0
3    999.0
4     50.0
5     60.0
dtype: float64

In [98]:
s.fillna(s.mean())  # replace the NaN values with s's mean

0    10.0
1    20.0
2    30.0
3    34.0
4    50.0
5    60.0
dtype: float64

# Exercise: Broken temp stats

1. Create a series of 5 elements in which the values are the project high temp for the next 5 days, and the index contains the day names.
2. In two of these cases, replace the values with `NaN`.
3. What is the mean that you get from the existing values?
4. Replace `NaN` with the mean of those existing values.

In [102]:
s = Series([25, 21, 17, 19, 21],
          index='Wed Thu Fri Sat Sun'.split())

s.loc['Thu'] = NaN
s.loc['Sat'] = NaN


In [103]:
s

Wed    25.0
Thu     NaN
Fri    17.0
Sat     NaN
Sun    21.0
dtype: float64

In [104]:
s.mean()

21.0

In [105]:
# how can I replace the NaN values with the mean?

s.fillna(s.mean())   # calculate s.mean(), then fill in all NaN values with it, returning a new series

Wed    25.0
Thu    21.0
Fri    17.0
Sat    21.0
Sun    21.0
dtype: float64

In [106]:
# now we modify s
s = s.fillna(s.mean()) 

In [107]:
s

Wed    25.0
Thu    21.0
Fri    17.0
Sat    21.0
Sun    21.0
dtype: float64

In [108]:
s.describe()

count     5.000000
mean     21.000000
std       2.828427
min      17.000000
25%      21.000000
50%      21.000000
75%      21.000000
max      25.000000
dtype: float64

In [109]:
# could I get descriptive statistics before, with NaN?

s = Series([25, 21, 17, 19, 21],
          index='Wed Thu Fri Sat Sun'.split())

s.loc['Thu'] = NaN
s.loc['Sat'] = NaN

s.describe()

count     3.0
mean     21.0
std       4.0
min      17.0
25%      19.0
50%      21.0
75%      23.0
max      25.0
dtype: float64

# Next up:

Data frames!
  - Creating
  - Adding/removing data
  - Useful methods

# Data frames

A data frame is a 2D table.  It has:

- Rows, and each row has an index
- Columns, and each column has a name

Each column is, behind the scenes, a Pandas series.

Which means that each column has its own dtype, and that all of the values in that column must be of the same dtype. 

Let's create a data frame, using a list of lists to set it up.

In [110]:
df = DataFrame([[10, 20, 30, 40],
               [50, 60, 70, 80],
               [90, 100, 110, 120]])

In [111]:
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


In [112]:
# I can find out the full dimensions of a data frame with the "shape" attribute
df.shape    # (rows, columns)

(3, 4)

In [113]:
df.dtypes   # what is the dtype of each column?

0    int64
1    int64
2    int64
3    int64
dtype: object

In [114]:
# the thing is, the index and columns on this data frame are just integers
# that's confusing and doens't really give us any semantic power

# let's create our data frame again, giving both index and columns

df = DataFrame([[10, 20, 30, 40],
               [50, 60, 70, 80],
               [90, 100, 110, 120]],
              index=list('abc'),      # this is just like setting the index on a new series
              columns=list('wxyz'))   # this is also just like that, but works on the columns

In [115]:
df

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [116]:
df.dtypes

w    int64
x    int64
y    int64
z    int64
dtype: object

In [118]:
# how do I retrieve a row from df?
# remember that when we had a series, we also had an index. We could use two different 
# systems to retrieve

# (1) retrieve by index
df.loc['a']

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [119]:
df.loc['c']

w     90
x    100
y    110
z    120
Name: c, dtype: int64

In [120]:
df.loc[['a', 'c']]

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [121]:
# (2) retrieve by position, using iloc
df.iloc[0]

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [123]:
df.iloc[2]

w     90
x    100
y    110
z    120
Name: c, dtype: int64

In [124]:
df.iloc[[0, 2]]

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [125]:
# what if I want to retrieve a column (or more)?

df['w']   # notice -- just [], without .loc

a    10
b    50
c    90
Name: w, dtype: int64

In [126]:
df[['w', 'y']]  # retrieve more than one column

Unnamed: 0,w,y
a,10,30
b,50,70
c,90,110


# Creating a random data frame

Instead of providing Pandas with a list of lists, we can also provide it with a 2D NumPy array. We can even use `np.random.randint` to give us such an array:

In [128]:
df = DataFrame(np.random.randint(0, 100, [4,5]),
              index=list('abcd'),
               columns=list('vwxyz'))

In [129]:
df

Unnamed: 0,v,w,x,y,z
a,81,37,25,77,72
b,9,20,80,69,79
c,47,64,82,99,88
d,49,29,19,19,14


# Exercise: Grocery store

1. Create a data frame in which each row represents one product at the store, and each column represents some information about the products: ID number, name, price, and sales.
2. Make sure there are 4-5 products in your data frame.
3. Calculate how much revenue you had from all of these products (price * sales).
4. Calculate descriptive statistics on the sales data.

In [132]:
df = DataFrame([[10, 'apple', 1, 10],
                [15, 'banana', 1.20, 15],
                [17, 'calculator', 5, 20],
                [28, 'coffee', 8, 30],
                [35, 'chair', 100, 12],
                ],
              columns='id name price sales'.split())

In [133]:
df

Unnamed: 0,id,name,price,sales
0,10,apple,1.0,10
1,15,banana,1.2,15
2,17,calculator,5.0,20
3,28,coffee,8.0,30
4,35,chair,100.0,12


In [135]:
df['price']

0      1.0
1      1.2
2      5.0
3      8.0
4    100.0
Name: price, dtype: float64

In [136]:
# what kinds of data do I have in my data frame
df.dtypes

id         int64
name      object
price    float64
sales      int64
dtype: object

In [137]:
# we can multiply two series that share an index, as here:

df['price'] * df['sales']

0      10.0
1      18.0
2     100.0
3     240.0
4    1200.0
dtype: float64

In [138]:
# get descriptive statistics for our sales

(df['price'] * df['sales']).describe()

count       5.00000
mean      313.60000
std       504.05833
min        10.00000
25%        18.00000
50%       100.00000
75%       240.00000
max      1200.00000
dtype: float64

In [140]:
# I would like to put the sales data in the data frame, as a new column

# to add a new column to a data frame, just assign to it!
# - if the column name already exists, you'll replace the existing contents
# - if the column name doesn't exist, then you'll add a new column of that name

df['revenue'] = df['price'] * df['sales']

In [141]:
df

Unnamed: 0,id,name,price,sales,revenue
0,10,apple,1.0,10,10.0
1,15,banana,1.2,15,18.0
2,17,calculator,5.0,20,100.0
3,28,coffee,8.0,30,240.0
4,35,chair,100.0,12,1200.0


In [142]:
# what if I want to add a new row?

# so long as the index of that row is new, I can do it with df.loc[INDEX] = , and assigning
# a list or series of values

df.loc[5] = [37, 'pen', 1, 30, 30]

In [143]:
df

Unnamed: 0,id,name,price,sales,revenue
0,10,apple,1.0,10,10.0
1,15,banana,1.2,15,18.0
2,17,calculator,5.0,20,100.0
3,28,coffee,8.0,30,240.0
4,35,chair,100.0,12,1200.0
5,37,pen,1.0,30,30.0


In [144]:
# things get trickier when I want to drop rows or columns
# dropping rows is easier: we use the "drop" method, and provide the index of the row

df.drop(5)  # this returns a new data frame, not actually modifying the original one

Unnamed: 0,id,name,price,sales,revenue
0,10,apple,1.0,10,10.0
1,15,banana,1.2,15,18.0
2,17,calculator,5.0,20,100.0
3,28,coffee,8.0,30,240.0
4,35,chair,100.0,12,1200.0


In [145]:
# this way, we catch the new data frame, and assign it back to the output from df.drop
df = df.drop(5)

In [146]:
df

Unnamed: 0,id,name,price,sales,revenue
0,10,apple,1.0,10,10.0
1,15,banana,1.2,15,18.0
2,17,calculator,5.0,20,100.0
3,28,coffee,8.0,30,240.0
4,35,chair,100.0,12,1200.0


In [148]:
# how can I remove a column?

# I can use the same method exactly, df.drop -- but I need to tell df.drop that
# we want to remove a column, because it'll assume a row

df.drop('revenue', axis='columns')  # this means: remove a column, not a row

Unnamed: 0,id,name,price,sales
0,10,apple,1.0,10
1,15,banana,1.2,15
2,17,calculator,5.0,20
3,28,coffee,8.0,30
4,35,chair,100.0,12


In [149]:
# assign the result back

df = df.drop('revenue', axis='columns')

In [150]:
df

Unnamed: 0,id,name,price,sales
0,10,apple,1.0,10
1,15,banana,1.2,15
2,17,calculator,5.0,20
3,28,coffee,8.0,30
4,35,chair,100.0,12


# Removing more than one row/column

Just provide a list of strings (row/column names), rather than a single string.

In [151]:
df.drop(['id', 'name', 'price'], axis='columns')

Unnamed: 0,sales
0,10
1,15
2,20
3,30
4,12


# Exercise: Family data

1. Create a data frame with two rows, for two people in your family. (You can pretend, if you need more data here, or in other parts of this exercise.)  For each person, you want to have four pieces of data: First name, last name, and age.  
2. Calculate the mean age of people in your data frame.
3. Add a new person.  Has the mean age changed?
4. Add a new column, `shoe_size`, for each of the people.
5. What is the average shoe size?
6. Remove the person you just added. What is the mean shoe size now?
7. Remove the `shoe_size` column. 

In [154]:
# - two rows, one for each person
# - three columns: first_name, last_name, age

df = DataFrame([['Reuven', 'Lerner', 52],
                 ['Atara', "Lerner-Friedman", 21]],
              columns=list('first_name last_name age'.split()))
df

Unnamed: 0,first_name,last_name,age
0,Reuven,Lerner,52
1,Atara,Lerner-Friedman,21


In [155]:
df['age'].mean()

36.5

In [156]:
# add a new person

df.loc[2] = ['Shikma', 'Lerner-Friedman', 19]
df

Unnamed: 0,first_name,last_name,age
0,Reuven,Lerner,52
1,Atara,Lerner-Friedman,21
2,Shikma,Lerner-Friedman,19


In [157]:
df['age'].mean()

30.666666666666668

In [158]:
# add a new column

df['shoe_size'] = [46, 42, 42]   # adding a column means adding a new value for each row
df

Unnamed: 0,first_name,last_name,age,shoe_size
0,Reuven,Lerner,52,46
1,Atara,Lerner-Friedman,21,42
2,Shikma,Lerner-Friedman,19,42


In [159]:
df['shoe_size'].mean()

43.333333333333336

In [161]:
# remove row 2
df = df.drop(2)

df

Unnamed: 0,first_name,last_name,age,shoe_size
0,Reuven,Lerner,52,46
1,Atara,Lerner-Friedman,21,42


In [162]:
df['shoe_size'].mean()

44.0

In [163]:
df = df.drop('shoe_size', axis='columns')
df

Unnamed: 0,first_name,last_name,age
0,Reuven,Lerner,52
1,Atara,Lerner-Friedman,21


# Next up:

1. Useful methods and attributes
2. Querying with boolean indexes

In [165]:
np.random.seed(0)

df = DataFrame(np.random.randint(0, 100, [5,6]),
              index=list('abcde'),
              columns=list('uvwxyz'))
df

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [166]:
# what if I want to see the first 3 rows of this data frame?
# we know that there is a series method .head
# sure enough, there's also a data frame method .head

df.head(3)

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87


In [168]:
df.head()  # 5 rows by default

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [169]:
df.tail()  # final 5 rows, which are all of the rows in this case!

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [170]:
df.tail(3)  # final 3 rows

Unnamed: 0,u,v,w,x,y,z
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [171]:
df.shape   # this tells us how many rows x how many columns

(5, 6)

In [172]:
df

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [173]:
# describe 
# we've seen describe as a great method for descriptive statistics (count, mean, std, min, 25, 50, 75, max)

df['u'].describe()    # we get a series back describing this series, index represents the measures

count     5.000000
mean     66.600000
std      20.562101
min      44.000000
25%      46.000000
50%      72.000000
75%      83.000000
max      88.000000
Name: u, dtype: float64

In [174]:
# general rule: Many series methods can also be run on data frames
# when that happens, we get one result per column

df.mean()  # on a series, mean returns one number. On a data frame, it returns 1 number per column

u    66.6
v    35.4
w    51.8
x    67.2
y    54.0
z    68.0
dtype: float64

In [175]:
df.min()

u    44
v     9
w    20
x    37
y    25
z     9
dtype: int64

In [176]:
df.median()

u    72.0
v    21.0
w    58.0
x    67.0
y    67.0
z    79.0
dtype: float64

In [177]:
# since each individual descriptive statistics method gives me a series
# asking for all of them (in the "describe" method) will give me a data frame

df.describe()

Unnamed: 0,u,v,w,x,y,z
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,66.6,35.4,51.8,67.2,54.0,68.0
std,20.562101,32.989392,23.983328,19.188538,20.712315,33.331667
min,44.0,9.0,20.0,37.0,25.0,9.0
25%,46.0,12.0,36.0,65.0,39.0,77.0
50%,72.0,21.0,58.0,67.0,67.0,79.0
75%,83.0,47.0,64.0,80.0,69.0,87.0
max,88.0,88.0,81.0,87.0,70.0,88.0


In [178]:
df.dtypes

u    int64
v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [180]:
# I can add a new column to my data frame

df['veg'] = ['carrot', 'tomato', 'lettuce', 'fennel', 'corn']
df

Unnamed: 0,u,v,w,x,y,z,veg
a,44,47,64,67,67,9,carrot
b,83,21,36,87,70,88,tomato
c,88,12,58,65,39,87,lettuce
d,46,88,81,37,25,77,fennel
e,72,9,20,80,69,79,corn


In [181]:
df.dtypes

u       int64
v       int64
w       int64
x       int64
y       int64
z       int64
veg    object
dtype: object

In [182]:
# what if I now ask for the descriptive statistics?

df.describe()

Unnamed: 0,u,v,w,x,y,z
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,66.6,35.4,51.8,67.2,54.0,68.0
std,20.562101,32.989392,23.983328,19.188538,20.712315,33.331667
min,44.0,9.0,20.0,37.0,25.0,9.0
25%,46.0,12.0,36.0,65.0,39.0,77.0
50%,72.0,21.0,58.0,67.0,67.0,79.0
75%,83.0,47.0,64.0,80.0,69.0,87.0
max,88.0,88.0,81.0,87.0,70.0,88.0


In [183]:
df['veg'].describe()

count          5
unique         5
top       carrot
freq           1
Name: veg, dtype: object

In [184]:
df['veg'] = ['carrot', 'carrot', 'carrot', 'fennel', 'fennel']
df

Unnamed: 0,u,v,w,x,y,z,veg
a,44,47,64,67,67,9,carrot
b,83,21,36,87,70,88,carrot
c,88,12,58,65,39,87,carrot
d,46,88,81,37,25,77,fennel
e,72,9,20,80,69,79,fennel


In [185]:
df['veg'].describe()

count          5
unique         2
top       carrot
freq           3
Name: veg, dtype: object

In [186]:
df.loc['a', 'x'] = NaN
df.loc['a', 'y'] = NaN
df.loc['b', 'z'] = NaN
df.loc['c', 'u'] = NaN
df.loc['e', 'v'] = NaN
df.loc['e', 'z'] = NaN


In [187]:
df

Unnamed: 0,u,v,w,x,y,z,veg
a,44.0,47.0,64,,,9.0,carrot
b,83.0,21.0,36,87.0,70.0,,carrot
c,,12.0,58,65.0,39.0,87.0,carrot
d,46.0,88.0,81,37.0,25.0,77.0,fennel
e,72.0,,20,80.0,69.0,,fennel


In [189]:
df.dtypes

u      float64
v      float64
w        int64
x      float64
y      float64
z      float64
veg     object
dtype: object

In [190]:
df.describe()

Unnamed: 0,u,v,w,x,y,z
count,4.0,4.0,5.0,4.0,4.0,3.0
mean,61.25,42.0,51.8,67.25,50.75,57.666667
std,19.31105,34.068558,23.983328,22.156639,22.396056,42.442117
min,44.0,12.0,20.0,37.0,25.0,9.0
25%,45.5,18.75,36.0,58.0,35.5,43.0
50%,59.0,34.0,58.0,72.5,54.0,77.0
75%,74.75,57.25,64.0,81.75,69.25,82.0
max,83.0,88.0,81.0,87.0,70.0,87.0


In [192]:
df.mean(numeric_only=True)

u    61.250000
v    42.000000
w    51.800000
x    67.250000
y    50.750000
z    57.666667
dtype: float64

In [193]:
df

Unnamed: 0,u,v,w,x,y,z,veg
a,44.0,47.0,64,,,9.0,carrot
b,83.0,21.0,36,87.0,70.0,,carrot
c,,12.0,58,65.0,39.0,87.0,carrot
d,46.0,88.0,81,37.0,25.0,77.0,fennel
e,72.0,,20,80.0,69.0,,fennel


In [194]:
# how can I get rid of the NaN values?
# (1) remove the NaNs, with dropna

df.dropna()   # this method returns only those rows in df WITHOUT ANY NaN VALUES

Unnamed: 0,u,v,w,x,y,z,veg
d,46.0,88.0,81,37.0,25.0,77.0,fennel


In [196]:
# dropna removes any row with even one NaN
# but we can convince it to be more forgiving, by establishing a minimum threshold of non-NaN values

df.dropna(thresh=6)   # this means: If we have 5 non-NaN values, we'll keep the row

Unnamed: 0,u,v,w,x,y,z,veg
b,83.0,21.0,36,87.0,70.0,,carrot
c,,12.0,58,65.0,39.0,87.0,carrot
d,46.0,88.0,81,37.0,25.0,77.0,fennel


In [198]:
# this didn't actually change df
df

Unnamed: 0,u,v,w,x,y,z,veg
a,44.0,47.0,64,,,9.0,carrot
b,83.0,21.0,36,87.0,70.0,,carrot
c,,12.0,58,65.0,39.0,87.0,carrot
d,46.0,88.0,81,37.0,25.0,77.0,fennel
e,72.0,,20,80.0,69.0,,fennel


In [200]:
# (2) fill the NaN values with non-NaN values
# 
# fill with a scalar value

df.fillna(999)  # if we want the same value in all NaNs, this will work

Unnamed: 0,u,v,w,x,y,z,veg
a,44.0,47.0,64,999.0,999.0,9.0,carrot
b,83.0,21.0,36,87.0,70.0,999.0,carrot
c,999.0,12.0,58,65.0,39.0,87.0,carrot
d,46.0,88.0,81,37.0,25.0,77.0,fennel
e,72.0,999.0,20,80.0,69.0,999.0,fennel


In [203]:
df.mean(numeric_only=True) # this will return a series -- index isdf's columns, values are the mean of each column

u    61.250000
v    42.000000
w    51.800000
x    67.250000
y    50.750000
z    57.666667
dtype: float64

In [205]:
# since df.mean maps column names to mean values, when we pass it
# to df.fillna, each column's mean will be used to fill that column's NaNs

df.fillna(df.mean(numeric_only=True))

Unnamed: 0,u,v,w,x,y,z,veg
a,44.0,47.0,64,67.25,50.75,9.0,carrot
b,83.0,21.0,36,87.0,70.0,57.666667,carrot
c,61.25,12.0,58,65.0,39.0,87.0,carrot
d,46.0,88.0,81,37.0,25.0,77.0,fennel
e,72.0,42.0,20,80.0,69.0,57.666667,fennel


# Exercise: Weather stats

1. Create a data frame with the 10-day forecast for your area. There should be three columns: `high`, `low`, and `precip`. The index should contain day names (`Mon`, `Tue`, etc.)
2. Add a new column, `diff`, which shows how much the temperature varies on each day.
3. Get the mean high and low temps.
4. Get the mean temps on all Wednesdays in your data frame.
5. Get the mean temps on all Wednesdays and Thursdays in your data frame.

In [208]:
df = DataFrame([[25, 13, 0],
                [21, 13, 6.05],
                [17, 14, 15],
               [19, 12, 1.12],
               [21, 15, 0],
               [24, 15, 0],
               [25, 15, 1.3],
               [23, 14, 1.1],
               [23, 13, 0]],
               index='Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split(),
               columns='high low precip'.split())
df

Unnamed: 0,high,low,precip
Wed,25,13,0.0
Thu,21,13,6.05
Fri,17,14,15.0
Sat,19,12,1.12
Sun,21,15,0.0
Mon,24,15,0.0
Tue,25,15,1.3
Wed,23,14,1.1
Thu,23,13,0.0


In [209]:
df['diff'] = df['high'] - df['low']
df

Unnamed: 0,high,low,precip,diff
Wed,25,13,0.0,12
Thu,21,13,6.05,8
Fri,17,14,15.0,3
Sat,19,12,1.12,7
Sun,21,15,0.0,6
Mon,24,15,0.0,9
Tue,25,15,1.3,10
Wed,23,14,1.1,9
Thu,23,13,0.0,10


In [210]:
df.mean()    # more than I asked for, but not wrong!

high      22.000000
low       13.777778
precip     2.730000
diff       8.222222
dtype: float64

In [211]:
df.mean().loc[['high', 'low']]   # works, but it's ugly

high    22.000000
low     13.777778
dtype: float64

In [213]:
# retrieve the columns we want from the start

df[['high', 'low']]    # here I get back a subset of df, with only the high + low columns

Unnamed: 0,high,low
Wed,25,13
Thu,21,13
Fri,17,14
Sat,19,12
Sun,21,15
Mon,24,15
Tue,25,15
Wed,23,14
Thu,23,13


In [214]:
df[['high', 'low']].mean()

high    22.000000
low     13.777778
dtype: float64

In [216]:
df.loc['Wed'].mean()

high      24.00
low       13.50
precip     0.55
diff      10.50
dtype: float64

In [218]:
df.loc[['Wed', 'Thu']].mean()

high      23.0000
low       13.2500
precip     1.7875
diff       9.7500
dtype: float64

# Boolean indexes with our data frame

If we run a comparison operator on a series, we get a boolean series back. We can apply that to the original series, and thus get a subset of it back.

We can also apply that boolean series to a *different* series that shares the same index.

In [219]:
df

Unnamed: 0,high,low,precip,diff
Wed,25,13,0.0,12
Thu,21,13,6.05,8
Fri,17,14,15.0,3
Sat,19,12,1.12,7
Sun,21,15,0.0,6
Mon,24,15,0.0,9
Tue,25,15,1.3,10
Wed,23,14,1.1,9
Thu,23,13,0.0,10


In [220]:
np.random.seed(0)

df = DataFrame(np.random.randint(0, 100, [5,6]),
              index=list('abcde'),
              columns=list('uvwxyz'))
df

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [221]:
# I want all of the values of y > y's mean

df['y'].mean()

54.0

In [222]:
df['y'] > df['y'].mean()

a     True
b     True
c    False
d    False
e     True
Name: y, dtype: bool

In [223]:
# show all values in df['y']
# that are > df['y']'s mean

df['y'][df['y'] > df['y'].mean()]

a    67
b    70
e    69
Name: y, dtype: int64

In [224]:
# show all values in df['v']
# that correspond to values in df['y'] > its mean

# so long as df['v'] and df['y'] share an index, we're fine
# (because they're in the same data frame, they must!)

df['v'][df['y'] > df['y'].mean()]

a    47
b    21
e     9
Name: v, dtype: int64

# Uses for this technique

1. Show users whose credit balance is < 100.

```python
df['usernames'][df['credit_balance'] < 100]
```
    
2. Show products with a price > 1,000.

```python
df['product_name'][df['price'] > 1000]
```


In [225]:
# I can apply a boolean series to an entire data frame!
# this will return only those rows of the data frame where the boolean is True
# it'll return all columns of the data frame

df

Unnamed: 0,u,v,w,x,y,z
a,44,47,64,67,67,9
b,83,21,36,87,70,88
c,88,12,58,65,39,87
d,46,88,81,37,25,77
e,72,9,20,80,69,79


In [228]:
# this gives all rows in df
# where df['x'] is greater than its mean

df[df['x'] > df['x'].mean()]

Unnamed: 0,u,v,w,x,y,z
b,83,21,36,87,70,88
e,72,9,20,80,69,79


In [229]:
# notice: better to use df.loc!
df.loc[df['x'] > df['x'].mean()]

Unnamed: 0,u,v,w,x,y,z
b,83,21,36,87,70,88
e,72,9,20,80,69,79


# Exercise: Cold and rainy

1. Using the weather data frame from before, find days in your 10-day forecast that will be colder than average.  Show the mean precipitation that will fall on those days.
2. Show the mean precipitation for days that are warmer than average. 
3. On which kind of day is it more likely to have precipitation?

In [230]:
df = DataFrame([[25, 13, 0],
                [21, 13, 6.05],
                [17, 14, 15],
               [19, 12, 1.12],
               [21, 15, 0],
               [24, 15, 0],
               [25, 15, 1.3],
               [23, 14, 1.1],
               [23, 13, 0]],
               index='Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split(),
               columns='high low precip'.split())
df

Unnamed: 0,high,low,precip
Wed,25,13,0.0
Thu,21,13,6.05
Fri,17,14,15.0
Sat,19,12,1.12
Sun,21,15,0.0
Mon,24,15,0.0
Tue,25,15,1.3
Wed,23,14,1.1
Thu,23,13,0.0


In [234]:
# find the precipitation on days whose low temperatures are lower than average
df['precip'][df['low'] < df['low'].mean()].mean()

1.7925

In [236]:
# find the precipitation on days whose high temperatures are higher than average
df['precip'][df['high'] > df['high'].mean()].mean()

0.4800000000000001

# Next up

1. Learn to use the two-argument form of `.loc`
2. Do a (little) work with real-world data

If you want, you can download this zipfile: https://files.lerner.co.il/data-science-exercise-files.zip

In [237]:
df

Unnamed: 0,high,low,precip
Wed,25,13,0.0
Thu,21,13,6.05
Fri,17,14,15.0
Sat,19,12,1.12
Sun,21,15,0.0
Mon,24,15,0.0
Tue,25,15,1.3
Wed,23,14,1.1
Thu,23,13,0.0


In [239]:
# I want to find days on which the high temp is greater than average

df.loc[df['high'] > df['high'].mean()]    # we're apply a ROW SELECTOR to df.loc

Unnamed: 0,high,low,precip
Wed,25,13,0.0
Mon,24,15,0.0
Tue,25,15,1.3
Wed,23,14,1.1
Thu,23,13,0.0


In [240]:
# we can give df.loc many different row selectors

df.loc['Sat']  # this retrieves the one row for Saturday

high      19.00
low       12.00
precip     1.12
Name: Sat, dtype: float64

In [241]:
df.loc['Thu']   # I got back a data frame, since two rows matched this index 

Unnamed: 0,high,low,precip
Thu,21,13,6.05
Thu,23,13,0.0


In [242]:
# I can give a two-element row selector

df.loc[['Thu', 'Fri']]

Unnamed: 0,high,low,precip
Thu,21,13,6.05
Thu,23,13,0.0
Fri,17,14,15.0


# Selecting rows, selecting columns

If we use `df.loc`, then the row selector can be:

- An index
- A list of index values
- A boolean index

But we can also provide a second argument to `df.loc`, after a comma: A *column selector*.  This can be:

- A column name (string)
- A list of column names

In [243]:
df

Unnamed: 0,high,low,precip
Wed,25,13,0.0
Thu,21,13,6.05
Fri,17,14,15.0
Sat,19,12,1.12
Sun,21,15,0.0
Mon,24,15,0.0
Tue,25,15,1.3
Wed,23,14,1.1
Thu,23,13,0.0


In [None]:
df.loc[   ROW_SELECTOR
       ,
          COLUMN_SELECTOR   # this is optional
      ]

In [245]:
df.loc['precip']   # precip is a column, so this won't work

KeyError: 'precip'

In [246]:
df

Unnamed: 0,high,low,precip
Wed,25,13,0.0
Thu,21,13,6.05
Fri,17,14,15.0
Sat,19,12,1.12
Sun,21,15,0.0
Mon,24,15,0.0
Tue,25,15,1.3
Wed,23,14,1.1
Thu,23,13,0.0


In [247]:
# I want the precipitation on days when the high temp is above average

# row selector: find days when the high temp is above average
# column selector: 'precip'

df.loc[
    df['high'] > df['high'].mean()    # row selector
    ,
    'precip'                          # column selector
]

Wed    0.0
Mon    0.0
Tue    1.3
Wed    1.1
Thu    0.0
Name: precip, dtype: float64

In [250]:
# I want the difference between high and low temp
# on days when more than 2 mm of rain are expected to fall

# row selector: > 2 mm of rain fell
# column selector: high and low

df['diff'] = df['high'] - df['low']

df.loc[
    df['precip'] > 2   # row selector
    ,
    'diff' # column selector
].mean()

5.5

In [254]:
df.loc[
    ['Wed','Thu'],   # row selector-- rows for Wed and Thu
    ['high', 'low']  # column selector -- just high and low temps
]

Unnamed: 0,high,low
Wed,25,13
Wed,23,14
Thu,21,13
Thu,23,13


# Exercise: Family/friends info, again

1. Create a data frame with 5 family/friend rows, with the following columns:
    - `first`
    - `last`
    - `age`
    - `shoesize`
2. Show all people's first and last names with above-average shoe sizes.
3. Show the average shoe size for people with below-average age.

In [258]:
df = DataFrame([['Reuven', 'Lerner', 52, 46],
                ['Atara', 'Lerner-Friedman', 21, 42],
                ['Shikma', 'Lerner-Friedman', 19, 42],
                 ['Amotz', 'Lerner-Friedman', 17, 44]
                              ],
              columns='first last age shoesize'.split())

In [259]:
df

Unnamed: 0,first,last,age,shoesize
0,Reuven,Lerner,52,46
1,Atara,Lerner-Friedman,21,42
2,Shikma,Lerner-Friedman,19,42
3,Amotz,Lerner-Friedman,17,44


In [260]:
# row selector: shoesize > shoesize.mean()
# column selector: first + last

df.loc[
    df['shoesize'] > df['shoesize'].mean()   # row selector
    ,
    ['first', 'last'] # column selector

]

Unnamed: 0,first,last
0,Reuven,Lerner
3,Amotz,Lerner-Friedman


In [262]:
# average shoe size for people with below-average age

# row selector: age < age.mean()
# column selector: shoesize

df.loc[
    df['age'] < df['age'].mean() # row selector: age < age.mean()
    ,
    'shoesize' # column selector
].mean()

42.666666666666664

In [263]:
# the most common format for data we'll work with it CSV (comma-separated values)
# - one record per row
# - fields separated by commas

# note: CSV can have other separators, but then you have to specify it
# to read a CSV into Pandas, we use the pd.read_csv function

# this returns a new data frame based upon the CSV file



In [264]:
!ls

'OReilly - Session 1 — 2022-11Nov-15-python-data.ipynb'
'OReilly - Session 2 — 2022-11Nov-22-python-data.ipynb'
 README.md
 README.md~
 airlines.dat
 airports
 airports.zip
 burrito_current.csv
 celebrity_deaths_2016.csv
 data-science-exercise-files.zip
 languages.csv
 taxi.csv
 titanic3.csv


In [265]:
!head taxi.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1

In [266]:
# let's create a data frame from this CSV file!

df = pd.read_csv('taxi.csv')

In [267]:
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [268]:
# to select which columns from the CSV file we really want, we can specify
# usecols=[LIST OF COLUMN NAMES]

df = pd.read_csv('taxi.csv',
                usecols=['passenger_count', 'trip_distance', 'total_amount'])
df

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.80
1,1,0.46,8.30
2,1,0.87,11.00
3,1,2.13,17.16
4,1,1.40,10.30
...,...,...,...
9994,1,2.70,12.30
9995,1,4.50,20.30
9996,1,5.59,22.30
9997,6,1.54,7.80


# Exercise: Weird taxi rides

1. Load the data, as I did, from `taxi.csv` into a data frame. We only want three columns: `passenger_count`, `trip_distance`, `total_amount`
2. How many trips went 0 miles? How much did people pay, on average, for such trips?
3. How many trips cost <= 0 dollars?  How far did people go, on average, for such trips?
4. How many trips had 0 passengers? How much did people pay, on average, for that?

In [270]:
# how many trips went 0 miles? How much did people pay for them?

# row selector: df['trip_distance'] == 0
# column selector: 'total_amount'

df.loc[
    df['trip_distance'] == 0 # row selector
    ,
    'total_amount' # column selector
].mean()

31.581940298507465

In [272]:
# how many trips cost <= 0 dollars? How far did people go on such trips?

# row selector df['total_amount'] <= 0
# column selector: 'trip_distance'

df.loc[
    df['total_amount'] <= 0
    ,
    'trip_distance'
].mean()

0.6066666666666667

In [274]:
# how many trips had 0 passengers? How much did people pay, and how far did they go?

# row selector: df['passenger_count'] == 0
# column select: ['trip_distance', 'total_amount']

df.loc[
    df['passenger_count'] == 0  # row selector
    ,
    ['trip_distance', 'total_amount']  # column selector

].mean()

trip_distance     4.60
total_amount     25.57
dtype: float64