# Agenda

- Data frames
    - Creating
    - Retrieving
    - Methods (series methods and special data frame methods)
- Loading data
    - CSV
    - Excel
    - Downloading from the Internet (retrieving files and scraping sites)

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# create a series based on any iterable in Python (usually a list or a NumPy array)
# we can pass additional keyword arguments, especially index and dtype

# Data frames are similar; we can pass a 2D set of values (a list of lists or a 2D NumPy array), and
# also index, and also columns -- the names of the columns we want

In [4]:
df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [5]:
type(df)

pandas.core.frame.DataFrame

# Our basic data frame

- A data frame has rows and columns
- The rows are identified with the index (i.e, the same index that we had on our series)
- The columns are identified with names that we just call "columns"

The key thing to remember: Each column is a Pandas series!

- I can retrieve a row with `.loc` or `.iloc`, as before
- I can retrieve a column with `[]` and the column's name

In [6]:
df.loc[1]

0    40
1    50
2    60
Name: 1, dtype: int64

In [7]:
df[1]  # this will retrieve the column

0     20
1     50
2     80
3    110
Name: 1, dtype: int64

In [8]:
%timeit df.loc[1]

11.4 μs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [9]:
%timeit df.iloc[1]

10.3 μs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [10]:
%timeit df[1]

1.33 μs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [11]:
# behind the scenes of our data frame is actually a 2D NumPy array (in theory)

df.values

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [ 70,  80,  90],
       [100, 110, 120]])

In [12]:
# we can retrieve in all of the ways we've already seen!

df.loc[[1, 3]]  # fancy indexing

Unnamed: 0,0,1,2
1,40,50,60
3,100,110,120


In [13]:
df[[0, 1]]  # here, we retrieve two columns

Unnamed: 0,0,1
0,10,20
1,40,50
2,70,80
3,100,110


In [14]:
# if I retrieve one row, or one column, then I get a series
df[1]

0     20
1     50
2     80
3    110
Name: 1, dtype: int64

In [15]:
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [16]:
# We can name the index and the columns via keyword arguments when we create the data frame

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


- Row names (just like the index in a series) can repeat, and that's even useful!
- Column names cannot repeat (like the keys in a dict)

You can even think of a data frame as a dict of series, where the keys are the column names and the values are Pandas series objects.

In [17]:
df.loc['b']

x    40
y    50
z    60
Name: b, dtype: int64

In [18]:
df.iloc[2]

x    70
y    80
z    90
Name: c, dtype: int64

In [19]:
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [20]:
df.loc[['a', 'b']]

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60


In [21]:
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [22]:
# if I retrieve a series, then I can run a method on it

df['x'].mean()

np.float64(55.0)

In [23]:
df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [25]:
# I can do that for the rows

df.loc['b'].describe()

count     3.0
mean     50.0
std      10.0
min      40.0
25%      45.0
50%      50.0
75%      55.0
max      60.0
Name: b, dtype: float64

As a general rule, any method that we can run on a series can also be run on a data frame. We'll get one result per column, and the results will either be a series (if we get one value per column) or a data frame (if we get a series back from the method).

In [26]:
df.mean()   # this will calculate the mean for each column

x    55.0
y    65.0
z    75.0
dtype: float64

In [27]:
df.describe()   # this will invoke describe on each column

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [28]:
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [30]:
# we've seen how we can retrieve either a row or a column.
# but how can we retrieve an individual value?

df.loc['b'].loc['y']

np.int64(50)

In [31]:
# even worse:

df.loc['b']['y']

np.int64(50)

In [32]:
# better is to use the 2-argument version of .loc, where you specify the row and the column

df.loc['b', 'y']

np.int64(50)

In [33]:
df.sum()

x    220
y    260
z    300
dtype: int64

In [34]:
# what if I want to sum across the rows?

df.sum(axis='columns')

a     60
b    150
c    240
d    330
dtype: int64

In [35]:
df = DataFrame([[10, 20, 30.5],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,90.0
d,100,110,120.0


In [36]:
df.dtypes   # get the dtype of each column

x      int64
y      int64
z    float64
dtype: object

In [37]:
# what happens when we retrieve row 'c'?

df.loc['c']

x    70.0
y    80.0
z    90.0
Name: c, dtype: float64

In [38]:
# what if I want to assign back to a value?

# this is a bad way to do it, as you'll see!
df.loc['c'].loc['z'] = 999 

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['c'].loc['z'] = 999
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc['c'].loc['z'] = 999


In [39]:
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,90.0
d,100,110,120.0


In [40]:
# the solution is always to use the two-argument version of .loc,
# which guarantees that we'll assign back to the original data frame

df.loc['c', 'z'] = 999 

In [41]:
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,999.0
d,100,110,120.0


# Exercise: Simple data frame

1. Create a 5x5 data frame in which the rows are `abcde` and the columns are `vwxyz`. The values can be random integers from 0-1,000. (You can use a 2D NumPy array to initialize the values.)
2. Retrieve row `b`
3. Retrieve rows `b` and `d`
4. Retrieve rows `b`, `c`, and `d`
5. Retrieve column `w`
6. Retrieve columns `w` and `y`
7. Retrieve columns `w`, `x`, and `y`
8. Retrieve the item at row `e`, column `v`
9. Update the value at row e`, column `z` to be 123.456

In [42]:
import numpy as np

np.random.randint(0, 1000, 20).reshape(4, 5)

array([[ 92, 855, 155, 443, 455],
       [386, 725, 796, 590, 586],
       [756, 567, 363, 394, 451],
       [449, 357,   3, 378, 714]])

In [43]:
np.random.randint(0, 1000, [4,5])

array([[444, 675, 888, 800, 967],
       [174, 391, 834, 520, 112],
       [976, 750, 342, 754, 295],
       [367, 243, 289, 455,  79]])

In [44]:
np.random.seed(0)

df = DataFrame(np.random.randint(0, 1000, [5,5]),
               index=list('abcde'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [45]:
# Retrieve row b

df.loc['b']

v    763
w    707
x    359
y      9
z    723
Name: b, dtype: int64

In [46]:
%timeit df.loc['b']

12.2 μs ± 70 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [47]:
%timeit df.iloc[1]

10.6 μs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [48]:
# Retrieve rows b and d

df.loc[['b', 'd']]

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
d,472,600,396,314,705


In [49]:
# Retrieve rows b, c, and d

df.loc['b':'d']  # use a slice, up to and including (with df.loc)

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705


In [52]:
# Retrieve column w

%timeit df['w']

1.57 μs ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [53]:
# you can also use a different syntax, "dot syntax," to retrieve a column

%timeit df.w

2.92 μs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [54]:
# Retrieve columns w and y

df[['w', 'y']]

Unnamed: 0,w,y
a,559,192
b,707,9
c,754,599
d,600,314
e,551,174


In [55]:
# Retrieve columns w, x, and y

df[['w', 'x', 'y']]

Unnamed: 0,w,x,y
a,559,629,192
b,707,359,9
c,754,804,599
d,600,396,314
e,551,87,174


In [56]:
df['w':'y']  # if you give a slice to [], Pandas looks for that slice ON THE ROWS!

Unnamed: 0,v,w,x,y,z


In [57]:
# Retrieve the item at row e, column v

df.loc[ 'e',    # row selector
        'v']    # column selector


np.int64(486)

In [58]:
df.loc[ ['b', 'e']   # row selector is two rows
    ,
        ['v', 'z']   # column selector
]

Unnamed: 0,v,z
b,763,723
e,486,600


In [59]:
# Update the value at row e, column z` to be 123.456

df.loc['e', 'z'] = 123.456

  df.loc['e', 'z'] = 123.456


In [60]:
df.dtypes

v      int64
w      int64
x      int64
y      int64
z    float64
dtype: object

In [61]:
# how should we update that column/value to avoid the problem?
# answer: replace the column with a new version of itself using float

np.random.seed(0)

df = DataFrame(np.random.randint(0, 1000, [5,5]),
               index=list('abcde'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [62]:
df.dtypes

v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [63]:
# you can add a new column to a data frame via assignment
df['u'] = [10, 20, 30, 40, 50]   # needs to be the same length as the other columns

df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,10
b,763,707,359,9,723,20
c,277,754,804,599,70,30
d,472,600,396,314,705,40
e,486,551,87,174,600,50


In [64]:
# what if I assign to a column that already exists? It's replaced.

df['u'] = [100 ,200, 300, 400, 500]
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100
b,763,707,359,9,723,200
c,277,754,804,599,70,300
d,472,600,396,314,705,400
e,486,551,87,174,600,500


In [65]:
# I can do the same, but using the output of .astype

df['u'] = df['u'].astype(np.float64)
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100.0
b,763,707,359,9,723,200.0
c,277,754,804,599,70,300.0
d,472,600,396,314,705,400.0
e,486,551,87,174,600,500.0


In [66]:
df.loc['e', 'u'] = 123.456


In [67]:
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100.0
b,763,707,359,9,723,200.0
c,277,754,804,599,70,300.0
d,472,600,396,314,705,400.0
e,486,551,87,174,600,123.456


# Other ways to create data frames

You aren't going to be creating a *lot* of data frames manually; most will be based on data you read from elsewhere. But let's just look at a few techniques we can use for when you do need to create them.

1. List of lists (we've seen this)
2. List of dicts (the keys indicate which columns we're adding to)
3. Dict of lists (the keys indicate the column names, and the lists are the values)
4. Dict of series (same as above, but using a series rather than a list)

In [70]:
# list of dicts
# the union of all keys across all lists will be our column names
# any missing key-value intersection is given nan as a value

df = DataFrame([{'a':10, 'b':20, 'c':30},
                {'a':100, 'b':200, 'c':300},
                {'a':1000, 'b':2000, 'd':4000}],
              index=list('xyz'))
df

Unnamed: 0,a,b,c,d
x,10,20,30.0,
y,100,200,300.0,
z,1000,2000,,4000.0


In [72]:
# dict of lists
# here, we're just defining the data frame as as dict whose values are lists/series

df = DataFrame({'a':[10, 20, 30, 40, 50],
                'b':[100, 200, 300, 400, 500],
                'c':[1000, 2000, 3000, 4000, 5000]},
              index=list('vwxyz'))
df

Unnamed: 0,a,b,c
v,10,100,1000
w,20,200,2000
x,30,300,3000
y,40,400,4000
z,50,500,5000


In [73]:
# what happens if a data frame has repeated column names?

df = DataFrame(np.random.randint(0, 100, [3, 3]),
               index=list('aab'),
               columns=list('xyy'))
df

Unnamed: 0,x,y,y.1
a,81,37,25
a,77,72,9
b,20,80,69


In [74]:
df['y']

Unnamed: 0,y,y.1
a,37,25
a,72,9
b,80,69


In [75]:
df['y'] = [10, 20, 30]
df

Unnamed: 0,x,y,y.1
a,81,10,10
a,77,20,20
b,20,30,30


In [77]:
df.columns.duplicated()

array([False, False,  True])

# Exercise: Semi-realistic example

1. Create a data frame with two columns, `age` and `shoesize`. Each row will represent a different person in your family. The index for each row should be the name of the person, and the two columns should be the age and shoe size for that person. Create this twice, first using a list of dicts and then using a dict of lists.
2. What is the mean age in your family? How much does this differ from the median? Ask the same about the shoe size. 

In [80]:
# list of dicts -- we describe the rows

df = DataFrame([{'age': 23, 'shoesize':40},
                {'age': 21, 'shoesize': 40},
                {'age': 19, 'shoesize':44},
                {'age': 54, 'shoesize':46}],
              index=['Atara', 'Shikma', 'Amotz', 'Reuven'])
df

Unnamed: 0,age,shoesize
Atara,23,40
Shikma,21,40
Amotz,19,44
Reuven,54,46


In [81]:
df.loc['Reuven']

age         54
shoesize    46
Name: Reuven, dtype: int64

In [82]:
# create this as a dict of lists

df = DataFrame({'age':[23, 21, 19, 54],
                'shoesize':[40, 40, 44, 46]},
               index=['Atara', 'Shikma', 'Amotz', 'Reuven'])
df

Unnamed: 0,age,shoesize
Atara,23,40
Shikma,21,40
Amotz,19,44
Reuven,54,46


In [83]:
df['age'].mean()

np.float64(29.25)

In [84]:
df['age'].median()

np.float64(22.0)

In [85]:
df['shoesize'].mean()

np.float64(42.5)

In [86]:
df['shoesize'].median()

np.float64(42.0)

In [87]:
df.mean()

age         29.25
shoesize    42.50
dtype: float64

In [88]:
df.median()

age         22.0
shoesize    42.0
dtype: float64

In [89]:
df.describe()

Unnamed: 0,age,shoesize
count,4.0,4.0
mean,29.25,42.5
std,16.580611,3.0
min,19.0,40.0
25%,20.5,40.0
50%,22.0,42.0
75%,30.75,44.5
max,54.0,46.0


In [91]:
# df.describe returns a data frame!
# I can choose which rows I want

df.describe().loc[['mean', '50%']]

Unnamed: 0,age,shoesize
mean,29.25,42.5
50%,22.0,42.0


In [93]:
%timeit df.describe().loc[['mean', '50%']]

1.39 ms ± 33.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [92]:
# the 'agg' method lets us run more than one aggregation method, and returns the results for all
df.describe().agg(['mean', 'median'])

Unnamed: 0,age,shoesize
mean,24.510076,32.75
median,21.25,41.0


In [94]:
%timeit df.describe().agg(['mean', 'median'])

1.99 ms ± 83.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# A few more convenience methods on data frames

In [96]:
# head and tail also work!

np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [8, 10]),
               index=list('abcdefgh'),
               columns=list('qrstuvwxyz'))
df

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684,559,629,192,835,763,707,359,9,723
b,277,754,804,599,70,472,600,396,314,705
c,486,551,87,174,600,849,677,537,845,72
d,777,916,115,976,755,709,847,431,448,850
e,99,984,177,755,797,659,147,910,423,288
f,961,265,697,639,544,543,714,244,151,675
g,510,459,882,183,28,802,128,128,932,53
h,901,550,488,756,273,335,388,617,42,442


In [97]:
df.head()   # returns the first 5 rows

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684,559,629,192,835,763,707,359,9,723
b,277,754,804,599,70,472,600,396,314,705
c,486,551,87,174,600,849,677,537,845,72
d,777,916,115,976,755,709,847,431,448,850
e,99,984,177,755,797,659,147,910,423,288


In [98]:
df.head(2)

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684,559,629,192,835,763,707,359,9,723
b,277,754,804,599,70,472,600,396,314,705


In [99]:
df.tail()

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
d,777,916,115,976,755,709,847,431,448,850
e,99,984,177,755,797,659,147,910,423,288
f,961,265,697,639,544,543,714,244,151,675
g,510,459,882,183,28,802,128,128,932,53
h,901,550,488,756,273,335,388,617,42,442


In [100]:
df.dtypes  # which dtype for each column?

q    int64
r    int64
s    int64
t    int64
u    int64
v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [101]:
# this gives us information about the data frame itself 

df.info()  

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, a to h
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   q       8 non-null      int64
 1   r       8 non-null      int64
 2   s       8 non-null      int64
 3   t       8 non-null      int64
 4   u       8 non-null      int64
 5   v       8 non-null      int64
 6   w       8 non-null      int64
 7   x       8 non-null      int64
 8   y       8 non-null      int64
 9   z       8 non-null      int64
dtypes: int64(10)
memory usage: 704.0+ bytes


In [103]:
# what about nan?

df.loc['a', 't'] = np.nan
df.loc['a', 'z'] = np.nan
df.loc['b', 'u'] = np.nan
df.loc['c', 'w'] = np.nan
df.loc['f', 'q'] = np.nan
df.loc['g', 'q'] = np.nan


In [104]:
df

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,,835.0,763,707.0,359,9,
b,277.0,754,804,599.0,,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,,265,697,639.0,544.0,543,714.0,244,151,675.0
g,,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [105]:
df.dtypes

q    float64
r      int64
s      int64
t    float64
u    float64
v      int64
w    float64
x      int64
y      int64
z    float64
dtype: object

In [106]:
# how do we deal with missing data?
# - dropna
# - fillna

In [107]:
df.dropna()     # any row with a nan value is removed 

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [108]:
# is this what we want?  Sometimes... 
# we can tell dropna which columns it should pay attention to, and also how many good values are needed to keep the row

df.dropna(thresh=9)  # as long as we have 9 columns, it's OK to have 1 nan value

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
b,277.0,754,804,599.0,,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,,265,697,639.0,544.0,543,714.0,244,151,675.0
g,,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [110]:
df.dropna(thresh=10)   # this means: all values must be good, no nans need apply

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [111]:
# I can use the "subset" keyword argument to restrict which columns we care about having nan

df.dropna(subset=['q', 'u'])

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,,835.0,763,707.0,359,9,
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [113]:
df.dropna(axis='columns')  # now we'll drop columns that have nan values

Unnamed: 0,r,s,v,x,y
a,559,629,763,359,9
b,754,804,472,396,314
c,551,87,849,537,845
d,916,115,709,431,448
e,984,177,659,910,423
f,265,697,543,244,151
g,459,882,802,128,932
h,550,488,335,617,42


In [115]:
df.dropna(thresh=7, axis='columns')   #  if we have at least 7 good values in the column, keep it

Unnamed: 0,r,s,t,u,v,w,x,y,z
a,559,629,,835.0,763,707.0,359,9,
b,754,804,599.0,,472,600.0,396,314,705.0
c,551,87,174.0,600.0,849,,537,845,72.0
d,916,115,976.0,755.0,709,847.0,431,448,850.0
e,984,177,755.0,797.0,659,147.0,910,423,288.0
f,265,697,639.0,544.0,543,714.0,244,151,675.0
g,459,882,183.0,28.0,802,128.0,128,932,53.0
h,550,488,756.0,273.0,335,388.0,617,42,442.0


In [112]:
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(
    *,
    axis: 'Axis' = 0,
    how: 'AnyAll | lib.NoDefault' = <no_default>,
    thresh: 'int | lib.NoDefault' = <no_default>,
    subset: 'IndexLabel | None' = None,
    inplace: 'bool' = False,
    ignore_index: 'bool' = False
) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.

    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.

    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.

        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.

        Only a single axis is allowed.

    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at le

In [116]:
# I'm guessing that when we say axis='columns', Pandas transposes our data frame, does the action we ask for
# and then transposes it back.

# you can do that with .transpose() or .T  (without parentheses!)

df.T

Unnamed: 0,a,b,c,d,e,f,g,h
q,684.0,277.0,486.0,777.0,99.0,,,901.0
r,559.0,754.0,551.0,916.0,984.0,265.0,459.0,550.0
s,629.0,804.0,87.0,115.0,177.0,697.0,882.0,488.0
t,,599.0,174.0,976.0,755.0,639.0,183.0,756.0
u,835.0,,600.0,755.0,797.0,544.0,28.0,273.0
v,763.0,472.0,849.0,709.0,659.0,543.0,802.0,335.0
w,707.0,600.0,,847.0,147.0,714.0,128.0,388.0
x,359.0,396.0,537.0,431.0,910.0,244.0,128.0,617.0
y,9.0,314.0,845.0,448.0,423.0,151.0,932.0,42.0
z,,705.0,72.0,850.0,288.0,675.0,53.0,442.0


In [117]:
df

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,,835.0,763,707.0,359,9,
b,277.0,754,804,599.0,,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,,265,697,639.0,544.0,543,714.0,244,151,675.0
g,,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [118]:
# we can use fillna with a single value

df.fillna(99999)

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,99999.0,835.0,763,707.0,359,9,99999.0
b,277.0,754,804,599.0,99999.0,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,99999.0,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,99999.0,265,697,639.0,544.0,543,714.0,244,151,675.0
g,99999.0,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [119]:
df.mean()

q    537.333333
r    629.750000
s    484.875000
t    583.142857
u    547.428571
v    641.500000
w    504.428571
x    452.750000
y    395.500000
z    440.714286
dtype: float64

In [120]:
df.fillna(df.mean())   # each colum's nan values will be replaced with the mean *of that column*

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,583.142857,835.0,763,707.0,359,9,440.714286
b,277.0,754,804,599.0,547.428571,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,504.428571,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,537.333333,265,697,639.0,544.0,543,714.0,244,151,675.0
g,537.333333,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [121]:
df.fillna({'q':1111, 'r':2222, 's':3333, 't':4444, 'u':5555, 'v':66666, 'w':7777, 'x':8888, 'y':9999, 'z':101010})

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,4444.0,835.0,763,707.0,359,9,101010.0
b,277.0,754,804,599.0,5555.0,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,7777.0,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,1111.0,265,697,639.0,544.0,543,714.0,244,151,675.0
g,1111.0,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [122]:
help(df.fillna)

Help on method fillna in module pandas.core.generic:

fillna(
    value: 'Hashable | Mapping | Series | DataFrame | None' = None,
    *,
    method: 'FillnaOptions | None' = None,
    axis: 'Axis | None' = None,
    inplace: 'bool_t' = False,
    limit: 'int | None' = None,
    downcast: 'dict | None | lib.NoDefault' = <no_default>
) -> 'Self | None' method of pandas.core.frame.DataFrame instance
    Fill NA/NaN values using the specified method.

    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series:

        * ffill: propagate last valid obser

# Next up

- Files
- Reading/writing

Resume :35

# Reading files

Pandas handles a *huge* number of different file formats.  We'll start with CSV, then a bit about Excel, and then a few other formats, as well.

# CSV 

Used to stand for "comma-separated values," but now it's more generically "character-separated values". The idea is pretty simple:

- The file is all text
- Every line is a record
- Fields on each line are separated (by default) by commas
- The first line of the file can (sometimes) be the names of the columns we want

Lots of problems with this:
- What if the data contains commas?
- How are we supposed to know if the first line is column names or just data?
- How is numeric data interpreted?


In [None]:
# goal: Create a data frame based on a CSV file
# for that, I'll use pd.read_csv



In [123]:
!head taxi.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

In [124]:
# to read this into a CSV file, I'll just give the filename to read_csv

df = pd.read_csv('taxi.csv')

In [125]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [126]:
# how big is my data frame? I can ask using .shape

df.shape

(9999, 19)

In [127]:
df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [128]:
df['passenger_count'].mean()

np.float64(1.6594659465946595)

In [129]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               9999 non-null   int64  
 1   tpep_pickup_datetime   9999 non-null   object 
 2   tpep_dropoff_datetime  9999 non-null   object 
 3   passenger_count        9999 non-null   int64  
 4   trip_distance          9999 non-null   float64
 5   pickup_longitude       9999 non-null   float64
 6   pickup_latitude        9999 non-null   float64
 7   RateCodeID             9999 non-null   int64  
 8   store_and_fwd_flag     9999 non-null   object 
 9   dropoff_longitude      9999 non-null   float64
 10  dropoff_latitude       9999 non-null   float64
 11  payment_type           9999 non-null   int64  
 12  fare_amount            9999 non-null   float64
 13  extra                  9999 non-null   float64
 14  mta_tax                9999 non-null   float64
 15  tip_

In [130]:
# what if I want to save this as a CSV file?
df.to_csv('mytaxi.csv')

In [131]:
!head mytaxi.csv

,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95442962646483,40.76414108276367,1,N,-73.9747543334961,40.754093170166016,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.97144317626953,40.75894165039063,1,N,-73.97853851318358,40.76190948486328,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.97811126708984,40.73843383789063,1,N,-73.99027252197266,40.74543762207031,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.94589233398438,40.773529052734375,1,N,-73.97152709960938,40.76033020019531,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.97908782958984,40.77677

In [133]:
# this is what we can do if there are no headers
# you can pass names=[list_of_strings_to_be_used_as_headers]

pd.read_csv('mytaxi.csv',
           header=None)   # there are no column names to be found!

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [139]:
pd.read_csv('mytaxi.csv',
           header=3)   # on non-blank line 3, we have our headers

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [140]:
# one of the most common things to do is cut down on the number of columns we read in
# I can use the "usecols" keyword argument to choose only a few columns

pd.read_csv('mytaxi.csv', header=3, usecols=['passenger_count', 'trip_distance', 'total_amount'])

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.80
1,1,0.46,8.30
2,1,0.87,11.00
3,1,2.13,17.16
4,1,1.40,10.30
...,...,...,...
9994,1,2.70,12.30
9995,1,4.50,20.30
9996,1,5.59,22.30
9997,6,1.54,7.80


if the data is 

hello,"hi, there"  

The above would be 2 fields, because stuff in the `""` is considered to be 1 field.

I'll re-save the file using a tab (`\t`) as the separator.

In [141]:
df.to_csv('mytaxi.csv', sep='\t')

In [142]:
!head mytaxi.csv

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount
0	2	2015-06-02 11:19:29	2015-06-02 11:47:52	1	1.63	-73.95442962646483	40.76414108276367	1	N	-73.9747543334961	40.754093170166016	2	17.0	0.0	0.5	0.0	0.0	0.3	17.8
1	2	2015-06-02 11:19:30	2015-06-02 11:27:56	1	0.46	-73.97144317626953	40.75894165039063	1	N	-73.97853851318358	40.76190948486328	1	6.5	0.0	0.5	1.0	0.0	0.3	8.3
2	2	2015-06-02 11:19:31	2015-06-02 11:30:30	1	0.87	-73.97811126708984	40.73843383789063	1	N	-73.99027252197266	40.74543762207031	1	8.0	0.0	0.5	2.2	0.0	0.3	11.0
3	2	2015-06-02 11:19:31	2015-06-02 11:39:02	1	2.13	-73.94589233398438	40.773529052734375	1	N	-73.97152709960938	40.76033020019531	1	13.5	0.0	0.5	2.86	0.0	0.3	17.16
4	1	2015-06-02 11:19:32	2015-06-02 11:32:49	1	1.4	-73.97908782958984	40.77677

In [143]:
# how can we read it in?

pd.read_csv('mytaxi.csv', sep='\t')

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [144]:
# what if you don't specify the separator?

pd.read_csv('mytaxi.csv')

Unnamed: 0,\tVendorID\ttpep_pickup_datetime\ttpep_dropoff_datetime\tpassenger_count\ttrip_distance\tpickup_longitude\tpickup_latitude\tRateCodeID\tstore_and_fwd_flag\tdropoff_longitude\tdropoff_latitude\tpayment_type\tfare_amount\textra\tmta_tax\ttip_amount\ttolls_amount\timprovement_surcharge\ttotal_amount
0,0\t2\t2015-06-02 11:19:29\t2015-06-02 11:47:52...
1,1\t2\t2015-06-02 11:19:30\t2015-06-02 11:27:56...
2,2\t2\t2015-06-02 11:19:31\t2015-06-02 11:30:30...
3,3\t2\t2015-06-02 11:19:31\t2015-06-02 11:39:02...
4,4\t1\t2015-06-02 11:19:32\t2015-06-02 11:32:49...
...,...
9994,9994\t1\t2015-06-01 00:12:59\t2015-06-01 00:24...
9995,9995\t1\t2015-06-01 00:12:59\t2015-06-01 00:28...
9996,9996\t2\t2015-06-01 00:13:00\t2015-06-01 00:37...
9997,9997\t2\t2015-06-01 00:13:02\t2015-06-01 00:19...


# Summary of the most common `read_csv` options

- `sep`
- `usecols`
- `header`
- `names` (if you need to name the columns)
- `index_col` -- use this column as the index in your data frame

In [145]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(
    filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
    *,
    sep: 'str | None | lib.NoDefault' = <no_default>,
    delimiter: 'str | None | lib.NoDefault' = None,
    header: "int | Sequence[int] | None | Literal['infer']" = 'infer',
    names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>,
    index_col: 'IndexLabel | Literal[False] | None' = None,
    usecols: 'UsecolsArgType' = None,
    dtype: 'DtypeArg | None' = None,
    engine: 'CSVEngine | None' = None,
    converters: 'Mapping[Hashable, Callable] | None' = None,
    true_values: 'list | None' = None,
    false_values: 'list | None' = None,
    skipinitialspace: 'bool' = False,
    skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None,
    skipfooter: 'int' = 0,
    nrows: 'int | None' = None,
    na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable, Iterable[Hashable]] | None' = None,
  

# Exercise: Taxi data

1. Read the taxi data (`taxi.csv`) into a data frame. We only care about the columns `passenger_count`, `trip_distance`, and `total_amount`.
2. What were the mean and median distances and amounts that people had in this data set?
3. What was the most common number of passengers?
4. How many trips went further than 30 miles?
5. How many trips went 0 miles?

In [146]:
df = pd.read_csv('taxi.csv',
                 usecols=['passenger_count', 'trip_distance', 'total_amount'])

df

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.80
1,1,0.46,8.30
2,1,0.87,11.00
3,1,2.13,17.16
4,1,1.40,10.30
...,...,...,...
9994,1,2.70,12.30
9995,1,4.50,20.30
9996,1,5.59,22.30
9997,6,1.54,7.80


In [147]:
df.dtypes

passenger_count      int64
trip_distance      float64
total_amount       float64
dtype: object

In [149]:
# this is a quick/easy way to find out how many NaN values we have in each column!

df.isna().sum()

passenger_count    0
trip_distance      0
total_amount       0
dtype: int64

In [152]:
# What was the most common number of passengers?

df['passenger_count'].value_counts()

passenger_count
1    7207
2    1313
5     520
3     406
6     369
4     182
0       2
Name: count, dtype: int64

In [153]:
# what percentage of trips were with each number of passengers?

df['passenger_count'].value_counts(normalize=True)   # this gives us the result in percentages

passenger_count
1    0.720772
2    0.131313
5    0.052005
3    0.040604
6    0.036904
4    0.018202
0    0.000200
Name: proportion, dtype: float64

In [157]:
# How many trips went further than 30 miles?

df.loc[  df['trip_distance'] > 30  ]

Unnamed: 0,passenger_count,trip_distance,total_amount
809,1,35.51,135.13
3323,1,32.1,162.39
4224,1,31.9,252.35
4270,1,64.6,79.96
4291,1,32.4,63.36
4583,1,37.2,210.14
5470,1,34.84,137.59
8513,1,60.3,160.05
9231,1,31.5,150.05


In [158]:
# How many trips went 0 miles?

df.loc[  df['trip_distance'] == 0  ]

Unnamed: 0,passenger_count,trip_distance,total_amount
149,1,0.0,4.30
246,1,0.0,3.30
297,2,0.0,2.30
657,1,0.0,15.35
660,1,0.0,3.30
...,...,...,...
9016,1,0.0,6.30
9087,2,0.0,69.99
9093,1,0.0,66.00
9740,1,0.0,4.30


In [160]:
# how much, on average, did people pay for the trips in which they went 0 miles?
# use the 2-argument form of .loc

df.loc[ 
    df['trip_distance'] == 0      # row selector
        ,
    ['total_amount']     # column selector
].mean()

total_amount    31.58194
dtype: float64

In [163]:
df[df['trip_distance'] == 0]

Unnamed: 0,passenger_count,trip_distance,total_amount
149,1,0.0,4.30
246,1,0.0,3.30
297,2,0.0,2.30
657,1,0.0,15.35
660,1,0.0,3.30
...,...,...,...
9016,1,0.0,6.30
9087,2,0.0,69.99
9093,1,0.0,66.00
9740,1,0.0,4.30


In [165]:
# for trips that went longer than 30 miles, what did people pay?

df.loc[  df['trip_distance'] > 30 ,
         'total_amount'].describe()

count      9.000000
mean     150.113333
std       58.238690
min       63.360000
25%      135.130000
50%      150.050000
75%      162.390000
max      252.350000
Name: total_amount, dtype: float64

In [166]:
# for trips with 1 passenger, what was the mean total_amount?

df.loc[    df['passenger_count'] == 1,
       'total_amount'].mean()

np.float64(17.368569446371584)

In [167]:
# for trips with >1 passenger, what was the mean total_amount?

df.loc[    df['passenger_count'] > 1,
       'total_amount'].mean()

np.float64(18.02177419354839)

In [168]:
!ls *.csv

burrito_current.csv	   languages.csv  taxi.csv
celebrity_deaths_2016.csv  mytaxi.csv	  titanic3.csv


In [170]:
print(open('titanic3.csv').read()[:1000])

pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,"Allen, Miss. Elisabeth Walton",female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Miss. Helen Loraine",female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Mr. Hudson Joshua Creighton",male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1,1,"Anderson, Mr. Harry",male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
1,1,"Andrews, Miss. Kornelia Theodosia",female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
1,0,"Andrews, Mr. Thomas Jr",male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53,2,0,11769,51.4792,C101,S,D,,"Bayside, Quee

# Exercise: Titanic data

1. Create a data frame from the Titanic data set (in `titanic3.csv`)
2. What was the mean `fare` for people who survived? What about the mean `fare` for people who didn't?
3. Do we see a difference in mean `fare` for each value of `sex` (male/female)?
4. What was the highest fare paid on the Titanic? Grab all of the data you can about them.

In [171]:
df = pd.read_csv('titanic3.csv')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [175]:
df.loc[  df['survived'] == 1  ,    # row selector
        'fare'                     # column selector
   ].mean()

np.float64(49.361183600000004)

In [176]:
df.loc[  df['survived'] == 0  ,    # row selector
        'fare'                     # column selector
   ].mean()

np.float64(23.353830569306933)

In [177]:
df.loc[  df['sex'] == 'male'  ,    # row selector
        'fare'                     # column selector
   ].mean()

np.float64(26.154600831353918)

In [178]:
df.loc[  df['sex'] == 'female'  ,    # row selector
        'fare'                     # column selector
   ].mean()

np.float64(46.19809656652361)

In [181]:
# what was the highest fare paid?

df.loc[   df['fare'] == df['fare'].max()  ]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
49,1.0,1.0,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3,,"Austria-Hungary / Germantown, Philadelphia, PA"
50,1.0,1.0,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3,,"Germantown, Philadelphia, PA"
183,1.0,1.0,"Lesurer, Mr. Gustave J",male,35.0,0.0,0.0,PC 17755,512.3292,B101,C,3,,
302,1.0,1.0,"Ward, Miss. Anna",female,35.0,0.0,0.0,PC 17755,512.3292,,C,3,,


# Indexes

We can use any column we want for comparisons and for retrieving data.  But Pandas encourages us to use the index for such queries, if we're going to be retrieving based on the index on a regular basis.

- How do we set something to be in the index?
- How do we change that?
- Multi-indexes

In [182]:
# to make a column into the index for our data frame, just call set_index
# note: this doesn't change the data frame; it returns a new data frame with that column as the index

df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [184]:
# to keep the index on our data frame, we assign back to the data frame's variable

df = df.set_index('ticket')
df.head()

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
24160,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,211.3375,B5,S,2.0,,"St Louis, MO"
113781,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [186]:
df.loc['113781']

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
113781,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
113781,1.0,1.0,"Cleaver, Miss. Alice",female,22.0,0.0,0.0,151.55,,S,11.0,,
113781,1.0,1.0,"Daniels, Miss. Sarah",female,33.0,0.0,0.0,151.55,,S,8.0,,


In [187]:
df.head()

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
24160,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,211.3375,B5,S,2.0,,"St Louis, MO"
113781,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [188]:
# use fancy indexing for more than one

df.loc[['24160', '113781']]

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
24160,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,211.3375,B5,S,2.0,,"St Louis, MO"
24160,1.0,1.0,"Kreuchen, Miss. Emilie",female,39.0,0.0,0.0,211.3375,,S,2.0,,
24160,1.0,1.0,"Madill, Miss. Georgette Alexandra",female,15.0,0.0,1.0,211.3375,B5,S,2.0,,"St Louis, MO"
24160,1.0,1.0,"Robert, Mrs. Edward Scott (Elisabeth Walton Mc...",female,43.0,0.0,1.0,211.3375,B3,S,2.0,,"St Louis, MO"
113781,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
113781,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
113781,1.0,1.0,"Cleaver, Miss. Alice",female,22.0,0.0,0.0,151.55,,S,11.0,,
113781,1.0,1.0,"Daniels, Miss. Sarah",female,33.0,0.0,0.0,151.55,,S,8.0,,


In [189]:
df.loc['24160':'113781']

KeyError: "Cannot get left slice bound for non-unique label: '24160'"

In [191]:
# what if I want "ticket" to be a normal column again?
# I can use reset_index(), which returns it to be a regular column

df = df.reset_index()
df

Unnamed: 0,ticket,pclass,survived,name,sex,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
0,24160,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,211.3375,B5,S,2,,"St Louis, MO"
1,113781,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,113781,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,113781,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,113781,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,2665,3.0,0.0,"Zabour, Miss. Thamine",female,,1.0,0.0,14.4542,,C,,,
1306,2656,3.0,0.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,7.2250,,C,,304.0,
1307,2670,3.0,0.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,7.2250,,C,,,
1308,315082,3.0,0.0,"Zimmerman, Mr. Leo",male,29.0000,0.0,0.0,7.8750,,S,,,


In [192]:
# look at the index objects

df.index

RangeIndex(start=0, stop=1310, step=1)

In [193]:
df = df.set_index('ticket')

df.index

Index([   '24160',   '113781',   '113781',   '113781',   '113781',    '19952',
          '13502',   '112050',    '11769', 'PC 17609',
       ...
           '2659',     '2628',     '2647',     '2627',     '2665',     '2665',
           '2656',     '2670',   '315082',        nan],
      dtype='object', name='ticket', length=1310)

In [194]:
df = df.reset_index()

In [195]:
df.columns

Index(['ticket', 'pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

In [196]:
x = DataFrame(np.random.randint(0, 100, [3,3]))
x

Unnamed: 0,0,1,2
0,31,1,65
1,41,57,35
2,11,46,82


In [197]:
x.columns

RangeIndex(start=0, stop=3, step=1)

In [198]:
x.index

RangeIndex(start=0, stop=3, step=1)

In [199]:
# sometimes you want a multi-index -- especially when you have hierarchical data

df.head()

Unnamed: 0,ticket,pclass,survived,name,sex,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
0,24160,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,211.3375,B5,S,2.0,,"St Louis, MO"
1,113781,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,113781,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,113781,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,113781,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [200]:
df = df.set_index(['pclass', 'sex'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,24160,1.0,"Allen, Miss. Elisabeth Walton",29.0000,0.0,0.0,211.3375,B5,S,2,,"St Louis, MO"
1.0,male,113781,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1.0,female,113781,0.0,"Allison, Miss. Helen Loraine",2.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1.0,male,113781,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0000,1.0,2.0,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
1.0,female,113781,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3.0,female,2665,0.0,"Zabour, Miss. Thamine",,1.0,0.0,14.4542,,C,,,
3.0,male,2656,0.0,"Zakarian, Mr. Mapriededer",26.5000,0.0,0.0,7.2250,,C,,304.0,
3.0,male,2670,0.0,"Zakarian, Mr. Ortin",27.0000,0.0,0.0,7.2250,,C,,,
3.0,male,315082,0.0,"Zimmerman, Mr. Leo",29.0000,0.0,0.0,7.8750,,S,,,


In [201]:
# I can now retrieve via the outer part of the multi-index

df.loc[1.0]

Unnamed: 0_level_0,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
female,24160,1.0,"Allen, Miss. Elisabeth Walton",29.0000,0.0,0.0,211.3375,B5,S,2,,"St Louis, MO"
male,113781,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
female,113781,0.0,"Allison, Miss. Helen Loraine",2.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
male,113781,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0000,1.0,2.0,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
female,113781,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...
male,113510,0.0,"Williams-Lambert, Mr. Fletcher Fellows",,0.0,0.0,35.0000,C128,S,,,"London, England"
female,16966,1.0,"Wilson, Miss. Helen Alice",31.0000,0.0,0.0,134.5000,E39 E41,C,3,,
male,19947,1.0,"Woolner, Mr. Hugh",,0.0,0.0,35.5000,C52,S,D,,"London, England"
male,113807,0.0,"Wright, Mr. George",62.0000,0.0,0.0,26.5500,,S,,,"Halifax, NS"


In [202]:
df.loc[2.0]

Unnamed: 0_level_0,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
male,P/PP 3381,0.0,"Abelson, Mr. Samuel",30.0,1.0,0.0,24.000,,C,,,"Russia New York, NY"
female,P/PP 3381,1.0,"Abelson, Mrs. Samuel (Hannah Wizosky)",28.0,1.0,0.0,24.000,,C,10,,"Russia New York, NY"
male,248744,0.0,"Aldworth, Mr. Charles Augustus",30.0,0.0,0.0,13.000,,S,,,"Bryn Mawr, PA, USA"
male,231945,0.0,"Andrew, Mr. Edgardo Samuel",18.0,0.0,0.0,11.500,,S,,,"Buenos Aires, Argentina / New Jersey, NJ"
male,C.A. 34050,0.0,"Andrew, Mr. Frank Thomas",25.0,0.0,0.0,10.500,,S,,,"Cornwall, England Houghton, MI"
...,...,...,...,...,...,...,...,...,...,...,...,...
male,SC/PARIS 2159,0.0,"Wheeler, Mr. Edwin ""Frederick""",,0.0,0.0,12.875,,S,,,
male,244270,1.0,"Wilhelms, Mr. Charles",31.0,0.0,0.0,13.000,,S,9,,"London, England"
male,244373,1.0,"Williams, Mr. Charles Eugene",,0.0,0.0,13.000,,S,14,,"Harrow, England"
female,220844,1.0,"Wright, Miss. Marion",26.0,0.0,0.0,13.500,,S,9,,"Yoevil, England / Cottage Grove, OR"


In [203]:
# what if I want to get all first-class males?
# if I use a list, it'll be seen as fancy indexing, so we use a tuple

df.loc[(1.0, 'male')]

  df.loc[(1.0, 'male')]


Unnamed: 0_level_0,Unnamed: 1_level_0,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,male,113781,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1.0,male,113781,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0000,1.0,2.0,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
1.0,male,19952,1.0,"Anderson, Mr. Harry",48.0000,0.0,0.0,26.5500,E12,S,3,,"New York, NY"
1.0,male,112050,0.0,"Andrews, Mr. Thomas Jr",39.0000,0.0,0.0,0.0000,A36,S,,,"Belfast, NI"
1.0,male,PC 17609,0.0,"Artagaveytia, Mr. Ramon",71.0000,0.0,0.0,49.5042,,C,,22.0,"Montevideo, Uruguay"
1.0,...,...,...,...,...,...,...,...,...,...,...,...,...
1.0,male,PC 17597,0.0,"Williams, Mr. Charles Duane",51.0000,0.0,1.0,61.3792,,C,,,"Geneva, Switzerland / Radnor, PA"
1.0,male,PC 17597,1.0,"Williams, Mr. Richard Norris II",21.0000,0.0,1.0,61.3792,,C,A,,"Geneva, Switzerland / Radnor, PA"
1.0,male,113510,0.0,"Williams-Lambert, Mr. Fletcher Fellows",,0.0,0.0,35.0000,C128,S,,,"London, England"
1.0,male,19947,1.0,"Woolner, Mr. Hugh",,0.0,0.0,35.5000,C52,S,D,,"London, England"


In [205]:
# what was the mean fare paid by men in first class?
df.loc[(1.0, 'male'), 'fare'].mean()

  df.loc[(1.0, 'male'), 'fare'].mean()


np.float64(69.88838491620112)

In [206]:
# what if you're tired of having sex as a part of your multi-index?

df.reset_index('sex')

Unnamed: 0_level_0,sex,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,24160,1.0,"Allen, Miss. Elisabeth Walton",29.0000,0.0,0.0,211.3375,B5,S,2,,"St Louis, MO"
1.0,male,113781,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1.0,female,113781,0.0,"Allison, Miss. Helen Loraine",2.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1.0,male,113781,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0000,1.0,2.0,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
1.0,female,113781,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0000,1.0,2.0,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3.0,female,2665,0.0,"Zabour, Miss. Thamine",,1.0,0.0,14.4542,,C,,,
3.0,male,2656,0.0,"Zakarian, Mr. Mapriededer",26.5000,0.0,0.0,7.2250,,C,,304.0,
3.0,male,2670,0.0,"Zakarian, Mr. Ortin",27.0000,0.0,0.0,7.2250,,C,,,
3.0,male,315082,0.0,"Zimmerman, Mr. Leo",29.0000,0.0,0.0,7.8750,,S,,,


In [207]:
# what if I want the mean ticket price for all men on the Titanic?

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,24160,1.0,"Allen, Miss. Elisabeth Walton",29.0,0.0,0.0,211.3375,B5,S,2.0,,"St Louis, MO"
1.0,male,113781,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
1.0,female,113781,0.0,"Allison, Miss. Helen Loraine",2.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1.0,male,113781,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0,1.0,2.0,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
1.0,female,113781,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0,1.0,2.0,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [208]:
df['sex'] == 'male'

KeyError: 'sex'

In [209]:
# we can also use the .xs method, which lets us retrieve based on a non-outer level of a multi-index

df.xs('male', level='sex')

Unnamed: 0_level_0,ticket,survived,name,age,sibsp,parch,fare,cabin,embarked,boat,body,home.dest
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1.0,113781,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1.0,113781,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0000,1.0,2.0,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
1.0,19952,1.0,"Anderson, Mr. Harry",48.0000,0.0,0.0,26.5500,E12,S,3,,"New York, NY"
1.0,112050,0.0,"Andrews, Mr. Thomas Jr",39.0000,0.0,0.0,0.0000,A36,S,,,"Belfast, NI"
1.0,PC 17609,0.0,"Artagaveytia, Mr. Ramon",71.0000,0.0,0.0,49.5042,,C,,22.0,"Montevideo, Uruguay"
...,...,...,...,...,...,...,...,...,...,...,...,...
3.0,2647,0.0,"Yousif, Mr. Wazli",,0.0,0.0,7.2250,,C,,,
3.0,2627,0.0,"Yousseff, Mr. Gerious",,0.0,0.0,14.4583,,C,,,
3.0,2656,0.0,"Zakarian, Mr. Mapriededer",26.5000,0.0,0.0,7.2250,,C,,304.0,
3.0,2670,0.0,"Zakarian, Mr. Ortin",27.0000,0.0,0.0,7.2250,,C,,,


In [210]:
!ls *.csv

burrito_current.csv	   languages.csv  taxi.csv
celebrity_deaths_2016.csv  mytaxi.csv	  titanic3.csv


# Exercise: Taxi hierarchies

1. Read `taxi.csv` into a data frame. We're going to want `VendorID` and `passenger_count` to be the columns in the multi-index.
2. What are the mean `total_amount` and `trip_distance` where the `VendorID` is 1?
3. What are the mean `total_amount` and `trip_distance` where the `VendorID` is 1 and `passenger_count` is 2?
3. What are the mean `total_amount` and `trip_distance` where `passenger_count` is 3?


In [211]:
df = pd.read_csv('taxi.csv')
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [214]:
# (most) anywhere you can pass a single string, you can pass a list of strings

df = df.set_index(['VendorID', 'passenger_count'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2,1,2015-06-02 11:19:29,2015-06-02 11:47:52,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
2,1,2015-06-02 11:19:30,2015-06-02 11:27:56,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,1,2015-06-02 11:19:31,2015-06-02 11:30:30,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
2,1,2015-06-02 11:19:31,2015-06-02 11:39:02,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
1,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [215]:
# we can pass a string or a list of strings to index_col, which indicates what column(s) we want to use as an index

df = pd.read_csv('taxi.csv', index_col=['VendorID', 'passenger_count'])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2,1,2015-06-02 11:19:29,2015-06-02 11:47:52,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
2,1,2015-06-02 11:19:30,2015-06-02 11:27:56,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,1,2015-06-02 11:19:31,2015-06-02 11:30:30,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
2,1,2015-06-02 11:19:31,2015-06-02 11:39:02,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
1,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [218]:
# What are the mean total_amount and trip_distance where the VendorID is 1?

df.loc[    1,                               # row selector
          ['total_amount', 'trip_distance'] # column selector
].mean()


total_amount     17.325272
trip_distance     3.053404
dtype: float64

In [219]:
df.loc[    2,                               # row selector
          ['total_amount', 'trip_distance'] # column selector
].mean()


total_amount     17.765027
trip_distance     3.256843
dtype: float64

In [221]:
# What are the mean total_amount and trip_distance where the VendorID is 1 and passenger_count is 2?

df.loc[  (1, 2),    # row selector
        ['total_amount', 'trip_distance']
   ].mean()


  df.loc[  (1, 2),    # row selector


total_amount     19.076807
trip_distance     3.452027
dtype: float64

In [224]:
# What are the mean total_amount and trip_distance where passenger_count is 3?

# this is where we use .xs
df.xs(3, level='passenger_count')[['total_amount', 'trip_distance']].mean()

total_amount     17.994704
trip_distance     3.342389
dtype: float64

# Next up

- Sorting (and how this affects slices and other warnings)
- Excel
- URLs
- Reading HTML

Return at 13:30 Paris time

# Sorting our data frame (and avoiding problems)

The issue that we've seen is that Pandas doesn't know how to handle an index that is all jumbled up. This can either give us an error or a warning message (depending on what's happening).



In [226]:
df = pd.read_csv('taxi.csv', index_col='passenger_count')
df

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
1,2,2015-06-02 11:19:31,2015-06-02 11:30:30,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
1,2,2015-06-02 11:19:31,2015-06-02 11:39:02,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
1,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,1,2015-06-01 00:12:59,2015-06-01 00:24:18,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
1,1,2015-06-01 00:12:59,2015-06-01 00:28:16,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
1,2,2015-06-01 00:13:00,2015-06-01 00:37:25,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
6,2,2015-06-01 00:13:02,2015-06-01 00:19:10,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [227]:
# let's try to get a slice of all rides with between 2-6 passengers

df.loc[2:6]   # notice that it's up to *and* including

KeyError: 'Cannot get left slice bound for non-unique label: 2'

In [229]:
# this is because the data frame isn't sorted.
# but ... it's not a sorting problem by *value*. It's a sorting problem by *index*.

# Pandas provides us with *two* methods for sorting:
# - sort_index
# - sort_values

df = df.sort_index()
df.loc[2:6]   # we get all of the rows with 2-6 passengers

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2,2,2015-06-02 11:21:42,2015-06-02 11:30:01,1.11,-74.007446,40.732414,1,N,-73.998344,40.745461,2,7.0,0.0,0.5,0.00,0.00,0.3,7.80
2,2,2015-06-01 00:13:11,2015-06-01 00:17:44,1.31,-73.987633,40.760452,1,N,-73.989281,40.753223,2,6.0,0.5,0.5,0.00,0.00,0.3,7.30
2,2,2015-06-02 11:33:17,2015-06-02 12:05:35,5.19,-73.991402,40.748791,1,N,-74.004379,40.706070,2,23.5,0.0,0.5,0.00,0.00,0.3,24.30
2,2,2015-06-01 00:03:31,2015-06-01 00:35:41,20.75,-73.782997,40.644302,2,N,-73.952286,40.772957,1,52.0,0.0,0.5,10.00,5.54,0.3,68.34
2,1,2015-06-01 00:04:18,2015-06-01 00:12:28,1.90,-73.954506,40.804218,1,N,-73.940231,40.822208,2,8.5,0.5,0.5,0.00,0.00,0.3,9.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6,2,2015-06-01 00:06:14,2015-06-01 00:11:15,0.91,-73.993790,40.752651,1,N,-73.988678,40.747730,2,5.5,0.5,0.5,0.00,0.00,0.3,6.80
6,2,2015-06-04 15:20:21,2015-06-04 15:49:51,2.82,-73.984268,40.764874,1,N,-73.958168,40.768665,1,18.5,0.0,0.5,3.86,0.00,0.3,23.16
6,2,2015-06-02 11:22:29,2015-06-02 11:58:21,1.78,-73.985146,40.768227,1,N,-73.983452,40.749786,1,20.5,0.0,0.5,5.32,0.00,0.3,26.62
6,2,2015-06-01 00:04:12,2015-06-01 00:11:08,1.56,-73.953476,40.767208,1,N,-73.969826,40.753185,1,7.5,0.5,0.5,1.00,0.00,0.3,9.80


In [230]:
# what if you want to sort the index in reverse order?

df.sort_index(ascending=False)

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
6,2,2015-06-04 15:22:01,2015-06-04 15:48:23,2.37,-74.007500,40.725880,1,N,-73.987129,40.750641,1,17.0,0.0,0.5,4.45,0.00,0.3,22.25
6,2,2015-06-04 15:20:06,2015-06-04 16:07:19,13.07,-73.954308,40.764095,1,N,-73.979317,40.626328,2,42.5,0.0,0.5,0.00,0.00,0.3,43.30
6,2,2015-06-02 11:31:41,2015-06-02 11:37:43,0.29,-73.956276,40.781658,1,N,-73.958755,40.778233,1,5.5,0.0,0.5,1.26,0.00,0.3,7.56
6,2,2015-06-02 11:31:39,2015-06-02 11:40:57,1.51,-73.956383,40.779827,1,N,-73.978317,40.789181,1,8.5,0.0,0.5,1.40,0.00,0.3,10.70
6,2,2015-06-02 11:32:30,2015-06-02 11:49:29,1.77,-73.994812,40.740093,1,N,-73.978287,40.764221,1,11.5,0.0,0.5,2.46,0.00,0.3,14.76
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,2,2015-06-06 16:52:49,2015-06-06 16:56:30,0.56,-73.998833,40.739803,1,N,-74.005356,40.734612,2,4.5,0.0,0.5,0.00,0.00,0.3,5.30
1,2,2015-06-06 16:52:48,2015-06-06 16:59:29,1.22,-73.987938,40.732616,1,N,-74.000191,40.742870,2,6.5,0.0,0.5,0.00,0.00,0.3,7.30
1,1,2015-06-04 15:18:57,2015-06-04 15:27:20,0.70,-73.953995,40.766647,1,N,-73.960716,40.757320,1,7.0,0.0,0.5,1.55,0.00,0.3,9.35
0,1,2015-06-04 15:15:45,2015-06-04 15:32:10,1.30,-73.953949,40.778915,1,N,-73.970337,40.788288,1,11.0,0.0,0.5,2.95,0.00,0.3,14.75


In [231]:
# what if we have a multi-index? 

df = df.reset_index()
df = df.set_index(['VendorID', 'passenger_count'])

In [232]:
df.sort_index()  # this will sort our index first by the outer layer, then (seconarily) by the inner layer

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,2015-06-01 00:03:42,2015-06-01 00:22:09,7.90,-73.885246,40.773014,1,N,-73.976089,40.741604,1,23.5,0.5,0.5,6.05,5.54,0.3,36.39
1,0,2015-06-04 15:15:45,2015-06-04 15:32:10,1.30,-73.953949,40.778915,1,N,-73.970337,40.788288,1,11.0,0.0,0.5,2.95,0.00,0.3,14.75
1,1,2015-06-04 15:18:58,2015-06-04 16:12:04,13.70,-73.987183,40.700821,2,N,-73.779518,40.645245,1,52.0,0.0,0.5,13.20,0.00,0.3,66.00
1,1,2015-06-04 15:18:57,2015-06-04 15:27:20,0.70,-73.953995,40.766647,1,N,-73.960716,40.757320,1,7.0,0.0,0.5,1.55,0.00,0.3,9.35
1,1,2015-06-04 15:18:56,2015-06-04 15:28:28,1.40,-73.977776,40.777058,1,N,-73.964386,40.792500,2,8.0,0.0,0.5,0.00,0.00,0.3,8.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,6,2015-06-01 00:06:14,2015-06-01 00:11:15,0.91,-73.993790,40.752651,1,N,-73.988678,40.747730,2,5.5,0.5,0.5,0.00,0.00,0.3,6.80
2,6,2015-06-04 15:20:21,2015-06-04 15:49:51,2.82,-73.984268,40.764874,1,N,-73.958168,40.768665,1,18.5,0.0,0.5,3.86,0.00,0.3,23.16
2,6,2015-06-02 11:22:29,2015-06-02 11:58:21,1.78,-73.985146,40.768227,1,N,-73.983452,40.749786,1,20.5,0.0,0.5,5.32,0.00,0.3,26.62
2,6,2015-06-01 00:04:12,2015-06-01 00:11:08,1.56,-73.953476,40.767208,1,N,-73.969826,40.753185,1,7.5,0.5,0.5,1.00,0.00,0.3,9.80


In [234]:
# same query as before, but with a sorted data frame

df = df.sort_index()
df.loc[  (1, 2),    # row selector
        ['total_amount', 'trip_distance']
   ].mean()

total_amount     19.076807
trip_distance     3.452027
dtype: float64

In [241]:
# what if I want to sort by VendorID in *ascending* order, but passenger_count in descending order?

df.sort_index(level=['VendorID', 'passenger_count'], 
              ascending=False)   # both are in descending order

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2,6,2015-06-04 15:22:13,2015-06-04 15:46:33,2.30,-73.954842,40.783520,1,N,-73.981850,40.775051,1,15.5,0.0,0.5,2.00,0.00,0.3,18.30
2,6,2015-06-02 11:30:37,2015-06-02 11:59:04,3.35,-73.995918,40.732151,1,N,-73.979393,40.765453,1,18.0,0.0,0.5,3.76,0.00,0.3,22.56
2,6,2015-06-02 11:27:34,2015-06-02 11:47:02,2.32,-73.970581,40.761372,1,N,-73.991302,40.739239,1,13.0,0.0,0.5,2.76,0.00,0.3,16.56
2,6,2015-06-04 15:17:17,2015-06-04 15:47:06,3.02,-74.007057,40.727749,1,N,-73.977760,40.753738,1,19.0,0.0,0.5,0.01,0.00,0.3,19.81
2,6,2015-06-04 15:18:11,2015-06-04 15:47:17,7.50,-74.010857,40.716583,1,N,-73.956627,40.766888,2,25.0,0.0,0.5,0.00,0.00,0.3,25.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,1,2015-06-02 11:30:42,2015-06-02 11:47:34,1.10,-73.962494,40.773041,1,N,-73.972572,40.759670,1,11.0,0.0,0.5,2.35,0.00,0.3,14.15
1,1,2015-06-02 11:24:28,2015-06-02 11:43:38,1.90,-74.004944,40.737541,1,N,-74.005653,40.715446,2,13.0,0.0,0.5,0.00,0.00,0.3,13.80
1,1,2015-06-02 11:31:01,2015-06-02 11:55:39,2.20,-73.955246,40.782616,1,N,-73.966713,40.757256,2,15.5,0.0,0.5,0.00,0.00,0.3,16.30
1,0,2015-06-01 00:03:42,2015-06-01 00:22:09,7.90,-73.885246,40.773014,1,N,-73.976089,40.741604,1,23.5,0.5,0.5,6.05,5.54,0.3,36.39


In [242]:
# let's do VendorID ascending and passenger_count descending
# we can pass a list of booleans! 
# they will be interpreted in the same order as the levels we mentioned

df.sort_index(level=['VendorID', 'passenger_count'], 
              ascending=[True, False])   

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,5,2015-06-01 00:13:13,2015-06-01 00:35:56,8.30,-73.949165,40.749912,1,N,-73.979492,40.665749,2,26.0,0.5,0.5,0.00,0.00,0.3,27.30
1,5,2015-06-06 16:52:59,2015-06-06 17:06:04,3.50,-73.951843,40.793392,1,N,-73.991821,40.770599,2,13.0,0.0,0.5,0.00,0.00,0.3,13.80
1,5,2015-06-04 15:22:54,2015-06-04 15:53:23,3.00,-73.988106,40.748238,1,N,-73.961945,40.779510,2,19.5,0.0,0.5,0.00,0.00,0.3,20.30
1,4,2015-06-02 11:21:13,2015-06-02 11:42:13,6.20,-73.989975,40.750702,1,N,-74.014153,40.712116,2,22.5,0.0,0.5,0.00,0.00,0.3,23.30
1,4,2015-06-01 00:01:35,2015-06-01 00:08:09,1.90,-74.005707,40.740623,1,N,-73.986244,40.758972,2,7.5,0.5,0.5,0.00,0.00,0.3,8.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,1,2015-06-02 11:30:40,2015-06-02 11:47:26,1.80,-74.005478,40.727409,1,N,-73.990883,40.737400,2,11.5,0.0,0.5,0.00,0.00,0.3,12.30
2,1,2015-06-01 00:00:00,2015-06-01 00:00:00,0.90,-73.984428,40.737209,1,N,-73.979935,40.749088,1,11.5,1.0,0.5,2.00,0.00,0.3,15.30
2,1,2015-06-02 11:31:32,2015-06-02 11:49:03,2.94,-73.954582,40.789440,1,N,-73.988762,40.773895,1,13.5,0.0,0.5,1.50,0.00,0.3,15.80
2,1,2015-06-02 11:24:29,2015-06-02 11:32:24,0.81,-73.984360,40.732327,1,N,-73.976448,40.740192,2,6.5,0.0,0.5,0.00,0.00,0.3,7.30


In [244]:
help(df.sort_index)

Help on method sort_index in module pandas.core.frame:

sort_index(
    *,
    axis: 'Axis' = 0,
    level: 'IndexLabel | None' = None,
    ascending: 'bool | Sequence[bool]' = True,
    inplace: 'bool' = False,
    kind: 'SortKind' = 'quicksort',
    na_position: 'NaPosition' = 'last',
    sort_remaining: 'bool' = True,
    ignore_index: 'bool' = False,
    key: 'IndexKeyFunc | None' = None
) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Sort object by labels (along an axis).

    Returns a new DataFrame sorted by label if `inplace` argument is
    ``False``, otherwise updates the original DataFrame and returns None.

    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        The axis along which to sort.  The value 0 identifies the rows,
        and 1 identifies the columns.
    level : int or level name or list of ints or list of level names
        If not None, sort on values in specified index level(s).
    ascending : boo

In [245]:
# there's another way to sort, namely by the values!
# when we invoke sort_values, we need to provide one or more columns that will determine the ordering

df.sort_values('trip_distance')   # return a new data frame whose rows are ordered by trip_distance, from smallest to largest

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2,5,2015-06-02 11:20:46,2015-06-02 11:23:52,0.00,-73.971809,40.764050,1,N,-73.966827,40.770039,1,4.00,0.0,0.5,1.00,0.00,0.3,5.80
2,1,2015-06-04 15:23:02,2015-06-04 15:23:05,0.00,0.000000,0.000000,5,N,-73.982674,40.771660,1,60.00,0.0,0.5,12.16,0.00,0.3,72.96
1,1,2015-06-02 11:32:53,2015-06-02 11:33:15,0.00,-73.997971,40.730198,1,N,-73.997643,40.730045,3,2.50,0.0,0.5,0.00,0.00,0.3,3.30
2,1,2015-06-01 00:08:03,2015-06-01 00:08:08,0.00,-73.962463,40.755138,2,N,-73.962463,40.755138,1,52.00,0.0,0.5,10.56,0.00,0.3,63.36
2,1,2015-06-02 11:24:27,2015-06-02 11:35:33,0.00,-73.979347,40.752792,1,N,-73.970490,40.764420,1,8.00,0.0,0.5,2.20,0.00,0.3,11.00
2,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,1,2015-06-04 15:17:25,2015-06-04 17:05:42,34.84,-73.787354,40.641670,5,N,-74.177376,40.690781,2,120.00,0.0,0.0,0.00,17.29,0.3,137.59
2,1,2015-06-02 11:21:03,2015-06-02 12:16:47,35.51,-73.789169,40.647758,3,N,-74.176750,40.662647,1,112.00,0.0,0.0,0.00,22.83,0.3,135.13
1,1,2015-06-01 00:02:42,2015-06-01 00:03:38,37.20,-73.550156,41.043472,5,N,-73.550102,41.043495,1,184.00,0.0,0.0,20.00,5.84,0.3,210.14
1,1,2015-06-01 00:04:50,2015-06-01 01:31:44,60.30,-73.994415,40.750603,5,N,-73.420250,41.137344,1,150.00,0.0,0.0,0.00,9.75,0.3,160.05


In [246]:
# of course, we can say ascending=False

df.sort_values('trip_distance', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,2015-06-01 00:00:58,2015-06-01 01:22:05,64.60,0.000000,0.000000,5,N,0.000000,0.000000,2,69.66,0.0,0.0,0.0,10.00,0.3,79.96
1,1,2015-06-01 00:04:50,2015-06-01 01:31:44,60.30,-73.994415,40.750603,5,N,-73.420250,41.137344,1,150.00,0.0,0.0,0.0,9.75,0.3,160.05
1,1,2015-06-01 00:02:42,2015-06-01 00:03:38,37.20,-73.550156,41.043472,5,N,-73.550102,41.043495,1,184.00,0.0,0.0,20.0,5.84,0.3,210.14
2,1,2015-06-02 11:21:03,2015-06-02 12:16:47,35.51,-73.789169,40.647758,3,N,-74.176750,40.662647,1,112.00,0.0,0.0,0.0,22.83,0.3,135.13
2,1,2015-06-04 15:17:25,2015-06-04 17:05:42,34.84,-73.787354,40.641670,5,N,-74.177376,40.690781,2,120.00,0.0,0.0,0.0,17.29,0.3,137.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,1,2015-06-01 00:05:05,2015-06-01 00:05:11,0.00,-73.789810,40.643852,1,N,-73.789612,40.643597,2,2.50,0.5,0.5,0.0,0.00,0.3,3.80
2,4,2015-06-02 11:33:06,2015-06-02 11:33:15,0.00,-73.901489,40.763783,3,N,-73.901489,40.763779,2,20.00,0.0,0.0,0.0,0.00,0.3,20.30
1,1,2015-06-04 15:21:29,2015-06-04 15:22:54,0.00,-73.952393,40.768887,1,N,-73.952187,40.769089,2,3.00,0.0,0.5,0.0,0.00,0.3,3.80
2,1,2015-06-01 00:00:46,2015-06-01 00:00:51,0.00,0.000000,0.000000,1,N,0.000000,0.000000,2,2.50,0.5,0.5,0.0,0.00,0.3,3.80


In [250]:
# can we sort by more than one column? Yes, exactly as we saw for sort_index -- pass a list of columns,
# and then (optionally) a list of booleans for ascending

df.sort_values(['payment_type', 'trip_distance'])

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,2015-06-04 15:18:40,2015-06-04 15:19:01,0.00,-73.991425,40.760403,5,N,-73.991425,40.760403,1,20.8,0.0,0.0,4.20,0.0,0.3,25.30
1,1,2015-06-04 15:17:21,2015-06-04 15:17:27,0.00,-73.861870,40.768280,2,N,-73.861870,40.768291,1,52.0,0.0,0.5,0.00,0.0,0.3,52.80
1,1,2015-06-04 15:20:27,2015-06-04 15:20:29,0.00,-74.004616,40.742172,1,N,-74.004616,40.742172,1,2.5,0.0,0.5,6.08,0.0,0.3,9.38
1,1,2015-06-04 15:20:26,2015-06-04 15:21:28,0.00,-73.790359,40.643848,5,N,-73.790375,40.643837,1,250.0,0.0,0.0,0.00,0.0,0.3,250.30
1,1,2015-06-01 00:01:38,2015-06-01 00:01:53,0.00,-74.005486,40.711548,2,N,-74.005661,40.711765,1,52.0,0.0,0.5,10.55,0.0,0.3,63.35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,1,2015-06-04 15:17:15,2015-06-04 15:26:40,0.89,-73.962166,40.773739,1,N,-73.962173,40.764603,4,-7.0,0.0,-0.5,0.00,0.0,-0.3,-7.80
1,1,2015-06-02 11:29:49,2015-06-02 11:34:43,0.90,-73.973488,40.738007,1,N,-73.968666,40.750500,4,5.5,0.0,0.5,0.00,0.0,0.3,6.30
1,1,2015-06-02 11:28:40,2015-06-02 11:42:56,1.50,-73.973808,40.764301,1,N,-73.962021,40.779594,4,10.5,0.0,0.5,0.00,0.0,0.3,11.30
1,3,2015-06-06 16:53:42,2015-06-06 17:22:41,3.60,-73.983917,40.725243,1,N,-73.967682,40.768555,4,20.0,0.0,0.5,0.00,0.0,0.3,20.80


In [249]:
df.columns

Index(['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance',
       'pickup_longitude', 'pickup_latitude', 'RateCodeID',
       'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude',
       'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount',
       'tolls_amount', 'improvement_surcharge', 'total_amount'],
      dtype='object')

In [251]:
# why sort the values? Most often, you don't need to do so.
# however, if you want the records for the n smallest __ (or n largest __), then you might do that

df.sort_values('trip_distance').head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2,5,2015-06-02 11:20:46,2015-06-02 11:23:52,0.0,-73.971809,40.76405,1,N,-73.966827,40.770039,1,4.0,0.0,0.5,1.0,0.0,0.3,5.8
2,1,2015-06-04 15:23:02,2015-06-04 15:23:05,0.0,0.0,0.0,5,N,-73.982674,40.77166,1,60.0,0.0,0.5,12.16,0.0,0.3,72.96
1,1,2015-06-02 11:32:53,2015-06-02 11:33:15,0.0,-73.997971,40.730198,1,N,-73.997643,40.730045,3,2.5,0.0,0.5,0.0,0.0,0.3,3.3
2,1,2015-06-01 00:08:03,2015-06-01 00:08:08,0.0,-73.962463,40.755138,2,N,-73.962463,40.755138,1,52.0,0.0,0.5,10.56,0.0,0.3,63.36
2,1,2015-06-02 11:24:27,2015-06-02 11:35:33,0.0,-73.979347,40.752792,1,N,-73.97049,40.76442,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
2,1,2015-06-01 00:07:46,2015-06-01 00:07:51,0.0,0.0,0.0,2,N,0.0,0.0,1,52.0,0.0,0.5,13.2,0.0,0.3,66.0
2,5,2015-06-04 15:16:06,2015-06-04 15:16:17,0.0,-73.782143,40.644672,1,N,0.0,0.0,2,2.5,0.0,0.5,0.0,0.0,0.3,3.3
1,1,2015-06-02 11:19:46,2015-06-02 12:26:33,0.0,-73.9832,40.766949,1,N,-73.99041,40.766872,2,2.5,0.0,0.5,0.0,0.0,0.3,3.3
2,5,2015-06-02 11:26:50,2015-06-02 11:35:21,0.0,-73.966927,40.771549,1,N,-73.98317,40.774311,2,6.0,0.0,0.5,0.0,0.0,0.3,6.8
1,1,2015-06-02 11:23:08,2015-06-02 11:23:11,0.0,-74.005409,40.750797,1,N,-74.005417,40.750797,1,2.5,0.0,0.5,3.0,0.0,0.3,6.3


In [252]:
# but there is another way to do it!
df.nsmallest(n=10, columns='trip_distance')

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,2015-06-04 15:19:12,2015-06-04 15:20:50,0.0,-74.00647,40.740452,1,N,-74.007622,40.740696,3,3.0,0.0,0.5,0.0,0.0,0.3,3.8
1,1,2015-06-04 15:18:40,2015-06-04 15:19:01,0.0,-73.991425,40.760403,5,N,-73.991425,40.760403,1,20.8,0.0,0.0,4.2,0.0,0.3,25.3
1,1,2015-06-04 15:17:21,2015-06-04 15:17:27,0.0,-73.86187,40.76828,2,N,-73.86187,40.768291,1,52.0,0.0,0.5,0.0,0.0,0.3,52.8
1,1,2015-06-04 15:21:29,2015-06-04 15:22:54,0.0,-73.952393,40.768887,1,N,-73.952187,40.769089,2,3.0,0.0,0.5,0.0,0.0,0.3,3.8
1,1,2015-06-04 15:19:59,2015-06-04 15:20:02,0.0,-73.951973,40.71616,1,N,-73.951973,40.71616,3,2.5,0.0,0.5,0.0,0.0,0.3,3.3
1,1,2015-06-04 15:20:27,2015-06-04 15:20:29,0.0,-74.004616,40.742172,1,N,-74.004616,40.742172,1,2.5,0.0,0.5,6.08,0.0,0.3,9.38
1,1,2015-06-04 15:20:26,2015-06-04 15:21:28,0.0,-73.790359,40.643848,5,N,-73.790375,40.643837,1,250.0,0.0,0.0,0.0,0.0,0.3,250.3
1,1,2015-06-01 00:02:33,2015-06-01 00:07:15,0.0,-73.975449,40.681915,1,N,-73.974785,40.681664,3,4.5,0.5,0.5,0.0,0.0,0.3,5.8
1,1,2015-06-01 00:02:29,2015-06-01 00:02:42,0.0,-73.781929,40.644714,1,N,-73.781914,40.644722,2,2.5,0.5,0.5,0.0,0.0,0.3,3.8
1,1,2015-06-01 00:01:38,2015-06-01 00:01:53,0.0,-74.005486,40.711548,2,N,-74.005661,40.711765,1,52.0,0.0,0.5,10.55,0.0,0.3,63.35


In [253]:
df.nlargest(n=10, columns='trip_distance')

Unnamed: 0_level_0,Unnamed: 1_level_0,tpep_pickup_datetime,tpep_dropoff_datetime,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
VendorID,passenger_count,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,1,2015-06-01 00:00:58,2015-06-01 01:22:05,64.6,0.0,0.0,5,N,0.0,0.0,2,69.66,0.0,0.0,0.0,10.0,0.3,79.96
1,1,2015-06-01 00:04:50,2015-06-01 01:31:44,60.3,-73.994415,40.750603,5,N,-73.42025,41.137344,1,150.0,0.0,0.0,0.0,9.75,0.3,160.05
1,1,2015-06-01 00:02:42,2015-06-01 00:03:38,37.2,-73.550156,41.043472,5,N,-73.550102,41.043495,1,184.0,0.0,0.0,20.0,5.84,0.3,210.14
2,1,2015-06-02 11:21:03,2015-06-02 12:16:47,35.51,-73.789169,40.647758,3,N,-74.17675,40.662647,1,112.0,0.0,0.0,0.0,22.83,0.3,135.13
2,1,2015-06-04 15:17:25,2015-06-04 17:05:42,34.84,-73.787354,40.64167,5,N,-74.177376,40.690781,2,120.0,0.0,0.0,0.0,17.29,0.3,137.59
2,1,2015-06-01 00:01:19,2015-06-01 00:40:12,32.4,-73.781425,40.644905,2,N,-73.974174,40.731441,1,52.0,0.0,0.5,10.56,0.0,0.3,63.36
1,1,2015-06-02 11:28:58,2015-06-02 12:13:29,32.1,-73.873085,40.774124,4,N,-73.957283,41.098221,1,129.0,0.0,0.5,27.05,5.54,0.3,162.39
1,1,2015-06-01 00:00:13,2015-06-01 00:41:05,31.9,-73.875206,40.770382,5,N,-73.549629,41.043552,1,210.0,0.0,0.0,42.05,0.0,0.3,252.35
1,1,2015-06-01 00:09:14,2015-06-01 01:03:11,31.5,-73.802437,40.677372,5,N,-74.255424,40.745316,2,140.0,0.0,0.0,0.0,9.75,0.3,150.05
2,2,2015-06-01 00:00:16,2015-06-01 00:40:35,29.78,-73.781853,40.644711,2,N,-74.006905,40.707958,1,52.0,0.0,0.5,17.5,5.54,0.3,75.84


# Exercise: Sorting indexes and values

1. Load the Titanic data into a data frame, and make `pclass` and `sex` the index (in a two-part multi-index).
2. Find all of the people from `pclass` 1-2, using a slice.
3. Find all of the people in class 2 and `sex` of `male`, without getting a warning.
4. What are the 5 largest fares that people paid?
5. What are the 10 smallest fares that they paid?

In [274]:
df = pd.read_csv('titanic3.csv',
                 index_col=['pclass', 'sex'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,1.0,"Allen, Miss. Elisabeth Walton",29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1.0,male,1.0,"Allison, Master. Hudson Trevor",0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
1.0,female,0.0,"Allison, Miss. Helen Loraine",2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1.0,male,0.0,"Allison, Mr. Hudson Joshua Creighton",30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
1.0,female,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3.0,female,0.0,"Zabour, Miss. Thamine",,1.0,0.0,2665,14.4542,,C,,,
3.0,male,0.0,"Zakarian, Mr. Mapriededer",26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
3.0,male,0.0,"Zakarian, Mr. Ortin",27.0000,0.0,0.0,2670,7.2250,,C,,,
3.0,male,0.0,"Zimmerman, Mr. Leo",29.0000,0.0,0.0,315082,7.8750,,S,,,


In [255]:
df.dtypes

survived     float64
name          object
age          float64
sibsp        float64
parch        float64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In [256]:
df.index

MultiIndex([(1.0, 'female'),
            (1.0,   'male'),
            (1.0, 'female'),
            (1.0,   'male'),
            (1.0, 'female'),
            (1.0,   'male'),
            (1.0, 'female'),
            (1.0,   'male'),
            (1.0, 'female'),
            (1.0,   'male'),
            ...
            (3.0, 'female'),
            (3.0,   'male'),
            (3.0,   'male'),
            (3.0,   'male'),
            (3.0, 'female'),
            (3.0, 'female'),
            (3.0,   'male'),
            (3.0,   'male'),
            (3.0,   'male'),
            (nan,      nan)],
           names=['pclass', 'sex'], length=1310)

In [259]:
df.index.dtypes

pclass    float64
sex        object
dtype: object

In [270]:
# Find all of the people from pclass 1-2, using a slice.

df = df.sort_index()



In [276]:
df = df.iloc[:-1]   # remove the final row from our data frame

In [279]:
help(df.sort_index)

Help on method sort_index in module pandas.core.frame:

sort_index(
    *,
    axis: 'Axis' = 0,
    level: 'IndexLabel | None' = None,
    ascending: 'bool | Sequence[bool]' = True,
    inplace: 'bool' = False,
    kind: 'SortKind' = 'quicksort',
    na_position: 'NaPosition' = 'last',
    sort_remaining: 'bool' = True,
    ignore_index: 'bool' = False,
    key: 'IndexKeyFunc | None' = None
) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Sort object by labels (along an axis).

    Returns a new DataFrame sorted by label if `inplace` argument is
    ``False``, otherwise updates the original DataFrame and returns None.

    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        The axis along which to sort.  The value 0 identifies the rows,
        and 1 identifies the columns.
    level : int or level name or list of ints or list of level names
        If not None, sort on values in specified index level(s).
    ascending : boo

In [280]:
df.isna().sum()

survived        0
name            0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [286]:
# a better approach

df = pd.read_csv('titanic3.csv')
df.isna().sum()
print(df.shape)
df = df.dropna(thresh=1)   # keep any row in which we have at least one non-NaN value
print(df.shape)
df = df.set_index(['pclass', 'sex'])
df = df.sort_index()

(1310, 14)
(1309, 14)


In [287]:
df.loc[1:2]

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,1.0,"Allen, Miss. Elisabeth Walton",29.0,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1.0,female,0.0,"Allison, Miss. Helen Loraine",2.0,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1.0,female,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
1.0,female,1.0,"Andrews, Miss. Kornelia Theodosia",63.0,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
1.0,female,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",53.0,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2.0,male,0.0,"West, Mr. Edwy Arthur",36.0,1.0,2.0,C.A. 34651,27.7500,,S,,,"Bournmouth, England"
2.0,male,0.0,"Wheadon, Mr. Edward H",66.0,0.0,0.0,C.A. 24579,10.5000,,S,,,"Guernsey, England / Edgewood, RI"
2.0,male,0.0,"Wheeler, Mr. Edwin ""Frederick""",,0.0,0.0,SC/PARIS 2159,12.8750,,S,,,
2.0,male,1.0,"Wilhelms, Mr. Charles",31.0,0.0,0.0,244270,13.0000,,S,9,,"London, England"


In [288]:
# Find all of the people in class 2 and sex of male, without getting a warning.

df.loc[(2, 'male')]

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2.0,male,0.0,"Abelson, Mr. Samuel",30.0,1.0,0.0,P/PP 3381,24.000,,C,,,"Russia New York, NY"
2.0,male,0.0,"Aldworth, Mr. Charles Augustus",30.0,0.0,0.0,248744,13.000,,S,,,"Bryn Mawr, PA, USA"
2.0,male,0.0,"Andrew, Mr. Edgardo Samuel",18.0,0.0,0.0,231945,11.500,,S,,,"Buenos Aires, Argentina / New Jersey, NJ"
2.0,male,0.0,"Andrew, Mr. Frank Thomas",25.0,0.0,0.0,C.A. 34050,10.500,,S,,,"Cornwall, England Houghton, MI"
2.0,male,0.0,"Angle, Mr. William A",34.0,1.0,0.0,226875,26.000,,S,,,"Warwick, England"
2.0,...,...,...,...,...,...,...,...,...,...,...,...,...
2.0,male,0.0,"West, Mr. Edwy Arthur",36.0,1.0,2.0,C.A. 34651,27.750,,S,,,"Bournmouth, England"
2.0,male,0.0,"Wheadon, Mr. Edward H",66.0,0.0,0.0,C.A. 24579,10.500,,S,,,"Guernsey, England / Edgewood, RI"
2.0,male,0.0,"Wheeler, Mr. Edwin ""Frederick""",,0.0,0.0,SC/PARIS 2159,12.875,,S,,,
2.0,male,1.0,"Wilhelms, Mr. Charles",31.0,0.0,0.0,244270,13.000,,S,9,,"London, England"


In [289]:
# What are the 5 largest fares that people paid?

df.sort_values('fare').tail(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,1.0,"Cardeza, Mrs. James Warburton Martinez (Charlo...",58.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3.0,,"Germantown, Philadelphia, PA"
1.0,female,1.0,"Ward, Miss. Anna",35.0,0.0,0.0,PC 17755,512.3292,,C,3.0,,
1.0,male,1.0,"Lesurer, Mr. Gustave J",35.0,0.0,0.0,PC 17755,512.3292,B101,C,3.0,,
1.0,male,1.0,"Cardeza, Mr. Thomas Drake Martinez",36.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3.0,,"Austria-Hungary / Germantown, Philadelphia, PA"
3.0,male,0.0,"Storey, Mr. Thomas",60.5,0.0,0.0,3701,,,S,,261.0,


In [290]:
df.nlargest(n=5, columns='fare')

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,female,1.0,"Cardeza, Mrs. James Warburton Martinez (Charlo...",58.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3,,"Germantown, Philadelphia, PA"
1.0,female,1.0,"Ward, Miss. Anna",35.0,0.0,0.0,PC 17755,512.3292,,C,3,,
1.0,male,1.0,"Cardeza, Mr. Thomas Drake Martinez",36.0,0.0,1.0,PC 17755,512.3292,B51 B53 B55,C,3,,"Austria-Hungary / Germantown, Philadelphia, PA"
1.0,male,1.0,"Lesurer, Mr. Gustave J",35.0,0.0,0.0,PC 17755,512.3292,B101,C,3,,
1.0,female,1.0,"Fortune, Miss. Alice Elizabeth",24.0,3.0,2.0,19950,263.0,C23 C25 C27,S,10,,"Winnipeg, MB"


In [291]:
# What are the 10 smallest fares that they paid?

df.nsmallest(n=10, columns='fare')

Unnamed: 0_level_0,Unnamed: 1_level_0,survived,name,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
pclass,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1.0,male,0.0,"Andrews, Mr. Thomas Jr",39.0,0.0,0.0,112050,0.0,A36,S,,,"Belfast, NI"
1.0,male,0.0,"Chisholm, Mr. Roderick Robert Crispin",,0.0,0.0,112051,0.0,,S,,,"Liverpool, England / Belfast"
1.0,male,0.0,"Fry, Mr. Richard",,0.0,0.0,112058,0.0,B102,S,,,
1.0,male,0.0,"Harrison, Mr. William",40.0,0.0,0.0,112059,0.0,B94,S,,110.0,
1.0,male,1.0,"Ismay, Mr. Joseph Bruce",49.0,0.0,0.0,112058,0.0,B52 B54 B56,S,C,,Liverpool
1.0,male,0.0,"Parr, Mr. William Henry Marsh",,0.0,0.0,112052,0.0,,S,,,Belfast
1.0,male,0.0,"Reuchlin, Jonkheer. John George",38.0,0.0,0.0,19972,0.0,,S,,,"Rotterdam, Netherlands"
2.0,male,0.0,"Campbell, Mr. William",,0.0,0.0,239853,0.0,,S,,,Belfast
2.0,male,0.0,"Cunningham, Mr. Alfred Fleming",,0.0,0.0,239853,0.0,,S,,,Belfast
2.0,male,0.0,"Frost, Mr. Anthony Wood ""Archie""",,0.0,0.0,239854,0.0,,S,,,Belfast


In [None]:
# method chaining -- take advantage of Python's syntax to do it all at once

# let's rewrite the following using method chaining:

# df = pd.read_csv('titanic3.csv')
# df = df.dropna(thresh=1)   # keep any row in which we have at least one non-NaN value
# df = df.set_index(['pclass', 'sex'])
# df = df.sort_index()

(
    pd
    .read_csv('titanic3.csv')
    .
)