# Agenda

- Data frames
    - Creating
    - Retrieving
    - Methods (series methods and special data frame methods)
- Loading data
    - CSV
    - Excel
    - Downloading from the Internet (retrieving files and scraping sites)

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# create a series based on any iterable in Python (usually a list or a NumPy array)
# we can pass additional keyword arguments, especially index and dtype

# Data frames are similar; we can pass a 2D set of values (a list of lists or a 2D NumPy array), and
# also index, and also columns -- the names of the columns we want

In [4]:
df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [5]:
type(df)

pandas.core.frame.DataFrame

# Our basic data frame

- A data frame has rows and columns
- The rows are identified with the index (i.e, the same index that we had on our series)
- The columns are identified with names that we just call "columns"

The key thing to remember: Each column is a Pandas series!

- I can retrieve a row with `.loc` or `.iloc`, as before
- I can retrieve a column with `[]` and the column's name

In [6]:
df.loc[1]

0    40
1    50
2    60
Name: 1, dtype: int64

In [7]:
df[1]  # this will retrieve the column

0     20
1     50
2     80
3    110
Name: 1, dtype: int64

In [8]:
%timeit df.loc[1]

11.4 μs ± 129 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [9]:
%timeit df.iloc[1]

10.3 μs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [10]:
%timeit df[1]

1.33 μs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [11]:
# behind the scenes of our data frame is actually a 2D NumPy array (in theory)

df.values

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [ 70,  80,  90],
       [100, 110, 120]])

In [12]:
# we can retrieve in all of the ways we've already seen!

df.loc[[1, 3]]  # fancy indexing

Unnamed: 0,0,1,2
1,40,50,60
3,100,110,120


In [13]:
df[[0, 1]]  # here, we retrieve two columns

Unnamed: 0,0,1
0,10,20
1,40,50
2,70,80
3,100,110


In [14]:
# if I retrieve one row, or one column, then I get a series
df[1]

0     20
1     50
2     80
3    110
Name: 1, dtype: int64

In [15]:
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [16]:
# We can name the index and the columns via keyword arguments when we create the data frame

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


- Row names (just like the index in a series) can repeat, and that's even useful!
- Column names cannot repeat (like the keys in a dict)

You can even think of a data frame as a dict of series, where the keys are the column names and the values are Pandas series objects.

In [17]:
df.loc['b']

x    40
y    50
z    60
Name: b, dtype: int64

In [18]:
df.iloc[2]

x    70
y    80
z    90
Name: c, dtype: int64

In [19]:
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [20]:
df.loc[['a', 'b']]

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60


In [21]:
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [22]:
# if I retrieve a series, then I can run a method on it

df['x'].mean()

np.float64(55.0)

In [23]:
df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [25]:
# I can do that for the rows

df.loc['b'].describe()

count     3.0
mean     50.0
std      10.0
min      40.0
25%      45.0
50%      50.0
75%      55.0
max      60.0
Name: b, dtype: float64

As a general rule, any method that we can run on a series can also be run on a data frame. We'll get one result per column, and the results will either be a series (if we get one value per column) or a data frame (if we get a series back from the method).

In [26]:
df.mean()   # this will calculate the mean for each column

x    55.0
y    65.0
z    75.0
dtype: float64

In [27]:
df.describe()   # this will invoke describe on each column

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [28]:
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [30]:
# we've seen how we can retrieve either a row or a column.
# but how can we retrieve an individual value?

df.loc['b'].loc['y']

np.int64(50)

In [31]:
# even worse:

df.loc['b']['y']

np.int64(50)

In [32]:
# better is to use the 2-argument version of .loc, where you specify the row and the column

df.loc['b', 'y']

np.int64(50)

In [33]:
df.sum()

x    220
y    260
z    300
dtype: int64

In [34]:
# what if I want to sum across the rows?

df.sum(axis='columns')

a     60
b    150
c    240
d    330
dtype: int64

In [35]:
df = DataFrame([[10, 20, 30.5],
                [40, 50, 60],
                [70, 80, 90],
                [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,90.0
d,100,110,120.0


In [36]:
df.dtypes   # get the dtype of each column

x      int64
y      int64
z    float64
dtype: object

In [37]:
# what happens when we retrieve row 'c'?

df.loc['c']

x    70.0
y    80.0
z    90.0
Name: c, dtype: float64

In [38]:
# what if I want to assign back to a value?

# this is a bad way to do it, as you'll see!
df.loc['c'].loc['z'] = 999 

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df.loc['c'].loc['z'] = 999
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc['c'].loc['z'] = 999


In [39]:
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,90.0
d,100,110,120.0


In [40]:
# the solution is always to use the two-argument version of .loc,
# which guarantees that we'll assign back to the original data frame

df.loc['c', 'z'] = 999 

In [41]:
df

Unnamed: 0,x,y,z
a,10,20,30.5
b,40,50,60.0
c,70,80,999.0
d,100,110,120.0


# Exercise: Simple data frame

1. Create a 5x5 data frame in which the rows are `abcde` and the columns are `vwxyz`. The values can be random integers from 0-1,000. (You can use a 2D NumPy array to initialize the values.)
2. Retrieve row `b`
3. Retrieve rows `b` and `d`
4. Retrieve rows `b`, `c`, and `d`
5. Retrieve column `w`
6. Retrieve columns `w` and `y`
7. Retrieve columns `w`, `x`, and `y`
8. Retrieve the item at row `e`, column `v`
9. Update the value at row e`, column `z` to be 123.456

In [42]:
import numpy as np

np.random.randint(0, 1000, 20).reshape(4, 5)

array([[ 92, 855, 155, 443, 455],
       [386, 725, 796, 590, 586],
       [756, 567, 363, 394, 451],
       [449, 357,   3, 378, 714]])

In [43]:
np.random.randint(0, 1000, [4,5])

array([[444, 675, 888, 800, 967],
       [174, 391, 834, 520, 112],
       [976, 750, 342, 754, 295],
       [367, 243, 289, 455,  79]])

In [44]:
np.random.seed(0)

df = DataFrame(np.random.randint(0, 1000, [5,5]),
               index=list('abcde'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [45]:
# Retrieve row b

df.loc['b']

v    763
w    707
x    359
y      9
z    723
Name: b, dtype: int64

In [46]:
%timeit df.loc['b']

12.2 μs ± 70 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [47]:
%timeit df.iloc[1]

10.6 μs ± 177 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [48]:
# Retrieve rows b and d

df.loc[['b', 'd']]

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
d,472,600,396,314,705


In [49]:
# Retrieve rows b, c, and d

df.loc['b':'d']  # use a slice, up to and including (with df.loc)

Unnamed: 0,v,w,x,y,z
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705


In [52]:
# Retrieve column w

%timeit df['w']

1.57 μs ± 10.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


In [53]:
# you can also use a different syntax, "dot syntax," to retrieve a column

%timeit df.w

2.92 μs ± 21 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [54]:
# Retrieve columns w and y

df[['w', 'y']]

Unnamed: 0,w,y
a,559,192
b,707,9
c,754,599
d,600,314
e,551,174


In [55]:
# Retrieve columns w, x, and y

df[['w', 'x', 'y']]

Unnamed: 0,w,x,y
a,559,629,192
b,707,359,9
c,754,804,599
d,600,396,314
e,551,87,174


In [56]:
df['w':'y']  # if you give a slice to [], Pandas looks for that slice ON THE ROWS!

Unnamed: 0,v,w,x,y,z


In [57]:
# Retrieve the item at row e, column v

df.loc[ 'e',    # row selector
        'v']    # column selector


np.int64(486)

In [58]:
df.loc[ ['b', 'e']   # row selector is two rows
    ,
        ['v', 'z']   # column selector
]

Unnamed: 0,v,z
b,763,723
e,486,600


In [59]:
# Update the value at row e, column z` to be 123.456

df.loc['e', 'z'] = 123.456

  df.loc['e', 'z'] = 123.456


In [60]:
df.dtypes

v      int64
w      int64
x      int64
y      int64
z    float64
dtype: object

In [61]:
# how should we update that column/value to avoid the problem?
# answer: replace the column with a new version of itself using float

np.random.seed(0)

df = DataFrame(np.random.randint(0, 1000, [5,5]),
               index=list('abcde'),
               columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,684,559,629,192,835
b,763,707,359,9,723
c,277,754,804,599,70
d,472,600,396,314,705
e,486,551,87,174,600


In [62]:
df.dtypes

v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [63]:
# you can add a new column to a data frame via assignment
df['u'] = [10, 20, 30, 40, 50]   # needs to be the same length as the other columns

df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,10
b,763,707,359,9,723,20
c,277,754,804,599,70,30
d,472,600,396,314,705,40
e,486,551,87,174,600,50


In [64]:
# what if I assign to a column that already exists? It's replaced.

df['u'] = [100 ,200, 300, 400, 500]
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100
b,763,707,359,9,723,200
c,277,754,804,599,70,300
d,472,600,396,314,705,400
e,486,551,87,174,600,500


In [65]:
# I can do the same, but using the output of .astype

df['u'] = df['u'].astype(np.float64)
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100.0
b,763,707,359,9,723,200.0
c,277,754,804,599,70,300.0
d,472,600,396,314,705,400.0
e,486,551,87,174,600,500.0


In [66]:
df.loc['e', 'u'] = 123.456


In [67]:
df

Unnamed: 0,v,w,x,y,z,u
a,684,559,629,192,835,100.0
b,763,707,359,9,723,200.0
c,277,754,804,599,70,300.0
d,472,600,396,314,705,400.0
e,486,551,87,174,600,123.456


# Other ways to create data frames

You aren't going to be creating a *lot* of data frames manually; most will be based on data you read from elsewhere. But let's just look at a few techniques we can use for when you do need to create them.

1. List of lists (we've seen this)
2. List of dicts (the keys indicate which columns we're adding to)
3. Dict of lists (the keys indicate the column names, and the lists are the values)
4. Dict of series (same as above, but using a series rather than a list)

In [70]:
# list of dicts
# the union of all keys across all lists will be our column names
# any missing key-value intersection is given nan as a value

df = DataFrame([{'a':10, 'b':20, 'c':30},
                {'a':100, 'b':200, 'c':300},
                {'a':1000, 'b':2000, 'd':4000}],
              index=list('xyz'))
df

Unnamed: 0,a,b,c,d
x,10,20,30.0,
y,100,200,300.0,
z,1000,2000,,4000.0


In [72]:
# dict of lists
# here, we're just defining the data frame as as dict whose values are lists/series

df = DataFrame({'a':[10, 20, 30, 40, 50],
                'b':[100, 200, 300, 400, 500],
                'c':[1000, 2000, 3000, 4000, 5000]},
              index=list('vwxyz'))
df

Unnamed: 0,a,b,c
v,10,100,1000
w,20,200,2000
x,30,300,3000
y,40,400,4000
z,50,500,5000


In [73]:
# what happens if a data frame has repeated column names?

df = DataFrame(np.random.randint(0, 100, [3, 3]),
               index=list('aab'),
               columns=list('xyy'))
df

Unnamed: 0,x,y,y.1
a,81,37,25
a,77,72,9
b,20,80,69


In [74]:
df['y']

Unnamed: 0,y,y.1
a,37,25
a,72,9
b,80,69


In [75]:
df['y'] = [10, 20, 30]
df

Unnamed: 0,x,y,y.1
a,81,10,10
a,77,20,20
b,20,30,30


In [77]:
df.columns.duplicated()

array([False, False,  True])

# Exercise: Semi-realistic example

1. Create a data frame with two columns, `age` and `shoesize`. Each row will represent a different person in your family. The index for each row should be the name of the person, and the two columns should be the age and shoe size for that person. Create this twice, first using a list of dicts and then using a dict of lists.
2. What is the mean age in your family? How much does this differ from the median? Ask the same about the shoe size. 

In [80]:
# list of dicts -- we describe the rows

df = DataFrame([{'age': 23, 'shoesize':40},
                {'age': 21, 'shoesize': 40},
                {'age': 19, 'shoesize':44},
                {'age': 54, 'shoesize':46}],
              index=['Atara', 'Shikma', 'Amotz', 'Reuven'])
df

Unnamed: 0,age,shoesize
Atara,23,40
Shikma,21,40
Amotz,19,44
Reuven,54,46


In [81]:
df.loc['Reuven']

age         54
shoesize    46
Name: Reuven, dtype: int64

In [82]:
# create this as a dict of lists

df = DataFrame({'age':[23, 21, 19, 54],
                'shoesize':[40, 40, 44, 46]},
               index=['Atara', 'Shikma', 'Amotz', 'Reuven'])
df

Unnamed: 0,age,shoesize
Atara,23,40
Shikma,21,40
Amotz,19,44
Reuven,54,46


In [83]:
df['age'].mean()

np.float64(29.25)

In [84]:
df['age'].median()

np.float64(22.0)

In [85]:
df['shoesize'].mean()

np.float64(42.5)

In [86]:
df['shoesize'].median()

np.float64(42.0)

In [87]:
df.mean()

age         29.25
shoesize    42.50
dtype: float64

In [88]:
df.median()

age         22.0
shoesize    42.0
dtype: float64

In [89]:
df.describe()

Unnamed: 0,age,shoesize
count,4.0,4.0
mean,29.25,42.5
std,16.580611,3.0
min,19.0,40.0
25%,20.5,40.0
50%,22.0,42.0
75%,30.75,44.5
max,54.0,46.0


In [91]:
# df.describe returns a data frame!
# I can choose which rows I want

df.describe().loc[['mean', '50%']]

Unnamed: 0,age,shoesize
mean,29.25,42.5
50%,22.0,42.0


In [93]:
%timeit df.describe().loc[['mean', '50%']]

1.39 ms ± 33.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [92]:
# the 'agg' method lets us run more than one aggregation method, and returns the results for all
df.describe().agg(['mean', 'median'])

Unnamed: 0,age,shoesize
mean,24.510076,32.75
median,21.25,41.0


In [94]:
%timeit df.describe().agg(['mean', 'median'])

1.99 ms ± 83.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# A few more convenience methods on data frames

In [96]:
# head and tail also work!

np.random.seed(0)
df = DataFrame(np.random.randint(0, 1000, [8, 10]),
               index=list('abcdefgh'),
               columns=list('qrstuvwxyz'))
df

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684,559,629,192,835,763,707,359,9,723
b,277,754,804,599,70,472,600,396,314,705
c,486,551,87,174,600,849,677,537,845,72
d,777,916,115,976,755,709,847,431,448,850
e,99,984,177,755,797,659,147,910,423,288
f,961,265,697,639,544,543,714,244,151,675
g,510,459,882,183,28,802,128,128,932,53
h,901,550,488,756,273,335,388,617,42,442


In [97]:
df.head()   # returns the first 5 rows

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684,559,629,192,835,763,707,359,9,723
b,277,754,804,599,70,472,600,396,314,705
c,486,551,87,174,600,849,677,537,845,72
d,777,916,115,976,755,709,847,431,448,850
e,99,984,177,755,797,659,147,910,423,288


In [98]:
df.head(2)

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684,559,629,192,835,763,707,359,9,723
b,277,754,804,599,70,472,600,396,314,705


In [99]:
df.tail()

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
d,777,916,115,976,755,709,847,431,448,850
e,99,984,177,755,797,659,147,910,423,288
f,961,265,697,639,544,543,714,244,151,675
g,510,459,882,183,28,802,128,128,932,53
h,901,550,488,756,273,335,388,617,42,442


In [100]:
df.dtypes  # which dtype for each column?

q    int64
r    int64
s    int64
t    int64
u    int64
v    int64
w    int64
x    int64
y    int64
z    int64
dtype: object

In [101]:
# this gives us information about the data frame itself 

df.info()  

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, a to h
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   q       8 non-null      int64
 1   r       8 non-null      int64
 2   s       8 non-null      int64
 3   t       8 non-null      int64
 4   u       8 non-null      int64
 5   v       8 non-null      int64
 6   w       8 non-null      int64
 7   x       8 non-null      int64
 8   y       8 non-null      int64
 9   z       8 non-null      int64
dtypes: int64(10)
memory usage: 704.0+ bytes


In [103]:
# what about nan?

df.loc['a', 't'] = np.nan
df.loc['a', 'z'] = np.nan
df.loc['b', 'u'] = np.nan
df.loc['c', 'w'] = np.nan
df.loc['f', 'q'] = np.nan
df.loc['g', 'q'] = np.nan


In [104]:
df

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,,835.0,763,707.0,359,9,
b,277.0,754,804,599.0,,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,,265,697,639.0,544.0,543,714.0,244,151,675.0
g,,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [105]:
df.dtypes

q    float64
r      int64
s      int64
t    float64
u    float64
v      int64
w    float64
x      int64
y      int64
z    float64
dtype: object

In [106]:
# how do we deal with missing data?
# - dropna
# - fillna

In [107]:
df.dropna()     # any row with a nan value is removed 

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [108]:
# is this what we want?  Sometimes... 
# we can tell dropna which columns it should pay attention to, and also how many good values are needed to keep the row

df.dropna(thresh=9)  # as long as we have 9 columns, it's OK to have 1 nan value

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
b,277.0,754,804,599.0,,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,,265,697,639.0,544.0,543,714.0,244,151,675.0
g,,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [110]:
df.dropna(thresh=10)   # this means: all values must be good, no nans need apply

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [111]:
# I can use the "subset" keyword argument to restrict which columns we care about having nan

df.dropna(subset=['q', 'u'])

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,,835.0,763,707.0,359,9,
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [113]:
df.dropna(axis='columns')  # now we'll drop columns that have nan values

Unnamed: 0,r,s,v,x,y
a,559,629,763,359,9
b,754,804,472,396,314
c,551,87,849,537,845
d,916,115,709,431,448
e,984,177,659,910,423
f,265,697,543,244,151
g,459,882,802,128,932
h,550,488,335,617,42


In [115]:
df.dropna(thresh=7, axis='columns')   #  if we have at least 7 good values in the column, keep it

Unnamed: 0,r,s,t,u,v,w,x,y,z
a,559,629,,835.0,763,707.0,359,9,
b,754,804,599.0,,472,600.0,396,314,705.0
c,551,87,174.0,600.0,849,,537,845,72.0
d,916,115,976.0,755.0,709,847.0,431,448,850.0
e,984,177,755.0,797.0,659,147.0,910,423,288.0
f,265,697,639.0,544.0,543,714.0,244,151,675.0
g,459,882,183.0,28.0,802,128.0,128,932,53.0
h,550,488,756.0,273.0,335,388.0,617,42,442.0


In [112]:
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(
    *,
    axis: 'Axis' = 0,
    how: 'AnyAll | lib.NoDefault' = <no_default>,
    thresh: 'int | lib.NoDefault' = <no_default>,
    subset: 'IndexLabel | None' = None,
    inplace: 'bool' = False,
    ignore_index: 'bool' = False
) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.

    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.

    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.

        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.

        Only a single axis is allowed.

    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at le

In [116]:
# I'm guessing that when we say axis='columns', Pandas transposes our data frame, does the action we ask for
# and then transposes it back.

# you can do that with .transpose() or .T  (without parentheses!)

df.T

Unnamed: 0,a,b,c,d,e,f,g,h
q,684.0,277.0,486.0,777.0,99.0,,,901.0
r,559.0,754.0,551.0,916.0,984.0,265.0,459.0,550.0
s,629.0,804.0,87.0,115.0,177.0,697.0,882.0,488.0
t,,599.0,174.0,976.0,755.0,639.0,183.0,756.0
u,835.0,,600.0,755.0,797.0,544.0,28.0,273.0
v,763.0,472.0,849.0,709.0,659.0,543.0,802.0,335.0
w,707.0,600.0,,847.0,147.0,714.0,128.0,388.0
x,359.0,396.0,537.0,431.0,910.0,244.0,128.0,617.0
y,9.0,314.0,845.0,448.0,423.0,151.0,932.0,42.0
z,,705.0,72.0,850.0,288.0,675.0,53.0,442.0


In [117]:
df

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,,835.0,763,707.0,359,9,
b,277.0,754,804,599.0,,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,,265,697,639.0,544.0,543,714.0,244,151,675.0
g,,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [118]:
# we can use fillna with a single value

df.fillna(99999)

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,99999.0,835.0,763,707.0,359,9,99999.0
b,277.0,754,804,599.0,99999.0,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,99999.0,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,99999.0,265,697,639.0,544.0,543,714.0,244,151,675.0
g,99999.0,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [119]:
df.mean()

q    537.333333
r    629.750000
s    484.875000
t    583.142857
u    547.428571
v    641.500000
w    504.428571
x    452.750000
y    395.500000
z    440.714286
dtype: float64

In [120]:
df.fillna(df.mean())   # each colum's nan values will be replaced with the mean *of that column*

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,583.142857,835.0,763,707.0,359,9,440.714286
b,277.0,754,804,599.0,547.428571,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,504.428571,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,537.333333,265,697,639.0,544.0,543,714.0,244,151,675.0
g,537.333333,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [121]:
df.fillna({'q':1111, 'r':2222, 's':3333, 't':4444, 'u':5555, 'v':66666, 'w':7777, 'x':8888, 'y':9999, 'z':101010})

Unnamed: 0,q,r,s,t,u,v,w,x,y,z
a,684.0,559,629,4444.0,835.0,763,707.0,359,9,101010.0
b,277.0,754,804,599.0,5555.0,472,600.0,396,314,705.0
c,486.0,551,87,174.0,600.0,849,7777.0,537,845,72.0
d,777.0,916,115,976.0,755.0,709,847.0,431,448,850.0
e,99.0,984,177,755.0,797.0,659,147.0,910,423,288.0
f,1111.0,265,697,639.0,544.0,543,714.0,244,151,675.0
g,1111.0,459,882,183.0,28.0,802,128.0,128,932,53.0
h,901.0,550,488,756.0,273.0,335,388.0,617,42,442.0


In [122]:
help(df.fillna)

Help on method fillna in module pandas.core.generic:

fillna(
    value: 'Hashable | Mapping | Series | DataFrame | None' = None,
    *,
    method: 'FillnaOptions | None' = None,
    axis: 'Axis | None' = None,
    inplace: 'bool_t' = False,
    limit: 'int | None' = None,
    downcast: 'dict | None | lib.NoDefault' = <no_default>
) -> 'Self | None' method of pandas.core.frame.DataFrame instance
    Fill NA/NaN values using the specified method.

    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series:

        * ffill: propagate last valid obser

# Next up

- Files
- Reading/writing

Resume :35

# Reading files

Pandas handles a *huge* number of different file formats.  We'll start with CSV, then a bit about Excel, and then a few other formats, as well.

# CSV 

Used to stand for "comma-separated values," but now it's more generically "character-separated values". The idea is pretty simple:

- The file is all text
- Every line is a record
- Fields on each line are separated (by default) by commas
- The first line of the file can (sometimes) be the names of the columns we want

Lots of problems with this:
- What if the data contains commas?
- How are we supposed to know if the first line is column names or just data?
- How is numeric data interpreted?


In [None]:
# goal: Create a data frame based on a CSV file
# for that, I'll use pd.read_csv



In [123]:
!head taxi.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

In [124]:
# to read this into a CSV file, I'll just give the filename to read_csv

df = pd.read_csv('taxi.csv')

In [125]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [126]:
# how big is my data frame? I can ask using .shape

df.shape

(9999, 19)

In [127]:
df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [128]:
df['passenger_count'].mean()

np.float64(1.6594659465946595)

In [129]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   VendorID               9999 non-null   int64  
 1   tpep_pickup_datetime   9999 non-null   object 
 2   tpep_dropoff_datetime  9999 non-null   object 
 3   passenger_count        9999 non-null   int64  
 4   trip_distance          9999 non-null   float64
 5   pickup_longitude       9999 non-null   float64
 6   pickup_latitude        9999 non-null   float64
 7   RateCodeID             9999 non-null   int64  
 8   store_and_fwd_flag     9999 non-null   object 
 9   dropoff_longitude      9999 non-null   float64
 10  dropoff_latitude       9999 non-null   float64
 11  payment_type           9999 non-null   int64  
 12  fare_amount            9999 non-null   float64
 13  extra                  9999 non-null   float64
 14  mta_tax                9999 non-null   float64
 15  tip_

In [130]:
# what if I want to save this as a CSV file?
df.to_csv('mytaxi.csv')

In [131]:
!head mytaxi.csv

,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95442962646483,40.76414108276367,1,N,-73.9747543334961,40.754093170166016,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.97144317626953,40.75894165039063,1,N,-73.97853851318358,40.76190948486328,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.97811126708984,40.73843383789063,1,N,-73.99027252197266,40.74543762207031,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.94589233398438,40.773529052734375,1,N,-73.97152709960938,40.76033020019531,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.97908782958984,40.77677

In [133]:
# this is what we can do if there are no headers
# you can pass names=[list_of_strings_to_be_used_as_headers]

pd.read_csv('mytaxi.csv',
           header=None)   # there are no column names to be found!

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [139]:
pd.read_csv('mytaxi.csv',
           header=3)   # on non-blank line 3, we have our headers

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [140]:
# one of the most common things to do is cut down on the number of columns we read in
# I can use the "usecols" keyword argument to choose only a few columns

pd.read_csv('mytaxi.csv', header=3, usecols=['passenger_count', 'trip_distance', 'total_amount'])

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.80
1,1,0.46,8.30
2,1,0.87,11.00
3,1,2.13,17.16
4,1,1.40,10.30
...,...,...,...
9994,1,2.70,12.30
9995,1,4.50,20.30
9996,1,5.59,22.30
9997,6,1.54,7.80


if the data is 

hello,"hi, there"  

The above would be 2 fields, because stuff in the `""` is considered to be 1 field.

I'll re-save the file using a tab (`\t`) as the separator.

In [141]:
df.to_csv('mytaxi.csv', sep='\t')

In [142]:
!head mytaxi.csv

	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	pickup_longitude	pickup_latitude	RateCodeID	store_and_fwd_flag	dropoff_longitude	dropoff_latitude	payment_type	fare_amount	extra	mta_tax	tip_amount	tolls_amount	improvement_surcharge	total_amount
0	2	2015-06-02 11:19:29	2015-06-02 11:47:52	1	1.63	-73.95442962646483	40.76414108276367	1	N	-73.9747543334961	40.754093170166016	2	17.0	0.0	0.5	0.0	0.0	0.3	17.8
1	2	2015-06-02 11:19:30	2015-06-02 11:27:56	1	0.46	-73.97144317626953	40.75894165039063	1	N	-73.97853851318358	40.76190948486328	1	6.5	0.0	0.5	1.0	0.0	0.3	8.3
2	2	2015-06-02 11:19:31	2015-06-02 11:30:30	1	0.87	-73.97811126708984	40.73843383789063	1	N	-73.99027252197266	40.74543762207031	1	8.0	0.0	0.5	2.2	0.0	0.3	11.0
3	2	2015-06-02 11:19:31	2015-06-02 11:39:02	1	2.13	-73.94589233398438	40.773529052734375	1	N	-73.97152709960938	40.76033020019531	1	13.5	0.0	0.5	2.86	0.0	0.3	17.16
4	1	2015-06-02 11:19:32	2015-06-02 11:32:49	1	1.4	-73.97908782958984	40.77677

In [143]:
# how can we read it in?

pd.read_csv('mytaxi.csv', sep='\t')

Unnamed: 0.1,Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954430,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.00,0.0,0.3,17.80
1,1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.00,0.0,0.3,8.30
2,2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.20,0.0,0.3,11.00
3,3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.760330,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.00,0.0,0.3,10.30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.00,0.0,0.3,12.30
9995,9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.00,0.0,0.3,20.30
9996,9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.00,0.0,0.3,22.30
9997,9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.00,0.0,0.3,7.80


In [144]:
# what if you don't specify the separator?

pd.read_csv('mytaxi.csv')

Unnamed: 0,\tVendorID\ttpep_pickup_datetime\ttpep_dropoff_datetime\tpassenger_count\ttrip_distance\tpickup_longitude\tpickup_latitude\tRateCodeID\tstore_and_fwd_flag\tdropoff_longitude\tdropoff_latitude\tpayment_type\tfare_amount\textra\tmta_tax\ttip_amount\ttolls_amount\timprovement_surcharge\ttotal_amount
0,0\t2\t2015-06-02 11:19:29\t2015-06-02 11:47:52...
1,1\t2\t2015-06-02 11:19:30\t2015-06-02 11:27:56...
2,2\t2\t2015-06-02 11:19:31\t2015-06-02 11:30:30...
3,3\t2\t2015-06-02 11:19:31\t2015-06-02 11:39:02...
4,4\t1\t2015-06-02 11:19:32\t2015-06-02 11:32:49...
...,...
9994,9994\t1\t2015-06-01 00:12:59\t2015-06-01 00:24...
9995,9995\t1\t2015-06-01 00:12:59\t2015-06-01 00:28...
9996,9996\t2\t2015-06-01 00:13:00\t2015-06-01 00:37...
9997,9997\t2\t2015-06-01 00:13:02\t2015-06-01 00:19...


# Summary of the most common `read_csv` options

- `sep`
- `usecols`
- `header`
- `names` (if you need to name the columns)
- `index_col` -- use this column as the index in your data frame

In [145]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(
    filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
    *,
    sep: 'str | None | lib.NoDefault' = <no_default>,
    delimiter: 'str | None | lib.NoDefault' = None,
    header: "int | Sequence[int] | None | Literal['infer']" = 'infer',
    names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>,
    index_col: 'IndexLabel | Literal[False] | None' = None,
    usecols: 'UsecolsArgType' = None,
    dtype: 'DtypeArg | None' = None,
    engine: 'CSVEngine | None' = None,
    converters: 'Mapping[Hashable, Callable] | None' = None,
    true_values: 'list | None' = None,
    false_values: 'list | None' = None,
    skipinitialspace: 'bool' = False,
    skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None,
    skipfooter: 'int' = 0,
    nrows: 'int | None' = None,
    na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable, Iterable[Hashable]] | None' = None,
  

# Exercise: Taxi data

1. Read the taxi data (`taxi.csv`) into a data frame. We only care about the columns `passenger_count`, `trip_distance`, and `total_amount`.
2. What were the mean and median distances and amounts that people had in this data set?
3. What was the most common number of passengers?
4. How many trips went further than 30 miles?
5. How many trips went 0 miles?

In [146]:
df = pd.read_csv('taxi.csv',
                 usecols=['passenger_count', 'trip_distance', 'total_amount'])

df

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.80
1,1,0.46,8.30
2,1,0.87,11.00
3,1,2.13,17.16
4,1,1.40,10.30
...,...,...,...
9994,1,2.70,12.30
9995,1,4.50,20.30
9996,1,5.59,22.30
9997,6,1.54,7.80


In [147]:
df.dtypes

passenger_count      int64
trip_distance      float64
total_amount       float64
dtype: object

In [149]:
# this is a quick/easy way to find out how many NaN values we have in each column!

df.isna().sum()

passenger_count    0
trip_distance      0
total_amount       0
dtype: int64

In [150]:
# What was the most common number of passengers?

df['passenger_count'].max()

np.int64(6)

passenger_count
1    7207
2    1313
5     520
3     406
6     369
4     182
0       2
Name: count, dtype: int64

In [None]:
# How many trips went further than 30 miles?
# How many trips went 0 miles?