# Day 2 -- data frames

1. Q&A
2. Data frames -- creating, and working with them
3. Adding and removing data in our data frames
4. Useful methods for our data frames
5. Boolean indexes / mask indexes
6. Using `.loc` to retrieve rows, rows/columns
7. Reading data from outside sources
    - Download this zipfile: https://files.lerner.co.il/data-science-exercise-files.zip
    - Scraping HTML files
    - Retrieving other formats

In [4]:
import pandas as pd
from pandas import Series, DataFrame

In [5]:
temps = Series([20, 23, 25, 22, 23, 25, 22, 27, 20])

# `.loc` retrieves data in numerous ways

1. Use it to retrieve one value from a series with the (default, numeric) index
2. Use it to retrieve one value from a series with the (non-default) index of another dtype -- strings, ints, floats, etc.
3. Retrieve one or more values as a series based on a "fancy index," passing a list of index values
4. Pass `.loc` a series/list of booleans of the same length as the series itself, and you get back only those elements that correspond to a `True`
5. Generate a boolean series with a comparison operator, and then pass that to `.loc`, to get only those values in the series for which the comparison is `True`

In [6]:
temps.loc[4]

np.int64(23)

In [7]:
temps.loc[2]

np.int64(25)

In [8]:
temps.loc[200]

KeyError: 200

In [9]:
temps

0    20
1    23
2    25
3    22
4    23
5    25
6    22
7    27
8    20
dtype: int64

In [11]:
# if I give values for an index, that is used -- .loc uses that
# (that's why we sometimes need .iloc)

temps = Series([20, 23, 25, 22, 23, 25, 22, 27, 20],
              index='Mon Tue Wed Thu Fri Sat Sun Mon Tue'.split())

In [12]:
temps

Mon    20
Tue    23
Wed    25
Thu    22
Fri    23
Sat    25
Sun    22
Mon    27
Tue    20
dtype: int64

In [13]:
temps.loc['Sat']

np.int64(25)

In [14]:
temps.loc['Tue']

Tue    23
Tue    20
dtype: int64

In [15]:
temps.loc[2]

KeyError: 2

In [16]:
temps.loc[[2, 4]]   # fancy indexing -- we give a list of indexes and get a series back

KeyError: "None of [Index([2, 4], dtype='object')] are in the [index]"

In [17]:
temps.loc[['Mon', 'Thu']]   # fancy indexing -- we give a list of indexes and get a series back

Mon    20
Mon    27
Thu    22
dtype: int64

In [18]:
# if you use fancy indexing, and there's only one match, you will still get a series back

temps.loc[['Thu']]

Thu    22
dtype: int64

In [19]:
# we can also use a boolean series (or list)
# in this case, the argument to .loc must be a list/series of booleans that is the same
# length as the series itself

temps.loc[[True, False, True, False, True, False, True, False, True]]

Mon    20
Wed    25
Fri    23
Sun    22
Tue    20
dtype: int64

In [21]:
# a variation on this is to generate a boolean list, and pass it to .loc

temps.loc[temps < 25]   # this generates a boolean series, and passes that series to .loc

Mon    20
Tue    23
Thu    22
Fri    23
Sun    22
Tue    20
dtype: int64

In [22]:
temps

Mon    20
Tue    23
Wed    25
Thu    22
Fri    23
Sat    25
Sun    22
Mon    27
Tue    20
dtype: int64

In [23]:
temps.index

Index(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue'], dtype='object')

In [24]:
temps.values

array([20, 23, 25, 22, 23, 25, 22, 27, 20])

# What is a data frame?

- 2D data
- The rows have an index (just like a series)
- The columns have names (which work like the index, but vertically)
- Each column is basically a series, which means that all values in that column have the same dtype

We can create a data frame using a list of lists

In [25]:
df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80],
                [90, 100, 110, 120]])
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


In [26]:
# we can name the data frame's index just as we did with a series -- passing the "index" kwarg
# we can name the data frame's columns in the same way, passing the "columns" kwarg

In [28]:
df = DataFrame([[10, 20, 30, 40],
              [50, 60, 70, 80],
              [90, 100, 110, 120]],
               index=list('abc'),
               columns=list('wxyz'))
df

Unnamed: 0,w,x,y,z
a,10,20,30,40
b,50,60,70,80
c,90,100,110,120


In [29]:
# retrieve from the rows using .loc
# all of the techniques we talked about still work!
# when you use .loc (in the simplest way), you get a row back
# and that row is represented as a series!

df.loc['a']

w    10
x    20
y    30
z    40
Name: a, dtype: int64

In [30]:
df.loc['b']

w    50
x    60
y    70
z    80
Name: b, dtype: int64

In [31]:
df.loc['d']

KeyError: 'd'

In [32]:
# how do I retrieve columns?
# we use []

df['x']

a     20
b     60
c    100
Name: x, dtype: int64

In [33]:
df['z']

a     40
b     80
c    120
Name: z, dtype: int64

In [34]:
# can I retrieve more than one row with a fancy index?
df.loc[['a', 'c']]  # fancy indexing on the row

Unnamed: 0,w,x,y,z
a,10,20,30,40
c,90,100,110,120


In [36]:
df[['x', 'z']]   # this gives us more than one column back -- fancy indexing on the columns

Unnamed: 0,x,z
a,20,40
b,60,80
c,100,120


# Exercise: Grocery store

1. Create a data frame in which you have two columns: One will be the price of an item (`price`), and the second will be the number of sales of that item (`sales`). The index will be the items that you sell.
2. The data frame should have 4 rows, and each item will have a price and a number of sales.
3. Retrieve all of the information for apples.
4. Retrieve all of the information for bananas.
5. Retrieve all information for apples and bananas.
6. What is the mean price for all products?

In [39]:
# because we're creating a data frame with two columns, 
# we'll use a list of lists in which every internal
# list will have two elements, the price and the number of sales

df = DataFrame([[1, 10],
                [0.5, 15],
                [3, 8],
                [2, 20]],
              columns=['price', 'sales'],
              index='apple banana mushroom pepper'.split())

df

Unnamed: 0,price,sales
apple,1.0,10
banana,0.5,15
mushroom,3.0,8
pepper,2.0,20


In [40]:
df.loc['apple']   # I want the row for the index "apple"

price     1.0
sales    10.0
Name: apple, dtype: float64

In [41]:
df.loc['banana']

price     0.5
sales    15.0
Name: banana, dtype: float64

In [42]:
# get both apple and banana

df.loc[['apple', 'banana']]

Unnamed: 0,price,sales
apple,1.0,10
banana,0.5,15


In [44]:
# What is the mean price for all products?

df['price'].mean()

np.float64(1.625)

In [46]:
# when the median is lower than the mean, it's because there is some value
# that is higher than the rest, and is "pulling up" the mean

df['price'].median()   

np.float64(1.5)

In [47]:
df['price']

apple       1.0
banana      0.5
mushroom    3.0
pepper      2.0
Name: price, dtype: float64

In [48]:
# AM asks: why do we use [] after .loc, and not ()?
# isn't it a method?

# official answer: no, it's not a method

# to allow us to use a slice, they used []
# slice syntax (first:second) only works inside of [], not ()

df.loc['banana':'mushroom']

Unnamed: 0,price,sales
banana,0.5,15
mushroom,3.0,8


In [50]:
df.loc[slice('banana','mushroom')]

Unnamed: 0,price,sales
banana,0.5,15
mushroom,3.0,8


# Modifying our data frame -- adding and retrieving values

1. We can add a new column by assigning to it. (If the column already exists, we replace it.)
2. We can add a new row by assigning to it via `.loc`. Here, if the index already exists, we have a new row anyway, with the same index.

In [53]:
# the index of price and the index of sales are identical, so we 
# can multiply one by the other

df['revenue'] = df['price'] * df['sales'] # assign based on an operation
df

Unnamed: 0,price,sales,revenue
apple,1.0,10,10.0
banana,0.5,15,7.5
mushroom,3.0,8,24.0
pepper,2.0,20,40.0


In [54]:
df['stuff']= [10, 20, 30, 40]   # assign directly
df

Unnamed: 0,price,sales,revenue,stuff
apple,1.0,10,10.0,10
banana,0.5,15,7.5,20
mushroom,3.0,8,24.0,30
pepper,2.0,20,40.0,40


In [55]:
df['other_things']= 9  # if you assign a scalar value, it'll be assigned to all rows
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10,10.0,10,9
banana,0.5,15,7.5,20,9
mushroom,3.0,8,24.0,30,9
pepper,2.0,20,40.0,40,9


In [56]:
# if it's not scalar, then you must assign the right number of values

df['this_will_not_work'] = [10, 20]

ValueError: Length of values (2) does not match length of index (4)

In [57]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10,10.0,10,9
banana,0.5,15,7.5,20,9
mushroom,3.0,8,24.0,30,9
pepper,2.0,20,40.0,40,9


In [59]:
# make sure to use df.loc when you add a row!

df.loc['cucumber'] = [0.25, 10, 2.50, 25, 9]
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0
cucumber,0.25,10.0,2.5,25.0,9.0


In [60]:
# how can we remove a row or column? We can use the "df.drop" method

df.drop('cucumber')  # this returns a new data frame, based on df, without the "cucumber" row

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [61]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0
cucumber,0.25,10.0,2.5,25.0,9.0


In [64]:
df = df.drop('cucumber')   # this is the way to do it!

In [65]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [66]:
# you can pass a list of indexes to drop

df.drop(['apple', 'banana'])

Unnamed: 0,price,sales,revenue,stuff,other_things
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [67]:
df

Unnamed: 0,price,sales,revenue,stuff,other_things
apple,1.0,10.0,10.0,10.0,9.0
banana,0.5,15.0,7.5,20.0,9.0
mushroom,3.0,8.0,24.0,30.0,9.0
pepper,2.0,20.0,40.0,40.0,9.0


In [68]:
df.drop('other_things') # by default, df.drop assumes you want to remove rows

KeyError: "['other_things'] not found in axis"

In [69]:
df.drop('other_things', axis='columns')    # by default, df.drop assumes you want to remove rows

Unnamed: 0,price,sales,revenue,stuff
apple,1.0,10.0,10.0,10.0
banana,0.5,15.0,7.5,20.0
mushroom,3.0,8.0,24.0,30.0
pepper,2.0,20.0,40.0,40.0


In [71]:
df = df.drop(['stuff', 'other_things'], axis='columns') 
df

Unnamed: 0,price,sales,revenue
apple,1.0,10.0,10.0
banana,0.5,15.0,7.5
mushroom,3.0,8.0,24.0
pepper,2.0,20.0,40.0


# Exercise: Sales tax!

1. Building on your grocery-store data frame (or mine, if you want to use it), add a new column for the sales tax, which is 10% for each item. 
2. Add another column, total revenue, which will be the product of price, sales, and (added on) the sales tax.

In [72]:
df

Unnamed: 0,price,sales,revenue
apple,1.0,10.0,10.0
banana,0.5,15.0,7.5
mushroom,3.0,8.0,24.0
pepper,2.0,20.0,40.0


In [73]:
df['sales_tax'] = 0.1    # assign 10% sales tax to each item
df

Unnamed: 0,price,sales,revenue,sales_tax
apple,1.0,10.0,10.0,0.1
banana,0.5,15.0,7.5,0.1
mushroom,3.0,8.0,24.0,0.1
pepper,2.0,20.0,40.0,0.1


In [75]:
df['total_revenue'] = df['revenue'] + (df['sales_tax'] * df['price'] * df['sales'])
df

Unnamed: 0,price,sales,revenue,sales_tax,total_revenue
apple,1.0,10.0,10.0,0.1,11.0
banana,0.5,15.0,7.5,0.1,8.25
mushroom,3.0,8.0,24.0,0.1,26.4
pepper,2.0,20.0,40.0,0.1,44.0


In [78]:
# to rename a column, use the "rename" method, with a dict handed to the "columns"
# keyword argument

df.rename(columns={'revenue':'gross_revenue'})

Unnamed: 0,price,sales,gross_revenue,sales_tax,total_revenue
apple,1.0,10.0,10.0,0.1,11.0
banana,0.5,15.0,7.5,0.1,8.25
mushroom,3.0,8.0,24.0,0.1,26.4
pepper,2.0,20.0,40.0,0.1,44.0


# Next up

- Useful methods (some of which you already know)
- Boolean indexes and our data frames

# Pandas rule of thumb #1 

Anywhere you can pass a single string (typically for an index or a column name), you can pass a list of strings (for multiple rows or multiple columns)

# Pandas rule of thumb #2

Anything you can do to a series, you can also do to a data frame.

- If the series version of the method returns a single (scalar) value, then the result from a data frame will be a series.
- If the series version of the method returns a series, then the result from a data frame will be a new data frame.

In [80]:
s = Series([10, 20, 30, 40, 50, 60, 70, 80],
          index=list('abcdefgh'))

# use a dict of lists to create a data frame; the key is the column name
# and the value is a list of values
df = DataFrame({'numbers':[10, 20, 30, 40, 50, 60, 70, 80],
               'times_10':[100, 200, 300, 400, 500, 600, 700,800]},
               index=list('abcdefgh')
              )


In [81]:
s

a    10
b    20
c    30
d    40
e    50
f    60
g    70
h    80
dtype: int64

In [82]:
df

Unnamed: 0,numbers,times_10
a,10,100
b,20,200
c,30,300
d,40,400
e,50,500
f,60,600
g,70,700
h,80,800


In [83]:
# aggregation methods

s.mean()

np.float64(45.0)

In [84]:
df.mean()

numbers      45.0
times_10    450.0
dtype: float64

In [85]:
s.min()

np.int64(10)

In [86]:
s.max()

np.int64(80)

In [87]:
df.min()

numbers      10
times_10    100
dtype: int64

In [88]:
df.max()

numbers      80
times_10    800
dtype: int64

In [89]:
s.describe()

count     8.000000
mean     45.000000
std      24.494897
min      10.000000
25%      27.500000
50%      45.000000
75%      62.500000
max      80.000000
dtype: float64

In [90]:
df.describe()

Unnamed: 0,numbers,times_10
count,8.0,8.0
mean,45.0,450.0
std,24.494897,244.948974
min,10.0,100.0
25%,27.5,275.0
50%,45.0,450.0
75%,62.5,625.0
max,80.0,800.0


In [91]:
s.head()  # first 5 values

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [92]:
s.tail() # final 5 values

d    40
e    50
f    60
g    70
h    80
dtype: int64

In [93]:
s.tail(3)  # final 3 values

f    60
g    70
h    80
dtype: int64

In [94]:
# the same thing applies to a data frame!

In [95]:
df.head()

Unnamed: 0,numbers,times_10
a,10,100
b,20,200
c,30,300
d,40,400
e,50,500


In [96]:
df.tail()  

Unnamed: 0,numbers,times_10
d,40,400
e,50,500
f,60,600
g,70,700
h,80,800


In [97]:
df.tail(3)

Unnamed: 0,numbers,times_10
f,60,600
g,70,700
h,80,800


In [98]:
# what if we have some NaN values?

import numpy as np

s = Series([10, 20, np.nan, 40, 50, np.nan, 70, 80],
          index=list('abcdefgh'))

# use a dict of lists to create a data frame; the key is the column name
# and the value is a list of values
df = DataFrame({'numbers':[10, 20, np.nan, 40, 50, np.nan, 70, 80],
               'times_10':[100, 200, 300, np.nan, 500, 600, 700, np.nan]},
               index=list('abcdefgh')
              )


In [99]:
s

a    10.0
b    20.0
c     NaN
d    40.0
e    50.0
f     NaN
g    70.0
h    80.0
dtype: float64

In [100]:
df

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
c,,300.0
d,40.0,
e,50.0,500.0
f,,600.0
g,70.0,700.0
h,80.0,


In [101]:
# how can we know that each column is a separate series, with its own dtype?

df.dtypes

numbers     float64
times_10    float64
dtype: object

In [102]:
s.dtypes

dtype('float64')

In [103]:
s.dtype

dtype('float64')

In [104]:
# to remove NaNs in a series, we can use dropna
s.dropna()

a    10.0
b    20.0
d    40.0
e    50.0
g    70.0
h    80.0
dtype: float64

In [105]:
s

a    10.0
b    20.0
c     NaN
d    40.0
e    50.0
f     NaN
g    70.0
h    80.0
dtype: float64

In [106]:
# when we invoke dropna on a data frame, it removes any ROW in which there is a NaN 

df.dropna()

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
e,50.0,500.0
g,70.0,700.0


In [107]:
# let's not drop *everything*
# let's keep any row in which we have at least 1 non-NaN value

df.dropna(thresh=1)   # this means: if we have 1 good value, we're fine

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
c,,300.0
d,40.0,
e,50.0,500.0
f,,600.0
g,70.0,700.0
h,80.0,


In [108]:
# separately, I can say: I only care about NaN in the numbers column.
# if there's NaN in numbers, drop the row. Otherwise, keep it.

df.dropna(subset='numbers')

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
d,40.0,
e,50.0,500.0
g,70.0,700.0
h,80.0,


In [109]:
# instead, we can also use fillna

s.fillna(999)

a     10.0
b     20.0
c    999.0
d     40.0
e     50.0
f    999.0
g     70.0
h     80.0
dtype: float64

In [110]:
df.fillna(999)

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
c,999.0,300.0
d,40.0,999.0
e,50.0,500.0
f,999.0,600.0
g,70.0,700.0
h,80.0,999.0


In [112]:
# if we want, we can ask Pandas to replace NaN with something different for each column!

df.fillna({'numbers':'999', 'times_10':888})

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
c,999.0,300.0
d,40.0,888.0
e,50.0,500.0
f,999.0,600.0
g,70.0,700.0
h,80.0,888.0


In [113]:
# remember that when we invoke "mean" on a data frame, we get back a series
# in which the index contains the column names 

df.mean()

numbers      45.0
times_10    400.0
dtype: float64

In [114]:
df.fillna(df.mean())  # now fillna will use the values for each column, as specified in the result

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
c,45.0,300.0
d,40.0,400.0
e,50.0,500.0
f,45.0,600.0
g,70.0,700.0
h,80.0,400.0


In [115]:
df.interpolate()

Unnamed: 0,numbers,times_10
a,10.0,100.0
b,20.0,200.0
c,30.0,300.0
d,40.0,400.0
e,50.0,500.0
f,60.0,600.0
g,70.0,700.0
h,80.0,700.0


In [116]:
# the "shape" attribute (not a method!) gives us a 2-element tuple with the
# number of rows and the number of columns

df.shape

(8, 2)

In [117]:
# the "info" method gives us a summary of a data frame

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, a to h
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   numbers   6 non-null      float64
 1   times_10  6 non-null      float64
dtypes: float64(2)
memory usage: 492.0+ bytes


# Exercises: Temperatures

1. Define a data frame with the 10-day forecast for low and high temperatures in your city. (There will be two columns and 10 rows, one for each day.) When entering the data, replace some of the values with `np.nan`.
2. Calculate the mean for highs and lows.
3. If you interpolate the values, how close are they to the originals?


In [119]:
df = DataFrame({'high': [30, 25, 26, 32, 39, 33, 29,28, 29, 29],
               'lows':[17, 16, 14, 23, 23, 18, 16, 16, 16, 16]},
              index='Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split())
df

Unnamed: 0,high,lows
Tue,30,17
Wed,25,16
Thu,26,14
Fri,32,23
Sat,39,23
Sun,33,18
Mon,29,16
Tue,28,16
Wed,29,16
Thu,29,16


In [120]:
df.shape

(10, 2)

In [121]:
df = DataFrame({'high': [30, 25, 26, 32, np.nan, 33, 29,np.nan, 29, 29],
               'lows':[17, 16, np.nan, 23, 23, 18, np.nan, 16, 16, 16]},
              index='Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu'.split())
df

Unnamed: 0,high,lows
Tue,30.0,17.0
Wed,25.0,16.0
Thu,26.0,
Fri,32.0,23.0
Sat,,23.0
Sun,33.0,18.0
Mon,29.0,
Tue,,16.0
Wed,29.0,16.0
Thu,29.0,16.0


In [122]:
df.mean()

high    29.125
lows    18.125
dtype: float64

In [123]:
df.interpolate()

Unnamed: 0,high,lows
Tue,30.0,17.0
Wed,25.0,16.0
Thu,26.0,19.5
Fri,32.0,23.0
Sat,32.5,23.0
Sun,33.0,18.0
Mon,29.0,17.0
Tue,29.0,16.0
Wed,29.0,16.0
Thu,29.0,16.0


In [125]:
# what about using boolean indexes?

# (1) we can treat a column as a series, and use a boolean index with it
# this works just fine, getting the above-average values from df['high']

df['high'].loc[df['high'] > df['high'].mean()]

Tue    30.0
Fri    32.0
Sun    33.0
Name: high, dtype: float64

In [126]:
# we can run .loc on the entire data frame, and can feed it the boolean
# series that we get

# give me rows from df where the high is above average

df.loc[df['high'] > df['high'].mean()]

Unnamed: 0,high,lows
Tue,30.0,17.0
Fri,32.0,23.0
Sun,33.0,18.0


In [129]:
# show me the rows of df
# where the diff between high + low temps > 5

df.loc[df['high'] - df['lows'] > 5]

Unnamed: 0,high,lows
Tue,30.0,17.0
Wed,25.0,16.0
Fri,32.0,23.0
Sun,33.0,18.0
Wed,29.0,16.0
Thu,29.0,16.0


# Exercise: Temp comparisons

1. Create two new columns, `high_f` and `low_f` (or if you did it in Fahrenheit already, do the opposite, into Celsius).
2. Find the number of days in which the difference between the high + low temperatures in Fahrenheit is > 10 degrees.

In [130]:
df

Unnamed: 0,high,lows
Tue,30.0,17.0
Wed,25.0,16.0
Thu,26.0,
Fri,32.0,23.0
Sat,,23.0
Sun,33.0,18.0
Mon,29.0,
Tue,,16.0
Wed,29.0,16.0
Thu,29.0,16.0


In [132]:
# (0°C × 9/5) + 32 = 32°F

df['high_f'] = df['high'] * 9/5 + 32
df['low_f'] = df['lows'] * 9/5 + 32

df


Unnamed: 0,high,lows,high_f,low_f
Tue,30.0,17.0,86.0,62.6
Wed,25.0,16.0,77.0,60.8
Thu,26.0,,78.8,
Fri,32.0,23.0,89.6,73.4
Sat,,23.0,,73.4
Sun,33.0,18.0,91.4,64.4
Mon,29.0,,84.2,
Tue,,16.0,,60.8
Wed,29.0,16.0,84.2,60.8
Thu,29.0,16.0,84.2,60.8


In [136]:
# Find the number of days in which the difference between the high + low temperatures 
# in Fahrenheit is > 10 degrees.

df.loc[df['high_f'] - df['low_f'] > 10].count()

high      6
lows      6
high_f    6
low_f     6
dtype: int64

# Next up

1. Using `.loc` with data frames -- even better than before!
2. Reading data from CSV and other sources


In [137]:
df

Unnamed: 0,high,lows,high_f,low_f
Tue,30.0,17.0,86.0,62.6
Wed,25.0,16.0,77.0,60.8
Thu,26.0,,78.8,
Fri,32.0,23.0,89.6,73.4
Sat,,23.0,,73.4
Sun,33.0,18.0,91.4,64.4
Mon,29.0,,84.2,
Tue,,16.0,,60.8
Wed,29.0,16.0,84.2,60.8
Thu,29.0,16.0,84.2,60.8


In [138]:
# I want to retrieve from row for "Fri"

df.loc['Fri']

high      32.0
lows      23.0
high_f    89.6
low_f     73.4
Name: Fri, dtype: float64

In [139]:
# what if I want the value of low_f for Friday?

df.loc['Fri'].loc['low_f']

np.float64(73.4)

In [140]:
# we could also do this:
df.loc['Fri']['low_f']

np.float64(73.4)

# Don't do this!

1. It's inefficient, because first we run the method on the data frame, then we run the method on the temporary series.
2. You can get yourself into trouble.

The other option? To realize that `.loc` can take two arguments! The first argument is what I call a "row selector," and we talked earlier about many of the possibilities for that row selector:

1. One index
2. A list of indexes ("fancy indexing")
3. A boolean series
4. An expression that returns a boolean series

We can also pass a second argument, which I call the "column selector." We can pass very similar things:

1. One column name
2. A list of column names ("fancy columning?")
3. A boolean series (very rare and a bit weird)
4. An expression that returns a boolean series (even rarer and weirder)

In [141]:
df.loc['Fri', 'low_f']

np.float64(73.4)

In [142]:
df.loc['Fri', ['low_f', 'high_f']]

low_f     73.4
high_f    89.6
Name: Fri, dtype: float64

In [143]:
# we use empty, otherwise useless parentheses to give us the chance to use multiple lines

(
    df
    .loc[
            'Fri',                 # row selector
            ['low_f', 'high_f']    # column selector
    ]
)

low_f     73.4
high_f    89.6
Name: Fri, dtype: float64

In [144]:
columns_wanted = ['low_f', 'high_f']

(
    df
    .loc[
            'Fri',                 # row selector
            columns_wanted    # column selector
    ]
)

low_f     73.4
high_f    89.6
Name: Fri, dtype: float64

# Exercise: Using `.loc`

1. Retrieve all rows (from the high-low temp data frame) where the high temp is within 5 degrees of the low temp, and show only the low-temp column.
2. Retrieve all rows where the low is greater than the mean, and show the high-temp column only.

In [145]:
df

Unnamed: 0,high,lows,high_f,low_f
Tue,30.0,17.0,86.0,62.6
Wed,25.0,16.0,77.0,60.8
Thu,26.0,,78.8,
Fri,32.0,23.0,89.6,73.4
Sat,,23.0,,73.4
Sun,33.0,18.0,91.4,64.4
Mon,29.0,,84.2,
Tue,,16.0,,60.8
Wed,29.0,16.0,84.2,60.8
Thu,29.0,16.0,84.2,60.8


In [150]:
# Retrieve all rows (from the high-low temp data frame) where the high temp is within 5 degrees
# of the low temp, and show only the low-temp column.

df.loc[
    df['high'] - df['lows'] < 10,    # row selector
    'lows'                           # column selector
]

Wed    16.0
Fri    23.0
Name: lows, dtype: float64

In [151]:
df.loc[
    df['high'] - df['lows'] < 10,    # row selector
    ['high', 'lows']                           # column selector
]

Unnamed: 0,high,lows
Wed,25.0,16.0
Fri,32.0,23.0


In [156]:
# Retrieve all rows where the low is greater than the mean, and show the high-temp column only.

df.loc[
    df['lows'] > df['lows'].mean()   # row selector
    ,
    'high'
]

Fri    32.0
Sat     NaN
Name: high, dtype: float64

# Getting data from other sources

Most of the time, we're going to want to analyze data that has been produced by some other program, and put into a format that we can read. In other words, we'll want to take a file and turn it into a data frame.

How can we do that? Pandas provides a large number of methods, all starting with the word `read_`, that reads data from a bunch of different formats, and creates a data frame.

Because we are not invoking a method on a data frame, but rather creating a data frame, these methods aren't run on `df`, the data frame itself, but rather via `pd`, the Pandas module name. So you can say `df.read_csv`, to read CSV file.

CSV ("comma-separated values" or "character-separated values") is a very common, largely unspecified format. The idea is that each line in the file is one record and it can contain a number of columns. When Pandas read a CSV file into memory, it examines each column and figures out what dtype each should be -- `float64`, `int64`, or `object`, where the latter means that it'll store the values as strings in Python.

We're going to start talking about how to read data from CSV files. We'll then go onto other files, too.

In [157]:
# taxi.csv contains 10,000 NYC taxi rides from 2016


burrito_current.csv	   languages.csv  titanic3.csv
celebrity_deaths_2016.csv  taxi.csv


In [158]:
filename = 'taxi.csv'

df = pd.read_csv(filename)

In [159]:
!head $filename

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

In [160]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


# Exercise: Taxi queries

1. Download the intro data-science exercise file from `files.lerner.co.il`.
2. Load `taxi.csv` into a data frame.
3. Find the mean `trip_distance`.
4. Were there ever trips with a 0 distance (`trip_distance`)? If so, how much did people pay (`total_amount`)?
5. Were there ever trips with <= 0 `total_amount`? If so, indicate the number of passengers and how far they went.

In [161]:
filename = 'taxi.csv'

df = pd.read_csv(filename)

df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [162]:
df['trip_distance'].mean()

np.float64(3.1585108510851083)

In [166]:
# Were there ever trips with a 0 distance (trip_distance)? If so, how much did people pay (total_amount)?

df.loc[
    df['trip_distance'] == 0    # row selector
    ,
    'total_amount'    # column selector
].mean()

np.float64(31.581940298507465)

In [167]:
# were there any taxi rides without people?
# check the passenger_count to find out

df.loc[
    df['passenger_count'] == 0    # row selector
    ,
    'total_amount'    # column selector
].mean()

np.float64(25.57)

In [168]:
df.loc[
    df['passenger_count'] == 0    # row selector
    ,
    'total_amount'    # column selector
].count()

np.int64(2)

In [172]:
# Were there ever trips with <= 0 total_amount? If so, indicate the number of passengers 
# and how far they went.

df.loc[
    df['total_amount'] <= 0    # row selector
    ,
    ['passenger_count', 'trip_distance', 'total_amount']    # column selector
]

Unnamed: 0,passenger_count,trip_distance,total_amount
2903,1,0.0,-3.3
5719,1,0.89,-7.8
9276,1,0.93,-7.3


# Next up

- Other formats in general
- More CSV (especially options you can pass)
- Reading Excel
- Reading HTML
- Make sure we talk about the setting error from [][]

In [174]:
# Pandas can read from a *lot* of different formats. All of the reading methods are pd.read_*

In [175]:
pd.read_

AttributeError: module 'pandas' has no attribute 'read_'

In [None]:
df.to_

In [176]:
filename = 'taxi.csv'

df = pd.read_csv(filename)

df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [177]:
df.shape

(9999, 19)

In [178]:
# we can pass a number of keyword arguments (i.e., name=value) to read_csv
# and that will customize what is read, and how our data frame is handled

df = pd.read_csv(filename,
                usecols=['passenger_count', 'trip_distance', 'total_amount'])

df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.8
1,1,0.46,8.3
2,1,0.87,11.0
3,1,2.13,17.16
4,1,1.4,10.3


# CSV -- what character is our delimiter?

Traditionally, CSV was "comma separated" values, meaning that fields were separated using commas. Many people prefer not to use commas, but rather other characters, such as tabs or `;` or the like.

We can pass a `sep` keyword argument to `read_csv`, and that'll allow us to treat something else as the separator.

In [179]:
# let's write the taxi data to disk, but using tabs ('\t') as the delimiter

df.to_csv('mytaxi.csv', sep='\t')

In [180]:
!head 'taxi.csv'

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

In [181]:
!head 'mytaxi.csv'

	passenger_count	trip_distance	total_amount
0	1	1.63	17.8
1	1	0.46	8.3
2	1	0.87	11.0
3	1	2.13	17.16
4	1	1.4	10.3
5	1	1.4	10.55
6	1	1.8	16.3
7	4	11.9	73.84
8	1	1.27	15.8


In [182]:
# let's try to read mytaxi.csv back into Pandas!

pd.read_csv('mytaxi.csv')

Unnamed: 0,\tpassenger_count\ttrip_distance\ttotal_amount
0,0\t1\t1.63\t17.8
1,1\t1\t0.46\t8.3
2,2\t1\t0.87\t11.0
3,3\t1\t2.13\t17.16
4,4\t1\t1.4\t10.3
...,...
9994,9994\t1\t2.7\t12.3
9995,9995\t1\t4.5\t20.3
9996,9996\t1\t5.59\t22.3
9997,9997\t6\t1.54\t7.8


In [183]:
pd.read_csv('mytaxi.csv', sep='\t')

Unnamed: 0.1,Unnamed: 0,passenger_count,trip_distance,total_amount
0,0,1,1.63,17.80
1,1,1,0.46,8.30
2,2,1,0.87,11.00
3,3,1,2.13,17.16
4,4,1,1.40,10.30
...,...,...,...,...
9994,9994,1,2.70,12.30
9995,9995,1,4.50,20.30
9996,9996,1,5.59,22.30
9997,9997,6,1.54,7.80


In [184]:
pd.read_csv('mytaxi.csv', sep='\t', usecols=['passenger_count'])

Unnamed: 0,passenger_count
0,1
1,1
2,1
3,1
4,1
...,...
9994,1
9995,1
9996,1
9997,6


In [185]:
# if one of the columns on disk can/should be our index, how can we indicate that?
# (it doesn't need to be the first column, either!)

# we can pass the index_col keyword argument, telling Pandas which column to use as an index

df = pd.read_csv('taxi.csv',
            index_col='tpep_pickup_datetime')

df.head(10)

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-06-02 11:19:29,2,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
2015-06-02 11:19:30,2,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2015-06-02 11:19:31,2,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
2015-06-02 11:19:31,2,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
2015-06-02 11:19:32,1,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3
2015-06-02 11:19:33,1,2015-06-02 11:28:48,1,1.4,-73.944641,40.779465,1,N,-73.961365,40.771561,1,8.0,0.0,0.5,1.75,0.0,0.3,10.55
2015-06-02 11:19:34,1,2015-06-02 11:38:46,1,1.8,-73.992867,40.748211,1,N,-73.969772,40.748459,1,12.5,0.0,0.5,3.0,0.0,0.3,16.3
2015-06-02 11:19:35,1,2015-06-02 12:36:46,4,11.9,-73.863075,40.769253,1,N,-73.98671,40.761307,1,52.5,0.0,0.5,15.0,5.54,0.3,73.84
2015-06-02 11:19:36,2,2015-06-02 11:45:19,1,1.27,-73.991432,40.749306,1,N,-73.985062,40.759525,2,15.0,0.0,0.5,0.0,0.0,0.3,15.8
2015-06-02 11:19:38,1,2015-06-02 11:23:50,1,0.6,-73.970734,40.796207,1,N,-73.97747,40.789509,1,5.0,0.0,0.5,0.5,0.0,0.3,6.3


In [186]:
df.loc['2015-06-02 11:19:29']

Unnamed: 0_level_0,VendorID,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
tpep_pickup_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-06-02 11:19:29,2,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
2015-06-02 11:19:29,1,2015-06-02 11:26:06,1,1.1,-74.005524,40.725693,1,N,-73.992691,40.732224,2,6.5,0.0,0.5,0.0,0.0,0.3,7.3
2015-06-02 11:19:29,1,2015-06-02 11:22:47,1,0.4,-73.975044,40.790031,1,N,-73.975517,40.787106,2,4.0,0.0,0.5,0.0,0.0,0.3,4.8


In [187]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers.readers:

read_csv(
    filepath_or_buffer: 'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]',
    *,
    sep: 'str | None | lib.NoDefault' = <no_default>,
    delimiter: 'str | None | lib.NoDefault' = None,
    header: "int | Sequence[int] | None | Literal['infer']" = 'infer',
    names: 'Sequence[Hashable] | None | lib.NoDefault' = <no_default>,
    index_col: 'IndexLabel | Literal[False] | None' = None,
    usecols: 'UsecolsArgType' = None,
    dtype: 'DtypeArg | None' = None,
    engine: 'CSVEngine | None' = None,
    converters: 'Mapping[Hashable, Callable] | None' = None,
    true_values: 'list | None' = None,
    false_values: 'list | None' = None,
    skipinitialspace: 'bool' = False,
    skiprows: 'list[int] | int | Callable[[Hashable], bool] | None' = None,
    skipfooter: 'int' = 0,
    nrows: 'int | None' = None,
    na_values: 'Hashable | Iterable[Hashable] | Mapping[Hashable, Iterable[Hashable]] | None' = None,
  

# Exercise: Reading and writing taxi data

1. Read in `taxi.csv`, but only keep the `passenger_count`, `trip_distance`, `tip_amount`, and `total_amount` columns.
2. Add a new column, `tip_percentage`, which is `tip_amount` divided by `total_amount`, and get the mean for that value.
3. Write all of this to a new file, `taxi_tip.csv`, with tabs between the fields.
4. Read all of this back in, and find the mean `trip_distance` where the `total_amount` was > 50.

In [189]:
df = pd.read_csv('taxi.csv',
           usecols=['passenger_count', 'trip_distance', 'tip_amount', 'total_amount'])
df

Unnamed: 0,passenger_count,trip_distance,tip_amount,total_amount
0,1,1.63,0.00,17.80
1,1,0.46,1.00,8.30
2,1,0.87,2.20,11.00
3,1,2.13,2.86,17.16
4,1,1.40,0.00,10.30
...,...,...,...,...
9994,1,2.70,0.00,12.30
9995,1,4.50,3.00,20.30
9996,1,5.59,0.00,22.30
9997,6,1.54,0.00,7.80


In [192]:
df['tip_percentage'] = df['tip_amount'] / df['total_amount']

In [193]:
df

Unnamed: 0,passenger_count,trip_distance,tip_amount,total_amount,tip_percentage
0,1,1.63,0.00,17.80,0.000000
1,1,0.46,1.00,8.30,0.120482
2,1,0.87,2.20,11.00,0.200000
3,1,2.13,2.86,17.16,0.166667
4,1,1.40,0.00,10.30,0.000000
...,...,...,...,...,...
9994,1,2.70,0.00,12.30,0.000000
9995,1,4.50,3.00,20.30,0.147783
9996,1,5.59,0.00,22.30,0.000000
9997,6,1.54,0.00,7.80,0.000000


In [194]:
df.to_csv('taxi_tips.csv', sep='\t')

In [197]:
pd.read_csv('taxi_tips.csv', sep='\t', index_col='Unnamed: 0')

Unnamed: 0,passenger_count,trip_distance,tip_amount,total_amount,tip_percentage
0,1,1.63,0.00,17.80,0.000000
1,1,0.46,1.00,8.30,0.120482
2,1,0.87,2.20,11.00,0.200000
3,1,2.13,2.86,17.16,0.166667
4,1,1.40,0.00,10.30,0.000000
...,...,...,...,...,...
9994,1,2.70,0.00,12.30,0.000000
9995,1,4.50,3.00,20.30,0.147783
9996,1,5.59,0.00,22.30,0.000000
9997,6,1.54,0.00,7.80,0.000000


In [198]:
df = pd.read_excel('/Users/reuven/Downloads/CPILFESL.xlsx')

Unnamed: 0,FRED Graph Observations,Unnamed: 1,Unnamed: 2
0,"Federal Reserve Economic Data, Federal Reserve...",,
1,Link: https://fred.stlouisfed.org,,
2,Help: https://fredhelp.stlouisfed.org,,
3,This data may be copyrighted. Please refer to ...,,
4,File Created: 2025-05-12 2:29 pm CDT,,
5,,,
6,CPILFESL,Consumer Price Index for All Urban Consumers: ...,Data Updated: 2025-04-10
