# Agenda: CSV and data frames

1. Data frames in general
    - What are they?
    - Creating them (very simple version)
    - Retrieving from them
2. Working with CSV files
    - What options are there?
3. Retriving from parts of a data frame
    - Applying `.loc` and boolean indexes to retrieving from a data frame

# Data frames

So far, we've been (mostly) using series. But most work in Pandas is done with a data frame

- Two-dimensional table
- Rows (which we identify with the index)
- Columns (which we identify with column names)

Remember that every column in a data frame is a Pandas series:
- The index identifying the rows continues to identify elements of a series
- All of the series (columns) in the data frame share an index
- When we retrieve a column, then it's a (normal) series
- When we retrieve a row, Pandas basically creates a new series on the fly
- Every column has a dtype, whereas rows are a combination of whatever dtypes are in their columns

In [1]:
# it's rare to create a data frame from scratch
# but if we really want to create a data frame, we can do it with a list of lists, or a list of dicts,
# or even a dict of lists

# I'm just going to use a list of lists for my data frame

import pandas as pd
from pandas import Series, DataFrame

In [2]:
# Create a data frame with a list of lists
df = DataFrame([[10, 20, 30],
            [40, 50, 60], 
            [70, 80, 90], 
            [100, 110, 120]])

df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [3]:
# let's make this a bit more readable by giving both 
# an index and column names. Just as we can pass index= to 
# the Series class, we can pass columns= (as well) to the
# DataFrame class

df = DataFrame([[10, 20, 30],
            [40, 50, 60], 
            [70, 80, 90], 
            [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))

df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [4]:
# we can retrieve the index just as we did with a series
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
# we can retrieve the column names with .columns
df.columns

Index(['x', 'y', 'z'], dtype='object')

In [6]:
# how can I retrieve one row?
# use the index and .loc

df.loc['c']

x    70
y    80
z    90
Name: c, dtype: int64

In [7]:
# I can run any series method on this return value
df.loc['c'].mean()

80.0

In [8]:
# how do I get a column? Answer: []
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [9]:
# we can get more than one row if we pass .loc a list of indexes
df.loc[['a', 'c']]

Unnamed: 0,x,y,z
a,10,20,30
c,70,80,90


In [10]:
# we can retrieve more than one column in the same way
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [11]:
# what kinds of operations can we run on a data frame?
# rule of thumb: (Just about) any method that you can run on a series, you can also run on a data frame
# you'll get back one result for each column

df['x'].mean()

55.0

In [12]:
df.mean()   # we get a series back, whose index is the same as df's column names

x    55.0
y    65.0
z    75.0
dtype: float64

In [13]:
df.min()

x    10
y    20
z    30
dtype: int64

In [14]:
# I can even invoke something like df.describe!

df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [15]:
df.describe()  # this will give us all of those values, for each column

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [16]:
# you can add a new column to a data frame just by assigning
df['w'] = ['hello', 'out', 'there', 'everyone']
df

Unnamed: 0,x,y,z,w
a,10,20,30,hello
b,40,50,60,out
c,70,80,90,there
d,100,110,120,everyone


In [17]:
df.describe()

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [19]:
df['w'].describe()

count         4
unique        4
top       hello
freq          1
Name: w, dtype: object

In [20]:
df

Unnamed: 0,x,y,z,w
a,10,20,30,hello
b,40,50,60,out
c,70,80,90,there
d,100,110,120,everyone


In [21]:
df.loc['a']

x       10
y       20
z       30
w    hello
Name: a, dtype: object

In [22]:
df.loc['a'].mean()

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U5')) -> None

# Exercise: Create a data frame

1. Create a data frame describing the weather forecast
    - It'll have six rows, each for a different day
    - The index will contain day names
    - The columns will be `high` and `low`, for the forecast high and low temps in your area
2. Describe the values for the `high` and `low` columns, both separately and together.
3. Calculate the predict mean temp on both Wednesday and Friday.
4. Add a new column, precipitation, to the data frame. What happens now if you try to describe the values in the data frame? Does it work? Why?

In [26]:
df = DataFrame([[32, 20],
                [33, 21],
                [36, 23],
                [37, 24],
                [37, 23],
                [34, 21]],
              index='Tue Wed Thu Fri Sat Sun'.split(),
              columns=['high', 'low'])
df
                

Unnamed: 0,high,low
Tue,32,20
Wed,33,21
Thu,36,23
Fri,37,24
Sat,37,23
Sun,34,21


In [27]:
# We can always get the dtype of a series with the ".dtype" attribute
df['high'].dtype

dtype('int64')

In [28]:
# I can also get the dtypes of an entire data frame with ".dtypes"
df.dtypes

high    int64
low     int64
dtype: object

In [29]:
# Describe the values for the high and low columns, both separately and together.
df['high'].describe()

count     6.000000
mean     34.833333
std       2.136976
min      32.000000
25%      33.250000
50%      35.000000
75%      36.750000
max      37.000000
Name: high, dtype: float64

In [30]:
df['low'].describe()

count     6.000000
mean     22.000000
std       1.549193
min      20.000000
25%      21.000000
50%      22.000000
75%      23.000000
max      24.000000
Name: low, dtype: float64

In [31]:
df.describe()

Unnamed: 0,high,low
count,6.0,6.0
mean,34.833333,22.0
std,2.136976,1.549193
min,32.0,20.0
25%,33.25,21.0
50%,35.0,22.0
75%,36.75,23.0
max,37.0,24.0


In [34]:
# Calculate the predict mean temp on both Wednesday and Friday.

df.loc[['Wed', 'Fri']].mean()

high    35.0
low     22.5
dtype: float64

In [45]:
# what if you want to calculate across the rows, and not via the columns?
# many, *many* methods on a data frame support the "axis" keyword argument, which can take
#  values of 0 or 1 or... I prefer to use the strings "columns" or "rows"

df.loc[['Wed', 'Fri']].mean(axis='rows')

high    35.0
low     22.5
dtype: float64

In [46]:
df.loc[['Wed', 'Fri']].mean(axis='columns')   # this means: go across, give me a new column with the result

Wed    27.0
Fri    30.5
dtype: float64

When we say `axis='rows'`, that means we basically want a new row to be calculated, with one result per column. This means that we're calculating one result per column.

When we say `axis='columns'`, that basically means we want a new column to be calculated, with one result per row. 

In [47]:
df

Unnamed: 0,high,low
Tue,32,20
Wed,33,21
Thu,36,23
Fri,37,24
Sat,37,23
Sun,34,21


In [48]:
df.mean(axis='columns')

Tue    26.0
Wed    27.0
Thu    29.5
Fri    30.5
Sat    30.0
Sun    27.5
dtype: float64

# Real data!

In [50]:
# Use ! to run a Unix command, just to get a list of files in the "../data" directory

!ls ../data/*.csv

../data/2020_sharing_data_outside.csv  ../data/olympic_athlete_events.csv
../data/CPILFESL.csv		       ../data/san+francisco,ca.csv
../data/albany,ny.csv		       ../data/sat-scores.csv
../data/boston,ma.csv		       ../data/skyscrapers.csv
../data/burrito_current.csv	       ../data/springfield,il.csv
../data/celebrity_deaths_2016.csv      ../data/springfield,ma.csv
../data/chicago,il.csv		       ../data/taxi-distance.csv
../data/eu_cpi.csv		       ../data/taxi-passenger-count.csv
../data/eu_gdp.csv		       ../data/taxi.csv
../data/ice-cream.csv		       ../data/titanic3.csv
../data/languages.csv		       ../data/us-median-cpi.csv
../data/los+angeles,ca.csv	       ../data/us-unemployment-rate.csv
../data/miles-traveled.csv	       ../data/us_gdp.csv
../data/new+york,ny.csv		       ../data/winemag-150k-reviews.csv
../data/oecd_locations.csv	       ../data/wti-daily.csv
../data/oecd_tourism.csv


In [51]:
# I want to load the taxi data (../data/taxi.csv) into a data frame

!head ../data/taxi.csv   # comma-separated values

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

In [52]:
# If I want to create a data frame based on a CSV file, the easiest thing to do
# is call pd.read_csv, and hand it a filename. This assumes that (a) fields are separated by
# commas, (b) the first row contains column names, and (c) the only data types we want are
# integers, floats, and strings. If Pandas has any problems at all with values in a column,
# it'll just call that column "object," and put strings in it.

df = pd.read_csv('../data/taxi.csv')   # this means: go up a directory, then go into the "data" subdirectory, and find taxi.csv

In [53]:
# how can I know that it read it correctly?
# let's run some basic tests.
# first test: how big is our data frame? We can use the "shape" attribute

df.shape   # gives me a tuple (rows, columns)

(9999, 19)

In [54]:
# I can run the "head" method to get the first five rows
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [55]:
df.tail()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
9994,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.7,-73.947792,40.814972,1,N,-73.973358,40.783638,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3
9995,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.5,-74.004066,40.747818,1,N,-73.953758,40.779285,1,16.0,0.5,0.5,3.0,0.0,0.3,20.3
9996,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377,40.766102,1,N,-73.903206,40.750546,2,21.0,0.5,0.5,0.0,0.0,0.3,22.3
9997,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302,40.748531,1,N,-73.989166,40.762852,2,6.5,0.5,0.5,0.0,0.0,0.3,7.8
9998,1,2015-06-01 00:13:04,2015-06-01 00:36:33,1,5.8,-73.983215,40.726414,1,N,-73.924133,40.701645,1,21.0,0.5,0.5,4.45,0.0,0.3,26.75


In [56]:
# I can also run describe

df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0
mean,1.516652,1.659466,3.158511,-73.014956,40.226521,1.045105,-73.054699,40.248644,1.407741,14.415892,0.118212,0.49745,1.818059,0.400433,0.29979,17.552472
std,0.499748,1.333306,4.037516,8.347871,4.599169,0.302132,8.186847,4.51052,0.501911,12.442624,0.214794,0.037667,2.634469,1.66517,0.010816,15.13799
min,1.0,0.0,0.0,-74.186302,0.0,1.0,-74.277367,0.0,1.0,-7.0,-0.5,-0.5,0.0,0.0,-0.3,-7.8
25%,1.0,1.0,1.0,-73.990997,40.738556,1.0,-73.990261,40.738478,1.0,7.0,0.0,0.5,0.0,0.0,0.3,8.8
50%,2.0,1.0,1.7,-73.979774,40.755909,1.0,-73.978256,40.75634,1.0,10.5,0.0,0.5,1.0,0.0,0.3,12.8
75%,2.0,2.0,3.3,-73.963001,40.770012,1.0,-73.961311,40.771044,2.0,17.0,0.0,0.5,2.46,0.0,0.3,19.8
max,2.0,6.0,64.6,0.0,41.064606,5.0,0.0,41.137344,4.0,250.0,1.0,0.5,42.05,70.0,0.3,252.35


In [57]:
# I can retrieve by a row
df.loc[3]

VendorID                                   2
tpep_pickup_datetime     2015-06-02 11:19:31
tpep_dropoff_datetime    2015-06-02 11:39:02
passenger_count                            1
trip_distance                           2.13
pickup_longitude                  -73.945892
pickup_latitude                    40.773529
RateCodeID                                 1
store_and_fwd_flag                         N
dropoff_longitude                 -73.971527
dropoff_latitude                    40.76033
payment_type                               1
fare_amount                             13.5
extra                                    0.0
mta_tax                                  0.5
tip_amount                              2.86
tolls_amount                             0.0
improvement_surcharge                    0.3
total_amount                           17.16
Name: 3, dtype: object

In [58]:
# I can retrieve a column
df['trip_distance']

0       1.63
1       0.46
2       0.87
3       2.13
4       1.40
        ... 
9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [59]:
# I can get more than one column
df[['trip_distance', 'total_amount']]

Unnamed: 0,trip_distance,total_amount
0,1.63,17.80
1,0.46,8.30
2,0.87,11.00
3,2.13,17.16
4,1.40,10.30
...,...,...
9994,2.70,12.30
9995,4.50,20.30
9996,5.59,22.30
9997,1.54,7.80


# Let's go deeper with `.loc`

So far, we've seen that `.loc` can be used to retrieve one element from a series, or one row from a data frame. That's true, but it's not even close to the whole story.

If we pass `.loc` one argument, then it is what I call a "row selector," meaning that it'll help us to select rows. We can pass an index, or a list of indexes, and it'll work. But we can also pass (as we've seen with series) a boolean series that tells Pandas which elements we do and don't want.

