# Agenda

1. Fun with promotions
2. Data frames
    - What are they?
    - How can we create simple ones by hand?
    - How do we retrieve from them?
3. Reading from CSV files
    - Some basic options
    - - Things to watch out for
4. Retrieving with `.loc` -- ways to think about it
5. How to avoid a common Pandas warning / error

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [3]:
s = Series([10, 20, 30, 40, 50], dtype='int8')
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [4]:
2 ** 8

256

In [5]:
s * 2

0     20
1     40
2     60
3     80
4    100
dtype: int8

In [6]:
s * 10  # here, we multiply by 10 and the numbers "roll over" -- because int8 isn't big enough for some solutions

0    100
1    -56
2     44
3   -112
4    -12
dtype: int8

In [8]:
s * 200   # will this have similar problems?  no!

0     2000
1     4000
2     6000
3     8000
4    10000
dtype: int16

In [20]:
# somewhere along the line, Pandas decided to "promote" the series from int8 to int16, thus saving the day

# Pandas looks at the number by which we're mulitplying  -- if it fits into the current dtype, then it keeps the dtype
# and performs the operation.  But if the number is too big for the current dtype, then it promotes the series when it calculates.

In [21]:
s + 126

0   -120
1   -110
2   -100
3    -90
4    -80
dtype: int8

In [22]:
s + 127

0   -119
1   -109
2    -99
3    -89
4    -79
dtype: int8

In [23]:
s + 128

0    138
1    148
2    158
3    168
4    178
dtype: int16

# Data frames

A data frame is a 2D table with rows and columns:

- Each row is identified by an index
- Each column is identified by a name, or a column name

Each column is basically a Pandas series. So anything that you can do on a series, you can do on a column.

- Because each column is a series, we continue to use the index to identify each element.
- All of teh series (columns) in a data frame share an index
- When we retrieve a column, it's a series, like usual.
- When we retrieve a row, Pandas creates a new series on the fly
- Every column has a dtype, whereas rows are a combination of whatever dtypes are in their columns, and are often "object" columns



In [24]:
# create a data frame with a list of lists
# each inner list describes a row in the data frame

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
               [100, 110, 120]])
df

Unnamed: 0,0,1,2
0,10,20,30
1,40,50,60
2,70,80,90
3,100,110,120


In [25]:
# by default, the rows and columns are both numbered starting at 0. That's technically fine,
# but in real life you'll want to identify them with names or numbers. We can do that by 
# passing the "index" keyword argument, and also the "columns" keyword argument.

df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90],
               [100, 110, 120]],
              index=list('abcd'),
              columns=list('xyz'))
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [26]:
# we can retrieve the index, just as we did with a series
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [27]:
# how do we retrieve the column names? Just use the "columns" attribute
df.columns

Index(['x', 'y', 'z'], dtype='object')

In [28]:
# how can I retrieve a row from my data frame?
# use .loc, just as we did with a series

df.loc['c']

x    70
y    80
z    90
Name: c, dtype: int64

In [29]:
# with a series, we can retrieve more than one object at a time with fancy indexing.
# can we do that now? What do we get back?

df.loc[['a', 'c']]

Unnamed: 0,x,y,z
a,10,20,30
c,70,80,90


In [30]:
# can we retrieve individual columns? Yes, just use [] without any .loc
df['x']

a     10
b     40
c     70
d    100
Name: x, dtype: int64

In [31]:
df[['x', 'z']]

Unnamed: 0,x,z
a,10,30
b,40,60
c,70,90
d,100,120


In [32]:
# we can run a method on a series, which means (normally) either on a row or a column
df['x'].mean()

55.0

In [33]:
# in most cases, a method you can run on a series can also be run on a data frame
# and in such a case, you'll get back one solution for each column.

df.mean()   # this will return the mean for each column, labeling each column as well

x    55.0
y    65.0
z    75.0
dtype: float64

In [34]:
df.sum()   # let's sum all of the numbers in each column

x    220
y    260
z    300
dtype: int64

In [35]:
# What happens when I do the following:

df.sum().sum()

780

In [36]:
# lots of methods we can run:

df.min()

x    10
y    20
z    30
dtype: int64

In [37]:
# give me info about column x
df['x'].describe()

count      4.000000
mean      55.000000
std       38.729833
min       10.000000
25%       32.500000
50%       55.000000
75%       77.500000
max      100.000000
Name: x, dtype: float64

In [38]:
# run a method on a data frame, and you get back one set of answers per column -- that means a data frame!
df.describe() 

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [39]:
# behind the scenes of every data frame is a 2D NumPy array
df.values

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [ 70,  80,  90],
       [100, 110, 120]])

In [40]:
df

Unnamed: 0,x,y,z
a,10,20,30
b,40,50,60
c,70,80,90
d,100,110,120


In [41]:
df.mean()   # this gives us the mean for each column

x    55.0
y    65.0
z    75.0
dtype: float64

In [42]:
df.mean(axis='columns')  # this means: calculate across the columns, giving me a new column with the results

a     20.0
b     50.0
c     80.0
d    110.0
dtype: float64

In [43]:
# how can I add a new column to a data frame?
# just assign to it!  (you can also replace an existing column this way, if you want)

df['w'] = 'hello out there everyone'.split()
df

Unnamed: 0,x,y,z,w
a,10,20,30,hello
b,40,50,60,out
c,70,80,90,there
d,100,110,120,everyone


In [44]:
df.dtypes  # what dtypes are in each column

x     int64
y     int64
z     int64
w    object
dtype: object

In [45]:
df.describe()  # what will be with column w?

Unnamed: 0,x,y,z
count,4.0,4.0,4.0
mean,55.0,65.0,75.0
std,38.729833,38.729833,38.729833
min,10.0,20.0,30.0
25%,32.5,42.5,52.5
50%,55.0,65.0,75.0
75%,77.5,87.5,97.5
max,100.0,110.0,120.0


In [46]:
df['w'].describe()

count         4
unique        4
top       hello
freq          1
Name: w, dtype: object

# Exercise: Create a data frame

1. Create a data frame containing your local weather forecast
    - It should have six rows, each for a different day
    - The index will contain day names/abbreviations
    - The columns will be `high` and `low`, showing the forecast high and low temps on each
2. Describe the values in `high` and `low`, both separately and together.
3. Calculate the predicted mean temps on both Friday and Tuesday.
4. Add a new column, precipitation, to the data frame. What happens if you try to run the describe method the values? Does it work?

In [47]:
df = DataFrame([[36, 27],
               [38, 22],
               [38, 23],
               [37, 23],
               [35, 22],
               [34, 21]],
              index='Thu Fri Sat Sun Mon Tue'.split(),
              columns=['high', 'low'])
df

Unnamed: 0,high,low
Thu,36,27
Fri,38,22
Sat,38,23
Sun,37,23
Mon,35,22
Tue,34,21


In [48]:
# describe the values in high, low, and both together
df['high'].describe()

count     6.000000
mean     36.333333
std       1.632993
min      34.000000
25%      35.250000
50%      36.500000
75%      37.750000
max      38.000000
Name: high, dtype: float64

In [49]:
df['low'].describe()

count     6.000000
mean     23.000000
std       2.097618
min      21.000000
25%      22.000000
50%      22.500000
75%      23.000000
max      27.000000
Name: low, dtype: float64

In [50]:
df.describe()

Unnamed: 0,high,low
count,6.0,6.0
mean,36.333333,23.0
std,1.632993,2.097618
min,34.0,21.0
25%,35.25,22.0
50%,36.5,22.5
75%,37.75,23.0
max,38.0,27.0


In [54]:
# Calculate the predicted mean temps on both Friday and Tuesday.

df.loc[['Fri', 'Tue']].mean()

high    36.0
low     21.5
dtype: float64

In [55]:
# give me the mean temp per *day*
df.loc[['Fri', 'Tue']].mean(axis='columns')

Fri    30.0
Tue    27.5
dtype: float64

In [56]:
# Add a new column, precipitation, to the data frame. What happens if you try to run the describe method the values? Does it work?

df['precipitation'] = [0.1, 0.2, 0, 0, 0.6, 0.2]
df

Unnamed: 0,high,low,precipitation
Thu,36,27,0.1
Fri,38,22,0.2
Sat,38,23,0.0
Sun,37,23,0.0
Mon,35,22,0.6
Tue,34,21,0.2


In [57]:
df.describe()

Unnamed: 0,high,low,precipitation
count,6.0,6.0,6.0
mean,36.333333,23.0,0.183333
std,1.632993,2.097618,0.22286
min,34.0,21.0,0.0
25%,35.25,22.0,0.025
50%,36.5,22.5,0.15
75%,37.75,23.0,0.2
max,38.0,27.0,0.6


# Reading data

Data comes in a *very* wide variety of formats. Very often, it comes in "CSV" (comma separated values) format, with one record per line in the file, and the records separated by commas.

Other data often comes in:
- Excel files
- JSON files
- Database files
- Amazon S3

Pandas handles all of these really well.

Typically, you'll want to use the `pd.read_SOMETHING` method, where `SOMETHING` is the format.

Where do we have CSV files? I've put a bunch of them in the `../data/` directory.

We can see the contents of that directory in Jupyter using the `!ls` command, whic goes to the command line.

In [58]:
# show me all files in ../data/ that end with CSV.

!ls ../data/*.csv

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
../data/2020_sharing_data_outside.csv  ../data/olympic_athlete_events.csv
../data/CPILFESL.csv		       ../data/san+francisco,ca.csv
../data/albany,ny.csv		       ../data/sat-scores.csv
../data/boston,ma.csv		       ../data/skyscrapers.csv
../data/burrito_current.csv	       ../data/springfield,il.csv
../data/celebrity_deaths_2016.csv      ../data/springfield,ma.csv
../data/chicago,il.csv		       ../data/taxi-distance.csv
../data/eu_cpi.csv		       ../data/taxi-passenger-count.csv
../data/eu_gdp.csv		       ../data/taxi.csv
../data/ice-cream.csv		       ../data/titanic3.csv
../data/languages.csv		       ../data/us-median-cpi.csv
../data/los+angeles,ca.csv	       ../data/us-unemployment-rate.csv
../data/miles-traveled.csv	       ../data/us_gdp.csv
../data/new+york,ny.csv		       ../data/winemag-150k-reviews.csv
../data/oecd_locations.csv

You can download these files from https://files.lerner.co.il, the zipfile for "data science and machine learning": https://files.lerner.co.il/data-science-exercise-files.zip

In [62]:
# Let's load up NYC taxi data!

df = pd.read_csv('../data/taxi.csv')

In [63]:
!head ../data/taxi.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1,N,-73.982

In [64]:
# how big is this data frame?

df.shape

(9999, 19)

In [65]:
# I always like to use "head" on a data set when I read in


In [66]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [67]:
df.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0,9999.0
mean,1.516652,1.659466,3.158511,-73.014956,40.226521,1.045105,-73.054699,40.248644,1.407741,14.415892,0.118212,0.49745,1.818059,0.400433,0.29979,17.552472
std,0.499748,1.333306,4.037516,8.347871,4.599169,0.302132,8.186847,4.51052,0.501911,12.442624,0.214794,0.037667,2.634469,1.66517,0.010816,15.13799
min,1.0,0.0,0.0,-74.186302,0.0,1.0,-74.277367,0.0,1.0,-7.0,-0.5,-0.5,0.0,0.0,-0.3,-7.8
25%,1.0,1.0,1.0,-73.990997,40.738556,1.0,-73.990261,40.738478,1.0,7.0,0.0,0.5,0.0,0.0,0.3,8.8
50%,2.0,1.0,1.7,-73.979774,40.755909,1.0,-73.978256,40.75634,1.0,10.5,0.0,0.5,1.0,0.0,0.3,12.8
75%,2.0,2.0,3.3,-73.963001,40.770012,1.0,-73.961311,40.771044,2.0,17.0,0.0,0.5,2.46,0.0,0.3,19.8
max,2.0,6.0,64.6,0.0,41.064606,5.0,0.0,41.137344,4.0,250.0,1.0,0.5,42.05,70.0,0.3,252.35


In [68]:
# retrieve a row
df.loc[3]


VendorID                                   2
tpep_pickup_datetime     2015-06-02 11:19:31
tpep_dropoff_datetime    2015-06-02 11:39:02
passenger_count                            1
trip_distance                           2.13
pickup_longitude                  -73.945892
pickup_latitude                    40.773529
RateCodeID                                 1
store_and_fwd_flag                         N
dropoff_longitude                 -73.971527
dropoff_latitude                    40.76033
payment_type                               1
fare_amount                             13.5
extra                                    0.0
mta_tax                                  0.5
tip_amount                              2.86
tolls_amount                             0.0
improvement_surcharge                    0.3
total_amount                           17.16
Name: 3, dtype: object

In [69]:
# retrieve a column with []

df['trip_distance']

0       1.63
1       0.46
2       0.87
3       2.13
4       1.40
        ... 
9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [70]:
# we can now calculate on this
df['trip_distance'].mean()

3.1585108510851083

In [71]:
# we can grab more than one column with [[]]
df[['trip_distance', 'total_amount']]

Unnamed: 0,trip_distance,total_amount
0,1.63,17.80
1,0.46,8.30
2,0.87,11.00
3,2.13,17.16
4,1.40,10.30
...,...,...
9994,2.70,12.30
9995,4.50,20.30
9996,5.59,22.30
9997,1.54,7.80


In [73]:
df[['trip_distance', 'total_amount']].mean()

trip_distance     3.158511
total_amount     17.552472
dtype: float64

# `.loc`

We've used `.loc` so far to retrieve a value from a particular index. We've also seen that we can apply a boolean series to `.loc` and get back only those values where the boolean values are `True`. 

In [74]:
df.loc[   df['trip_distance'] > 50  ]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
4270,1,2015-06-01 00:00:58,2015-06-01 01:22:05,1,64.6,0.0,0.0,5,N,0.0,0.0,2,69.66,0.0,0.0,0.0,10.0,0.3,79.96
8513,1,2015-06-01 00:04:50,2015-06-01 01:31:44,1,60.3,-73.994415,40.750603,5,N,-73.42025,41.137344,1,150.0,0.0,0.0,0.0,9.75,0.3,160.05


# How can I get the trip distance when there were 3 passengers?
# option 1: APply the boolean series only to "trip_distance"

1. Get a boolean series where passenger_count is 3.
2. Apply that .loc to trip_distance
3. We'll have a series of trip_distance where passenger_count is 3

In [77]:
# how many were T/F? 
df['trip_distance'][df['passenger_count'] == 3]

10       0.01
67       2.50
82      10.24
126      1.06
192      0.60
        ...  
9837     1.50
9888     6.00
9914     2.44
9946     5.00
9983     1.78
Name: trip_distance, Length: 406, dtype: float64

In [None]:
# method chaining format

(
    df
    ['trip_distance']
    [df
    ['passeng
