# Agenda: Week 3 — real-world data

0. Q&A -- setting up an environment on your computer
1. More about CSV
2. Reading data from online sources
3. Sorting
4. Grouping
5. Pivot tables
6. Joining
7. Cleaning data

# Setting up an environment

At the end of the day, in order to run the sorts of programs we're talking about in class, you'll need:

- Python - easiest way is from https://python.org/
- NumPy / Pandas / Matplotlib -- easiest way is with `pip`
- Somewhere to write that code and then run it

My preference is:

- Install Python from python.org
- Install packages via `pip`
- Edit in PyCharm (for long things) and Jupyter (for short things)
    - Install PyCharm from the JetBrains site
    - Install Jupyter with `pip`

There can be issues/differences with installing Python:
- You might install it with Anaconda, an all-in-one system
- I often install (on my Mac) the version of Python from "Homebrew," an open-source installation system

Jupyter can cause all sorts of other issues.

Basically, if you can be running Jupyter, then you can experiment with what I'm doing in class most directly, including using the files that I'm uploading to GitHub.

If you want to download + use my Jupyter notebook files, then you need to download them from GitHub (you can clone or just download a zipfile).

There's one Jupyter file (with a `.ipynb` suffix) for each day of the course.  Put that into the directory where you're running Jupyter, and you'll be able to open it on your computer.

I'm also going to be creating HTML versions of every Jupyter notebook, after class happens.  That'll allow you to view it in a browser without running Jupyter.

In [1]:
# recap a bit from last week -- loading a CSV file

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# create a data frame from a CSV file

df = pd.read_csv('taxi.csv')
df.head()                       # did we load the file correctly?

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


# Assumptions of `read_csv`:

1. The fields on each line are separated by commas (`,`)
2. The first line in the file contains header/column names 
3. Pandas will guess the `dtype` that it should give to each column:
    - If the column only contains integers, it's made a `np.int64`
    - If the column contains digits + decimal points, it's made a `np.float64`
    - Everything else is considered a string, which is stored as a Python string as dtype `object`

In [4]:
# retrieve a column with []

df['trip_distance']

0       1.63
1       0.46
2       0.87
3       2.13
4       1.40
        ... 
9994    2.70
9995    4.50
9996    5.59
9997    1.54
9998    5.80
Name: trip_distance, Length: 9999, dtype: float64

In [5]:
# retrieve multiple columns with a list in []

df[['trip_distance', 'total_amount']]

Unnamed: 0,trip_distance,total_amount
0,1.63,17.80
1,0.46,8.30
2,0.87,11.00
3,2.13,17.16
4,1.40,10.30
...,...,...
9994,2.70,12.30
9995,4.50,20.30
9996,5.59,22.30
9997,1.54,7.80


In [6]:
# retrieve one row with its index, using .loc

df.loc[3]

VendorID                                   2
tpep_pickup_datetime     2015-06-02 11:19:31
tpep_dropoff_datetime    2015-06-02 11:39:02
passenger_count                            1
trip_distance                           2.13
pickup_longitude                  -73.945892
pickup_latitude                    40.773529
RateCodeID                                 1
store_and_fwd_flag                         N
dropoff_longitude                 -73.971527
dropoff_latitude                    40.76033
payment_type                               1
fare_amount                             13.5
extra                                    0.0
mta_tax                                  0.5
tip_amount                              2.86
tolls_amount                             0.0
improvement_surcharge                    0.3
total_amount                           17.16
Name: 3, dtype: object

In [10]:
# retrieve multiple rows with df.loc and a "row selector"
# - one index
# - list of indexes
# - boolean series, used as a mask index

df.loc[df['passenger_count'] > 5]

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
30,2,2015-06-02 11:19:55,2015-06-02 11:32:57,6,0.81,-73.999336,40.754444,1,N,-73.996078,40.748554,2,8.5,0.0,0.5,0.00,0.0,0.3,9.30
62,2,2015-06-02 11:20:17,2015-06-02 11:24:33,6,0.51,-73.953720,40.775059,1,N,-73.960892,40.774948,2,5.0,0.0,0.5,0.00,0.0,0.3,5.80
71,2,2015-06-02 11:20:25,2015-06-02 11:35:44,6,0.88,-73.994125,40.751011,1,N,-73.985313,40.758541,1,10.0,0.0,0.5,3.24,0.0,0.3,14.04
146,2,2015-06-02 11:21:23,2015-06-02 11:33:09,6,1.21,-73.998741,40.744953,1,N,-73.980057,40.736820,1,9.0,0.0,0.5,2.45,0.0,0.3,12.25
188,2,2015-06-02 11:21:52,2015-06-02 11:36:17,6,1.84,-73.971909,40.794464,1,N,-73.963913,40.776718,1,11.0,0.0,0.5,2.36,0.0,0.3,14.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9953,2,2015-06-01 00:13:42,2015-06-01 00:21:01,6,1.07,-73.994072,40.759239,1,N,-73.976021,40.751888,1,7.0,0.5,0.5,2.49,0.0,0.3,10.79
9961,2,2015-06-01 00:13:05,2015-06-01 00:22:38,6,2.72,-73.923241,40.767677,1,N,-73.956253,40.747513,1,10.5,0.5,0.5,2.36,0.0,0.3,14.16
9963,2,2015-06-01 00:13:07,2015-06-01 00:21:14,6,1.52,-73.972778,40.750256,1,N,-73.989075,40.762695,2,8.0,0.5,0.5,0.00,0.0,0.3,9.30
9989,2,2015-06-01 00:13:41,2015-06-01 00:17:44,6,1.34,-74.005898,40.735851,1,N,-73.991318,40.748177,1,6.0,0.5,0.5,1.46,0.0,0.3,8.76


In [11]:
# giving df.loc two arguments, we can provide a row selector and a column selector:
# - a column name
# - a list of column names
# - a boolean series for the columns (weird and rare)

df.loc[
    df['passenger_count'] > 5   # row selector
    ,
    ['trip_distance', 'total_amount']   # column selector
]

Unnamed: 0,trip_distance,total_amount
30,0.81,9.30
62,0.51,5.80
71,0.88,14.04
146,1.21,12.25
188,1.84,14.16
...,...,...
9953,1.07,10.79
9961,2.72,14.16
9963,1.52,9.30
9989,1.34,8.76


# What if we don't want all of the columns?

Each column that we read into memory consumes RAM. If we have a big file, we might want to choose only selected columns when we read the CSV file into memory.

We can do that by passing the `usecols` parameter to `read_csv`, and providing a list of columns.

In [12]:
df = pd.read_csv('taxi.csv',
                usecols=['trip_distance', 'total_amount', 'tip_amount', 'passenger_count'])

In [13]:
df.head()

Unnamed: 0,passenger_count,trip_distance,tip_amount,total_amount
0,1,1.63,0.0,17.8
1,1,0.46,1.0,8.3
2,1,0.87,2.2,11.0
3,1,2.13,2.86,17.16
4,1,1.4,0.0,10.3


In [15]:
df = pd.read_csv('taxi.csv',
                usecols=[3, 4, 15, 18])
df.head()

Unnamed: 0,passenger_count,trip_distance,tip_amount,total_amount
0,1,1.63,0.0,17.8
1,1,0.46,1.0,8.3
2,1,0.87,2.2,11.0
3,1,2.13,2.86,17.16
4,1,1.4,0.0,10.3


# Headers (or not)

If the CSV file doesn't have a header row, then the first row will be automatically seen as the column 
names -- which can be bad if it's actually data!  

You can tell Pandas to treat the first row in the file as data (not headers) by passing `header=None`.

If the header row isn't on the first line of the file, then you can pass `header=n`, where `n` is the number of the row where the column names are actually located.  Pandas will skip over the first `n` rows of the file, read the headers, and then read the data.

In [16]:
pd.read_csv('taxi.csv', header=None)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
1,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
3,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
4,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1,2015-06-01 00:12:59,2015-06-01 00:24:18,1,2.70,-73.947792053222656,40.814971923828125,1,N,-73.973358154296875,40.783638000488281,2,11,0.5,0.5,0,0,0.3,12.3
9996,1,2015-06-01 00:12:59,2015-06-01 00:28:16,1,4.50,-74.004066467285156,40.747817993164063,1,N,-73.953758239746094,40.779285430908203,1,16,0.5,0.5,3,0,0.3,20.3
9997,2,2015-06-01 00:13:00,2015-06-01 00:37:25,1,5.59,-73.994377136230469,40.766101837158203,1,N,-73.903205871582031,40.750545501708984,2,21,0.5,0.5,0,0,0.3,22.3
9998,2,2015-06-01 00:13:02,2015-06-01 00:19:10,6,1.54,-73.978302001953125,40.748531341552734,1,N,-73.989166259765625,40.762851715087891,2,6.5,0.5,0.5,0,0,0.3,7.8


Every data frame has two attributes that you can/should know about:

- `index`, which contains the index (for the rows)
- `columns`, which contains the column names (as an index object, but applied on the other axis)

You can actually assign to either of these, if you want to modify the index or columns; just provide an index object or (if you prefer) a series or list.

In [17]:
df.index

RangeIndex(start=0, stop=9999, step=1)

In [18]:
df.columns

Index(['passenger_count', 'trip_distance', 'tip_amount', 'total_amount'], dtype='object')

In [19]:
# time and date data -- currently, as read, they're treated as strings
# next week, we'll be talking about how to parse and work with it

# delimiter in CSV is ',' by default, but you can pass something else with sep='\t' (for tab)
# if you have , in your data fields, then CSV knows how to handle that (by putting "" around it)

In [21]:
df = DataFrame([[10, 20, 30],
                [40, 50, 60],
                [70, 80, 90]],
              index=list('xyz'),
              columns=list('abc'))
df

Unnamed: 0,a,b,c
x,10,20,30
y,40,50,60
z,70,80,90


In [22]:
df['b'] = ['hello, out there', 'goodbye', 'whatever']
df

Unnamed: 0,a,b,c
x,10,"hello, out there",30
y,40,goodbye,60
z,70,whatever,90


In [23]:
print(df.to_csv())

,a,b,c
x,10,"hello, out there",30
y,40,goodbye,60
z,70,whatever,90

