# Agenda, day 3 — real-world data

1. Recap + Q&A
2. More on CSV
3. Reading online data
4. Sorting
5. Grouping
6. Pivot tables
7. Joining tables
8. Cleaning data

# Recap

1. Most work in Pandas is done in a data frame
    - 2D table
    - Columns -- names must be unique
    - Rows -- with an index that doesn't need to be unique
2. We can retrieve (or assign) to a Pandas data frame using `.loc` and `.iloc`
3. In particular, using `.loc` is a key part of working with Pandas
    - In the one-argument version, we just pass a *row selector*, aka `df.loc[ROW_SELECTOR]`, which can be:
        - A string (for the index of the row(s) we want)
        - A list of strings (for the indexes of the row(s) we want)
        - A boolean series (indicating which rows we want, wherever there's a `True` in there)
        - A slice, for a number of rows
    - In the two-argument version, we pass both a *row selector* and a *column selector*, separated by a comma, which looks like: `df.loc[ROW_SELECTOR, COLUMN SELECTOR]`. The row selector is the same as the one-argument version, but the column selector is also the same, and can be used to describe which columns we want. 
4. At the end of last week, we saw that we can read data from a CSV file    

# Reading from CSV files

"CSV" stands for "comma-separated values" or "character-separated values."  The idea is:

- Every row in the file represents one record
- The field separator (a comma, by default) separates fields from one another in each record.

You can imagine a CSV file looking something like this:

```
United States,English
United Kingdom,English
France,French
Germany,German
Netherlands,Dutch
```

If the above is a CSV file, then we want to read it into a data frame, such that the country will be one column and the language will be a second column.

There is a problem with what I did here: There is no header line, giving the column names! If we aren't careful, then when we read it in, the first line will be seen as the column names, not as a data row.

Let's consider this, then:

```
country,language
United States,English
United Kingdom,English
France,French
Germany,German
Netherlands,Dutch
```

If we want to read a CSV file into a data frame, we use `pd.read_csv`. This function takes a filename as an argument, and assumes a lot of defaults for *many* options that can be passed.  Here are a few of the options we might want to pass to `pd.read_csv`.

(Note that all of these are passed as keyword arguments, meaning that they are of the form `name=value` in the function call.)

1. `sep` -- pass this to indicate what character is separating the fields on each line. By default, it's `','`. I often use `'\t'` (tab) for a separator, because I find it easier to read.
2. `header` -- pass this to indicate on what line of the file the header is located. This is useful if you want to skip over several rows, until you get to the column names or the data. If there are no column names (i.e., no header), then just pass `header=None`.
3. `usecols` -- this argument takes a list of strings (or of integers, if you really want), indicating which fields should be included in the data frame. If you pass strings, then they should be column names. If you pass integers, they should be column indexes, starting with 0.

In [1]:
!ls *.csv

burrito_current.csv	   languages.csv  titanic3.csv
celebrity_deaths_2016.csv  taxi.csv


In [2]:
df = pd.read_csv('taxi.csv')

In [3]:
!head taxi.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.954429626464844,40.764141082763672,1,N,-73.974754333496094,40.754093170166016,2,17,0,0.5,0,0,0.3,17.8
2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,.46,-73.971443176269531,40.758941650390625,1,N,-73.978538513183594,40.761909484863281,1,6.5,0,0.5,1,0,0.3,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,.87,-73.978111267089844,40.738433837890625,1,N,-73.990272521972656,40.745437622070313,1,8,0,0.5,2.2,0,0.3,11
2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892333984375,40.773529052734375,1,N,-73.971527099609375,40.760330200195312,1,13.5,0,0.5,2.86,0,0.3,17.16
1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.40,-73.979087829589844,40.776771545410156,1

In [4]:
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [5]:
df = pd.read_csv('taxi.csv', usecols=['trip_distance', 'total_amount', 'passenger_count'])

In [6]:
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.63,17.8
1,1,0.46,8.3
2,1,0.87,11.0
3,1,2.13,17.16
4,1,1.4,10.3


In [7]:
df.dtypes  # what dtype is each column?

passenger_count      int64
trip_distance      float64
total_amount       float64
dtype: object