# Week 4: Text and dates

1. Text
    - Working with text data via the `str` accessor
    - Using `str` to clean integer data
    - Getting textual statistics 
    - Cleaning text + strings
2. Dates and times
    - How do dates and times work as data structures?
    - `datetime` and `timedelta` objects
    - Reading date information from CSV files
    - Retrieving via dates and times
    - Time series -- setting the index to use a datetime column
    - Resampling -- grouping via time

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# load NYC taxi data from January, 2019
filename = '/Users/reuven/Courses/Current/data/nyc_taxi_2019-01.csv'

df = pd.read_csv(filename, 
                usecols=['passenger_count', 'trip_distance', 'total_amount'])

In [3]:
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.5,9.95
1,1,2.6,16.3
2,3,0.0,5.8
3,5,0.0,7.55
4,5,0.0,55.55


In [4]:
df.dtypes

passenger_count      int64
trip_distance      float64
total_amount       float64
dtype: object

In [6]:
# what if I want to find the 10 shortest-distance trips?

df.sort_values(by='trip_distance').head(10)

Unnamed: 0,passenger_count,trip_distance,total_amount
7667791,1,0.0,0.0
4863796,0,0.0,3.3
4863795,2,0.0,3.3
4863794,2,0.0,3.3
4863793,1,0.0,3.3
4863792,1,0.0,3.3
4863789,1,0.0,5.3
4863768,1,0.0,3.3
4863743,1,0.0,3.96
2682283,1,0.0,20.3


In [7]:
# what if I want to sort first by trip_distance, and then (in the case of a tie) by total_amount?

df.sort_values(by=['trip_distance', 'total_amount']).head(20)

Unnamed: 0,passenger_count,trip_distance,total_amount
4890628,1,0.0,-362.8
6308124,2,0.0,-320.3
57093,1,0.0,-300.3
7227721,1,0.0,-300.3
868820,1,0.0,-250.31
54256,1,0.0,-224.8
57095,1,0.0,-190.3
3339311,1,0.0,-165.3
3310047,1,0.0,-160.8
6153373,1,0.0,-150.8


In [8]:
# grouping 

# grouping allows us to ask a question, and to get a separate answer for each
# unique value of a particular column.

# to group, we need:
# (1) a categorical column on which to group
# (2) a numeric column on which to perform our calculation
# (3) an aggregation method that will give us one value back for all rows for each categorical value

df.groupby('passenger_count')['trip_distance'].mean()

passenger_count
0    2.651561
1    2.779088
2    2.880572
3    2.840698
4    2.853084
5    2.865741
6    2.842335
7    2.561579
8    3.142759
9    1.486667
Name: trip_distance, dtype: float64

In [9]:
df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0    18.663658
1    15.609601
2    15.831294
3    15.604015
4    15.650307
5    15.546940
6    15.437892
7    48.278421
8    64.105517
9    31.094444
Name: total_amount, dtype: float64

In [10]:
# what if I want to get info about more than one column?
# two options:
# (1) don't specify columns in square brackets, and then all numeric columns will be calculated
# (2) specify them in double square brackets, giving a list of columns

df.groupby('passenger_count')[['total_amount', 'trip_distance']].mean()

Unnamed: 0_level_0,total_amount,trip_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,18.663658,2.651561
1,15.609601,2.779088
2,15.831294,2.880572
3,15.604015,2.840698
4,15.650307,2.853084
5,15.54694,2.865741
6,15.437892,2.842335
7,48.278421,2.561579
8,64.105517,3.142759
9,31.094444,1.486667


In [11]:
df.groupby('passenger_count')[['total_amount', 'trip_distance']].mean().sort_values(by='total_amount')

Unnamed: 0_level_0,total_amount,trip_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
6,15.437892,2.842335
5,15.54694,2.865741
3,15.604015,2.840698
1,15.609601,2.779088
4,15.650307,2.853084
2,15.831294,2.880572
0,18.663658,2.651561
9,31.094444,1.486667
7,48.278421,2.561579
8,64.105517,3.142759


# Text

In Python (not Pandas), we use strings all of the time. They're really useful! But think about how Pandas does things: It stores its values inside of NumPy arra