# Agenda, day 4 — text and dates
 
1. Q&A
2. Textual data and Pandas
3. Cleaning dirty textual data
4. Statistics about text
5. Useful string methods
6. Time and date information
    - `datetime` 
    - `timedelta`
7. Calculating time deltas
8. Time series (i.e., where we have time data as our index)
9. Resampling

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# let's assume that I have a series containing some text

s = Series('This is a sample sentence for use in my Pandas course'.split())
s

0         This
1           is
2            a
3       sample
4     sentence
5          for
6          use
7           in
8           my
9       Pandas
10      course
dtype: object

In [3]:
# how can I find out the length of each word in this series?

# could we / should we use a "for" loop and run "len" on each word?

# using a "for" loop in Pandas is almost always the wrong solution.

for one_item in s:
    print(len(one_item))

4
2
1
6
8
3
3
2
2
6
6


In [4]:
# what we want is a way to run "len" on each element
# without a "for" loop

# Pandas provides us with a way to broadcast our string methods/functionality across elements of a series

# I'd want to say

s.len()

AttributeError: 'Series' object has no attribute 'len'

In [6]:
# we can use the "str" accessor object on every series
# in other words, we can say s.str.METHOD_NAME and there are many, many methods defined for s.str

# we get back a new series, one whose index is identical to s!

s.str.len() 

0     4
1     2
2     1
3     6
4     8
5     3
6     3
7     2
8     2
9     6
10    6
dtype: int64

In [7]:
# what if I want to find all of the words that are longer than average in the series?

s.str.len().mean()   # find the mean length of words in s

3.909090909090909

In [8]:
# what words are longer than that?

# series > float -- we run a broadcast, and get a boolean series in return
s.str.len() > s.str.len().mean()

0      True
1     False
2     False
3      True
4      True
5     False
6     False
7     False
8     False
9      True
10     True
dtype: bool

In [9]:
# now, let's retrieve the elements of s where the word length > mean
s.loc[s.str.len() > s.str.len().mean()]

0         This
3       sample
4     sentence
9       Pandas
10      course
dtype: object

# The `.str` accessor

If you want to run string methods on every element in a series, you can do so with `.str` and then the method name. What methods are available?

- All Python string methods
- Many Python operations, implemented as methods
    - `.str.contains` implements `in`
    - `.str.get` implements `[]`
- A few other methods that are just useful, often taken from the R language    

# Exercise: Shorter than average strings

1. Ask the user to enter a sentence.
2. Turn the sentence into a series.
3. Find all of the words in the sentence that are shorter than average, and print them.

In [10]:
# you can get user input with the "input" function

s = input('Enter a string: ').strip()  # strip removes leading/trailing whitespace from the string

Enter a string: asdfasfdafasf


In [11]:
s

'asdfasfdafasf'

In [12]:
s = input('Enter a sentence: ').strip()

Enter a sentence: this is yet another test sentence for my Pandas course


In [13]:
s

'this is yet another test sentence for my Pandas course'

In [15]:
# I want to turn this string into a Pandas series. I'll use "split" to turn it into a list of strings

words = Series(s.split())
words

0        this
1          is
2         yet
3     another
4        test
5    sentence
6         for
7          my
8      Pandas
9      course
dtype: object

In [17]:
# to get the shorter-than-average words, I need:

# (1) find the average word length
# (2) find the length of each word
# (3) find which words are shorter than the average length

words.str.len().mean()

4.5

In [19]:
# which word lengths are shorter than the average length?
# we'll get a boolean series back, with True where it's shorter and False where it isn't
words.str.len() < words.str.len().mean()

0     True
1     True
2     True
3    False
4     True
5    False
6     True
7     True
8    False
9    False
dtype: bool

In [20]:
# now I need to apply that series back to words, to filter out 
# any of the values with a False index

# here, we use .loc to keep only those words that are shorter than average length
words.loc[  words.str.len() < words.str.len().mean()  ]

0    this
1      is
2     yet
4    test
6     for
7      my
dtype: object

In [21]:
# what happens if I do this:

s = Series('10 20 30 40 50'.split())
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [22]:
# what happens if I add these together?

# these are strings, and using + on them is a bit .. dangerous

s.sum()

'1020304050'

In [23]:
# it gets worse:

s.mean()  # this is awful -- it takes s.sum(), turns it into an integer, and then divides by 5!

204060810.0

In [25]:
# if I want the sum or the mean of the numbers in s, I need to convert the dtype to integer

import numpy as np
s = s.astype(np.int8)
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [26]:
s.sum()

150

In [27]:
s.mean()

30.0

In [28]:
# what if the data isn't quite this nice and simple?

s = Series('10 20 30 hello goodbye 40 whatever 50'.split())
s

0          10
1          20
2          30
3       hello
4     goodbye
5          40
6    whatever
7          50
dtype: object

In [29]:
# what will happen now if I try to convert the series to np.int8? 

s.astype(np.int8)

ValueError: invalid literal for int() with base 10: 'hello'

In [30]:
# what do I want to do here?

# - identify which elements in s contain only digits
# - remove the non-digit elements from the series
# - use .astype(np.int8) on what remains

# I can use .str.isdigit(), a method taken straight from Python's string class
# this returns True if the string is non-empty and contains only 0-9.

s.str.isdigit()

0     True
1     True
2     True
3    False
4    False
5     True
6    False
7     True
dtype: bool

In [33]:
# use the boolean series as a mask index

s = s.loc[s.str.isdigit()].astype(np.int8)
s

0    10
1    20
2    30
5    40
7    50
dtype: int8

In [34]:
s.sum()

150

In [35]:
s.mean()

30.0

In [36]:
# what if I want to replace bad strings with NaN

s = Series('10 20 30 hello goodbye 40 whatever 50'.split())
s

0          10
1          20
2          30
3       hello
4     goodbye
5          40
6    whatever
7          50
dtype: object

In [40]:
# use the ~ to flip the logic, as "not"

s.loc[~s.str.isdigit()] = np.nan

In [41]:
s

0     10
1     20
2     30
3    NaN
4    NaN
5     40
6    NaN
7     50
dtype: object

In [43]:
# now we can convert the values to floats, because NaN is a float
s.astype(np.float16)

0    10.0
1    20.0
2    30.0
3     NaN
4     NaN
5    40.0
6     NaN
7    50.0
dtype: float16

In [44]:
# I cannot turn them into integers, though...
s.astype(np.int8)

ValueError: cannot convert float NaN to integer

# `.str.contains` -- checking membership



In [50]:
s = Series('this is a bunch of words for my course'.split())

In [51]:
s

0      this
1        is
2         a
3     bunch
4        of
5     words
6       for
7        my
8    course
dtype: object

In [52]:
# we can find out which of the strings contain a substring
# much as we would use "in" in regular Python as an operator

s.str.contains('i')

0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
dtype: bool

In [53]:
# find words in s that contain 'i'
s.loc[s.str.contains('i')]

0    this
1      is
dtype: object

In [54]:
# what if I want all of those words that contain either e or i?
# option 1: use | to combine them, as an "or" operator

s.loc[s.str.contains('i') | s.str.contains('e')]

0      this
1        is
8    course
dtype: object

In [55]:
# what if I want all of those words that contain either e or i?
# option 2: use a regular expression!

s.loc[s.str.contains('[ei]')]  # this means: one of "e" or "i"

0      this
1        is
8    course
dtype: object

 # Exercises with `.str`
 
 1. Define a series of strings with both digits and non-digits as the elements.
 2. As I did before, remove the non-digit elements, turn the digits into integers, and then sum them.
 3. Find those elements that contained either `3` or `8` in them, and display them.
 4. Find those elements that contain `3`, and which are shorter than average length.

In [56]:
s = Series('123 abc 456 defg 7hi j8k 9876 135'.split())
s

0     123
1     abc
2     456
3    defg
4     7hi
5     j8k
6    9876
7     135
dtype: object

In [57]:
s.astype(np.int64)

ValueError: invalid literal for int() with base 10: 'abc'

In [61]:
# (1) find which elements of s contain only digits
# (2) use .loc to retrieve only those elements into a new series
# (3) turn the dtype of that new series into np.int64

s = s.loc[s.str.isdigit()].astype(np.int64)
s

0     123
2     456
6    9876
7     135
dtype: int64

In [62]:
s.sum()

10590

In [66]:
# Find those elements that contained either 3 or 8 in them, and display them.

s = Series('123 abc 456 defg 7hi j8k 9876 135'.split())

s.loc[ s.str.contains('3') | s.str.contains('8') ]

0     123
5     j8k
6    9876
7     135
dtype: object

In [68]:
# Find those elements that contain 3, and which are shorter than average length.

      # does s contain '3'?          is the length < the average word length
s.loc[   s.str.contains('3')      &   (s.str.len() < s.str.len().mean())  ]

0    123
7    135
dtype: object

In [69]:
# can I run .str on integers?

s = s.loc[s.str.isdigit()].astype(np.int64)
s

0     123
2     456
6    9876
7     135
dtype: int64

In [70]:
s.str.contains('3')

AttributeError: Can only use .str accessor with string values!

# Next up

- Textual statistics
- Trimming strings


In [71]:
s = Series('this is a sample sentence that is truly interesting and amazing and wonderful and shows off the interesting and amazing things we can do with Pandas'.split())
s

0            this
1              is
2               a
3          sample
4        sentence
5            that
6              is
7           truly
8     interesting
9             and
10        amazing
11            and
12      wonderful
13            and
14          shows
15            off
16            the
17    interesting
18            and
19        amazing
20         things
21             we
22            can
23             do
24           with
25         Pandas
dtype: object

In [72]:
# how can I get some statistics that describe this series?
# we know that we can use .describe on numeric series to get "descriptive statistics"
# but there's no mean, std, min, max, etc. with text.  So... what'll happen?

s.describe()

count      26
unique     20
top       and
freq        4
dtype: object

In [74]:
df = DataFrame({'a':[10, 20, 30, 40, 50, 60, 70, 80, 90, 100], 'b':'this is a sentence and it is a great one'.split()})

In [75]:
df

Unnamed: 0,a,b
0,10,this
1,20,is
2,30,a
3,40,sentence
4,50,and
5,60,it
6,70,is
7,80,a
8,90,great
9,100,one


In [78]:
# what happens if I run df.describe()? 

# if you have both numeric and non-numeric columns in your data frame,
# only the numeric ones will be described when you run df.describe()

df.describe()

Unnamed: 0,a
count,10.0
mean,55.0
std,30.276504
min,10.0
25%,32.5
50%,55.0
75%,77.5
max,100.0


In [79]:
help(df.describe)

Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None) -> 'NDFrameT' method of pandas.core.frame.DataFrame instance
    Generate descriptive statistics.
    
    Descriptive statistics include those that summarize the central
    tendency, dispersion and shape of a
    dataset's distribution, excluding ``NaN`` values.
    
    Analyzes both numeric and object series, as well
    as ``DataFrame`` column sets of mixed data types. The output
    will vary depending on what is provided. Refer to the notes
    below for more detail.
    
    Parameters
    ----------
    percentiles : list-like of numbers, optional
        The percentiles to include in the output. All should
        fall between 0 and 1. The default is
        ``[.25, .5, .75]``, which returns the 25th, 50th, and
        75th percentiles.
    include : 'all', list-like of dtypes or None (default), optional
        A white list of data types to include in the result. Ig

In [80]:
df.describe(include='all')

Unnamed: 0,a,b
count,10.0,10
unique,,8
top,,is
freq,,2
mean,55.0,
...,...,...
min,10.0,
25%,32.5,
50%,55.0,
75%,77.5,


In [82]:
df['b'].value_counts()

b
is          2
a           2
this        1
sentence    1
and         1
it          1
great       1
one         1
Name: count, dtype: int64

# Exercise: Text statistics

1. Create a series based on the contents of a text file. (It can be any text file!) If you don't have a good one that you want to use, you can download a text file from Project Gutenberg. (I often use Alice in Wonderland for my text.)
2. What are the 5 most common words in the file you downloaded?
3. How many distinct/different words does it contain?
4. If you lowercase all of the words using `.str.lower`, then re-run questions 2 and 3. What do you see?

In [84]:
# let's create a series based on Alice in Wonderland
s = Series(open('alice-in-wonderland.txt').read().split())

In [85]:
s.head(20)

0            ﻿The
1         Project
2       Gutenberg
3           EBook
4              of
5           Alice
6              in
7     Wonderland,
8              by
9           Lewis
10        Carroll
11           This
12          eBook
13             is
14            for
15            the
16            use
17             of
18         anyone
19       anywhere
dtype: object

In [87]:
# we can find common words with value_counts
s.value_counts().head(10)

the     732
and     362
a       321
to      311
of      300
in      211
she     197
was     160
said    129
it      122
Name: count, dtype: int64

In [88]:
s.describe()

count     12763
unique     3408
top         the
freq        732
dtype: object

In [89]:
# lowercase all of the words
s = s.str.lower()

In [90]:
# we can find common words with value_counts
s.value_counts().head(10)

the    792
and    379
a      325
to     318
of     313
she    232
in     222
was    160
you    141
it     136
Name: count, dtype: int64

In [91]:
s.describe()

count     12763
unique     3174
top         the
freq        792
dtype: object

# Cleaning and trimming strings

Very often, strings will contain characters that we want to remove. There are a few ways to deal with this:

- `.str.replace`, which we indicate which character we want to replace along with its replacement. If we want to remove a character, we can provide the empty string as the replacement.
- `.str.strip`, which removes whitespace from the start and end of the string
- `.str.strip(CHARS)`, which removes any of the characters in CHARS from the start and end of the string

In [93]:
s = Series('"Hello", is what she said to me.'.split())

In [94]:
s

0    "Hello",
1          is
2        what
3         she
4        said
5          to
6         me.
dtype: object

In [95]:
s.str.strip()

0    "Hello",
1          is
2        what
3         she
4        said
5          to
6         me.
dtype: object

In [100]:
# what if I want to remove the punctuation?
# strip with an argument still only removes from the front and back of the string; the middle is untouched
# it will remove any of the characters from the start or end of the string, until the first and last characters
# are not in the argument we provide.

# notice that I need to use.str twice here, because I'm running two methods
s.str.strip('".,?!$#').str.lower()

0    hello
1       is
2     what
3      she
4     said
5       to
6       me
dtype: object

In [101]:
s

0    "Hello",
1          is
2        what
3         she
4        said
5          to
6         me.
dtype: object

In [106]:
s.str.replace('"', '').str.replace(',', '')

0    Hello
1       is
2     what
3      she
4     said
5       to
6      me.
dtype: object

In [105]:
help(s.str.replace)

Help on method replace in module pandas.core.strings.accessor:

replace(pat: 'str | re.Pattern', repl: 'str | Callable', n: 'int' = -1, case: 'bool | None' = None, flags: 'int' = 0, regex: 'bool' = False) method of pandas.core.strings.accessor.StringMethods instance
    Replace each occurrence of pattern/regex in the Series/Index.
    
    Equivalent to :meth:`str.replace` or :func:`re.sub`, depending on
    the regex value.
    
    Parameters
    ----------
    pat : str or compiled regex
        String can be a character sequence or regular expression.
    repl : str or callable
        Replacement string or a callable. The callable is passed the regex
        match object and must return a replacement string to be used.
        See :func:`re.sub`.
    n : int, default -1 (all)
        Number of replacements to make from start.
    case : bool, default None
        Determines if replace is case sensitive:
    
        - If True, case sensitive (the default if `pat` is a string)
    

# Exercise: Grabbing integers from prices

1. Create a series in which the elements are strings -- prices containing a `$` sign. There are no decimals or floating-point numbers here.
2. Remove the `$`, such that we can then sum and calculate the mean of our numbers.

In [108]:
# str.replace used to assume that you want to use regular expressions (pattern-matching for text).
# at some point, you started to get warnings telling you that if you want to use a regexp, you need
# to be explicit about it.

s.str.replace('.', '?')

0    "Hello",
1          is
2        what
3         she
4        said
5          to
6         me?
dtype: object

In [109]:
pd.__version__

'2.0.2'

In [110]:
s.str.replace('.', '?', regex=True)

0    ????????
1          ??
2        ????
3         ???
4        ????
5          ??
6         ???
dtype: object

In [111]:
s = Series('$10 $20 $50 $70 $2345'.split())
s

0      $10
1      $20
2      $50
3      $70
4    $2345
dtype: object

In [112]:
# if I try to sum or get the mean of these, it won't work

s.sum()

'$10$20$50$70$2345'

In [113]:
s.mean()

TypeError: Could not convert $10$20$50$70$2345 to numeric

In [116]:
# I'm going to remove the $ from each element of the series
# option 1: use strip (this would probably be my preference)

s.str.strip('$').astype(np.int64).sum()

2495

In [117]:
s.str.strip('$').astype(np.int64).mean()

499.0

In [119]:
# use "agg" for more than one!
s.str.strip('$').astype(np.int64).agg(['sum', 'mean'])

sum     2495.0
mean     499.0
dtype: float64

In [121]:
# option 2: use replace

s.str.replace('$', '').astype(np.int64).agg(['sum', 'mean'])

sum     2495.0
mean     499.0
dtype: float64

# Retrieving one character and multiple characters

In Python, we can retrieve one character from a string by passing `[i]`, where `i` is a numeric index. We'll get the character at that index.

Similarly, we can pass a *slice* in square brackets, where we indicate the starting point and the ending point (which is not included), as in `[start:end]` or even `[start:end:step]` if we want to ignore certain characters.

We can use both of these on strings via `.str`! The methods are known as `.get` and `.slice`.

In [122]:
s

0      $10
1      $20
2      $50
3      $70
4    $2345
dtype: object

In [123]:
# retrieve the first character in every string
s.str.get(0)

0    $
1    $
2    $
3    $
4    $
dtype: object

In [124]:
# retrieve the last character in every string
s.str.get(-1)

0    0
1    0
2    0
3    0
4    5
dtype: object

In [126]:
# I can retrieve a slice with .slice

s.str.slice(1, 3) # from index 1 up to (and not including) index 3

0    10
1    20
2    50
3    70
4    23
dtype: object

In [127]:
s.str.slice(1, None) # from index 1 through the end

0      10
1      20
2      50
3      70
4    2345
dtype: object

# Exercises

1. Repeat what we did before with the text file. What are the 10 most common words, if you lowercase them, and remove all punctuation from the start and finish of each string.
2. Show all of the words without their first characters.

In [128]:
s = Series(open('alice-in-wonderland.txt').read().split())

s.value_counts().head(10)

the     732
and     362
a       321
to      311
of      300
in      211
she     197
was     160
said    129
it      122
Name: count, dtype: int64

In [129]:
s = s.str.lower()
s.value_counts().head(10)

the    792
and    379
a      325
to     318
of     313
she    232
in     222
was    160
you    141
it     136
Name: count, dtype: int64

In [130]:
# I'm going to use the "string" module in Python for ease

import string

s = s.str.strip(string.punctuation)
s = s.str.strip(string.whitespace)

s.value_counts().head(10)

the    807
and    404
a      328
to     327
of     318
she    237
in     227
it     183
you    171
was    168
Name: count, dtype: int64

In [131]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [132]:
string.whitespace

' \t\n\r\x0b\x0c'

In [136]:
'\x0b'

'\x0b'

In [137]:
# Show all of the words without their first characters.

# just the first character? we can use get

s.str.get(0)

0        ﻿
1        p
2        g
3        e
4        o
        ..
12758    t
12759    h
12760    a
12761    n
12762    e
Length: 12763, dtype: object

In [138]:
# get all but the first character -- we'll need a slice!

s.str.slice(1, None)

0             the
1          roject
2        utenberg
3            book
4               f
           ...   
12758           o
12759         ear
12760        bout
12761          ew
12762       books
Length: 12763, dtype: object

In [139]:
s.head(5)

0         ﻿the
1      project
2    gutenberg
3        ebook
4           of
dtype: object

In [140]:
s.loc[0] # \ufeff == the BOM , or byte order marker, a special Unicode character to show left-to-right or right-to-left

'\ufeffthe'

# Unicode and strings

Strings in Python all use Unicode, which includes all characters from all languages ever created, plus a bunch of symbols, e.g., emojis. Let's say that you want to refer to a Unicode character -- you can use the special syntax `\UXXXXXX` where the `X`s are replaced by digits.

This means that if your string contains `\U`, and doesn't then have digits referring to a legit Unicode character, you'll get an error.

I often see this when people try to open files on a Windows machine, and they refer to their own `\Users` directory. In such cases, double the backslashes and you should be OK.

A related problem, but probably not this one, is that if you try to read a binary file into memory, then that does not contain legal Unicode characters, and you'll get an error.

# Next up: Dates and times!

- Defining them
- Calculating with them
- Grouping with them
- Indexing with them

# Times and dates

A lot of data includes time-date information:

- Calendars
- Sales info 
- Travel info
- Timing information for optimization of systems
- Logfiles

Python, like many other languages, uses *two* data structures to keep track of time. We rarely think of time as two different things, but after you think about it this way, you might realize that this precision is useful.

When we talk about "time," we're really referring to two things:

- A point in time, when something happened. That point in time is uniquely described with a year, month, day, hour, minute, second, and (perhaps) even smaller measures (e.g., ms and ns). 
    - What time does a meeting start?
    - When was someone born?
    - When was the performance supposed to start?
- A span in time, which takes up a certain amount of time, but doesn't have a specific start or end date/time:
    - How long did someone live?
    - How long is this trip?

The two Python data structures work with these two ideas:

- `datetime` objects (sometimes known as "timestamps") contain unique dates and times.
- `timedelta` objects (sometimes known as "intervals") contain spans of time, but without specific years, months, etc.

You can do date math with these:

- `datetime` - `datetime` = `timedelta`
- `datetime` + `timedelta` = `datetime`

Pandas supports both of these. 

In [141]:
# Let's load the taxi data, and see what we can do.

df = pd.read_csv('taxi.csv',
                usecols=['passenger_count', 'trip_distance', 'total_amount',
                        'tpep_pickup_datetime', 'tpep_dropoff_datetime'])

In [142]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


In [143]:
# what dtypes do we have?

df.dtypes

tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
total_amount             float64
dtype: object

In [149]:
# how can I force the datetime columns to be of datetime types?
# option 1: read the file, and then apply pd.to_datetime to those columns

df = pd.read_csv('taxi.csv',
                usecols=['passenger_count', 'trip_distance', 'total_amount',
                        'tpep_pickup_datetime', 'tpep_dropoff_datetime'])
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

df.dtypes

tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

In [156]:
# option 2: pass the parse_dates keyword argument to read_csv

df = pd.read_csv('taxi.csv',
                usecols=['passenger_count', 'trip_distance', 'total_amount',
                        'tpep_pickup_datetime', 'tpep_dropoff_datetime'],
                parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])

df.dtypes



tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
total_amount                    float64
dtype: object

In [157]:
df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,total_amount
0,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,17.8
1,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,8.3
2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,11.0
3,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,17.16
4,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,10.3


# The `.dt` accessor

So far, we've talked about the `.str` accessor, giving us access to string methods.

On datetime objects, we can pass the `.dt` accessor, and thus retrieve all sorts of attributes associated with `datetime` objects.

For starters, we can pass `year`, `month`, `day`, `hour`, `minute`, and `second`.

In [161]:
df['tpep_pickup_datetime'].dt.day

0       2
1       2
2       2
3       2
4       2
       ..
9994    1
9995    1
9996    1
9997    1
9998    1
Name: tpep_pickup_datetime, Length: 9999, dtype: int32

# Exercise

1. Read in `taxi.csv`, making sure to parse the dates for the two datetime columns.
2. How many rides were there at each hour of the day in our data set? (You'll quickly see how non-representative the data is.)
3. How many rides took place before 12 noon, and how many took place on or after 12 noon?

In [163]:
# 1. Read in `taxi.csv`, making sure to parse the dates for the two datetime columns.

df = pd.read_csv('taxi.csv',
                parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'])
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3


In [164]:
df.dtypes

VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RateCodeID                        int64
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
dtype: object

In [167]:
# 2. How many rides were there at each hour of the day in our data set?
# (You'll quickly see how non-representative the data is.)

df['tpep_pickup_datetime'].dt.hour.value_counts()

tpep_pickup_datetime
11    4396
15    2536
0     2439
16     628
Name: count, dtype: int64

In [169]:
# 3. How many rides took place before 12 noon, and how many took place on or after 12 noon?

(df['tpep_pickup_datetime'].dt.hour < 12).value_counts()

tpep_pickup_datetime
True     6835
False    3164
Name: count, dtype: int64

# What else can I get with `.dt`?

- `day_of_week`
- `is_leap_year`
- `is_quarter_start`
- `is_quarter_end`

Lots and lots of options to check about a given date!

# Time deltas

I mentioned the idea of "time deltas" before, where we can get the difference between two timestamps.

What if I were to subtract the pickup datetime from the dropoff datetime?  We would know how long the taxi ride was!

Time deltas normally measure: days, hours, and ns.  That's (roughly) how they'll be presented when we ask to see time deltas.

In [171]:
df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

0      0 days 00:28:23
1      0 days 00:08:26
2      0 days 00:10:59
3      0 days 00:19:31
4      0 days 00:13:17
             ...      
9994   0 days 00:11:19
9995   0 days 00:15:17
9996   0 days 00:24:25
9997   0 days 00:06:08
9998   0 days 00:23:29
Length: 9999, dtype: timedelta64[ns]

In [172]:
# I can assign that to a new column!

df['trip_time'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']

df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
0,2,2015-06-02 11:19:29,2015-06-02 11:47:52,1,1.63,-73.95443,40.764141,1,N,-73.974754,40.754093,2,17.0,0.0,0.5,0.0,0.0,0.3,17.8,0 days 00:28:23
1,2,2015-06-02 11:19:30,2015-06-02 11:27:56,1,0.46,-73.971443,40.758942,1,N,-73.978539,40.761909,1,6.5,0.0,0.5,1.0,0.0,0.3,8.3,0 days 00:08:26
2,2,2015-06-02 11:19:31,2015-06-02 11:30:30,1,0.87,-73.978111,40.738434,1,N,-73.990273,40.745438,1,8.0,0.0,0.5,2.2,0.0,0.3,11.0,0 days 00:10:59
3,2,2015-06-02 11:19:31,2015-06-02 11:39:02,1,2.13,-73.945892,40.773529,1,N,-73.971527,40.76033,1,13.5,0.0,0.5,2.86,0.0,0.3,17.16,0 days 00:19:31
4,1,2015-06-02 11:19:32,2015-06-02 11:32:49,1,1.4,-73.979088,40.776772,1,N,-73.982162,40.758999,2,9.5,0.0,0.5,0.0,0.0,0.3,10.3,0 days 00:13:17


In [174]:
# let's say that I want to find all trips that were < 1 minute long

df.loc[df['trip_time'] < '00:01:00']

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,trip_time
10,2,2015-06-02 11:19:38,2015-06-02 11:19:43,3,0.01,0.000000,0.000000,2,N,0.000000,0.000000,2,52.0,0.0,0.5,0.00,0.00,0.3,52.80,0 days 00:00:05
149,1,2015-06-02 11:21:25,2015-06-02 11:21:50,1,0.00,-73.978493,40.748562,1,N,-73.978493,40.748604,1,2.5,0.0,0.5,1.00,0.00,0.3,4.30,0 days 00:00:25
297,2,2015-06-02 11:20:23,2015-06-02 11:20:23,2,0.00,-73.937851,40.758236,1,N,0.000000,0.000000,2,1.5,0.0,0.5,0.00,0.00,0.3,2.30,0 days 00:00:00
516,2,2015-06-02 11:21:14,2015-06-02 11:22:03,1,0.03,-73.991943,40.740261,1,N,-73.991463,40.740086,1,2.5,0.0,0.5,0.66,0.00,0.3,3.96,0 days 00:00:49
657,1,2015-06-02 11:24:33,2015-06-02 11:24:50,1,0.00,-73.996460,40.732124,5,N,-73.996429,40.732147,1,12.0,0.0,0.0,3.05,0.00,0.3,15.35,0 days 00:00:17
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9452,1,2015-06-01 00:10:45,2015-06-01 00:11:28,1,0.10,-73.962044,40.761909,1,N,-73.960907,40.761726,3,2.5,0.5,0.5,0.00,0.00,0.3,3.80,0 days 00:00:43
9693,2,2015-06-01 00:11:08,2015-06-01 00:11:08,1,5.03,-73.991440,40.731533,1,N,-73.967010,40.802006,1,16.0,0.5,0.5,1.50,0.00,0.3,18.80,0 days 00:00:00
9761,1,2015-06-01 00:12:46,2015-06-01 00:12:46,1,0.00,-73.999535,40.738533,1,N,0.000000,0.000000,2,4.5,0.5,0.5,0.00,0.00,0.3,5.80,0 days 00:00:00
9868,1,2015-06-01 00:12:31,2015-06-01 00:12:56,2,2.40,-73.984634,40.759377,2,N,-73.984604,40.759350,2,52.0,0.0,0.5,0.00,5.54,0.3,58.34,0 days 00:00:25


In [175]:
help(s.value_counts)

Help on method value_counts in module pandas.core.base:

value_counts(normalize: 'bool' = False, sort: 'bool' = True, ascending: 'bool' = False, bins=None, dropna: 'bool' = True) -> 'Series' method of pandas.core.series.Series instance
    Return a Series containing counts of unique values.
    
    The resulting object will be in descending order so that the
    first element is the most frequently-occurring element.
    Excludes NA values by default.
    
    Parameters
    ----------
    normalize : bool, default False
        If True then the object returned will contain the relative
        frequencies of the unique values.
    sort : bool, default True
        Sort by frequencies.
    ascending : bool, default False
        Sort in ascending order.
    bins : int, optional
        Rather than count values, group them into half-open bins,
        a convenience for ``pd.cut``, only works with numeric data.
    dropna : bool, default True
        Don't include counts of NaN.
    
  