# Agenda, week 4

1. Recap and Q&A
    - Oil prices!
2. Text strings
    - The `str` accessor
    - Cleaning dirty integer data
    - Textual statistics
    - Trimming strings
3. Dates and times
    - Date and time dtypes
    - Parsing CSV files with times
    - Time deltas
    - Time series
    - Resampling

In [3]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [4]:
filename = 'oil-prices-master/data/wti-daily.csv'

df = pd.read_csv(filename)
df.head()

Unnamed: 0,Date,Price
0,1986-01-02,25.56
1,1986-01-03,26.0
2,1986-01-06,26.53
3,1986-01-07,25.85
4,1986-01-08,25.87


In [5]:
# What was the highest-ever price of oil (as per WTI)?

df['Price'].max()   # this is the highest price ever

145.31

In [6]:

df['Price'] == df['Price'].max()  # this returns a boolean series

0       False
1       False
2       False
3       False
4       False
        ...  
9222    False
9223    False
9224    False
9225    False
9226    False
Name: Price, Length: 9227, dtype: bool

In [8]:
# Use .loc with a row selector and a column selector
# our row selector will be our boolean series


df.loc[
    df['Price'] == df['Price'].max(),   # row selector
    'Date' # column selector
]

5678    2008-07-03
Name: Date, dtype: object

In [10]:
# what was the minimum price ever found of WTI?


df.loc[
    df['Price'] == df['Price'].min(),   # row selector
    ['Date', 'Price']                              # column selector
]

Unnamed: 0,Date,Price
8643,2020-04-20,-36.98


In [11]:
# what were the 10 most recent values for oil prices?

df.tail(5)

Unnamed: 0,Date,Price
9222,2022-08-09,93.18
9223,2022-08-10,94.68
9224,2022-08-11,97.02
9225,2022-08-12,94.86
9226,2022-08-15,92.24


In [12]:
df.tail(20)

Unnamed: 0,Date,Price
9207,2022-07-19,106.12
9208,2022-07-20,104.45
9209,2022-07-21,98.44
9210,2022-07-22,97.71
9211,2022-07-25,99.83
9212,2022-07-26,97.74
9213,2022-07-27,100.03
9214,2022-07-28,99.11
9215,2022-07-29,101.31
9216,2022-08-01,96.59


In [15]:
# running describe on our data frame runs describe on each numeric column

df.describe()

Unnamed: 0,Price
count,9227.0
mean,45.636007
std,29.481022
min,-36.98
25%,19.94
50%,34.76
75%,65.975
max,145.31


In [14]:
df.dtypes

Date      object
Price    float64
dtype: object

In [16]:
df['Date'].describe()

count           9227
unique          9227
top       1986-01-02
freq               1
Name: Date, dtype: object

In [18]:
# read_html only works when you have HTML tables on the target site
# and when they are written in HTML, and not generated on the fly via JavaScript

url = 'https://www.bankofcanada.ca/rates/exchange/daily-exchange-rates/'

all_dfs = pd.read_html(url)


In [19]:
len(all_dfs)

1

In [20]:
df = all_dfs[0]
df.head()

Unnamed: 0,Currency,2022‑08‑16,2022‑08‑17,2022‑08‑18,2022‑08‑19,2022‑08‑22
0,Australian dollar,0.9025,0.8949,0.8962,0.8936,0.897
1,Brazilian real,0.2504,0.2494,0.2496,0.2503,0.2521
2,Chinese renminbi,0.1896,0.1904,0.1905,0.1906,0.1904
3,European euro,1.3081,1.3134,1.308,1.3049,1.2979
4,Hong Kong dollar,0.1641,0.1646,0.1648,0.1656,0.1661


In [21]:
df.shape

(23, 6)

In [22]:
df['Currency']

0      Australian dollar
1         Brazilian real
2       Chinese renminbi
3          European euro
4       Hong Kong dollar
5           Indian rupee
6      Indonesian rupiah
7           Japanese yen
8           Mexican peso
9     New Zealand dollar
10       Norwegian krone
11      Peruvian new sol
12         Russian ruble
13           Saudi riyal
14      Singapore dollar
15    South African rand
16      South Korean won
17         Swedish krona
18           Swiss franc
19      Taiwanese dollar
20          Turkish lira
21     UK pound sterling
22             US dollar
Name: Currency, dtype: object

# Text data in Pandas

We've seen that a series can contain text. As such, a column in a data frame can also contain text. When we do that, the dtype of the column (series) is known as `object`, which means that the data isn't being stored directly inside of Pandas.  Instead, Pandas is referring to Python string objects located elsewhere in memory.  This means that we have access (in theory) to all of the Python string methods and associated functionality.

In [23]:
# create a series based on a list of strings, created via str.split
# str.split without an explicit delimiter argument uses any whitespace, of any 
# length, in any combination

s = Series('this is a test of text in Pandas'.split())
s

0      this
1        is
2         a
3      test
4        of
5      text
6        in
7    Pandas
dtype: object

In [25]:
# what is the length of each word?

# option 1 for answering: a for loop

for one_item in s:
    print(len(one_item))

4
2
1
4
2
4
2
6


In [27]:
# option 1b: use a list comprehension

[len(one_item)
for one_item in s]

[4, 2, 1, 4, 2, 4, 2, 6]

**DO NOT DO THIS!**

If you ever find yourself using a `for` loop in Pandas, stop! There is almost certainly a better way to accomplish it.

In [29]:
# Better, option 2: Use the "str" accessor

# an "accessor" is a Pandas term for an attribute (i.e., coming after a .)
# that lets us access special functionality for certain types of objects

# if we have an "object" column containing strings, then we can use the str
# accessor to invoke a number of different string methods

s.str.len() 

# this invokes len() on each of the elements in s, and returns a new series
# the index of the returned series matches the index in our original series s

0    4
1    2
2    1
3    4
4    2
5    4
6    2
7    6
dtype: int64

In [30]:
# because we get a series back, we can use it in all series-type operations
s.str.len() == 2

0    False
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [31]:
# apply the boolean series as a mask index, and find all of the words
# that have only two letters

s.loc[s.str.len() == 2]

1    is
4    of
6    in
dtype: object

In [33]:
# the "contains" method on the  "str" accessor lets us search in a string
# for a character/substring

# use s.loc to find all words in s that contain 'e'
s.loc[s.str.contains('e')]

3    test
5    text
dtype: object

# Exercise: Longer-than average words

1. Ask the user to enter a sentence. 
2. Turn that sentence into a Pandas series.
3. Show all of the words that are longer than average in the sentence.

In [34]:
sentence = input('Enter a sentence: ').strip()

s = Series(sentence.split())

Enter a sentence: this is the most marvelous and fascinating and scintillating sentence on the planet


In [35]:
s

0              this
1                is
2               the
3              most
4         marvelous
5               and
6       fascinating
7               and
8     scintillating
9          sentence
10               on
11              the
12           planet
dtype: object

In [37]:
# show all words in s that are longer than the average length

s.str.len()  # this returns a series of ints, based on s, with s's index

0      4
1      2
2      3
3      4
4      9
5      3
6     11
7      3
8     13
9      8
10     2
11     3
12     6
dtype: int64

In [38]:
# this is the mean word length
s.str.len().mean()

5.461538461538462

In [39]:
# Get a boolean series, indicating where the word's length is greater than the average
s.str.len() > s.str.len().mean()

0     False
1     False
2     False
3     False
4      True
5     False
6      True
7     False
8      True
9      True
10    False
11    False
12     True
dtype: bool

In [40]:
s.loc[s.str.len() > s.str.len().mean()]

4         marvelous
6       fascinating
8     scintillating
9          sentence
12           planet
dtype: object

In [43]:
s = s.astype('string')  # new-ish functionality in Pandas

In [44]:
s

0              this
1                is
2               the
3              most
4         marvelous
5               and
6       fascinating
7               and
8     scintillating
9          sentence
10               on
11              the
12           planet
dtype: string

In [45]:
s.loc[0] = 12345
s

ValueError: Cannot set non-string value '12345' into a StringArray.

In [46]:
# what happens if I have a series of integers?

s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [47]:
s.describe()

count     5.000000
mean     30.000000
std      15.811388
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      50.000000
dtype: float64

In [48]:
# what if I have a series of strings containing digits, and I want
# to turn the series dtype into int?

s = Series('10 20 30 40 50'.split())
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [49]:
s.astype(np.int64)

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [50]:
s = s.astype(np.int64)
s.describe()

count     5.000000
mean     30.000000
std      15.811388
min      10.000000
25%      20.000000
50%      30.000000
75%      40.000000
max      50.000000
dtype: float64

In [51]:
# but what if there is a bad value in that series?

s = Series('10 20 30 abcd 50 60'.split())
s

0      10
1      20
2      30
3    abcd
4      50
5      60
dtype: object

In [52]:
# what if I use .astype(np.int64)?

s.astype(np.int64)

ValueError: invalid literal for int() with base 10: 'abcd'

In [53]:
# let's filter out the strings in s that cannot be turned into integers
# an easy way to do this is with the "str.isdigit" accessor method
# isdigit is a builtin Python string method, which returns True if the string
# only contains the digits 0-9.  No whitespace, and no empty strings.

s.str.isdigit()

0     True
1     True
2     True
3    False
4     True
5     True
dtype: bool

In [55]:
# get only those strings that contain digits
s.loc[s.str.isdigit()]

0    10
1    20
2    30
4    50
5    60
dtype: object

In [59]:
# first, we grab only those strings containing the digits 0-9
# via a mask/boolean index
# then, we invoke astype(np.int64) on the resulting series of strings
# we sum those integers, and get the value
s.loc[s.str.isdigit()].astype(np.int64).sum()

170

In [60]:
# If I want to change s, and turn it (permanently) into a series of integers,
# I have to do a bit more work:

s = s.loc[s.str.isdigit()].astype(np.int64)
s.sum()

170

# Exercise: Mean of dirty ints

1. Ask the user to enter integers, separated by spaces, in a single string.
2. Filter out any strings that contain non-digits.
3. Get the mean of those inputs.
4. Bonus extra: Get the non-digits from the input string!

In [None]:
numbers = input('Enter some numbers, separated by spaces: ').strip()

In [67]:
s = Series(numbers.split())

s.loc[s.str.isdigit()]   # return only those elements that *can* be turned into ints

0    10
1    20
2    30
4    40
5    50
6    60
dtype: object

In [69]:
s = s.loc[s.str.isdigit()].astype(np.int64)
s

0    10
1    20
2    30
4    40
5    50
6    60
dtype: int64

In [70]:
s.mean()

35.0

In [71]:
# can we get all of those elements that are *not* numeric?

s

0    10
1    20
2    30
4    40
5    50
6    60
dtype: int64

In [74]:
s = Series(numbers.split())
s

0       10
1       20
2       30
3    abcde
4       40
5       50
6       60
dtype: object

In [77]:
# I can find the things that are not numeric by using str.isdigit, and then
# flipping the logic with ~ (the "not" operator)

s[~s.str.isdigit()]

3    abcde
dtype: object

# Next up

1. Textual statistics
2. Trimming + splitting + indexing strings

In [78]:
s = Series('this is a bunch of words and the bunch of words is very exciting and the words just go on and on and on and there is no end to our words'.split())
s

0         this
1           is
2            a
3        bunch
4           of
5        words
6          and
7          the
8        bunch
9           of
10       words
11          is
12        very
13    exciting
14         and
15         the
16       words
17        just
18          go
19          on
20         and
21          on
22         and
23          on
24         and
25       there
26          is
27          no
28         end
29          to
30         our
31       words
dtype: object

In [79]:
# what happens if I run "describe" on our series?

s.describe()

count      32
unique     18
top       and
freq        5
dtype: object

# Why `describe` is useful with text

The big thing we can find out by running `describe` on a text column (series) is how many unique values there are, and how many are `NaN`. This can be useful in understanding whether every value is unique (as we saw earlier with dates) or if there is a lot of overlap.

The reason this can help us is that Pandas provides ways to crunch our data down, saving memory, if we have repeated values. This is similar to what progrmamers would call an "enum," where we replace text with numbers, because numbers take up much less space. So if the number of unqiue values is much less than the number of values, you might want to consider looking into a "category" dtype column, rather than a text/object column.

In [80]:
# we can also run value_counts on our series of words

s.value_counts()

and         5
words       4
on          3
is          3
bunch       2
of          2
the         2
this        1
to          1
end         1
no          1
there       1
exciting    1
go          1
just        1
very        1
a           1
our         1
dtype: int64

In [83]:
# what words appeared more than once in s?

# select those words in s that appear more than once
s.value_counts().loc[s.value_counts() > 1]

and      5
words    4
on       3
is       3
bunch    2
of       2
the      2
dtype: int64

In [86]:
# I can search in a string using the "contains" method on the str accessor

s.loc[s.str.contains('e')]   # this is sort of like the 'in' operator in Python

7          the
12        very
13    exciting
15         the
25       there
28         end
dtype: object

In [87]:
# what if I want all of the words that contain 'i'?

s.loc[s.str.contains('i')]

0         this
1           is
11          is
13    exciting
26          is
dtype: object

In [89]:
# I want all of the words that contain either e or i
s.loc[s.str.contains('e') | s.str.contains('i')]

0         this
1           is
7          the
11          is
12        very
13    exciting
15         the
25       there
26          is
28         end
dtype: object

In [91]:
# the above works just fine *but* we can also do the same thing
# via regular expressions.  contains, by default, assumes that its
# string argument is a regexp.

# Check out RegexpCrashCourse.com -- my free, 14-day e-mail course 
# on regular expressions.

s.loc[s.str.contains('[ei]')]  # we are looking for either e or i in s

0         this
1           is
7          the
11          is
12        very
13    exciting
15         the
25       there
26          is
28         end
dtype: object

In [93]:
s.loc[0] = np.nan
s

0          NaN
1           is
2            a
3        bunch
4           of
5        words
6          and
7          the
8        bunch
9           of
10       words
11          is
12        very
13    exciting
14         and
15         the
16       words
17        just
18          go
19          on
20         and
21          on
22         and
23          on
24         and
25       there
26          is
27          no
28         end
29          to
30         our
31       words
dtype: object

In [98]:
# If you have NaN values in your series, then str.contains will return NaN
# and that'll blow up your boolean index. Use fillna(False) to remove them.
s.loc[s.str.contains('a').fillna(False)]

2       a
6     and
14    and
20    and
22    and
24    and
dtype: object

# Exercise: Character frequency

1. Ask the user to enter a sentence.
2. Show the 5 most common characters use in the input sentence.

In [99]:
sentence = input('Enter a sentence: ').strip()



Enter a sentence: this is yet another amazing and fantastic and expressive sentence created for my online course


In [100]:
sentence

'this is yet another amazing and fantastic and expressive sentence created for my online course'

In [104]:
s = Series(list(sentence))
s

0     t
1     h
2     i
3     s
4      
     ..
89    o
90    u
91    r
92    s
93    e
Length: 94, dtype: object

In [106]:
s.value_counts().head(5)

     14
e    12
n     9
a     8
t     7
dtype: int64

In [109]:
# what if I want the five most common alphanumeric characters?
s.loc[s.str.isalnum()].value_counts().head(5)

e    12
n     9
a     8
t     7
s     7
dtype: int64

In [112]:
filename = 'alice-in-wonderland.txt'

s = Series(open(filename).read().split())

In [113]:
s

0             ﻿The
1          Project
2        Gutenberg
3            EBook
4               of
           ...    
12758           to
12759         hear
12760        about
12761          new
12762      eBooks.
Length: 12763, dtype: object

In [114]:
# what are the five most common words in Alice in Wonderland?
s.value_counts().head(5)

the    732
and    362
a      321
to     311
of     300
dtype: int64

In [118]:
# what are the five most common words, with at least 4 characters in them,
# in Alice in Wonderland?

s.loc[s.str.len() >= 4].value_counts().head(5)

said       129
with       111
Alice      101
that        90
Project     78
dtype: int64

In [120]:
!ls -l alice-in-wonderland.txt

-rw-r--r-- 1 reuven staff 74703 Aug 23 13:31 alice-in-wonderland.txt


In [122]:
# I want to find the most common words, but I first want to remove
# all punctuation from the words.  This will be a more reasonable
# comparison.

# I want to remove the following characters from the ends of all words:
# . ? ! "

# the strip method doesn't just remove whitespace (although that's what it does
# by default). If we give it a string as an argument, it removes any and all
# of those characters that might appear at the start or end of a string.

s.str.strip('.?!"')

0             ﻿The
1          Project
2        Gutenberg
3            EBook
4               of
           ...    
12758           to
12759         hear
12760        about
12761          new
12762       eBooks
Length: 12763, dtype: object

In [125]:
s.str.strip(',;.?!"').value_counts().head(10)

the      734
and      383
a        321
to       319
of       303
in       213
she      197
Alice    165
was      163
it       156
dtype: int64

In [126]:
s.value_counts().head(10)

the     732
and     362
a       321
to      311
of      300
in      211
she     197
was     160
said    129
it      122
dtype: int64

In [127]:
# I can retrieve a character via "get" -- this is sort of like []
# on a (string) element in the series.  But get is a method, that 
# takes its arguments in round parentheses, ()

s.str.get(0)   # retrieve the first character from each string

0        ﻿
1        P
2        G
3        E
4        o
        ..
12758    t
12759    h
12760    a
12761    n
12762    e
Length: 12763, dtype: object

In [128]:
s.str.get(-1)    # retrieve the final character from each string

0        e
1        t
2        g
3        k
4        f
        ..
12758    o
12759    r
12760    t
12761    w
12762    .
Length: 12763, dtype: object

In [129]:
# slices are OK, too!

s.str.slice(2, 4)

0        he
1        oj
2        te
3        oo
4          
         ..
12758      
12759    ar
12760    ou
12761     w
12762    oo
Length: 12763, dtype: object

In [131]:
s.loc[12761]

'new'

# Exercise: Starts and ends the same

1. Grab the `alice-in-wonderland.txt` file from the GitHub repo.
2. (If you cannot download it for some reason, just ask the user to enter a sentence.)
3. Print all of the words whose first and final letters are identical. Ignore punctuation (as best as possible).

In [132]:
filename = 'alice-in-wonderland.txt'

s = Series(open(filename).read().split())
s

0             ﻿The
1          Project
2        Gutenberg
3            EBook
4               of
           ...    
12758           to
12759         hear
12760        about
12761          new
12762      eBooks.
Length: 12763, dtype: object

In [133]:
# remove punctuation from the start and end of each word

s = s.str.strip('.!?,;:"')
s

0             ﻿The
1          Project
2        Gutenberg
3            EBook
4               of
           ...    
12758           to
12759         hear
12760        about
12761          new
12762       eBooks
Length: 12763, dtype: object

In [134]:
s = s.str.strip()
s

0             ﻿The
1          Project
2        Gutenberg
3            EBook
4               of
           ...    
12758           to
12759         hear
12760        about
12761          new
12762       eBooks
Length: 12763, dtype: object

In [135]:
s.str.len().head(5)

0    4
1    7
2    9
3    5
4    2
dtype: int64

In [136]:
s.loc[0]

'\ufeffThe'

In [137]:
s

0             ﻿The
1          Project
2        Gutenberg
3            EBook
4               of
           ...    
12758           to
12759         hear
12760        about
12761          new
12762       eBooks
Length: 12763, dtype: object

In [139]:
s.loc[s.str.get(0) == s.str.get(-1)]

74               ***
79         GUTENBERG
84               ***
116             SONS
117                &
            ...     
12637              a
12642           that
12655    distributed
12661              a
12689              a
Length: 884, dtype: object

In [140]:
s.str.get(0)

0        ﻿
1        P
2        G
3        E
4        o
        ..
12758    t
12759    h
12760    a
12761    n
12762    e
Length: 12763, dtype: object

In [143]:
s.str.slice(3)  # starting at index 3, through the end -- sort of like s[3:]

0             e
1          ject
2        enberg
3            ok
4              
          ...  
12758          
12759         r
12760        ut
12761          
12762       oks
Length: 12763, dtype: object

In [145]:
s.str.slice(-1)  # this will give you the final character, because it's like s[-1:]

0        e
1        t
2        g
3        k
4        f
        ..
12758    o
12759    r
12760    t
12761    w
12762    s
Length: 12763, dtype: object

In [142]:
help(s.str.slice)

Help on method slice in module pandas.core.strings.accessor:

slice(start=None, stop=None, step=None) method of pandas.core.strings.accessor.StringMethods instance
    Slice substrings from each element in the Series or Index.
    
    Parameters
    ----------
    start : int, optional
        Start position for slice operation.
    stop : int, optional
        Stop position for slice operation.
    step : int, optional
        Step size for slice operation.
    
    Returns
    -------
    Series or Index of object
        Series or Index from sliced substring from original string object.
    
    See Also
    --------
    Series.str.slice_replace : Replace a slice with a string.
    Series.str.get : Return element at position.
        Equivalent to `Series.str.slice(start=i, stop=i+1)` with `i`
        being the position.
    
    Examples
    --------
    >>> s = pd.Series(["koala", "dog", "chameleon"])
    >>> s
    0        koala
    1          dog
    2    chameleon
    dtype: o

In [None]:
s.loc[s.str.slice(0) == s.str.get(-1)]

# Question: 

What are the commonalities and differences between 

1. Python lists
2. NumPy arrays
3. Pandas series

Commonalities:

1. They are all iterable collections in Python
2. We can use `[]` and integers to retrieve items from them.

Differences:

1. Lists can contain any number of different types. It's considered best and conventional for all elements to be of the same type, but there is no technical requirement.  By contrast, values in both NumPy arrays and Pandas series must all be of the same type (dtype).
2. Pandas provides many, many more methods than NumPy. So a Pandas series can do much more than a NumPy array. 
3. NumPy is (for the time being, so far as I know) faster than Pandas.
4. Pandas series can have more interesting indexes than either Python lists or NumPy arrays, which are restricted to integer indexes, starting at 0.
5. Retrieving from a list or NumPy array is done with integers. With Pandas, we have a variety of ways to retrieve -- `.loc`, `.iloc`.
6. If you're working with numbers, than either NumPy or Pandas will be *far* more efficient (in time and memory used) than Python lists.

# Next up

1. Dates and times -- in general
2. Dates and times -- in Pandas

# Working with dates and times with computers

When we use the term "time" as people, we are actually referring to two different concepts. Computers have to make this distinction explicit:

1. We might be referring to **a specific point in time**. For example: Someone's birth date. The start of a meeting. When your car registration expires. This points to a unique point in time that we can describe via year, month, day, hour, minute, second, and (if we want) with even finer granularity.
2. We might be referring to **a stretch of time**. For example: How old you are. How long a meeting is going. How long you took to complete graduate school. How long a project has been going. When we talk about these times, we're not talking about one point that has elapsed, or that might occur in the future. Rather, we're talking about a length of time -- the distance between two of these points.

The first type of data is typically known as a **timestamp** or a **datetime**. The second is known as a **timedelta** or an **interval**.

You can even do some simple calculations with these:

- timestamp - timestamp = interval
- timestamp + interval = timestamp

More concretely:

- When the meeting ends - when the meeting starts = length of the meeting
- When the meeting starts + length of the meeting = when the meeting ends

Python supports both of these data types. So does Pandas!

In [None]:
# I'm going to load the taxi data from January 2019.  (If you prefer to 
# load the smaller data, taxi.csv, that's OK.)

# load these columns: passenger_count, trip_distance, total_amount,
#                     tpep_pickup_datetime, tpep_dropoff_datetime

df = pd.read_csv('../data/nyc_taxi_2019-01.csv',
                 usecols=['tpep_pickup_datetime', 'tpep_dropoff_datetime',
                         'passenger_count', 'trip_distance', 'total_amount']
                )