# Agenda, week 4

1. Recap and Q&A
    - Oil prices!
2. Text strings
    - The `str` accessor
    - Cleaning dirty integer data
    - Textual statistics
    - Trimming strings
3. Dates and times
    - Date and time dtypes
    - Parsing CSV files with times
    - Time deltas
    - Time series
    - Resampling

In [3]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [4]:
filename = 'oil-prices-master/data/wti-daily.csv'

df = pd.read_csv(filename)
df.head()

Unnamed: 0,Date,Price
0,1986-01-02,25.56
1,1986-01-03,26.0
2,1986-01-06,26.53
3,1986-01-07,25.85
4,1986-01-08,25.87


In [5]:
# What was the highest-ever price of oil (as per WTI)?

df['Price'].max()   # this is the highest price ever

145.31

In [6]:

df['Price'] == df['Price'].max()  # this returns a boolean series

0       False
1       False
2       False
3       False
4       False
        ...  
9222    False
9223    False
9224    False
9225    False
9226    False
Name: Price, Length: 9227, dtype: bool

In [8]:
# Use .loc with a row selector and a column selector
# our row selector will be our boolean series


df.loc[
    df['Price'] == df['Price'].max(),   # row selector
    'Date' # column selector
]

5678    2008-07-03
Name: Date, dtype: object

In [10]:
# what was the minimum price ever found of WTI?


df.loc[
    df['Price'] == df['Price'].min(),   # row selector
    ['Date', 'Price']                              # column selector
]

Unnamed: 0,Date,Price
8643,2020-04-20,-36.98


In [11]:
# what were the 10 most recent values for oil prices?

df.tail(5)

Unnamed: 0,Date,Price
9222,2022-08-09,93.18
9223,2022-08-10,94.68
9224,2022-08-11,97.02
9225,2022-08-12,94.86
9226,2022-08-15,92.24


In [12]:
df.tail(20)

Unnamed: 0,Date,Price
9207,2022-07-19,106.12
9208,2022-07-20,104.45
9209,2022-07-21,98.44
9210,2022-07-22,97.71
9211,2022-07-25,99.83
9212,2022-07-26,97.74
9213,2022-07-27,100.03
9214,2022-07-28,99.11
9215,2022-07-29,101.31
9216,2022-08-01,96.59


In [15]:
# running describe on our data frame runs describe on each numeric column

df.describe()

Unnamed: 0,Price
count,9227.0
mean,45.636007
std,29.481022
min,-36.98
25%,19.94
50%,34.76
75%,65.975
max,145.31


In [14]:
df.dtypes

Date      object
Price    float64
dtype: object

In [16]:
df['Date'].describe()

count           9227
unique          9227
top       1986-01-02
freq               1
Name: Date, dtype: object

In [18]:
# read_html only works when you have HTML tables on the target site
# and when they are written in HTML, and not generated on the fly via JavaScript

url = 'https://www.bankofcanada.ca/rates/exchange/daily-exchange-rates/'

all_dfs = pd.read_html(url)


In [19]:
len(all_dfs)

1

In [20]:
df = all_dfs[0]
df.head()

Unnamed: 0,Currency,2022‑08‑16,2022‑08‑17,2022‑08‑18,2022‑08‑19,2022‑08‑22
0,Australian dollar,0.9025,0.8949,0.8962,0.8936,0.897
1,Brazilian real,0.2504,0.2494,0.2496,0.2503,0.2521
2,Chinese renminbi,0.1896,0.1904,0.1905,0.1906,0.1904
3,European euro,1.3081,1.3134,1.308,1.3049,1.2979
4,Hong Kong dollar,0.1641,0.1646,0.1648,0.1656,0.1661


In [21]:
df.shape

(23, 6)

In [22]:
df['Currency']

0      Australian dollar
1         Brazilian real
2       Chinese renminbi
3          European euro
4       Hong Kong dollar
5           Indian rupee
6      Indonesian rupiah
7           Japanese yen
8           Mexican peso
9     New Zealand dollar
10       Norwegian krone
11      Peruvian new sol
12         Russian ruble
13           Saudi riyal
14      Singapore dollar
15    South African rand
16      South Korean won
17         Swedish krona
18           Swiss franc
19      Taiwanese dollar
20          Turkish lira
21     UK pound sterling
22             US dollar
Name: Currency, dtype: object

# Text data in Pandas

We've seen that a series can contain text. As such, a column in a data frame can also contain text. When we do that, the dtype of the column (series) is known as `object`, which means that the data isn't being stored directly inside of Pandas.  Instead, Pandas is referring to Python string objects located elsewhere in memory.  This means that we have access (in theory) to all of the Python string methods and associated functionality.

In [23]:
# create a series based on a list of strings, created via str.split
# str.split without an explicit delimiter argument uses any whitespace, of any 
# length, in any combination

s = Series('this is a test of text in Pandas'.split())
s

0      this
1        is
2         a
3      test
4        of
5      text
6        in
7    Pandas
dtype: object

In [25]:
# what is the length of each word?

# option 1 for answering: a for loop

for one_item in s:
    print(len(one_item))

4
2
1
4
2
4
2
6


In [27]:
# option 1b: use a list comprehension

[len(one_item)
for one_item in s]

[4, 2, 1, 4, 2, 4, 2, 6]

**DO NOT DO THIS!**

If you ever find yourself using a `for` loop in Pandas, stop! There is almost certainly a better way to accomplish it.

In [29]:
# Better, option 2: Use the "str" accessor

# an "accessor" is a Pandas term for an attribute (i.e., coming after a .)
# that lets us access special functionality for certain types of objects

# if we have an "object" column containing strings, then we can use the str
# accessor to invoke a number of different string methods

s.str.len() 

# this invokes len() on each of the elements in s, and returns a new series
# the index of the returned series matches the index in our original series s

0    4
1    2
2    1
3    4
4    2
5    4
6    2
7    6
dtype: int64

In [30]:
# because we get a series back, we can use it in all series-type operations
s.str.len() == 2

0    False
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [31]:
# apply the boolean series as a mask index, and find all of the words
# that have only two letters

s.loc[s.str.len() == 2]

1    is
4    of
6    in
dtype: object

In [33]:
# the "contains" method on the  "str" accessor lets us search in a string
# for a character/substring

# use s.loc to find all words in s that contain 'e'
s.loc[s.str.contains('e')]

3    test
5    text
dtype: object

# Exercise: Longer-than average words

1. Ask the user to enter a sentence. 
2. Turn that sentence into a Pandas series.
3. Show all of the words that are longer than average in the sentence.

In [34]:
sentence = input('Enter a sentence: ').strip()

s = Series(sentence.split())

Enter a sentence: this is the most marvelous and fascinating and scintillating sentence on the planet


In [35]:
s

0              this
1                is
2               the
3              most
4         marvelous
5               and
6       fascinating
7               and
8     scintillating
9          sentence
10               on
11              the
12           planet
dtype: object

In [36]:
# show all words in s that are longer than the average length

s.str.len()

AttributeError: 'Series' object has no attribute 'len'