# Week 4: Text and time

1. Text
    - Dealing with text data
    - Cleaning dirty integer data
    - Textual statistics 
    - Trimming strings
2. Dates and times
    - What does it mean to have dates and times in programming / data?
    - Time deltas
    - Time series
    - Resampling 

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# if I create a series of integers, the dtype will (by default) be an integer type (np.int64)

s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [4]:
# what if, though, I have a series of strings?

s = Series('this is a bunch of words'.split())
s

0     this
1       is
2        a
3    bunch
4       of
5    words
dtype: object

The `object` dtype in Pandas means: I'm not storing this in NumPy, because it's easier for me to think of it as a Python object. Really, in the back-end NumPy storage, I just have a "pointer," or a "reference," to the memory location of the Python object.

If you see a `dtype` of `object`, the odds are pretty good that it contains strings.

Pandas is moving, slowly but surely, toward having its own string types, but we don't have to worry about that right now.

Let's say I want to find out how long each of these strings is. How can I do that? Python provides me with the `len` function, so can I run that on my series?

In [5]:
len(s)  # this returns the length of the series, not of the individual strings in the series

6

In [7]:
# what about a for loop?

for one_item in s:
    print(len(one_item))    # don't do this!

4
2
1
5
2
5


Pandas provides us with a special attribute, known as an "accessor," which lets us invoke string methods on every element in our series, one at a time.  Instead of invoking a `for` loop, we can have Pandas do that on our behalf, and do it at the low level that makes things faster.

The key, then, is to use this accessor, known as `.str`.



In [8]:
s.str    # this brings up the accessor

<pandas.core.strings.accessor.StringMethods at 0x12194fd50>

In [9]:
s.str.len()    # notice -- we're invoking the method via the str accessor

0    4
1    2
2    1
3    5
4    2
5    5
dtype: int64

After invoking `s.str.len()`, we get back a new series, with the same index as `s`, and with the same length as `s`, but with values representing invoking `len` on each of the elements of `s`.

The `dtype` is now `int64`, because we get integers from running `len`.

In [10]:
# let's do a little benchmarking to see which is faster
# I'll use the Jupyter magic method %timeit to run my code

%timeit s.str.len()

73.5 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [12]:
# let's compare it with a list comprehension

%timeit Series([len(one_item) for one_item in s])

40 µs ± 565 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


What methods do we have available to us via the `str` accessor?

- All of the builtin `str` methods in Python
- A bunch of methods that implement Python's operators (e.g., `[]` and `in`)
- Some other methods that we got from other languages, such as R

In [13]:
s = Series('tHiS iS a vErY wEiRd lOoKiNg sEt oF wOrDs'.split())
s

0       tHiS
1         iS
2          a
3       vErY
4      wEiRd
5    lOoKiNg
6        sEt
7         oF
8      wOrDs
dtype: object

In [14]:
s.str.lower()   # this returns a new series in which all of the letters have been forced to lowercase

0       this
1         is
2          a
3       very
4      weird
5    looking
6        set
7         of
8      words
dtype: object

In [15]:
s.str.capitalize()

0       This
1         Is
2          A
3       Very
4      Weird
5    Looking
6        Set
7         Of
8      Words
dtype: object

In [16]:
s.str.swapcase()   # the most useless method in Python's standard library

0       ThIs
1         Is
2          A
3       VeRy
4      WeIrD
5    LoOkInG
6        SeT
7         Of
8      WoRdS
dtype: object

In [17]:
# this won't have any obvious effect now, but it might in some cases

s.str.strip()   # this removes leading/trailing whitespace from our strings

0       tHiS
1         iS
2          a
3       vErY
4      wEiRd
5    lOoKiNg
6        sEt
7         oF
8      wOrDs
dtype: object

# Exercise: Longer-than average words

1. Create a series of at last 10 strings of different lengths.
2. Find all of those words in the series that are longer than average (in your series). 

In [18]:
s = Series('this is a fantastic and wonderful and extremely interesting series of words'.split())
s

0            this
1              is
2               a
3       fantastic
4             and
5       wonderful
6             and
7       extremely
8     interesting
9          series
10             of
11          words
dtype: object

In [19]:
# how can I get the lengths of the words? with .str.len()

s.str.len()

0      4
1      2
2      1
3      9
4      3
5      9
6      3
7      9
8     11
9      6
10     2
11     5
dtype: int64

In [20]:
s.str.len().mean()  # calculate the mean word length

5.333333333333333

In [23]:
# which of the words in s are longer than the mean length?

# (1) calculate the mean with s.str.len().mean()
# (2) compare with the length of each word, s.str.len()
# (3) apply that boolean series to s.loc
# (4) we get back a series of words -- those longer than the mean

s.loc[s.str.len() > s.str.len().mean()]

3      fantastic
5      wonderful
7      extremely
8    interesting
9         series
dtype: object

In [24]:
# a series in which some words are capitalized

s = Series('this is a Fantastic and Wonderful and extremely Interesting series of Words'.split())


In [27]:
# which of these words are *not* capitalized?

s.loc[s == s.str.lower()]

0          this
1            is
2             a
4           and
6           and
7     extremely
9        series
10           of
dtype: object

In [29]:
# what if I want to find all of those words that contain the letter 'e'?

s.loc[s.str.contains('e')]

5      Wonderful
7      extremely
8    Interesting
9         series
dtype: object

In [30]:
# what if I want to find all of those words that contain the letter 'i'?

s.loc[s.str.contains('i')]

0           this
1             is
3      Fantastic
8    Interesting
9         series
dtype: object

In [31]:
# what if I want to find all of those words that contain *either* e or i?

# I could use | as an "or" to combine conditions
s.loc[(s.str.contains('e')) | (s.str.contains('i'))]

0           this
1             is
3      Fantastic
5      Wonderful
7      extremely
8    Interesting
9         series
dtype: object

In [None]:
# another way -- take advantage of the fact that "str.contains" support regular expres