# Week 4: Text and time

1. Text
    - Dealing with text data
    - Cleaning dirty integer data
    - Textual statistics 
    - Trimming strings
2. Dates and times
    - What does it mean to have dates and times in programming / data?
    - Time deltas
    - Time series
    - Resampling 

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# if I create a series of integers, the dtype will (by default) be an integer type (np.int64)

s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [4]:
# what if, though, I have a series of strings?

s = Series('this is a bunch of words'.split())
s

0     this
1       is
2        a
3    bunch
4       of
5    words
dtype: object

The `object` dtype in Pandas means: I'm not storing this in NumPy, because it's easier for me to think of it as a Python object. Really, in the back-end NumPy storage, I just have a "pointer," or a "reference," to the memory location of the Python object.

If you see a `dtype` of `object`, the odds are pretty good that it contains strings.

Pandas is moving, slowly but surely, toward having its own string types, but we don't have to worry about that right now.

Let's say I want to find out how long each of these strings is. How can I do that? Python provides me with the `len` function, so can I run that on my series?

In [5]:
len(s)  # this returns the length of the series, not of the individual strings in the series

6

In [7]:
# what about a for loop?

for one_item in s:
    print(len(one_item))    # don't do this!

4
2
1
5
2
5


Pandas provides us with a special attribute, known as an "accessor," which lets us invoke string methods on every element in our series, one at a time.  Instead of invoking a `for` loop, we can have Pandas do that on our behalf, and do it at the low level that makes things faster.

The key, then, is to use this accessor, known as `.str`.



In [8]:
s.str    # this brings up the accessor

<pandas.core.strings.accessor.StringMethods at 0x12194fd50>

In [9]:
s.str.len()    # notice -- we're invoking the method via the str accessor

0    4
1    2
2    1
3    5
4    2
5    5
dtype: int64

After invoking `s.str.len()`, we get back a new series, with the same index as `s`, and with the same length as `s`, but with values representing invoking `len` on each of the elements of `s`.

The `dtype` is now `int64`, because we get integers from running `len`.

In [10]:
# let's do a little benchmarking to see which is faster
# I'll use the Jupyter magic method %timeit to run my code

%timeit s.str.len()

73.5 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [12]:
# let's compare it with a list comprehension

%timeit Series([len(one_item) for one_item in s])

40 µs ± 565 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


What methods do we have available to us via the `str` accessor?

- All of the builtin `str` methods in Python
- A bunch of methods that implement Python's operators (e.g., `[]` and `in`)
- Some other methods that we got from other languages, such as R

In [13]:
s = Series('tHiS iS a vErY wEiRd lOoKiNg sEt oF wOrDs'.split())
s

0       tHiS
1         iS
2          a
3       vErY
4      wEiRd
5    lOoKiNg
6        sEt
7         oF
8      wOrDs
dtype: object

In [14]:
s.str.lower()   # this returns a new series in which all of the letters have been forced to lowercase

0       this
1         is
2          a
3       very
4      weird
5    looking
6        set
7         of
8      words
dtype: object

In [15]:
s.str.capitalize()

0       This
1         Is
2          A
3       Very
4      Weird
5    Looking
6        Set
7         Of
8      Words
dtype: object

In [16]:
s.str.swapcase()   # the most useless method in Python's standard library

0       ThIs
1         Is
2          A
3       VeRy
4      WeIrD
5    LoOkInG
6        SeT
7         Of
8      WoRdS
dtype: object

In [17]:
# this won't have any obvious effect now, but it might in some cases

s.str.strip()   # this removes leading/trailing whitespace from our strings

0       tHiS
1         iS
2          a
3       vErY
4      wEiRd
5    lOoKiNg
6        sEt
7         oF
8      wOrDs
dtype: object

# Exercise: Longer-than average words

1. Create a series of at last 10 strings of different lengths.
2. Find all of those words in the series that are longer than average (in your series). 

In [18]:
s = Series('this is a fantastic and wonderful and extremely interesting series of words'.split())
s

0            this
1              is
2               a
3       fantastic
4             and
5       wonderful
6             and
7       extremely
8     interesting
9          series
10             of
11          words
dtype: object

In [19]:
# how can I get the lengths of the words? with .str.len()

s.str.len()

0      4
1      2
2      1
3      9
4      3
5      9
6      3
7      9
8     11
9      6
10     2
11     5
dtype: int64

In [20]:
s.str.len().mean()  # calculate the mean word length

5.333333333333333

In [23]:
# which of the words in s are longer than the mean length?

# (1) calculate the mean with s.str.len().mean()
# (2) compare with the length of each word, s.str.len()
# (3) apply that boolean series to s.loc
# (4) we get back a series of words -- those longer than the mean

s.loc[s.str.len() > s.str.len().mean()]

3      fantastic
5      wonderful
7      extremely
8    interesting
9         series
dtype: object

In [24]:
# a series in which some words are capitalized

s = Series('this is a Fantastic and Wonderful and extremely Interesting series of Words'.split())


In [27]:
# which of these words are *not* capitalized?

s.loc[s == s.str.lower()]

0          this
1            is
2             a
4           and
6           and
7     extremely
9        series
10           of
dtype: object

In [29]:
# what if I want to find all of those words that contain the letter 'e'?

s.loc[s.str.contains('e')]

5      Wonderful
7      extremely
8    Interesting
9         series
dtype: object

In [30]:
# what if I want to find all of those words that contain the letter 'i'?

s.loc[s.str.contains('i')]

0           this
1             is
3      Fantastic
8    Interesting
9         series
dtype: object

In [31]:
# what if I want to find all of those words that contain *either* e or i?

# I could use | as an "or" to combine conditions
s.loc[(s.str.contains('e')) | (s.str.contains('i'))]

0           this
1             is
3      Fantastic
5      Wonderful
7      extremely
8    Interesting
9         series
dtype: object

In [33]:
# another way -- take advantage of the fact that "str.contains" support regular expressions!
# https://RegexpCrashCourse.com

# I can use regexps in the contains method
# either e or i looks like this: [ei]

s.loc[s.str.contains('[ei]')]

0           this
1             is
3      Fantastic
5      Wonderful
7      extremely
8    Interesting
9         series
dtype: object

In [34]:
s

0            this
1              is
2               a
3       Fantastic
4             and
5       Wonderful
6             and
7       extremely
8     Interesting
9          series
10             of
11          Words
dtype: object

In [35]:
# I asked you to find words > the mean length
# could I use describe to find that?

# using describe on a text series gives us a weird response

s.describe()

count      12
unique     11
top       and
freq        2
dtype: object

In [41]:
# you can do this, but why?

s.loc[[len(one_word) > s.str.len().describe()['mean'] 
      for one_word in s]]

3      Fantastic
5      Wonderful
7      extremely
8    Interesting
9         series
dtype: object

# What if I have a text series and want to make it numeric?

In [42]:
s = Series('10 20 30 40 50'.split())
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [43]:
s + s   # will this work?

0    1010
1    2020
2    3030
3    4040
4    5050
dtype: object

In [45]:
# we started with a string series, and got a string series back

# what if we want to actually treat our values as integers?
# what if we got them as strings, and want to change them to be integers?

# I can use .astype to get a new series back!

s = s.astype('int8')
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [46]:
s+s

0     20
1     40
2     60
3     80
4    100
dtype: int8

In [47]:
# a harder example

s = Series('10 20 30 abcd 40 50 efgh'.split())
s

0      10
1      20
2      30
3    abcd
4      40
5      50
6    efgh
dtype: object

In [48]:
# what happens if I try to use astype to get back a new series of ints?

s.astype('int8')

ValueError: invalid literal for int() with base 10: 'abcd'

In [55]:
# if I want to get a new series based on s, containing ints, I need to
# remove the elements that don't contain digits

# fortunately, s.str supports the "isdigit" method, which returns True/False
# based on whether the string only contains 0-9.

# note: this means that you cannot have - or . in your number

s.loc[s.str.isdigit()].astype('int16') 

0    10
1    20
2    30
4    40
5    50
dtype: int16

# Exercise: Even (dirty) ints

1. Create a series containing a bunch of integers, as well as a bunch of other non-numeric values.
2. Try to turn it into a series of ints... and it'll fail.
3. Use `isdigit` to filter out the non-numeric values.
4. Once you've done that, grab only the even numbers
5. Calculate the mean of those even numbers.

In [56]:
np.random.seed(0)   # reset the random-number generator to a known state
s = Series(np.random.randint(0, 1000, 20))
s

0     684
1     559
2     629
3     192
4     835
5     763
6     707
7     359
8       9
9     723
10    277
11    754
12    804
13    599
14     70
15    472
16    600
17    396
18    314
19    705
dtype: int64

In [57]:
s.loc[3] = 'hello'
s.loc[10] = 'goodbye'
s.loc[16] = 'whatever'

In [58]:
s

0          684
1          559
2          629
3        hello
4          835
5          763
6          707
7          359
8            9
9          723
10     goodbye
11         754
12         804
13         599
14          70
15         472
16    whatever
17         396
18         314
19         705
dtype: object

In [61]:
s = s.astype(str)
s

0          684
1          559
2          629
3        hello
4          835
5          763
6          707
7          359
8            9
9          723
10     goodbye
11         754
12         804
13         599
14          70
15         472
16    whatever
17         396
18         314
19         705
dtype: object

In [65]:
s = s.loc[s.str.isdigit()].astype('int64')
s

0     684
1     559
2     629
4     835
5     763
6     707
7     359
8       9
9     723
11    754
12    804
13    599
14     70
15    472
17    396
18    314
19    705
dtype: int64

In [67]:
# find the even elements of s

s.loc[s % 2 == 0]

0     684
11    754
12    804
14     70
15    472
17    396
18    314
dtype: int64

In [68]:
s.loc[s % 2 == 0].mean()

499.14285714285717

In [69]:
s = Series('this is a bunch of words'.split())
s

0     this
1       is
2        a
3    bunch
4       of
5    words
dtype: object

In [71]:
# use the new builtin Pandas string object

s = Series('this is a bunch of words'.split(), dtype=pd.StringDtype)
s

AttributeError: module 'pandas' has no attribute 'String'

In [72]:
s = Series([10, 15, 20, 30])

s

0    10
1    15
2    20
3    30
dtype: int64

In [73]:
s.loc[s > 16]

2    20
3    30
dtype: int64

In [78]:
df = DataFrame([['x', 'a'],
                ['x', 'b'],
                ['x', 'c'],
               ['y', 'a'],
               ['y', 'b'],
               ['y', 'c']])
df

Unnamed: 0,0,1
0,x,a
1,x,b
2,x,c
3,y,a
4,y,b
5,y,c


In [79]:
df.groupby(0)[1].sum()

0
x    abc
y    abc
Name: 1, dtype: object

# Next up

1. Textual statistics
2. Splitting and retrieving
3. `value_counts` and text data
4. Trimming and getting and slicing

In [80]:
s = Series('10 20 30 40 50'.split())

s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [81]:
# what happens when I calculate the mean?

s.mean()

204060810.0

In [82]:
s.sum()

'1020304050'

In [83]:
s.sum() / s.count()

TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [84]:
s.mean()

204060810.0

In [85]:
1020304050 / 5

204060810.0

In [86]:
s = Series('This is the most interesting thing that I have written today but that is not hard because it is very early in the morning where I live'.split())

In [87]:
s

0            This
1              is
2             the
3            most
4     interesting
5           thing
6            that
7               I
8            have
9         written
10          today
11            but
12           that
13             is
14            not
15           hard
16        because
17             it
18             is
19           very
20          early
21             in
22            the
23        morning
24          where
25              I
26           live
dtype: object

In [88]:
s.describe()

count     27
unique    22
top       is
freq       3
dtype: object

In [89]:
df = DataFrame(np.random.randint(0, 100, [4,5]),
              index=list('abcd'),
              columns=list('vwxyz'))
df

Unnamed: 0,v,w,x,y,z
a,39,87,46,88,81
b,37,25,77,72,9
c,20,80,69,79,47
d,64,82,99,88,49


In [90]:
df['u'] = 'this is another test'.split()
df

Unnamed: 0,v,w,x,y,z,u
a,39,87,46,88,81,this
b,37,25,77,72,9,is
c,20,80,69,79,47,another
d,64,82,99,88,49,test


In [91]:
df.dtypes

v     int64
w     int64
x     int64
y     int64
z     int64
u    object
dtype: object

In [92]:
df.describe()

Unnamed: 0,v,w,x,y,z
count,4.0,4.0,4.0,4.0,4.0
mean,40.0,68.5,72.75,81.75,46.5
std,18.129166,29.149042,21.884165,7.762087,29.456182
min,20.0,25.0,46.0,72.0,9.0
25%,32.75,66.25,63.25,77.25,37.5
50%,38.0,81.0,73.0,83.5,48.0
75%,45.25,83.25,82.5,88.0,57.0
max,64.0,87.0,99.0,88.0,81.0


In [93]:
help(df.describe)

Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None, datetime_is_numeric: 'bool_t' = False) -> 'NDFrameT' method of pandas.core.frame.DataFrame instance
    Generate descriptive statistics.
    
    Descriptive statistics include those that summarize the central
    tendency, dispersion and shape of a
    dataset's distribution, excluding ``NaN`` values.
    
    Analyzes both numeric and object series, as well
    as ``DataFrame`` column sets of mixed data types. The output
    will vary depending on what is provided. Refer to the notes
    below for more detail.
    
    Parameters
    ----------
    percentiles : list-like of numbers, optional
        The percentiles to include in the output. All should
        fall between 0 and 1. The default is
        ``[.25, .5, .75]``, which returns the 25th, 50th, and
        75th percentiles.
    include : 'all', list-like of dtypes or None (default), optional
        A white list of 

In [94]:
df.describe(include='all')

Unnamed: 0,v,w,x,y,z,u
count,4.0,4.0,4.0,4.0,4.0,4
unique,,,,,,4
top,,,,,,this
freq,,,,,,1
mean,40.0,68.5,72.75,81.75,46.5,
std,18.129166,29.149042,21.884165,7.762087,29.456182,
min,20.0,25.0,46.0,72.0,9.0,
25%,32.75,66.25,63.25,77.25,37.5,
50%,38.0,81.0,73.0,83.5,48.0,
75%,45.25,83.25,82.5,88.0,57.0,


In [95]:
s

0            This
1              is
2             the
3            most
4     interesting
5           thing
6            that
7               I
8            have
9         written
10          today
11            but
12           that
13             is
14            not
15           hard
16        because
17             it
18             is
19           very
20          early
21             in
22            the
23        morning
24          where
25              I
26           live
dtype: object

In [96]:
s.value_counts()

is             3
the            2
that           2
I              2
This           1
hard           1
where          1
morning        1
in             1
early          1
very           1
it             1
because        1
but            1
not            1
today          1
written        1
have           1
thing          1
interesting    1
most           1
live           1
dtype: int64

In [97]:
'this is a test'.split()

['this', 'is', 'a', 'test']

In [98]:
s = Series(['this is a test', 'this is another test', 'yet another one for us to look at'])
s

0                       this is a test
1                 this is another test
2    yet another one for us to look at
dtype: object

In [99]:
# one of the methods that we can invoke on .str is split

s.str.split()

0                           [this, is, a, test]
1                     [this, is, another, test]
2    [yet, another, one, for, us, to, look, at]
dtype: object

In [100]:
s.str.split()[0]

['this', 'is', 'a', 'test']

In [101]:
# there is another .str method, called "get"
# get works just like [] do in regular Python

s.str.get(0)  # this returns the first letter of each string

0    t
1    t
2    y
dtype: object

In [103]:
# what if I have a list? Can I retrieve from there, too?
# answer: yes!  But it'll be a bit weird

s.str.split().str.get(0)

0    this
1    this
2     yet
dtype: object

# Exercise: Letter frequencies

Using Pandas, find the 5 most common characters that appear in a string.

In [106]:
text = 'this is a bunch of characters that will be analyzed'

s = Series(list(text))

In [107]:
s

0     t
1     h
2     i
3     s
4      
5     i
6     s
7      
8     a
9      
10    b
11    u
12    n
13    c
14    h
15     
16    o
17    f
18     
19    c
20    h
21    a
22    r
23    a
24    c
25    t
26    e
27    r
28    s
29     
30    t
31    h
32    a
33    t
34     
35    w
36    i
37    l
38    l
39     
40    b
41    e
42     
43    a
44    n
45    a
46    l
47    y
48    z
49    e
50    d
dtype: object

In [108]:
s.value_counts()

     9
a    6
t    4
h    4
l    3
e    3
c    3
s    3
i    3
n    2
b    2
r    2
u    1
o    1
f    1
w    1
y    1
z    1
d    1
dtype: int64

In [109]:
help(s.str.cat)

Help on method cat in module pandas.core.strings.accessor:

cat(others=None, sep=None, na_rep=None, join='left') -> 'str | Series | Index' method of pandas.core.strings.accessor.StringMethods instance
    Concatenate strings in the Series/Index with given separator.
    
    If `others` is specified, this function concatenates the Series/Index
    and elements of `others` element-wise.
    If `others` is not passed, then all values in the Series/Index are
    concatenated into a single string with a given `sep`.
    
    Parameters
    ----------
    others : Series, Index, DataFrame, np.ndarray or list-like
        Series, Index, DataFrame, np.ndarray (one- or two-dimensional) and
        other list-likes of strings must have the same length as the
        calling Series/Index, with the exception of indexed objects (i.e.
        Series/Index/DataFrame) if `join` is not None.
    
        If others is a list-like that contains a combination of Series,
        Index or np.ndarray (1-dim

In [111]:
s = Series('this is a bunch of words'.split())
s

0     this
1       is
2        a
3    bunch
4       of
5    words
dtype: object

In [112]:
# I can use get to retrieve one character

s.str.get(0)

0    t
1    i
2    a
3    b
4    o
5    w
dtype: object

In [113]:
word = 'hello'

word[2:4]  # starting at index 2, until (not including) index 4

'll'

In [114]:
# I can do the same thing with .str.slice, specifying the start and stop or start, stop, step
s.str.slice(2, 4)   # get, from each word in s, from index 2 until (not including) index 4

0    is
1      
2      
3    nc
4      
5    rd
dtype: object

In [116]:
s.str.slice(0, 2)

0    th
1    is
2     a
3    bu
4    of
5    wo
dtype: object

In [117]:
# from index 2 through the end...

s.str.slice(2, None)

0     is
1       
2       
3    nch
4       
5    rds
dtype: object

# Exercise: Summing prices

1. Create a series of strings, in which each string consists of a \\$ followed by one or more integers (for prices).
2. Sum the prices. This will invokve removing the \\$ and also turning the values into integers.

In [118]:
s = Series('$100 $234 $102 $2 $1234'.split())

In [119]:
s

0     $100
1     $234
2     $102
3       $2
4    $1234
dtype: object

In [123]:
# option 1: use a slice to remove the first character

s.str.slice(1, None).astype('int64').sum()

1672

In [126]:
# option 2: use "str.replace" to remove the $

s.str.replace('$', '', regex=False).astype('int64').sum()  # treat our first string as literal

1672

In [139]:
# option 3: use regular expressions to keep only the numbers

s.str.replace('[^\d]', r'', regex=True)

0     100
1    2345
2     102
3       2
4    1234
dtype: object

In [136]:
s = Series('$100 $2,345 $102 $2 $1,234'.split())

In [137]:
s

0      $100
1    $2,345
2      $102
3        $2
4    $1,234
dtype: object

In [138]:
# let's remove both $ and ,

s.str.replace('[$,]', '', regex=True)

0     100
1    2345
2     102
3       2
4    1234
dtype: object

In [140]:
s

0      $100
1    $2,345
2      $102
3        $2
4    $1,234
dtype: object

In [141]:
s.str.removeprefix('$')

0      100
1    2,345
2      102
3        2
4    1,234
dtype: object

In [142]:
s = 'salaries.xlsx'

# assuming that string, how can I get rid of the .xlsx suffix?
# many many many people assume you can use "strip"

s.strip('.xlsx')   # this removes the characters . x l s  from the start and end of the string

'alarie'

In [143]:
s.removesuffix('.xlsx')

'salaries'

# Next up

1. Dates and times!
2. Timestamp + timedelta objects
3. Converting dates, especially when reading in data
4. The `.dt` accessor
5. Time series
6. Resampling

# Dates and times

When we talk about dates and times, we're actually talking about two different types of data:

- A specific point in time. That has a unique combination of year-month-day and hour-minute-second. This is known, in the programming world, as a "datetime" object, or sometimes a "timestamp" object.  Examples are: When our class starts. When our class ends. Birth dates. Death dates. Meeting start times.
