# Agenda, day 4 — text and dates
 
1. Q&A
2. Textual data and Pandas
3. Cleaning dirty textual data
4. Statistics about text
5. Useful string methods
6. Time and date information
    - `datetime` 
    - `timedelta`
7. Calculating time deltas
8. Time series (i.e., where we have time data as our index)
9. Resampling

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# let's assume that I have a series containing some text

s = Series('This is a sample sentence for use in my Pandas course'.split())
s

0         This
1           is
2            a
3       sample
4     sentence
5          for
6          use
7           in
8           my
9       Pandas
10      course
dtype: object

In [3]:
# how can I find out the length of each word in this series?

# could we / should we use a "for" loop and run "len" on each word?

# using a "for" loop in Pandas is almost always the wrong solution.

for one_item in s:
    print(len(one_item))

4
2
1
6
8
3
3
2
2
6
6


In [4]:
# what we want is a way to run "len" on each element
# without a "for" loop

# Pandas provides us with a way to broadcast our string methods/functionality across elements of a series

# I'd want to say

s.len()

AttributeError: 'Series' object has no attribute 'len'

In [6]:
# we can use the "str" accessor object on every series
# in other words, we can say s.str.METHOD_NAME and there are many, many methods defined for s.str

# we get back a new series, one whose index is identical to s!

s.str.len() 

0     4
1     2
2     1
3     6
4     8
5     3
6     3
7     2
8     2
9     6
10    6
dtype: int64

In [7]:
# what if I want to find all of the words that are longer than average in the series?

s.str.len().mean()   # find the mean length of words in s

3.909090909090909

In [8]:
# what words are longer than that?

# series > float -- we run a broadcast, and get a boolean series in return
s.str.len() > s.str.len().mean()

0      True
1     False
2     False
3      True
4      True
5     False
6     False
7     False
8     False
9      True
10     True
dtype: bool

In [9]:
# now, let's retrieve the elements of s where the word length > mean
s.loc[s.str.len() > s.str.len().mean()]

0         This
3       sample
4     sentence
9       Pandas
10      course
dtype: object

# The `.str` accessor

If you want to run string methods on every element in a series, you can do so with `.str` and then the method name. What methods are available?

- All Python string methods
- Many Python operations, implemented as methods
    - `.str.contains` implements `in`
    - `.str.get` implements `[]`
- A few other methods that are just useful, often taken from the R language    

# Exercise: Shorter than average strings

1. Ask the user to enter a sentence.
2. Turn the sentence into a series.
3. Find all of the words in the sentence that are shorter than average, and print them.

In [10]:
# you can get user input with the "input" function

s = input('Enter a string: ').strip()  # strip removes leading/trailing whitespace from the string

Enter a string: asdfasfdafasf


In [11]:
s

'asdfasfdafasf'

In [12]:
s = input('Enter a sentence: ').strip()

Enter a sentence: this is yet another test sentence for my Pandas course


In [13]:
s

'this is yet another test sentence for my Pandas course'

In [15]:
# I want to turn this string into a Pandas series. I'll use "split" to turn it into a list of strings

words = Series(s.split())
words

0        this
1          is
2         yet
3     another
4        test
5    sentence
6         for
7          my
8      Pandas
9      course
dtype: object

In [17]:
# to get the shorter-than-average words, I need:

# (1) find the average word length
# (2) find the length of each word
# (3) find which words are shorter than the average length

words.str.len().mean()

4.5

In [19]:
# which word lengths are shorter than the average length?
# we'll get a boolean series back, with True where it's shorter and False where it isn't
words.str.len() < words.str.len().mean()

0     True
1     True
2     True
3    False
4     True
5    False
6     True
7     True
8    False
9    False
dtype: bool

In [20]:
# now I need to apply that series back to words, to filter out 
# any of the values with a False index

# here, we use .loc to keep only those words that are shorter than average length
words.loc[  words.str.len() < words.str.len().mean()  ]

0    this
1      is
2     yet
4    test
6     for
7      my
dtype: object

In [21]:
# what happens if I do this:

s = Series('10 20 30 40 50'.split())
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [22]:
# what happens if I add these together?

# these are strings, and using + on them is a bit .. dangerous

s.sum()

'1020304050'

In [23]:
# it gets worse:

s.mean()  # this is awful -- it takes s.sum(), turns it into an integer, and then divides by 5!

204060810.0

In [25]:
# if I want the sum or the mean of the numbers in s, I need to convert the dtype to integer

import numpy as np
s = s.astype(np.int8)
s

0    10
1    20
2    30
3    40
4    50
dtype: int8

In [26]:
s.sum()

150

In [27]:
s.mean()

30.0

In [28]:
# what if the data isn't quite this nice and simple?

s = Series('10 20 30 hello goodbye 40 whatever 50'.split())
s

0          10
1          20
2          30
3       hello
4     goodbye
5          40
6    whatever
7          50
dtype: object

In [29]:
# what will happen now if I try to convert the series to np.int8? 

s.astype(np.int8)

ValueError: invalid literal for int() with base 10: 'hello'

In [30]:
# what do I want to do here?

# - identify which elements in s contain only digits
# - remove the non-digit elements from the series
# - use .astype(np.int8) on what remains

# I can use .str.isdigit(), a method taken straight from Python's string class
# this returns True if the string is non-empty and contains only 0-9.

s.str.isdigit()

0     True
1     True
2     True
3    False
4    False
5     True
6    False
7     True
dtype: bool

In [33]:
# use the boolean series as a mask index

s = s.loc[s.str.isdigit()].astype(np.int8)
s

0    10
1    20
2    30
5    40
7    50
dtype: int8

In [34]:
s.sum()

150

In [35]:
s.mean()

30.0

In [36]:
# what if I want to replace bad strings with NaN

s = Series('10 20 30 hello goodbye 40 whatever 50'.split())
s

0          10
1          20
2          30
3       hello
4     goodbye
5          40
6    whatever
7          50
dtype: object

In [40]:
# use the ~ to flip the logic, as "not"

s.loc[~s.str.isdigit()] = np.nan

In [41]:
s

0     10
1     20
2     30
3    NaN
4    NaN
5     40
6    NaN
7     50
dtype: object

In [43]:
# now we can convert the values to floats, because NaN is a float
s.astype(np.float16)

0    10.0
1    20.0
2    30.0
3     NaN
4     NaN
5    40.0
6     NaN
7    50.0
dtype: float16

In [44]:
# I cannot turn them into integers, though...
s.astype(np.int8)

ValueError: cannot convert float NaN to integer

# `.str.contains` -- checking membership



In [50]:
s = Series('this is a bunch of words for my course'.split())

In [51]:
s

0      this
1        is
2         a
3     bunch
4        of
5     words
6       for
7        my
8    course
dtype: object

In [52]:
# we can find out which of the strings contain a substring
# much as we would use "in" in regular Python as an operator

s.str.contains('i')

0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
8    False
dtype: bool

In [53]:
# find words in s that contain 'i'
s.loc[s.str.contains('i')]

0    this
1      is
dtype: object

In [54]:
# what if I want all of those words that contain either e or i?
# option 1: use | to combine them, as an "or" operator

s.loc[s.str.contains('i') | s.str.contains('e')]

0      this
1        is
8    course
dtype: object

In [55]:
# what if I want all of those words that contain either e or i?
# option 2: use a regular expression!

s.loc[s.str.contains('[ei]')]  # this means: one of "e" or "i"

0      this
1        is
8    course
dtype: object

 # Exercises with `.str`
 
 1. Define a series of strings with both digits and non-digits as the elements.
 2. As I did before, remove the non-digit elements, turn the digits into integers, and then sum them.
 3. Find those elements that contained either `3` or `8` in them, and display them.
 4. Find those elements that contain `3`, and which are shorter than average length.

In [56]:
s = Series('123 abc 456 defg 7hi j8k 9876 135'.split())
s

0     123
1     abc
2     456
3    defg
4     7hi
5     j8k
6    9876
7     135
dtype: object

In [57]:
s.astype(np.int64)

ValueError: invalid literal for int() with base 10: 'abc'

In [61]:
# (1) find which elements of s contain only digits
# (2) use .loc to retrieve only those elements into a new series
# (3) turn the dtype of that new series into np.int64

s = s.loc[s.str.isdigit()].astype(np.int64)
s

0     123
2     456
6    9876
7     135
dtype: int64

In [62]:
s.sum()

10590

In [66]:
# Find those elements that contained either 3 or 8 in them, and display them.

s = Series('123 abc 456 defg 7hi j8k 9876 135'.split())

s.loc[ s.str.contains('3') | s.str.contains('8') ]

0     123
5     j8k
6    9876
7     135
dtype: object

In [68]:
# Find those elements that contain 3, and which are shorter than average length.

      # does s contain '3'?          is the length < the average word length
s.loc[   s.str.contains('3')      &   (s.str.len() < s.str.len().mean())  ]

0    123
7    135
dtype: object

In [69]:
# can I run .str on integers?

s = s.loc[s.str.isdigit()].astype(np.int64)
s

0     123
2     456
6    9876
7     135
dtype: int64

In [70]:
s.str.contains('3')

AttributeError: Can only use .str accessor with string values!

# Next up

- Textual statistics
- Trimming strings
