# Agenda, day 4 — text and dates
 
1. Q&A
2. Textual data and Pandas
3. Cleaning dirty textual data
4. Statistics about text
5. Useful string methods
6. Time and date information
    - `datetime` 
    - `timedelta`
7. Calculating time deltas
8. Time series (i.e., where we have time data as our index)
9. Resampling

In [1]:
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# let's assume that I have a series containing some text

s = Series('This is a sample sentence for use in my Pandas course'.split())
s

0         This
1           is
2            a
3       sample
4     sentence
5          for
6          use
7           in
8           my
9       Pandas
10      course
dtype: object

In [3]:
# how can I find out the length of each word in this series?

# could we / should we use a "for" loop and run "len" on each word?

# using a "for" loop in Pandas is almost always the wrong solution.

for one_item in s:
    print(len(one_item))

4
2
1
6
8
3
3
2
2
6
6


In [4]:
# what we want is a way to run "len" on each element
# without a "for" loop

# Pandas provides us with a way to broadcast our string methods/functionality across elements of a series

# I'd want to say

s.len()

AttributeError: 'Series' object has no attribute 'len'

In [6]:
# we can use the "str" accessor object on every series
# in other words, we can say s.str.METHOD_NAME and there are many, many methods defined for s.str

# we get back a new series, one whose index is identical to s!

s.str.len() 

0     4
1     2
2     1
3     6
4     8
5     3
6     3
7     2
8     2
9     6
10    6
dtype: int64

In [7]:
# what if I want to find all of the words that are longer than average in the series?

s.str.len().mean()   # find the mean length of words in s

3.909090909090909

In [8]:
# what words are longer than that?

# series > float -- we run a broadcast, and get a boolean series in return
s.str.len() > s.str.len().mean()

0      True
1     False
2     False
3      True
4      True
5     False
6     False
7     False
8     False
9      True
10     True
dtype: bool

In [9]:
# now, let's retrieve the elements of s where the word length > mean
s.loc[s.str.len() > s.str.len().mean()]

0         This
3       sample
4     sentence
9       Pandas
10      course
dtype: object

# The `.str` accessor

If you want to run string methods on every element in a series, you can do so with `.str` and then the method name. What methods are available?

- All Python string methods
- Many Python operations, implemented as methods
    - `.str.contains` implements `in`
    - `.str.get` implements `[]`
- A few other methods that are just useful, often taken from the R language    

# Exercise: Shorter than average strings

1. Ask the user to enter a sentence.
2. Turn the sentence into a series.
3. Find all of the words in the sentence that are shorter than average, and print them.

In [10]:
# you can get user input with the "input" function

s = input('Enter a string: ').strip()  # strip removes leading/trailing whitespace from the string

Enter a string: asdfasfdafasf


In [11]:
s

'asdfasfdafasf'