# Week 4: Text and dates

1. Text
    - Working with text data via the `str` accessor
    - Using `str` to clean integer data
    - Getting textual statistics 
    - Cleaning text + strings
2. Dates and times
    - How do dates and times work as data structures?
    - `datetime` and `timedelta` objects
    - Reading date information from CSV files
    - Retrieving via dates and times
    - Time series -- setting the index to use a datetime column
    - Resampling -- grouping via time

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# load NYC taxi data from January, 2019
filename = '/Users/reuven/Courses/Current/data/nyc_taxi_2019-01.csv'

df = pd.read_csv(filename, 
                usecols=['passenger_count', 'trip_distance', 'total_amount'])

In [3]:
df.head()

Unnamed: 0,passenger_count,trip_distance,total_amount
0,1,1.5,9.95
1,1,2.6,16.3
2,3,0.0,5.8
3,5,0.0,7.55
4,5,0.0,55.55


In [4]:
df.dtypes

passenger_count      int64
trip_distance      float64
total_amount       float64
dtype: object

In [6]:
# what if I want to find the 10 shortest-distance trips?

df.sort_values(by='trip_distance').head(10)

Unnamed: 0,passenger_count,trip_distance,total_amount
7667791,1,0.0,0.0
4863796,0,0.0,3.3
4863795,2,0.0,3.3
4863794,2,0.0,3.3
4863793,1,0.0,3.3
4863792,1,0.0,3.3
4863789,1,0.0,5.3
4863768,1,0.0,3.3
4863743,1,0.0,3.96
2682283,1,0.0,20.3


In [7]:
# what if I want to sort first by trip_distance, and then (in the case of a tie) by total_amount?

df.sort_values(by=['trip_distance', 'total_amount']).head(20)

Unnamed: 0,passenger_count,trip_distance,total_amount
4890628,1,0.0,-362.8
6308124,2,0.0,-320.3
57093,1,0.0,-300.3
7227721,1,0.0,-300.3
868820,1,0.0,-250.31
54256,1,0.0,-224.8
57095,1,0.0,-190.3
3339311,1,0.0,-165.3
3310047,1,0.0,-160.8
6153373,1,0.0,-150.8


In [8]:
# grouping 

# grouping allows us to ask a question, and to get a separate answer for each
# unique value of a particular column.

# to group, we need:
# (1) a categorical column on which to group
# (2) a numeric column on which to perform our calculation
# (3) an aggregation method that will give us one value back for all rows for each categorical value

df.groupby('passenger_count')['trip_distance'].mean()

passenger_count
0    2.651561
1    2.779088
2    2.880572
3    2.840698
4    2.853084
5    2.865741
6    2.842335
7    2.561579
8    3.142759
9    1.486667
Name: trip_distance, dtype: float64

In [9]:
df.groupby('passenger_count')['total_amount'].mean()

passenger_count
0    18.663658
1    15.609601
2    15.831294
3    15.604015
4    15.650307
5    15.546940
6    15.437892
7    48.278421
8    64.105517
9    31.094444
Name: total_amount, dtype: float64

In [10]:
# what if I want to get info about more than one column?
# two options:
# (1) don't specify columns in square brackets, and then all numeric columns will be calculated
# (2) specify them in double square brackets, giving a list of columns

df.groupby('passenger_count')[['total_amount', 'trip_distance']].mean()

Unnamed: 0_level_0,total_amount,trip_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
0,18.663658,2.651561
1,15.609601,2.779088
2,15.831294,2.880572
3,15.604015,2.840698
4,15.650307,2.853084
5,15.54694,2.865741
6,15.437892,2.842335
7,48.278421,2.561579
8,64.105517,3.142759
9,31.094444,1.486667


In [11]:
df.groupby('passenger_count')[['total_amount', 'trip_distance']].mean().sort_values(by='total_amount')

Unnamed: 0_level_0,total_amount,trip_distance
passenger_count,Unnamed: 1_level_1,Unnamed: 2_level_1
6,15.437892,2.842335
5,15.54694,2.865741
3,15.604015,2.840698
1,15.609601,2.779088
4,15.650307,2.853084
2,15.831294,2.880572
0,18.663658,2.651561
9,31.094444,1.486667
7,48.278421,2.561579
8,64.105517,3.142759


# Text

In Python (not Pandas), we use strings all of the time. They're really useful! But think about how Pandas does things: It stores its values inside of NumPy arrays. Those arrays contain C data structures.

It's relatively obvious that those C data structures include 8-bit ints, 16-bit ints, 32-bit floats, etc. etc., but where do strings fit in?

First: You can, in theory, put strings in NumPy arrays, but then things get very messy. Also, you then have C-style strings, not Python-style strings, which means that they're far more limited.

The other thing we can do is use Python strings, stored in Python, but Pandas will keep track of the location of each Python string, and use it as necessary.

If we have a series whose `dtype` is listed as `object`, that theoretically means that we could store any kind of Python object there. In practice, it almost always means that we have a string column, or a column containing some strings.

In [12]:
s = Series('this is an example of text in Pandas for my course'.split())

In [13]:
s

0        this
1          is
2          an
3     example
4          of
5        text
6          in
7      Pandas
8         for
9          my
10     course
dtype: object

Once we have our string column, we can print it -- but how can we work with it?

For example, if I want to find out the length of each word in `s`, how can I do it?

- Option 1: Use a `for` loop to go through each value and get it. Problem: Don't do that.
- Option 2: Use the `str` accessor for series (or data frame) to invoke a method on each element of `s`.

In [15]:
s.str.len()   # we're applying the len() function to the s series, via the str accessor

0     4
1     2
2     2
3     7
4     2
5     4
6     2
7     6
8     3
9     2
10    6
dtype: int64

# Exercise: Find longer-than-average words

1. Define a series containing strings.
2. Construct a query that'll return all strings of above-average length.
3. Print the returned results.

In [16]:
s = Series('this is a bunch of words for my Pandas course about strings and dates'.split())
s

0        this
1          is
2           a
3       bunch
4          of
5       words
6         for
7          my
8      Pandas
9      course
10      about
11    strings
12        and
13      dates
dtype: object

In [18]:
# what is the average length of a word in s?
# what are the lengths of the words?

s.str.len().mean()

4.0

In [19]:
# which words are longer than the average?

s.str.len() > s.str.len().mean()

0     False
1     False
2     False
3      True
4     False
5      True
6     False
7     False
8      True
9      True
10     True
11     True
12    False
13     True
dtype: bool

In [20]:
# use the boolean series as a mask index, to get only those items whose lengths are > the mean
s.loc[s.str.len() > s.str.len().mean()]

3       bunch
5       words
8      Pandas
9      course
10      about
11    strings
13      dates
dtype: object

In [21]:
# in Jupyter, I can say

s.str.isdigit()   # this returns a series of booleans -- True wherever the string only contains 0-9

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

In [22]:
s = Series('I work 40 hours per week at my job, and another 20 hours per week on my consulting.'.split())

In [23]:
s

0               I
1            work
2              40
3           hours
4             per
5            week
6              at
7              my
8            job,
9             and
10        another
11             20
12          hours
13            per
14           week
15             on
16             my
17    consulting.
dtype: object

In [25]:
# how can I we find those strings that contain only digits?

s.loc[s.str.isdigit()]   # this returns True, when the string only contains digits

2     40
11    20
dtype: object

In [29]:
# I want to sum the numbers

# to do that, I have to find which elements of s are numbers
# I'll use .astype(np.int64 ) to convert, but the input 
# must be a series of strings, which we trimmed from using a mask
# index, based on isdigit()

s.loc[s.str.isdigit()].astype(np.int64).sum()

60

# Exercise: Summing ints from the user

1. Ask the user to enter a list of 2- and 3-digit integers.  These will be strings, and not all will contain only digits.  Some will contain 0 digits!
2. Use the `.str` accessor to keep only those "words" that contain only digits.  
3. Turn the series into integers
4. Calculate descriptive statistics on the data.

In [31]:
s = Series('This sentence contains 10 numbers, some of which are 20 30 97 85 and 23 . But we also have 22 and 33'.split())
s

0         This
1     sentence
2     contains
3           10
4     numbers,
5         some
6           of
7        which
8          are
9           20
10          30
11          97
12          85
13         and
14          23
15           .
16         But
17          we
18        also
19        have
20          22
21         and
22          33
dtype: object

In [35]:
# I want: Descriptive statistics on the numbers in s
# I need: remove all non-numeric strings from s, then use astype to turn the values into integers

s.loc[s.str.isdigit()].astype(np.int8).describe()


count     8.000000
mean     40.000000
std      32.372828
min      10.000000
25%      21.500000
50%      26.500000
75%      46.000000
max      97.000000
dtype: float64

In [36]:
s = Series('10 20 30 40 50'.split())
s

0    10
1    20
2    30
3    40
4    50
dtype: object

In [37]:
s.mean()

204060810.0

In [38]:
s.sum()

'1020304050'

# Next up:

1. Textual statistics
2. Trimming strings
3. `contains` 

In [39]:
filename = '/Users/reuven/Courses/Current/data/winemag-150k-reviews.csv'

df = pd.read_csv(filename)

In [40]:
df.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude
