**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2024**

# Lesson 20. Strings and Datetimes in Pandas

## Overview

- Strings and datetimes can be awkward to work with, especially compared to numeric values


- In this lesson, we'll learn the basics of working with strings and datetimes in Pandas


- As we'll see, there are *many* methods related to string and datetime processing


- Consider this lesson just an introduction!

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## The UFO dataset 

* Let's start by importing Pandas:

In [None]:
import pandas as pd

- For this lesson, we'll use [this dataset on UFO sightings](http://bit.ly/uforeports), based on information from the [National UFO Reporting Center](http://www.nuforc.org/webreports.html)


- This dataset is also included in `data/ufo.csv`, in the same folder as this notebook


- Let's read the data into a DataFrame:

In [None]:
df = pd.read_csv('data/ufo.csv')

* Let's peek at the first 5 rows:

In [None]:
df.head()

- Let's also get some more information about the data types (dtypes) in this DataFrame:

In [None]:
df.info()

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Working with strings

- The Series object is equipped with a set of **string methods** that perform string processing operations on the entire Series at once


- These methods also conveniently exclude missing values automatically, unlike their counterparts in the Python standard library


- You can access these methods with the `.str` attribute


- For example, `.str.lower()` converts strings to lowercase, like this:

- Note that you can use string methods in `.query()`, like this:

- Note that this kind of query might help if your data contains values with inconsistent capitalization

- You may recall that we've already seen a few other examples of Series string methods, like `.str.split()` and `.str.cat()`

- Here are some string methods that you might find particularly useful:

| `.str` method | Description |
| :- | :- |
| `.cat()` | Concatenate strings |
| `.split()` | Split strings on delimiter |
| `.contains()` | Return boolean array if each string contains pattern/regex |
| `.replace()` | Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence |
| `.pad()` | Add whitespace to left, right, or both sides of strings |
| `.slice()` | Slice each string in the Series |
| `.slice_replace()` | Replace slice in each string with passed value |
| `.count()` | Count occurrences of pattern |
| `.startswith()` | Test if the start of each string element matches a pattern |
| `.endswith()` | Test if the end of each string element matches a pattern |
| `.len()` | Compute string lengths |
| `.strip()` | Strip whitespaces from left and right sides |
| `.partition()` | Split the string at the first occurrence of a substring |
| `.lower()` | Convert strings to lowercase |
| `.upper()` | Convert strings to uppercase |
| `.title()` | Convert strings to title case |
| `.find()` | Find substring within string |

- Note that this list is incomplete, and doesn't tell you how these methods work... (e.g. what arguments do they take?)


- For more information, [here is the section on Series string methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str) in the Pandas documentation

- Some of the documentation make references to a **regular expression** or **regex**
    - These are specially constructed sequences of characters that define a search
    - Regexes can be very useful... and somewhat complicated to use
    - There are *many* resources on the internet for learning about regexes
    - For example, [here is a nice tool](https://regexr.com/) to learn and test regular expressions

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Working with datetimes

- Note that our UFO dataset contains a column `Time` with the date and time (*datetime*) of the sighting:

In [None]:
df.head()

- Above, when we used `df.info()`, we saw that the column `Time` consists of strings

- Keeping the date and time information as a string is inconvenient &mdash; for example:
    - We can't easily group by datetime components, like month or hour
    - We can't easily perform datetime arithmetic (e.g. add a day to each datetime)

- One way to resolve this would be to use string methods


- For example, we can use `.str.split()` to tease out the different components of the column `Time`, like this:

- An arguably better alternative would be to use `pd.to_datetime()` to convert a column with well-formatted strings to the **datetime64** dtype


- Pandas has a variety of built-in tools to manipulate dates and times represented with the datetime64 dtype
    

- Back to our example: we can convert the column `Time` to datetime64 with `pd.to_datetime()` by specifying the format of the dates and times in `Time`:

- We can use `.info()` to confirm that the `datetime` column does indeed contain datetime64 values:

In [None]:
new_df.info()

- The `format=...` keyword argument takes a string with standard C datetime format codes as placeholders for datetime components such as year, month day, etc.
    - [Here is a list of standard C datetime format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) from the Python documentation

- The Series object has a variety of attributes and methods to work with datetime dtypes


- We access these attributes and methods with the `.dt` attribute


- For example, we can get the month of each date in `datetime` like this:

- Here are some useful datetime attributes:

| `.dt` attribute | Description |
| :- | :- |
| `.year` | The year of the datetime |
| `.month` | The month as January=1, ..., December=12 |
| `.day` | The day of the datetime |
| `.hour` | The hours of the datetime |
| `.minute` | The minutes of the datetime |
| `.second` | The seconds of the datetime |
| `.microsecond` | The microseconds of the datetime |
| `.nanosecond` | The nanoseconds of the datetime |
| `.dayofweek` | The day of the week with Monday=0, Sunday=6 |

- For more information, [here is the section on datetime attributes and methods](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-dt) in the Pandas documentation 


- In addition, [here is the documentation for `pd.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html)


- As you can see, there are many ways to convert datetime information into the datetime dtype


- For example, we could have taken the year-month-day-hour-minute information we parsed manually with string methods and passed those to `pd.to_datetime()` like this:

In [None]:
(
    df
    .assign(
        MMDDYYYY=lambda x: x['Time'].str.split(expand=True)[0],
        HHMM=lambda x: x['Time'].str.split(expand=True)[1],
        month=lambda x: x['MMDDYYYY'].str.split(pat='/', expand=True)[0],
        day=lambda x: x['MMDDYYYY'].str.split(pat='/', expand=True)[1],
        year=lambda x: x['MMDDYYYY'].str.split(pat='/', expand=True)[2],
        hour=lambda x: x['HHMM'].str.split(pat=':', expand=True)[0],
        minute=lambda x: x['HHMM'].str.split(pat=':', expand=True)[1],
        
        # We can pass a DataFrame with columns corresponding to 
        # year, month, day, hour, and minute into pd.to_datetime()
        datetime=lambda x: pd.to_datetime(x[['year', 'month', 'day', 'hour', 'minute']])
    )
)

- We can also perform arithmetic on datetime64 dtypes with the **timedelta64** dtype


- Timedeltas are differences in datetimes, expressed in units such as days, hours, minutes, etc.


- We can convert a scalar or Series to timedelta64 with `pd.to_timedelta()`


- For example, going back to `new_df` we created earlier, we can add 1 day to each entry in `datetime` like this:

- [Here is the documentation for `pd.to_timedelta()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_timedelta.html)


- In particular, you can find the valid values of the `unit=...` keyword argument here

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Problems

For the problems below, use the UFO dataset in the DataFrame `df` we created above.

### Problem 1

Add a new column to `df` containing the city for each observation, converted to all uppercase.

### Problem 2 

Add a new column to `df` containing the city and state of each observation, separated by a comma and a space, like this: `Ithaca, NY` 

### Problem 3

We can obtain a frequency table of the colors reported as follows:

In [None]:
(
    df
    .groupby(['Colors Reported'])
    .agg(
        count=('Colors Reported', 'count')
    )
    .sort_values('count', ascending=False)
    .reset_index()
)

Note that the values in the column `Colors Reported` sometimes contain multiple color names. 

Drop all observations with missing values. Then filter the remaining observations for those that contain `ORANGE` in the column `Colors Reported`.

*Hint.* Remember to use `` ` ` `` to enclose a column name with spaces in the string passed to `.query()`.

### Problem 4

Add a column to `df` containing the day of the week corresponding to the observation date. 

*Hint.* You may find it useful to add a column containing the observation times as datetime64 values.

### Problem 5

Compute the time between observations when ordered chronologically. 

*Hint.* Convert the observation times to datetime64 values, then sort the observations based on these values. Use the `.diff()` "same size" Series method from Lesson 15 to compute the difference between consecutive rows. [Here's the documentation for `.diff()`.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.diff.html)

*Food for thought.* What's the problem with sorting the rows directly based on the values in the `Time` column?

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- From the [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html):
    - [Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)
    - [Time series / date functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)
    - [Time deltas](https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html)