# Working with Text Data

## NOTE: The NYC Dataset

The `nyc.csv` file is pretty large (500+ MB) so I've compressed it as a `.zip` archive in this folder. 

You'll need to unpack the CSV file in order for the following code samples to work.

- On macOS, double-click the `nyc.csv.zip` file.
- On Windows, right-click the `nyc.csv.zip` file and click `Extract All`. Extract the file to this folder.

- The `nyc.csv` dataset is a collection of public sector employees in New York City.

- Let's see the count of null values and the count of present values.
- The `shape` attribute retuns the dimensions (height, width) of the `DataFrame`.
- The `count` method returns the number of present values per column.
- The `null_count` method returns the number of absent (`null`) values per column.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.shape.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.count.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.null_count.html

## Case Conversion
- Polars groups related methods under an attribute/accessor (a namespace).
- String methods have a dedicated `str` namespace.

- The `str.to_lowercase` method converts all characters to lowercase.
- The `str.to_uppercase` method converts all characters to uppercase.
- The `str.to_titlecase` method converts all characters to titlecase (capital letter of every first word).

- Converting all the string columns to titlecase seems ideal for this dataset.
- We can pass a Polars data type to the `pl.col` function to target columns with that data type.
- `with_columns` will elegantly replace old columns with new ones (if they have the same name).

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#case-conversion
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_lowercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_uppercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html

## Removing Whitespace

- The `strip` method removes whitespace from the beginning and end of a string.
- The `lstrip` method removes whitespace from the beginning of a string.
- The `rstrip` method removes whitespace from the end of a string.

- The `str.strip_chars` method removes whitespace from the beginning and end of the string.
- The `str.strip_chars_start` method removes whitespace from the beginning of the string.
- The `str.strip_chars_end` method removes whitespace from the end of the string.

- We can method chain multiple string methods.
- Each method call requires the `str` attribute. The method is found _within_ the namespace.
- Let's first remove whitespace from all string columns, then titlecase their values.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#stripping-characters-from-the-ends
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_start.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars_end.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html

## Removing Prefix and Suffix

- The `str.strip_prefix` method removes a consistent substring from the beginning of each string.
- Polars will remove the prefix exactly once.
- For example, we can remove `Office Of ` from the start of every string that includes the text.

- The complementary `str.strip_suffix` method removes a substring from the end of each string.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_prefix.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_suffix.html

## Characters vs. Bytes
- Every English character occupies 1 byte in memory. A string like "pizza" occupies 5 bytes in memory.
- Not all characters occupy one byte. Examples include non-English characters and emojis.
- For example, a üçï appears visually as 1 character but occupies 4 bytes in memory.
- To access the emoji keyboard, use `Fn + E` on macOS or `Windows + .` on Windows.

- The `str.len_bytes` method returns the number of bytes used to store each row value.
- The `str.len_chars` method returns a count of the string's characters.
- ASCII (American Standard Code for Information Exchange) text is a set of 128 characters that includes the whole English alphabet (lowercase and uppercase), the digits 0-9, and symbols (e.g., `?`, `!`, `.`, `$`).
- If you you are working with ASCII data, `len_bytes` will be faster.

- The number of characters is not always equal to the number of bytes.
- An emoji appears as 1 visual character but may occupy multiple bytes in memory.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#the-string-namespace
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_bytes.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html

## String Slicing

- The `str.slice` method extracts a sequence of characters from each string.
- The first argument is the starting index, and the second argument is the number of characters to slice.
- Let's assign each borough a 3-letter code based on its first 3 letters.

- If the slice extends past the end of the string, Polars will collect all the characters.

- We can start in the middle of a string by specifying a non-zero index.

- The `str.head` method pulls characters from the beginning of the string.
- The `str.tail` method pulls characters at the end of the string.
- As with `str.slice`, if the requested amount extends past the string length, Polars will pull as many characters as possible.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#slicing
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.slice.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.head.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.tail.html

## Filtering Methods

- The `str.contains` method returns true if a row contains a given substring.

- The `str.contains_any` method returns true if the row value is found within a set of possible values.

- The `str.starts_with` method checks if a substring is found at the beginning of a string.
- The `str.ends_with` method checks if a substring is found at the end of the string.

- Case-sensitivity can be an issue. There are no row values that start with a lowercase "fire".

- One solution is normalizing the text, which converts all characters to a consistent casing.
- For example, we can convert all convert all characters to lowercase, then provide a lowercase substring.
- Regular expressions (search patterns for text) are an alternative solution that we'll cover in the next lesson.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#check-for-the-existence-of-a-pattern
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains_any.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.starts_with.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.ends_with.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.lowercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_uppercase.html

## Regular Expressions
- A regular expression is a search pattern for text.
- A regular expression uses symbols to designate a match pattern (any digit, a character, a space, etc).
- Regular expressions can build up more complex search patterns (i.e., exactly 2 digits in a row followed by a space and any character between 'a' and 'e'").

- Most string methods under `str` support regular expression arguments.
- Use raw strings for regular expressions. A raw string tells Python to interpret every character literally. 
- For example, Python will treat `\n` as a backslash followed by `n` rather than as a newline character.
- Raw strings prevent conflicts between the regular expression syntax and Python's escape characters.
- The next examples uses regex to identify any strings with _any_ digit within them.

- The next example finds the strings with at least 2 whitespaces in a row.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.contains.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html

## Capture Groups
- A capture group is a designated chunk of the regular expression.
- Polars can extract a matched value through its capture group name.

- The `str.extract` method pulls out a capture group's content.
- Let's say we want to identify the various _types_ of mechanics who work in New York City
- We want to extract the _word_ before "mechanic" for entries in the `"Title"` column.
- We'll define that word before "mechanic" as one capture group.
- `(?i)` is a pattern that makes the search case-insensitive.
- `\w+` is a pattern for "1 or more word characters (`[A-Za-z0-9_]`).
- Capture group 0 will be the full match (the word characters, the space, and `mechanic`).

- Index 1 will capture the first and only capture group.
- The capture group is the content corresponding to `\w+` (one or more characters) inside the parentheses.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#regex-specification
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.extract.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html

## Replacing Values

- The `str.replace` method replaces the first occurrence of a string with another string.

- The `str.replace_all` method replaces _all_ occurrences of the substring.

- Polars supports regular expressions as arguments.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#replace-a-pattern
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.replace_all.html

## Find Longest Employee Names

- The longest names appear to be missing characters at the end.
- The data would suggest that the system is capped at a 30-character length.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.to_titlecase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.strip_chars.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.len_chars.html