# String Series Methods

## Overview

The previous chapters in this part focused on Series that contained numeric values. In this chapter, we will focus on methods that work for Series that contain string data. Columns of strings are processed quite differently than columns of numeric values. Remember that there is no string data type in pandas. Instead there is the **object** data type which may contain any Python object, though the majority of the time, object columns will be entirely composed of strings. Let's begin by selecting the `dept` column from the employee dataset.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head()

In [None]:
dept = emp['dept']
dept.head()

### Attempt to take the mean

Several methods that worked on numeric columns will either not work with strings or provide very little value. For instance, the `mean` method raises an error when attempted on a string column.

In [None]:
dept.mean()

### Other methods do work
Many of the other methods we covered from the previous chapters in this part will work just fine with string columns such as finding the maximum department. The `max` of a string is based on its alphabetical ordering.

In [None]:
dept.max()

More specifically, ordering is based on the Unicode code point value. Python 3 strings are encoded in a system called Unicode which associates each character to a number. You can use the built-in `ord` function to get code points for single characters. Below, we find that the code point for 'a' is 97 and 'z' is 122. It's these code points that get evaluated when determining whether one string is 'greater' than another.

In [None]:
ord('a'), ord('z')

Uppercase letters actually have code points less than lowercase so will always evaluate as 'less' than them.

In [None]:
ord('A'), ord('Z')

Unicode has code points for nearly every single system of writing with the basic Latin letters coming first. This means that every other writing system will evaluate as larger. Below, the Greek letter gamma (γ) is compared against the Latin character 'z'.

In [None]:
'γ' > 'z'

### Missing values
Many other methods will work with string columns identically as they do with numeric columns. Below, we calculate the number of missing values. Object data type Series can contain any of three missing value representations. The numpy `NaN` and `NaT` and Python `None` will all be counted as missing.

In [None]:
dept.isna().sum()

## The `value_counts` method
The `value_counts` method is one of the most valuable methods for string columns. It returns the count of each unique value in the Series and sorts it from most to least common.

In [None]:
dept.value_counts()

### Notice what object is returned
The `value_counts` method returns a Series object itself with the unique values as the index and the count as the new values.

### Use `normalize=True` for relative frequency
We can use `value_counts` to return the relative frequency (proportion) of each occurrence instead of the raw count by setting the parameter `normalize` to `True`. For instance, this tells us that 32% of the employees are members of the police department.

In [None]:
dept.value_counts(normalize=True)

### `value_counts` works for columns of all data types
The `value_counts` method works for columns of all data types and not just strings. It's just usually more informative for string columns. Let's use it on the salary column to see if we have common salaries.

In [None]:
emp['salary'].value_counts().head(10)

## Special methods just for object columns
pandas provides a collection of methods only available to object columns with the `str` accessor. The `str` accessor is only available to Series objects with data type of **object**. It provides a few dozen methods for string manipulation.

### Access with dot notation
To access these special string methods, first append the Series object with `.str` followed by another dot and then the specific string method. Think of the term 'accessor' as giving the Series access to more specific specialized string methods.

### Make each value uppercase
Let's call a simple string method to make each value in the `dept` Series uppercase. We will use the `upper` method of the `str` accessor.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [None]:
dept.str.upper().head()

### `str` accessor API
Take a look at the [str accessor API][1] in the official documentation.

### Lot's of methods but mostly easy to use
There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. Let's work through some examples of the following popular string methods:

* `count`
* `contains`
* `find`
* `len`
* `split`
* `replace`
* Substrings with just the brackets

### `count` str method

The `count` method returns the number of non-overlapping occurrences of the passed string.

[1]: http://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str

In [None]:
dept.str.count('a').head()

In [None]:
dept.str.count('Department').head()

### `contains` str method
The `contains` method returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any departments contain the letter 'z'?

In [None]:
dept.str.contains('z').head()

We can then use this information to find the number of employees that work in departments that contain a 'z'.

In [None]:
dept.str.contains('z').sum()

### `find` str method
The `find` method returns the lowest index (the integer location) of the passed string. If the string is not found, -1 is returned.

In [None]:
dept.str.find('a').head(10)

### `len` str method
The `len` string method returns the length of every string. Take note that this is completely different and unrelated to the `len` built-in function which returns the number of elements in a Series.

In [None]:
dept.str.len().head()

### `split` str method
The `split` method splits each string into multiple separate strings based on a given separator. The default separator is a single space. The following splits on each space and returns a Series of lists.

In [None]:
dept.str.split().head()

Set the `expand` parameter to `True` to return a DataFrame:

In [None]:
dept.str.split(expand=True).head()

Here, we split on the string 'Department'. Note that the string used for splitting is removed and not contained in the result.

In [None]:
dept.str.split('Department', expand=True).head()

### `replace` str method
You must pass two string arguments to `replace` - the string you want to replace and its replacement value.

In [None]:
dept.str.replace('Houston', 'H-Town').head()

### Selecting substrings with the brackets
Selecting a single character of a regular Python string is simple and accomplished by placing the integer location of the desired character in brackets. Selecting substrings within a pandas Series is also quite simple and accomplished by using slice notation in the brackets.

pandas allows us to perform the exact same operation with its `str` accessor to select one or more characters of each string. We simply append the brackets to `str` and use the same selection process as we do with Python strings. Let's see some examples. Select the character with integer location 5 for each value in the Series:

In [None]:
dept.str[5].head()

Select the last 5 characters of each string in the Series:

In [None]:
dept.str[-5:].head()

Select characters 5 through 15 of each string.

In [None]:
dept.str[5:15].head()

### Many more string-only methods
There are many more string-only methods that were not covered in this chapter and I would encourage you to explore them on your own. Many of them overlap with the built-in Python string methods.

### Regular expressions
Regular expressions help match patterns within text. Many of the string methods presented above accept regular expressions as inputs. They are an important part of doing data analysis and will be covered thoroughly in part 5 of this book.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it. Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

### Exercise 2
<span  style="color:green; font-size:16px">What percent of movies have the top 100 most frequent actor 1's appeared in?</span>

### Exercise 3
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

### Exercise 4
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

### Exercise 5
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name.</span>

### Exercise 6
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

### Exercise 7
<span  style="color:green; font-size:16px">The Pandas string methods overlap with the builtin Python string methods. Find all the public method names that are in-common to both. Then find the public methods that are unique to each.</span>