# String Series Methods

The previous chapters in this part focused mainly on Series that contained numeric values. In this chapter, we focus on methods that work for Series containing string data. Columns of strings are processed quite differently than columns of numeric values. Remember, there is no string data type in pandas. Instead there is the **object** data type which may contain any Python object. The majority of the time, object columns are entirely composed of strings. Let's begin by reading in the employee dataset and selecting the `dept` column as a Series.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
dept = emp['dept']
dept.head(3)

### Attempt to take the mean

Several methods that worked on numeric columns will either not work with strings or provide little value. For instance, the `mean` method raises an error when attempted on a string column.

In [None]:
dept.mean()

### Other methods do work
Many of the other methods we covered from the previous chapters in this part work with string columns such as finding the maximum department. The `max` of a string is based on its alphabetical ordering.

In [None]:
dept.max()

### Missing values
Many other methods work with string columns identically as they do with numeric columns. Below, we calculate the number of missing values. Object data type Series can contain any of three missing value representations. The numpy `NaN` and `NaT` and Python `None` are all counted as missing.

In [None]:
dept.isna().sum()

## The `value_counts` method
The `value_counts` method is one of the most valuable methods for string columns. It returns the count of each unique value in the Series and sorts it from most to least common.

In [None]:
dept.value_counts()

### Notice what object is returned

The `value_counts` method returns a Series with the unique values as the index and the count as the new values.

### Use `normalize=True` for relative frequency

We can use `value_counts` to return the relative frequency (proportion) of each occurrence instead of the raw count by setting the parameter `normalize` to `True`. For instance, this tells us that 39% of the employees are members of the police department.

In [None]:
dept.value_counts(normalize=True)

### `value_counts` works for columns of all data types
The `value_counts` method works for columns of all data types and not just strings. It's just usually more informative for string columns. Let's use it on the salary column to see if we have common salaries.

In [None]:
emp['salary'].value_counts().head(3)

## Special methods just for object columns

pandas provides a collection of methods only available to object columns. These methods are not directly available using dot notation from the DataFrame variable name and you will not be able to find them as you normally do.

To access these special string-only methods, first append the Series variable name with `.str` followed by another dot and then the specific string method. pandas refers to this as the `str` accessor. Think of the term 'accessor' as giving the Series access to more specific specialized string methods. [Visit the official documentation][1] to take a look at the several dozen string-only methods available with the `str` accessor.

Let's use the title column for these string-only methods.

In [None]:
title = emp['title']
title.head()

### Make each value lowercase

Let's begin by calling a simple string method to make each value in the `title` Series uppercase. We will use the `lower` method of the `str` accessor.

[1]: https://pandas.pydata.org/pandas-docs/stable/reference/series.html#string-handling

In [None]:
title.str.lower().head()

### Lot's of methods but mostly easy to use

There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. We will not cover every single method possible, but instead, walk through examples of some of the more common ones such as the ones that follow here:

* `count` - Returns the number of non-overlapping occurrences of the passed string.
* `contains` - Checks to see whether each string contains the given string. Returns a boolean Series
* `len` - Returns the number of characters in each string
* `split` - Splits the string into multiple strings by a given separator
* `replace` - Replaces parts of a string with other characters

## The `count` str method

The `count` method returns the number of non-overlapping occurrences of the passed string. Here, we count the number of uppercase 'O' characters appear in each string.

In [None]:
title.str.count('O').head()

You are not limited to single characters. Here we count the number of times 'ER' appears in each string.

In [None]:
title.str.count('ER').head()

## The `contains` str method

The `contains` method returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any titles contain the letter 'Z'?

In [None]:
title.str.contains('Z').head(3)

We can then sum this boolean Series to find the number of employees that have a title containing a 'z'.

In [None]:
title.str.contains('Z').sum()

Let's find out which employees have the word 'POLICE' somewhere in their title.

In [None]:
title.str.contains('POLICE').head()

Summing this Series reveals the number of employees that have the word 'POLICE' somewhere in their title.

In [None]:
title.str.contains('POLICE').sum()

## The `len` str method
The `len` string method returns the length of every string. Take note that this is completely different and unrelated to the `len` built-in function which returns the number of elements in a Series.

In [None]:
title.str.len().head()

## The `split` str method

The `split` method splits each string into multiple separate strings based on a given separator. The default separator is a single space. The following splits on each space and returns a Series of lists.

In [None]:
title.str.split().head(3)

Set the `expand` parameter to `True` to return a DataFrame:

In [None]:
title.str.split(expand=True).head(3)

Here, we split on the string 'AN'. Note that the string used for splitting is removed and not contained in the result.

In [None]:
title.str.split('AN', expand=True).head(3)

## The `replace` str method

The `replace` string method allows you to replace one section of the string (a substring) with some other string. You must pass two string arguments to `replace` - the string you want to replace and its replacement value. Here, we replace 'SENIOR' with 'SR.'.

In [None]:
title.str.replace('SENIOR', 'SR.').head(3)

## Selecting substrings with the brackets

Selecting one or more characters of a regular Python string is simple and accomplished by using either an integer or slice notation within the brackets. Let's review this concept now.

In [None]:
some_string = 'The Astros will win the world series in 2019'

Select the character at integer location 5.

In [None]:
some_string[5]

Select the 24th to 36th characters with slice notation.

In [None]:
some_string[24:36]

You can use the same square brackets appended to the `str` accessor to select one or more characters from every string in a Series. Let's begin by selecting the character at position 10.

In [None]:
title.str[10].head(3)

In the following example, we use slice notation to select the last five characters of each string.

In [None]:
title.str[-5:].head(3)

Slice notation is used again to select from the 5th to 15th character.

In [None]:
title.str[5:15].head()

## Many more string-only methods

There are many more string-only methods that were not covered in this chapter and I would encourage you to explore them on your own. Many of them overlap with the built-in Python string methods.

## Regular expressions

Regular expressions help match patterns within text. Many of the string methods presented above accept regular expressions as inputs for more advanced string manuevering. They are an important part of doing data analysis and are covered thoroughly in their own part of this book.

## Exercises

Read in the movie dataset assigning the actor1 column to a variable name as a Series by executing the cell below. All missing values have been dropped from this Series. Use this Series for the exercises below.

In [None]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
actor1 = movie['actor1'].dropna()
actor1.head(3)

### Exercise 1

<span  style="color:green; font-size:16px">Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

### Exercise 2
<span  style="color:green; font-size:16px">What percent of movies have the top 100 most frequent actor 1's appeared in?</span>

### Exercise 3
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

### Exercise 4
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

### Exercise 5
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name.</span>

### Exercise 6
<span  style="color:green; font-size:16px">How many unique actor 1 names end in 'x'?</span>

### Exercise 7
<span  style="color:green; font-size:16px">The pandas string methods overlap with the builtin Python string methods. Find all the public method names that are in-common to both. Then find the public methods that are unique to each.</span>