Objectives
- Use string methods to manipulate textual data columns
- Use string methods to select rows of interest (boolean indexing)
- replace as general function for mappings

Content to cover
- str.upper/…
- df[df[].str.contains()]
- replace


In [139]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [140]:
titanic = pd.read_csv("../data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Manipulate data columns with text

> Make all name characters lower case

In [141]:
titanic["Name"].str.lower()

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

Similar to datetime objects in the  [time series tutorial](9_timeseries.ipynb) having a `dt` accessor, a number of specialized string methods are available when using the `str` accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember [element wise calculations from tutorial 5?](5_add_columns.ipynb)) on each of the values of the columns. 

> Create a new column 'Surname' that contains the surname of the Passengers by extracting the part before the comma.

In [142]:
titanic["Name"].str.split(",")

0                             [Braund,  Mr. Owen Harris]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                              [Heikkinen,  Miss. Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                      [Graham,  Miss. Margaret Edith]
888          [Johnston,  Miss. Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

Using the `split` method, each of the values is returned as a list of 2 elements. The first element is the part before the comma and the second element the part after the comma. 

In [143]:
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
titanic["Surname"]

0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Name: Surname, Length: 891, dtype: object

As we are only interested in the first part representing the surname (element 0), we can again use the `str` accessor and apply `get` to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!

__To user guide:__ More information on extracting parts of strings is available in :ref:`text.split`.

> Extract the passenger data about the Countess on board of the Titanic. 

In [144]:
titanic["Name"].str.contains("Countess")

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: Name, Length: 891, dtype: bool

In [145]:
titanic[titanic["Name"].str.contains("Countess")]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
759,760,1,1,"Rothes, the Countess. of (Lucy Noel Martha Dye...",female,33.0,0,0,110152,86.5,B77,S,Rothes


(_Interested in her story? See [Wikipedia](https://en.wikipedia.org/wiki/No%C3%ABl_Leslie,_Countess_of_Rothes)!_)

The string method `contains` checks for each of the values in the column `Name` if the string contains the word `Countess` and returns for each of the values `True` (`Countess` is part of the name) of `False` (`Countess` is notpart of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the [subsetting of data tutorial](subset_data.ipynb). As there was only 1 Countess on the Titanic, we get one row as a result.

<div class="alert alert-info">
    
__Note__: More powerfull extractions on strings is supported, as the `contains` and `extract` methods accepts [regular expressions](https://docs.python.org/3/library/re.html), but out of scope of this tutorial.

</div>

__To user guide:__ More information on extracting parts of strings is available in :ref:`text.extract`.

> Which passenger of the titanic has the longest name?

In [146]:
titanic["Name"].str.len()

0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64

To get the longest name we first have to get the lenghts of each of the names in the `Name` column. By using Pandas string methods, the `len` function is applied to each of the names individually (element-wise). 

In [147]:
titanic["Name"].str.len().idxmax()

307

Next, we need to get the corresponding location, prefereably the index name, in the table for which the name length is the largest. The `idxmax` method does exactly that. It is not a string method and is applied to integers, so no `str` is used.

In [148]:
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]

'Penasco y Castellana, Mrs. Victor de Satode (Maria Josefa Perez de Soto y Vallejo)'

Based on the index name of the row (`307`) and the column (`Name`), we can do a selection using the `loc` operator, introduced in the [tutorial on subsetting](3_subset_data.ipynb).

> In the 'Sex' columns, replace values of 'male' by 'M' and all 'female' values by 'F'

In [149]:
titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", 
                                         "female": "F"})
titanic["Sex_short"]

0      M
1      F
2      F
3      F
4      M
      ..
886    M
887    F
888    F
889    M
890    M
Name: Sex_short, Length: 891, dtype: object

Whereas `replace` is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values. It requires a `dictionary` to define the mapping `{from : to}`. 

<div class="alert alert-warning">
    
__Note__: There is also a `str.replace` methods available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:

    titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
    titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")

This would become cumbersome and easily lead to mistakes. Just think (or try out) yoursel what would happen if those two statements are applied in the opposite order...

</div>

## REMEMBER

- String methods are available using the `str` accessor.
- String methods work element wise and can be used for conditional indexing.
- The `replace` method is a convenient method to convert values according to a given dictionary.

__To user guide:__ More information on string methods is given in :ref:`text`.