# How to manipulate textual data

In [1]:
import pandas as pd


In [2]:
titanic = pd.read_csv("../../data/titanic.csv")

titanic.head()

FileNotFoundError: [Errno 2] No such file or directory: '../data/titanic.csv'

Let's make all the name characters lowercase:


In [None]:
titanic["Name"].str.lower()

To make each of the strings in the `Name` column lowercase, select the `Name` column, add the `str` accessor and apply the `lower` method. As such, each of the strings is converted element-wise.

Similar to datetime objects in the time series tutorial, having a `dt` accessor, a number of specialized string methods are available when using the `str` accessor. These methods have in general matching names with the equivalent built-in string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of the values of the columns.

> ❓ **Create a new column `Surname`** that contains the surname of the passengers by extracting the part before the comma.


In [None]:
titanic["Name"].str.split(",")

Using the `Series.str.split()` method, each of the values is returned as a list of 2 elements. 
* The *first element* is the part before the comma
* The *second element* is the part after the comma.

In [None]:
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)

In [None]:
titanic["Surname"]

As we are only interested in the first part representing the surname (element 0), we can again use the str accessor and apply `Series.str.get()` to extract the relevant part. Indeed, these string functions can be concatenated to combine multiple functions at once!

More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.<br>

https://pandas.pydata.org/docs/user_guide/text.html#text-split

#### Extract the passenger data about the countesses on board of the Titanic.

In [None]:
titanic["Name"].str.contains("Countess")

In [None]:
titanic[titanic["Name"].str.contains("Countess")]

The string method `Series.str.contains()` checks for each of the values in the column `Name` if the string contains the word 'Countess' and returns for each of the values `True` (Countess is part of the name) or `False` (Countess is not part of the name). This output can be used to subselect the data using conditional (boolean) indexing introduced in the subsetting of data tutorial. As there was only one countess on the Titanic, we get one row as a result.

> #### **ℹ️ Note**  
> More powerful extractions on strings are supported, as the `Series.str.contains()` and `Series.str.extract()` methods accept [regular expressions](#), but out of scope of this tutorial.

More information on extracting parts of strings is available in the user guide section on regular expressions. <br>
https://docs.python.org/3/library/re.html 

> ❓ **Which passenger of the Titanic has the longest name?**


In [None]:
titanic["Name"].str.len()

* To get the longest name, we first have to get the lengths of each of the names in the `Name` column.
* By using pandas string methods, the `Series.str.len()` function is applied to each of the names individually, element-wise.

Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is the largest. 
* The `idxmax()` method does exactly that. 

* It is not a string method and is applied to integers, so no str is used.


In [None]:
titanic["Name"].str.len().idxmax()

* Based on the index name of the row (`307`) and the column (`Name`), we can do a selection using the `loc` operator, introduced in the tutorial on subsetting.

In [None]:
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]

> ❓ **In the "Sex" column, replace values of "male" by "M" and values of "female" by "F".**



In [None]:
titanic["Sex_short"] = titanic["Sex"].replace({"male":"M","female":"F"})
titanic["Sex_short"]

> * Although `replace()` is not a string method, it provides a convenient way to use mappings or vocabularies to translate certain values.
> * It requires a `dictionary` to define the mapping `{from : to}`.


> **⚠️ Warning**  
> There is also a `replace()` method available to replace a specific set of characters. However, when having a mapping of multiple values, this would become:
> 
> ```python
> titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
> titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")
> ```
> 
> This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if those two statements are applied in the opposite order...

---

## REMEMBER  
- String methods are available using the `str` accessor.  
- String methods work element-wise and can be used for conditional indexing.  
- The `replace` method is a convenient method to convert values according to a given dictionary.