<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>

<br style="clear: both">
<hr>
<br>

<h1 align='center'>String Series</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/aww.jpg" width="80%">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"Words are good servants but bad masters."</p>
                <br>
                <p>-Aldous Huxley</p>
            </blockquote>
        </div>
    </div>
</div>


<br>




<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Red_Panda_Tennoji_2.jpg'>Kuribo</a> under the <a href='https://creativecommons.org/licenses/by/2.5/deed.en'>CC BY-SA 3.0</a>
</div>

<hr>

In [2]:
import numpy as np
import pandas as pd

# Generally

Because of their importance in data analysis and computer programming as a whole, strings of text also have a special namespace for methods called `.str`. This string methods object allows us to perform specialized vectorized string operations.

Strings are stored as an 'O'('object') type which gives you flexibility when it is time to transform these strings (e.g. `np.NaN` for missing data, Python introspection, etc.). All 'object' or 'O' type `Series` will have a `.str` namespace, but it won't be useful unless you're working with stings.

**Note**: unless you are working with fixed-width strings (e.g. 10 UTF-8 characters each), you will want to steer clear of using numpy strings for text processing. Python string operations are usually quick enough.

## Creation

Most of the time you will not need to worry about creating a string series.  Pandas is smart enough to do that on it's own--the only time you'll need to do this manually is if you are reading or converting a number to a string. Conversions at the time of reading can be accomplished with passing `dtype=str` to your `pd.read_csv()` or `pd.Series()` constructors. Conversions later on can be performed using the `Series.astype(str)`.

In [3]:
# Load names as strings automatically
names = pd.read_csv('data/subject_names.csv', header=None, squeeze=True).head()
names

0       Steve Smith
1     Steve Johnson
2    Steve Williams
3       Steve Brown
4       Steve Jones
Name: 0, dtype: object

In [4]:
# And here is how you would convert in place using .astype()
number_series = pd.Series([1.1, 2.2, 3.02222])
string_series = number_series.astype(str)
string_series

0        1.1
1        2.2
2    3.02222
dtype: object

**Note**: if you are loading numbers, you may have to specify dtype='str' to prevent autoconversion.

In [5]:
heights = pd.read_csv('data/height_in_feet.csv', header=None, squeeze=True, dtype=str).head()
heights

0    5.57
1     5.7
2    6.12
3     5.9
4    5.87
Name: 0, dtype: object

**Note**: the logic for how something is translated to a string (if you care), is handled by the `__str__()` magic method, and consequently is class-specific.

In [6]:
import re, difflib, string

pd.Series([re, difflib, string]).astype(str)

0    <module 're' from 'C:\\Program Files (x86)\\Mi...
1    <module 'difflib' from 'C:\\Program Files (x86...
2    <module 'string' from 'C:\\Program Files (x86)...
dtype: object

## String series and the '+' operator

If doesn't fit anywhere else neatly so we will tackle this here. Using the plus operator, you can concatenate a strings series to another string series ... or concatenate a arbitrary string to each item in the series.

In [7]:
# String series can be concatenated simply by adding them.
double_names = names + names
double_names

0          Steve SmithSteve Smith
1      Steve JohnsonSteve Johnson
2    Steve WilliamsSteve Williams
3          Steve BrownSteve Brown
4          Steve JonesSteve Jones
Name: 0, dtype: object

In [8]:
# Adding a string to a series will concatenate it to each element.
names + ' cannot be trusted. ' + names + '!'

0          Steve Smith cannot be trusted. Steve Smith!
1      Steve Johnson cannot be trusted. Steve Johnson!
2    Steve Williams cannot be trusted. Steve Williams!
3          Steve Brown cannot be trusted. Steve Brown!
4          Steve Jones cannot be trusted. Steve Jones!
Name: 0, dtype: object

## So how do we use this `.str` namespace?

The `.str` name space can be used in two main ways:

* subscription
* methods

In [9]:
# This is what the string looks like
dir_count = len(dir(names.str))
print(f'Our string namespace has {dir_count} methods.')

# And this is what it looks like
names.str

Our string namespace has 89 methods.


<pandas.core.strings.StringMethods at 0x18d1b0be6d8>

### Subscription

If you have ever worked with strings before in Python, you know that you can get [specified portion of the string using bracket notation](https://docs.python.org/3/tutorial/introduction.html#strings).

    >>> word = 'Python'
    >>> word[0]  # character in position 0
    'P'
    >>> word[5]  # character in position 5
    'n'
    Indices may also be negative numbers, to start counting from the right:
    >>>
    >>> word[-1]  # last character
    'n'
    >>> word[-2]  # second-last character
    'o'
    >>> word[-6]
    'P'
    
Pandas strings can be indexed in this manner by subscripting the `.str` methods object directly.

In [10]:
# Python string series
words = pd.Series(['Python', 'Python', 'Not Python'])

# Get first letter.
first_letters = words.str[0:3]

# Get last letter.
last_letters = words.str[-1]

'First letters are ' + first_letters + ' and last letter is ' + last_letters

0    First letters are Pyt and last letter is n
1    First letters are Pyt and last letter is n
2    First letters are Not and last letter is n
dtype: object

In [11]:
# Similarly, you can use ranges.
words.str[2:-1]

0        tho
1        tho
2    t Pytho
dtype: object

### `.str` methods

This is where the majority of the work is done. The `.str` function namespace has access to versions of [pretty much every string method the Python standard library has](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling), and a few more that are pandas-specific. Some of the more important ones are listed below.

These methods can be grouped into the following categories:

* **Boolean Checks**;
* **Splitting, Joining, and Splicing**;
* **Find/Extract/Replace**;
* **Spacing**;
* **Miscellaneous Transformations**;
* **Encoding**; and,
* **Other**.

#### Boolean Checks

These functions perform a check on each of the strings in your Series and returns a boolean Series of True or False values (i.e. True if the test is passed; False is the test is failed). It is then possible to use a boolean index to transform or filter the Series.

    # Pseudocode
    bix = series.str.method()
    
    # Use bix to get a filtered series.
    series.loc[bix]
    
    # Or more succinctly ...
    series[series.str.method()]

Some of the more useful methods are:

* [`str.contains()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html): test if the string is found in each entry.
* [`str.endswith()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.endswith.html): test if the end of each string mattaches a pattern.
* [`str.isalnum()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isalnum.html): tests whether each item is alphanumeric.
* [`str.isalpha()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isalpha.html): test whether each string is alphabetic.
* [`str.isdecimal()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isdecimal.html): tests whether each item is a decimal.
* [`str.isdigit()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isdigit.html): tests whether each item is a digit.
* [`str.islower()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.islower.html): tests whether each item is lowercase.
* [`str.isnumeric()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isnumeric.html): tests whether each item is numeric.
* [`str.isspace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isspace.html): tests whether each item is whitespace.
* [`str.istitle()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.istitle.html): tests whether each item is in title case.
* [`str.isupper()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.isupper.html): tests whether each item is upper case.
* [`str.startswith()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.startswith.html): tests whether the start of the string matches a mattern.
* [`str.match()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.match.html): determine if the string matches a regular expression.

**Note**: by default, `str.contains()` and `str.match()` take [regex expressions](https://docs.python.org/3/howto/regex.html), so if you're looking for certain characters such as `)`, `?`, or `*` you will need to use regex or pass a `regex=False` argument. 

In [12]:
# Load names
names = pd.read_csv('./data/lob_characters.csv', squeeze=True)
names.head()

0                    Brian
1                      NaN
2    Centurion of the Yard
3                   Gaoler
4                      NaN
Name: characters, dtype: object

In [13]:
# This will return boolean values (including NaNs)
or_bix = names.str.contains('or')
or_bix.head()

0    False
1      NaN
2    False
3    False
4      NaN
Name: characters, dtype: object

In [17]:
# Quick note: you can't index with NaNs
try:
    names.loc[or_bix]
except ValueError:
    print('ValueError!')
    print('Cannot index with vector containing NA / NaN values')
    print('Get around that inconvenient fact with fillna().')
    
# To get around it, fillna()!
no_nan_or_bix = or_bix.fillna(False)
no_nan_or_bix.head()

Cannot index with vector containing NA / NaN values


0    False
1    False
2    False
3    False
4    False
Name: characters, dtype: bool

In [22]:
# Then we can use that bix to get our subset
print("Here are our names ending with 'er'!")
er_bix = names.str.endswith('er').fillna(False)
names[er_bix]

Here are our names ending with 'er'!


3                Gaoler
5     Harry the Haggler
6              Ex-Leper
14               Gaoler
22             Ex-Leper
Name: characters, dtype: object

In [20]:
# The contains() and match() are super powerful.

# Contains more than one space.
names.str.match()

# Contains a 'g'.


0                     Brian
1                       NaN
2     Centurion of the Yard
3                    Gaoler
4                       NaN
5         Harry the Haggler
6                  Ex-Leper
7                   Gregory
8           Judith Escariot
9        Simon the Holy Man
10           Pontius Pilate
11                 Matthias
12                  Gregory
13                      NaN
14                   Gaoler
15                    Brian
16       Simon the Holy Man
17                      NaN
18                      NaN
19                  Gregory
20                  Gregory
21                  Gregory
22                 Ex-Leper
23                  Gregory
24       Simon the Holy Man
25                 Matthias
26                      NaN
Name: characters, dtype: object

In [None]:
# Again, remember it defaults to regex (which can)

In [None]:
str.contains(): test if the string is found in each entry.
str.endswith(): test if the end of each string mattaches a pattern.
str.isalnum(): tests whether each item is alphanumeric.
str.isalpha(): test whether each string is alphabetic.
str.isdecimal(): tests whether each item is a decimal.
str.isdigit(): tests whether each item is a digit.
str.islower(): tests whether each item is lowercase.
str.isnumeric(): tests whether each item is numeric.
str.isspace(): tests whether each item is whitespace.
str.istitle(): tests whether each item is in title case.
str.isupper(): tests whether each item is upper case.
str.startswith(): tests whether the start of the string matches a mattern.
str.match(): determine if the string matches a regular expression.

#### Splitting, joining, and slicing

These methods are used to split or join text for further processing.

* [`str.cat()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.cat.html): combine all text elements of string into a mega string.
* [`str.get()`](): get the Nth item of list/tuple/string.
* [`str.join()`](): join each list element of a series with a particular delimiter.
* [`str.partition()`](): split a string into 3 elements: text before separator, separator, text after separator (from left).
* [`str.rpartition()`](): split a string into 3 elements: text before separator, separator, text after separator (from right).
* [`str.rsplit()`](): split a string based on a separator into a list (from right).
* [`str.split()`](): split a string based on a separator into a list (from left).

**Note**: `partition()` (vectorized or otherwise) is super-useful. If you haven't heard of it, you should take a look at the documentation and see if it can benefit you.

**Note**: `str.cat()` is a great way to turn multiple bodies of text into a single corpus for things like machine learning.

#### Find, Extract, Replace

* [`str.extract()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html):
* [`str.extractall()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extractall.html):
* [`str.find()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.find.html):
* [`str.findall()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.findall.html):
* [`str.index()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.index.html):
* [`str.replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html):
* [`str.rfind()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.rfind.html):
* [`str.rindex()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.rindex.html):
* [`str.slice()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.slice.html): a function for string subscription that returns a value.
* [`str.slice_replace()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.slice_replace.html): a function for string subscription that replaces values.
* [`str.translate()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.translate.html):

**Note**: by default, `str.extract()`, `str.extractall()`, `str.find()`, `str.findall()`, and `str.replace()` take [regex expressions](https://docs.python.org/3/howto/regex.html), so if you're looking for certain characters such as `)`, `?`, or `*` you will need to use regex or pass a `regex=False` argument. 

#### Spacing

* [`str.center()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.center.html):
* [`str.ljust()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.ljust.html):
* [`str.lstrip()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lstrip.html):
* [`str.pad()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.pad.html):
* [`str.rjust()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.rjust.html):
* [`str.rstrip()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.rstrip.html):
* [`str.strip()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.strip.html):
* [`str.wrap()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.wrap.html):
* [`str.zfill()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.zfill.html):


#### Miscellaneous Transformation

* [`str.capitalize()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.capitalize.html):
* [`str.lower()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.lower.html):
* [`str.repeat()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.repeat.html):
* [`str.swapcase()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.swapcase.html):
* [`str.title()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.title.html):
* [`str.upper()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.upper.html):

#### Encoding

* [`str.decode()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.decode.html):
* [`str.encode()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.encode.html):
* [`str.normalize()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.normalize.html):

#### Miscellaneous

* [`str.count`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.count.html):
* [`str.get_dummies`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.get_dummies.html):
* [`str.len()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.len.html):

# Additional Learing Resources

* ### [Working with Text Data](http://pandas.pydata.org/pandas-docs/stable/text.html)
* ### [String API](https://pandas.pydata.org/pandas-docs/stable/api.html#string-handling)
* ### [Python Regular Expression How-To](https://docs.python.org/3/howto/regex.html)

---

# Next Up: [Series Part 3 Exercises](4_series_part_3_exercises.ipynb)

<br>

$\huge{W=-\Delta PE}$

<br>

<div align='left'>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Population_vs_area.svg'>Skbkekas</a> under the <a href='https://creativecommons.org/licenses/by-sa/3.0/deed.en'>CC BY-SA 3.0</a>
</div>

---