# Draft Intro into Strings 

**Authorship**<br />
Original Author: Nicholas Davis<br />
Last Edit: Nicholas Davis, 4/19/2021<br />

**Test System Specs**<br />
Test System Hardware: Tesla T4<br />
Test System Software: Ubuntu 18.04-py3.7<br />
RAPIDS Version: 0.18. - Docker Install<br />
Driver: 450.80.02<br />
CUDA: 11.0<br />


**Known Working Systems**<br />
RAPIDS Versions: 0.18

## Working with text data <br />

Enterprise analytics workflows commonly require processing large-scale text data. To address this need, the RAPIDS CUDA DataFrame library (cuDF) and RAPIDS CUDA Machine Learning library (cuML) now include string processing capabilities. cuDF has a fully-featured string and regular expression processing engine. With a pandas-like API, cuDF string analytics can provide data scientists with up to 90x performance improvement with minimal changes to their code.<br />

This notebook serves as an intro to string capabilities with cuDF. Each string functionality will have a pandas example and it's cuDF equivalent.<br />

For any additional information please reference:<br />
[cuDF Documentation](https://docs.rapids.ai/api/cudf/stable/api.html#strings)<br /><br />
[GPU-Accelerated String Processing with RAPIDS Video](https://www.nvidia.com/en-us/on-demand/session/gtcfall20-a21131/)


Before we begin, let's check out our hardware setup by running the nvidia-smi command.

In [1]:
!nvidia-smi



Wed Jul 21 06:45:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro P5000        On   | 00000000:00:05.0 Off |                  Off |
| 26%   31C    P8     6W / 180W |      1MiB / 16278MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Text data types

There are two ways to store text data in pandas and cudf:

1. object -dtype NumPy array.

1. StringDtype extension type.

We recommend using StringDtype to store text data.

Prior to pandas 1.0, object dtype was the only option. This was unfortunate for many reasons:

1. You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.

1. object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.

1. When reading code, the contents of an object dtype array is less clear than 'string'.

Currently, the performance of object dtype arrays of strings and arrays.StringArray are about the same. We expect future enhancements to significantly increase the performance and lower the memory overhead of StringArray.



In [2]:
import pandas as pd; print('Pandas Version:', pd.__version__)
import numpy as np
import cupy as cp
import cudf; print('CuDF Version:', cudf.__version__)
import warnings
warnings.filterwarnings('ignore')


Pandas Version: 1.2.4
CuDF Version: 21.06.01+2.g101fc0fda4



For backwards-compatibility, object dtype remains the default type we infer a list of strings to.

In [3]:
# Pandas

pd.Series(["a", "b", "c"])

0    a
1    b
2    c
dtype: object

In [4]:
# cuDF

cudf.Series(["a", "b", "c"])


0    a
1    b
2    c
dtype: object


To explicitly request string dtype, specify the dtype.

In [5]:
pd.Series(["a", "b", "c"], dtype="string")

0    a
1    b
2    c
dtype: string

In [6]:
cudf.Series(["a", "b", "c"], dtype="str")

0    a
1    b
2    c
dtype: object


Or astype after the Series or DataFrame is created.

In [7]:
pandasSeries = pd.Series(["a", "b", "c"])
print('Original: ')
print(pandasSeries.astype("string"))

print("\n# of 'n': ")
print(pandasSeries.str.count('n'))

Original: 
0    a
1    b
2    c
dtype: string

# of 'n': 
0    0
1    0
2    0
dtype: int64


In [8]:
cudfSeries = cudf.Series(["a", "b", "c"])
print('Original: ')
print(cudfSeries.astype("string"))

print("\n# of 'n': ")
print(cudfSeries.str.count('n'))

Original: 
0    a
1    b
2    c
dtype: object

# of 'n': 
0    0
1    0
2    0
dtype: int32



You can also use StringDtype/"string" as the dtype on non-string data and it will be converted to string dtype:

In [9]:
pandasSeries = pd.Series(["a", 2, np.nan], dtype="string")
print(pandasSeries)
type(pandasSeries[1])

0       a
1       2
2    <NA>
dtype: string


str

In [10]:
cudfSeries = cudf.Series(["a", 2, np.nan], dtype="str")
print(cudfSeries)
type(cudfSeries[1])

0       a
1       2
2    <NA>
dtype: object


str


or convert from existing pandas data:

In [11]:
pandasSeries = pd.Series([1, 2, np.nan], dtype="Int64")

pandasSeries2 = pandasSeries.astype("string")
print(pandasSeries2)
type(pandasSeries2[0])

0       1
1       2
2    <NA>
dtype: string


str

In [12]:
cudfSeries1 = cudf.Series([1, 2, np.nan], dtype="int64")

cudfSeries2 = cudfSeries1.astype("string")
print(cudfSeries2)
type(cudfSeries2[0])

0       1
1       2
2    <NA>
dtype: object


str


## Behavior differences

These are places where the behavior of StringDtype objects differ from object dtype.

For StringDtype, string accessor methods that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean output will return a nullable boolean dtype.

In [13]:
pandasSeries = pd.Series(["a", None, "b"], dtype="string")
print('Original: ')
print(pandasSeries)
print("# of 'a': ")
print(pandasSeries.str.count("a"))
print("\n# of 'a' after dropping n/a: ")
print(pandasSeries.dropna().str.count("a"))
print("\nCheck if numeric: ")
print(pandasSeries.str.isnumeric())


Original: 
0       a
1    <NA>
2       b
dtype: string
# of 'a': 
0       1
1    <NA>
2       0
dtype: Int64

# of 'a' after dropping n/a: 
0    1
2    0
dtype: Int64

Check if numeric: 
0    False
1     <NA>
2    False
dtype: boolean


In [14]:
cudfSeries = cudf.Series(["a", None, "b"], dtype="str")
print('Original: ')
print(cudfSeries)
print("# of 'a': ")
print(cudfSeries.str.count("a"))
print("\n# of 'a' after dropping n/a: ")
print(cudfSeries.dropna().str.count("a"))
print("\nCheck if numeric: ")
print(cudfSeries.str.isnumeric())

Original: 
0       a
1    <NA>
2       b
dtype: object
# of 'a': 
0       1
1    <NA>
2       0
dtype: int32

# of 'a' after dropping n/a: 
0    1
2    0
dtype: int32

Check if numeric: 
0    False
1     <NA>
2    False
dtype: bool



Both outputs are Int64 dtype. Compare that with object-dtype.

In [15]:
pandasSeries2 = pd.Series(["a", None, "b"], dtype="object")
print("# of 'a': ")
print(pandasSeries2.str.count("a"))
print("\n# of 'a' after dropping n/a: ")
pandasSeries2.dropna().str.count("a")

# of 'a': 
0    1.0
1    NaN
2    0.0
dtype: float64

# of 'a' after dropping n/a: 


0    1
2    0
dtype: int64

In [16]:
cudfSeries2 = cudf.Series(["a", None, "b"], dtype="object")
print("# of 'a': ")
print(cudfSeries2.str.count("a"))
print("\n# of 'a' after dropping n/a: ")
cudfSeries2.dropna().str.count("a")

# of 'a': 
0       1
1    <NA>
2       0
dtype: int32

# of 'a' after dropping n/a: 


0    1
2    0
dtype: int32


When NA values are present, the output dtype is float64. Similarly for methods returning boolean values.

In [17]:
print("Check if digit: ")
print(pandasSeries.str.isdigit())
print("\nMatch against 'a': ")
pandasSeries.str.match("a")

Check if digit: 
0    False
1     <NA>
2    False
dtype: boolean

Match against 'a': 


0     True
1     <NA>
2    False
dtype: boolean

In [18]:
print("Check if digit: ")
print(cudfSeries.str.isdigit())
print("\nMatch against 'a': ")
cudfSeries.str.match("a")

Check if digit: 
0    False
1     <NA>
2    False
dtype: bool

Match against 'a': 


0     True
1     <NA>
2    False
dtype: bool

<br />

Some string methods, like Series.str.decode() are not available on StringArray because StringArray only holds strings, not bytes.

In comparison operations, arrays.StringArray and Series backed by a StringArray will return an object with BooleanDtype, rather than a bool dtype object. Missing values in a StringArray will propagate in comparison operations, rather than always comparing unequal like numpy.nan.

Everything else that follows in the rest of this document applies equally to string and object dtype.


## String methods

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:

In [19]:
pandasSeries = pd.Series(
  ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
  ....: )
  ....: 
print('Original: ')
print(pandasSeries)
print('\nLowered: ')
print(pandasSeries.str.lower())
print('\nCheck if Lowered: ')
print(pandasSeries.str.islower())
print('\nUppercase: ')
print(pandasSeries.str.upper())
print('\nCheck if Uppercase: ')
print(pandasSeries.str.isupper())
print('\nDetermine Length: ')
pandasSeries.str.len()



Original: 
0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: string

Lowered: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

Check if Lowered: 
0    False
1    False
2    False
3    False
4    False
5     <NA>
6    False
7     True
8     True
dtype: boolean

Uppercase: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

Check if Uppercase: 
0     True
1     True
2     True
3    False
4    False
5     <NA>
6     True
7    False
8    False
dtype: boolean

Determine Length: 


0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

In [20]:
cudfSeries = cudf.Series(
  ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="str"
  ....: )
  ....: 

print('Original: ')
print(cudfSeries)
print('\nLowered: ')
print(cudfSeries.str.lower())
print('\nCheck if Lowered: ')
print(cudfSeries.str.islower())
print('\nUppercase: ')
print(cudfSeries.str.upper())
print('\nCheck if Uppercase: ')
print(cudfSeries.str.isupper())
print('\nDetermine Length: ')
cudfSeries.str.len()


Original: 
0       A
1       B
2       C
3    Aaba
4    Baca
5    <NA>
6    CABA
7     dog
8     cat
dtype: object

Lowered: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: object

Check if Lowered: 
0    False
1    False
2    False
3    False
4    False
5     <NA>
6    False
7     True
8     True
dtype: bool

Uppercase: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: object

Check if Uppercase: 
0     True
1     True
2     True
3    False
4    False
5     <NA>
6     True
7    False
8    False
dtype: bool

Determine Length: 


0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: int32

In [21]:
pandasIdx = pd.Index([" jack", "jill ", " jesse ", "frank"])

pandasIdx.str.strip()

print('Right Strip: ')
print(pandasIdx.str.rstrip())

print('\nLeft Strip: ')
pandasIdx.str.lstrip()


Right Strip: 
Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

Left Strip: 


Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [22]:
cudfIdx = cudf.Index([" jack", "jill ", " jesse ", "frank"])

cudfIdx.str.strip()

print('Right Strip: ')
print(cudfIdx.str.rstrip())

print('\nLeft Strip: ')
cudfIdx.str.lstrip()


Right Strip: 
StringIndex([' jack' 'jill' ' jesse' 'frank'], dtype='object')

Left Strip: 


StringIndex(['jack' 'jill ' 'jesse ' 'frank'], dtype='object')


The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance, you may have columns with leading or trailing whitespace:

In [23]:
pandasDataFrame = pd.DataFrame(np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3))
   
pandasDataFrame

Unnamed: 0,Column A,Column B
0,-1.738195,0.177494
1,-1.751751,2.403963
2,0.629295,0.415875


In [24]:
cudfDataFrame = cudf.DataFrame(np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3))
   
cudfDataFrame

Unnamed: 0,Column A,Column B
0,-1.85184,-0.1727
1,0.33088,0.101063
2,0.259782,-1.017952



Since df.columns is an Index object, we can use the .str accessor.

In [25]:
print("Stripped: ")
print(pandasDataFrame.columns.str.strip())
print("\nLowered: ")
pandasDataFrame.columns.str.lower()

Stripped: 
Index(['Column A', 'Column B'], dtype='object')

Lowered: 


Index([' column a ', ' column b '], dtype='object')

In [26]:
print("Stripped: ")
print(cudfDataFrame.columns.str.strip())
print("\nLowered: ")
cudfDataFrame.columns.str.lower()

Stripped: 
Index(['Column A', 'Column B'], dtype='object')

Lowered: 


Index([' column a ', ' column b '], dtype='object')


These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:

In [27]:
pandasDataFrame.columns = pandasDataFrame.columns.str.strip().str.lower().str.replace(" ", "_")
pandasDataFrame

Unnamed: 0,column_a,column_b
0,-1.738195,0.177494
1,-1.751751,2.403963
2,0.629295,0.415875


In [28]:
cudfDataFrame.columns = cudfDataFrame.columns.str.strip().str.lower().str.replace(" ", "_")
cudfDataFrame

Unnamed: 0,column_a,column_b
0,-1.85184,-0.1727
1,0.33088,0.101063
2,0.259782,-1.017952


## Splitting and replacing strings

Methods like split return a Series of lists:

In [29]:
pandasSeries3 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")
pandasSeries3.str.split("_")

0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

In [30]:
cudfSeries3 = cudf.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="str")
cudfSeries3.str.split("_")

0    [a, b, c]
1    [c, d, e]
2         None
3    [f, g, h]
dtype: list


It is easy to expand this to return a DataFrame using expand.

In [31]:
pandasSeries3.str.split("_", expand=True)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


In [32]:
cudfSeries3.str.split("_", expand=True)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h



When original Series has StringDtype, the output columns will all be StringDtype as well.

It is also possible to limit the number of splits:

In [33]:
pandasSeries3.str.split("_", expand=True, n=1)

Unnamed: 0,0,1
0,a,b_c
1,c,d_e
2,,
3,f,g_h


In [34]:
cudfSeries3.str.split("_", expand=True, n=1)

Unnamed: 0,0,1
0,a,b_c
1,c,d_e
2,,
3,f,g_h



rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning of the string:

In [35]:
pandasSeries3.str.rsplit("_", expand=True, n=1)

Unnamed: 0,0,1
0,a_b,c
1,c_d,e
2,,
3,f_g,h


In [36]:
cudfSeries3.str.rsplit("_", expand=True, n=1)

Unnamed: 0,0,1
0,a_b,c
1,c_d,e
2,,
3,f_g,h


## The replace method


replace optionally uses regular expressions:

In [37]:
pandasSeries4 = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
   ....:     dtype="string",
   ....: )
   ....: 
print('Original: ')
print(pandasSeries4) 
print('\nReplaced: ')
pandasSeries4.str.replace("^.a|dog", "XX-XX ", regex=True)

Original: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

Replaced: 


0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7        CABA
8      XX-XX 
9     XX-XX t
dtype: string

In [38]:
cudfSeries4 = cudf.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
   ....:     dtype="str",
   ....: )
   ....: 
print('Original: ')
print(cudfSeries4) 
print('\nReplaced: ')
cudfSeries4.str.replace("^.a|dog", "XX-XX ",  regex=True)

Original: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: object

Replaced: 


0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7        CABA
8      XX-XX 
9     XX-XX t
dtype: object


If you want literal replacement of a string (equivalent to str.replace()), you can set the optional regex parameter to False, rather than escaping each character. In this case both pat and repl must be strings:

In [39]:
pandasdollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")

# These lines are equivalent
print(pandasdollars.str.replace(r"-\$", "-", regex=True))
print("\nAre these equivalent? \n")
pandasdollars.str.replace("-$", "-", regex=False)

0         12
1        -10
2    $10,000
dtype: string

Are these equivalent? 



0         12
1        -10
2    $10,000
dtype: string

In [40]:
cudfDollars = cudf.Series(["12", "-$10", "$10,000"], dtype="str")

# These lines are equivalent
print(cudfDollars.str.replace(r"-\$", "-", regex=True))
print("\nAre these equivalent? \n")
cudfDollars.str.replace("-$", "-", regex=False)

0         12
1        -10
2    $10,000
dtype: object

Are these equivalent? 



0         12
1        -10
2    $10,000
dtype: object

## Concatenation


There are several ways to concatenate a Series or Index, either with itself or others, all based on cat(), resp. Index.str.cat.

### Concatenating a single Series into a string

The content of a Series (or Index) can be concatenated:

In [41]:
pandasSeries = pd.Series(["a", "b", "c", "d"], dtype="string")

pandasSeries.str.cat(sep=",")

'a,b,c,d'

In [42]:
cudfSeries = cudf.Series(["a", "b", "c", "d"], dtype="str")

cudfSeries.str.cat(sep=",")

'a,b,c,d'


If not specified, the keyword sep for the separator defaults to the empty string, sep='':

In [43]:
pandasSeries.str.cat()

'abcd'

In [44]:
cudfSeries.str.cat()

'abcd'


By default, missing values are ignored. Using na_rep, they can be given a representation:

In [45]:
pandasSeriesB = pd.Series(["a", "b", np.nan, "d"], dtype="string")
print('Seperated by ,: ')
print(pandasSeriesB.str.cat(sep=","))
print('\nSeperated by , & -: ')
print(pandasSeriesB.str.cat(sep=",", na_rep="-"))

Seperated by ,: 
a,b,d

Seperated by , & -: 
a,b,-,d


In [46]:
cudfSeriesB = cudf.Series(["a", "b", np.nan, "d"], dtype="str")
print('Seperated by ,: ')
print(cudfSeriesB.str.cat(sep=","))
print('\nSeperated by , & -: ')
print(cudfSeriesB.str.cat(sep=",", na_rep="-"))

Seperated by ,: 
a,b,d

Seperated by , & -: 
a,b,-,d


## Concatenating a Series and something list-like into a Series

The first argument to cat() can be a list-like object, provided that it matches the length of the calling Series (or Index).

In [47]:
pandasSeries.str.cat(["A", "B", "C", "D"])

0    aA
1    bB
2    cC
3    dD
dtype: string

In [48]:
cudfSeries.str.cat(["A", "B", "C", "D"])

0    aA
1    bB
2    cC
3    dD
dtype: object

Missing values on either side will result in missing values in the result as well, unless na_rep is specified:

In [49]:
print('Original: ')
print(pandasSeries.str.cat(pandasSeriesB))
print('\nna_rep is specified')
pandasSeries.str.cat(pandasSeriesB, na_rep="-")

Original: 
0      aa
1      bb
2    <NA>
3      dd
dtype: string

na_rep is specified


0    aa
1    bb
2    c-
3    dd
dtype: string

In [50]:
print('Original: ')
print(cudfSeries.str.cat(cudfSeriesB))
print('\nna_rep is specified')
cudfSeries.str.cat(cudfSeriesB, na_rep="-")

Original: 
0      aa
1      bb
2    <NA>
3      dd
dtype: object

na_rep is specified


0    aa
1    bb
2    c-
3    dd
dtype: object


## Concatenating a Series and something array-like into a Series

The parameter others can also be two-dimensional. In this case, the number or rows must match the lengths of the calling Series (or Index).

In [51]:
pandasArray = pd.concat([pandasSeriesB, pandasSeries], axis=1)
print('Original: ')
print(pandasSeries)
print('\nConcatenating a Series and something array-like')
print(pandasArray)
pandasSeries.str.cat(pandasArray, na_rep="-")


Original: 
0    a
1    b
2    c
3    d
dtype: string

Concatenating a Series and something array-like
      0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d


0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

In [52]:
cudfArray = cudf.concat([cudfSeriesB, cudfSeries], axis=1)
print('Original: ')
print(cudfSeries)
print('\nConcatenating a Series and something array-like')
print(cudfArray)
cudfArray[1].str.cat(cudfArray[0], na_rep="-").str.cat(cudfSeries, na_rep="-")

Original: 
0    a
1    b
2    c
3    d
dtype: object

Concatenating a Series and something array-like
      0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d


0    aaa
1    bbb
2    c-c
3    ddd
Name: 1, dtype: object


## Indexing with .str

You can use [] notation to directly index by position locations. If you index past the end of the string, the result will be a NaN.

In [53]:
pandasSeries = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string")
   
print('Indexed at position 0: ')
print(pandasSeries.str[0])
print('\nIndexed at position 1: ')
pandasSeries.str[1]

Indexed at position 0: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

Indexed at position 1: 


0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

In [54]:
cudfSeries = cudf.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="str")
   
print('Indexed at position 0: ')
print(cudfSeries.str[0])
print('\nIndexed at position 1: ')
cudfSeries.str[1]

Indexed at position 0: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: object

Indexed at position 1: 


0        
1        
2        
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: object


## Extracting substrings

Extract first match in each subject (extract).

In [55]:
pdSeries = pd.Series(["a1", "b2", "c3"],dtype="string",).str.extract(r"([ab])(\d)", )
print(pdSeries)

      0     1
0     a     1
1     b     2
2  <NA>  <NA>


In [56]:
cudfSeries = cudf.Series(['a1', 'b2', 'c3']).str.extract(r'([ab])(\d)')
print(cudfSeries)    

      0     1
0     a     1
1     b     2
2  <NA>  <NA>



Extracting a regular expression with one group returns a DataFrame with one column if expand=True.

In [57]:
pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)

Unnamed: 0,0
0,1.0
1,2.0
2,


In [58]:
cudf.Series(["a1", "b2", "c3"], dtype="str").str.extract(r"[ab](\d)", expand=True)

Unnamed: 0,0
0,1.0
1,2.0
2,



It returns a Series if expand=False.

In [59]:
pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)

0       1
1       2
2    <NA>
dtype: string

In [60]:
cudf.Series(["a1", "b2", "c3"], dtype="str").str.extract(r"[ab](\d)", expand=False)

0       1
1       2
2    <NA>
dtype: object


When each subject string in the Series has exactly one match.

In [61]:
pandasSeries = pd.Series(["a3", "b3", "c2"], dtype="string")
print(pandasSeries)

0    a3
1    b3
2    c2
dtype: string


In [62]:
cudfSeries = cudf.Series(["a3", "b3", "c2"], dtype="str")
print(cudfSeries)

0    a3
1    b3
2    c2
dtype: object



## Testing for strings that match or contain a pattern

You can check whether elements contain a pattern:

In [63]:
pattern = r"[0-9][a-z]"

pd.Series(["1", "2", "3a", "3b", "03c", "4dx"],dtype="str",
         ).str.contains(pattern)
   

0    False
1    False
2     True
3     True
4     True
5     True
dtype: bool

In [64]:
pattern = r"[0-9][a-z]"

cudf.Series(["1", "2", "3a", "3b", "03c", "4dx"],dtype="str",
         ).str.contains(pattern)
   

0    False
1    False
2     True
3     True
4     True
5     True
dtype: bool


Or whether elements match a pattern:

In [65]:
pd.Series(["1", "2", "3a", "3b", "03c", "4dx"],dtype="string",
         ).str.match(pattern)
   

0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [66]:
cudf.Series(["1", "2", "3a", "3b", "03c", "4dx"],dtype="str",
         ).str.match(pattern) 

0    False
1    False
2     True
3     True
4    False
5     True
dtype: bool


New in version 1.1.0.

In [67]:
pd.Series(["1", "2", "3a", "3b", "03c", "4dx"],dtype="string",
         ).str.fullmatch(pattern)
    

0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

In [68]:
cudf.Series(["1", "2", "3a", "3b", "03c", "4dx"],dtype="str",
         ).str.match(pattern)

0    False
1    False
2     True
3     True
4    False
5     True
dtype: bool


Methods like match, fullmatch, contains, startswith, and endswith take an extra na argument so missing values can be considered True or False:

In [69]:
pandasSeries5 = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string")   
print("Strings that contain 'A':")
print(pandasSeries5.str.contains("A", na=False))
print("\nStrings that have swapped case:")
print(pandasSeries5.str.swapcase())
print("\nStrings that start with 'b':")
print(pandasSeries5.str.startswith ('b'))
print(("\nStrings that ends with 'a':"))
print(pandasSeries5.str.endswith ('a'))

Strings that contain 'A':
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

Strings that have swapped case:
0       a
1       b
2       c
3    aABA
4    bACA
5    <NA>
6    caba
7     DOG
8     CAT
dtype: string

Strings that start with 'b':
0    False
1    False
2    False
3    False
4    False
5     <NA>
6    False
7    False
8    False
dtype: boolean

Strings that ends with 'a':
0    False
1    False
2    False
3     True
4     True
5     <NA>
6    False
7    False
8    False
dtype: boolean


In [70]:
cudfSeries5 = cudf.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="str")   
print("Strings that contain 'A':")
print(cudfSeries5.str.contains("A"))
print("\nStrings that have swapped case:")
print(cudfSeries5.str.swapcase())
print("\nStrings that start with 'b':")
print(cudfSeries5.str.startswith ('b'))
print(("\nStrings that ends with 'a':"))
print(cudfSeries5.str.endswith ('a'))

Strings that contain 'A':
0     True
1    False
2    False
3     True
4    False
5     <NA>
6     True
7    False
8    False
dtype: bool

Strings that have swapped case:
0       a
1       b
2       c
3    aABA
4    bACA
5    <NA>
6    caba
7     DOG
8     CAT
dtype: object

Strings that start with 'b':
0    False
1    False
2    False
3    False
4    False
5     <NA>
6    False
7    False
8    False
dtype: bool

Strings that ends with 'a':
0    False
1    False
2    False
3     True
4     True
5     <NA>
6    False
7    False
8    False
dtype: bool
