In [1]:
import pandas as pd 
import numpy as np
print(pd.__version__)

2.0.2


As a powerful data processing and analysis tool, one of the reasons why pandas is widely used is its convenient interfaces for various common data types. By modifying the specific method names for different data types, pandas maintains a consistent and easy-to-use calling style, ensuring its ease of learning, usability, flexibility, and extensibility

# Text data type

In pandas, there are two data types commonly used for representing text data: `object` and `StringDtype`. The object data type is the default type for columns containing strings in pandas and is compatible with a wide range of data. On the other hand, the `StringDtype` is a specific data type introduced in pandas 1.0 for representing strings.

## `object`
By default, when we load or create a DataFrame in pandas, columns containing text data are inferred as the object data type.

In [2]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"], name = "NBA players")

0         Kevin Durant
1          Luka Doncic
2         Jayson Tatum
3         James Harden
4        Stephen Curry
5    Russell Westbrook
Name: NBA players, dtype: object

## `string`

In [3]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"], name = "NBA players",
         dtype = pd.StringDtype())

0         Kevin Durant
1          Luka Doncic
2         Jayson Tatum
3         James Harden
4        Stephen Curry
5    Russell Westbrook
Name: NBA players, dtype: string

## Type Conversion

In [4]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).astype('string')

0         Kevin Durant
1          Luka Doncic
2         Jayson Tatum
3         James Harden
4        Stephen Curry
5    Russell Westbrook
dtype: string

In [5]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"], dtype = 'string').astype('object')

0         Kevin Durant
1          Luka Doncic
2         Jayson Tatum
3         James Harden
4        Stephen Curry
5    Russell Westbrook
dtype: object

## String operations
In pandas, both Series and Index objects provide a set of string processing methods that allow for convenient operations on text data. These methods are accessible through the `.str accessor.`

By using the `.str accessor`, we can apply string operations to each element in a Series or Index objec

In [6]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).map(lambda x: x.lower())

0         kevin durant
1          luka doncic
2         jayson tatum
3         james harden
4        stephen curry
5    russell westbrook
dtype: object

In [7]:
# str.accessor method
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).str.lower()

0         kevin durant
1          luka doncic
2         jayson tatum
3         james harden
4        stephen curry
5    russell westbrook
dtype: object

In [8]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).str.len()

0    12
1    11
2    12
3    12
4    13
5    17
dtype: int64

In [9]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).str.replace(' ', '_')

0         Kevin_Durant
1          Luka_Doncic
2         Jayson_Tatum
3         James_Harden
4        Stephen_Curry
5    Russell_Westbrook
dtype: object

In [10]:
NBA_players = pd.Series(data = ["Kevin Durant", "Luka Doncic", "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"])
NBA_players.str.contains('Kevin') # return boolean values 

0     True
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [11]:
NBA_players[NBA_players.str.contains('Kevin')] 

0    Kevin Durant
dtype: object

## Differences between `object` and `string`

There are three main differences between the string type (`StringDtype`) and the `object` type in Pandas:

* **`Nullable Type`**: String `accessor` methods (e.g., `str.count`) return a nullable type when applied to string type data, while object type data may change the return type depending on the presence of missing values. The nullable type (pd.NA) allows for more accurate representation of missing or null values in string data.

* **Compatibility with String Methods**: Some Series methods that are specific to strings cannot be directly applied to object type data. For example, the `Series.str.decode()` method is only applicable to string type data, as it operates on the actual string values rather than bytes.

* **Missing Values Handling**: When dealing with missing values, the string type handles them differently from the object type. In the string type, missing values are represented as `pd.NA`, while in the object type, missing values are typically represented as `np.nan`, which is a floating-point value.

In general, it is recommended to use the string type (`StringDtype`) for working with string data in Pandas, as it provides better handling of missing values and supports a wider range of string-specific operations. 

In [12]:
# float 64
pd.Series(data = ["Kevin Durant", "Luka Doncic", None, "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).str.count('A')

0    0.0
1    0.0
2    NaN
3    0.0
4    0.0
5    0.0
6    0.0
dtype: float64

In [13]:
# Int64 
pd.Series(data = ["Kevin Durant", "Luka Doncic", None, "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"],
         dtype = pd.StringDtype()).str.count('A')

0       0
1       0
2    <NA>
3       0
4       0
5       0
6       0
dtype: Int64

In [14]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", None, "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"]).str.isdigit()

0    False
1    False
2     None
3    False
4    False
5    False
6    False
dtype: object

In [15]:
pd.Series(data = ["Kevin Durant", "Luka Doncic", None, "Jayson Tatum", 
                  "James Harden", "Stephen Curry", "Russell Westbrook"],
         dtype = pd.StringDtype()).str.isdigit()

0    False
1    False
2     <NA>
3    False
4    False
5    False
6    False
dtype: boolean

# Text processing methods


* `cat()`: Concatenate strings together.
* `split()`: Split strings based on a delimiter.
* `rsplit()`: Split strings from the end.
* `get()`: Index into each element to retrieve the `ith` element.
* `join()`: Join elements of a series with a separator.
* `get_dummies()`: Split strings on a delimiter and return a DataFrame of dummy variables.
* `contains()`: Return a boolean array indicating if each string contains a pattern or regex.
* `replace()`: Replace occurrences of a pattern or regex with another value.
* `repeat()`: Repeat values (equivalent to x * 3).
* `pad()`: Pad strings with whitespace on the left, right, or both sides.
* `center()`, `ljust()`, `rjust()`, `zfill()`: Similar to their string counterparts.
* `wrap()`: Split long strings into lines of a given width.
* `slice()`: Split each string in a Series into substrings.
* `slice_replace()`: Replace a slice of each string with a specified value.
* `count()`: Count occurrences of a pattern.
* `startswith()`, `endswith()`: Check if each element starts or ends with a specified substring.
* `findall()`: Find all occurrences of patterns/regex in each string.
* `match()`: Call re.match on each element and return the matched groups as a list.
* `extract()`: Call re.search on each element and return a DataFrame with one row per element and one column per captured group of each regex.
* `extractall()`: Call `re.findall` on each element and return a DataFrame with one row per match and one column per captured group of each regex.
* `len()`: Compute the length of each string.
* `strip()`, `rstrip()`, `lstrip()`: Remove leading and trailing whitespace.
* `partition()`, `rpartition()`: Equivalent to their string counterparts.
* `lower()`, `upper()`, `find()`, `rfind()`, `index()`, `rindex()`: Equivalent to their string counterparts.
* `capitalize()`, `swapcase()`, `normalize()`, `translate()`: Similar to their string counterparts.
* `isalnum()`, `isalpha()`, `isdigit()`, `isspace()`, `islower()`, `isupper()`, `istitle()`, `isnumeric()`, `isdecimal()`: Check specific characteristics of each string.


In [16]:
data = {
    "name": ["Tom", "Bob", "Mary", "James", "Eric", "Alice"],
    "city": ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix", "Philadelphia"],
    "hobby": ['Badminton;Table tennis', 'Baseball;American football', 'Dancing', 'Drawing;Singing', 'Swimming;Skiing', 'Running;Basketball'],
    "birth": ["2000-02-10", "1988-10-17", "2000-01-01", "1978-08-08", "2008-08-08", "1988-10-17"]
}
user_info = pd.DataFrame(data)
user_info

Unnamed: 0,name,city,hobby,birth
0,Tom,New York,Badminton;Table tennis,2000-02-10
1,Bob,Los Angeles,Baseball;American football,1988-10-17
2,Mary,Chicago,Dancing,2000-01-01
3,James,Houston,Drawing;Singing,1978-08-08
4,Eric,Phoenix,Swimming;Skiing,2008-08-08
5,Alice,Philadelphia,Running;Basketball,1988-10-17


## Split the text
Splitting and replacing text are commonly used text processing techniques. When we split text, it generates a list, and we can use slicing operations on the list to extract the desired content. After splitting, we can also expand the split content to form separate rows.
### `str.split`

In [17]:
user_info.birth.str.split('-')

0    [2000, 02, 10]
1    [1988, 10, 17]
2    [2000, 01, 01]
3    [1978, 08, 08]
4    [2008, 08, 08]
5    [1988, 10, 17]
Name: birth, dtype: object

After splitting, the data type of the column becomes object because it contains lists instead of strings. The `str accessor` methods can still be used on elements within the column. If a cell contains a list, `str[i]` can be used to access the `i-th` element. If a cell contains a single element, it needs to be converted to a list before accessing the element.

In [18]:
user_info['birth'].str.split('-').str[0]

0    2000
1    1988
2    2000
3    1978
4    2008
5    1988
Name: birth, dtype: object

In [19]:
user_info['birth'].str.split('-').str.get(1)

0    02
1    10
2    01
3    08
4    08
5    10
Name: birth, dtype: object

In [20]:
user_info['birth'].str.split('-').str[1:3]

0    [02, 10]
1    [10, 17]
2    [01, 01]
3    [08, 08]
4    [08, 08]
5    [10, 17]
Name: birth, dtype: object

The `expand` parameter controls whether to split the column into separate columns or keep it as a single column. The `n` parameter specifies the maximum number of splits to perform.

In [21]:
user_info['birth'].str.split('-', expand = True)

Unnamed: 0,0,1,2
0,2000,2,10
1,1988,10,17
2,2000,1,1
3,1978,8,8
4,2008,8,8
5,1988,10,17


In [22]:
date_columns = user_info['birth'].str.split('-', expand = True)

In [23]:
user_info_new = pd.concat([user_info, date_columns], axis = 1)
user_info_new

Unnamed: 0,name,city,hobby,birth,0,1,2
0,Tom,New York,Badminton;Table tennis,2000-02-10,2000,2,10
1,Bob,Los Angeles,Baseball;American football,1988-10-17,1988,10,17
2,Mary,Chicago,Dancing,2000-01-01,2000,1,1
3,James,Houston,Drawing;Singing,1978-08-08,1978,8,8
4,Eric,Phoenix,Swimming;Skiing,2008-08-08,2008,8,8
5,Alice,Philadelphia,Running;Basketball,1988-10-17,1988,10,17


In [24]:
# The n parameter specifies the maximum number of splits to perform
user_info['birth'].str.split('-',  n = 1)

0    [2000, 02-10]
1    [1988, 10-17]
2    [2000, 01-01]
3    [1978, 08-08]
4    [2008, 08-08]
5    [1988, 10-17]
Name: birth, dtype: object

In [25]:
user_info['birth'].str.split('-', n = 1, expand = True)

Unnamed: 0,0,1
0,2000,02-10
1,1988,10-17
2,2000,01-01
3,1978,08-08
4,2008,08-08
5,1988,10-17


If we have complex rules for splitting text, we can use regular expressions (regex) as the delimiter when using the `str.split()` function in pandas.

In [26]:
user_info.city.str.split(f'[A-z]\s[A-z]')

0         [Ne, ork]
1      [Lo, ngeles]
2         [Chicago]
3         [Houston]
4         [Phoenix]
5    [Philadelphia]
Name: city, dtype: object

### `str.rsplit`
The `rsplit()` function in pandas is similar to the `split()` function, but it starts splitting the text from the right side instead of the left side. If we don't specify the `n` parameter, the output of `rsplit()` and `split()` will be the same.

In [27]:
user_info['birth'].str.rsplit('-', n = 1, expand = True)

Unnamed: 0,0,1
0,2000-02,10
1,1988-10,17
2,2000-01,1
3,1978-08,8
4,2008-08,8
5,1988-10,17


## Text replacement
### `str.replace`

In [28]:
user_info.birth.str.replace('-', '/')

0    2000/02/10
1    1988/10/17
2    2000/01/01
3    1978/08/08
4    2008/08/08
5    1988/10/17
Name: birth, dtype: object

In [29]:
user_info['city'].str.replace('\s$|^\s', '', regex=True)

0        New York
1     Los Angeles
2         Chicago
3         Houston
4         Phoenix
5    Philadelphia
Name: city, dtype: object

In [30]:
def func(text):
    return text.group(0)[::-1] 

#group(0) method is used to retrieve the entire matched string or 
#the entire content of a capturing group.
    
user_info['city'].str.replace('[A-z]+', func, regex = True)

0        weN kroY
1     soL selegnA
2         ogacihC
3         notsuoH
4         xineohP
5    aihpledalihP
Name: city, dtype: object

## Text concatenation
### `str.cat`

The `cat` method in pandas has different behaviors depending on the object it is applied to.

In [31]:
user_info.hobby

0        Badminton;Table tennis
1    Baseball;American football
2                       Dancing
3               Drawing;Singing
4               Swimming;Skiing
5            Running;Basketball
Name: hobby, dtype: object

#### For a single series:
The `cat` method concatenates all elements of the Series into a single string.

In [32]:
user_info.hobby.str.cat()  # Concatenate all elements of a series into one single string

'Badminton;Table tennisBaseball;American footballDancingDrawing;SingingSwimming;SkiingRunning;Basketball'

In [33]:
user_info.hobby.str.cat(sep = ';')

'Badminton;Table tennis;Baseball;American football;Dancing;Drawing;Singing;Swimming;Skiing;Running;Basketball'

#### For a pair of Series or DataFrames: 
The `cat` method concatenates the corresponding elements of the two objects into a single string based on index

In [34]:
user_info.hobby.str.cat(user_info.name, sep = '***')

0        Badminton;Table tennis***Tom
1    Baseball;American football***Bob
2                      Dancing***Mary
3             Drawing;Singing***James
4              Swimming;Skiing***Eric
5          Running;Basketball***Alice
Name: hobby, dtype: object

In [35]:
series1 = pd.Series(['a', 'b', 'c'])
series2 = pd.Series(['x', 'y', 'z'])
series1.str.cat(series2)

0    ax
1    by
2    cz
dtype: object

#### For multiple Series or DataFrames:

In [36]:
user_info.hobby.str.cat(user_info[['name', 'city']], sep = "***")

0           Badminton;Table tennis***Tom***New York
1    Baseball;American football***Bob***Los Angeles
2                          Dancing***Mary***Chicago
3                 Drawing;Singing***James***Houston
4                  Swimming;Skiing***Eric***Phoenix
5         Running;Basketball***Alice***Philadelphia
Name: hobby, dtype: object

#### alignment 

In [37]:
s = pd.Series(['a', 'b', 'c', 'd'], dtype = "string")
s

0    a
1    b
2    c
3    d
dtype: string

In [38]:
u = pd.Series(['b', 'd', 'a', 'c'],
              index = [1, 3, 0, 2],
              dtype ="string")
u 

1    b
3    d
0    a
2    c
dtype: string

In [39]:
s.str.cat(u) #  prioritize the left side of the strings

0    aa
1    bb
2    cc
3    dd
dtype: string

In [40]:
s.str.cat(u, join = 'left')

0    aa
1    bb
2    cc
3    dd
dtype: string

In [41]:
s.str.cat(u, join = 'right')

1    bb
3    dd
0    aa
2    cc
dtype: string

## Text matching and extraction

### `st.extract`
Extracts substrings from each element based on a specified pattern and returns a new Series.


In [42]:
# \w+: Matches any alphanumeric character 
# \s: Matches any whitespace character (space, tab, newline, etc.).
user_info.city.str.extract("(\w+)\s+", expand = True)

Unnamed: 0,0
0,New
1,Los
2,
3,
4,
5,


In [43]:
user_info.city.str.extract("(\w+)\s+(\w+)", expand = True)

Unnamed: 0,0,1
0,New,York
1,Los,Angeles
2,,
3,,
4,,
5,,


### `st.extractall`
`str.extract()`, which only matches the first occurrence of the pattern, `str.extractall()` matches all occurrences of the pattern and returns a new DataFrame with a hierarchical index.

In [44]:
# match all occurrences of a letter before a whitespace character
user_info.city.str.extractall("(\w+)\s+")

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
0,0,New
1,0,Los


### `st.contains`
Checks if each element contains a specified pattern and returns a boolean Series indicating the match.

In [45]:
user_info.city.str.contains("Los")

0    False
1     True
2    False
3    False
4    False
5    False
Name: city, dtype: bool

In [46]:
user_info.birth.str.contains(r"^2000")

0     True
1    False
2     True
3    False
4    False
5    False
Name: birth, dtype: bool

###  `str.get_dummies`

In [47]:
user_info.city.str.get_dummies(sep=" ")

Unnamed: 0,Angeles,Chicago,Houston,Los,New,Philadelphia,Phoenix,York
0,0,0,0,0,1,0,0,1
1,1,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,1,0
5,0,0,0,0,0,1,0,0


### Others
* `str.lower()`: Converts the strings to lowercase.
* `str.upper()`: Converts the strings to uppercase.
* `str.title()`: Converts the strings to title case, where the first letter of each word is capitalized.
* `str.capitalize()`: Capitalizes the first letter of each string.
* `str.swapcase()`: Swaps the case of each character in the strings, converting lowercase to uppercase and vice versa.
* `str.casefold()`: Converts the strings to lowercase using the casefolding algorithm, which provides better support for non-English languages such as German.

In [48]:
s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])

In [49]:
s.str.lower()  # Converts the strings to lowercase

0                 lower
1              capitals
2    this is a sentence
3              swapcase
dtype: object

In [50]:
s.str.upper()  # Converts the strings to uppercase.

0                 LOWER
1              CAPITALS
2    THIS IS A SENTENCE
3              SWAPCASE
dtype: object

In [51]:
s.str.title() # Converts the strings to title case, where the first letter of each word is capitalized.

0                 Lower
1              Capitals
2    This Is A Sentence
3              Swapcase
dtype: object

In [52]:
s.str.swapcase()

0                 LOWER
1              capitals
2    THIS IS A SENTENCE
3              sWaPcAsE
dtype: object

In [53]:
s.str.casefold()

0                 lower
1              capitals
2    this is a sentence
3              swapcase
dtype: object

`str.center(10, fillchar='-')`: Centers the strings within a width of 10 characters and fills any remaining space with the specified fill character ('-').

`str.ljust(10, fillchar='-')`: Left aligns the strings within a width of 10 characters and fills any remaining space with the specified fill character ('-').

`str.rjust(10, fillchar='-')`: Right aligns the strings within a width of 10 characters and fills any remaining space with the specified fill character ('-').

`str.pad(width=10, side='left', fillchar='-')`: Specifies the width, alignment side ('left', 'right', 'both'), and fill character ('-') for padding the strings.

In [54]:
s.str.center(10, fillchar = '-')

0            --lower---
1            -CAPITALS-
2    this is a sentence
3            -SwApCaSe-
dtype: object

In [55]:
s.str.ljust(10, fillchar = '-')

0            lower-----
1            CAPITALS--
2    this is a sentence
3            SwApCaSe--
dtype: object

In [56]:
s.str.rjust(10, fillchar = '-')

0            -----lower
1            --CAPITALS
2    this is a sentence
3            --SwApCaSe
dtype: object

In [57]:
s.str.pad(width = 10, side = 'left', fillchar = '-')

0            -----lower
1            --CAPITALS
2    this is a sentence
3            --SwApCaSe
dtype: object

In [58]:
# method is used to pad the strings in a Series with leading zeros (0) 
# to achieve a specified width of n characters
s.str.zfill(20)

0    000000000000000lower
1    000000000000CAPITALS
2    00this is a sentence
3    000000000000SwApCaSe
dtype: object