- Note: I refer to Chapter 07 < Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (3rd) > (2023, Wes McKinney)

# 1. Python Buil-In String Methods

- Python Buil-In String Methods
  - `count`
  - `endswith`: Return `True` if string ends with suffix
  - `startswith`: Return `True` if string starts with prefix
  - `join`
  - `index`: Return starting index of the first occurrence
  - `find`: Return position of first character of first occurence
  - `rfind`
  - `replace`
  - `strip, rstrip, lstrip`: Trim whitespace
  - `split`
  - `lower`
  - `upper`
  - `casefold`
  - `ljust, rjust`: Left or right justify

#### `split()`: Break a comma-seperated string into pieces

In [24]:
fruit = "apple, banana,carrot,     pineapple,apple"

print('Before using split(): ', fruit)
print('After using split(): ', fruit.split(","))

Before using split():  apple, banana,carrot,     pineapple,apple
After using split():  ['apple', ' banana', 'carrot', '     pineapple', 'apple']


#### `strip()`: Trim whitespace (e.g. line breaks)

In [25]:
fruit_basket = [x.strip() for x in fruit.split(",")]

fruit_basket

['apple', 'banana', 'carrot', 'pineapple', 'apple']

#### `in` keyword: Detect a substring

In [26]:
"carrot" in fruit_basket

True

#### `count()`: Return the number of occurences of a particular substring

In [27]:
fruit_basket.count("apple")

2

# 2. Regular Expressions

- Regular Expressions
  - Search or match string patterns in text
  - `re` module: Apply regular expressions to string
      - Pattern matching
      - Substitution
      - Splitting
  - A List of methods
      - `findall`
      - `finditer`: Return an iterator
      - `match`
      - `search`
      - `split`
      - `sub, subn`

In [5]:
# Import module
import re

## 2.1. Basics

#### `\s+`: One or more whitespace character

In [28]:
fruit = "apple    banana\t pineapple  \tmelon"

re.split(r"\s+", fruit)

['apple', 'banana', 'pineapple', 'melon']

#### `re.compile()`: Compile the regex

In [30]:
regex = re.compile(r"\s+")

regex.split(fruit)

['apple', 'banana', 'pineapple', 'melon']

#### `findall()`: Get a list of all string patters matching the regex

In [23]:
regex.findall(fruit)

['    ', '\t ', '  \t']

## 2.2. Excercise: Identify email addresses

In [42]:
# Sample email addresses
email = """Karen karen@google.com
            Lina lina@gmail.com
            Alice alice@gmail.com
            Joy joy@google.com"""

#### `re.IGNORECASE`: Ignore upper and lower cases

In [34]:
# Pattern
pattern = r"[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}"

# Regex
regex = re.compile(pattern, flags=re.IGNORECASE)

#### `findall()`: Create a list of email addresses

In [37]:
email_list = regex.findall(email)

email_list

['karen@google.com', 'lina@gmail.com', 'alice@gmail.com', 'joy@google.com']

### Get email addesses with three seperated components: username, domain name, domain suffix

In [39]:
# Pattern: Put the email address around the parts of three patterns
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"

# Defind Regex again
regex = re.compile(pattern, flags=re.IGNORECASE)

#### `group()`: Return a tuple of pattern components

In [41]:
test = regex.match("meow@google.com")

test.groups()

('meow', 'google', 'com')

In [43]:
regex.findall(email)

[('karen', 'google', 'com'),
 ('lina', 'gmail', 'com'),
 ('alice', 'gmail', 'com'),
 ('joy', 'google', 'com')]

# 3. String Functions in `pandas`

In [49]:
# Import module
import pandas as pd
import numpy as np

In [52]:
# Generate data
data = {"Karen": "karen@google.com",
       "Lina": "lina@gmail.com",
       "Alice": "alice@microsoft.com",
       "Joy": "joy@google.com",
       "June": np.nan}

# Create a Series
data = pd.Series(data)

data

Karen       karen@google.com
Lina          lina@gmail.com
Alice    alice@microsoft.com
Joy           joy@google.com
June                     NaN
dtype: object

In [53]:
# Check a NaN column
data.isna()

Karen    False
Lina     False
Alice    False
Joy      False
June      True
dtype: bool

#### `str.contains()`: Check whether containing a specific string

In [54]:
data.str.contains("gmail")

Karen    False
Lina      True
Alice    False
Joy      False
June       NaN
dtype: object

#### `astype()`: Change data type

In [55]:
data_string = data.astype('string')

data_string

Karen       karen@google.com
Lina          lina@gmail.com
Alice    alice@microsoft.com
Joy           joy@google.com
June                    <NA>
dtype: string

#### Regular Expressions with `re` options

In [57]:
# Pattern
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"

# `re`
data.str.findall(pattern, flags=re.IGNORECASE)

Karen       [(karen, google, com)]
Lina          [(lina, gmail, com)]
Alice    [(alice, microsoft, com)]
Joy           [(joy, google, com)]
June                           NaN
dtype: object

#### `str.get()`: Return vectorized elements

In [61]:
# Vectorize
data_vectorized = data.str.findall(pattern, flags=re.IGNORECASE).str[0]

data_vectorized

Karen       (karen, google, com)
Lina          (lina, gmail, com)
Alice    (alice, microsoft, com)
Joy           (joy, google, com)
June                         NaN
dtype: object

In [65]:
# Return a vectorized element
data_vectorized.str.get(1)

Karen       google
Lina         gmail
Alice    microsoft
Joy         google
June           NaN
dtype: object