# String Manipulation and Regex in Python

In Python, a **string** is a sequence of characters enclosed within either single quotes (' ') or double quotes (" "). Strings are a fundamental data type used to represent text data. They are versatile and offer various methods for manipulation.

**Counting Occurrences**:  
You can count the number of occurrences of a specific character or substring within a string using the `count` method.  
  
**Sub-String Slicing**:  
To extract a sub-string from a string, you can use slicing. Slicing is done using square brackets.   
  
**Joining Strings**:  
To join a list of string objects into a single string with a specified separator, you can use the `join` method.

In [None]:
import re

In [None]:
words = ["Hello", "world", "Python"]
sentence = " ".join(words)
print(sentence)  # Output: "Hello world Python"

In [None]:
# Join the list of string objects using a space as separator.
" ".join(x)
# Count the number of occurrences of 'a'.
y.count('a')
# Sub-string.
y[0:10]

# Substitute the matching pattern with 'X'.
re.sub('\d','X',x1)

# Substitute the matching pattern with space.
x2_modified = re.sub('[@~*#%+-]',' ',x2)
x2_modified

# Remove the excessive spaces.
x2_final = re.sub('\s+',' ',x2_modified)
x2_final

##### Regex cheatsheet [\[1\]](https://regexr.com/)

`.` - matches any character, except newline.

`\d, \s \S` - match digit, match whitespace, not whitespace.

`\b, \B` - word, not word boundary.

`[xyz]` - matches x, y or z.

`[^xyz]` - matches anything that is not x, y or z.

`[x-z]` - matches a character between x and z.

`^xyz$` - `^` is the start of the string, `$` is the end of the string.

`\.` - use escaping to match special characters.

`\t`, `\n` - matches tab and newline.

`x*` - matches 0 or more symbols x.

`x+` - matches 1 or more symbols x.

`x?` - matches 0 or 1 symbol x.

`.?`, `*?`, `+?`, etc - represent non-greedy search.

`x{5}` - matches exactly 5 symbols x.

`x{5,}` - matches 5 or more symbols x.

`x{5, 8}` - matches between 5 and 8 symbols x.

`xy|yz` - matches `xy` or `yz`.

In [None]:
text = "titanic titan life"
print("\nLooking for \"titanic\":")
match_titanic = re.search("titanic", text)
print(match_titanic)

print("Looking for \"titan\":")
match_titan= re.search("titan", text)
print(match_titan)

print("\nLooking for \"life\":")
match_life = re.search("life", text)
print(match_life)



Looking for "titanic":
<re.Match object; span=(0, 7), match='titanic'>
Looking for "titan":
<re.Match object; span=(0, 5), match='titan'>

Looking for "life":
<re.Match object; span=(14, 18), match='life'>


0


Here are some key observations:

1. When a match is not found, the `search()` function returns `None`.

2. The `Match` object provides the starting and ending indices of the match, which can be accessed using `match.start()` and `match.end()`.

3. If a word appears multiple times in the text, only the first occurrence is retrieved by default.

4. To retrieve all matches of a pattern in a given text, you can use the `findall()` function. In this case, the matched portions of the text are returned as a list, rather than as individual `Match` objects."

5. Notice that search "titan" choose the first option span (0,5) if you want to find all you should use `findall` method

In [None]:
# Substitute the matching pattern with 'X' using re.sub()
# Hidding digits
x1 = 'My secret code is 0000.  I       am         damn     far.        I like space'
x1_modified  = ??
x1_modified

'My secret code is XXXX.  I       am         damn     far.        I like space'

In [None]:
# Remove the excessive spaces using re.sub()
x1_final = re.sub('\s+',' ',x1_modified)

'My secret code is XXXX. I am damn far. I like space'

In [None]:
# Cleaning a messy string
x2 = "Hello@world~Python#programming"
# Replace special characters with spaces
# Remove excessive spaces
print(x2_final)  # Output: "Hello world Python programming"

Hello world Python programming


In [None]:
def extract_numbers(text):
    # Define a regex pattern to match numbers (integers and floating-point numbers)
    pattern = r'[-+]?\d*\.\d+|\d+'  # This pattern matches both integer and floating-point numbers

    # Use findall to extract all matched numbers from the text
    numbers = re.findall(pattern, text)

    return numbers

# Test the function with example text
text1 = "There are 3 apples and 2.5 bananas in the basket."
text2 = "No numbers in this text."

numbers1 = extract_numbers(text1)
numbers2 = extract_numbers(text2)

print("Numbers in text1:", numbers1)#Numbers in text1: ['3', '2.5']
print("Numbers in text2:", numbers2)#Numbers in text2: []

Numbers in text1: ['3', '2.5']
Numbers in text2: []


In [None]:
def extract_dates(text):
    # Define a regex pattern to match valid dates in different formats
    date_pattern = r'\d{4}-\d{2}-\d{2}|\d{2}-\d{2}-\d{4}|\d{2}/\d{2}/\d{4}'  # Matches YYYY-MM-DD, DD-MM-YYYY, and MM/DD/YYYY

    # Use findall to extract all matched dates from the text
    dates = re.findall(date_pattern, text)

    return dates


# Test the function with example text
text1 = "Let's meet for a coffee at 2023-09-23 at 06:00:00, right when the caffeine fairy sprinkles magic dust."
text2 = "Are you being serious about that? Can we reschedule for a time when the sun is out on the 24-09-2023?"
text3 = "What about on 09/23/2023 at 3:30 AM, when the world is still asleep?"
text4 = "If you don't make time for me, we're gonna break up on the 09/27/2023, and I'll start a new career as a hermit."
text5 = "No funny dates in this text, just serious business."


dates1 = extract_dates(text1)
dates2 = extract_dates(text2)
dates3 = extract_dates(text3)
dates4 = extract_dates(text4)
dates5 = extract_dates(text5)


print("Dates in text1:", dates1)
print("Dates in text2:", dates2)
print("Dates in text3:", dates3)
print("Dates in text4:", dates4)
print("Dates in text5:", dates5)

Dates in text1: ['2023-09-23']
Dates in text2: ['24-09-2023']
Dates in text3: ['09/23/2023']
Dates in text4: ['09/27/2023']
Dates in text5: []


In [None]:
# Function to extract phone numbers
def extract_phone_numbers(text):
    phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
    phone_numbers = re.findall(phone_pattern, text)
    return phone_numbers

# Function to extract emails
def extract_emails(text):
    email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
    emails = re.findall(email_pattern, text)
    return emails

difficult_job ='''Dear Applicant,

We have an exciting job opportunity for you!300k a year! We are looking for a talented individual to join our team.
Please send your resume and cover letter to hr@example.com. You should also contact John Doe: Phone - (123) 456-7890, Email - john@example.com.

And don't forget to contact Jane Smith: Phone - 555-123-4567, Email - jane@example.com.
She is a very important person, and she has a sweet tooth for chocolates. If you really want to make a good impression, think about going to this location and buying some chocolates for her.
Once you arrive at the office in this address: 123 Main St, Springfield, IL 12345, take a picture and send it to the WhatsApp group.
Then we will evaluate if you did all the steps correctly and get back to you using this email. Once you receive the email, please let us know by going to this  1234 Elm Street Springfield, IL 12345United States street and pressing a button.

Also, did I mention the time you spent on this journey? Time is valuable. John's phone number is (123) 456-7890, and Jane's phone number is 555-123-4567. You can reach them anytime.

Best regards,
Your Future Employer'''

# Test the functions on the applicant's text
phone_numbers = extract_phone_numbers(difficult_job)
emails = extract_emails(difficult_job)


print("Phone Numbers:", phone_numbers)
print("Emails:", emails)

NameError: ignored

## Tokenization
Tokenization is the process of breaking down a text or a sequence of characters into smaller units, usually words or subwords, to make it more manageable and suitable for natural language processing tasks. Here's another way to explain tokenization with a different example:

Imagine you have a paragraph of text:

"John's dog, Max, loves chasing after tennis balls in the park. It's his favorite activity!"

Tokenizing this text would involve splitting it into individual units or tokens, which could result in something like:

["John's", "dog", ",", "Max", ",", "loves", "chasing", "after", "tennis", "balls", "in", "the", "park", ".", "It's", "his", "favorite", "activity", "!"]

Tokenization helps in various NLP tasks like text analysis, sentiment analysis, machine translation, and more, as it provides a structured representation of the text that algorithms can work with effectively.

In [2]:
import re
import nltk

from nltk.tokenize import RegexpTokenizer

In [7]:
text = "John's dog, Max, loves chasing after tennis balls in the park. It's his favorite activity..."

In [None]:
# Seperate text by spaces

In [8]:
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokens = tokenizer.tokenize(text)
print(tokens)

['John', "'s", 'dog', ',', 'Max', ',', 'loves', 'chasing', 'after', 'tennis', 'balls', 'in', 'the', 'park', '.', 'It', "'s", 'his', 'favorite', 'activity', '...']


In [9]:
from nltk.tokenize import BlanklineTokenizer
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import WhitespaceTokenizer

In [12]:
tokenized_text_1 = BlanklineTokenizer().tokenize(text)
print(tokenized_text_1)
tokenized_text_3  = WhitespaceTokenizer().tokenize(text)
print(tokenized_text_3)

tokenized_text_2= WordPunctTokenizer().tokenize(text)
print(tokenized_text_2)

["John's dog, Max, loves chasing after tennis balls in the park. It's his favorite activity..."]
["John's", 'dog,', 'Max,', 'loves', 'chasing', 'after', 'tennis', 'balls', 'in', 'the', 'park.', "It's", 'his', 'favorite', 'activity...']
['John', "'", 's', 'dog', ',', 'Max', ',', 'loves', 'chasing', 'after', 'tennis', 'balls', 'in', 'the', 'park', '.', 'It', "'", 's', 'his', 'favorite', 'activity', '...']


"Take note that the WordPunctTokenizer() closely resembles the initial one we defined. This is the standard and default tokenization method typically employed when discussing this technique."