### **Dealing with String**
When working on data scraping tasks, string manipulation is crucial for cleaning and processing the extracted data.

In [2]:
"""
Objective: f-string printing
"""
current_page = "https://www.python.org/page-3"
page_status_code = 200

# TODO: Use f-string to print the current page and page status code
# Expected output:
# Current page: https://www.python.org/page-3
# Page status code: 200
print(f"Current page: {current_page}")
print(f"Page status code: {page_status_code}")

Current page: https://www.python.org/page-3
Page status code: 200


In [3]:
"""
Objective: Extract text from URL using .split()
"""
current_page = "https://www.python.org/page-30"

# TODO: Extract the page number from the current_page
# Expected output: "30"
print(current_page.split("-")[1])

30


In [4]:
"""
Objective: Extract number from URL using isdigit()
"""
texts = "This page has 30 results"

# TODO: Split the texts into list of words
# TODO: Check if word is a number
# TODO: Print the word
# Expected output: "30"
print(texts.split())
for i in texts.split():
    if i.isdigit():
        print(i)
        

['This', 'page', 'has', '30', 'results']
30


In [5]:
"""
Objective: Clean Unwanted Characters using .strip()
"""
raw_data = "  \n\t###Welcome to Python!###\t\n  "

# TODO: Remove the unwanted characters from raw_data
# Expected output: "Welcome to Python!"
print(raw_data.strip(' #\n\t'))


Welcome to Python!


In [10]:
"""
Objective: Check if Content is Exist
"""
button_text = "Load More"

# TODO: Check if "Load More" is in button_text
if("Load More" in button_text):
    print("Load More is in button_text")
# TODO: Print "More content is available" if true
# TODO: Print "No more content" if false
# Expected output: "More content is available"
if("Load More" in button_text):
    print("More content is available")
else:
    print("No more content")
    


Load More is in button_text
More content is available


In [12]:
"""
Objective: Check if Content is Exist using .find()
"""
button_text = "Load More"

# TODO: Check if "Load More" is in button_text
if button_text.find("Load More") != -1:
    print("More content is available")
# TODO: Print "More content is available" if true
# TODO: Print "No more content" if false
if button_text.find("Load More")!= -1:
    print("More content is available")
else:
    print("No more content")
# Expected output: "More content is available"

More content is available
More content is available


In [13]:
"""
Objective: Split Data into a List
"""
keywords = "Python, BeautifulSoup, Scrapy, Selenium, Web Scraping"

# TODO: Split keywords into a list
# Expected output: ["Python", "BeautifulSoup", "Scrapy", "Selenium", "Web Scraping"]
print(keywords.split(","))

['Python', ' BeautifulSoup', ' Scrapy', ' Selenium', ' Web Scraping']


In [14]:
"""
Objective: Joining list to string
"""
keywords = ["Python", "BeautifulSoup", "Scrapy", "Selenium", "Web Scraping"]

# TODO: Join keywords into a string with ", "
# Expected output: "Python, BeautifulSoup, Scrapy, Selenium, Web Scraping"
print(", ".join(keywords))

Python, BeautifulSoup, Scrapy, Selenium, Web Scraping


In [21]:
"""
Objective: Extract URL from text
"""
text = "For more information visit https://www.example.com"

# TODO: Split the text into list of words
# TODO: Check if word starts with "https://"
# TODO: Print the word
# Expected output: "https://www.example.com"
print(text.split())
if text.startswith("https://"):
  print(text) 
words = text.split()
for word in words:
    if word.startswith("https://"):
        print(word)


['For', 'more', 'information', 'visit', 'https://www.example.com']
https://www.example.com


In [24]:
"""
Objective: Extract email from text
"""
text = "For more info, contact us at support@example.com or visit our website."

# TODO: Split the text into list of words
print(text.split())
# TODO: Find the word that ends with ".com"
word = [x for x in text.split() if x.endswith(".com")]
print(word)
# Expected output: "support@example.com"

['For', 'more', 'info,', 'contact', 'us', 'at', 'support@example.com', 'or', 'visit', 'our', 'website.']
['support@example.com']


In [27]:
"""
Objective: Generate Slug for URLs using .replace()
"""
base_url = "https://www.example.com"
section = "News and Articles"

# TODO: Format section from "News and Articles" to "news-and-articles"
print(section.replace(" ", "-").lower())
# TODO: Combine base_url and section
# TODO: Print the complete URL
# Expected output: "https://www.example.com/news-and-articles"
print(base_url + "/" + section.replace(" ", "-").lower())


news-and-articles
https://www.example.com/news-and-articles


In [29]:
""" 
Objective: String manipulation
"""
message = "   Hello, world! How, are you today? "

# TODO: Remove the leading and trailing spaces
print(message.strip())
# TODO: Replace the spaces with "-"
print(message.replace(" ", "-"))
# Expected output: "hello-world!-how;-are-you-today?"

Hello, world! How, are you today?
---Hello,-world!-How,-are-you-today?-


### **Reflection**
What is the difference between using .find() and "in" to check if an element is exist in the text?

(answer here)

Here are the key differences between .find() and in for checking text existence:

Return Value :

- in : Returns a boolean (True/False)

In [None]:
text = "Hello World"
print("Hello" in text)  # True
print("Python" in text) # False

- .find() : Returns an integer (index position or -1)

In [None]:
text = "Hello World"
print(text.find("Hello"))  # 0 (found at start)
print(text.find("Python")) # -1 (not found)

Use Cases :

In [None]:
if "Hello" in text:
    print("Text found!")
    
position = text.find("Hello")
if position != -1:
    print(f"Found at position: {position}")
    
text = "Hello World Hello"
print(text.find("Hello", 5))  # Finds second "Hello"

Performance :

- in : Generally faster for simple existence checks
- .find() : Slightly slower but provides more information

### **Exploration**
Regular expressions are an advanced way to extract data using specific patterns and rules. While we will discuss regex in the intermediate level, it’s worth starting to learn it now.

Let me introduce you to basic regular expressions (regex) in Python with some practical examples:

In [None]:
import re

# Basic pattern matching
text = "Contact us: support@myweb.com or sales@mycompany.com"
emails = re.findall(r'[\w\.-]+@[\w\.-]+', text)
print("Emails found:", emails)  # ['support@myweb.com', 'sales@mycompany.com']

# Phone number pattern
text = "Call us at 0812345678 or (021) 123-4567"
phones = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone numbers found:", phones)  # ['0812345678', '(021) 123-4567']

# URL pattern
text = "Visit https://www.myweb.com or http://test.com"
urls = re.findall(r'https?://[\w\.-]+\.\w+', text)
print("URLs found:", urls)  # ['https://www.myweb.com', 'http://test.com']

Common regex patterns:

- \w : Any word character (letter, number, underscore)
- \d : Any digit
- + : One or more occurrences
- * : Zero or more occurrences
- ? : Zero or one occurrence
- [] : Character set
- () : Grouping
- \. : Literal dot (escaped)
- ^ : Start of string
- $ : End of string
These patterns help you extract structured data like:

- Email addresses
- Phone numbers
- URLs
- Dates
- Custom formats
Regular expressions are powerful for data extraction and validation tasks