# üß© Strings and Regular Expressions in Python

## 1. String Concatenation and Type Conversion

```python
value = input("Enter your favorite number: ")
print(value + " is my favorite number")

# Try numeric operation
print(value * 10)  # ‚ùå Repeats the string 10 times!

# Convert to integer
value_int = int(value)
print("When you multiply it by 10, you get:", value_int * 10)
```

### üß† Notes:
- Input values are **always strings**.
- Use `int()` or `float()` to convert to numbers before performing math.
- Use `+` to **concatenate strings**, but `,` inside `print()` to combine types safely.

---

In [26]:
# Strings: values and outputs

value = input("Enter a number: ")
print(value + " is my favorite number!")
print("When you multiply it by 10, this is what you get: ")
print(value * 10)
print(type(value))

Enter a number:  53


53 is my favorite number!
When you multiply it by 10, this is what you get: 
53535353535353535353
<class 'str'>


In [10]:
# Strings: changing values and outputs
value = input("Enter a number: ")
print(value + " is my favorite number!")
print("When you multiply it by 10, this is what you get: ")
value_int = int(value)
print(value_int * 10)
print(type(value_int))

Enter a number:  53


53 is my favorite number!
When you multiply it by 10, this is what you get: 
530
<class 'int'>


## 2. Finding Patterns and Slicing Strings

```python
first_name = "malala"
last_name = "yousafzai"

# Capitalize
first_name_cap = first_name.capitalize()
last_name_cap = last_name.capitalize()

print(first_name_cap, last_name_cap)
```

```python
note = "Award: Nobel Peace Prize"

# Find position of substring
award_location = note.find("Award: ")
print("Award found at position:", award_location)

# Slice to get text after the colon
award_text = note[7:]
print("Award text:", award_text)
```

### üß† Notes:
- `.find()` returns the *index* (0-based position) of a substring.
- `.capitalize()`, `.upper()`, `.lower()`, and `.title()` are common string transformations.
- **Slicing syntax:**  
  `string[start:end]`  
  - Omitting `start` ‚Üí begins at 0  
  - Omitting `end` ‚Üí runs to end  
  - You can even use `string[::-1]` to reverse a string!

---

In [27]:
# String methods
# .capitalize()

first_name = 'malala'
last_name = 'yousafzai'
note = 'award: Nobel Peace Prize'

first_name_cap = first_name.capitalize()
last_name_cap = last_name.capitalize()
print(first_name_cap)
print(last_name_cap)

# String methods: finding text
# .find()  # reference to first occurence of text you specify
# .index()  # reference to first occurence of text you specify
# .rfind()  # reference to last occurence of text you specify
# .rindex()  # reference to last occurence of text you specify

award_location = note.find("award: ")  # in variable named note, find occurence of text "award: "
print(award_location)  # returns index location of text
award_text = (note[7:])  # string [start:end] -- slice remainder starting at position 7 to end
print(award_text)

Malala
Yousafzai
0
Nobel Peace Prize


## 3. Regular Expressions (Regex)

```python
import re

five_digit_zip = "12345"
nine_digit_zip = "12345-6789"
phone_number = "555-123-4567"

# --- Create a regex pattern for 5 digits in a row ---
# r"" = raw string (keeps backslashes literal)
# \d  = any digit (0‚Äì9)
# {5} = exactly five occurrences
five_digit_expression = r"\d{5}"

# --- Search for the pattern in different strings ---
# re.search() scans the entire string and returns a Match object if found
print(re.search(five_digit_expression, five_digit_zip))   # ‚úÖ match
print(re.search(five_digit_expression, nine_digit_zip))   # ‚úÖ match (first 5 digits)
print(re.search(five_digit_expression, phone_number))     # ‚ùå no match
```

### üß† Notes
- `r"\d{5}"` ‚Üí matches **exactly five digits in a row**  
- `re.search()` scans **anywhere** in the string  
- `re.match()` checks **only at the beginning**  
- `re.findall()` returns **all matches** as a list  
- A successful match returns a **Match object**, which includes details:  
  ```python
  match = re.search(r"\d{5}", "ZIP 12345 USA")
  print(match.start(), match.end(), match.group())  # 4 9 12345
  ```

---

### üìò Common Regex Tokens Reference

| Token | Meaning | Example | Matches |
|--------|----------|----------|----------|
| `\d` | Digit (0‚Äì9) | `\d{3}` | `"123"` |
| `\w` | Word character (letters, digits, underscore) | `\w+` | `"Hello_123"` |
| `.` | Any character except newline | `A.B` | `"AcB"`, `"A9B"` |
| `+` | One or more of previous pattern | `\d+` | `"42"`, `"12345"` |
| `*` | Zero or more of previous pattern | `go*` | `"g"`, `"goo"` |
| `?` | Zero or one of previous pattern | `colou?r` | `"color"`, `"colour"` |

> These tokens are the *building blocks* for powerful data cleaning and validation patterns.

---

## üß™ Quick Data Analytics Example

**Task:** Validate ZIP Codes in a dataset using regular expressions.

```python
import pandas as pd
import re

data = {
    "Name": ["Alex", "Jamie", "Jordan"],
    "ZipCode": ["12345", "9876A", "67890"],
}

df = pd.DataFrame(data)

# ^ = start of string, $ = end of string
# Matches only 5 digits, no letters or extra characters
pattern = re.compile(r"^\d{5}$")

# Apply regex to each row in 'ZipCode'
df["Valid_Zip"] = df["ZipCode"].apply(lambda z: bool(pattern.match(z)))
print(df)
```

### ‚úÖ Output:
| Name | ZipCode | Valid_Zip |
|------|----------|-----------|
| Alex | 12345 | True |
| Jamie | 9876A | False |
| Jordan | 67890 | True |

---

## üß∞ Summary Checklist

| Concept | Tool / Keyword | Example |
|----------|----------------|----------|
| Combine strings | `+` | `"Hi " + "there"` |
| Convert to int | `int()` / `float()` | `int("42")` |
| Capitalize | `.capitalize()` / `.upper()` / `.title()` | `"malala".capitalize()` |
| Search | `.find()` / `.index()` | `"Nobel".find("b")` |
| Slice | `[start:end]` | `"Award"[0:3] ‚Üí "Awa"` |
| Regex | `re.search()` | `re.search(r"\d{5}", "12345")` |

---

### üåü Pro Tip:
When working with **pandas**:
- Use vectorized `.str` methods for speed ‚Üí  
  `df['ZipCode'].str.match(r'^\d{5}$')`  
- Use `.str.contains('pattern', regex=True)` to filter rows by text pattern.  
- Combine `.str` with `.apply()` for more flexible, condition-based cleaning.


In [29]:
# üîé Import the 're' module (Regular Expressions) ‚Äî gives access to pattern matching tools
import re

# --- Example data ---
five_digit_zip = "12345"        # valid U.S. 5-digit ZIP code
nine_digit_zip = "12345-6789"   # ZIP+4 format (also valid, includes hyphen)
phone_number = "555-123-4567"   # üö´ not a ZIP, but good for testing non-matches

# --- Create a Regular Expression pattern ---
# r"" makes a *raw string literal* (Python won‚Äôt treat backslashes as escape characters)
# \d means "digit" (0‚Äì9)
# {5} means "exactly five occurrences"
five_digit_expression = r"\d{5}"   # pattern: find any sequence of exactly 5 digits
print(re.search(five_digit_expression, five_digit_zip))   # Expect a match

# --- Search against each string for a match ---
# re.search() scans the entire string for the pattern
# If found ‚Üí returns a Match object; if not ‚Üí returns None
print(re.search(five_digit_expression, five_digit_zip))   # Expect a match ‚úÖ
print(re.search(five_digit_expression, nine_digit_zip))   # Expect a match ‚úÖ (first 5 digits)
print(re.search(five_digit_expression, phone_number))     # Expect None ‚ùå (digits broken up by hyphens)


<re.Match object; span=(0, 5), match='12345'>
<re.Match object; span=(0, 5), match='12345'>
<re.Match object; span=(0, 5), match='12345'>
None


In [25]:
# üß© Import required libraries
import pandas as pd     # pandas ‚Üí powerful data analysis and manipulation library
import re               # re ‚Üí built-in regular expressions (pattern matching) module

# --- Create a small dataset (dictionary of lists) ---
data = {
    "Name": ["Alex", "Jamie", "Jordan"],    # sample names
    "ZipCode": ["12345", "9876A", "67890"]  # one invalid ZIP (contains letter 'A')
}

# --- Convert dictionary into a pandas DataFrame ---
# DataFrame = table-like structure (rows and columns)
df = pd.DataFrame(data)

# --- Compile a Regular Expression pattern ---
# r"" creates a raw string (Python won't treat backslashes as escape characters)
# ^  ‚Üí start of string
# \d ‚Üí digit (0‚Äì9)
# {5} ‚Üí exactly five digits
# $  ‚Üí end of string
# Together: match only strings that are *exactly* five digits long
pattern = re.compile(r"^\d{5}$")

# --- Apply the pattern to the 'ZipCode' column ---
# .apply() runs a function on every row value
# pattern.match(z) checks if each ZIP matches the 5-digit pattern
# bool() converts the result (Match object or None) into True or False
df["Valid_Zip"] = df["ZipCode"].apply(lambda z: bool(pattern.match(z)))

# --- Display the final DataFrame ---
print(df)


     Name ZipCode  Valid_Zip
0    Alex   12345       True
1   Jamie   9876A      False
2  Jordan   67890       True


In [24]:
import pandas as pd
import re

data = {
    "Name": ["Alex", "Jamie", "Jordan"],
    "ZipCode": ["12345", "9876A", "67890"],
}

df = pd.DataFrame(data)

pattern = re.compile(r"^\d{5}$")  # match only 5 digits

df["Valid_Zip"] = df["ZipCode"].apply(lambda z: bool(pattern.match(z)))
print(df)

     Name ZipCode  Valid_Zip
0    Alex   12345       True
1   Jamie   9876A      False
2  Jordan   67890       True


In [55]:
# Challenge: Strings
# Your task is to take the value entered by the user, convert it to a value in kilometers 
# and then print the result to the terminal with a text description using string concatenation.
# Remember that you can convert a string to a number containing decimal places using python's float() method. 

miles = input('Enter a distance in miles: ')
miles_float = (float(miles))  # convert miles to float
km_value = (miles_float * 1.609344)  # kilometers_value = miles_value * 1.609344
print("This is the distance in kilometers: " + str(km_value))
print(f"This is the distance in kilometers: {km_value:.2f}")  # f-string with rounding

Enter a distance in miles:  50


This is the distance in kilometers: 80.4672
This is the distance in kilometers: 80.47


##### üß© Strings and Regular Expressions in Python

## 1. String Concatenation and Type Conversion

```python
value = input("Enter your favorite number: ")
print(value + " is my favorite number")

# Try numeric operation
print(value * 10)  # ‚ùå Repeats the string 10 times!

# Convert to integer
value_int = int(value)
print("When you multiply it by 10, you get:", value_int * 10)
```

### üß† Notes:
- Input values are **always strings**.
- Use `int()` or `float()` to convert to numbers before performing math.
- Use `+` to **concatenate strings**, but `,` inside `print()` to combine types safely.
