<a href="https://colab.research.google.com/github/rameen2/NED-PGD/blob/main/Assignment_4_advanced_Regular_Expression_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Advanced Regular Expression**

Advanced regular expressions (regex or regexp) are patterns used for matching and manipulating strings. They are a powerful tool for text processing, data extraction, and validation.
 Here are some advanced regex concepts and techniques:

**Grouping and Capturing:** Parentheses ( ) are used to group parts of a regular expression together. This is useful for capturing specific parts of a matched string. For example, (abc)+ captures repeated sequences of "abc."

**Backreferences:** You can refer to captured groups within the same regex pattern using backreferences. For example, (abc)\1 matches "abcabc" because \1 refers to the first captured group.

**Non-Capturing Groups:** Sometimes, you want to group without capturing. You can use (?: ) to create non-capturing groups. For example, (?:abc)+ matches "abcabc" but doesn't capture it.

**Lookahead and Lookbehind:** bold text Lookahead ((?= )) and lookbehind ((?<= )) assertions allow you to check for conditions ahead of or behind the current position in the string without consuming characters. They are useful for complex matching scenarios.

**Anchors:** Anchors like ^ (start of line) and $ (end of line) are used to specify where a match should occur within a string. For example, ^abc matches "abc" only at the start of a line.

**Quantifiers:** Quantifiers specify how many times a character or group should be repeated. Common quantifiers include * (0 or more), + (1 or more), ? (0 or 1), {n} (exactly n times), and {n, m} (between n and m times).

**Character Classes:** Character classes like [a-z] match any single character within the specified range. You can negate a character class using [^ ], which matches any character not in the specified range.

**Alternation:** The pipe symbol | is used for alternation, allowing you to match multiple patterns. For example, cat|dog matches either "cat" or "dog."

**Modifiers:** Regex modifiers like i (case-insensitive), g (global search), and m (multiline mode) affect how a regex pattern is applied to a string.

**Escaping:** Special characters like ., *, +, ?, [, ], (, ), {, }, |, ^, $, \, and others need to be escaped with a backslash \ to be treated as literal characters in the pattern.

**Named Capture Groups:** Some regex flavors support naming capture groups, making it easier to extract specific information from matches.

**Greedy vs. Non-Greedy:** By default, quantifiers are greedy, meaning they match as much as possible. You can make them non-greedy by adding ? after the quantifier, like *? or +?.

**Recursion:** Some advanced regex flavors support recursion, allowing you to match nested patterns, such as nested parentheses.



**Assignment 1: Extracting Phone Numbers**

Raw Text: Extract all valid Pakistani phone numbers from a given text.

Example:

Text: Please contact me at 0301-1234567 or 042-35678901 for further details.

In [1]:
 #re module, which is used for regular expressions.
import re
# Sample text
text = "Please contact me at 0301-1234567 or 042-35678901 for further details."

# Define a regex pattern for Pakistani phone numbers
#pattern-> four-digit area code followed by a hyphen and seven digits (for mobile numbers)
# or a three-digit area code followed by a hyphen and eight digits (for landline numbers)
# The \b at the beginning and end ensures that we match complete words (phone numbers)
pattern = r'\b(?:\d{4}-\d{7}|\d{3}-\d{8})\b'

# Find all matches in the text
matches = re.findall(pattern, text)

# Print the extracted phone numbers
for match in matches:
    print(match)

0301-1234567
042-35678901


**Assignment 2: Validating Email Addresses**

Raw Text: Validate email addresses according to Pakistani domain extensions (.pk).

Example:

Text: Contact us at info@example.com or support@domain.pk for assistance.

In [17]:
import re
text = "Contact us at info@example.com or support@domain.pk for assistance"

pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,7}\b'
matches = re.findall(pattern, text)
for match in matches:
    print(match)

info@example.com
support@domain.pk


**Assignment 4: Identifying Urdu Words**

Raw Text: Identify and extract Urdu words from a mixed English-Urdu text.

Example:

Text: یہ sentence میں کچھ English words بھی ہیں۔

In [8]:
import re
text = "یہ sentence میں کچھ English words بھی ہیں۔"

# \u0600 represents the starting Unicode character in the Urdu script
# \u06FF represents the ending Unicode character in the Urdu script
pattern = r'[\u0600-\u06FF]+'

# Find all matches in the text
matches = re.findall(pattern, text)

# Print the extracted Urdu words
for match in matches:
    print(match)

یہ
میں
کچھ
بھی
ہیں۔


**Assignment 5: Finding Dates**

Raw Text: Find and extract dates in the format DD-MM-YYYY from a given text.

Example:

Text: The event will take place on 15-08-2023 and 23-09-2023.

In [7]:
import re
text = "The event will take place on 15-08-2023 and 23-09-2023"
pattern = r'\b(?:\d{2}-\d{2}-\d{4}|d{2}-\d{2}-\d{4})\b'
matches = re.findall(pattern, text)
for match in matches:
  print(match)


15-08-2023
23-09-2023


**Assignment 6: Extracting URLs**

Raw Text: Extract all URLs from a text that belong to Pakistani domains.

Example:

Text: Visit http://www.example.pk or https://website.com.pk for more information.

In [15]:
import re
text = "Visit http://www.example.pk or https://website.com.pk for more information."
pattern = r'\b(?:https?://\S+)\b'
matches = re.findall(pattern,text)
for match in matches:
 print(match)


http://www.example.pk
https://website.com.pk


**Assignment 7: Analyzing Currency**

Raw Text: Extract and analyze currency amounts in Pakistani Rupees (PKR) from a given text.

Example:

Text: The product costs PKR 1500, while the deluxe version is priced at Rs. 2500.

In [18]:
import re
text = "The product costs PKR 1500, while the deluxe version is priced at Rs. 2500."
pattern = r'PKR\s(\d+(?:,\d{3})*(?:\.\d{2})?)'
matches = re. findall(pattern, text)

amounts = [float(match.replace(',', '')) for match in matches]
total_amount = sum(amounts)
average_amount = total_amount / len(amounts) if len(amounts) > 0 else 0

print("Extracted PKR amounts:", matches)
print("Total PKR amount:", total_amount)
print("Average PKR amount:", average_amount)

Extracted PKR amounts: ['1500']
Total PKR amount: 1500.0
Average PKR amount: 1500.0


**Assignment 8: Removing Punctuation**

Raw Text: Remove all punctuation marks from a text while preserving Urdu characters.

Example:

Text: کیا! آپ, یہاں؟

In [20]:
import re
text = "کیا! آپ, یہاں؟"
pattern = r'[^\w\s\u0600-\u06FF]+'
cleaned_text = re.sub(pattern, '', text)  # Remove punctuation marks from the text
print(cleaned_text)

کیا آپ یہاں؟


**Assignment 9: Extracting City Names**

Raw Text: Extract names of Pakistani cities from a given text.

Example:

Text: Lahore, Karachi, Islamabad, and Peshawar are major cities of Pakistan.

In [25]:
import re
text = "Lahore, Karachi, Islamabad, and Peshawar are major cities of Pakistan."
cities = ["Lahore", "Karachi", "Islamabad", "Peshawar", "Rawalpindi", "Faisalabad", "Multan", "Gujranwala", "Quetta", "Sialkot"]
extracted_cities = []
for city in cities:
    if city in text:
      extracted_cities.append(city)
      print(extracted_cities)

['Lahore']
['Lahore', 'Karachi']
['Lahore', 'Karachi', 'Islamabad']
['Lahore', 'Karachi', 'Islamabad', 'Peshawar']


**Assignment 10: Analyzing Vehicle Numbers**

Raw Text: Identify and extract Pakistani vehicle registration numbers (e.g., ABC-123) from a text.

Example:

Text: I saw a car with the number plate LEA-567 near the market.

In [19]:
import re
text = "I saw a car with the number plate LEA-567 near the market."
pattern = r'\b[A-Z]{3}-\d{3}\b'
matches = re.findall(pattern, text)
for match in matches:
  print(match)

LEA-567
