In [None]:
# if there is a text data and you need to find or only extract the number and email from this text then how

'''
Plotly Studio serves every level of data sophistication across teams and 9775656564 use cases.
Marketing managers build interactive dashboards from spreadsheets, while finance teams
replace static budget hrreports@gmail.com with intelligent data apps.9656565656 You could be an analyst creating your
first data visualization. Or maybe you’re a seasoned expert deploying enterprise data products.
 Whoever you are, Plotly scales with you from simple charts to sophisticated AI-driven data analysis,
 all within one unified ecosystem.
'''


# regex and web scraping

- https://docs.python.org/3/library/re.html
- https://docs.python.org/3/howto/regex.html
- https://regexr.com/
- https://regex101.com/
- https://regex-generator.olafneumann.org/
- https://github.com/PacktPublishing/2018-Python-Regular-Expressions---Real-World-Projects/tree/master
- https://www.w3resource.com/python-exercises/re/
- https://github.com/codebasics/py/blob/master/Advanced/regex/regex_tutorial_exercise_questions.ipynb -exercise
- https://github.com/Asabeneh/30-Days-Of-Python/blob/master/18_Day_Regular_expressions/18_regular_expressions.md
- https://github.com/CoreyMSchafer/code_snippets/tree/master/Python-Regular-Expressions
-


## Implement Regular Expression


# Python Regular Expressions (Regex): An In-Depth Guide

## **1. Introduction to Regular Expressions**

### **1.1 What is a Regular Expression?**
A **Regular Expression (Regex)** is a sequence of characters that defines a search pattern. This pattern can be used for searching, replacing, and manipulating strings in a flexible and efficient way. Regex is widely used in text processing tasks like validation, parsing, and string manipulation.

### **1.2 Why Use Regex?**
- **Text Searching:** Finding specific patterns in text data.
- **Validation:** Ensuring strings conform to a required format (e.g., email addresses).
- **Replacement:** Modifying or replacing parts of strings that match a pattern.
- **Parsing:** Extracting specific information from a text.

## **2. Basic Syntax of Python Regex**

### **2.1 Importing the `re` Module**
In Python, the `re` module provides support for working with regular expressions.

```python
import re
```

### **2.2 Common Metacharacters**
- **`.`**: Matches any single character except a newline.
- **`^`**: Matches the start of a string.
- **`$`**: Matches the end of a string.
- **`*`**: Matches 0 or more repetitions of the preceding element.
- **`+`**: Matches 1 or more repetitions of the preceding element.
- **`?`**: Matches 0 or 1 repetition of the preceding element.
- **`[]`**: Matches any one of the characters inside the square brackets.
- **`|`**: Logical OR, matches the expression before or after the `|`.
- **`()`**: Groups expressions and captures the matched sub-expression.
- **`\`**: Escapes a special character, or denotes a special sequence.

### **2.3 Basic Functions**
- **`re.match()`**: Checks for a match only at the beginning of the string.
- **`re.search()`**: Searches for the first location where the pattern matches.
- **`re.findall()`**: Finds all substrings where the pattern matches and returns them as a list.
- **`re.sub()`**: Replaces occurrences of the pattern with a replacement string.

## **3. Understanding Regex Patterns**

### **3.1 Character Classes**
- **`[abc]`**: Matches any one of `a`, `b`, or `c`.
- **`[^abc]`**: Matches any character except `a`, `b`, or `c`.
- **`[a-z]`**: Matches any lowercase letter.
- **`[0-9]`**: Matches any digit.

### **3.2 Special Sequences**
- **`\d`**: Matches any digit, equivalent to `[0-9]`.
- **`\D`**: Matches any non-digit, equivalent to `[^0-9]`.
- **`\s`**: Matches any whitespace character (space, tab, newline).
- **`\S`**: Matches any non-whitespace character.
- **`\w`**: Matches any alphanumeric character (letters and digits) and underscores, equivalent to `[a-zA-Z0-9_]`.
- **`\W`**: Matches any non-alphanumeric character.

### **3.3 Quantifiers**
- **`*`**: Matches 0 or more repetitions.
- **`+`**: Matches 1 or more repetitions.
- **`?`**: Matches 0 or 1 repetition.
- **`{n}`**: Matches exactly `n` repetitions.
- **`{n,}`**: Matches `n` or more repetitions.
- **`{n,m}`**: Matches between `n` and `m` repetitions.


## **5. Practical Examples**

### **5.1 Email Validation**
```python
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
email = "test.email@example.com"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")
```

### **5.2 Extracting Phone Numbers**
```python
text = "Call me at 123-456-7890 or 987.654.3210"
pattern = r'\b\d{3}[-.]\d{3}[-.]\d{4}\b'
phone_numbers = re.findall(pattern, text)
print(phone_numbers)
```

### **5.3 Removing HTML Tags**
```python
html = "<p>This is a <b>bold</b> paragraph.</p>"
pattern = r'<.*?>'
clean_text = re.sub(pattern, '', html)
print(clean_text)
```

### **5.4 Password Validation**
```python
pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(?=.*[!@#$%^&*]).{8,}$'
password = "StrongPassw0rd!"
if re.match(pattern, password):
    print("Valid password")
else:
    print("Invalid password")
```

### **5.5 Parsing Log Files**
```python
log_entry = "ERROR 2024-08-17 12:34:56 - Something went wrong"
pattern = r'^(ERROR|WARNING|INFO)\s(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*)$'
match = re.search(pattern, log_entry)
if match:
    level, date, time, message = match.groups()
    print(f"Level: {level}, Date: {date}, Time: {time}, Message: {message}")
```

## **6. Real-World Use Cases**

### **6.1 Data Cleaning in Data Science**
Regex can be used to clean and preprocess data, such as removing unwanted characters, correcting formats, or extracting specific fields.

### **6.2 Web Scraping**
In web scraping, regex is useful for extracting data from HTML or JSON responses, such as extracting product prices or titles from e-commerce sites.

### **6.3 Input Validation**
Forms in web applications often require regex to validate inputs like email addresses, phone numbers, ZIP codes, etc.

### **6.4 Log File Analysis**
System administrators and developers often use regex to parse log files and extract meaningful data, like timestamps or error messages.

### **6.5 Natural Language Processing (NLP)**
Regex is used in NLP tasks for tokenization, text normalization, and entity extraction.

## **7. Performance Considerations**

### **7.1 Efficiency**
- Use raw strings (e.g., `r'pattern'`) to avoid escaping backslashes.
- Pre-compile regex patterns with `re.compile()` when using the same pattern multiple times.

### **7.2 Alternatives**
For simple string searches or replacements, consider using Python’s built-in string methods like `str.find()`, `str.replace()`, or `str.split()` instead of regex.

## **8. Conclusion**

Regular expressions are powerful tools for text processing and can handle a wide range of tasks, from simple searches to complex text manipulations. Mastering regex will greatly enhance your ability to work with strings and text data in Python.



In [None]:
import re

In [None]:
data = "AEF9891 AW  AB  ABBCE  AE989 9155656456 AER"
p ="AE"
p3 = '[A-Z0-9]+'

In [None]:
re.match(p,data)
# re.match(): Checks for a match only at the beginning of the string.

<re.Match object; span=(0, 2), match='AE'>

In [None]:
# re.match(p,data).group()
d = re.match(p,data)
d.group()

'AE'

In [None]:
p2 ="AW"

In [None]:
re.match(p2,data)

In [None]:
# re.search(): Searches for the first location where the pattern matches.
re.search(p2,data)

<re.Match object; span=(8, 10), match='AW'>

In [None]:
p5='[0-9]+'
re.search(p5,data)

<re.Match object; span=(3, 7), match='9891'>

In [None]:
data

'AEF9891 AW  AB  ABBCE  AE989 9155656456'

In [None]:
# re.findall(): Finds all substrings where the pattern matches and returns them as a list.
re.findall(p5,data)

['9891', '989', '9155656456']

In [None]:
# find all pattern  which start with AE
p6 = "AE[0-9A-Z]+"
re.findall(p6,data)

['AEF9891', 'AE989', 'AER']

In [None]:
# only number
pd = "[0-9]+"
re.findall(pd,data)

['9891', '989', '9155656456']

In [None]:
data

'AEF9891 AW  AB  ABBCE  AE989 9155656456 AER'

In [None]:
# replace the AB with ab

'AEF9891 AW  AB  ABBCE  AE989 9155656456 AER'

In [None]:
re.sub("AB","ab",data)

'AEF9891 AW  ab  abBCE  AE989 9155656456 AER'

In [None]:
data

'AE9891 AW  AB  ABBCE  AE989 9155656456'

In [None]:
# all
text = """12Edit the Expression & Text to see matches +91 97933453245. shahil@classes.org
            +93 9433112478 Roll over matches deepkumar@itvedant.com or the expression for details 9878.
            PCRE & +140553434231 JavaScript 23 flavors of RegEx are supported.
            Validate your expression with ajayjha@deepclasses.in Tests mode."""

In [None]:
email_pattern= "[a-zA-Z0-9]+@[a-zA-Z0-9]+.[a-z]+"

In [None]:
re.findall(email_pattern,text)

['shahil@classes.org', 'deepkumar@itvedant.com', 'ajayjha@deepclasses.in']

In [None]:
# find email
email_p = "\w+@\w+.\w+"

  email_p = "\w+@\w+.\w+"


In [None]:
re.findall(email_p,text)

['shahil@classes.org', 'deepkumar@itvedant.com', 'ajayjha@deepclasses.in']

In [None]:
num_pattern = "[0-9]{10}"
re.findall(num_pattern,text)

['9793345324', '9433112478', '1405534342']

In [None]:
num_pattern = "\+[0-9 ]*[0-9]{10}"

In [None]:
re.findall(num_pattern,text)

['+91 97933453245', '+93 9433112478', '+140553434231']

In [None]:
ph = "\+[0-9]{1,3} [0-9]{10,12}"
re.findall(ph,text)

  ph = "\+[0-9]{1,3} [0-9]{10,12}"


['+91 97933453245', '+93 9433112478']

In [None]:
ph1 = "\+[0-9]{1,3}[0-9]{10,12}"
t = re.findall(ph1,text)
t[0][:2]+" "+t[0][2:]

  ph1 = "\+[0-9]{1,3}[0-9]{10,12}"


'+1 40553434231'

In [None]:
# replace
re.sub(ph1,t[0][:2]+" "+t[0][2:], text)

'12Edit the Expression & Text to see matches +91 97933453245.\n            +93 1233112478 Roll over matches deepkumar@itvedant.com or the expression for details 9878.\n            PCRE & +1 40553434231 JavaScript 23 flavors of RegEx are supported.\n            Validate your expression with ajayjha@deepclasses.in Tests mode.'

In [None]:
text

'12Edit the Expression & Text to see matches +91 97933453245.\n            +93 1233112478 Roll over matches deepkumar@itvedant.com or the expression for details 9878.\n            PCRE & +140553434231 JavaScript 23 flavors of RegEx are supported.\n            Validate your expression with ajayjha@deepclasses.in Tests mode.'

In [None]:
# read file
file = open("/content/sample_data/README.md",'r')
data = file.read()
print(data)

This directory includes a few sample datasets to get you started.

*   `california_housing_data*.csv` is California housing data from the 1990 US
    Census; more information is available at:
    https://docs.google.com/document/d/e/2PACX-1vRhYtsvc5eOR2FWNCwaBiKL6suIOrxJig8LcSBbmCbyYsayia_DvPOOBlXZ4CAlQ5nlDD8kTaIDRwrN/pub

*   `mnist_*.csv` is a small sample of the
    [MNIST database](https://en.wikipedia.org/wiki/MNIST_database), which is
    described at: http://yann.lecun.com/exdb/mnist/

*   `anscombe.json` contains a copy of
    [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet); it
    was originally described in

    Anscombe, F. J. (1973). 'Graphs in Statistical Analysis'. American
    Statistician. 27 (1): 17-21. JSTOR 2682899.

    and our copy was prepared by the
    [vega_datasets library](https://github.com/altair-viz/vega_datasets/blob/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/anscombe.json).



In [None]:
re.findall('[0-9]{4}',data)

['1990', '1973', '2682', '3549', '3088', '7315']

In [None]:
# url pattern
url_pattern = "https?://\S+"

In [None]:

re.findall(url_pattern,data)

['https://docs.google.com/document/d/e/2PACX-1vRhYtsvc5eOR2FWNCwaBiKL6suIOrxJig8LcSBbmCbyYsayia_DvPOOBlXZ4CAlQ5nlDD8kTaIDRwrN/pub',
 'https://en.wikipedia.org/wiki/MNIST_database),',
 'http://yann.lecun.com/exdb/mnist/',
 'https://en.wikipedia.org/wiki/Anscombe%27s_quartet);',
 'https://github.com/altair-viz/vega_datasets/blob/4f67bdaad10f45e3549984e17e1b3088c731503d/vega_datasets/_data/anscombe.json).']

In [None]:
# Web scraping

In [None]:
# requests --take url response and get content from website
!pip install requests

!pip install beautifulsoup4
# bs4--> Pre-define html pattern



In [2]:
# url respose
import requests
# html tag pattern in bs4
from bs4 import BeautifulSoup as bs
data = requests.get(r"https://www.snapdeal.com/products/mens-footwear-sports-shoes?sort=plrty")
data.status_code


403

In [3]:
from urllib.request import Request, urlopen

req = Request(
    url='https://www.snapdeal.com/products/mens-footwear-sports-shoes?sort=plrty',
    headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

HTTPError: HTTP Error 403: Forbidden

In [None]:
# url respose
import requests
# html tag pattern in bs4
from bs4 import BeautifulSoup as bs
data = requests.get(r"http://yellowpages.in/hyderabad/restaurants/161059018")

In [None]:
data.status_code

200

In [None]:
# load all data
data.content

b'<!DOCTYPE html><html><head id="head1" prefix="og: https://ogp.me/ns# fb: https://ogp.me/ns/fb# product: https://ogp.me/ns/product#"><meta charset="utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=edge" /><title>Best Restaurants in Hyderabad</title><meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" /><link rel="shortcut icon" href="/images/favicon.ico" type="image/x-icon" /><link rel="stylesheet" href="/css/v6/popup.css" /><script src="/js/v4/jquery.min.js"></script><link rel="manifest" href="/js/v4/manifest.json" /><meta property=\'fb:app_id\' content=\'1719688271655824\' /><meta property=\'fb:admins\' content=\'100001091606750\' /><meta property=\'og:title\' itemprop=\'name\' content=\'Best Restaurants in Hyderabad\' /><meta property=\'og:description\' itemprop=\'description\' content=\'Get 1036 listings from Restaurants in Hyderabad. Find details like photos, reviews, directions, phone numbers, address and more..\' /><meta property=\'og:url

In [None]:
soup = bs(data.content,"html.parser")

In [None]:
soup.title.string

'Best Restaurants in Hyderabad'

In [None]:
# Restaurants details
# name, phone , address , url of resturant_details

In [None]:
soup.find("div", class_="popularTitleTextBlock").text

'Kitchens of Godavari'

In [None]:
name=[]
links=[]
for i in soup.find_all("div", class_="popularTitleTextBlock"):
    print(i.text)
    name.append(i.text)
    links.append("http://yellowpages.in"+i.a["href"])

    print("http://yellowpages.in"+i.a["href"])
    print("=======================")

Kitchens of Godavari
http://yellowpages.in/b/kitchens-of-godavari-hitech-city-hyderabad/446870522
Amaravathi Restaurant
http://yellowpages.in/b/amaravathi-restaurant-secunderabad-hyderabad/257124433
Sukhibhava Restaurant
http://yellowpages.in/b/sukhibhava-restaurant-dilsukh-nagar-hyderabad/944833701
Lucky Restaurant
http://yellowpages.in/b/lucky-restaurant-santosh-nagar-hyderabad/973795213
Am To Pm Bawarchi Restaurant
http://yellowpages.in/b/am-to-pm-bawarchi-restaurant-vanasthalipuram-hyderabad/170410218
Daawat Biryani & Family Restaurant 
http://yellowpages.in/b/daawat-biryani-and-family-restaurant--warasiguda-hyderabad/391559060
MRCB Food Court
http://yellowpages.in/b/mrcb-food-court-nagole-hyderabad/464958962
Aroma Bakery
http://yellowpages.in/b/aroma-bakery-sainikpuri-hyderabad/162350595
Lucky Restaurant
http://yellowpages.in/b/lucky-restaurant-padmarao-nagar-hyderabad/361900206
Sahara Restaurant
http://yellowpages.in/b/sahara-restaurant-chintal-hyderabad/926027096
Prince Al Mataa

In [None]:
# extract the phone number

ph=[]
for i in soup.find_all("a", class_="businessContact"):
  print(i.text)
  ph.append(i.text)

+91 9848409505
+91 9100889993
+91 7032168786
+91 9030072003
+91 9100519991
+91 40 27090001
+91 9966090055
+91 9849838211
+91 40 65355999
+91 9949102353
+91 9700413299
+91 9696279999
+91 9949988002
+91 8008333555
+91 9391153081
+91 9652343683
+91 9440095563
+91 8497922223
+91 40 64595959
+91 040 64548885
+91 9290145876
+91 40 27712350
+91 040 6529 7620
+91 9778860172
+91 8886604392


In [None]:
# address
soup.find("div", class_="eachPopularRight").text

'+91 9848409505Hitech City Hyderabad - 500081 Directions'

In [None]:
# only addres
soup.find("address", class_="businessArea").text

'Hitech City Hyderabad - 500081'

In [None]:
# full address
links[0]

'http://yellowpages.in/b/kitchens-of-godavari-hitech-city-hyderabad/446870522'

In [None]:
# detail address --> single
each = requests.get(links[0])
details= bs(each.content)
details.find("div", id="MainContent_divAdd").text

'Lower Ground Floor,Phoenix - B.K Towers,Hitech City,Hyderabad - 500081Telangana.'

In [None]:
address =[]
for i in links:
  each = requests.get(i)
  print(i)
  details= bs(each.content)
  a = details.find("div", id="MainContent_divAdd").text
  address.append(a)

http://yellowpages.in/b/kitchens-of-godavari-hitech-city-hyderabad/446870522
http://yellowpages.in/b/amaravathi-restaurant-secunderabad-hyderabad/257124433
http://yellowpages.in/b/sukhibhava-restaurant-dilsukh-nagar-hyderabad/944833701
http://yellowpages.in/b/lucky-restaurant-santosh-nagar-hyderabad/973795213
http://yellowpages.in/b/am-to-pm-bawarchi-restaurant-vanasthalipuram-hyderabad/170410218
http://yellowpages.in/b/daawat-biryani-and-family-restaurant--warasiguda-hyderabad/391559060
http://yellowpages.in/b/mrcb-food-court-nagole-hyderabad/464958962
http://yellowpages.in/b/aroma-bakery-sainikpuri-hyderabad/162350595
http://yellowpages.in/b/lucky-restaurant-padmarao-nagar-hyderabad/361900206
http://yellowpages.in/b/sahara-restaurant-chintal-hyderabad/926027096
http://yellowpages.in/b/prince-al-mataam-arabian-food-court-tarnaka-hyderabad/254071076
http://yellowpages.in/b/tabla-restaurant-nagole-hyderabad/561565209
http://yellowpages.in/b/mfc-hot-and-fresh-jeedimetla-hyderabad/5897107

In [None]:
# for find email id --> single
details.find("div", id="MainContent_divEmail").text[5:]

'kitchenofgodavari@gmail.com'

In [None]:
email=[]
for i in links:
  each = requests.get(i)
  print(i)
  details= bs(each.content)
  t= details.find("div", id="MainContent_divEmail")
  if type(t) != type(None):
    print(t.text)
    email.append(t.text[5:])
  else:
    email.append("NA")


http://yellowpages.in/b/kitchens-of-godavari-hitech-city-hyderabad/446870522
Emailkitchenofgodavari@gmail.com
http://yellowpages.in/b/amaravathi-restaurant-secunderabad-hyderabad/257124433
Emailamaravathibr@gmail.com
http://yellowpages.in/b/sukhibhava-restaurant-dilsukh-nagar-hyderabad/944833701
http://yellowpages.in/b/lucky-restaurant-santosh-nagar-hyderabad/973795213
http://yellowpages.in/b/am-to-pm-bawarchi-restaurant-vanasthalipuram-hyderabad/170410218
http://yellowpages.in/b/daawat-biryani-and-family-restaurant--warasiguda-hyderabad/391559060
http://yellowpages.in/b/mrcb-food-court-nagole-hyderabad/464958962
http://yellowpages.in/b/aroma-bakery-sainikpuri-hyderabad/162350595
Emailtyaralipartabani786@gamil.com
http://yellowpages.in/b/lucky-restaurant-padmarao-nagar-hyderabad/361900206
http://yellowpages.in/b/sahara-restaurant-chintal-hyderabad/926027096
http://yellowpages.in/b/prince-al-mataam-arabian-food-court-tarnaka-hyderabad/254071076
http://yellowpages.in/b/tabla-restaurant-n

In [None]:
name
links
ph
address
email

In [None]:
d = {"name":name, "phone":ph, "address":address, "email":email,"url":links}

i=10
d["name"][i],d["phone"][i],d["address"][i],d["email"][i] ,d["url"][i]

('Prince Al Mataam Arabian Food Court',
 '+91 9700413299',
 'Door-No. 12-13-1282/G 11, Mehtab Arcade,Tarnaka,Hyderabad - 500007Telangana.',
 'NA',
 'http://yellowpages.in/b/prince-al-mataam-arabian-food-court-tarnaka-hyderabad/254071076')

In [None]:
import pandas as pd
df = pd.DataFrame(d)
df.head()

df.to_csv("data.csv")