---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 17</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

## _Regular Expressions Part-I.ipynb_
https://docs.python.org/3/howto/regex.html#regex-howto

https://docs.python.org/3/library/re.html

## Baseline:
- The concept of regular expressions began in the 1950s, when the American mathematician `Stephen Cole Kleene` formalized the concept of a regular language. 
- Today's regular expressions are used in Data Science day to day tasks like:
 >- Simple pattern matching
 >- In find and replace operations like in editor example sublime, notepad++, atom etc.
 >- Information Extraction
 >- Web scraping
 >- Text Mining (Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within their unstructured data.)
 >- In domain of natural language processing .

<h1 align="center">A Gentle Introduction to Regular Expressions (Regex)</h1> <br><br>

<img align="center" width="800" height="800"  src="images/re.jpeg"  >
<img align="center" width="500" height="500"  src="images/tm.jpg"  >

<br><br><br><br><br><br><br><br><br>

# Learning Agenda
**PART-I:**
1. A gentle introduction to Regular Expressions
2. Overview of Regex Metacharacters, Anchors, Quantifiers, Escape Codes and Grouping Constructs
3. Overview of regex101
4. A Step by Step hands-on practical understanding of REs on regex101.com
5. Practical Use Cases
    - Identify valid phone numbers
    - Identify/locate valid names or city codes
    - Identify valid email addresses
    - Identify valid URLs
6. Substitution and Replacement

<br><br>**PART-II:**

**Lecture 2.18 (Regular Expressions in Python)**

## Wild Card / Meta Characters
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. Some commonly used wild cards or meta characters are listed below:


| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group

## Quantifiers
- A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match means they are used to search multiple characters.

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times

## Escape Codes
- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 
- The following list of special sequences isn’t complete.

| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

##  Practice Regular Expressions
(Visit reges101)[https://regex101.com/]

- There are some grand tools online these days that allow you to write and test just about any regular expression out there using color coding, code explanation, substitution, etc.
- The best regular expressions tester is Regex101 since it boasts the most features and supports the most language flavors. 

#### We will perform following some tasks/activities in regex101.
- perform single `.`
- perform multiple dots `....`
- perform `\.`
- To search `\` we perform double backslaches `\\`
- To search `*` we perform `\*` and other characters also like this.
- To search single digit like 1,2,3,... , we perform `\d`.
- To search non-digit characters, we perform `\D`.
- To search for boundary , we perform `\b`. like `Ha\b`. We can also put boundary on both sides like `\bHa\b`.
- Perform caret symbol like `^Ha` , this gives a string that's start with `Ha`.
- Search for a valid numbe using `\d`.
- Search for all valid numbers using `[.-]` like `\d\d\d[.-]`
- Search for a range by using this method `[A-z]`, Note this is differene of `-` in between and in end/start.
- What is return of `[^0-9]`? And what is difference between `[0-9]` and `[^0-9]`.
- find all the words ends with `at` but not start with `b`.
- select all the valid hexa-decimal numbers using quantifier. `0[xX][0-9a-fA-F]+\b`
- what is difference between `\d\d\d[.-]\d\d\d[.-]\d\d\d\d` and `\d{3}[.-]\d{3}[.-]\d{4}`.
- Here `M[sr]s?` , character ? is optional .
- Select all the valid names from given text.

abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped): 
.[{()\^$|?*+
arifbutt.me
321-555-4321
123.555.1234
111#923#9234
cat
mat
bat
0x45
0X4Ad
0x2g3
0x349ABf
0x

#### Select only valid names

Hello World
Mr. Ehtisham
Mr Tayyab
Ms Sonia
Mrs. Ayesha
Mr. B
Learning is fun

#### Check validaity of emails
- `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9]+\.[a-zA-Z0-9.-]+` this regex is used to select valid emails.

List of Valid Email Addresses
ehtisham@pucit.edu.pk
ehtisham.ds@pu.edu.pk
ehtishampucit@gmail.com
ehtisham.pucit@pu.edu.pk
first+123.5@example.com
abc%xyz@subdomain.example.com
my_name@example.com
first-last@example.com

List of Invalid Email Addresses
#@%^%#$@#$@#.com
abc.def@mail
abc.def@mail#archive.com
@example.com
ehtisham sadiq @example.com
Tayyab#@gmail.com
Abc.example.com

In [None]:
Select valid URL
- 

https://www.google.com
http://ehtisham.me
https://youtube.com
https://www.yahoo.com
http://facebook.com

## Practice Questions:

### Example 1: Write a regular expression to search digit inside a string.
- Input : "My roll number is 21"
- Output : [2,5]

In [None]:
import re
targetString = "My roll number is 25"
reg = r"\d"
result = re.findall(reg,targetString)
result

### Write a Python program that matches a string that has an a followed by zero or more b's.

In [None]:
# write your answer

### Write a Python program that matches a string that has an a followed by zero or one 'b'.

Hint :         patterns = 'ab?'

### Write a Python program to find sequences of lowercase letters joined with a underscore.

Hint : ^[a-z]+_[a-z]+$

### Write a Python program that matches a word at the beginning of a string.

In [None]:
import re
def text_match(text):
    pattern = r"^\w"
    if re.search(pattern,text):
        print("Text Found!")
    else:
        print("Text not found")
        
text_match("The quick brown fox jumps over the lazy dog.")
text_match(" The quick brown fox jumps over the lazy dog.")

### Write a Python program that matches a word at the end of a string, with optional punctuation.

In [None]:
import re
def text_match(text):
        patterns = '\w+\S*$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown fox jumps over the lazy dog."))
print(text_match("The quick brown fox jumps over the lazy dog. "))
print(text_match("The quick brown fox jumps over the lazy dog "))

### Write a Python program that matches a word containing 'z'.

In [None]:
import re
def text_match(text):
        patterns = '\w*z.\w*'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown fox jumps over the lazy dog."))
print(text_match("Python Exercises."))

### Write a Python program to match a string that contains only upper and lowercase letters, numbers, and underscores.


In [None]:
import re
def text_match(text):
        patterns = '^[a-zA-Z0-9_]*$'
        if re.search(patterns,  text):
                return 'Found a match!'
        else:
                return('Not matched!')

print(text_match("The quick brown fox jumps over the lazy dog."))
print(text_match("Python_Exercises_1"))

In [None]:
import re
def match_num(string):
    text = re.compile(r"^5")
    if text.match(string):
        return True
    else:
        return False
print(match_num('5-2345861'))
print(match_num('6-2345861'))