<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Session - 11 (Part - 01)</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">RegEx in Python</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#0)
* [RegEx in PYTHON](#1)
* [RAW STRING ("r/ R")](#2)
* [COMMON PYTHON RegEx FUNCTIONS](#3)    
* [PANDAS FUNCTIONS ACCEPTING RegEx](#4)    
* [THE END OF THE SESSION - 07](#5)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [1]:
import numpy as np
import pandas as pd
import re

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">RegEx in Python</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- A **Reg**ular **Ex**pression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

- The **Python module** **``re``** provides full support for regular expressions in Python [Source 01](https://docs.python.org/3/library/re.html#re-objects), [Source 02](https://www.tutorialspoint.com/python/python_reg_expressions.htm) & [Source 03](https://www.w3schools.com/python/python_regex.asp).


### Common Expressions

**``\d``** Any numeric digit from ``0`` to ``9``.
                           
**``\D``** Matches any character which is not a decimal digit. This is the opposite of ``\d``.
                           
**``\w``** Any letter, numeric digit, or the underscore character. (Think of this as matching "word" characters.)
                           
**``\W``** Any character that is not a letter, numeric digit, or the underscore character.
                           
**``\s``** Any space, tab, or newline character. (Think of this as matching white-space characters.)
                           
**``\S``** Any character that is not a space, tab, or newline.


### Common Metacharacters

**``"[]"``**	  A set of characters	``"[a-m]"``

**``"\"``**	      Signals a special sequence (can also be used to escape special characters)

**``"."``**	      Any character (except newline character)

**``"^"``**	      Starts with	``"^hello"``

**``"$"``**	      Ends with	``"world$"``

**``"*"``**	      Match zero, one or more of the previous

**``"+"``**	      Match one or more of the previous

**``"?"``**	      Match zero or one of the previous

**``"{}"``**	  Match exactly the specified number of occurrences

**``"|"``**	      Either or	`"falls|stays"`

**``"()"``**	  Capture and group

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Raw String ("r / R")</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- Python raw string is created by prefixing a string literal with **'r' or 'R'**.
- Python raw string treats **``backslash (\)``** as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character [Source 01](https://blog.devgenius.io/beauty-of-raw-strings-in-python-fa627d674cbf) & [Source 02](https://stackoverflow.com/questions/26318287/what-does-r-mean-before-a-regex-pattern#:~:text=The%20r%20means%20that%20the,escape%20codes%20will%20be%20ignored.).

In [2]:
print("backslash:\\")

backslash:\


In [3]:
print(r"backslash:\\")

backslash:\\


In [4]:
print(r"backslash:\")

SyntaxError: EOL while scanning string literal (2200827200.py, line 1)

In [5]:
print("new line char: \\n")

new line char: \n


In [6]:
print(r"new line char: \\n")

new line char: \\n


In [7]:
my_string = "Hello\nWorld"


In [8]:
print(my_string)

Hello
World


In [9]:
print(r"Hello\nWorld")

Hello\nWorld


## Invalid Raw String

In [10]:
# print("\")

In [11]:
# print(r"\")

In [12]:
print(r"abc\")

SyntaxError: EOL while scanning string literal (565559609.py, line 1)

In [13]:
print(r"abc\\\")

SyntaxError: EOL while scanning string literal (3643937422.py, line 1)

In [14]:
print(r"\abc")

\abc


In [15]:
print(r"abc\ ")

abc\ 


In [16]:
print(r"abc\\")

abc\\


In [17]:
# print(r"abc\\\")

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Common Python RegEx Functions</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **re.search():** Scan through string looking for a match to the pattern.
- **re.match():** Try to apply the pattern at the start of the string.
- **re.fullmatch():** Try to apply the pattern to all of the string.
- **re.findall():** Return a list of all non-overlapping matches in the string.
- **re.sub():** Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
- **re.split():** Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

In [18]:
# dir(re)

In [19]:
help(re.match) # Define the builtin 'help'.

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.



## ``re.search(pattern, string, flags=0)``

Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found [Source](https://www.pythontutorial.net/python-regex/python-regex-flags/).

#### Find numeric digits with search function

In [20]:
text = "A78L41K"

In [21]:
re.search("78",text)  # gets the first match to the pattern.

<re.Match object; span=(1, 3), match='78'>

#### with regular expressions

In [22]:
re.search("\d\d",text)

<re.Match object; span=(1, 3), match='78'>

In [23]:
re.search("\d\d",text).start()

1

In [24]:
re.search("\d\d",text).end()

3

In [25]:
re.search("\d\d",text).span()

(1, 3)

In [26]:
re.search("\d\d",text).group()

'78'

#### with compile() method

> When working with very large data, it increases the processing speed if we are going to regex a lot of documents.

In [27]:
comp = re.compile("\d{2}")  # "78","\d\d"

In [28]:
comp.search(text)

<re.Match object; span=(1, 3), match='78'>

In [29]:
num = comp.search(text)

In [30]:
num

<re.Match object; span=(1, 3), match='78'>

In [31]:
num.start()  # Return index of the start of the substring matched by group.

1

In [32]:
num.end()  # Return index of the end of the substring matched by group.

3

In [33]:
num.span()  #  For match object m, return the 2-tuple (m.start(group), m.end(group)).

(1, 3)

In [34]:
num.group()  # Return subgroup(s) of the match by indices or names.For 0 returns the entire match.

'78'

#### Find non decimal digits with search function

In [35]:
text = "8PM19MIN"

In [36]:
re.search("\D",text)

<re.Match object; span=(1, 2), match='P'>

In [37]:
re.search("\D\D",text)

<re.Match object; span=(1, 3), match='PM'>

In [38]:
re.search("\D+",text)

<re.Match object; span=(1, 3), match='PM'>

In [39]:
re.search("\D*",text)

<re.Match object; span=(0, 0), match=''>

In [40]:
re.search("\D{2}",text)

<re.Match object; span=(1, 3), match='PM'>

In [41]:
re.search("\D+",text).group(), re.search("\D+",text).start(),re.search("\D+",text).end()  # Returns the indexes of characters in the string.

('PM', 1, 3)

In [42]:
# re.search("[^0-9]",text)

#### Find phone number pattern

In [43]:
text = 'My phone number is 1234567890'

In [44]:
re.search("\d{10}",text)  # span denotes the inexes of the caharacters that we search

<re.Match object; span=(19, 29), match='1234567890'>

In [45]:
re.search("\d+",text)

<re.Match object; span=(19, 29), match='1234567890'>

In [46]:
text = 'My phone number is 123 456 7890'

In [47]:
re.search("\d+\s*\d+\s*\d+",text)

<re.Match object; span=(19, 31), match='123 456 7890'>

In [48]:
re.search("(\d+\s*){2}\d+",text)

<re.Match object; span=(19, 31), match='123 456 7890'>

In [49]:
re.search("\d.+",text)

<re.Match object; span=(19, 31), match='123 456 7890'>

In [50]:
re.search("\d.*",text)

<re.Match object; span=(19, 31), match='123 456 7890'>

In [51]:
text = 'My phone number is 123-456-7890'

In [52]:
re.search("\d\d\d-\d\d\d-\d\d\d\d",text)

<re.Match object; span=(19, 31), match='123-456-7890'>

In [53]:
re.search("\d+-\d+-\d+",text)

<re.Match object; span=(19, 31), match='123-456-7890'>

In [54]:
re.search("(\d+-){2}\d+",text)

<re.Match object; span=(19, 31), match='123-456-7890'>

In [55]:
re.search("\d.*",text)

<re.Match object; span=(19, 31), match='123-456-7890'>

#### Find phone number pattern by grouping

In [56]:
text

'My phone number is 123-456-7890'

In [57]:
re.search("\d\d\d-\d\d\d-\d\d\d\d",text)

<re.Match object; span=(19, 31), match='123-456-7890'>

Let's group the "text"

In [58]:
tel_no = re.search("(\d\d\d)-(\d\d\d)-(\d\d\d\d)",text)
tel_no

<re.Match object; span=(19, 31), match='123-456-7890'>

In [59]:
tel_no.group()

'123-456-7890'

In [60]:
tel_no.group(0)

'123-456-7890'

In [61]:
tel_no.group(1)

'123'

In [62]:
tel_no.group(2)

'456'

In [63]:
tel_no.group(3)

'7890'

In [64]:
tel_no = re.search("(\d+)-(\d+)-(\d+)",text)
tel_no

<re.Match object; span=(19, 31), match='123-456-7890'>

In [65]:
tel_no = re.search("(\d*)-(\d+)-(\d+)",text)
tel_no

<re.Match object; span=(19, 31), match='123-456-7890'>

In [66]:
tel_no = re.search("(\d*)-(\d*)-(\d*)",text)
tel_no

<re.Match object; span=(19, 31), match='123-456-7890'>

#### Escaping parentheses and create 2 group -> first group:(415) second group:555-1212 print

In [67]:
text = 'My phone number is (415) 555-1212'

In [68]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text)

<re.Match object; span=(19, 33), match='(415) 555-1212'>

In [69]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).group()

'(415) 555-1212'

In [70]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).group(0)

'(415) 555-1212'

In [71]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).group(1)

'(415)'

In [72]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).group(2)

'555'

In [73]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).group(3)

'1212'

In [74]:
re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).groups()

('(415)', '555', '1212')

In [75]:
len(re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).groups())

3

In [76]:
for i in range(len(re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).groups())+1):
    print(re.search("(\(\d\d\d\))\s+(\d+)-(\d+)",text).group(i))

(415) 555-1212
(415)
555
1212


## ``re.match(pattern, string, flags=0)``

Try to apply the pattern at the start of the string, returning a Match object, or None if no match was found.

If you want to locate a match anywhere in string, use search() instead of match()

Searches the typed regex expression only at the beginning of the text. Returns None if the originally searched pattern does not exist

In [77]:
text = "A78L41K"

In [78]:
re.match("\d+",text)  #  None if no match was found.

In [79]:
re.match("\D\d+",text)

<re.Match object; span=(0, 3), match='A78'>

In [80]:
re.match("\w{3}",text)

<re.Match object; span=(0, 3), match='A78'>

## ``re.fullmatch(pattern, string, flags=0)``

Try to apply the pattern to all of the string


In [81]:
text = "A78L41K" 

In [82]:
re.fullmatch("\D\d+\D\d{2}\D",text)

<re.Match object; span=(0, 7), match='A78L41K'>

In [83]:
re.fullmatch("\w+",text)

<re.Match object; span=(0, 7), match='A78L41K'>

## ``re.findall(pattern, string, flags=0)``

Return a list of all non-overlapping matches in the string.

#### Extract numbers from text as a list

In [84]:
text = "O 1, t 10, o 100. 100000,10"

In [85]:
re.findall("\d",text)  # Return a list of all non-overlapping matches in the string.

['1', '1', '0', '1', '0', '0', '1', '0', '0', '0', '0', '0', '1', '0']

In [86]:
re.findall("\d{2}",text)

['10', '10', '10', '00', '00', '10']

In [87]:
re.findall("\d{3}",text)

['100', '100', '000']

In [88]:
re.findall("\d{4}",text)

['1000']

In [89]:
re.findall("\d{5}",text)

['10000']

In [90]:
re.findall("\d{6}",text)

['100000']

In [91]:
re.findall("\d{7}",text)

[]

In [92]:
re.findall("\d{1,6}",text)

['1', '10', '100', '100000', '10']

In [93]:
re.findall("\d{1,3}",text)

['1', '10', '100', '100', '000', '10']

In [94]:
re.findall("\d{1,2}",text)

['1', '10', '10', '0', '10', '00', '00', '10']

#### Extract words begining with "f"

In [95]:
text = 'which foot or hand fell fastest'

In [96]:
re.search("f[a-z]*",text)

<re.Match object; span=(6, 10), match='foot'>

In [97]:
re.match("f[a-z]*",text)

In [98]:
re.fullmatch("f[a-z]*",text)

In [99]:
re.findall("f[a-z]*",text)

['foot', 'fell', 'fastest']

#### Extract equations made up of words and numbers

In [100]:
text = 'set width=20 and height=10'

In [101]:
re.findall("\w+=\d+",text)

['width=20', 'height=10']

In [102]:
re.findall("(\w+)=(\d+)",text)  # It does not return the expressions that fall outside the group.

[('width', '20'), ('height', '10')]

#### Check if the string starts with 'hello'

In [103]:
text = "hello world"

In [104]:
re.match("\w+",text)

<re.Match object; span=(0, 5), match='hello'>

In [105]:
re.findall("\w+",text)

['hello', 'world']

In [106]:
re.findall("\w+",text)[0]

'hello'

In [107]:
re.findall("^hello",text)

['hello']

In [108]:
re.findall("^hel",text)

['hel']

#### Check if the string ends with 'world'

In [109]:
re.findall("world$",text)

['world']

In [110]:
re.findall("rld$",text)

['rld']

## ``re.sub(pattern, repl, string, count=0, flags=0)``

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.  repl can be either a string or a callable; if a string, backslash escapes in it are processed.  If it is a callable, it's passed the Match object and must return a replacement string to be used.

> It works like replace function. when finding the pattern you search, replace that you want the expression.

#### Remove anything other than digits

Let's remove the non-digits.

In [111]:
text = "2004-959-559 # This is Phone Number"

In [112]:
print(re.findall("\D",text))

['-', '-', ' ', '#', ' ', 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'P', 'h', 'o', 'n', 'e', ' ', 'N', 'u', 'm', 'b', 'e', 'r']


In [113]:
re.sub("\D","",text)

'2004959559'

#### Remove digits and replace with "."

In [114]:
re.sub("\d",".",text)

'....-...-... # This is Phone Number'

In [115]:
re.sub("\d",".",text,count=4)

'....-959-559 # This is Phone Number'

In [116]:
re.sub("\d","0",text,count=4)

'0000-959-559 # This is Phone Number'

Let's use the replace function:

In [117]:
pd.Series(text).str.replace("\d",".",regex=True)  # we used the replace method with regex expression. 

0    ....-...-... # This is Phone Number
dtype: object

## ``re.split(pattern, string, maxsplit=0, flags=0)``

Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

In [118]:
text = "ab56cd78_de fg3hıi49"

In [119]:
re.split("\d+",text)

['ab', 'cd', '_de fg', 'hıi', '']

In [120]:
re.split("\D+",text,maxsplit=2)

['', '56', '78_de fg3hıi49']

In [121]:
re.findall("\d+",text)

['56', '78', '3', '49']

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Pandas Functions Accepting RegEx</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **count():** Count occurrences of pattern in each string of the Series/Index
- **replace():** Replace the search string or pattern with the given value
- **contains():** Test if pattern or regex is contained within a string of a Series or Index. Calls re.search() and returns a boolean
- **findall():** Find all occurrences of pattern or regular expression in the Series/Index. Equivalent to applying re.findall() on all elements
- **match():** Determine if each string matches a regular expression. Calls re.match() and returns a boolean
- **split():** Split strings around given separator/delimiter and accepts string or regular expression to split on
- **extract():** Extract capture groups in the regex pat as columns in a DataFrame and returns the captured groups

In [132]:
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks. #94569# Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Every Good Gift', 'Red.  Flowers velvety red.  #079463895689# Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
        ['Evghenya', 'Orange-pink.  75 petals.  Large, very double #68345_686# bloom form.  Blooms in flushes throughout the season.'], 
        ['Evita', 'White or white blend.  None to mild fragrance.  35 petals #9897#.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely #679754YH89#.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
        ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance #AGHJS876IOP#.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]
  
df = pd.DataFrame(data, columns = ['name', 'bloom']) 
df 

Unnamed: 0,name,bloom
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, fl..."
1,Every Good Gift,Red. Flowers velvety red. #079463895689# Mod...
2,Evghenya,"Orange-pink. 75 petals. Large, very double #..."
3,Evita,White or white blend. None to mild fragrance....
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...
5,Evita 2,"White, blush shading. Mild, wild rose fragran..."


## ``pandas.Series.str.count(pat, flags=0)``

Count occurrences of pattern in each string of the Series/Index.

This function is used to count the number of times a particular regex pattern is repeated in each of the string elements of the Series.

#### How many numerical values are there in each row of "bloom" feature?

In [133]:
df["bloom"]

0    Carmine-pink, salmon-pink streaks, stripes, fl...
1    Red.  Flowers velvety red.  #079463895689# Mod...
2    Orange-pink.  75 petals.  Large, very double #...
3    White or white blend.  None to mild fragrance....
4    Light pink. [Deep pink.]  Outer petals white. ...
5    White, blush shading.  Mild, wild rose fragran...
Name: bloom, dtype: object

In [134]:
df["bloom"][0]

'Carmine-pink, salmon-pink streaks, stripes, flecks. #94569# Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'

In [135]:
type(df["bloom"][0])

str

In [138]:
df["bloom"].count()  # Count non-NA cells for each column or row.

6

In [142]:
df["bloom"].str.count("\d+")

0     1
1     4
2     3
3     4
4    10
5     5
Name: bloom, dtype: int64

#### How many characters are there in each row of "bloom" feature?

In [143]:
df["bloom"].apply(len)

0    240
1    196
2    110
3    162
4    327
5    198
Name: bloom, dtype: int64

In [148]:
df["bloom"].str.count(".")  #  Any character (except newline character)

0    240
1    196
2    110
3    162
4    327
5    198
Name: bloom, dtype: int64

#### How many sentences are there in each row of "bloom" feature?

In [152]:
df["bloom"].str.count("\.")

0     5
1     6
2     4
3     5
4    11
5     7
Name: bloom, dtype: int64

In [154]:
df["bloom"][0]

'Carmine-pink, salmon-pink streaks, stripes, flecks. #94569# Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'

#### How many the word "pink" are there in each row of "bloom" feature?

In [155]:
df["bloom"].str.count("pink")

0    5
1    0
2    1
3    0
4    2
5    0
Name: bloom, dtype: int64

## ``pandas.Series.str.replace(pat, repl, n=- 1, case=None, flags=0, regex=None)``

Replace each occurrence of pattern/regex in the Series/Index.

Equivalent to str.replace() or re.sub(), depending on the regex value.

#### Replace the values finding between the two "#" characters (including "#" characters) with the "" in each row of "bloom" feature 

In [156]:
df.bloom.str.replace("#\S+#","",regex=True)

0    Carmine-pink, salmon-pink streaks, stripes, fl...
1    Red.  Flowers velvety red.   Moderate fragranc...
2    Orange-pink.  75 petals.  Large, very double  ...
3    White or white blend.  None to mild fragrance....
4    Light pink. [Deep pink.]  Outer petals white. ...
5    White, blush shading.  Mild, wild rose fragran...
Name: bloom, dtype: object

In [158]:
df["bloom"] = df.bloom.str.replace("#\S+#","",regex=True)
df.bloom[0]

'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'

The expression #94569# is removed.

In [159]:
df.bloom.str.count("#94569#")

0    0
1    0
2    0
3    0
4    0
5    0
Name: bloom, dtype: int64

## ``pandas.Series.str.contains(pat, case=True, flags=0, na=None, regex=True)``

Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

#### Which rows in "bloom" feature includes "diameter" value?

In [161]:
df.bloom.str.contains("diameter")

0    False
1     True
2    False
3    False
4     True
5     True
Name: bloom, dtype: bool

In [163]:
df.loc[df.bloom.str.contains("diameter")]

Unnamed: 0,name,bloom
1,Every Good Gift,Red. Flowers velvety red. Moderate fragranc...
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...
5,Evita 2,"White, blush shading. Mild, wild rose fragran..."


Lets write some pattern examples:

In [169]:
df.bloom.str.contains('\d+"')

0    False
1     True
2    False
3    False
4     True
5     True
Name: bloom, dtype: bool

In [170]:
df.loc[df.bloom.str.contains('\d+"')]

Unnamed: 0,name,bloom
1,Every Good Gift,Red. Flowers velvety red. Moderate fragranc...
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...
5,Evita 2,"White, blush shading. Mild, wild rose fragran..."


In [171]:
df.loc[df.bloom.str.contains('"\.')]

Unnamed: 0,name,bloom
1,Every Good Gift,Red. Flowers velvety red. Moderate fragranc...
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...
5,Evita 2,"White, blush shading. Mild, wild rose fragran..."


## ``pandas.Series.str.findall(pat, flags=0)``

Find all occurrences of pattern or regular expression in the Series/Index.

Equivalent to applying re.findall() to all the elements in the Series/Index.

#### Find all numeric values in each rows of the "bloom" feature 

In [175]:
df.bloom.str.findall("\d+")

0                                []
1                       [4, 26, 40]
2                              [75]
3                      [35, 26, 40]
4    [35, 40, 2, 5, 17, 25, 26, 40]
5                   [20, 25, 1, 25]
Name: bloom, dtype: object

#### Find diameter values in each rows of the "bloom" feature

In [180]:
df.bloom.str.findall('\d+\.\d+"|\d+"')

0         []
1       [4"]
2         []
3         []
4     [2.5"]
5    [1.25"]
Name: bloom, dtype: object

We must be careful when using this expression. Sometimes it can give different results when we relocate.

In [181]:
df.bloom.str.findall('\d+"|\d+\.\d+"')

0         []
1       [4"]
2         []
3         []
4     [2.5"]
5    [1.25"]
Name: bloom, dtype: object

## ``pandas.Series.str.match(pat, case=True, flags=0, na=None)``

Determine if each string starts with a match of a regular expression.

#### Find the rows of pink blooms (this information is available in the first words of the rows)

In [188]:
df.bloom.str.match("pink|\w+-pink|\w+ pink")

0     True
1    False
2     True
3    False
4     True
5    False
Name: bloom, dtype: bool

In [190]:
df.loc[df.bloom.str.match("pink|\w+-pink|\w+ pink")]

Unnamed: 0,name,bloom
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, fl..."
2,Evghenya,"Orange-pink. 75 petals. Large, very double ..."
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...


In [194]:
df.bloom.str.match(".+pink")

0     True
1    False
2     True
3    False
4     True
5    False
Name: bloom, dtype: bool

In [195]:
df.loc[df.bloom.str.match(".+pink")]

Unnamed: 0,name,bloom
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, fl..."
2,Evghenya,"Orange-pink. 75 petals. Large, very double ..."
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...


## ``pandas.Series.str.split(pat=None, n=- 1, expand=False, *, regex=None)``

Split strings around given separator/delimiter.

Splits the string in the Series/Index from the beginning, at the specified delimiter string.

#### Split each rows of "bloom" feature from the dot character as sentences 

In [197]:
df.bloom.str.split("\. ",expand=True)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,"Carmine-pink, salmon-pink streaks, stripes, fl...","Warm pink, clear carmine pink, rose pink shad...",Mild fragrance,"Large, very double, in small clusters, high-c...",Blooms in flushes throughout the season.,,,,
1,Red,Flowers velvety red,Moderate fragrance,"Average diameter 4""","Medium-large, full (26-40 petals), borne most...",Blooms in flushes throughout the season.,,,
2,Orange-pink,75 petals,"Large, very double bloom form",Blooms in flushes throughout the season.,,,,,
3,White or white blend,None to mild fragrance,35 petals,"Large, full (26-40 petals), high-centered blo...",Blooms in flushes throughout the season.,,,,
4,Light pink,[Deep pink.] Outer petals white,Expand rarely,Mild fragrance,35 to 40 petals,"Average diameter 2.5""","Medium, double (17-25 petals), full (26-40 pe...","Prolific, once-blooming spring or summer","Glandular sepals, leafy sepals, long sepals b..."
5,"White, blush shading","Mild, wild rose fragrance",20 to 25 petals,"Average diameter 1.25""","Small, very double, cluster-flowered bloom form",Blooms in flushes throughout the season.,,,


In [198]:
info = ["id:345, age:25, salary:1200", "id:346, age:32, salary:1500", "id:347, age:28, salary:1400"]
s = pd.Series(info)
s

0    id:345, age:25, salary:1200
1    id:346, age:32, salary:1500
2    id:347, age:28, salary:1400
dtype: object

#### Split the serie to create a dataframe consisting of "id, age and salary" columns.

In [204]:
s.str.split("\D+",expand=True)

Unnamed: 0,0,1,2,3
0,,345,25,1200
1,,346,32,1500
2,,347,28,1400


In [206]:
df = s.str.split("\D+",expand=True).iloc[:,1:]
df

Unnamed: 0,1,2,3
0,345,25,1200
1,346,32,1500
2,347,28,1400


In [208]:
df.columns = ["id","age","salary"]

In [209]:
df

Unnamed: 0,id,age,salary
0,345,25,1200
1,346,32,1500
2,347,28,1400


we can also write:

In [210]:
df.rename(columns = {1:'id', 2:'age', 3:"salary"}, inplace = True)

In [211]:
df

Unnamed: 0,id,age,salary
0,345,25,1200
1,346,32,1500
2,347,28,1400


## ``pandas.Series.str.extract(pat, flags=0, expand=True)``

Extract capture **groups** in the regex pat as columns in a DataFrame.

For each subject string in the Series, extract groups from the first match of regular expression pat.

Return a dataframe. 

#### Extract just numbers

In [246]:
s = pd.Series(['a3aa', 'b4aa', 'c5aa'])
s

0    a3aa
1    b4aa
2    c5aa
dtype: object

In [247]:
s.str.extract("()")

Unnamed: 0,0
0,
1,
2,


In [248]:
# s.str.extract("\d+")

In [249]:
s.str.extract("(\d+)")

Unnamed: 0,0
0,3
1,4
2,5


#### Extract just letters

In [250]:
s.str.extract("([a-zA-Z]+)")

Unnamed: 0,0
0,a
1,b
2,c


In [251]:
s.str.extract("([a-zA-Z]+)\d+([a-zA-Z]+)")  # creates as many columns as the number of groupings.

Unnamed: 0,0,1
0,a,aa
1,b,aa
2,c,aa


In [252]:
s.str.extract("(\D)\d(\D)(\D)")

Unnamed: 0,0,1,2
0,a,a,a
1,b,a,a
2,c,a,a


In [270]:
s.str.extract("([a-z]{2})"),s.str.extract("([a-z]\D+)") 

(    0
 0  aa
 1  aa
 2  aa,
     0
 0  aa
 1  aa
 2  aa)

In [253]:
s.str.split("\d+",expand=True)

Unnamed: 0,0,1
0,a,aa
1,b,aa
2,c,aa


In [271]:
s.str.split("\d*",expand=True)

Unnamed: 0,0,1,2,3,4,5
0,,a,,a,a,
1,,b,,a,a,
2,,c,,a,a,


In [255]:
s = pd.Series(['a3aa', 'b4aa', 'c5aa'])
s

0    a3aa
1    b4aa
2    c5aa
dtype: object

In [240]:
s.str.split("\D+",expand=True)

Unnamed: 0,0,1,2
0,,3,
1,,4,
2,,5,


#### Extract "id, age and salary" values to create a dataframe consisting of "id, age and salary" columns.

In [281]:
info = ["id:345, age:25, salary:1200", "id:346, age:32, salary:1500", "id:347, age:28, salary:1400"]
s = pd.Series(info)
s

0    id:345, age:25, salary:1200
1    id:346, age:32, salary:1500
2    id:347, age:28, salary:1400
dtype: object

In [288]:
df = s.str.extract("(\d+)\D+(\d+)\D+(\d+)")
df

Unnamed: 0,0,1,2
0,345,25,1200
1,346,32,1500
2,347,28,1400


In [289]:
df.rename(columns = {0:'id', 1:'age', 2:"salary"}, inplace = True)

In [290]:
df

Unnamed: 0,id,age,salary
0,345,25,1200
1,346,32,1500
2,347,28,1400


#### Extract first number

In [292]:
s= pd.Series(['40 l/100 km (comb)', 
        '38 l/100 km (comb)', '6.4 l/100 km (comb)',
       '8.3 kg/100 km (comb)', '5.1 kg/100 km (comb)',
       '5.4 l/100 km (comb)', '6.7 l/100 km (comb)',
       '6.2 l/100 km (comb)', '7.3 l/100 km (comb)',
       '6.3 l/100 km (comb)', '5.7 l/100 km (comb)',
       '6.1 l/100 km (comb)', '6.8 l/100 km (comb)',
       '7.5 l/100 km (comb)', '7.4 l/100 km (comb)',
       '3.6 kg/100 km (comb)', '0 l/100 km (comb)', 
       '7.8 l/100 km (comb)'])
s

0       40 l/100 km (comb)
1       38 l/100 km (comb)
2      6.4 l/100 km (comb)
3     8.3 kg/100 km (comb)
4     5.1 kg/100 km (comb)
5      5.4 l/100 km (comb)
6      6.7 l/100 km (comb)
7      6.2 l/100 km (comb)
8      7.3 l/100 km (comb)
9      6.3 l/100 km (comb)
10     5.7 l/100 km (comb)
11     6.1 l/100 km (comb)
12     6.8 l/100 km (comb)
13     7.5 l/100 km (comb)
14     7.4 l/100 km (comb)
15    3.6 kg/100 km (comb)
16       0 l/100 km (comb)
17     7.8 l/100 km (comb)
dtype: object

In [297]:
# s.str.extract("(\d\.\d+|\d+)")

In [301]:
# s.str.extract("(\d*\.?\d*)")

In [316]:
s.str.extract("(\S+)")  # Any character that is not a space, tab, or newline.

Unnamed: 0,0
0,40.0
1,38.0
2,6.4
3,8.3
4,5.1
5,5.4
6,6.7
7,6.2
8,7.3
9,6.3


#### Extract first and second number

In [318]:
s.str.extract("(\d*.\d*).+/(\d+)")

Unnamed: 0,0,1
0,40.0,100
1,38.0,100
2,6.4,100
3,8.3,100
4,5.1,100
5,5.4,100
6,6.7,100
7,6.2,100
8,7.3,100
9,6.3,100


#### Extract date as month and year separately

In [319]:
s = pd.Series(['06/2020\n\n4.9 l/100 km (comb)',
'11/2020\n\n166 g CO2/km (comb)',                                 
'10/2019\n\n5.3 l/100 km (comb)',
'05/2022\n\n6.3 l/100 km (comb)',
'07/2019\n\n128 g CO2/km (comb)',
'06/2022\n\n112 g CO2/km (comb)',                                                 
'01/2022\n\n5.8 l/100 km (comb)',
'11/2020\n\n106 g CO2/km (comb)',
'04/2019\n\n105 g CO2/km (comb)',
'08/2020\n\n133 g CO2/km (comb)',
'04/2022\n\n133 g CO2/km (comb)'])
s

0     06/2020\n\n4.9 l/100 km (comb)
1     11/2020\n\n166 g CO2/km (comb)
2     10/2019\n\n5.3 l/100 km (comb)
3     05/2022\n\n6.3 l/100 km (comb)
4     07/2019\n\n128 g CO2/km (comb)
5     06/2022\n\n112 g CO2/km (comb)
6     01/2022\n\n5.8 l/100 km (comb)
7     11/2020\n\n106 g CO2/km (comb)
8     04/2019\n\n105 g CO2/km (comb)
9     08/2020\n\n133 g CO2/km (comb)
10    04/2022\n\n133 g CO2/km (comb)
dtype: object

In [323]:
# s.str.extract("(\d+)/(\d+)")
#s.str.extract("(\d{2}).(\d{4})")
s.str.extract("(\S+)/(\S+)")  # 

Unnamed: 0,0,1
0,6,2020
1,11,2020
2,10,2019
3,5,2022
4,7,2019
5,6,2022
6,1,2022
7,11,2020
8,4,2019
9,8,2020


#### Extract date and comsuption value -> 4.9

In [325]:
s.str.extract("(\d+/\d+)")

Unnamed: 0,0
0,06/2020
1,11/2020
2,10/2019
3,05/2022
4,07/2019
5,06/2022
6,01/2022
7,11/2020
8,04/2019
9,08/2020


In [326]:
s.str.extract("(\d+/\d+)\s+(\d+.\d+|\d+)")

Unnamed: 0,0,1
0,06/2020,4.9
1,11/2020,166.0
2,10/2019,5.3
3,05/2022,6.3
4,07/2019,128.0
5,06/2022,112.0
6,01/2022,5.8
7,11/2020,106.0
8,04/2019,105.0
9,08/2020,133.0


#### Extract date as month and year separately

In [344]:
s = pd.Series(['\n\n4.9 06/2020 l/100 km (comb)',
'\n\n166 11/2020 g CO2/km (comb)',                                 
'\n\n5.3 10/2019 l/100 km (comb)',
'\n\n6.3 05/2022 l/100 km (comb)',
'\n\n128 07/2019 g CO2/km (comb)',
'\n\n112 06/2022 g CO2/km (comb)',                                                 
'\n\n5.8 01/2022 l/100 km (comb)'])
s

0    \n\n4.9 06/2020 l/100 km (comb)
1    \n\n166 11/2020 g CO2/km (comb)
2    \n\n5.3 10/2019 l/100 km (comb)
3    \n\n6.3 05/2022 l/100 km (comb)
4    \n\n128 07/2019 g CO2/km (comb)
5    \n\n112 06/2022 g CO2/km (comb)
6    \n\n5.8 01/2022 l/100 km (comb)
dtype: object

In [345]:
s.str.extract("(\d{2})/(\d{4})")

Unnamed: 0,0,1
0,6,2020
1,11,2020
2,10,2019
3,5,2022
4,7,2019
5,6,2022
6,1,2022


In [351]:
df = s.str.extract("\S+\s(\d+)/(\d+)")
df.rename(columns={0:"month",1:"year"})

Unnamed: 0,month,year
0,6,2020
1,11,2020
2,10,2019
3,5,2022
4,7,2019
5,6,2022
6,1,2022


## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:150%; text-align:center; border-radius:10px 10px;">The End of The Session - 11 (Part - 01)</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>