<img src=https://i.ibb.co/6gCsHd6/1200px-Pandas-logo-svg.png width="700" height="200">

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis with Python</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#060108; font-size:150%; text-align:center; border-radius:10px 10px;">Session - 13</p>

## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#4d77cf; font-size:200%; text-align:center; border-radius:10px 10px;">RegEx in Python</p>

<a id="toc"></a>

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#0)
* [RegEx in PYTHON](#1)
* [RAW STRING ("r/ R")](#2)
* [COMMON PYTHON RegEx FUNCTIONS](#3)    
* [PANDAS FUNCTIONS ACCEPTING RegEx](#4)    
* [THE END OF THE SESSION - 07](#5)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Libraries Needed in This Notebook</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [7]:
import numpy as np
import pandas as pd
import re

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">RegEx in Python</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- A **Reg**ular **Ex**pression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

- The **Python module** **``re``** provides full support for regular expressions in Python [Source 01](https://docs.python.org/3/library/re.html#re-objects), [Source 02](https://www.tutorialspoint.com/python/python_reg_expressions.htm) & [Source 03](https://www.w3schools.com/python/python_regex.asp).


### Common Expressions

**``\d``** Any numeric digit from ``0`` to ``9``.
                           
**``\D``** Matches any character which is not a decimal digit. This is the opposite of ``\d``.
                           
**``\w``** Any letter, numeric digit, or the underscore character. (Think of this as matching "word" characters.)
                           
**``\W``** Any character that is not a letter, numeric digit, or the underscore character.
                           
**``\s``** Any space, tab, or newline character. (Think of this as matching white-space characters.)
                           
**``\S``** Any character that is not a space, tab, or newline.


### Common Metacharacters

**``"[]"``**	  A set of characters	``"[a-m]"``

**``"\"``**	      Signals a special sequence (can also be used to escape special characters)

**``"."``**	      Any character (except newline character)

**``"^"``**	      Starts with	``"^hello"``

**``"$"``**	      Ends with	``"world$"``

**``"*"``**	      Match zero, one or more of the previous

**``"+"``**	      Match one or more of the previous

**``"?"``**	      Match zero or one of the previous

**``"{}"``**	  Match exactly the specified number of occurrences

**``"|"``**	      Either or	`"falls|stays"`

**``"()"``**	  Capture and group

For regex exercises: https://regex101.com

![image.png](attachment:98f42326-051b-43fb-a114-c2141082b33b.png)

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Raw String ("r / R")</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- Python raw string is created by prefixing a string literal with **'r' or 'R'**.
- Python raw string treats **``backslash (\)``** as a literal character. This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character [Source 01](https://blog.devgenius.io/beauty-of-raw-strings-in-python-fa627d674cbf) & [Source 02](https://stackoverflow.com/questions/26318287/what-does-r-mean-before-a-regex-pattern#:~:text=The%20r%20means%20that%20the,escape%20codes%20will%20be%20ignored.).

In [1]:
print("Backslash: \\")
print("New line char: \\n")

Backslash: \
New line char: \n


In [2]:
print(r"Backslash: \\")
print(r"New line char: \\n")

Backslash: \\
New line char: \\n


In [None]:
my_string = "Hello\nWorld"
print(my_string)

Hello
World


In [None]:
my_string = r"Hello\nWorld"
print(my_string)

Hello\nWorld


## Invalid Raw String

In [None]:
#print("\") # gives an error

In [None]:
#print(r"\") # gives an error

In [None]:
#print(r"abc\") # gives an error

In [None]:
#print(r"abc\\\)" # gives an error

In [5]:
print(r"abc\\")

abc\\


## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Common Python RegEx Functions</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **re.search():** Scan through string looking for a match to the pattern.
- **re.match():** Try to apply the pattern at the start of the string.
- **re.fullmatch():** Try to apply the pattern to all of the string.
- **re.findall():** Return a list of all non-overlapping matches in the string.
- **re.sub():** Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
- **re.split():** Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

In [None]:
#dir(re) # shows all functions/methods in re module

In [8]:
# We can see docstring of related method with help function
method = re.match
help(method)

Help on function match in module re:

match(pattern, string, flags=0)
    Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.



## ``re.search(pattern, string, flags=0)``

Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found [Source](https://www.pythontutorial.net/python-regex/python-regex-flags/).

#### Find numeric digits with search function

In [9]:
text = "A78L41K"

In [10]:
re.search("78", text)

<re.Match object; span=(1, 3), match='78'>

#### with regular expressions

In [11]:
re.search("\d\d", text)

<re.Match object; span=(1, 3), match='78'>

#### with compile() method

In [12]:
re.compile("\d\d")
# bu bir obje dondurur. daha sonra o objeyi kullanabiliri<z

re.compile(r'\d\d', re.UNICODE)

In [13]:
type(re.compile("\d\d"))

re.Pattern

In [16]:
comp = re.compile("\d\d")

In [17]:
# o patterni comp diye kaydedip ona search veya farklı metotlar uygulayabiliriz
comp.search(text)
# cok buyuk datalarla calisirlen her defasında bir regex codeu yazmaktansa comp ornegindeki gibi
# kaydedip surekli o patterni arayabiliriz.

# search de match object verir buna start vb metotlar uygulayabilirzi

<re.Match object; span=(1, 3), match='78'>

In [19]:
num = comp.search(text)
num

<re.Match object; span=(1, 3), match='78'>

In [None]:
num.start()
# comp kacinci ifadede baslyor textte. 1. karakterden basliyormus, index degil karakter

1

In [None]:
num.end()
# kacinci karakterde bittigini gosteriyor

3

In [None]:
num.span()
# araligini veriyor. match objecttede de span'İ gosteriyor

(1, 3)

In [20]:
num.group()
# group eslestigi ifadeyi gosterir. birden fazla grup patterni aranabilir. group(1), group(2)... group(0) hepsini getirir

'78'

In [21]:
num[0]

'78'

In [22]:
print(num.group())

78


#### Find non decimal digits with search function

In [23]:
text = "8PM19MIN"

In [24]:
nondeci = re.search("\D", text)
nondeci.group()
# ilk buldugu non-decimali getirir

'P'

In [25]:
nondeci = re.search("\D\D", text)
nondeci.group()
# iki kere \D bir tane daha getirir soldan saga bakip

'PM'

In [26]:
nondeci = re.search("\D+", text)
nondeci.group()
# +, bir veya daha fazla non-decimali getir. burada decimal olmayan en az bir karakter getir, birden fazla da olabilir
# PM sonrası tekrar digite girdigi icin orda bırakıyor. basta MIN olsa MIN getirirdi. en az bir, ust limit yok

'PM'

In [27]:
re.search("\D\D\D", text).group()
# PM 3D sartini karsilayamadi ileri baktı bu nedenle.

'MIN'

#### Find phone number pattern

In [28]:
text = 'My phone number is 1234567890'

In [29]:
telno = re.search("\d\d\d\d\d\d\d\d\d\d",text)
telno.group()
# rakam kadar \d yazılabilir. ama no len'i bilmedigimizde faydasiz

'1234567890'

In [None]:
telno = re.search("\d+",text)
telno.group()
# en az bir tane digit olacak demek \d, + da daha cok da olabilir

'1234567890'

In [None]:
telno = re.search("\d*",text)
telno.group()
# * tek basina bir karakterle anlamsız sonuc verir

''

In [30]:
text = 'My phone number is 123 456 7890'

In [34]:
telno = re.search("\d\d\d \d\d\d \d\d\d\d",text)
telno.group()
# digitler arası bosluklarla birebir pattern olmalı

'123 456 7890'

In [35]:
text = 'My phone number is 123-456-7890'

In [36]:
telno = re.search("\d\d\d-\d\d\d-\d\d\d\d",text)
telno.group()
# tireleri de bosluk gibi ifade etmeliyiz

'123-456-7890'

In [37]:
telno = re.search('\d'*3 + '-' + '\d'*3 + '-' + '\d'*4, text)
telno.group(0)
# soyle de yapabilirdik: 3 tane \d (3 rakam), sonra tire birak, sonra 3 tane daha, sonra tire, sonra 4 no. bunu,
# text variable icinde ara

'123-456-7890'

#### Find phone number pattern by grouping

In [39]:
telno = re.search("(\d\d\d)-(\d\d\d)-(\d\d\d\d)",text)
telno.group() # try 1,2,3 for group
# her parantez ici bir grup olur. tireleri parantez disina atmaya dikkat

'123-456-7890'

In [40]:
telno = re.search("(\d*)-(\d*)-(\d*)",text)
telno.group()
# 3d yerine yıldızla da ifade edebiliriz. olmayabilir ya da daha fazla olabilir demek

'123-456-7890'

In [41]:
telno = re.search("(\d+)-(\d+)-(\d+)",text)
telno.group()

'123-456-7890'

In [42]:
telno.group(2)

'456'

In [43]:
telno.group(3)

'7890'

In [44]:
telno.group(0)

'123-456-7890'

#### Escaping parentheses and create 2 group -> first group:(415) second group:555-1212 print

In [46]:
text = 'My phone number is (415) 555-1212'

In [47]:
telno = re.search("(\(\d\d\d\)) (\d\d\d-\d\d\d\d)", text)

print(telno.group(1))
print(telno.group(2))
# regexin grup parantezleriyle textin parantezi karismasin diye backslash koyar ve onu da raw hale getiririz.

(415)
555-1212


## ``re.match(pattern, string, flags=0)``

Try to apply the pattern at the start of the string, returning a Match object, or None if no match was found.

If you want to locate a match anywhere in string, use search() instead of match()

In [48]:
text = "A78L41K"

In [None]:
# num = re.match("\d\d", text) # gives an error, because match function just look at the begining of string 
# num.group()

In [None]:
alp = re.match("\D\d\d", text)
alp.group()
# search ilkini getiriyordu, match da en basta ariyor. en basta yoksa hata verir, startswith gibi en basa..

'A78'

## ``re.fullmatch(pattern, string, flags=0)``

Try to apply the pattern to all of the string

In [51]:
text = "A78L41K"

In [52]:
alpnum = re.fullmatch("\D\d+\D\d+\D", text)
alpnum.group()
# regex kodunun ilgili kodun tamamını karsilamasi gerekir. büyük D, kucuk d+ yani decimal 1 veya daha fazla, yine
# buyuk d+, sonra kucuk d yani decimal 1 veya daha fazla, en son buyuk harf yine

'A78L41K'

In [53]:
alpnum = re.fullmatch("\w\d+\w\d+\w", text)
alpnum.group()
# D decimal olmayan herseyi, w ise alphanum'ı, ya harf ya rakam, arıyor. burda w ile basladigimiz icin basta rakam 
# olsa da getirirdi.
# w+ ise herhangi bir word. icinde bosluk olmasin, rakam veya harf veya _ olan bir kelimeyi getirir.

'A78L41K'

In [55]:
re.fullmatch("\w+", text).group()
# boyle de buldu. cuku text sadece harf ve rakam

'A78L41K'

## ``re.findall(pattern, string, flags=0)``

Return a list of all non-overlapping matches in the string.

#### Extract numbers from text as a list

In [56]:
text = "O 1, t 10, o 100. 100000"

In [57]:
re.findall("\d{1}", text)
# sadece sayıları cekmek istersek
# {1} bir tane olacak demek

['1', '1', '0', '1', '0', '0', '1', '0', '0', '0', '0', '0']

In [58]:
re.findall("\d", text)
# eğer curly brackets olmazsa. sadece 1 aynı sey gibi

['1', '1', '0', '1', '0', '0', '1', '0', '0', '0', '0', '0']

In [None]:
re.findall("\d{2}", text)
# yanyana 2 rakam varsa al

['10', '10', '10', '00', '00']

In [None]:
re.findall("\d{3}", text)
# yanyana 3 olanları al

['100', '100', '000']

In [None]:
re.findall("\d{4}", text)
# 4 tane sadece 1000 var

['1000']

In [None]:
re.findall("\d{1,6}", text)
# 1den 6ya hersey... 1 tane de olabilir  6 tane de olabilir. aradaki sayılar da olabilir

['1', '10', '100', '100000']

In [None]:
re.findall("\d+", text)
# + 1 de olabilir 1den fazla da olabilir

['1', '10', '100', '100000']

#### Extract words begining with "f"

In [3]:
text = 'which foot or hand fell fastest f'

In [69]:
re.findall("f[a-z]*", text)
# findall list icine koyar. f ile baslayan, sonra adan zye herhangi bir karakterle devam etmis olan, ama * oldugu
# icin de bunlar olmasa da olur (tek basina f de yeter yani)..

['foot', 'fell', 'fastest', 'f']

In [68]:
re.findall("f[a-z]", text)

['fo', 'fe', 'fa']

In [4]:
re.findall("f[a-z]+", text)
# + ise en az biri olacak demek, bu nedenle tek lan f gitti.

['foot', 'fell', 'fe', 'fastest']

#### Extract equations made up of words and numbers

In [5]:
text = 'set width=20 and height=10'

In [71]:
re.findall('\w+=\d+', text)
# and disindakileri biraip esitlikleri cekiyoruz. findall ile tamamını ararız, sadece bir tane degil find gibi
# \w wordu getir. sonrasında = ve sonrasında sayılar. en az bir veya daha uzun sayıları getiir deriz \d+ ile

['width=20', 'height=10']

In [72]:
re.findall('(\w+)=(\d+)', text)
# bunu grup grup da alabiliriz. esitlik isareti grup disi kaldi kodumuzda

[('width', '20'), ('height', '10')]

#### Check if the string starts with 'hello'

In [73]:
text = "hello world"

In [74]:
re.findall("^hello", text)
# basta var mi diye arıyoruz. bunu search hello ile de ararız. en bastakini getirir o. match da onla baslayip 
# baslamadigina gore ariyor. asagida ornekler. findall ise kactane varsa hespini getirir

['hello']

In [None]:
re.match("^hello", text).group()

'hello'

In [None]:
re.search("^hello", text).group()

'hello'

#### Check if the string ends with 'world'

In [None]:
re.findall("world$",text)
# stringin sonunda world warsa findall

['world']

In [75]:
re.findall("hello$",text)
# olmadigi icin bos liste getirdi

[]

## ``re.sub(pattern, repl, string, count=0, flags=0)``

Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.  repl can be either a string or a callable; if a string, backslash escapes in it are processed.  If it is a callable, it's passed the Match object and must return a replacement string to be used.

#### Remove anything other than digits

In [76]:
text = "2004-959-559 # This is Phone Number"

In [77]:
re.sub("\D", "", text)
# replace ile aynı mantıkta. buldugunun yerine istedigimiz seyi yazar
# decimal olmayan herseyi bul ve yok et, yani sadece decimalleri getir. sadece sayıları getir

'2004959559'

#### Remove digits and replace with "."

In [78]:
re.sub("\d", ".", text)
# rakamları nokta ile degistir. 

'....-...-... # This is Phone Number'

In [79]:
re.sub("\d", ".", text, count=4)
# count ile de bu degisikligi kac kere yapabilecegimizi girebiliriz. 4ünü degistirir gerisini tutar bu ornekte

'....-959-559 # This is Phone Number'

In [None]:
pd.Series(text).str.replace("\d", ".", regex=True)
# aynı islemi string fonksiyonu ile de yapabiliriz. ama once texti serie haline getirmeliyiz.
# serie oldugu icin str uygulanabilir. orda replace ile decimalleri nokta yaparız, ama regex parametresini True yapmalıyız
# cunku default false
# hatirlatma: str.replace stringden stringe ceviriyor. built-in replace stringen farklı formata da cevirir

0    ....-...-... # This is Phone Number
dtype: object

## ``re.split(pattern, string, maxsplit=0, flags=0)``

Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.

In [80]:
text = "ab56cd78_de fg3hıi49"

In [None]:
re.split("\D+", text)
# textten tum sayları alıp liste icinde gosterelim
# findalle de yapabiliriz bunu. findall daha mantikli cunku basinda bir karakter olmadigi icin
# split edince en basi "" getirir.

['', '56', '78', '3', '49']

In [None]:
re.split("\D+", text, maxsplit=2)
# maxsplit split islemini kac kere yapacagimizi belirler

['', '56', '78_de fg3hıi49']

In [None]:
re.findall("\d+", text)
# eger findalle yaparsak buluyoruz onları, dolayısıyla dogrudan decimallere ulasiyoruz. "" gelmiyor bir daha

['56', '78', '3', '49']

## <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Pandas Functions Accepting RegEx</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

- **count():** Count occurrences of pattern in each string of the Series/Index
- **replace():** Replace the search string or pattern with the given value
- **contains():** Test if pattern or regex is contained within a string of a Series or Index. Calls re.search() and returns a boolean
- **findall():** Find all occurrences of pattern or regular expression in the Series/Index. Equivalent to applying re.findall() on all elements
- **match():** Determine if each string matches a regular expression. Calls re.match() and returns a boolean
- **split():** Split strings around given separator/delimiter and accepts string or regular expression to split on
- **extract():** Extract capture groups in the regex pat as columns in a DataFrame and returns the captured groups

In [81]:
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks. #94569# Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Every Good Gift', 'Red.  Flowers velvety red.  #079463895689# Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
        ['Evghenya', 'Orange-pink.  75 petals.  Large, very double #68345_686# bloom form.  Blooms in flushes throughout the season.'], 
        ['Evita', 'White or white blend.  None to mild fragrance.  35 petals #9897#.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely #679754YH89#.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
        ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance #AGHJS876IOP#.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]
  
df = pd.DataFrame(data, columns = ['name', 'bloom']) 
df 

Unnamed: 0,name,bloom
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, fl..."
1,Every Good Gift,Red. Flowers velvety red. #079463895689# Mod...
2,Evghenya,"Orange-pink. 75 petals. Large, very double #..."
3,Evita,White or white blend. None to mild fragrance....
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...
5,Evita 2,"White, blush shading. Mild, wild rose fragran..."


## ``pandas.Series.str.count(pat, flags=0)``

Count occurrences of pattern in each string of the Series/Index.

This function is used to count the number of times a particular regex pattern is repeated in each of the string elements of the Series.

#### How many numerical values are there in each row of "bloom" feature?

In [85]:
df.bloom.count()

6

In [86]:
df.bloom[0]

'Carmine-pink, salmon-pink streaks, stripes, flecks. #94569# Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'

In [87]:
df.bloom.str.count("\d+")
# bloom sutununun her satırında kac tane sayısal deger var soru
# str satır satır yapmasini sagliyor
# ilk satırda bir sayı (94569), ikinci satırda 4 sayı...

0     1
1     4
2     3
3     4
4    10
5     5
Name: bloom, dtype: int64

#### How many characters are there in each row of "bloom" feature?

In [88]:
df.bloom.apply(len)
# apply len ile bulabiliriz bunu

0    240
1    196
2    110
3    162
4    327
5    198
Name: bloom, dtype: int64

In [4]:
df.bloom.apply(lambda x : re.findall("\n", x))
# applyda lambda da function olarak re.findall ve regex de olur

0    []
1    []
2    []
3    []
4    []
5    []
Name: bloom, dtype: object

In [None]:
df.bloom.str.count(".")
# str.count() ile de bulabiliriz. nokta işareti new line haric herseyi sayar. dolayısıyla once new line var mı yok mu
# emin olmalıyız. bu nedenle yukardaki lambdayı yaptık.

0    240
1    196
2    110
3    162
4    327
5    198
Name: bloom, dtype: int64

#### How many sentences are there in each row of "bloom" feature?

In [89]:
df.bloom.str.count("\.")
# her bir satırda kac cumle icin "\." yani nokta ile biten cümleleri bul.

0     5
1     6
2     4
3     5
4    11
5     7
Name: bloom, dtype: int64

## ``pandas.Series.str.replace(pat, repl, n=- 1, case=None, flags=0, regex=None)``

Replace each occurrence of pattern/regex in the Series/Index.

Equivalent to str.replace() or re.sub(), depending on the regex value.

#### Replace the values finding between the two "#" characters (including "#" characters) with the "" in each row of "bloom" feature 

In [92]:
df["bloom"] = df.bloom.str.replace("#\S+#", "", regex = True)
# bloom sutununu her saırında  # # arasında olanları sil. yani basi ve sonu bu karakter olacak
# S: bosluk haric her karakteri getir demek.
# regex = True'ya dikkat. 

In [91]:
df.bloom[5]

'White, blush shading.  Mild, wild rose fragrance .  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.'

## ``pandas.Series.str.contains(pat, case=True, flags=0, na=None, regex=True)``

Test if pattern or regex is contained within a string of a Series or Index.

Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.

#### Which rows in "bloom" feature includes "diameter" value?

In [94]:
df.bloom[1]

'Red.  Flowers velvety red.   Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'

In [93]:
df.bloom.str.contains("diameter")
# hangi satırlarda çap biligisi verilmis. contains ile ararız

0    False
1     True
2    False
3    False
4     True
5     True
Name: bloom, dtype: bool

In [None]:
df.bloom.str.contains('\d+"')
# 4" inch demek yanındaki ". yani ben bir rakam ve " patternini bulursam yeterli.
# T-F dondurdugu icin bu bir condition. df[] icine atarsak satırları gelir dolayısıyla. asagida:

0    False
1     True
2    False
3    False
4     True
5     True
Name: bloom, dtype: bool

In [None]:
df[df.bloom.str.contains('\d+"')]

Unnamed: 0,name,bloom
1,Every Good Gift,Red. Flowers velvety red. Moderate fragranc...
4,Evrathin,Light pink. [Deep pink.] Outer petals white. ...
5,Evita 2,"White, blush shading. Mild, wild rose fragran..."


## ``pandas.Series.str.findall(pat, flags=0)``

Find all occurrences of pattern or regular expression in the Series/Index.

Equivalent to applying re.findall() to all the elements in the Series/Index.

#### Find all numeric values in each rows of the "bloom" feature 

In [95]:
df.bloom.str.findall("\d+")

0                                []
1                       [4, 26, 40]
2                              [75]
3                      [35, 26, 40]
4    [35, 40, 2, 5, 17, 25, 26, 40]
5                   [20, 25, 1, 25]
Name: bloom, dtype: object

In [96]:
for i in df.bloom.str.findall("\d+"):
    print(len(i))
# her satırda kac tane oldugunu gormek icinn

0
3
1
3
8
4


#### Find diameter values in each rows of the "bloom" feature

In [None]:
df.bloom.str.findall('\d+\.\d+"|\d+"')
# sag taraf arasında nokta karakteri olan, sol taraf ise digit olanları getir demek. aradaki | a dikkat, herhangi biri 

0         []
1       [4"]
2         []
3         []
4     [2.5"]
5    [1.25"]
Name: bloom, dtype: object

## ``pandas.Series.str.match(pat, case=True, flags=0, na=None)``

Determine if each string starts with a match of a regular expression.

#### Find the rows of pink blooms (this information is available in the first words of the rows)

In [None]:
df.bloom.str.match("pink|\w+-pink|\w+ pink")
# pembe renkli cicekleri bulalım. ilk kelimede renkler geciyor
# match basa bakiyor. ama pink bazı satırlarda en basta, bazısında yanında bir kelime sonrası vs vs .bu 
# nednele or ile tum ihtimalleri kapsariz
# "pink|\w+-pink|\w+ pink": "sadece pink or kelime-pink or kelime bosluk pink"

0     True
1    False
2     True
3    False
4     True
5    False
Name: bloom, dtype: bool

In [None]:
df.bloom.str.match("pink|\w+-pink|\w+ pink")

0     True
1    False
2     True
3    False
4     True
5    False
Name: bloom, dtype: bool

In [None]:
df.bloom.str.match("pink|\w+[- ]?pink")
# daha kisası: en basta pink veya bsınnda bir kelime olabilir ama tire ve bolsukla takip eden

0     True
1    False
2     True
3    False
4     True
5    False
Name: bloom, dtype: bool

In [97]:
df.bloom.str.match(".+pink")
# daha da kisasi: herhangi bie karakter (.), + da herhangi bir karakterden bir veya daha fazla olabilir, sonrasinda da
# pink gelsin

0     True
1    False
2     True
3    False
4     True
5    False
Name: bloom, dtype: bool

## ``pandas.Series.str.split(pat=None, n=- 1, expand=False, *, regex=None)``

Split strings around given separator/delimiter.

Splits the string in the Series/Index from the beginning, at the specified delimiter string.

#### Split each rows of "bloom" feature from the dot character as sentences 

In [98]:
df.bloom.str.split("\. ")
# nokta ve bir bosluk gordugun yerden itibaren ayır. sadece nokta desek her yeni element bir boslukla baslar yoksa

0    [Carmine-pink, salmon-pink streaks, stripes, f...
1    [Red,  Flowers velvety red,   Moderate fragran...
2    [Orange-pink,  75 petals,  Large, very double ...
3    [White or white blend,  None to mild fragrance...
4    [Light pink, [Deep pink.]  Outer petals white,...
5    [White, blush shading,  Mild, wild rose fragra...
Name: bloom, dtype: object

In [None]:
df.bloom.str.split("\. ", expand = True)
# her birini yeni bir sutun yapalım expand parametresi ile

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,"Carmine-pink, salmon-pink streaks, stripes, fl...","Warm pink, clear carmine pink, rose pink shad...",Mild fragrance,"Large, very double, in small clusters, high-c...",Blooms in flushes throughout the season.,,,,
1,Red,Flowers velvety red,Moderate fragrance,"Average diameter 4""","Medium-large, full (26-40 petals), borne most...",Blooms in flushes throughout the season.,,,
2,Orange-pink,75 petals,"Large, very double bloom form",Blooms in flushes throughout the season.,,,,,
3,White or white blend,None to mild fragrance,35 petals,"Large, full (26-40 petals), high-centered blo...",Blooms in flushes throughout the season.,,,,
4,Light pink,[Deep pink.] Outer petals white,Expand rarely,Mild fragrance,35 to 40 petals,"Average diameter 2.5""","Medium, double (17-25 petals), full (26-40 pe...","Prolific, once-blooming spring or summer","Glandular sepals, leafy sepals, long sepals b..."
5,"White, blush shading","Mild, wild rose fragrance",20 to 25 petals,"Average diameter 1.25""","Small, very double, cluster-flowered bloom form",Blooms in flushes throughout the season.,,,


In [99]:
info = ["id:345, age:25, salary:1200", "id:346, age:32, salary:1500", "id:347, age:28, salary:1400"]
s = pd.Series(info)
s

0    id:345, age:25, salary:1200
1    id:346, age:32, salary:1500
2    id:347, age:28, salary:1400
dtype: object

#### Split the serie to create a dataframe consisting of "id, age and salary" columns.

In [None]:
s.str.split("\D+", expand = True)
# numeric olmayan degerden split edip expand parametresini uygurlarızz

Unnamed: 0,0,1,2,3
0,,345,25,1200
1,,346,32,1500
2,,347,28,1400


In [None]:
# istedigimiz sutunlatı secip ilk cikan o sıfır nolu sutundan kurtulabiliriz.
df = s.str.split("\D+", expand = True).iloc[:,1:]
df

Unnamed: 0,1,2,3
0,345,25,1200
1,346,32,1500
2,347,28,1400


In [None]:
# columns isimlerini guncelleyerek proper bir df haline getirebiliriz.
df.columns = ["id", "age", "salary"]
df

Unnamed: 0,id,age,salary
0,345,25,1200
1,346,32,1500
2,347,28,1400


## ``pandas.Series.str.extract(pat, flags=0, expand=True)``

Extract capture groups in the regex pat as columns in a DataFrame.

For each subject string in the Series, extract groups from the first match of regular expression pat.

#### Extract just numbers

In [101]:
s = pd.Series(['a3aa', 'b4aa', 'c5aa'])
s

0    a3aa
1    b4aa
2    c5aa
dtype: object

In [None]:
s.str.extract("(\d)")
# str.extract cok kullanacagiz. 
# tum satırlarda digit olanları grup olarak getir

Unnamed: 0,0
0,3
1,4
2,5


In [103]:
# EXTRACT KULLANIRKEN PATTERNLERI HEP GRUP ICINDE VERMEMİZ LAZIM
# s.str.extract("\d")
# ValueError: pattern contains no capture groups

#### Extract just letters

In [None]:
s.str.extract("(\D)\d(\D+)")
# SADECE HARFLERİ CEKİP AL
# solda harf al, ortada rakamı alma, sagda harfi al. bunları iki ayrı grup yapar.


Unnamed: 0,0,1
0,a,aa
1,b,aa
2,c,aa


In [None]:
s.str.extract("(\D)\d(\D)(\D)")
# her harf bir sutun olsun istiyorsak harf harf +sız gruplarız

Unnamed: 0,0,1,2
0,a,a,a
1,b,a,a
2,c,a,a


In [None]:
s.str.extract("(\w)\d(\w)(\w)")
# bunu w ile de yapabiliriz büyük D yerine

Unnamed: 0,0,1,2
0,a,a,a
1,b,a,a
2,c,a,a


#### Extract "id, age and salary" values to create a dataframe consisting of "id, age and salary" columns.

In [104]:
info = ["id:345, age:25, salary:1200", "id:346, age:32, salary:1500", "id:347, age:28, salary:1400"]
s = pd.Series(info)
s
# task columnları id age salary olan ve valueları rakamları olan df yap

0    id:345, age:25, salary:1200
1    id:346, age:32, salary:1500
2    id:347, age:28, salary:1400
dtype: object

In [None]:
df = s.str.extract("(\d+)\D+(\d+)\D+(\d+)")
df
# ilk d+ ile texti ayırdık ve decimal yaptık sadece. aradaki bosluk harf vsyi D+ ile disarda tutrız. 

Unnamed: 0,0,1,2
0,345,25,1200
1,346,32,1500
2,347,28,1400


In [None]:
# sutun isimlerini atayalım
df.columns = ["id", "age", "salary"]
df

Unnamed: 0,id,age,salary
0,345,25,1200
1,346,32,1500
2,347,28,1400


In [None]:
# findalla da nemuric degerleri cekebilriz satır satır
s.str.findall("(\d+)")

0    [345, 25, 1200]
1    [346, 32, 1500]
2    [347, 28, 1400]
dtype: object

#### Extract first number

In [106]:
s= pd.Series(['40 l/100 km (comb)', 
        '38 l/100 km (comb)', '6.4 l/100 km (comb)',
       '8.3 kg/100 km (comb)', '5.1 kg/100 km (comb)',
       '5.4 l/100 km (comb)', '6.7 l/100 km (comb)',
       '6.2 l/100 km (comb)', '7.3 l/100 km (comb)',
       '6.3 l/100 km (comb)', '5.7 l/100 km (comb)',
       '6.1 l/100 km (comb)', '6.8 l/100 km (comb)',
       '7.5 l/100 km (comb)', '7.4 l/100 km (comb)',
       '3.6 kg/100 km (comb)', '0 l/100 km (comb)', 
       '7.8 l/100 km (comb)'])
s

0       40 l/100 km (comb)
1       38 l/100 km (comb)
2      6.4 l/100 km (comb)
3     8.3 kg/100 km (comb)
4     5.1 kg/100 km (comb)
5      5.4 l/100 km (comb)
6      6.7 l/100 km (comb)
7      6.2 l/100 km (comb)
8      7.3 l/100 km (comb)
9      6.3 l/100 km (comb)
10     5.7 l/100 km (comb)
11     6.1 l/100 km (comb)
12     6.8 l/100 km (comb)
13     7.5 l/100 km (comb)
14     7.4 l/100 km (comb)
15    3.6 kg/100 km (comb)
16       0 l/100 km (comb)
17     7.8 l/100 km (comb)
dtype: object

In [107]:
# yontem 1
s.str.extract("(\d+\.\d+|\d+)")
# or'un sag tarafı 40-38 gibiler icin. sol taraf ise notalılar icin

Unnamed: 0,0
0,40.0
1,38.0
2,6.4
3,8.3
4,5.1
5,5.4
6,6.7
7,6.2
8,7.3
9,6.3


In [108]:
# Yöntem 2:
s.str.extract("(\d*\.?\d*)")

Unnamed: 0,0
0,40.0
1,38.0
2,6.4
3,8.3
4,5.1
5,5.4
6,6.7
7,6.2
8,7.3
9,6.3


In [None]:
s.str.extract("(\S+)")
# S+ bosluk olmayan tum karakterleri alır

Unnamed: 0,0
0,40.0
1,38.0
2,6.4
3,8.3
4,5.1
5,5.4
6,6.7
7,6.2
8,7.3
9,6.3


#### Extract first and second number

In [None]:
s.str.extract('(\d*.\d*).+/(\d*)')
# iki ayrı gruba almamızı istedigini anlarız taskten.
# aradaki nokta bosluk l icin. / da slash icin. .+/ herhangi bir karakteri slash takip edecek

Unnamed: 0,0,1
0,40.0,100
1,38.0,100
2,6.4,100
3,8.3,100
4,5.1,100
5,5.4,100
6,6.7,100
7,6.2,100
8,7.3,100
9,6.3,100


#### Extract date as month and year separately

In [109]:
s = pd.Series(['06/2020\n\n4.9 l/100 km (comb)',
'11/2020\n\n166 g CO2/km (comb)',                                 
'10/2019\n\n5.3 l/100 km (comb)',
'05/2022\n\n6.3 l/100 km (comb)',
'07/2019\n\n128 g CO2/km (comb)',
'06/2022\n\n112 g CO2/km (comb)',                                                 
'01/2022\n\n5.8 l/100 km (comb)',
'11/2020\n\n106 g CO2/km (comb)',
'04/2019\n\n105 g CO2/km (comb)',
'08/2020\n\n133 g CO2/km (comb)',
'04/2022\n\n133 g CO2/km (comb)'])
s

0     06/2020\n\n4.9 l/100 km (comb)
1     11/2020\n\n166 g CO2/km (comb)
2     10/2019\n\n5.3 l/100 km (comb)
3     05/2022\n\n6.3 l/100 km (comb)
4     07/2019\n\n128 g CO2/km (comb)
5     06/2022\n\n112 g CO2/km (comb)
6     01/2022\n\n5.8 l/100 km (comb)
7     11/2020\n\n106 g CO2/km (comb)
8     04/2019\n\n105 g CO2/km (comb)
9     08/2020\n\n133 g CO2/km (comb)
10    04/2022\n\n133 g CO2/km (comb)
dtype: object

In [110]:
s.str.extract("(\d\d).(\d\d\d\d)")
# aradaki nokta slash icin.
#s.str.extract("(\d{2}).(\d{4})")
#s.str.extract("(\d*).(\d*)")
#s.str.extract("(\d\d)/(\d*)")
#s.str.extract("(\d\d)/(\d\d\d\d)")
#s.str.extract("(\d+).(\d+)")
#s.str.extract("(\S+)/(\S+)")

Unnamed: 0,0,1
0,6,2020
1,11,2020
2,10,2019
3,5,2022
4,7,2019
5,6,2022
6,1,2022
7,11,2020
8,4,2019
9,8,2020


#### Extract date and comsuption value -> 4.9

In [None]:
s.str.extract('(\d+/\d+)\s+(\d+.\d+|\d+)')
# ilk grup tarihin patterni, daha kısa yazılmıs hali; sonraki 2 new line icin \s+ ile disarda birak;
# sonra da sayı.sayı veya sayı

Unnamed: 0,0,1
0,06/2020,4.9
1,11/2020,166.0
2,10/2019,5.3
3,05/2022,6.3
4,07/2019,128.0
5,06/2022,112.0
6,01/2022,5.8
7,11/2020,106.0
8,04/2019,105.0
9,08/2020,133.0


#### Extract date as month and year separately

In [111]:
s = pd.Series(['\n\n4.9 06/2020 l/100 km (comb)',
'\n\n166 11/2020 g CO2/km (comb)',                                 
'\n\n5.3 10/2019 l/100 km (comb)',
'\n\n6.3 05/2022 l/100 km (comb)',
'\n\n128 07/2019 g CO2/km (comb)',
'\n\n112 06/2022 g CO2/km (comb)',                                                 
'\n\n5.8 01/2022 l/100 km (comb)'])
s

0    \n\n4.9 06/2020 l/100 km (comb)
1    \n\n166 11/2020 g CO2/km (comb)
2    \n\n5.3 10/2019 l/100 km (comb)
3    \n\n6.3 05/2022 l/100 km (comb)
4    \n\n128 07/2019 g CO2/km (comb)
5    \n\n112 06/2022 g CO2/km (comb)
6    \n\n5.8 01/2022 l/100 km (comb)
dtype: object

In [None]:
s.str.extract("\S+\s(\d+)/(\d+)")
# S bosluk haric demek, s ise boslukları yakalar. tam tersi birbirlerinin. bu kod bosluk olmayan ve boslukları disarda
# tut ve d+ ile ilk rakamları al. bunları bir grup yap(ay), sonra da slashı disarda birak ve yılı da ayrı bir grup 
# olarak al

Unnamed: 0,0,1
0,6,2020
1,11,2020
2,10,2019
3,5,2022
4,7,2019
5,6,2022
6,1,2022


In [None]:
s.str.extract("(\d+)/(\d+)")

Unnamed: 0,0,1
0,6,2020
1,11,2020
2,10,2019
3,5,2022
4,7,2019
5,6,2022
6,1,2022


## Example For Slides

In [112]:
text = "my email adress is example@gmail.com"

In [113]:
reg = re.search("([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9_.+-]+)\.([a-zA-Z0-9_.+-]+)", text)

print(reg.group(0))
print(reg.group(1))
print(reg.group(2))
print(reg.group(3))

example@gmail.com
example
gmail
com


In [114]:
text = "/er._%+-@42f.-.Ab/"

In [115]:
reg = re.search("/[\w._%+-]+@[\w.-]+\.[a-zA-Z]{2,4}/", text)
print(reg.group(0))

/er._%+-@42f.-.Ab/


## <p style="background-color:#FDFEFE; font-family:newtimeroman; color:#9d4f8c; font-size:150%; text-align:center; border-radius:10px 10px;">The End of The Session - 13</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>