## f-Strings: A New and Improvemd Way to Format Strings in Python

In [1]:
### Also called “formatted string literals,” f-strings are string literals that have an f at the beginning and 
### curly braces containing expressions that will be replaced with their values. 
### The expressions are evaluated at runtime and then formatted using the __format__ protocol

In [2]:
#Simple Syntax
##The syntax is similar to the one you used with str.format() but less verbose.

name = "Sudhanshu"
age = 37
f"Hello, {name}. You are {age}."

'Hello, Sudhanshu. You are 37.'

In [3]:
F"Hello, {name}. You are {age}."

'Hello, Sudhanshu. You are 37.'

In [4]:
f"{2 * 37}"

'74'

In [5]:
def to_lowercase(input):
...     return input.lower()

In [6]:
name = "Sudhanshu Saxena"
f"{to_lowercase(name)} is funny."

'sudhanshu saxena is funny.'

## Match_a_symbol

In [7]:
# Load regex package
import re

In [8]:
# Create a variable containing a text string
text = '$100'

In [9]:
## Apply Regex
# Find all instances of the exact match '$'
re.findall(r'\$', text)

['$']

## Match_a_unicode_character

In [10]:
# Load regex package
import re

# Create a variable containing a text string
text = 'Microsoft™.'

# Find any unicode character for a trademark
re.findall(r'\u2122', text)

['™']

## Match_a_word

In [11]:
# Load regex package
import re

# Create a variable containing a text string
text = 'The quick brown fox jumped over the lazy brown bear.'

# Find any word of three letters
re.findall(r'\b...\b', text)

['The', 'fox', 'the']

## Match_any_character

In [12]:
# Find anything with a 'T' and then the next two characters
re.findall(r'T..', text)

['The']

## Match_any_of_a_list_of_symbols

In [13]:
# Find all instances of any vowel
re.findall(r'[aeiou]', text)

['e', 'u', 'i', 'o', 'o', 'u', 'e', 'o', 'e', 'e', 'a', 'o', 'e', 'a']

## Match_any_of_series_of_characters

In [14]:
# Find any of fox, snake, or bear
re.findall(r'fox|snake|bear', text)

['fox', 'bear']

## Match_any_of_series_of_words

In [15]:
# Find any of fox, snake, or bear
re.findall(r'\b(fox|snake|bear)\b', text)

['fox', 'bear']

## Match_dates

In [16]:
# Create a variable containing a text string
text = 'My birthday is 09/15/1983. My brother\'s birthday is 01/01/01. My other two brothers have birthdays of 9/3/2001 and 09/1/83.'

In [17]:
# Find any text that fits the regex
re.findall(r'\b[0-3]?[0-9]/[0-3]?[0-9]/(?:[0-9]{2})?[0-9]{2}\b', text)

['09/15/1983', '01/01/01', '9/3/2001', '09/1/83']

## Match_email_addresses

In [18]:
# Create a variable containing a text string
text =  'My email is chris@hotmail.com, thanks! No, I am at bob@data.ninja.'

In [19]:
# Find all email addresses
re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9]+', text)

# Explanation:
# This regex has three parts
# [a-zA-Z0-9_.+-]+ Matches a word (the username) of any length
# @[a-zA-Z0-9-]+  Matches a word (the domain name) of any length
# \.[a-zA-Z0-9-.]+ Matches a word (the TLD) of any length

['chris@hotmail.com', 'bob@data.ninja']

## Match_exact_text

In [20]:
# Create a variable containing a text string
text = 'The quick brown fox jumped over the lazy brown bear.'

In [21]:
# Find all instances of the exact match 'The'
re.findall(r'The', text)

['The']

## Match_integers_of_any_length

In [22]:
# Create a variable containing a text string
text = '21 scouts and 3 tanks fought against 4,003 protestors.'

In [23]:
# Find any character block that is a integer of any length
re.findall(r'[1-9](?:\d{0,2})(?:,\d{3})*(?:\.\d*[1-9])?|0?\.\d*[1-9]|0', text)

['21', '3', '4,003']

## Match_text_between_html_tags

In [24]:
# Create a variable containing a text string
text = '<p>The quick brown fox.</p><p>The lazy brown bear.</p>'

In [25]:
# Find any text between '<p>' and '</p>'
re.findall(r'<p>(.*?)</p>', text)

['The quick brown fox.', 'The lazy brown bear.']

## Match_times

In [26]:
# Create a variable containing a text string
text = 'Chris: 12:34am. Steve: 16:30'

In [27]:
# Find any text that fits the regex
re.findall(r'([0-1]\d:[0-5]\d)\s*(?:AM|PM)?', text)

['12:34', '16:30']

In [28]:
import re

### RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:

findall	    ----Returns a list containing all matches<br>
search	    ----Returns a Match object if there is a match anywhere in the string<br>
split	    ----Returns a list where the string has been split at each match<br>
sub	        ----Replaces one or many matches with a string<br>

In [29]:
import re

str = "The rain in Spain"
x = re.findall("ai", str)
print(x)

['ai', 'ai']


In [30]:
str = "The rain in Spain"
x = re.findall("Portugal", str)
print(x)

[]


In [31]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
x

<re.Match object; span=(0, 17), match='The rain in Spain'>

In [32]:
##The split() function returns a list where the string has been split at each match:

str = "The rain in Spain"
x = re.split("\s", str)
print(x)

['The', 'rain', 'in', 'Spain']


In [33]:
#You can control the number of occurrences by specifying the maxsplit parameter:


str = "The rain in Spain"
x = re.split("\s", str, 1)
print(x)

['The', 'rain in Spain']


In [34]:
#The sub() function replaces the matches with the text of your choice:

str = "The rain in Spain"
x = re.sub("\s", "9", str)
print(x)

The9rain9in9Spain


In [35]:
#You can control the number of replacements by specifying the count parameter:

str = "The rain in Spain"
x = re.sub("\s", "9", str, 2)
print(x)

The9rain9in Spain


Element	Description
.	This element matches any character except \n<br>
\d	This matches any digit [0-9]<br>
\D	This matches non-digit characters [^0-9]<br>
\s	This matches whitespace character [ \t\n\r\f\v]<br>
\S	This matches non-whitespace character [^ \t\n\r\f\v]<br>
\w	This matches alphanumeric character [a-zA-Z0-9_]<br>
\W	This matches any non-alphanumeric character [^a-zA-Z0-9]<br>

### A Match Object is an object containing information about the search and the result.

 

The Match object has properties and methods used to retrieve information about the search, and the result:

.span() returns a tuple containing the start-, and end positions of the match.<br>
.string returns the string passed into the function<br>
.group() returns the part of the string where there was a match<br>

In [36]:
line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter


### The search Function
This function searches for first occurrence of RE pattern within string with optional flags.

Here is the syntax for this function −

re.search(pattern, string, flags=0)


In [37]:
line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
   print ("searchObj.group() : ", searchObj.group())
   print ("searchObj.group(1) : ", searchObj.group(1))
   print ("searchObj.group(2) : ", searchObj.group(2))
else:
   print ("Nothing found!!")

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter


# Use cases for re

1	
[Pp]ython

Match "Python" or "python"

2	
rub[ye]

Match "ruby" or "rube"

3	
[aeiou]

Match any one lowercase vowel

4	
[0-9]

Match any digit; same as [0123456789]

5	
[a-z]

Match any lowercase ASCII letter

6	
[A-Z]

Match any uppercase ASCII letter

7	
[a-zA-Z0-9]

Match any of the above

8	
[^aeiou]

Match anything other than a lowercase vowel

9	
[^0-9]

Match anything other than a digit

In [38]:
import re
li=['9999999999','999999-999','99999x9999']
for val in li:
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print ('yes')
 else:
     print ('no')

yes
no
no


In [39]:
sentence = "Samsung rolled out a beta update for the samsung Galaxy S10 a couple of days to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and Bluetooth headset noise. You can check out the screenshot below for the complete list of bug fixes."

In [40]:
sentence

'Samsung rolled out a beta update for the samsung Galaxy S10 a couple of days to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and Bluetooth headset noise. You can check out the screenshot below for the complete list of bug fixes.'

In [41]:
match = re.match(r"[Ss]amsung", sentence)
if match:
    print(match.group())

Samsung


### Substitute the values in the Data

import re
list = [ "mouse", "cat", "dog", "no-match"]
# Loop starts here
for elements in list:
    m = re.match("(d\w+) \W(d/w+)" , elements)
# Check for matching
if m:
    print (m . groups ( ))

In [42]:
print(re.sub (r"samsung", "Microsoft", sentence))

Samsung rolled out a beta update for the Microsoft Galaxy S10 a couple of days to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and Bluetooth headset noise. You can check out the screenshot below for the complete list of bug fixes.


In [43]:
print(re.sub(r"[a-z]","0", sentence))

S000000 000000 000 0 0000 000000 000 000 0000000 G00000 S10 0 000000 00 0000 00 0000 000 0000 0000000 00 B00000000 000000000, 000000 0000000 000, 000 000000 000. T0000, 000 S000000 0000000 00000000 0 000 0000 00000 000 000 S10 000000 0000 0000 0 00000 000 00000.T00 000 00000000 00000 000000 0000000 00 000000 00000, 000000-00 0000000, 0000 000000, W0-F0 0000000, 000 B00000000 0000000 00000. Y00 000 00000 000 000 0000000000 00000 000 000 00000000 0000 00 000 00000.


In [44]:
print(re.sub(r"[a-z]","0", sentence, flags = re.I))

0000000 000000 000 0 0000 000000 000 000 0000000 000000 010 0 000000 00 0000 00 0000 000 0000 0000000 00 000000000 000000000, 000000 0000000 000, 000 000000 000. 00000, 000 0000000 0000000 00000000 0 000 0000 00000 000 000 010 000000 0000 0000 0 00000 000 00000.000 000 00000000 00000 000000 0000000 00 000000 00000, 000000-00 0000000, 0000 000000, 00-00 0000000, 000 000000000 0000000 00000. 000 000 00000 000 000 0000000000 00000 000 000 00000000 0000 00 000 00000.


In [45]:
print(re.sub(r"[a-z]","0", sentence,1, flags = re.I)) ### First Pattern

0amsung rolled out a beta update for the samsung Galaxy S10 a couple of days to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and Bluetooth headset noise. You can check out the screenshot below for the complete list of bug fixes.


In [46]:
print(re.sub(r"[a-z]","0", sentence,10, flags = re.I)) ### First ten Pattern

0000000 000led out a beta update for the samsung Galaxy S10 a couple of days to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and Bluetooth headset noise. You can check out the screenshot below for the complete list of bug fixes.


In [47]:
sentence1 = "Samsung 2019 rolled out a beta update for the ^%#^^^samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and &^$#%&^$@ . ^^^ . $$#$#$# . @$$@Bluetooth headset noise. You can check out  *%&# *#&%^ the $^#^%#@screenshot z below for the complete list of bug fixes."

In [48]:
sentence1

'Samsung 2019 rolled out a beta update for the ^%#^^^samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and &^$#%&^$@ . ^^^ . $$#$#$# . @$$@Bluetooth headset noise. You can check out  *%&# *#&%^ the $^#^%#@screenshot z below for the complete list of bug fixes.'

In [49]:
modified_sentences_1 = re.sub(r"\d","", sentence1)

In [50]:
modified_sentences_1

'Samsung  rolled out a beta update for the ^%#^^^samsung Galaxy SSSS a couple of days s to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and &^$#%&^$@ . ^^^ . $$#$#$# . @$$@Bluetooth headset noise. You can check out  *%&# *#&%^ the $^#^%#@screenshot z below for the complete list of bug fixes.'

In [51]:
modified_sentences_2 = re.sub(r"[!@#$%^&*.]","", sentence1)

In [52]:
modified_sentences_2

'Samsung 2019 rolled out a beta update for the samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering, volume control bar, and status bar Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixesThe new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and       Bluetooth headset noise You can check out    the screenshot z below for the complete list of bug fixes'

In [53]:
modified_sentences_3 = re.sub(r"[/d!@#$%^&*.]","", sentence1)

In [54]:
modified_sentences_3

'Samsung 2019 rolle out a beta upate for the samsung Galaxy SSSS10 a couple of ays s to iron out bugs relate to Bluetooth tethering, volume control bar, an status bar Toay, the Samsung company release a new beta buil for the S10 lineup with over a ozen bug fixesThe new firmware fixes issues relate to evice reset, always-on isplay, file moving, Wi-Fi hanging, an       Bluetooth heaset noise You can check out    the screenshot z below for the complete list of bug fixes'

In [55]:
sentence1

'Samsung 2019 rolled out a beta update for the ^%#^^^samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and &^$#%&^$@ . ^^^ . $$#$#$# . @$$@Bluetooth headset noise. You can check out  *%&# *#&%^ the $^#^%#@screenshot z below for the complete list of bug fixes.'

In [56]:
modified_sentences_3 = re.sub(r"\W"," ", sentence1)

In [57]:
modified_sentences_3

'Samsung 2019 rolled out a beta update for the       samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering  volume control bar  and status bar  Today  the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes The new firmware fixes issues related to device reset  always on display  file moving  Wi Fi hanging  and                                 Bluetooth headset noise  You can check out             the        screenshot z below for the complete list of bug fixes '

In [58]:
modified_sentences_4 = re.sub(r"\w"," ", sentence1)

In [59]:
modified_sentences_4

'                                              ^%#^^^                                                                                        ,                   ,               .      ,                                                                                             .                                                     ,       -          ,            ,   -          ,     &^$#%&^$@ . ^^^ . $$#$#$# . @$$@                       .                    *%&# *#&%^     $^#^%#@                                                     .'

In [60]:
modified_sentences_5 = re.sub(r"\s+"," ", modified_sentences_3)

In [61]:
modified_sentences_5

'Samsung 2019 rolled out a beta update for the samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering volume control bar and status bar Today the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes The new firmware fixes issues related to device reset always on display file moving Wi Fi hanging and Bluetooth headset noise You can check out the screenshot z below for the complete list of bug fixes '

In [62]:
modified_sentences_6 = re.sub(r"\s+[a-zA-Z]\s+"," ", modified_sentences_5)

In [63]:
modified_sentences_6

'Samsung 2019 rolled out beta update for the samsung Galaxy SSSS10 couple of days to iron out bugs related to Bluetooth tethering volume control bar and status bar Today the Samsung company released new beta build for the S10 lineup with over dozen bug fixes The new firmware fixes issues related to device reset always on display file moving Wi Fi hanging and Bluetooth headset noise You can check out the screenshot below for the complete list of bug fixes '

In [64]:
sentence1

'Samsung 2019 rolled out a beta update for the ^%#^^^samsung Galaxy SSSS10 a couple of days s to iron out bugs related to Bluetooth tethering, volume control bar, and status bar. Today, the Samsung company released a new beta build for the S10 lineup with over a dozen bug fixes.The new firmware fixes issues related to device reset, always-on display, file moving, Wi-Fi hanging, and &^$#%&^$@ . ^^^ . $$#$#$# . @$$@Bluetooth headset noise. You can check out  *%&# *#&%^ the $^#^%#@screenshot z below for the complete list of bug fixes.'

In [65]:
text =  'Excccept'

In [66]:
modified_sentences_7 = re.sub(r"[c]{2}","", text)

In [67]:
modified_sentences_7

'Except'

## Web scraping Basic

In [68]:
# importing libraries 
import nltk 
from bs4 import BeautifulSoup 
from urllib.request import urlopen 

# extract all the contents of the text file. 
raw = urlopen("https://www.w3.org/TR/PNG/iso_8859-1.txt").read() 
raw

b"The following are the graphical (non-control) characters defined by\nISO 8859-1 (1987).  Descriptions in words aren't all that helpful,\nbut they're the best we can do in text.  A graphics file illustrating\nthe character set should be available from the same archive as this\nfile.\n\nHex Description                 Hex Description\n\n20  SPACE\n21  EXCLAMATION MARK            A1  INVERTED EXCLAMATION MARK\n22  QUOTATION MARK              A2  CENT SIGN\n23  NUMBER SIGN                 A3  POUND SIGN\n24  DOLLAR SIGN                 A4  CURRENCY SIGN\n25  PERCENT SIGN                A5  YEN SIGN\n26  AMPERSAND                   A6  BROKEN BAR\n27  APOSTROPHE                  A7  SECTION SIGN\n28  LEFT PARENTHESIS            A8  DIAERESIS\n29  RIGHT PARENTHESIS           A9  COPYRIGHT SIGN\n2A  ASTERISK                    AA  FEMININE ORDINAL INDICATOR\n2B  PLUS SIGN                   AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK\n2C  COMMA                       AC  NOT SIGN\n2D  HYPHE

In [69]:
# remove any html/xml tags 
raw1 = BeautifulSoup(raw)
raw1

<html><body><p>The following are the graphical (non-control) characters defined by
ISO 8859-1 (1987).  Descriptions in words aren't all that helpful,
but they're the best we can do in text.  A graphics file illustrating
the character set should be available from the same archive as this
file.

Hex Description                 Hex Description

20  SPACE
21  EXCLAMATION MARK            A1  INVERTED EXCLAMATION MARK
22  QUOTATION MARK              A2  CENT SIGN
23  NUMBER SIGN                 A3  POUND SIGN
24  DOLLAR SIGN                 A4  CURRENCY SIGN
25  PERCENT SIGN                A5  YEN SIGN
26  AMPERSAND                   A6  BROKEN BAR
27  APOSTROPHE                  A7  SECTION SIGN
28  LEFT PARENTHESIS            A8  DIAERESIS
29  RIGHT PARENTHESIS           A9  COPYRIGHT SIGN
2A  ASTERISK                    AA  FEMININE ORDINAL INDICATOR
2B  PLUS SIGN                   AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
2C  COMMA                       AC  NOT SIGN
2D  HYPHEN-MINUS 

In [70]:
# obtain the text present in ‘raw’ 
raw2 = raw1.get_text() 

raw2

"The following are the graphical (non-control) characters defined by\nISO 8859-1 (1987).  Descriptions in words aren't all that helpful,\nbut they're the best we can do in text.  A graphics file illustrating\nthe character set should be available from the same archive as this\nfile.\n\nHex Description                 Hex Description\n\n20  SPACE\n21  EXCLAMATION MARK            A1  INVERTED EXCLAMATION MARK\n22  QUOTATION MARK              A2  CENT SIGN\n23  NUMBER SIGN                 A3  POUND SIGN\n24  DOLLAR SIGN                 A4  CURRENCY SIGN\n25  PERCENT SIGN                A5  YEN SIGN\n26  AMPERSAND                   A6  BROKEN BAR\n27  APOSTROPHE                  A7  SECTION SIGN\n28  LEFT PARENTHESIS            A8  DIAERESIS\n29  RIGHT PARENTHESIS           A9  COPYRIGHT SIGN\n2A  ASTERISK                    AA  FEMININE ORDINAL INDICATOR\n2B  PLUS SIGN                   AB  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK\n2C  COMMA                       AC  NOT SIGN\n2D  HYPHEN

In [71]:
# tokenize the text into words. 
#nltk.download('punkt')
token = nltk.word_tokenize(raw2) 
text2 = ' '.join(token) 
text2

"The following are the graphical ( non-control ) characters defined by ISO 8859-1 ( 1987 ) . Descriptions in words are n't all that helpful , but they 're the best we can do in text . A graphics file illustrating the character set should be available from the same archive as this file . Hex Description Hex Description 20 SPACE 21 EXCLAMATION MARK A1 INVERTED EXCLAMATION MARK 22 QUOTATION MARK A2 CENT SIGN 23 NUMBER SIGN A3 POUND SIGN 24 DOLLAR SIGN A4 CURRENCY SIGN 25 PERCENT SIGN A5 YEN SIGN 26 AMPERSAND A6 BROKEN BAR 27 APOSTROPHE A7 SECTION SIGN 28 LEFT PARENTHESIS A8 DIAERESIS 29 RIGHT PARENTHESIS A9 COPYRIGHT SIGN 2A ASTERISK AA FEMININE ORDINAL INDICATOR 2B PLUS SIGN AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 2C COMMA AC NOT SIGN 2D HYPHEN-MINUS AD SOFT HYPHEN 2E FULL STOP AE REGISTERED SIGN 2F SOLIDUS AF OVERLINE 30 DIGIT ZERO B0 DEGREE SIGN 31 DIGIT ONE B1 PLUS-MINUS SIGN 32 DIGIT TWO B2 SUPERSCRIPT TWO 33 DIGIT THREE B3 SUPERSCRIPT THREE 34 DIGIT FOUR B4 ACUTE ACCENT 35 DIGI