In [26]:
import re
import logging
from importlib import reload
reload(logging)
import sys
logging.basicConfig(format='Explanation | %(levelname)s : %(message)s', level=logging.INFO, stream=sys.stdout)
log = logging.getLogger("Zero to Hero in NLP")


### Regular Expressions
> Regular expressions can be used to specify pattern of strings we might want to extract from a document.

### Text Normalization
> Text normalization is a set of tasks used to convert text into a more convenient, and standard form.

* <strong>Tokenization</strong>
    * A method to seperate words from running texts.<br>
    
    *But tokenization is much more than just seperating words*<br>
    
    * For processing tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc.
    ---
* <strong>Lemmatization</strong>
    * The task of determining that two words have the same root, despite their surface differences. For example, sang, sung, sing have the same root <strong>sing</strong>. 
    ℹ️ The word *sing* is known as a <strong>lemma</strong>
    ---
* <strong>Stemming</strong> 
    * Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word.
    ---
* <strong>Sentence segmentation</strong>
    * Sentence segmentation is the method of breaking up individual sentences using cues like point(.), exclamation mark(!) and more.
    ---
    
### Edit Distance
> Edit distance a metric that measures similarity between two strings based on the number of edits(insertion, deletion, substitution).
    
---


<span style="color:red">English words are often seperated by whitespaces, but it's not always the case. For example, 'New York' & 'rock 'n' roll' shall be treated as single large words rather than splitting on *New* & *York*. A good reason to not rely on python's split() method for tokenization</style></span>

---

<h2 style="text-align:center">Regular Expressions</h2>

* Regular expressions are useful when we have a *pattern* to search within a *corpus* of text.
<br>
---
<strong>Corpus</strong> - A collection of documents or a single document.

---
<br>

* A regular expression search function will search through the corpus, returning all texts that match the pattern.
* Here, regular expressions are delimited by backslashes(/).

---
A simple regular expression would be a sequence of characters.
For example, /woodchuck/, /buttercup/

> Couple of things to pay attention to.

    * Regular expressions are case-sensitive. 'Woodchuck' does not match the regex /woodchuck/
    * Note how the regex is delimited by /.
 

#### Square brackets [ ]

> The string of characters inside square braces denote <strong>disjunction</strong> of characters to match. In other words, specifies an OR condition.

For example,

/[wW]oodchuck/ - This regex matches <strong>w</strong>oodchuck or <strong>W</strong>oodchuck.<br>

/[abc]/ - Matches sequence having <strong>a</strong>, <strong>b</strong>, or <strong>c</strong>. Matches 'Hi my n<strong>a</strong>me is John', but not 'A good night'

#### Dash -

> Characters seperated by dash denote a range.

For example,

/[2-5]/ - A range of 2 to 5. 2, 3, 4, or 5.<br>

/[a-z]/ - A range of lowercase English alphabets. a,b,c,...,z. Matches '<strong>a</strong> good day.'<br>

/[A-Z]/ - A range of uppercase English alphabets. A,B,C,...,Z. Matches '<strong>B</strong>est day of my life.'<br>

#### Caret ^

> If caret is the first character inside a square bracket, the sequence after it is negated. In other words, all the characters are matched except the ones after caret.

For example,

/[^A-Z]/ - Matches lowercase letters but not uppercase. Matches 'I WAS GOING TO <strong>t</strong>he market'.<br>

/[^Ss]/ - Neither S nor s. Matches '<strong>I</strong> had a bad day.'<br>'S<strong>o</strong>metimes I wish to escape'.<br>

/[A-Z]/ - A range of uppercase English alphabets. A,B,C,...,Z. Matches '<strong>B</strong>est day of my life.'<br>

#### Question mark ?

> If specified after a character, it says match or nothing. In other words character before ? is optional

For example,

/[woodchucks?]/ - Matches with string either having 's' after woodchuck or not at all. Matches woodchuck or woodchucks<br>

/[colou?r]/ - Matches with string either having 'u' in colour or not having 'u' at all. Matches color or colour.<br> 


In [27]:
print(re.findall(r'[Ss]ome','sometimes I feel very energetic. Somedays not.')) 
log.info(' Matches Some and some')

['some', 'Some']
Explanation | INFO :  Matches Some and some


In [28]:
log.info("The findall() method returns all the words that match a pattern in a corpus. search() returns the first point of match")

Explanation | INFO : The findall() method returns all the words that match a pattern in a corpus. search() returns the first point of match


In [29]:
print(re.findall(r'h[abc][vd]','I have had a good day in the cabin ha of my backyard.')) 
log.info(' Matches words starting with h, having a,b, or c and ending with v or d')


['hav', 'had']
Explanation | INFO :  Matches words starting with h, having a,b, or c and ending with v or d


In [30]:
print(re.findall(r'[A-Z]','This is the Besttttt day of my Liffeee')) 
log.info(' Matches uppercase letters.')


['T', 'B', 'L']
Explanation | INFO :  Matches uppercase letters.


In [31]:
print(re.findall(r'[^A-Z]','This is  the Besttttt day of my Liffeee')) 
log.info(' Matches everything except uppercase letters.')

['h', 'i', 's', ' ', 'i', 's', ' ', ' ', 't', 'h', 'e', ' ', 'e', 's', 't', 't', 't', 't', 't', ' ', 'd', 'a', 'y', ' ', 'o', 'f', ' ', 'm', 'y', ' ', 'i', 'f', 'f', 'e', 'e', 'e']
Explanation | INFO :  Matches everything except uppercase letters.


In [32]:
print(re.findall(r'[^T]','This is  the Besttttt day of my Liffeee')) 
log.info(' Matches everything except uppercase T.')

['h', 'i', 's', ' ', 'i', 's', ' ', ' ', 't', 'h', 'e', ' ', 'B', 'e', 's', 't', 't', 't', 't', 't', ' ', 'd', 'a', 'y', ' ', 'o', 'f', ' ', 'm', 'y', ' ', 'L', 'i', 'f', 'f', 'e', 'e', 'e']
Explanation | INFO :  Matches everything except uppercase T.


In [33]:
print(re.findall(r'colou?r','Bright colors and colours')) 
log.info(' Matches if string has color or colour')

['color', 'colour']
Explanation | INFO :  Matches if string has color or colour


In [34]:
print(re.findall(r'[A-Za-z]unny days?','sunny day vs sunny days vs Sunny day vs Sunny days vs funny day')) 
log.info(' Matches if string has anything with unny day or unny days')

['sunny day', 'sunny days', 'Sunny day', 'Sunny days', 'funny day']
Explanation | INFO :  Matches if string has anything with unny day or unny days


#### Kleene *

> The Kleene star means “zero or more occurrences of the immediately previous character or regular expression”.

For example,

/a*/ - Matches zero or more 'a'. Matches 'baaaaaaa' and 'Hello'<br>

/aa*/ - Matches one or more 'a'. Matches 'baaaaaa' but not 'Hello' because atleast one 'a' should be there<br>


In [35]:
print(re.findall(r'a*','Hello')) 
log.info(' Zero \'a\' found')

['', '', '', '', '', '']
Explanation | INFO :  Zero 'a' found


In [36]:
print(re.findall(r'aa*','Hello')) 
log.info(' No \'a\' found')

[]
Explanation | INFO :  No 'a' found


In [37]:
print(re.findall(r'[ab]*','Baaaaaa & aaaaaaa and a')) 
log.info('Matches if string has zero or more \'a\' or \'b\' ')

['', 'aaaaaa', '', '', '', 'aaaaaaa', '', 'a', '', '', '', 'a', '']
Explanation | INFO : Matches if string has zero or more 'a' or 'b' 


In [38]:
print(re.findall(r'[ab]*','bbbbbb')) 
log.info(' Matches if string has zero or more \'a\' or \'b\' ')

['bbbbbb', '']
Explanation | INFO :  Matches if string has zero or more 'a' or 'b' 


#### Kleene +

> The Kleene + means “one or more occurrences of the immediately previous character or regular expression”.

For example,

/a+/ - Matches one or more 'a'. Matches 'baaaaaaa' but not 'Hello'<br>

/[0-9]+/ - Matches sequence of digits<br>


In [39]:
print(re.findall(r'[0-9]+','My phone number is 9457068769')) 
log.info(' Matches sequence of digits')

['9457068769']
Explanation | INFO :  Matches sequence of digits


#### Period .

> The period specifies any single character (wildcard).

For example,

/beg.n/ - Matches for any single character betwee'beg' and 'n'<br>


In [40]:
print(re.findall(r'beg.n','begin vs begun vs beg\'n')) 
log.info(' Matches anything with "beg" and "n"')

['begin', 'begun', "beg'n"]
Explanation | INFO :  Matches anything with "beg" and "n"


In [41]:
print(re.findall(r't.*d','tday vs timid vs ticked vs topped vs top')) 
log.info(' Matches sequence starting with "t" and ending with "d"')

['tday vs timid vs ticked vs topped']
Explanation | INFO :  Matches sequence starting with "t" and ending with "d"


In [42]:
print(re.findall(r'beg.n','It is the beginning')) 
log.info(' Matches anything with "beg" and "n"')

['begin']
Explanation | INFO :  Matches anything with "beg" and "n"


#### Anchors 

> Anchors are special characters that anchor regular expressions to particular places in a string.

* ^ (caret) specifies the start of line.

* $ (dollar) matches end of line.

* \b matches a word boundary.

* \B matches a non-word boundary

In [43]:
print(re.findall(r'^The','The is a very common word in English')) 
log.info(' Matches if "The" is at the beginning of string')

['The']
Explanation | INFO :  Matches if "The" is at the beginning of string


In [44]:
print(re.findall(r'^the','The is the most common word in English')) 
log.info(' Matches if "the" is at the beginning of string. Note that "the" is present but not matched.')

[]
Explanation | INFO :  Matches if "the" is at the beginning of string. Note that "the" is present but not matched.


In [45]:
print(re.findall(r'the$','Pattern you are looking for is -> the')) 
log.info(' Matches if "the" is at the end of string.')

['the']
Explanation | INFO :  Matches if "the" is at the end of string.


In [46]:
print(re.findall(r'the$','You are the winner')) 
log.info(' Matches if "the" is at the end of string. Notice tha "the" is present but not matched')

[]
Explanation | INFO :  Matches if "the" is at the end of string. Notice tha "the" is present but not matched


In [47]:
print(re.findall(r'\bthe\b','the vs other')) 
log.info(' Matches words having only "the". Notice that "the" is present in "other" but not matched')

['the']
Explanation | INFO :  Matches words having only "the". Notice that "the" is present in "other" but not matched


In [48]:
print(re.findall(r'\bother\b','the vs other')) 

['other']


<h2 style="text-align:center">Disjunction, Grouping, & Precedence</h2>

#### Disjunction operator |

> Disjunction operator is used to search for more than one string. Example, you want to search for "cats" or "dogs". You cannot use /[catsdogs]/, instead use /cats|dogs/

For example,

/cats|dogs/ - This regex matches <strong>cats</strong> or <strong>dogs</strong>.<br>

In [49]:
re.findall(r'cats|dogs','I have two cats and 3 dogs')

['cats', 'dogs']

In [50]:
re.search(r'cats|dogs','I have two cats and 3 dogs')

<re.Match object; span=(11, 15), match='cats'>

In [53]:
print(re.findall(r'[0-9]+|[A-Z]','this will return empty list'))
log.info(" It returns empty list because there are no sting of digits or(|) uppercase letters")

[]
Explanation | INFO :  It returns empty list because there are no sting of digits or(|) uppercase letters


In [56]:
print(re.findall(r'[0-9]+|[A-Z]','this will return something because it has num3er5'))
log.info(" It returns where it finds a sequence of one or more digits")

['3', '5']
Explanation | INFO :  It returns where it finds a sequence of one or more digits


In [59]:
print(re.findall(r'[0-9]+|[A-Z]','this will return something because it has num3er5 as well as UPPERCASE LETTERS'))
log.info(" It returns where it finds a sequence of one or more digits or uppercase letters")

['3', '5', 'U', 'P', 'P', 'E', 'R', 'C', 'A', 'S', 'E', 'L', 'E', 'T', 'T', 'E', 'R', 'S']
Explanation | INFO :  It returns where it finds a sequence of one or more digits or uppercase letters


In [60]:
print(re.findall(r'[0-9]+|[A-Z]+','this will return something because it has num3er5 as well as UPPERCASE LETTERS'))
log.info(" Notice the differece in output due to Kleene+")

['3', '5', 'UPPERCASE', 'LETTERS']
Explanation | INFO :  Notice the differece in output due to Kleene+
