## Natural Language Processing
  * Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
  
## Computer Linguistics
  * Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.
  * Example: computers and algorithms to perform linguistic tasks such as parts-of-speech tagging. 

## Text manipulation in Python

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
s = 'Kids are playing in ground. The ground is very neetly maintained'

#### string 'count' method counts number of occurences of a given character or word.

In [3]:
print(s.count('a'))

4


In [4]:
print(s.count('ground'))

2


#### string 'find' method returns character-index of the first occurence of a given character or word.

In [5]:
print(s.find('The'))

28


In [6]:
print(s.find('kids'))

-1


In [7]:
print(s.find('ground'))

20


In [8]:
print(s.find('a'))

5


In [9]:
print(s.index('ground'))

20


 #### upper will convert all characters into upper case

In [10]:
print(s.upper())

KIDS ARE PLAYING IN GROUND. THE GROUND IS VERY NEETLY MAINTAINED


#### lower will convert all characters into lower case

In [11]:
print(s.lower())

kids are playing in ground. the ground is very neetly maintained


#### title will convert all characters into title case

In [12]:
s_title = s.title()
print(s_title)

Kids Are Playing In Ground. The Ground Is Very Neetly Maintained


#### swapcase will convert all characters into upper-to-lower and lower-to-upper case

In [13]:
print(s_title.swapcase())

kIDS aRE pLAYING iN gROUND. tHE gROUND iS vERY nEETLY mAINTAINED


#### replace will replace a sequence with given sequence

In [14]:
print(s.replace('play', 'sing'))

Kids are singing in ground. The ground is very neetly maintained


#### startswith will return True if given word or sequence starts with.

In [15]:

print(s.startswith('Ki'))

True


In [18]:
print(s.endswith('ing'))

False


In [19]:
print(s.endswith('ed'))

True


#### join will join given work with sequence

In [20]:
print(" - ".join('rama'))

r - a - m - a


In [21]:
s1 = "rama"

In [22]:
print(s1.isalnum())

True


In [23]:
print(s1.isalpha())

True


In [58]:
num = "2"

In [59]:
print(num.isdecimal())

True


In [60]:
print(num.isdigit())

True


In [68]:
sp = ""

In [69]:
print(sp.isspace())

False


In [74]:
s = " "
if s:
    print("stirng")
else:
    print("blank")

stirng


## Regular Expressions:

### Methods:
<table>
  <th>
  <tr>
      <td>Method/Attribute
      </td>
      <td>Purpose
      </td>
  </tr>    
  </th>
  <tbody>
  <tr>
      <td>match()</td>
      <td>Determine if the RE matches at the beginning of the string.</td>
  </tr>
  <tr>
      <td>search()</td>
      <td>Scan through a string, looking for any location where this RE matches.</td>
  </tr>
  <tr>
      <td>findall()</td>
      <td>Find all substrings where the RE matches, and returns them as a list.</td>
  </tr>
  <tr>
      <td>finditer()</td>
      <td>Find all substrings where the RE matches, and returns them as an iterator.</td>
  </tr>  
  </tbody>
</table>  

### Method Attributes:
<table>
  <th>
  <tr>
      <td>Method/Attribute
      </td>
      <td>Purpose
      </td>
  </tr>    
  </th>
  <tbody>
  <tr>
      <td>group()</td>
      <td>Return the string matched by the RE.</td>
  </tr>
  <tr>
      <td>start()</td>
      <td>Return the starting position of the match.</td>
  </tr>
  <tr>
      <td>end()</td>
      <td>Return the ending position of the match.</td>
  </tr>
  <tr>
      <td>span()</td>
      <td>Return a tuple containing the (start, end) positions of the match.</td>
  </tr>  
  </tbody>
</table>    

###  Metacharacters : . ^ $ * + ? { } [ ] \ | ( )
  * [ and ] : They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'
  
  * s = 'Kids are playing in ground. The ground is very neetly maintained'

In [75]:
import re

In [78]:
s = 'Kids are playing in ground. \nThe ground is very neetly maintained'

In [79]:
print(s)

Kids are playing in ground. 
The ground is very neetly maintained


In [82]:
p = re.compile('Kids')
m = p.match(s)

print("match object : ", m)

if m:
    print("Result : ", m.group())
else:
    print("None found : ", m)

match object :  <re.Match object; span=(0, 4), match='Kids'>
Result :  Kids


In [83]:
print(p.match(s).span())

(0, 4)


In [85]:
p = re.compile('ground')
print(p.search(s).span())

(20, 26)


In [86]:
p.search(s)

<re.Match object; span=(20, 26), match='ground'>

In [93]:
p = re.compile('[rie]')
print(p.search(s).group())

i


In [94]:
p = re.compile('play')
print(p.search(s))

<re.Match object; span=(9, 13), match='play'>


In [95]:
print(p.search(s).start())

9


In [96]:
print(p.search(s).end())

13


In [97]:
print(p.search(s).span())

(9, 13)


In [112]:
s = 'Kids are playing in ground. \nThe ground is very neetly maintained'

In [113]:
print(s)

Kids are playing in ground. 
The ground is very neetly maintained


In [116]:
p = re.compile('ground', re.I)
print(p.findall(s))

for i in p.findall(s):
    print(i)

['ground', 'ground']
ground
ground


In [42]:
print(p.finditer(s))

<callable_iterator object at 0x00000257E41E7040>


In [43]:
for x in p.finditer(s):
    print(x)

<re.Match object; span=(20, 26), match='ground'>
<re.Match object; span=(32, 38), match='ground'>


In [44]:
for x in p.finditer(s):
    print(x.group(), ' - ' , x.span())

ground  -  (20, 26)
ground  -  (32, 38)


In [117]:
p1 = re.compile('[zj]')

In [118]:
print(p1.search(s))

None


In [119]:
print(p1.match(s))

None


In [120]:
s3 = 'there are 200 apples, \n400 bananas in basket'

In [121]:
print(s3)

there are 200 apples, 
400 bananas in basket


In [122]:
s3

'there are 200 apples, \n400 bananas in basket'

In [143]:
p3 = re.compile('[^\d]')

In [148]:
print(p3.findall(s3))

['t', 'h', 'e', 'r', 'e', ' ', 'a', 'r', 'e', ' ', ' ', 'a', 'p', 'p', 'l', 'e', 's', ',', ' ', '\n', ' ', 'b', 'a', 'n', 'a', 'n', 'a', 's', ' ', 'i', 'n', ' ', 'b', 'a', 's', 'k', 'e', 't']


### Basic Flags: re.I, re.L, re.M, re.S, re.U, res.X
  * re.I - ignore case.
  * re.L - local dependence.
  * re.M - find pattern through multiple lines.
  * re.S - find "." matches
  * re.U - to work with unicode data.
  * re.X - to write REGEX in more readable format.

In [157]:
s = 'Kids are playing in ground. \n\t The ground is very neetly maintained 343'

p = re.compile('.', re.S)

print(p.findall(s))

['K', 'i', 'd', 's', ' ', 'a', 'r', 'e', ' ', 'p', 'l', 'a', 'y', 'i', 'n', 'g', ' ', 'i', 'n', ' ', 'g', 'r', 'o', 'u', 'n', 'd', '.', ' ', '\n', '\t', ' ', 'T', 'h', 'e', ' ', 'g', 'r', 'o', 'u', 'n', 'd', ' ', 'i', 's', ' ', 'v', 'e', 'r', 'y', ' ', 'n', 'e', 'e', 't', 'l', 'y', ' ', 'm', 'a', 'i', 'n', 't', 'a', 'i', 'n', 'e', 'd', ' ', '3', '4', '3']


In [158]:
s = 'Kids are playing.\nits raining while they are playing'

In [159]:
print(s)

Kids are playing.
its raining while they are playing


In [162]:
p = re.compile('kids', re.I)
match_obj1 = p.match(s)
print(match_obj1.group())

Kids


In [164]:
match_obj2 = re.match('playing', s)
print(match_obj2.group())

AttributeError: 'NoneType' object has no attribute 'group'

In [165]:
if match_obj2:
    print(match_obj2.group())
else:
    print('Found none')

Found none


In [167]:
search_obj1 = re.search('Kids', s)
print(search_obj1.group())

Kids


In [168]:
search_obj2 = re.search('playing', s)
print(search_obj2.group())

playing


* **\$ - Dollar**
  * The dollar symbol $ is used to check if a string ends with a certain character.

In [177]:
#s = 'Kids are playing.\n its raining while they are playing'
s = '''Kids are playing


its raining while they are Playing'''

In [181]:
p = re.compile('playing$',  re.M | re.I)
search_obj2 = p.finditer(s)
#print(search_obj2)
for x in search_obj2:
    print(x)

<re.Match object; span=(9, 16), match='playing'>
<re.Match object; span=(46, 53), match='Playing'>


In [179]:
search_obj2 = re.search('kids$', s)
print(search_obj2)

None


#### Metacharacter "$" not active inside [ and ]

In [182]:
import re
s2 = 'kids are pl$aying'

In [191]:
p2 = re.compile(r'\$')

In [192]:
print(p2.findall(s2))

['$']


In [193]:
p2 = re.compile(r'[akig$]')

In [194]:
print(p2.findall(s2))

['k', 'i', 'a', '$', 'a', 'i', 'g']


* **^ - Caret**
  * The caret symbol **^** is used to check if a string starts with a certain character.

In [199]:
search_obj2 = re.compile('^kids')
print(search_obj2.findall(s2))

['kids']


In [201]:
search_obj2 = re.compile('kids$')
print(search_obj2.findall(s2))

[]


* **. - Period**
  * A period matches any single character (except newline '\n').

In [203]:
s = 'kids are playing.\nits raining while \tthey are playing'
search_re = re.compile('.', re.S)
search_list = search_re.findall(s)
print(search_list)
print("".join(search_list))

['k', 'i', 'd', 's', ' ', 'a', 'r', 'e', ' ', 'p', 'l', 'a', 'y', 'i', 'n', 'g', '.', '\n', 'i', 't', 's', ' ', 'r', 'a', 'i', 'n', 'i', 'n', 'g', ' ', 'w', 'h', 'i', 'l', 'e', ' ', '\t', 't', 'h', 'e', 'y', ' ', 'a', 'r', 'e', ' ', 'p', 'l', 'a', 'y', 'i', 'n', 'g']
kids are playing.
its raining while 	they are playing


* **\* - Star**
  * The star symbol * matches zero or more occurrences of the pattern left to it.

In [214]:
import re
s = 'I won the match hahaha...'
search_re = re.compile('ha*')

search_re.findall(s)

['h', 'h', 'ha', 'ha', 'ha']

In [217]:
s = 'I won a game...'
search_re = re.compile('a+')

search_re.findall(s)

['a', 'a']

In [162]:
for x in re.finditer('ha*', s):
    print(x.group(), ' - ' , x.span())

h  -  (7, 8)
h  -  (14, 15)
ha  -  (16, 18)
ha  -  (18, 20)
ha  -  (20, 22)


* **+ - Plus**
  * The plus symbol + matches one or more occurrences of the pattern left to it.

In [163]:
re.findall('ha+', s)

['ha', 'ha', 'ha']

In [164]:
for x in re.finditer('ha+', s):
    print(x.group(), ' - ' , x.span())

ha  -  (16, 18)
ha  -  (18, 20)
ha  -  (20, 22)


* **? - Question Mark**
  * The question mark symbol ? matches zero or one occurrence of the pattern left to it.

In [165]:
s = 'maaan womaaaaan'

In [166]:
print(re.findall('a*', s))

['', 'aaa', '', '', '', '', '', 'aaaaa', '', '']


In [167]:
print(re.findall('a+', s))

['aaa', 'aaaaa']


In [168]:
print(re.search('ma*n', s).group())

maaan


* **{} - Braces**
  * Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

In [222]:
s = 'maaan womaaaaan'

In [235]:
print(re.finditer('a{1,6}',s))

for i in re.finditer('a{1}',s):
    print(i)

<callable_iterator object at 0x000001DE4A7409D0>
<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(2, 3), match='a'>
<re.Match object; span=(3, 4), match='a'>
<re.Match object; span=(9, 10), match='a'>
<re.Match object; span=(10, 11), match='a'>
<re.Match object; span=(11, 12), match='a'>
<re.Match object; span=(12, 13), match='a'>
<re.Match object; span=(13, 14), match='a'>


In [237]:
s = 'E90202R'
print(re.search('\d{2,6}', s))

<re.Match object; span=(1, 6), match='90202'>


In [238]:
print(re.search('[0-9]{2,4}', s).group())

9020


In [239]:
print(re.search('[0-9]{2,7}', s).group())

90202


* **| - Alternation**
  * Vertical bar | is used for alternation (or operator).

In [242]:
s = 'aecbeca'
print(re.search('a|b', s).group())

a


In [243]:
print(re.search('b|a', s).group())

a


In [245]:
print(re.findall('a|b', s))

['a', 'b', 'a']


* ** () - Group **
  * Parentheses () is used to group sub-patterns. 
  * For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

In [251]:
s = 'arecbreca'

In [253]:
re.finditer('(ar|br)ec', s)

print(re.findall('(ar|br)ec', s))

#for i in re.finditer('[arbr]ec', s):
for i in re.finditer('(ar|br)ec', s):
    print(i)

['ar', 'br']
<re.Match object; span=(0, 4), match='arec'>
<re.Match object; span=(4, 8), match='brec'>


In [260]:
s = "foo The allfootball team foo"
c = re.compile(r'\Bfoo')
print(c.search(s))

<re.Match object; span=(11, 14), match='foo'>


* [a-e] is the same as [abcde]
* [1-4] is the same as [1234]
* [0-39] is the same as [01239]
* [^abc] means any character except a or b or c
* [^0-9] means any non-digit character
* \ - Backslash - is used to escape various characters including all metacharacters. 
  * For example, \\$a match if a string contains \\$ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.
* \A - Matches if the specified characters are at the start of a string.
  * For example, \Athe - 
    * "the sun" matches, 
    * "In the sun" no match as the is not in the begining of the sentence.
* \b - Matches if the specified characters are at the beginning or end of a word.
  * For example, \bfoo - 
    * "football" matches
    * "a football" matches
    * "afootball" no match
  * For example, foo\b
    * "the foo" matches
    * "the afoo test" matches
    * "the afootest" no match
* \B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.
* \d - Matches any decimal digit. Equivalent to [0-9]
* \D - Matches any non-decimal digit. Equivalent to [^0-9]
* \s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v]
* \S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].
* \w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.
* \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]
* \Z - Matches if the specified characters are at the end of a string.

https://www.programiz.com/python-programming/regex