# Natural language processing tutorial
Reference: https://www.udemy.com/course/introduction-to-natural-language-processing/learn/lecture/17935522#overview

# Topics covered
- Introduction to text data and natural language processing
- Key techniques related to natural language processing
- Implement natural language processing on real datasets

# Dataset Types
- #### Structured Dataset
    - Fixed Dimensions (rows and columns)
    - Information is well organized
    - Dataset is available in a tabular format (and are stored in relational databases, or in json etc.)
- #### Unstructured Dataset
    - No fixed dimension
    - Can take up any form such as image, video, audio, text
    - Text unstructured data examples:
        - Social media: tweets, posts, and comments
        - Conversations: messages, emails, chats
        - Articles: news, blogs, transcripts
    - Text data is data that is meaningful, in any language that follows the grammar and defined structures
    
Note that data in unstructured dataset cannot be well defined in a tabular format

# Natural language processing
Natural language processing is the branch of datascience that helps in deriving useful information and insights from text data

### Applications of NLP:
- Analyze customer feedback, and customer sentiments on product/service to improvise the product / service
- Automatic categorization of customer queries (map top queries to specific departments, and route the queries to departments using the defined business logics on the top phrases/queries)
- Identify patients at risk of cancer (analyze historical case notes on medicines prescribed, symptoms etc. to idenfity the risk level of patients on risk of cancer)

# Regular Expressions
- ### What are regular expressions? 
    - Patterns of special characters having an associated textual meaning (ex: "/d" indicates digits or numbers)
    - Wild card expressions for matching strings, finding strings, and parsing strings
    - Used for writing rule based information mining
- ### Why we need regular expressions?
    - Used for segmentation for words from sentences, or segmenation of sentences from paragraphs. This process is called tokenization
    - Used for text cleaning to remove text noise or unwanted information
    - Information retrieval from texts (chatbots, news datasets etc.)
- ### Types of regular expressions:
    - There are many regular expression patterns to match specific structures such as characters, or integers, or decimals etc.
- ### Regular expression functions available in python re package:
    - match: finds the first occurence of a pattern in a string
    - search: locates the pattern in the string
    - findall: find all occurences of pattern in the string
    - sub: search and replace a pattern in the string
    - split: split the given text by a regular expression

![Types of regular expression](.\\other-data\\types_of_regular_expressions.png "Types of regular expressions")
    

In [1]:
# Example of using regular expressions in python:
import re
import numpy as np

# match - returns true only of the first word matches with the pattern
string = "Tiger is the national animal of India"
pattern = 'Tiger'
result = re.match(pattern, string).group(0) # Returns tiger if tiger is the first occurence in the sentence, else None
print(f'Match: {result} is present as the first occurence in the sentence', '\n')

# search - returns true if the pattern is present anywhere in the sentence
string = "The national animal of India is Tiger"
pattern = 'Tiger'
result = re.search(pattern, string)
print(f'Search: {result.group(0)} is available in the sentence in the position: {result.span()}', '\n')

# findall - returns all occurences of pattern in the sentence
string = "The national animal of India is Tiger and national sport is hockey"
pattern = 'national'
result = re.findall(pattern, string)
print(f'Total occurence of the pattern: {pattern} is {np.count_nonzero(result)}', '\n')

# We can also get the index of the pattern using re.finditer
result = re.finditer(pattern, string)
for var in result:
    print(f'Start index: {var.start()} | Ending index: {var.end()}')
    print(string[var.start():var.end()], '\n')

# Another example of findall to find all the dates in a given text
string = "Ram, ID # 2554-896 and birthday is on 03/21/1985 whereas the ID # for Kumar is 5547-65-774 and his birthday is on 11/12/2000"
pattern = '\d{2}/\d{2}/\d{4}' # simple regular expression the represents 2 digits followed by front slash followe by 2 digits, followe by front slash and 4 digits
result = re.findall(pattern, string)
print(f'All dates available in the sentence: {result}', '\n')

# split - splits sentences using the provided delimiter
string = 'this;is a sample,sentence'
pattern = '[;,\s]' # Anything expressed within square braces are like a OR condition
result = re.split(pattern, string)
print(f'Split words from sentence using the delimiter: {result}', '\n')

# sub - search and replace a pattern in a given string
string = 'cricket is a popular sport in india. Sachin is a famous cricketer'
pattern = 'cricket'
replacement = 'hockey' # additional constraints should be added to match the exact word. Here, it will replace cricketer to hockeyer
result = re.sub(pattern, replacement, string)
print(f'Modified string after sub: {result}', '\n')

Match: Tiger is present as the first occurence in the sentence 

Search: Tiger is available in the sentence in the position: (32, 37) 

Total occurence of the pattern: national is 2 

Start index: 4 | Ending index: 12
national 

Start index: 42 | Ending index: 50
national 

All dates available in the sentence: ['03/21/1985', '11/12/2000'] 

Split words from sentence using the delimiter: ['this', 'is', 'a', 'sample', 'sentence'] 

Modified string after sub: hockey is a popular sport in india. Sachin is a famous hockeyer 



In [2]:
string = "Ram, ID # 2554-896 and birthday is on 03/08/1985 whereas the ID # for Kumar is 5547-65-774 and his birthday is on 11/12/2000"
pattern = '(0[1-9]|1[0-2])[\.-/]([0-2][0-9]|3[0-1])[\.-/][0-9]{4}'
# pattern = '0[1-9][\.-/]|1[0-2][\.-/][0-2][0-9][\.-/]|3[0-1][\.-/]'
# pattern = '0[1-9][\.-/]|1[0-2][\.-/]'
# pattern = '[0-2][0-9][\.-/]|3[0-1][\.-/]'
result = re.findall(pattern, string)
print(result)

# https://stackoverflow.com/questions/4709652/python-regex-to-match-dates
# date="13-11-2017"
#x=re.search("^([1-9] |1[0-9]| 2[0-9]|3[0-1])(.|-)([1-9] |1[0-2])(.|-|)20[0-9][0-9]$",date)
#x.group()

[('03', '08'), ('11', '12')]


In [29]:
string = "Contact us on training_queries@analyticsvidhya.com"
pattern = '([\w.]+)@([\w.]+)'
match = re.search(pattern, string)
print(match.group(0))

training_queries@analyticsvidhya.com
