<a href="https://colab.research.google.com/github/raz0208/Techniques-For-Text-Analysis/blob/main/POS%26NER_Tagging_By_Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## POS Tagging (Part-of-Speech Tagging):
POS tagging is the process of assigning parts of speech (like noun, verb, adjective, etc.) to each word in a sentence based on its definition and context. It helps understand the grammatical structure of the sentence.
- Example: Sentence: "The dog barked loudly."
- POS Tags:
      The → Determiner (DET)
      dog → Noun (NN)
      barked → Verb (VBD)
      loudly → Adverb (RB)

## NER Tagging (Named Entity Recognition Tagging):
NER tagging identifies and classifies named entities in text into predefined categories like names of people, organizations, locations, dates, etc.
- Example: Sentence: "Apple Inc. was founded in Cupertino by Steve Jobs."
- NER Tags:
      Apple Inc. → Organization (ORG)
      Cupertino → Location (LOC)
      Steve Jobs → Person (PER)

### Step 1: Import libraries and read data

In [4]:
# Import required libraries
import re
import nltk
from nltk.tokenize import word_tokenize

In [5]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
# Read data
text = "Apple Inc. was founded in Cupertino by Steve Jobs in 1976."

text

'Apple Inc. was founded in Cupertino by Steve Jobs in 1976.'

### Step 2: Preprocess data

In [7]:
# Preprocess Text
text = text.lower().strip()

print(text, "\n")

# Tokenization
tokens = word_tokenize(text)

print(tokens)

apple inc. was founded in cupertino by steve jobs in 1976. 

['apple', 'inc.', 'was', 'founded', 'in', 'cupertino', 'by', 'steve', 'jobs', 'in', '1976', '.']


### Setp 3: Implement tagging methods by Regex
- POS Tagging Using Regex Patterns
- Named Entity Recognition (NER) Using Regex

In [8]:
# POS Tagging Using Regex Patterns
POS_TAGS = {
    'NN': r'\b[a-z]+(?:tion|ment|ness|ity|ship|ence|ance)\b',  # Nouns
    'VB': r'\b[a-z]+(?:ed|ing|s)\b',                          # Verbs
    'JJ': r'\b[a-z]+(?:ous|ive|able|ible|al|ful)\b'           # Adjectives
}

pos_result = []
for word in tokens:
    tag = 'UNK'  # Default tag for unknown words
    for pos, pattern in POS_TAGS.items():
        if re.fullmatch(pattern, word):
            tag = pos
            break
    pos_result.append((word, tag))

pos_result

[('apple', 'UNK'),
 ('inc.', 'UNK'),
 ('was', 'VB'),
 ('founded', 'VB'),
 ('in', 'UNK'),
 ('cupertino', 'UNK'),
 ('by', 'UNK'),
 ('steve', 'UNK'),
 ('jobs', 'VB'),
 ('in', 'UNK'),
 ('1976', 'UNK'),
 ('.', 'UNK')]

In [9]:
# NER Using Regex Patterns
NER_TAGS = {
    'ORG': r'\b(?:apple|microsoft|google|inc)\b',            # Organizations
    'LOC': r'\b(?:cupertino|paris|london|new york)\b',       # Locations
    'PER': r'\b(?:steve jobs|elon musk|bill gates)\b'        # Persons
}

ner_result = []
for word in tokens:
    entity = 'O'  # 'O' means no entity
    for ner, pattern in NER_TAGS.items():
        if re.fullmatch(pattern, word):
            entity = ner
            break
    ner_result.append((word, entity))

ner_result

[('apple', 'ORG'),
 ('inc.', 'O'),
 ('was', 'O'),
 ('founded', 'O'),
 ('in', 'O'),
 ('cupertino', 'LOC'),
 ('by', 'O'),
 ('steve', 'O'),
 ('jobs', 'O'),
 ('in', 'O'),
 ('1976', 'O'),
 ('.', 'O')]