### POS tagging

POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.

Source: https://www.guru99.com/pos-tagging-chunking-nltk.html

#### Chunking

Chunking in NLP is a process to take small pieces of information and group them into large units. The primary use of Chunking is making groups of “noun phrases.” It is used to add structure to the sentence by following POS tagging combined with regular expressions. The resulted group of words are called “chunks.” It is also called shallow parsing.

It can be done using `nltk` library

### How to do Named Entity Recognition?

There are two ways to do it:

1. Using the nltk library

```
from nltk.chunk import ne_chunk

def extract_ne(trees, labels):
    
    ne_list = []
    for tree in ne_res:
        if hasattr(tree, 'label'):
            if tree.label() in labels:
                ne_list.append(tree)
    
    return ne_list
    
# ex is the text
            
ne_res = ne_chunk(pos_tag(word_tokenize(ex)))
labels = ['ORGANIZATION']

```


2. Using spaCy library

```
import spacy

nlp = spacy.load("en_core_web_sm")
wiki_ex = df_wikibooks.iloc[11]['body_text']
doc = nlp(wiki_ex)
doc

print('All entity types that spacy recognised from the document above')
set([ent.label_ for ent in doc.ents])

print('Persons from the document above')
print(set([ent for ent in doc.ents if ent.label_ == 'PERSON']))
print('Organizations from the document above')
print(set([ent for ent in doc.ents if ent.label_ == 'ORG']))

```

Source: https://www.kaggle.com/code/eneszvo/ner-named-entity-recognition-tutorial

In [2]:
# data is taken from https://www.kaggle.com/competitions/nlp-getting-started/data?select=test.csv
# Problem this dataset was for "Predict which Tweets are about real disasters and which ones are not"
# We are using the dataset to understand NER

from transformers import pipeline
import pandas as pd
import numpy as np

2023-11-03 12:29:13.842815: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
data = pd.read_csv("./../data/tweets_classification_test.csv")

In [4]:
data.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
text = data.iloc[4]["text"]
text

'Typhoon Soudelor kills 28 in China and Taiwan'

In [9]:
generator = pipeline("ner", grouped_entities=True)
generator(text)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'entity_group': 'MISC',
  'score': 0.97845435,
  'word': 'Soudelor',
  'start': 8,
  'end': 16},
 {'entity_group': 'LOC',
  'score': 0.99978215,
  'word': 'China',
  'start': 29,
  'end': 34},
 {'entity_group': 'LOC',
  'score': 0.9997906,
  'word': 'Taiwan',
  'start': 39,
  'end': 45}]

#### Finetuning a model

The process of training a neural network is a difficult and time-consuming process and for most of the users not even feasible. Because of that, instead of training the model from scratch, we can use models from Hugging Face which has been trained using a large amount of text.

These types of models through training developed a statistical understanding of the language they have been trained on, but they might not be useful for our specific task. In order to utilize the knowledge of the model, we can apply fine-tuning. It means that we can take pretrained model and train it a little bit more with our annotated data.

This process is called transfer learning when the knowledge is transfered from one model to another one and that strategy is often used in deep learning.