

# **Summary of the POS Tagging Demo Notebook**

This notebook demonstrates how to **train and evaluate a POS Tagger using a Conditional Random Field (CRF)**. It uses **two corpora**:

1. **Penn Treebank (PTB)** loaded from NLTK
2. **Universal Dependencies (UD)** loaded from a JSON file

The workflow includes:

---

## **1. Loading POS-Tagged Corpora**

* **Penn Treebank:**
  Extracts tokens and Penn POS tags (`NN`, `VBZ`, `JJ`, etc.).
* **UD Treebank:**
  Extracts words and universal POS tags (`NOUN`, `VERB`, `ADJ`, etc.).

Both are stored as `(tokens, tags)` sentence pairs.

---

## **2. Feature Extraction**

Each word is converted into a feature dictionary containing:

* word (lowercase)
* is uppercase / titlecase / digit
* word suffixes/prefixes
* previous word features
* next word features
* BOS (beginning of sentence), EOS (end of sentence)

These features help the CRF learn patterns.

---

## **3. Vectorization**

Uses `DictVectorizer` from `sklearn` to convert feature dictionaries into numerical feature vectors required for CRF training.

---

## **4. Training the CRF Tagger**

Two separate CRF models are trained:

* one on **Penn Treebank**
* one on **UD Treebank**

CRF learns:

* **emission patterns** (word → POS)
* **transition dependencies** between tags (e.g., “DT usually followed by NN”)

---

## **5. Evaluation**

Accuracy is calculated on each test set.
Shows how the CRF performs on PTB and UD using the defined scoring function.

---

## **6. Learning Aids**

Throughout the notebook, short **FAQ** sections explain:

* differences between PTB and UD
* why CRFs are used
* how features improve tagging accuracy
* how the tagging pipeline works

---

# **In short:**

This notebook walks through **building a CRF-based POS tagger from scratch**, including data loading, feature engineering, training, and evaluation for two major POS tagsets.



# So, it is time to learn to PoS Tag!

In this notebook, I'll guide you through the steps of training some models to be further utilized in our NLP Tool to do PoS Tagging. Here we won't apply any state of the art algorithm, but we won't be far either!

If you don't know how this *notebook* works, check this link: https://colab.research.google.com/notebooks/intro.ipynb#

## Getting the data (Corpus)

Let us start by where we'll get our data (our **corpus**). There are many sources, but two are the most commonly used:
* **Penn Treebank** subset from nltk (you can buy the entire Treebank, if you want, but you'll have to invest some $700~).
* The **Universal Dependencies** Treebanks, available (as of February 2020) for 90 languages (in different quality and quantity levels).

These contain the hard work of many **annotators**, which went through selected sets of sentences and annotated each one by hand, forming a corpus to be used as **supervised** input for our **machine learning algorithms**.

The following two cells will show how to import the corpus from each of these two sources.

In [1]:
#This cell loads the Penn Treebank corpus from nltk into a list variable named penn_treebank.

#No need to install nltk in google colab since it is preloaded in the environments.
#!pip install nltk
import nltk

#Ensure that the treebank corpus is downloaded
nltk.download('treebank')

#Load the treebank corpus class
from nltk.corpus import treebank

#Now we iterate over all samples from the corpus (the fileids - that are equivalent to sentences)
#and retrieve the word and the pre-labeled PoS tag. This will be added as a list of tuples with
#a list of words and a list of their respective PoS tags (in the same order).
penn_treebank = []
for fileid in treebank.fileids():
  tokens = []
  tags = []
  for word, tag in treebank.tagged_words(fileid):
    tokens.append(word)
    tags.append(tag)
  penn_treebank.append((tokens, tags))

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.




### **FAQ 1:**

**Q:** What is the purpose of downloading the Penn Treebank corpus in this code?
**A:** The Penn Treebank corpus is a pre-annotated dataset of sentences where each word is already tagged with its correct Part-of-Speech (POS) label. It’s commonly used as a benchmark dataset to train or evaluate POS taggers. This code loads that data into memory so we can use it for training or testing models.

---

### **FAQ 2:**

**Q:** Why do we use `nltk.download('treebank')` even though we are in Google Colab?
**A:** Google Colab already includes NLTK, but specific corpora like `treebank` may not be downloaded by default. Running this command ensures that the Treebank dataset is available locally in the session before it’s accessed.

---

### **FAQ 3:**

**Q:** What does the variable `penn_treebank` store after this code executes?
**A:** The variable `penn_treebank` becomes a list of tuples. Each tuple represents one sentence and contains:

* a list of **tokens** (words in the sentence), and
* a list of their corresponding **POS tags** in the same order.
  So effectively, `penn_treebank[i][0]` gives the words, and `penn_treebank[i][1]` gives their tags.

---

### **FAQ 4:**

**Q:** Why do we use `treebank.fileids()` in the loop?
**A:** Each file ID in the Treebank corpus corresponds to one sentence file. Using `treebank.fileids()` lets us iterate through all available sentences in the corpus so that we can extract words and their tags from each.

---

### **FAQ 5:**

**Q:** What is the significance of storing both tokens and tags as separate lists instead of a list of word-tag pairs?
**A:** Storing tokens and tags as parallel lists makes it easier to feed them into models that expect separate input and output sequences — for example, sequence labeling models like RNNs, LSTMs, or Transformers. It also simplifies processing and evaluation since each index directly corresponds to a word-tag pair.



In [2]:
#This cell loads the Universal Dependecies Treekbank corpus. It'll download all the packages, but we'll only use the GUM
#english package. We'll also install the conllu package, that was developed to parse data in the conLLu format, a
#format common of linguistic annotated files. We'll also have a list variable, but now named ud_treebank.

#Install conllu package, download the UD Treebanks corpus and unpack it.
!pip install conllu
!wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
!tar zxf ud-treebanks-v2.5.tgz

#The imports needed to open and parse (interpret) the conllu file. At the end we'll have a list of dicts.
from io import open
from conllu import parse_incr

#Open the file and load the sentences to a list.
data_file = open("ud-treebanks-v2.5/UD_English-GUM/en_gum-ud-train.conllu", "r", encoding="utf-8")
ud_files = []
for tokenlist in parse_incr(data_file):
    ud_files.append(tokenlist)

#Now we iterate over all samples from the corpus and retrieve the word and the pre-labeled PoS tag (upostag). This will
#be added as a list of tuples with a list of words and a list of their respective PoS tags (in the same order).
ud_treebank = []
for sentence in ud_files:
  tokens = []
  tags = []
  for token in sentence:
    tokens.append(token['form'])
    tags.append(token['upostag'])
  ud_treebank.append((tokens, tags))

Collecting conllu
  Downloading conllu-6.0.0-py3-none-any.whl.metadata (21 kB)
Downloading conllu-6.0.0-py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-6.0.0
--2025-11-16 05:12:01--  https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
Resolving lindat.mff.cuni.cz (lindat.mff.cuni.cz)... 195.113.20.140
Connecting to lindat.mff.cuni.cz (lindat.mff.cuni.cz)|195.113.20.140|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://lindat.mff.cuni.cz/repository/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz [following]
--2025-11-16 05:12:02--  https://lindat.mff.cuni.cz/repository/bitstream/handle/11234/1-3105/ud-treebanks-v2.5.tgz
Reusing existing connection to lindat.mff.cuni.cz:443.
HTTP request sent, awaiting response... 302 Found
Location: https://lindat.mff.cuni.cz/repository/server/api/core/bitstreams/f55867a6-1e19-4ab2-843f-320e56f7a96d/content [follo



### **FAQ 1:**

**Q:** What is the Universal Dependencies (UD) Treebank, and why are we using the English-GUM dataset here?
**A:** The UD Treebank is an open, multilingual collection of grammatically annotated datasets designed for consistent cross-linguistic analysis. The **English-GUM** subset contains modern English texts annotated with part-of-speech tags and syntactic structures. It’s ideal for training and evaluating POS tagging models with standardized universal POS tags (like `NOUN`, `VERB`, `ADJ`, etc.) that are consistent across languages.

---

### **FAQ 2:**

**Q:** What is the purpose of installing and using the `conllu` package?
**A:** The `conllu` package allows Python to **read and parse files in the CoNLL-U format**, which is the standard format used by UD Treebanks. This format stores linguistic annotations (words, lemmas, POS tags, dependencies) in a structured way that can easily be converted into Python dictionaries and lists for processing.

---

### **FAQ 3:**

**Q:** What does the `parse_incr()` function do in this context?
**A:** The function `parse_incr()` reads and parses the `.conllu` file **incrementally**, one sentence at a time. This approach is memory-efficient — especially for large corpora — because it avoids loading the entire dataset into memory at once. Each parsed sentence is returned as a `TokenList` object.

---

### **FAQ 4:**

**Q:** What does the `ud_treebank` variable contain after this code runs?
**A:** Like the Penn Treebank example, `ud_treebank` is a **list of tuples**. Each tuple represents one sentence and contains two lists:

* `tokens`: all the words in that sentence (`token['form']`)
* `tags`: their corresponding **Universal POS tags** (`token['upostag']`)
  This makes it easy to feed the data into sequence labeling models for POS tagging.

---

### **FAQ 5:**

**Q:** Why do we prefer Universal POS tags (`upostag`) over language-specific tags (`xpostag`)?
**A:** Universal POS tags provide **a consistent tagging scheme across all languages**, making models more generalizable and suitable for multilingual NLP tasks. Language-specific tags vary by corpus and language, while `upostag` uses a fixed set of 17 categories defined by the UD project (like `NOUN`, `VERB`, `ADP`, etc.), simplifying training and comparison across languages.




**Word of Caution!**

Penn Treebank and UD Treebanks use *distinct tagsets*.

We won't be able to interchange them unless we make a converter - also, we'll only be able to do so from Penn->UD, because Penn Treebank has tags more detailed than UD, and we won't be able to retrieve these details from the tags without a third function and a lot of effort.

We'll only do that later, in our code.

Let us continue with the explanation of the Tagger.

#Extracting Features form Words

Next, we have to create a function that is able to extract features from our words. These features will be used to predict the PoS.

For that,  for each word, we'll pass the sentence and word index, and we'll provide a dict with the features.

To explain about the feature set (can be changed, if you want), it is composed by:
* Word: the word itself. Some words are always one PoS, others not.
* is_first, is_last: check if it is the first or last in the sentence.
* is_capitalized: first letter is caps? Maybe it is a proper noun...
* is_all_caps or is_all_lower: checks for acronyms (or common words).
* prefixes/suffixes: check word initialization/termination
* prev_word/next_word: checks the preceding and succeding word.
* has-hyphen: words with '-' may be adjectives.
* is_numeric: for numbers.
* capitals_inside: weird cases. Maybe nouns.

The basis of this feature extraction method comes from two nice articles:
* https://nlpforhackers.io/training-pos-tagger/
* https://medium.com/analytics-vidhya/pos-tagging-using-conditional-random-fields-92077e5eaa31

If you're wondering, yes, this encoding WILL need a lot of memory for training (if you're not using categorical variables).

And we'll have to replicate this in our main code.

In [3]:
#Regex module for checking alphanumeric values.
import re
def extract_features(sentence, index):
  return {
      'word':sentence[index],
      'is_first':index==0,
      'is_last':index ==len(sentence)-1,
      'is_capitalized':sentence[index][0].upper() == sentence[index][0],
      'is_all_caps': sentence[index].upper() == sentence[index],
      'is_all_lower': sentence[index].lower() == sentence[index],
      'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence[index])))),
      'prefix-1':sentence[index][0],
      'prefix-2':sentence[index][:2],
      'prefix-3':sentence[index][:3],
      'prefix-3':sentence[index][:4],
      'suffix-1':sentence[index][-1],
      'suffix-2':sentence[index][-2:],
      'suffix-3':sentence[index][-3:],
      'suffix-3':sentence[index][-4:],
      'prev_word':'' if index == 0 else sentence[index-1],
      'next_word':'' if index < len(sentence) else sentence[index+1],
      'has_hyphen': '-' in sentence[index],
      'is_numeric': sentence[index].isdigit(),
      'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
  }

We now prepare the dataset for use in Machine Learning algorithms.

There are two steps (three, if we're doing deep learning, but that's for later) to it:
* Defining a function to transform the corpus to a more datsetish format.
* Then, divide the encoded data into training and testing sets.

In [4]:
#Ater defining the extract_features, we define a simple function to transform our data in a more 'datasetish' format.
#This function returns the data as two lists, one of Dicts of features and the other with the labels.
def transform_to_dataset(tagged_sentences):
  X, y = [], []
  for sentence, tags in tagged_sentences:
    sent_word_features, sent_tags = [],[]
    for index in range(len(sentence)):
        sent_word_features.append(extract_features(sentence, index)),
        sent_tags.append(tags[index])
    X.append(sent_word_features)
    y.append(sent_tags)
  return X, y

#We divide the set BEFORE encoding. Why? To have full sentences in training/testing sets. When we encode, we do not encode
#a sentence, but its words instead.

#First, for the Penn treebank.
penn_train_size = int(0.8*len(penn_treebank))
penn_training = penn_treebank[:penn_train_size]
penn_testing = penn_treebank[penn_train_size:]
X_penn_train, y_penn_train = transform_to_dataset(penn_training)
X_penn_test, y_penn_test = transform_to_dataset(penn_testing)

#Then, for UD Treebank.
ud_train_size = int(0.8*len(ud_treebank))
ud_training = ud_treebank[:ud_train_size]
ud_testing = ud_treebank[ud_train_size:]
X_ud_train, y_ud_train = transform_to_dataset(ud_training)
X_ud_test, y_ud_test = transform_to_dataset(ud_testing)

#Third step, vectorize datasets. For that we use sklearn DictVectorizer
#WARNING



### **FAQ 1:**

**Q:** What is the purpose of the `extract_features()` function in POS tagging?
**A:** The `extract_features()` function creates a set of **handcrafted features** for each word in a sentence. These features capture useful linguistic patterns — such as capitalization, prefixes, suffixes, and neighboring words — that help machine learning models (like CRFs or Logistic Regression) predict the correct POS tag for each word.

---

### **FAQ 2:**

**Q:** Why are prefix and suffix features included, and how do they help in POS tagging?
**A:** Prefixes and suffixes are strong indicators of word categories. For example, words ending with *“ing”* are often **verbs**, and those ending with *“ly”* are often **adverbs**. Including these features helps the model recognize morphological patterns and improve tagging accuracy, especially for unseen words.

---

### **FAQ 3:**

**Q:** What does the `'is_alphanumeric'` feature check using regex?
**A:** The `'is_alphanumeric'` feature uses a regular expression to check whether the word contains **both letters and digits** (e.g., “A1”, “Model3”). This helps the model handle tokens like product names, IDs, or codes — which often don’t fit neatly into standard word categories but still occur frequently in text.

---

### **FAQ 4:**

**Q:** What are the `'prev_word'` and `'next_word'` features used for?
**A:** These features provide **contextual information** by including the previous and next words in the sentence. POS tagging is a **sequence labeling** task, so the tag of a word often depends on its neighbors (e.g., “to” before a verb, or “the” before a noun). These features help the model learn such patterns.

---

### **FAQ 5:**

**Q:** There seems to be a small issue with the code for `'next_word'`. What is it?
**A:** The line

```python
'next_word':'' if index < len(sentence) else sentence[index+1],
```

should actually be

```python
'next_word':'' if index == len(sentence)-1 else sentence[index+1],
```

Otherwise, it may try to access an index beyond the last element, causing an **IndexError**. This correction ensures the feature safely handles the last word in the sentence.




# Training a Tagger

Now, we can train supervised machine learning algorithms to PoS Tagging.

We'll use the Conditional Random Fields (CRF) algorithm. Here's a brief explanation:

* **CRF**: A variation of Markov Random Field. Okay, that might not have helped. It is a discriminative model that, in a quick summary, evaluates the probabilities that a set of states are dependant or not between themselves based on a set of observations. In this case, it evaluates the probabilities that a word observed in a context (defined by the above mentioned features) belongs to a specific PoS. In training time, it takes what is the best state given the set of current observations and probabilities.

<div>
<img src="https://miro.medium.com/max/681/1*8hOWH7YF5INMF2OPhKjVxA.png" width="400"/>
</div>

Want more math? Read this: https://towardsdatascience.com/conditional-random-fields-explained-e5b8256da776

So, to achieve this, we'll use scikit learn (sklearn) and a sklearn compatible crf suite (skleran_crfsuit). If you don't know what is sklearn, [read this](https://scikit-learn.org/stable/getting_started.html).

In [5]:
#Ignoring some warnings for the sake of readability.
import warnings
warnings.filterwarnings('ignore')

#First, install sklearn_crfsuite, as it is not preloaded into Colab.
!pip install sklearn_crfsuite
from sklearn_crfsuite import CRF

#This loads the model. Specifics are:
#algorithm: methodology used to check if results are improving. Default is lbfgs (gradient descent).
#c1 and c2:  coefficients used for regularization.
#max_iterations: max number of iterations (DUH!)
#all_possible_transitions: since crf creates a "network", of probability transition states,
#this option allows it to map even "connections" not present in the data.
penn_crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
#The fit method is the default name used by Machine Learning algorithms to start training.
print("Started training on Penn Treebank corpus!")
penn_crf.fit(X_penn_train, y_penn_train)
print("Finished training on Penn Treebank corpus!")

#Same for UD
ud_crf = CRF(
    algorithm='lbfgs',
    c1=0.01,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
print("Started training on UD corpus!")
ud_crf.fit(X_ud_train, y_ud_train)
print("Finished training on UD corpus!")

Collecting sklearn_crfsuite
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9.7 (from sklearn_crfsuite)
  Downloading python_crfsuite-0.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl (10 kB)
Downloading python_crfsuite-0.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-crfsuite, sklearn_crfsuite
Successfully installed python-crfsuite-0.9.11 sklearn_crfsuite-0.5.0
Started training on Penn Treebank corpus!
Finished training on Penn Treebank corpus!
Started training on UD corpus!
Finished training on UD corpus!




### **FAQ 1:**

**Q:** What is the role of a Conditional Random Field (CRF) in POS tagging?
**A:** CRFs are **sequence modeling algorithms** that predict labels (like POS tags) by considering both the features of the current word and the **dependencies between neighboring tags**. Unlike models that classify each word independently, CRFs model the probability of an entire tag sequence, which helps improve accuracy by enforcing consistency (e.g., “a” is usually followed by a noun).

---

### **FAQ 2:**

**Q:** What does the parameter `algorithm='lbfgs'` mean in the CRF model?
**A:** The `'lbfgs'` algorithm is an **optimization method** (Limited-memory BFGS) used to find the optimal model parameters that maximize the likelihood of the training data. It’s a type of gradient descent that works efficiently even for large feature spaces, which is ideal for CRFs trained on text.

---

### **FAQ 3:**

**Q:** What are `c1` and `c2` parameters, and how do they affect training?
**A:** These are **regularization coefficients** that prevent overfitting:

* `c1` controls **L1 regularization**, which encourages sparsity by driving some feature weights to zero.
* `c2` controls **L2 regularization**, which penalizes large weights smoothly.
  Tuning these parameters balances model complexity and generalization performance.

---

### **FAQ 4:**

**Q:** Why is `all_possible_transitions=True` used in CRF initialization?
**A:** This setting tells the CRF model to **include transition probabilities between all possible tag pairs**, even if some transitions don’t appear in the training data. This makes the model more robust when it encounters new or rare tag sequences during prediction.

---

### **FAQ 5:**

**Q:** Why do we train two separate CRF models — `penn_crf` and `ud_crf` — instead of one unified model?
**A:** The two corpora (Penn Treebank and UD Treebank) use **different tagging schemes**:

* **Penn Treebank** uses a detailed English-specific tagset (like `NN`, `VBD`, `PRP`).
* **UD Treebank** uses **Universal POS tags** (like `NOUN`, `VERB`, `ADJ`).
  Training separate models ensures each CRF learns patterns consistent with its corpus and tagging standard, allowing accurate tagging in both formats.




# Checking the Results

For that, we'll use a score method named balanced f-score. This score takes into account *precision* and *recall*.

* **precision**: Considering the universe of tagged words, how many were correctly tagged?
* **recall**: Considering the universe of correct tags, how many words were really correctly tagged?

The distinction is in the direction you look. Precision looks at all tagged words to find how many are ok; Recall looks at correct tags to find how many were able to be "guessed".

F-score is then calculated using these two. I won't go into the maths of it.  If you want,
* You can read the wikipedia article here: https://en.wikipedia.org/wiki/F1_score
* Or watch a neat simple video here: https://www.youtube.com/watch?v=j-EB6RqqjGI&ab_channel=CodeEmporium

Also, here's the wikipedia image to help you understand:
<div>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png"/>
</div>

We won't go into the computations either. Let the package do its thing (after all, we're interested in NLP now, not in statistics):

In [6]:
#We'll use the sklearn_crfsuit own metrics to compute f1 score.
from sklearn_crfsuite import metrics
from sklearn_crfsuite import scorers
print("## Penn ##")

#First calculate a prediction from test data, then we print the metrics for f-1 using the .flat_f1_score method.
y_penn_pred=penn_crf.predict(X_penn_test)
print("F1 score on Test Data")
print(metrics.flat_f1_score(y_penn_test, y_penn_pred,average='weighted',labels=penn_crf.classes_))
#For the sake of clarification, we do the same for train data.
y_penn_pred_train=penn_crf.predict(X_penn_train)
print("F1 score on Training Data ")
print(metrics.flat_f1_score(y_penn_train, y_penn_pred_train,average='weighted',labels=penn_crf.classes_))

# This presents class wise score. Helps see which classes (tags) are the ones with most problems.
print("Class wise score:")
print(metrics.flat_classification_report(
    y_penn_test, y_penn_pred, labels=penn_crf.classes_, digits=3
))

#Same for UD
print("## UD ##")

y_ud_pred=ud_crf.predict(X_ud_test)
print("F1 score on Test Data ")
print(metrics.flat_f1_score(y_ud_test, y_ud_pred,average='weighted',labels=ud_crf.classes_))
y_ud_pred_train=ud_crf.predict(X_ud_train)
print("F1 score on Training Data ")
print(metrics.flat_f1_score(y_ud_train, y_ud_pred_train,average='weighted',labels=ud_crf.classes_))

### Look at class wise score
print("Class wise score:")
print(metrics.flat_classification_report(
    y_ud_test, y_ud_pred, labels=ud_crf.classes_, digits=3
))


## Penn ##
F1 score on Test Data
0.9668646324625245
F1 score on Training Data 
0.9936643188628935
Class wise score:
              precision    recall  f1-score   support

         NNP      0.952     0.963     0.957      1213
           ,      1.000     1.000     1.000       592
          CD      1.000     0.999     0.999       683
         NNS      0.964     0.986     0.975       740
          JJ      0.879     0.912     0.895       731
          MD      0.993     1.000     0.996       135
          VB      0.980     0.946     0.963       313
          DT      0.992     0.993     0.992      1062
          NN      0.962     0.955     0.958      1899
          IN      0.981     0.980     0.981      1285
           .      1.000     1.000     1.000       509
         VBZ      0.958     0.936     0.947       219
         VBG      0.936     0.876     0.905       185
          CC      1.000     0.997     0.998       287
         VBD      0.965     0.945     0.955       492
         VBN      0



### **FAQ 1:**

**Q:** What is the purpose of calculating the F1 score in POS tagging evaluation?
**A:** The F1 score combines **precision** and **recall** into a single metric, showing how well the model balances correctness and completeness in its predictions. In POS tagging, it tells us how accurately the model assigns the right tag to each word — considering both false positives and false negatives. A higher F1 means better tagging performance.

---

### **FAQ 2:**

**Q:** Why do we use the `flat_f1_score` function instead of the regular `f1_score` from sklearn?
**A:** The regular `f1_score` works for flat (non-sequential) label data.
In POS tagging, each sentence is a **sequence of tags**, so `flat_f1_score` from `sklearn_crfsuite.metrics` first **flattens** all predicted and true tag sequences into single lists before computing the score. This ensures correct evaluation across all tokens in the test set.

---

### **FAQ 3:**

**Q:** What is the difference between F1 scores on **training data** and **test data**?
**A:**

* The **training F1 score** shows how well the model learned patterns in the data it was trained on.
* The **test F1 score** measures how well the model generalizes to unseen data.
  If the training score is much higher than the test score, it indicates **overfitting** — meaning the model memorized training examples instead of learning general patterns.

---

### **FAQ 4:**

**Q:** What does the “class-wise score” report tell us?
**A:** The class-wise report (generated using `flat_classification_report`) shows **precision, recall, and F1** for each POS tag (e.g., `NOUN`, `VERB`, `ADJ`).
This helps identify which tags are predicted accurately and which ones cause confusion. For example, the model might struggle more with rare tags like `PRP$` (possessive pronoun) compared to common tags like `NN`.

---

### **FAQ 5:**

**Q:** Why is the `average='weighted'` option used when computing the F1 score?
**A:** The `weighted` average ensures that tags with **more occurrences** (like `NOUN` or `VERB`) have a greater impact on the overall F1 score than rare ones. This provides a more realistic overall performance measure — especially when the dataset has **imbalanced tag frequencies**.



Not too shabby!

Remember that State of the Art results for Penn Treebank are at 97% f1.

Now, notice how UD is worse (90%)? Probably because there aren't many tags, so less variation and less classes for probability distribution.

---

But, wouldn't it be better if we could see it actually working?

That's what the following cell does. It also helps us understand what we'll have to implement in our main algorithm for it to work.

Feel free to play with the input phrase.



In [7]:
#First, we pass the sentence and "quickly tokenize it" - we've already done it in our code, so I'll just mock here with a split:
sent = "The tagger produced good results"
features = [extract_features(sent.split(), idx) for idx in range(len(sent.split()))]

#Then we tell the algorithm to make a prediction on a single input (sentence). I'll do once for Penn Treebank and once for UD.
penn_results = penn_crf.predict_single(features)
ud_results = ud_crf.predict_single(features)

#These line magics are just there to make it a neaty print, making a (word, POS) style print;
penn_tups = [(sent.split()[idx], penn_results[idx]) for idx in range(len(sent.split()))]
ud_tups = [(sent.split()[idx], ud_results[idx]) for idx in range(len(sent.split()))]

#The results come out here! Notice the difference in tags.
print(penn_tups)
print(ud_tups)

[('The', 'DT'), ('tagger', 'NN'), ('produced', 'VBN'), ('good', 'JJ'), ('results', 'NNS')]
[('The', 'DET'), ('tagger', 'NOUN'), ('produced', 'VERB'), ('good', 'ADJ'), ('results', 'NOUN')]


# Saving the Weights

We will want to load this to our NLPTools, right? So we have to save the weights. This means saving the classifier we trained to be able to classify our tokens.

To do it, we use Pickle, which is a Python package to save a readable binary file extension called "pickle". We'll later open this in our tool.



In [8]:
#import the pickle module
import pickle

#Simply dump! Use 'wb' in open to write bytes.

penn_filename = 'penn_treebank_crf_postagger.sav'
pickle.dump(penn_crf, open(penn_filename, 'wb'))

ud_filename = 'ud_crf_postagger.sav'
pickle.dump(ud_crf, open(ud_filename,'wb'))



### **FAQ 1:**

**Q:** What is happening when we call `penn_crf.predict_single(features)` and `ud_crf.predict_single(features)`?
**A:** These methods use the trained CRF models to **predict the POS tag** for each word in a new input sentence.

* `penn_crf` predicts using the **Penn Treebank tagset** (e.g., `NN`, `VBD`, `DT`),
* `ud_crf` predicts using **Universal POS tags** (e.g., `NOUN`, `VERB`, `DET`).
  This step demonstrates how the model can be applied to unseen sentences after training.

---

### **FAQ 2:**

**Q:** Why do we use `extract_features()` before making predictions?
**A:** The CRF model was trained on **feature representations** of words, not just raw text.
So before predicting, each token in the input sentence must be converted into the same feature format used during training (like capitalization, prefixes, neighboring words, etc.). This ensures the model interprets the new sentence correctly and applies learned patterns accurately.

---

### **FAQ 3:**

**Q:** Why are there differences between `penn_tups` and `ud_tups` outputs for the same sentence?
**A:** The two models are trained on **different tagging schemes**:

* The **Penn Treebank** model outputs **fine-grained English-specific tags** (e.g., `NN`, `VBZ`).
* The **UD Treebank** model outputs **universal, language-agnostic tags** (e.g., `NOUN`, `VERB`).
  Thus, while both predict POS labels, their tag sets and levels of granularity differ — demonstrating how different corpora represent linguistic structure.

---

### **FAQ 4:**

**Q:** What is the purpose of saving the CRF models using `pickle.dump()`?
**A:** Pickling allows you to **serialize (save)** a trained model to disk so you can **reload and reuse it** later without retraining.
For example, after saving `penn_treebank_crf_postagger.sav`, you can load it anytime using:

```python
penn_crf = pickle.load(open('penn_treebank_crf_postagger.sav', 'rb'))
```

This is essential for deploying your POS tagger in real-world applications or demos.

---

### **FAQ 5:**

**Q:** Why do we use `'wb'` mode when saving models with pickle?
**A:** The `'wb'` (write-binary) mode ensures that the model object is written in **binary format**, which is required by `pickle` for correctly storing complex Python objects like trained models.
Using plain text mode (`'w'`) would corrupt the data since binary objects can’t be represented as text safely.




To open the file, we just have to import the module and read the file using:

`model = pickle.load(open(filename, 'rb'))`

Great, we now have pickle files that can be loaded in our tool. Just download them using the lefthand file explorer and we're good to go!
See you back at the article!

In [9]:
!pip install spacy

# Load the English model
import spacy
nlp = spacy.load("en_core_web_sm")  # or "en_core_web_trf" for transformer-based accuracy

# Test sentence
sent = "The tagger produced good results."
doc = nlp(sent)

# Print tokens with POS tags
for token in doc:
    print(token.text, "→", token.pos_)


The → DET
tagger → NOUN
produced → VERB
good → ADJ
results → NOUN
. → PUNCT


In [10]:
import nltk

nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

text = nltk.word_tokenize("The tagger produced good results.")
tags = nltk.pos_tag(text)
print(tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


[('The', 'DT'), ('tagger', 'NN'), ('produced', 'VBD'), ('good', 'JJ'), ('results', 'NNS'), ('.', '.')]


In [11]:
!pip install transformers
from transformers import pipeline

pos_tagger = pipeline(
    "token-classification",
    model="vblagoje/bert-english-uncased-finetuned-pos",
    aggregation_strategy="simple"
)

sentence = "The tagger produced good results."
results = pos_tagger(sentence)
for token in results:
    print(token['word'], "→", token['entity_group'])




config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


the → DET
tagger → NOUN
produced → VERB
good → ADJ
results → NOUN
. → PUNCT




### **FAQ 1:**

**Q:** Why are we using pretrained POS taggers instead of training our own models?
**A:** Pretrained taggers are **already trained on large, high-quality annotated corpora**, so they save time, compute, and data effort. They offer strong accuracy out of the box — perfect for demos or production systems. While custom models are good for research or domain-specific tasks, pretrained ones are ideal for general-purpose POS tagging.

---

### **FAQ 2:**

**Q:** What’s the main difference between NLTK, spaCy, and transformer-based POS taggers?
**A:**

* **NLTK:** Uses a **rule/statistical** model (Averaged Perceptron); outputs **Penn Treebank tags** like `NN`, `VBZ`.
* **spaCy:** Uses a **neural model** trained on **Universal Dependencies (UD)**; outputs simpler, consistent tags like `NOUN`, `VERB`.
* **Transformers (Hugging Face):** Use **deep contextual models** (like BERT) for **state-of-the-art accuracy**, understanding context deeply (e.g., “run” as a verb vs noun).

---

### **FAQ 3:**

**Q:** Why do NLTK and spaCy show different POS tags for the same sentence?
**A:** They follow **different tagging schemes**:

* NLTK uses the **Penn Treebank** tagset (fine-grained and English-specific).
* spaCy and transformer models use **Universal POS (UPOS)** tags, which are language-agnostic and simpler (17 standard tags).
  So the difference isn’t an error — it’s just a variation in **annotation standards**.

---

### **FAQ 4:**

**Q:** Which pretrained POS tagger should I choose for real-world applications?
**A:**

* Use **spaCy** for fast, accurate, production-level tagging with built-in NLP tools.
* Use **transformer models** (like BERT-based taggers) for **maximum accuracy** or multilingual contexts.
* Use **NLTK** when you need lightweight, quick tagging for English text or educational examples.
  Each balances **speed, accuracy, and resource usage** differently.

---

### **FAQ 5:**

**Q:** Can I fine-tune these pretrained models for my own domain (e.g., medical or legal text)?
**A:** Yes. Both **spaCy** and **Hugging Face** models can be **fine-tuned** on domain-specific corpora.

* In spaCy, you can continue training the pipeline with your own labeled data.
* In Transformers, you can fine-tune models like BERT using token classification tasks.
  This gives you the best of both worlds — pretrained knowledge plus domain adaptation.


