POS-tagger

POS tagger for English and Catalan trained with datasets from Universal Dependencies.

🚀 Getting started

It is good practice to run all commands and installations inside a python environment. Check out this page to know how.

To install the dependencies to reproduce this project, just run in the terminal:

    pip install -r requirements.txt

To use this implementation of the pos tagger, you just have to use our class! Inside a python terminal run:

>>> from src.tagger import HiddenMarkovModel
>>> from src.scrapper import parse_conllu_file

>>> train = parse_conllu_file(filepath="datasets/en_gum-ud-train.conllu")
>>> tagger = HiddenMarkovModel(corpus=train).train()

>>> test = [[('hello', ) , ('world', )]] 
>>> tagger.predict(corpus=test)
[[('hello', 'intj'), ('world', 'noun')]]

Read the documentation of the methods present in src/tagger.py to understand formatting, printing and how the input arguments work. You can also play around with several methods present in the classes HiddenMarkovModel and HiddenMarkovModelTagger such as viterbi_best_path, get_confusion_matrix, et cetera.

📌 The project

In this repo we don't only provide the code to use your own POS-Tagger but also a a set of analyses performed in two datasets: exploratory data analysis, performance evaluation and a bit of algorithm profiling and cost assessment. The analyses and findings are recommended in the following order:

Exploratory Data Analysis - To check out the data we have used to train and test our model. It is found inside the folder eda/. It is not recommended to re-run the execution of the notebooks since some of the plots and analysis might take a while to load. However, the results are already available and visible in the notebook itself. It is also recommended to check them with some notebook visualizator, since GitHub does not show some of the interactive plots generated.
The algorithm - The class HiddenMarkovModelTagger has been implemented to use the pos-tagger with the viterbi implementation. It is wrapped by the class HiddenMarkovModel, which takes corpus data and returns a HiddenMarkovModelTagger instance intended for general use. The code is inside the folder src/. There, you can also find some scrapping and visualization/plotting modules with different functionalities.
Evaluation - Notebooks where you can check out the results provided by our tagger implementation, validate the code, etc. It is found inside the folder evaluation/. Below it is listed a guide for all of the notebooks present in it.
- model_testing.ipynb $\rightarrow$ Very simple, lightweight analysis demonstrating the tagger works for simple sentences, validating transition & emission matrices consistency, etc.
- english_model_evaluation.ipynb $\rightarrow$ Performance analysis of the tagger model trained using the english language dataset, along with some relevant metrics
- catalan_model_evaluation.ipynb $\rightarrow$ Performance analysis of the tagger model trained using the catalan language dataset, along with some relevant metrics
- cost_analysis.ipynb $\rightarrow$ Computational cost of training and test and carbon print profiling.

⚠️ It is recommended to visualize the notebooks in a local environment, since github does not display all the plots we include.

📝 The data

The data used has been extracted from the resources provided by the Universal Dependencies Project.
For the analysis, two datasets from different languages have been used: English and Catalan.

English Dataset

The corpus is referred as GUM, Georgetown University Multilayer corpus. NICT JLE. Its purpose is to research on discourse models and therefore it contains mutliple text types and might include code-switching.
Github repository available here
Train dataset available here (8548 sentences)
Test dataset available here (1096 sentences)

Catalan Dataset

Sentences from the corpus Ancora
Github repository available here
Train dataset available here (13123 sentences)
Test dataset available here (1846 sentences)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POS-tagger

🚀 Getting started

📌 The project

📝 The data

English Dataset

Catalan Dataset

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
datasets		datasets
eda		eda
evaluation		evaluation
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

ramcarreno/pos-tagger

Folders and files

Latest commit

History

Repository files navigation

POS-tagger

🚀 Getting started

📌 The project

📝 The data

English Dataset

Catalan Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages