Skip to content

rouzki/URL-NLP

Repository files navigation

URL NLP

REPO Structure

.
├───bert_types_models                       # folder containing codes used for bert type finetuning
│   │   bert_multilingual.ipynb              # [DON'T USE IT FOR TRAINING] Notebook used to iterate and create model training script <NotUpdated>
│   │   requirements.txt                     # requirements files to install indepencies
│   │   train_bert.py                        # Main script used to train bert models (bertmultilingual/camembert/flaubert)
│   │   utils.py                             # Utility script containing all other class and fucntions to build model and train it
│   │   
│   │       
│   ├───models_trained                      # directory containing all trained models
│       ├───bertm_5epochs_dropout
│       ├───bertm_5epochs_dropout_concat
│       ├───bertm_5epochs_dropout_freezing_concat
│       ├───bertm_5epochs_nodropout
│       └───camembert_5epochs_nodropout
│   
├── data                                     # data folder also ouput folder for preprocessed data
├───inference
│   │   docker-compose.yml                   # Docker compose to build containers : service+app
│   │
│   ├───app                                 # StreamLit application for runing inference
│   │       app.py                          
│   │       Dockerfile
│   │       requirements.txt
│   │
│   └───service                             # FastAPI service for doing inference.
│       │   Dockerfile
│       │   main.py
│       │   requirements.txt
│       │   utils.py
│       │
│       └───models                          # Directory containing Bert type model finetuned used for inference
│       
├───preprocessing                           # Folder containing code used for preprocessing
│       preprocess.py                        # script to run preprocessing
│       preprocess_urls.ipynb                # notebook used
│   
│
├───statistical_embedding_models            # folder containing codes used for straining statistical models
│       tf_idf.ipynb                         # Notebook used to to do Statistical modeling : TFIDF with different Models
│
├── LICENSE
└── README.md

Preprocess data

  • install requirements :
cd preprocessing
pip install -r requirements.txt
  • preprocess data :
python preprocess.py

Training Bert type models

  • install requirements :
cd bert_types_models
pip install -r requirements.txt
  • Training : make sure to preprocess data first. To change model, you need to change PRE_TRAINED_MODEL_NAME variable inside script. models already used ['bert-base-multilingual-uncased', 'camembert-base', 'flaubert/flaubert_base_cased'] script will export model trained to the folder models_trained/ with the MultiLabelBinarizer object for production + scoring during training
python train_bert.py
  • Models Results :
Models Accuracy Hamming loss AUC F1 score macro F1 score micro F1 score weighted
bertm_5epochs_dropout 0.073055 0.00949833 0.611556 0.244681 0.51026 0.406705
bertm_5epochs_dropout_concat 0.187065 0.00833813 0.741633 0.528397 0.638434 0.608079
bertm_5epochs_dropout_freezing_concat 0.0251423 0.0113899 0.537301 0.093941 0.268745 0.197125
bertm_5epochs_nodropout 0.0770082 0.00946042 0.61398 0.25053 0.515028 0.411921
camembert_5epochs_nodropout 0.0477546 0.0102866 0.566295 0.134595 0.438133 0.308513
flaubert_5epochs_nodropout 0.000158128 0.012761 0.501113 0.00339176 0.0116508 0.00889494

Training Statistical models:

  • Training procedure is explained in the Notebook (statistical_embedding_models/tf_idf.ipynb)
  • Models results
Accuracy Hamming loss AUC F1 score macro F1 score micro F1 score weighted
LR 0.121936 0.00891236 0.653643 0.396298 0.538858 0.49323
LSVC 0.171484 0.00831222 0.70922 0.503603 0.607955 0.577387
XGB 0.151981 0.00873008 0.702646 0.488 0.584884 0.555727
MLP 0.101116 0.00913117 0.635244 0.300006 0.543177 0.457907

Inference

  • run :
cd inference
docker-compose up --build
  • Note that you may want to get my trained model, get it from here and put both files on inference/service/models/.

Feedback and point to check:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published