URL NLP

REPO Structure

.
├───bert_types_models                       # folder containing codes used for bert type finetuning
│   │   bert_multilingual.ipynb              # [DON'T USE IT FOR TRAINING] Notebook used to iterate and create model training script <NotUpdated>
│   │   requirements.txt                     # requirements files to install indepencies
│   │   train_bert.py                        # Main script used to train bert models (bertmultilingual/camembert/flaubert)
│   │   utils.py                             # Utility script containing all other class and fucntions to build model and train it
│   │   
│   │       
│   ├───models_trained                      # directory containing all trained models
│       ├───bertm_5epochs_dropout
│       ├───bertm_5epochs_dropout_concat
│       ├───bertm_5epochs_dropout_freezing_concat
│       ├───bertm_5epochs_nodropout
│       └───camembert_5epochs_nodropout
│   
├── data                                     # data folder also ouput folder for preprocessed data
├───inference
│   │   docker-compose.yml                   # Docker compose to build containers : service+app
│   │
│   ├───app                                 # StreamLit application for runing inference
│   │       app.py                          
│   │       Dockerfile
│   │       requirements.txt
│   │
│   └───service                             # FastAPI service for doing inference.
│       │   Dockerfile
│       │   main.py
│       │   requirements.txt
│       │   utils.py
│       │
│       └───models                          # Directory containing Bert type model finetuned used for inference
│       
├───preprocessing                           # Folder containing code used for preprocessing
│       preprocess.py                        # script to run preprocessing
│       preprocess_urls.ipynb                # notebook used
│   
│
├───statistical_embedding_models            # folder containing codes used for straining statistical models
│       tf_idf.ipynb                         # Notebook used to to do Statistical modeling : TFIDF with different Models
│
├── LICENSE
└── README.md

Preprocess data

install requirements :

cd preprocessing
pip install -r requirements.txt

preprocess data :

python preprocess.py

Training Bert type models

install requirements :

cd bert_types_models
pip install -r requirements.txt

Training : make sure to preprocess data first. To change model, you need to change PRE_TRAINED_MODEL_NAME variable inside script. models already used ['bert-base-multilingual-uncased', 'camembert-base', 'flaubert/flaubert_base_cased'] script will export model trained to the folder models_trained/ with the MultiLabelBinarizer object for production + scoring during training

python train_bert.py

Models Results :

Models	Accuracy	Hamming loss	AUC	F1 score macro	F1 score micro	F1 score weighted
bertm_5epochs_dropout	0.073055	0.00949833	0.611556	0.244681	0.51026	0.406705
bertm_5epochs_dropout_concat	0.187065	0.00833813	0.741633	0.528397	0.638434	0.608079
bertm_5epochs_dropout_freezing_concat	0.0251423	0.0113899	0.537301	0.093941	0.268745	0.197125
bertm_5epochs_nodropout	0.0770082	0.00946042	0.61398	0.25053	0.515028	0.411921
camembert_5epochs_nodropout	0.0477546	0.0102866	0.566295	0.134595	0.438133	0.308513
flaubert_5epochs_nodropout	0.000158128	0.012761	0.501113	0.00339176	0.0116508	0.00889494

Training Statistical models:

Training procedure is explained in the Notebook (statistical_embedding_models/tf_idf.ipynb)
Models results

	Accuracy	Hamming loss	AUC	F1 score macro	F1 score micro	F1 score weighted
LR	0.121936	0.00891236	0.653643	0.396298	0.538858	0.49323
LSVC	0.171484	0.00831222	0.70922	0.503603	0.607955	0.577387
XGB	0.151981	0.00873008	0.702646	0.488	0.584884	0.555727
MLP	0.101116	0.00913117	0.635244	0.300006	0.543177	0.457907

Inference

run :

cd inference
docker-compose up --build

Note that you may want to get my trained model, get it from here and put both files on inference/service/models/.

Feedback and point to check:

Bert types model doesn't perform well, main raison: no context meaning in the urls, the statistical approche (tfidf) is better because it doesn't depend on the context it's just a convertion of each word (token) into one statistical embedding.
Use some Neural networks with Gloove/FastText embeddings or Train proper embeddings.
Handle Imbalance by using weighted loss on the BCElosswithlogits()
Use data augmentation for minority labels.
Convert Model to ONNX format to optimize inference latency
Task Similar to the one in this Paper by Mircrosoft but the problem is that the model is not yet opensourced but would recommend just using a scrapper then Bert type models.
Going deeper into production and deployement with AWS, check my article here (https://towardsdatascience.com/deploy-fastai-transformers-based-nlp-models-using-amazon-sagemaker-and-creating-api-using-aws-7ea39bbcc021)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

URL NLP

REPO Structure

Preprocess data

Training Bert type models

Training Statistical models:

Inference

Feedback and point to check:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bert_types_models		bert_types_models
data		data
inference		inference
preprocessing		preprocessing
statistical_embedding_models		statistical_embedding_models
.gitignore		.gitignore
README.md		README.md
train_bert_colab.py		train_bert_colab.py

rouzki/URL-NLP

Folders and files

Latest commit

History

Repository files navigation

URL NLP

REPO Structure

Preprocess data

Training Bert type models

Training Statistical models:

Inference

Feedback and point to check:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages