<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/NLPModel_MultiLabel_XGBoost_TFIDFVectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Model with XGBoost

In this notebook we are going to train a XGBoost Model to predict categories of text. To vectorize the text we are going to use the TFIDF approach. 

XGBoost: https://xgboost.readthedocs.io/en/latest/

Sklearn TFIDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Notebook adapted from: https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb

## Fetch data

We use the sklearn direct dataset 20 news groups.

In [0]:
import pandas as pd
import numpy as np 

In [2]:
from sklearn.datasets import fetch_20newsgroups
train_raw_df = fetch_20newsgroups(subset='train')
test_raw_df = fetch_20newsgroups(subset='test')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
print(f'Number of raw training examples: {len(train_raw_df.data)}')
print(f'Number of raw test examples: {len(test_raw_df.data)}')

Number of raw training examples: 11314
Number of raw test examples: 7532


In [4]:
category_names = np.unique(np.array(train_raw_df.target_names))
print(f'Number of different categories : {len(category_names)}')
print(f'Category list: {category_names}')

Number of different categories : 20
Category list: ['alt.atheism' 'comp.graphics' 'comp.os.ms-windows.misc'
 'comp.sys.ibm.pc.hardware' 'comp.sys.mac.hardware' 'comp.windows.x'
 'misc.forsale' 'rec.autos' 'rec.motorcycles' 'rec.sport.baseball'
 'rec.sport.hockey' 'sci.crypt' 'sci.electronics' 'sci.med' 'sci.space'
 'soc.religion.christian' 'talk.politics.guns' 'talk.politics.mideast'
 'talk.politics.misc' 'talk.religion.misc']


In [5]:
print('Example of entry:')
print(f'\t - LABEL : {train_raw_df.target[0]} - {train_raw_df.target_names[0]}')
print(f'\t - {train_raw_df.data[0]}')

Example of entry:
	 - LABEL : 7 - alt.atheism
	 - From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







## Prepare data to the model

In [0]:
x_train = train_raw_df.data
y_train = train_raw_df.target

x_test = test_raw_df.data
y_test = test_raw_df.target

## Model training

We are going to use a model that doesn't use raw text as input. It use a vectorization of the text. For this, we are going to create a pipeline that allows combine the vectorizer and the model in the same structure.

The vectorization is via TFIDF codification.

In [0]:
# Word Embedding via TFIDF - vectorization of the text
from sklearn.feature_extraction.text import TfidfVectorizer

# Pipeline execution to combine vectorization and model execution
from sklearn.pipeline import make_pipeline 

# Definition of generic models
from xgboost import XGBClassifier  # XGBoost

In [8]:
# Train the vectorizer with the training documents to codify the text in a TFIDF representation 
vec = TfidfVectorizer()
vec.fit(x_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

*XGBClassifier* : Implementation of the scikit-learn API for XGBoost classification.

https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=xgbclassifier#xgboost.XGBClassifier


In [0]:
# Create the model
estimator = XGBClassifier()

In [0]:
# Create the pipeline with the vectorizer and the model
pipeline = make_pipeline(vec, estimator)

In [11]:
# Train in the model
pipeline.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token...
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0, max_depth=3,
                        

## Model evaluation

In [12]:
from sklearn.metrics import classification_report

predictions = pipeline.predict(x_test)

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.76      0.62      0.68       319
           1       0.65      0.70      0.67       389
           2       0.70      0.71      0.71       394
           3       0.61      0.69      0.65       392
           4       0.76      0.77      0.76       385
           5       0.78      0.64      0.71       395
           6       0.75      0.86      0.80       390
           7       0.85      0.74      0.79       396
           8       0.88      0.85      0.87       398
           9       0.82      0.85      0.83       397
          10       0.89      0.86      0.88       399
          11       0.88      0.80      0.84       396
          12       0.42      0.60      0.49       393
          13       0.84      0.72      0.78       396
          14       0.83      0.84      0.83       394
          15       0.81      0.89      0.85       398
          16       0.63      0.77      0.69       364
          17       0.96    