# Using `bouquet` for NLP Classification with XGBoost
Demonstrating how to use the `bouquet` framework to build a classifier that predicts Newsgroups from news clippings (text). <br>
Link to dataset origin: [home page for newsgroups](http://qwone.com/~jason/20Newsgroups/)

In [1]:
import sys
sys.path.append("../")

import os
import tarfile

import numpy as np
import pandas as pd

from src.inference_model import XGBoostModel

In [2]:
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
# assert len(logger.handlers) == 1
# handler = logger.handlers[0]
# handler.setLevel(logging.INFO)

## Dataset creation
Unzip the files and collate the text files into one DataFrame.

In [3]:
# unzip newsgroups dataset
local_origin = "../data/20news-bydate.tar.gz"
local_destination = "../data/"

with tarfile.open(local_origin, "r:gz") as tar:
    tar.extractall(local_destination)

In [27]:
# read in newsgroups dataset
def collate_text_files(data_path: str) -> pd.DataFrame:
    """
    Reads all text files located in subfolders of a directory and stores them
    into a dataframe.
    """
    files = [os.path.join(dirpath, f) 
             for (dirpath, dirnames, filenames) in os.walk(data_path)
             for f in filenames]
    texts = []
    
    for filepath in files:
        try:
            with open(filepath, "r") as f:
                texts.append(f.read())
        except UnicodeDecodeError:
            texts.append("")

    df = pd.DataFrame({
        "file_path": files,
        "text": texts
    })
    print(df.shape)

    return df

In [28]:
train_df = collate_text_files("../data/20news-bydate-train")

(11314, 2)


In [29]:
# check how many extractions failed
print(train_df.loc[train_df["text"] == ""].shape)

# and drop those records
train_df = train_df.loc[train_df["text"] != ""].reset_index(drop=True)
print(train_df.shape)

(44, 2)
(11270, 2)


In [30]:
# do the same for test files
test_df = collate_text_files("../data/20news-bydate-test")
test_df = test_df.loc[test_df["text"] != ""].reset_index(drop=True)
print(test_df.shape)

(7532, 2)
(7503, 2)


In [33]:
# create target class column
train_df["newsgroup"] = train_df["file_path"].apply(lambda x: x.split("/")[3])
test_df["newsgroup"] = test_df["file_path"].apply(lambda x: x.split("/")[3])

In [34]:
# save datasets for easier loading next time
train_df.to_csv("../data/20news_bydate_train.csv", index=False)
test_df.to_csv("../data/20news_bydate_test.csv", index=False)

### Quick EDA

In [35]:
train_df.groupby("newsgroup").size()

newsgroup
alt.atheism                 475
comp.graphics               579
comp.os.ms-windows.misc     591
comp.sys.ibm.pc.hardware    588
comp.sys.mac.hardware       567
comp.windows.x              590
misc.forsale                583
rec.autos                   593
rec.motorcycles             598
rec.sport.baseball          591
rec.sport.hockey            600
sci.crypt                   594
sci.electronics             588
sci.med                     592
sci.space                   593
soc.religion.christian      599
talk.politics.guns          546
talk.politics.mideast       564
talk.politics.misc          465
talk.religion.misc          374
dtype: int64

The dataset is pretty balanced between classes.

## XGBoost Modeling

In [3]:
import nltk
nltk.download("punkt_tab")
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package punkt_tab to /Users/sarah/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /Users/sarah/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /Users/sarah/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /Users/sarah/nltk_data...


True

In [6]:
model_kwargs = {
    "seed": 13,
    "learning_rate": 0.1,
    "gamma": 0,
    "max_depth": 5,
    "subsample": 0.7,
    "colsample_bytree": 0.7,
    "lambda": 1,
    "alpha": 0,
    "objective": "multi:softmax",
    "multi_strategy": "one_output_per_tree",
    "eval_metric": ["merror", "mlogloss"]
}

In [7]:
xgb_model = XGBoostModel(data_path="~/Documents/mission-control/bouquet/data/20news-bydate",
                         target="newsgroup",
                         xgb_kwargs=model_kwargs,
                         max_features=10000,
                         save_path="~/Documents/mission-control/bouquet/artifacts/20news-bydate")

In [8]:
xgb_model.run()

INFO:inference_model:Cleaning text
INFO:inference_model:Vectorizing text
INFO:inference_model:Text cleaned and vectorized in 46.02586007118225 seconds
INFO:inference_model:Encoding class labels


Results aren't bad - 0.75 macro F1 - but could be way better