In [1]:
import pandas as pd
import xgboost as xgb

Plan of the model:

1. tokenize input texts into words, collect them all accross the dataset and generate a count-based encoding as a feature vector
2. fit a model for a multi-class classification problem
3. extract feature importance data from the model and pick top-4 from the feature vector (global feature importance), alternatively, pick the four most important features from a single prediction (local feature importance)

Questions:

1. how to fight the class imbalance
2. how to deal with OOV words

A nice tutorial to consult: https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py

And another one using the same 20Newsgroups dataset: https://reslan-tinawi.github.io/2020/05/26/text-classification-using-sklearn-and-nltk.html
And the code: https://github.com/Reslan-Tinawi/20-newsgroups-Text-Classification/blob/master/Text%20Classification.ipynb

#### Text vectorization

1. Using CountVectorizer from scikit I attempt 
  * raw word counts
  * one-hot
  * 2-grams

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

train_df = pd.read_csv('../data/train_df.csv')
train_df[['Title']]

Unnamed: 0,Title
0,senior program analyst navy strike fighter sen...
1,senior intelligence analyst iii job
2,retail wireless sales consultant part
3,test automation engineer w selenium and gerkin...
4,public sector portfolio management senior asso...
...,...
12093,tier help desk technical support rep
12094,road technician mechanic
12095,health care policy editor
12096,cyberecurity engineer senior


In [10]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_df.Title)

In [11]:
X_train

<12098x4310 sparse matrix of type '<class 'numpy.int64'>'
	with 52588 stored elements in Compressed Sparse Row format>

In [9]:
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['aamf', 'abap', 'abc', ..., 'zone', 'zoning', 'zos'], dtype=object)

In [13]:
y_train = train_df.Code
y_train

0        15-1121.00
1        33-3021.06
2        41-2031.00
3        15-1121.00
4        11-1021.00
            ...    
12093    33-3021.06
12094    49-3023.02
12095    41-2031.00
12096    15-1122.00
12097    15-1122.00
Name: Code, Length: 12098, dtype: object

In [15]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

In [16]:
test_df = pd.read_csv('../data/test_df.csv')
X_test = vectorizer.transform(test_df.Title)
y_test = test_df.Code

pred = clf.predict(X_test)

In [18]:
from sklearn import metrics

score = metrics.accuracy_score(y_test, pred)
print("accuracy:   %0.3f" % score)

accuracy:   0.431
