<a href="https://colab.research.google.com/github/rpujala/machine_learning/blob/main/Commodity_Code_Predictor_using_ML_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Commodity Code Prediction

* In a large enterprise procurement system, users manually assign commodity codes to items based on free-text descriptions and limited structured attributes such as item type and lifecycle phase.

* Manual classification is slow, inconsistent, and error-print, leading to downstream issues in reporting, compliance, and supplier analytics.

* The goal is to build a machine learning model that automatically predicts the correct commodity code by jointly learning from item descriptions (text) and structured item attributes, enabling faster onboarding and improved data quality

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)

n_samples = 6000

descriptions = [
    'steel bolt hex head',
    'plastic container food grade',
    'electrical wiring copper',
    'industrial lubricant oil',
    'packaging cardboard box',
    'precision medical instrument',
    'chemical solvent liquid'
]

data = {
    'item_desc': np.random.choice(descriptions, n_samples),
    'item_type': np.random.choice(
                ['raw_material', 'component', 'finished_good'],
                n_samples),
    'lifecycle_phase': np.random.choice(
                ['Design', 'Production', 'Obsolete'],
                n_samples),
    'is_custom_part': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    'avg_unit_cost': np.random.uniform(5, 500, n_samples),
    'commodity_code': np.random.choice(
                ['METALS', 'PLASTICS', 'ELECTRICAL', 'CHEMICALS', 'PACKAGING'],
                n_samples)
}

df = pd.DataFrame(data=data)
df.head()

Unnamed: 0,item_desc,item_type,lifecycle_phase,is_custom_part,avg_unit_cost,commodity_code
0,chemical solvent liquid,component,Obsolete,0,324.066553,ELECTRICAL
1,industrial lubricant oil,finished_good,Design,1,73.020672,PLASTICS
2,packaging cardboard box,component,Production,0,37.413284,METALS
3,chemical solvent liquid,finished_good,Production,0,436.413297,CHEMICALS
4,electrical wiring copper,raw_material,Obsolete,0,366.713777,ELECTRICAL


# Encode Target

In [None]:
X = df.drop(columns=['commodity_code'], axis=1)
y = df['commodity_code']

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(df['commodity_code'])
y[:2]

array([1, 4])

In [None]:
type(y)

numpy.ndarray

In [None]:
import tensorflow as tf
import numpy as np
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# convert y into one_hot numpy array
y = tf.keras.utils.to_categorical(y)
y[:2]

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.]], dtype=float32)

# Train / Test split

In [None]:
df.head()

Unnamed: 0,item_desc,item_type,lifecycle_phase,is_custom_part,avg_unit_cost,commodity_code
0,chemical solvent liquid,component,Obsolete,0,324.066553,ELECTRICAL
1,industrial lubricant oil,finished_good,Design,1,73.020672,PLASTICS
2,packaging cardboard box,component,Production,0,37.413284,METALS
3,chemical solvent liquid,finished_good,Production,0,436.413297,CHEMICALS
4,electrical wiring copper,raw_material,Obsolete,0,366.713777,ELECTRICAL


In [None]:
from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.2,
                                                   random_state=42,
                                                   stratify=y)

In [None]:
X_train.shape, y_train.shape

((4800, 5), (4800, 5))

In [None]:
X_test.shape, y_test.shape

((1200, 5), (1200, 5))

# Text processing (Tokenizer)

In [None]:
max_words = 200
max_len = 20

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words,
                                                 oov_token="<oov>")
tokenizer.fit_on_texts(X_train['item_desc'])

* max_words = 200: This sets the size of your vocabulary. You are telling the model to only pay attention to the 200 most frequent words in your dataset. Any word less frequent than the top 200 will be ignored or treated as an "unknown" word.

* max_len = 20: This variable defines the maximum length of a sentence (sequence). Although it is not used in the Tokenizer function call above, it is typically used in the next step (padding) to ensure all text sequences are the same length (e.g., cutting off sentences longer than 20 words or adding zeros to shorter ones).

* num_words=max_words (200): This enforces the limit we set earlier. Even if your text has 10,000 unique words, the tokenizer will only assign specific numbers to the top 199 most frequent words.

* oov_token="**oov**": This stands for Out Of Vocabulary. This is a crucial safety net.
* Without this: If the model encounters a word it hasn't seen before (or a word that wasn't in the top 200), it would simply delete it, losing information about the sentence structure.
    
* With this: Any word not in the top 200 is replaced with the token "oov". This allows the model to recognize that "some word exists here, even if I don't know what it is.

In [None]:
X_train_text = tf.keras.preprocessing.sequence.pad_sequences(
    tokenizer.texts_to_sequences(X_train['item_desc']),
    maxlen=max_len
)

X_test_text = tf.keras.preprocessing.sequence.pad_sequences(
    tokenizer.texts_to_sequences(X_test['item_desc']),
    maxlen=max_len
)

* X_train['item_desc']
    * Input: This is your raw data source. It is likely a column from a Pandas DataFrame containing text descriptions (e.g., "Blue mens shirt", "Red shoes").

* tokenizer.texts_to_sequences(...)
    * Action: It converts the raw text strings into lists of integers.
    * How it works: It uses a tokenizer (which you must have previously fitted on your data) to look up each word in a vocabulary dictionary.
    * Result: A list of lists, where each list represents a sentence, and each integer represents a specific word.
    * Note: The standard Keras method is named texts_to_sequences (plural). If your code says texts_to_sequence (singular), it might be a typo unless you have a custom wrapper.

* tf.keras.preprocessing.sequence.pad_sequences(...)
    * Action: It standardizes the length of the integer lists.

    * Why it's needed: Deep learning models usually require input tensors to be rectangular (i.e., every sample must have the exact same length).

    * How it works:

        * If a sequence is shorter than max_len: It adds zeros (padding) to the beginning (default) or end to match the length.

        * If a sequence is longer than max_len: It cuts off (truncates) the extra words so it fits.

# Feature Preprocessing

In [None]:
df.head()

Unnamed: 0,item_desc,item_type,lifecycle_phase,is_custom_part,avg_unit_cost,commodity_code
0,chemical solvent liquid,component,Obsolete,0,324.066553,ELECTRICAL
1,industrial lubricant oil,finished_good,Design,1,73.020672,PLASTICS
2,packaging cardboard box,component,Production,0,37.413284,METALS
3,chemical solvent liquid,finished_good,Production,0,436.413297,CHEMICALS
4,electrical wiring copper,raw_material,Obsolete,0,366.713777,ELECTRICAL


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

In [None]:
num_features = [
    'avg_unit_cost'
]

cat_features = [
    'item_type',
    'lifecycle_phase',
    'is_custom_part'
]

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num_pipeline', num_pipeline, num_features),
    ('cat_pipeline', cat_pipeline, cat_features)
])

preprocessor

0,1,2
,transformers,"[('num_pipeline', ...), ('cat_pipeline', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [None]:
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

In [None]:
X_train_processed[:2]

array([[-1.06277228,  0.        ,  0.        ,  1.        ,  1.        ,
         0.        ,  0.        ,  1.        ,  0.        ],
       [-1.46451837,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  1.        ]])

In [None]:
type(X_train_processed)

numpy.ndarray

In [None]:
type(X_test_processed)

numpy.ndarray

In [None]:
# X_train_processed = X_train_processed.toarray()
# type(X_train_processed)

# Functional API model

In [None]:
print(max_len)
print(max_words)

text_input = tf.keras.layers.Input(shape=(max_len, ), name='text_input')

embedding = tf.keras.layers.Embedding(
    input_dim = max_words,
    output_dim=64,
    input_length=max_len,
)(text_input)

text_features = tf.keras.layers.GlobalAveragePooling1D()(embedding)

20
200


* text_input = tf.keras.layers.Input(shape=(max_len, ), name='text_input')
    * Role: Defines the entry point for data.

    * shape=(max_len, ): Tells the model to expect a 1D array of integers with exactly max_len items for each sample.

    * Concept: This acts as a placeholder tensor that will be fed the X_train_text data you processed earlier.

* embedding = tf.keras.layers.Embedding(input_dim=max_words, output_dim=64, ...)(text_input)

    * Role: Converts integers into dense vectors.

    * input_dim=max_words: The layer creates a lookup matrix with this many rows.

    * output_dim=64: Every single word is transformed into a vector of 64 distinct numbers.

    * Transformation:

        * Input: (batch_size, max_len) — A 2D matrix of integers.

        * Output: (batch_size, max_len, 64) — A 3D tensor where each word now has "depth" (features).

* text_features = tf.keras.layers.GlobalAveragePooling1D()(embedding)
    * Role: Summarizes the sequence into a single vector.

    * How it works: It takes the average of all word vectors in the sequence.

    * f your text has 10 words, you have 10 vectors of size 64.

    * This layer adds them all up and divides by 10.

    * Why: It is a very computationally efficient way to flatten the data while retaining the "average meaning" of the sentence. It discards word order but keeps semantic content.

    * Transformation:

        * Input: (batch_size, max_len, 64)

        * Output: (batch_size, 64) — A 2D matrix. The sequence dimension (max_len) is gone.

# Structured Branch

In [None]:
structured_input = tf.keras.layers.Input(
    shape=(X_train_processed.shape[1], ),
    name='structured_input'
)

structured_features = tf.keras.layers.Dense(32, activation='relu')(structured_input)

combined = tf.keras.layers.Concatenate()([text_features, structured_features])
x = tf.keras.layers.Dense(64, activation='relu')(combined)
x = tf.keras.layers.Dense(32, activation='relu')(x)
output = tf.keras.layers.Dense(y.shape[1], activation='softmax')(x)

model = tf.keras.Model(
    inputs=[text_input, structured_input],
    outputs=output,
    name="model"
)

initial_learning_rate = 0.001
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate,
    decay_steps=1000,
    decay_rate=0.96,
    staircase=True
)

model.compile(
    loss=tf.keras.losses.CategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
    metrics=[tf.keras.metrics.CategoricalAccuracy()]
)

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text_input (InputLayer)        [(None, 20)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 20, 64)       12800       ['text_input[0][0]']             
                                                                                                  
 structured_input (InputLayer)  [(None, 9)]          0           []                               
                                                                                                  
 global_average_pooling1d (Glob  (None, 64)          0           ['embedding[0][0]']              
 alAveragePooling1D)                                                                          

# Model training

In [None]:
import os
from datetime import datetime

log_dir = "logs/fit/" + datetime.now().strftime('%Y%m%d-%H%M%S')

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath="myModel_{epoch:02d}.keras",
        save_best_only=True,
        monitor='val_loss',
        verbose=2),

    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        min_delta=1e-2,
        patience=5,
        verbose=2),

    tf.keras.callbacks.TensorBoard(
        log_dir=log_dir,
        histogram_freq=1
    )
]

In [None]:
model.fit(
    [X_train_text, X_train_processed],
    y_train,
    epochs=5,
    batch_size=32,
    validation_split=0.2,
    callbacks=callbacks,
    verbose=2
)

Epoch 1/5

Epoch 1: val_loss improved from inf to 1.61357, saving model to myModel_01.keras
120/120 - 1s - loss: 1.5998 - categorical_accuracy: 0.2294 - val_loss: 1.6136 - val_categorical_accuracy: 0.2062 - 602ms/epoch - 5ms/step
Epoch 2/5

Epoch 2: val_loss did not improve from 1.61357
120/120 - 0s - loss: 1.5984 - categorical_accuracy: 0.2339 - val_loss: 1.6157 - val_categorical_accuracy: 0.2073 - 447ms/epoch - 4ms/step
Epoch 3/5

Epoch 3: val_loss did not improve from 1.61357
120/120 - 0s - loss: 1.5955 - categorical_accuracy: 0.2380 - val_loss: 1.6144 - val_categorical_accuracy: 0.1823 - 437ms/epoch - 4ms/step
Epoch 4/5

Epoch 4: val_loss did not improve from 1.61357
120/120 - 0s - loss: 1.5933 - categorical_accuracy: 0.2482 - val_loss: 1.6214 - val_categorical_accuracy: 0.1813 - 446ms/epoch - 4ms/step
Epoch 5/5

Epoch 5: val_loss did not improve from 1.61357
120/120 - 0s - loss: 1.5908 - categorical_accuracy: 0.2456 - val_loss: 1.6193 - val_categorical_accuracy: 0.2146 - 448ms/epo

<keras.callbacks.History at 0x7f6bb02278e0>

# Evaluation

In [None]:
loss, acc = model.evaluate(
    [X_test_text, X_test_processed],
    y_test
)
print(f"Loss: {loss:.2f}")
print(f"Accuracy: {acc:.2f}")

Loss: 1.62
Accuracy: 0.20


# Feature Importance (Functional API)

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

def model_acc(X_text, X_processed, y_true):
    preds = model.predict([X_text, X_processed])
    return accuracy_score(y_true.argmax(axis=1), preds.argmax(axis=1))

baseline_acc = model_acc(X_test_text, X_test_processed, y_test)
baseline_acc



0.2

In [None]:
importances = []

for i in range(X_test_processed.shape[1]):
    X_perm = X_test_processed.copy()
    np.random.shuffle(X_perm[:, i])

    acc_perm = model_acc(X_test_text, X_perm, y_test)
    importances.append(baseline_acc - acc_perm)

sorted(importances, reverse=True)



[0.012500000000000011,
 0.010000000000000009,
 0.0050000000000000044,
 0.00416666666666668,
 0.0025000000000000022,
 -0.0008333333333333248,
 -0.0025000000000000022,
 -0.004166666666666652,
 -0.009999999999999981]

In [None]:
preprocessor.get_feature_names_out()

In [None]:
importance_df = pd.Series(importances, index=preprocessor.get_feature_names_out())
importance_df = importance_df.sort_values(ascending=True)
importance_df

In [None]:
import matplotlib.pyplot as plt

importance_df.plot(kind='bar')