# Mapping product names to product SKUs

##### There are 2 datasets:
##### - Product Catalog, consists of 4 columns: `Product SKU`, `Type`, `Brand`, & `Formula`
##### - Product Names from transactions' PoS, only contains 1 column: `Product Name`
##### The product names need to be mapped to the available product SKUs. If necessary, new product SKUs might be generated to map the existing product names or the incoming ones in the future.

In [1]:
# Import pandas to make DataFrames from the available datasets
import pandas as pd

In [2]:
# Read datasets and assign them in each DataFrame variable
product_catalog = pd.read_excel('/Users/sakabumi/project/dsw-2023/Product Catalog.xlsx')
product_pos = pd.read_excel('/Users/sakabumi/project/dsw-2023/Product Name from PoS Transactions.xlsx')

In [3]:
product_catalog.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,


In [4]:
# Checking how many NaN values are there
product_catalog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187 entries, 0 to 186
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Product SKU  187 non-null    object
 1   Brand        187 non-null    object
 2   Type         187 non-null    object
 3   Formula      123 non-null    object
dtypes: object(4)
memory usage: 6.0+ KB


In [5]:
# Checking how many unique values on each column to help further data exploration
product_catalog.describe()

Unnamed: 0,Product SKU,Brand,Type,Formula
count,187,187,187,123
unique,186,9,10,57
top,ENTEC 13-10-20,PIHC,Majemuk,15-15-15
freq,2,43,123,15


From 187 entries, there are:
- 186 unique product SKUs
- 9 brands,
- 10 types, &
- 57 unique formulas

The occurence of Formula is equal to the frequency of Type "Majemuk". Therefore, Formula is not null only for that Type.

In [9]:
product_catalog = product_catalog.drop_duplicates()

In [10]:
product_catalog.describe()

Unnamed: 0,Product SKU,Brand,Type,Formula
count,186,186,186,122
unique,186,9,10,57
top,Urea Petro,PIHC,Majemuk,15-15-15
freq,1,43,122,15


In [11]:
product_catalog.info()

<class 'pandas.core.frame.DataFrame'>
Index: 186 entries, 0 to 186
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Product SKU  186 non-null    object
 1   Brand        186 non-null    object
 2   Type         186 non-null    object
 3   Formula      122 non-null    object
dtypes: object(4)
memory usage: 7.3+ KB


In [12]:
product_catalog = product_catalog.reset_index(drop=True)

In [13]:
product_catalog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186 entries, 0 to 185
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Product SKU  186 non-null    object
 1   Brand        186 non-null    object
 2   Type         186 non-null    object
 3   Formula      122 non-null    object
dtypes: object(4)
memory usage: 5.9+ KB


In [14]:
product_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44002 entries, 0 to 44001
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Product Name  44001 non-null  object
dtypes: object(1)
memory usage: 343.9+ KB


In [15]:
product_pos.describe()

Unnamed: 0,Product Name
count,44001
unique,44001
top,Pupuk Urea N 46%
freq,1


In [16]:
# Rename column names in the new df "product_catalog" & "product_pos"
product_catalog = product_catalog.rename(columns={
    "Product SKU": "product_sku",
    "Brand": "brand",
    "Type": "type",
    "Formula": "formula"
})
product_pos = product_pos.rename(columns={"Product Name": "product_name"})

# Convert all columns' values in "product_catalog" & "product_pos" to uppercase
product_catalog[['product_sku', 'brand', 'type']] = product_catalog[['product_sku', 'brand', 'type']].astype(str).apply(lambda col: col.str.upper())
product_pos['product_name'] = product_pos['product_name'].str.upper()

## Preprocess the columns by cleaning the values

In [17]:
import re

# Create a function to do the cleaning
def preprocess_text(text):
    # Trim leading and trailing spaces
    text = text.strip()
    
    # Remove special characters, non-alphanumeric characters, and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove multiple spaces and replace them with a single space
    text = re.sub(r'\s+', ' ', text)
    
    return text

In [18]:
# Cleaning the product names & formulas
preprocessed_product_names = [preprocess_text(str(name)) for name in product_pos['product_name']]
preprocessed_formula = [preprocess_text(str(formula)) for formula in product_catalog['formula']]

# Put back the preprocessed values to their own columns
product_pos['product_name'] = pd.DataFrame(preprocessed_product_names)
product_catalog['formula'] = pd.DataFrame(preprocessed_formula)

In [19]:
product_catalog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186 entries, 0 to 185
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   product_sku  186 non-null    object
 1   brand        186 non-null    object
 2   type         186 non-null    object
 3   formula      186 non-null    object
dtypes: object(4)
memory usage: 5.9+ KB


In [20]:
product_catalog.head()

Unnamed: 0,product_sku,brand,type,formula
0,UREA PETRO,PIHC,UREA,
1,UREA PIM,PIHC,UREA,
2,UREA NITREA,PIHC,UREA,
3,UREA DAUN BUAH,PIHC,UREA,
4,UREA PUSRI,PIHC,UREA,


The `NaN` values in column `formula` become `nan` after the `preprocessing` function has been used

In [21]:
product_catalog.describe()

Unnamed: 0,product_sku,brand,type,formula
count,186,186,186,186.0
unique,186,9,10,58.0
top,UREA PETRO,PIHC,MAJEMUK,
freq,1,43,122,64.0


In `formula`, the count increases to 186 due to the rise of `nan` after the data cleaning in that column.
The non-nan values can be used to tag the product names as type `MAJEMUK` as long as the product names contain the formula.

In [22]:
# Dataset from the PoS is joined by product SKUs data from
# dataset `product_catalog` using join keys `product_name` & `product_sku`.
product_pos = product_pos.merge(
    product_catalog['product_sku'],
    left_on='product_name',
    right_on='product_sku',
    how='left')

In [23]:
product_pos.head()

Unnamed: 0,product_name,product_sku
0,PUPUK UREA N 46,
1,PUPUK AMONIUM SULFAT ZA,
2,PUPUK SUPER FOSFAT SP36,
3,PUPUK NPK PHONSKA,
4,PUPUK NPK FORMULA KHUSUS,


In [24]:
product_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44002 entries, 0 to 44001
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   product_name  44002 non-null  object
 1   product_sku   39 non-null     object
dtypes: object(2)
memory usage: 687.7+ KB


Turns out, there are only 39 out of 44,002 `product names` which can be mapped.

## Mapping

In [25]:
# List down the unique values of each column: `type` & `formula`
# The lists then can be used in a function to map the product names
list_product_type = list(product_catalog['type'].unique())
list_formula = list(product_catalog['formula'].unique())

In [26]:
list_product_type

['UREA',
 'NITROGEN',
 'ZA',
 'ZK',
 'MIKRO',
 'FOSFAT',
 'ORGANIK',
 'MAJEMUK',
 'KALIUM',
 'MG']

In [27]:
list_formula

['nan',
 '151515',
 '121217',
 '12622',
 '16168',
 '20200',
 '16200',
 '161616',
 '15920',
 '3068',
 '281010',
 '201010',
 '201018',
 '13627',
 '181014',
 '121120',
 '131324',
 '92525',
 '15150',
 '12610',
 '05234',
 '8939',
 '15156',
 '13827',
 '7635',
 '151022',
 '21147',
 '18614',
 '18810',
 '201012',
 '28613',
 '81519',
 '9156',
 '12624',
 '05232',
 '15015',
 '12600',
 '121236',
 '181818',
 '61828',
 '01617',
 '13046',
 '1370463',
 '131121',
 '2577',
 '121118',
 '18126',
 '19919',
 '05035',
 '18460',
 '7634',
 '161018',
 '131020',
 '15520',
 '131111',
 '151020',
 '15200',
 '20614']

In [28]:
# Remove `nan` from `list_formula`
list_formula.remove('nan')

In [29]:
# Create a function to map product type to product name
def map_product_type(product_name):
    for product_type in list_product_type:
        if product_type in product_name:
            return product_type
    return "UNKNOWN"

# Create a function to map product type `Majemuk` to product name which contains a certain formula
def map_product_type_majemuk(product_name):
    for formula in list_formula:
        if formula in product_name:
            return "MAJEMUK"
    return "UNKNOWN"

In [30]:
# Mapping process using function `map_product_type()` & `map_product_type_majemuk()`
mapped_product_types = [map_product_type(str(name)) for name in product_pos['product_name']]
mapped_product_types_majemuk = [map_product_type_majemuk(str(name)) for name in product_pos['product_name']]

# Convert the list of the mapped product types to a DataFrame column within product_pos
product_pos['product_type_map'] = pd.DataFrame(mapped_product_types)
product_pos['product_type_map_majemuk'] = pd.DataFrame(mapped_product_types_majemuk)

In [31]:
product_pos.head()

Unnamed: 0,product_name,product_sku,product_type_map,product_type_map_majemuk
0,PUPUK UREA N 46,,UREA,UNKNOWN
1,PUPUK AMONIUM SULFAT ZA,,ZA,UNKNOWN
2,PUPUK SUPER FOSFAT SP36,,FOSFAT,UNKNOWN
3,PUPUK NPK PHONSKA,,UNKNOWN,UNKNOWN
4,PUPUK NPK FORMULA KHUSUS,,UNKNOWN,UNKNOWN


In [32]:
product_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44002 entries, 0 to 44001
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   product_name              44002 non-null  object
 1   product_sku               39 non-null     object
 2   product_type_map          44002 non-null  object
 3   product_type_map_majemuk  44002 non-null  object
dtypes: object(4)
memory usage: 1.3+ MB


In [33]:
product_pos.groupby(['product_type_map'])['product_name'].count()

product_type_map
FOSFAT         11
KALIUM         22
MAJEMUK        18
MG             68
MIKRO          37
NITROGEN        4
ORGANIK        44
UNKNOWN     42371
UREA          727
ZA            656
ZK             44
Name: product_name, dtype: int64

There are 42371 product names with no `product_type`

In [34]:
product_pos.groupby(['product_type_map_majemuk'])['product_name'].count()

product_type_map_majemuk
MAJEMUK     1143
UNKNOWN    42859
Name: product_name, dtype: int64

Most of the `product_name`s are not mapped still --> 42371 out of 44,002

In [35]:
import numpy as np

# Replace the values in "product_type_map" 
product_pos['product_type_map'] = np.where(
    product_pos['product_type_map_majemuk'] != 'UNKNOWN',
    product_pos['product_type_map_majemuk'],
    product_pos['product_type_map']
)

In [36]:
product_pos.groupby(['product_type_map'])['product_name'].count()

product_type_map
FOSFAT         11
KALIUM         22
MAJEMUK      1153
MG             16
MIKRO          35
NITROGEN        4
ORGANIK        44
UNKNOWN     41292
UREA          725
ZA            656
ZK             44
Name: product_name, dtype: int64

In [37]:
# List down the unique values in column `brand`
list_brand = list(product_catalog['brand'].unique())

In [38]:
list_brand

['PIHC',
 'MUTIARA',
 'MAHKOTA',
 'PAK TANI',
 'YARA',
 'TAWON',
 'DGW/HEXTAR',
 'BASF',
 'LAOYING']

In [39]:
# Create a function to map brand to product name
def map_brand(product_name):
    for brand in list_brand:
        if brand in product_name:
            return brand
    return "UNKNOWN"

In [40]:
# Mapping process using function `map_brand()`
# Convert argument `name` within the function `map_brand()` to avoid this error:
#   TypeError: argument of type 'float' is not iterable

mapped_brands = [map_brand(str(name)) for name in product_pos['product_name']]

# Convert the list of the mapped brands to a DataFrame column within product_pos
product_pos['brand_map'] = pd.DataFrame(mapped_brands)

In [41]:
product_pos.groupby(['brand_map'])['product_name'].count()

brand_map
BASF            8
LAOYING        45
MAHKOTA       149
MUTIARA       404
PAK TANI      252
TAWON         273
UNKNOWN     42610
YARA          261
Name: product_name, dtype: int64

In [42]:
product_pos[product_pos['brand_map'] != 'UNKNOWN'].head()

Unnamed: 0,product_name,product_sku,product_type_map,product_type_map_majemuk,brand_map
81,KCL MAHKOTA,,UNKNOWN,UNKNOWN,MAHKOTA
82,MUTIARA,,UNKNOWN,UNKNOWN,MUTIARA
83,MUTIARA GROWER,,UNKNOWN,UNKNOWN,MUTIARA
366,ZA TAWON50KG,,ZA,UNKNOWN,TAWON
410,DAP TAWON,,UNKNOWN,UNKNOWN,TAWON


In [43]:
# Remove column `product_type_map_majemuk` since it is no longer used
product_pos = product_pos.drop(['product_type_map_majemuk'], axis=1)

In [44]:
product_pos[product_pos['brand_map'] != 'UNKNOWN'].head()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map
81,KCL MAHKOTA,,UNKNOWN,MAHKOTA
82,MUTIARA,,UNKNOWN,MUTIARA
83,MUTIARA GROWER,,UNKNOWN,MUTIARA
366,ZA TAWON50KG,,ZA,TAWON
410,DAP TAWON,,UNKNOWN,TAWON


In [45]:
product_pos[product_pos['product_type_map'] != 'UNKNOWN'].count()

product_name        2710
product_sku           22
product_type_map    2710
brand_map           2710
dtype: int64

There are 2,710 records whose product types are not equal to `UNKNOWN`

In [46]:
product_pos[product_pos['brand_map'] != 'UNKNOWN'].count()

product_name        1392
product_sku            2
product_type_map    1392
brand_map           1392
dtype: int64

There are 1,392 records whose brands are not equal to `UNKNOWN`

## Trial, creating models

The models will be used to map `product_name` to `type`, `brand`, and then `product_sku`

In [47]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn import metrics

In [56]:
# Creating dataframes as raw data to be used in the model training
df_map_type = product_pos[product_pos['product_type_map'] != 'UNKNOWN']
df_map_brand = product_pos[product_pos['brand_map'] != 'UNKNOWN']

In [57]:
df_map_type.describe()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map
count,2193,11,2193,2193
unique,2193,11,10,8
top,PUPUK UREA N 46,UREA PETRO,MAJEMUK,UNKNOWN
freq,1,1,890,1889


In [58]:
df_map_brand.describe()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map
count,1174,2,1174,1174
unique,1174,2,4,7
top,KCL MAHKOTA,ZA PAK TANI,UNKNOWN,MUTIARA
freq,1,1,870,328


In [61]:
# Split the data into training and testing sets
X_train_type, X_test_type, y_train_type, y_test_type = train_test_split(df_map_type['product_name'], df_map_type['product_type_map'], test_size=0.2, random_state=42)
X_train_brand, X_test_brand, y_train_brand, y_test_brand = train_test_split(df_map_brand['product_name'], df_map_brand['brand_map'], test_size=0.2, random_state=42)

# Create a text classification pipeline
model_type = make_pipeline(TfidfVectorizer(), MultinomialNB())
model_brand = make_pipeline(TfidfVectorizer(), MultinomialNB())

# Train the models
model_type.fit(X_train_type, y_train_type)
model_brand.fit(X_train_brand, y_train_brand)

# Make predictions on the test sets
predictions_type = model_type.predict(X_test_type)
predictions_brand = model_brand.predict(X_test_brand)

# Evaluate the models
print("Accuracy - model mapping types:", metrics.accuracy_score(y_test_type, predictions_type))
print("Classification Report:\n", metrics.classification_report(y_test_type, predictions_type))

print("Accuracy - model mapping brands:", metrics.accuracy_score(y_test_brand, predictions_brand))
print("Classification Report:\n", metrics.classification_report(y_test_brand, predictions_brand))

Accuracy - model mapping types: 0.8815489749430524
Classification Report:
               precision    recall  f1-score   support

      FOSFAT       0.00      0.00      0.00         2
      KALIUM       0.00      0.00      0.00         8
     MAJEMUK       0.85      1.00      0.92       175
          MG       0.00      0.00      0.00         3
       MIKRO       0.00      0.00      0.00         7
    NITROGEN       0.00      0.00      0.00         1
     ORGANIK       1.00      0.56      0.71         9
        UREA       0.93      0.97      0.95       118
          ZA       0.88      0.85      0.87       108
          ZK       1.00      0.12      0.22         8

    accuracy                           0.88       439
   macro avg       0.47      0.35      0.37       439
weighted avg       0.84      0.88      0.85       439

Accuracy - model mapping brands: 0.948936170212766
Classification Report:
               precision    recall  f1-score   support

        BASF       0.00      0.00   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The accuracy from both models are considerably high: 0.8815 for `type` & 0.9489 for `brand`.
Both of them will be used to map further product names to both types & brands.

In [62]:
product_pos['product_type_test'] = model_type.predict(product_pos['product_name'])
product_pos['product_brand_test'] = model_brand.predict(product_pos['product_name'])

product_pos.head(10)

Unnamed: 0,product_name,product_sku,product_type_map,brand_map,product_type_test,product_brand_test
0,PUPUK UREA N 46,,UREA,UNKNOWN,UREA,TAWON
1,PUPUK AMONIUM SULFAT ZA,,ZA,UNKNOWN,ZA,TAWON
2,PUPUK SUPER FOSFAT SP36,,FOSFAT,UNKNOWN,ZA,TAWON
3,PUPUK NPK PHONSKA,,UNKNOWN,UNKNOWN,MAJEMUK,TAWON
4,PUPUK NPK FORMULA KHUSUS,,UNKNOWN,UNKNOWN,MAJEMUK,MUTIARA
5,PUPUK ORGANIK GRANUL,,ORGANIK,UNKNOWN,UREA,TAWON
6,PUPUK ORGANIK CAIR,,ORGANIK,UNKNOWN,UREA,MUTIARA
7,PRODUK LAIN,,UNKNOWN,UNKNOWN,MAJEMUK,MUTIARA
8,RONDAP,,UNKNOWN,UNKNOWN,MAJEMUK,MUTIARA
9,SEKOR,,UNKNOWN,UNKNOWN,MAJEMUK,MUTIARA


Based on the observation above, if either `product_type` or `brand` are filled already, there is no need to use the result from the model.

In [63]:
# If the values in either `product_type_map` or `brand_map` are `UNKNOWN`, use the values from the models' result.
# Otherwise, use the initial mapping result.

product_pos['product_type_map'] = np.where(
    product_pos['product_type_map'] == 'UNKNOWN',
    product_pos['product_type_test'],
    product_pos['product_type_map']
)

product_pos['brand_map'] = np.where(
    product_pos['brand_map'] == 'UNKNOWN',
    product_pos['product_brand_test'],
    product_pos['brand_map']
)

In [64]:
product_pos.head(10)

Unnamed: 0,product_name,product_sku,product_type_map,brand_map,product_type_test,product_brand_test
0,PUPUK UREA N 46,,UREA,TAWON,UREA,TAWON
1,PUPUK AMONIUM SULFAT ZA,,ZA,TAWON,ZA,TAWON
2,PUPUK SUPER FOSFAT SP36,,FOSFAT,TAWON,ZA,TAWON
3,PUPUK NPK PHONSKA,,MAJEMUK,TAWON,MAJEMUK,TAWON
4,PUPUK NPK FORMULA KHUSUS,,MAJEMUK,MUTIARA,MAJEMUK,MUTIARA
5,PUPUK ORGANIK GRANUL,,ORGANIK,TAWON,UREA,TAWON
6,PUPUK ORGANIK CAIR,,ORGANIK,MUTIARA,UREA,MUTIARA
7,PRODUK LAIN,,MAJEMUK,MUTIARA,MAJEMUK,MUTIARA
8,RONDAP,,MAJEMUK,MUTIARA,MAJEMUK,MUTIARA
9,SEKOR,,MAJEMUK,MUTIARA,MAJEMUK,MUTIARA


In [65]:
product_pos.describe()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map,product_type_test,product_brand_test
count,40021,23,40021,40021,40021,40021
unique,40021,23,10,7,5,6
top,PUPUK UREA N 46,MESTAC,MAJEMUK,MUTIARA,MAJEMUK,MUTIARA
freq,1,1,21906,32726,21986,32750


In [66]:
product_pos.info()

<class 'pandas.core.frame.DataFrame'>
Index: 40021 entries, 0 to 44001
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   product_name        40021 non-null  object
 1   product_sku         23 non-null     object
 2   product_type_map    40021 non-null  object
 3   brand_map           40021 non-null  object
 4   product_type_test   40021 non-null  object
 5   product_brand_test  40021 non-null  object
dtypes: object(6)
memory usage: 2.1+ MB


In [67]:
# Remove column `product_type_test` & `product_brand_test` since they are no longer used
product_pos = product_pos.drop(['product_type_test', 'product_brand_test'], axis=1)

In [68]:
product_pos.head()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map
0,PUPUK UREA N 46,,UREA,TAWON
1,PUPUK AMONIUM SULFAT ZA,,ZA,TAWON
2,PUPUK SUPER FOSFAT SP36,,FOSFAT,TAWON
3,PUPUK NPK PHONSKA,,MAJEMUK,TAWON
4,PUPUK NPK FORMULA KHUSUS,,MAJEMUK,MUTIARA


Since the product types & brands have been mapped, it is time to create a model to predict the product SKUs

In [69]:
# Create a new column which concatenates `product_name`, `product_type`, & `product_brand` to be used in the model to predict the product SKUs
product_pos['product_name_type_brand'] = product_pos['product_name'] + '-' + product_pos['product_type_map'] + '-' + product_pos['brand_map']

In [70]:
product_pos.head()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map,product_name_type_brand
0,PUPUK UREA N 46,,UREA,TAWON,PUPUK UREA N 46-UREA-TAWON
1,PUPUK AMONIUM SULFAT ZA,,ZA,TAWON,PUPUK AMONIUM SULFAT ZA-ZA-TAWON
2,PUPUK SUPER FOSFAT SP36,,FOSFAT,TAWON,PUPUK SUPER FOSFAT SP36-FOSFAT-TAWON
3,PUPUK NPK PHONSKA,,MAJEMUK,TAWON,PUPUK NPK PHONSKA-MAJEMUK-TAWON
4,PUPUK NPK FORMULA KHUSUS,,MAJEMUK,MUTIARA,PUPUK NPK FORMULA KHUSUS-MAJEMUK-MUTIARA


In [71]:
product_pos[product_pos['product_sku'].notnull()].info()

<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, 415 to 35401
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   product_name             23 non-null     object
 1   product_sku              23 non-null     object
 2   product_type_map         23 non-null     object
 3   brand_map                23 non-null     object
 4   product_name_type_brand  23 non-null     object
dtypes: object(5)
memory usage: 1.1+ KB


In [72]:
product_pos[product_pos['product_sku'].notnull()].describe()

Unnamed: 0,product_name,product_sku,product_type_map,brand_map,product_name_type_brand
count,23,23,23,23,23
unique,23,23,4,5,23
top,MESTAC,MESTAC,MAJEMUK,MUTIARA,MESTAC-UREA-LAOYING
freq,1,1,8,9,1


In [73]:
# Creating a dataframe as raw data to be used in the model training
df_map_sku = product_pos[product_pos['product_sku'].notnull()]

In [74]:
df_map_sku.head(10)

Unnamed: 0,product_name,product_sku,product_type_map,brand_map,product_name_type_brand
415,MESTAC,MESTAC,UREA,LAOYING,MESTAC-UREA-LAOYING
504,FERTIPHOS,FERTIPHOS,MAJEMUK,PAK TANI,FERTIPHOS-MAJEMUK-PAK TANI
637,UREA PETRO,UREA PETRO,UREA,TAWON,UREA PETRO-UREA-TAWON
734,ZA PETRO,ZA PETRO,ZA,TAWON,ZA PETRO-ZA-TAWON
740,MESTIKALI,MESTIKALI,MAJEMUK,MUTIARA,MESTIKALI-MAJEMUK-MUTIARA
4832,NITRALITE,NITRALITE,UREA,MUTIARA,NITRALITE-UREA-MUTIARA
4906,UREA PUSRI,UREA PUSRI,UREA,TAWON,UREA PUSRI-UREA-TAWON
4953,ZK PETRO,ZK PETRO,ZK,MUTIARA,ZK PETRO-ZK-MUTIARA
5138,ZA PAK TANI,ZA PAK TANI,ZA,PAK TANI,ZA PAK TANI-ZA-PAK TANI
5506,UREA NITREA,UREA NITREA,UREA,TAWON,UREA NITREA-UREA-TAWON


In [81]:
# # Split the data into training and testing sets
# X_train_sku, X_test_sku, y_train_sku, y_test_sku = train_test_split(df_map_sku['product_name'], df_map_sku['product_sku'], test_size=0.3, random_state=42)

# # Create a text classification pipeline
# model_sku = make_pipeline(TfidfVectorizer(), MultinomialNB())

# # Train the model
# model_sku.fit(X_train_sku, y_train_sku)

# # Make predictions on the test set
# predictions_sku = model_sku.predict(X_test_sku)

# # Evaluate the model
# print("Accuracy:", metrics.accuracy_score(y_test_sku, predictions_sku))
# print("Classification Report:\n", metrics.classification_report(y_test_sku, predictions_sku))

Honestly, I am stuck here, knowing that only 23 product names are mapped to the product SKUs out of 40 K and I still have no idea on how to proceed further from the current data situation.