# Harmonize data catalogs: Optimize operations by navigating POS variability

<table align="left">
  <td>
<a href="https://colab.research.google.com/github/carloabimanyu/dsw-data-challenge-2023/blob/master/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
<a href="https://github.com/carloabimanyu/dsw-data-challenge-2023/blob/master/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>       
</table>
<br/><br/><br/>

## Overview

This notebook demonstrate how to do text preprocessing and calculate similarity using specific vectorizer and distance measure to manage data catalog.

### Objective

By managing data catalog, we can reach following objectives:
- Operational Efficiency
- Data Integrity & Quality
- Aiding Decision-Making

### Dataset
Dataset used in this project are:
1. POS data: given datasets that contains product name across multiple POS
2. Data catalog: given datasets that contains standardized product name, brand, type, and formula
3. External data collection: collected fertilizer catalog from various resource

## Installation
Run following command to clone repository.

In [10]:
! git clone https://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@github.com/carloabimanyu/dsw-data-challenge-2023.git

Install sparse_dot_topn.

In [None]:
! pip install sparse-dot-topn

### Import library and define constants

In [2]:
colab_path = '/content/dsw-data-challenge-2023/'

import sys
sys.path.append('./')
sys.path.append(colab_path)

import re
import numpy as np
import pandas as pd

from src import utils
from src.product import Product
from src.preprocessing import preprocessing_catalog, preprocessing_pos, preprocessing_external

config = utils.load_config()

# UNCOMMENT THIS IF RUN IN COLAB
# config['catalog_data_path'] = colab_path + config['catalog_data_path']
# config['pos_data_path'] = colab_path + config['pos_data_path']
# config['external_data_path'] = colab_path + config['external_data_path']

### Load dataset

In [3]:
catalog = pd.read_excel(config['catalog_data_path'], sheet_name=config['catalog_data_sheet'])
catalog = preprocessing_catalog.preprocessing(catalog)

pos = pd.read_excel(config['pos_data_path'], sheet_name=config['pos_data_sheet'])
pos = pos.dropna()
pos['Product Name'] = pos['Product Name'].apply(lambda name: Product(name))
pos = preprocessing_pos.preprocessing(pos)

external = pd.read_csv(config['external_data_path'])
external['Nama Produk'] = external['Nama Produk'].apply(lambda name: Product(name))
external = preprocessing_external.preprocessing(external, config)

In [4]:
catalog.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,


In [5]:
pos.head()

Unnamed: 0,Product Name,Name,Formula,Metrics
0,Pupuk Urea N 46%,Pupuk Urea N,,46%
1,Pupuk Amonium Sulfat ZA,Pupuk Amonium Sulfat ZA,,
2,Pupuk Super Fosfat SP-36,Pupuk Super Fosfat SP36,,
3,Pupuk NPK Phonska,Pupuk NPK Phonska,,
4,Pupuk NPK Formula Khusus,Pupuk NPK Formula Khusus,,


In [6]:
external.head()

Unnamed: 0,Brand,Product SKU,Formula,Type
0,Yara,NPK 15-09-20 YARAMILA WINNER,15-09-20,Majemuk
1,Yara,NPK 16-16-16 YARAMILA UNIK,16-16-16,Majemuk
2,Yara,NPK 25-7-7 YARAMILA FASTER,25-7-7,Majemuk
3,Yara,NPK YARAMILA COMPLEX,,Others
4,Yara,NPK YARAMILA PALMAE,,Others


## Calculate similarity
### sparse_dot_topn

In [7]:
all_catalog = pd.concat([catalog, external], ignore_index=True)
all_catalog = all_catalog.reset_index(drop=True)

In [8]:
pos['Text'] = pos.apply(lambda row: row['Name'] if pd.isnull(row['Formula']) else f'{row["Name"]}{row["Formula"]}', axis=1)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from src.similarity import spdt
from src.similarity.ngrams import ngrams

In [10]:
products = pd.concat([all_catalog['Product SKU'], pos['Text']], axis=0, ignore_index=True).reset_index(drop=True)

vectorizer = TfidfVectorizer(min_df=2, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(products)

In [11]:
matches = spdt.awesome_cossim_top(
    tf_idf_matrix,
    tf_idf_matrix.transpose(),
    ntop=1000,
    lower_bound=0.6
)

In [12]:
matches_df = spdt.get_matches_df(matches, products, top=200)
matches_df = matches_df[matches_df['similarity'] < 0.9999]

In [13]:
matches_df.sample(10)

Unnamed: 0,left_side,right_side,similarity
129,Urea PIM,Urea,0.622444
43,Urea Petro,Urea NS Petro,0.731097
74,Urea Petro,urea,0.629945
120,Urea PIM,Urea,0.622444
28,Urea Petro,UREA PETRO NS,0.812807
36,Urea Petro,pupuk urea petro,0.74206
39,Urea Petro,Urea NS Petro,0.731097
114,Urea PIM,Nitrea pim,0.702142
35,Urea Petro,PUPUK UREA PETRO,0.74206
108,Urea PIM,urea pt pim,0.742819


In [14]:
matches_df.shape

(155, 3)

In [15]:
products.to_pickle('data/processed/catalog.pkl')

In [16]:
products.sample(n=10)

7104              Greta 
5697     OBAT TIKUS CAIR
17600         amistattof
42169           SP36 NS 
6075          sawi 10 gr
10152       Terminal 3L 
8995           NPK 3068 
39487         bitop 1 lt
9178              Lavoro
31748     talang karpet 
dtype: object

In [20]:
user_input = 'MUTIARA 16-16-16'
all_products = pd.concat([pd.Series([user_input]), products], ignore_index=True)

vectorizer = TfidfVectorizer(min_df=2, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(all_products)

matches = spdt.awesome_cossim_top(
    tf_idf_matrix,
    tf_idf_matrix.transpose(),
    ntop=10,
    lower_bound=0
)

matches_df = spdt.get_matches_df(matches, all_products, top=10)
# matches_df = matches_df[matches_df['similarity'] < 0.9999]

In [21]:
matches

<44358x44358 sparse matrix of type '<class 'numpy.float64'>'
	with 443173 stored elements in Compressed Sparse Row format>

In [22]:
matches_df[matches_df['left_side'] == user_input]

Unnamed: 0,left_side,right_side,similarity
0,MUTIARA 16-16-16,MUTIARA 16-16-16,1.0
1,MUTIARA 16-16-16,MUTIARA 16-16-16,1.0
2,MUTIARA 16-16-16,Mutiara 16-16-16,1.0
3,MUTIARA 16-16-16,Mutiara 16-16-16,1.0
4,MUTIARA 16-16-16,Mutiara 16 16 16,1.0
5,MUTIARA 16-16-16,Mutiara 16 16 16,1.0
6,MUTIARA 16-16-16,Mutiara 16-16-16,1.0
7,MUTIARA 16-16-16,Mutiara 16 16 16,1.0
8,MUTIARA 16-16-16,mutiara 16 16-16,1.0
9,MUTIARA 16-16-16,Mutiara 16-16-16,1.0


In [23]:
matches_df['left_side'].nunique()

1

In [25]:
matches_df.iloc[0]

left_side        MUTIARA 16-16-16
right_side    MUTIARA    16-16-16
similarity                    1.0
Name: 0, dtype: object

In [146]:
matches_df.size

300

In [28]:
products

0                    Urea Petro
1                      Urea PIM
2                   Urea Nitrea
3                Urea Daun Buah
4                    Urea Pusri
                  ...          
44352        Extra one 680 EC  
44353        Extra One 680 SC  
44354           JARING ARWANA  
44355          Terong Puma F1  
44356    Terong Liberto Hijau  
Length: 44357, dtype: object

In [30]:
products[products == 'MUTIARA 16-16-16']

7849    MUTIARA 16-16-16
dtype: object

In [34]:
all_catalog[all_catalog['Brand'] == 'PIHC'].head(40)

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,
5,Nitralite,PIHC,Nitrogen,
6,ZA Petro,PIHC,ZA,
7,ZA Plus Petro,PIHC,ZA,
8,ZK Petro,PIHC,ZK,
9,Petro-CAS,PIHC,Mikro,
