# Harmonize data catalogs: Optimize operations by navigating POS variability

<table align="left">
  <td>
<a href="https://colab.research.google.com/github/carloabimanyu/dsw-data-challenge-2023/blob/master/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
<a href="https://github.com/carloabimanyu/dsw-data-challenge-2023/blob/master/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>       
</table>
<br/><br/><br/>

## Overview

This notebook demonstrate how to do text preprocessing and calculate similarity using specific vectorizer and distance measure to manage data catalog.

### Objective

By managing data catalog, we can reach following objectives:
- Operational Efficiency
- Data Integrity & Quality
- Aiding Decision-Making

### Dataset
Dataset used in this project are:
1. POS data: given datasets that contains product name across multiple POS
2. Data catalog: given datasets that contains standardized product name, brand, type, and formula
3. External data collection: collected fertilizer catalog from various resource

## Installation
Run following command to clone repository.

In [10]:
! git clone https://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@github.com/carloabimanyu/dsw-data-challenge-2023.git

Install sparse_dot_topn.

In [None]:
! pip install sparse-dot-topn

### Import library and define constants

In [1]:
colab_path = '/content/dsw-data-challenge-2023/'

import sys
sys.path.append('./')
sys.path.append(colab_path)

import re
import numpy as np
import pandas as pd

from src import utils
from src.product import Product
from src.preprocessing import preprocessing_catalog, preprocessing_pos, preprocessing_external

config = utils.load_config()

# UNCOMMENT THIS IF RUN IN COLAB
# config['catalog_data_path'] = colab_path + config['catalog_data_path']
# config['pos_data_path'] = colab_path + config['pos_data_path']
# config['external_data_path'] = colab_path + config['external_data_path']

### Load dataset

In [3]:
catalog = pd.read_excel(config['catalog_data_path'], sheet_name=config['catalog_data_sheet'])
catalog = preprocessing_catalog.preprocessing(catalog)

pos = pd.read_excel(config['pos_data_path'], sheet_name=config['pos_data_sheet'])
pos = pos.dropna()
pos['Product Name'] = pos['Product Name'].apply(lambda name: Product(name))
pos = preprocessing_pos.preprocessing(pos)

external = pd.read_csv(config['external_data_path'])
external['Nama Produk'] = external['Nama Produk'].apply(lambda name: Product(name))
external = preprocessing_external.preprocessing(external, config)

In [4]:
catalog.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,


In [5]:
pos.head()

Unnamed: 0,Product Name,Name,Formula,Metrics
0,Pupuk Urea N 46%,Pupuk Urea N,,46%
1,Pupuk Amonium Sulfat ZA,Pupuk Amonium Sulfat ZA,,
2,Pupuk Super Fosfat SP-36,Pupuk Super Fosfat SP36,,
3,Pupuk NPK Phonska,Pupuk NPK Phonska,,
4,Pupuk NPK Formula Khusus,Pupuk NPK Formula Khusus,,


In [6]:
external.head()

Unnamed: 0,Brand,Product SKU,Formula,Type
0,Yara,NPK 15-09-20 YARAMILA WINNER,15-09-20,Majemuk
1,Yara,NPK 16-16-16 YARAMILA UNIK,16-16-16,Majemuk
2,Yara,NPK 25-7-7 YARAMILA FASTER,25-7-7,Majemuk
3,Yara,NPK YARAMILA COMPLEX,,Others
4,Yara,NPK YARAMILA PALMAE,,Others


## Calculate similarity
### sparse_dot_topn

In [7]:
all_catalog = pd.concat([catalog, external], ignore_index=True)
all_catalog = all_catalog.reset_index(drop=True)

In [8]:
pos['Text'] = pos.apply(lambda row: row['Name'] if pd.isnull(row['Formula']) else f'{row["Name"]}{row["Formula"]}', axis=1)

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from src.similarity import spdt
from src.similarity.ngrams import ngrams

In [11]:
products = pd.concat([all_catalog['Product SKU'], pos['Text']], axis=0, ignore_index=True).reset_index(drop=True)

vectorizer = TfidfVectorizer(min_df=2, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(products)

In [12]:
matches = spdt.awesome_cossim_top(
    tf_idf_matrix,
    tf_idf_matrix.transpose(),
    ntop=10,
    lower_bound=0.6
)

In [17]:
matches_df = spdt.get_matches_df(matches, products, top=200)
matches_df = matches_df[matches_df['similarity'] < 0.9999]

In [18]:
matches_df.sample(10)

Unnamed: 0,left_side,right_side,similarity
180,Petro Niphos 20-20+13S,Petro Niphos,0.607072
103,SP-36 Petro,Petro Sp 36,0.865918
159,NPK Kebomas 12-6-22+3Mg,PUPUK NPK KEBOMAS 12-6-22+3Mg,0.923851
155,Phonska Plus 15-15-15+9S+0.2Zn,Phonska Plus 15 15 15,0.765986
167,NPK Kebomas 12-6-22+3Mg,Kebomas 12-6-22,0.672364
162,NPK Kebomas 12-6-22+3Mg,NPK KEBOMAS 12-6-22,0.726857
111,Rock Phosphate PETRO,Rock Phosphate,0.882283
78,ZA Plus Petro,Za Plus Petro 50,0.886411
150,Phonska Plus 15-15-15+9S+0.2Zn,Phonska plus 15 15 15,0.765986
105,SP-36 Petro,SP 36 PETRO CASH,0.782065


In [22]:
all_catalog.to_pickle('data/processed/AllCatalog.pkl')