# Harmonize data catalogs: Optimize operations by navigating POS variability

<table align="left">
  <td>
<a href="https://colab.research.google.com/github/carloabimanyu/dsw-data-challenge-2023/blob/master/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
<a href="https://github.com/carloabimanyu/dsw-data-challenge-2023/blob/master/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>       
</table>
<br/><br/><br/>

## Overview

This notebook demonstrate how to do text preprocessing and calculate similarity using specific vectorizer and distance measure to manage data catalog.

### Objective

By managing data catalog, we can reach following objectives:
- Operational Efficiency
- Data Integrity & Quality
- Aiding Decision-Making

### Dataset
Dataset used in this project are:
1. POS data: given datasets that contains product name across multiple POS
2. Data catalog: given datasets that contains standardized product name, brand, type, and formula
3. External data collection: collected fertilizer catalog from various resource

## Installation
Run following command to clone repository.

In [10]:
! git clone https://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@github.com/carloabimanyu/dsw-data-challenge-2023.git

Install sparse_dot_topn.

In [None]:
! pip install sparse-dot-topn

### Import library and define constants

In [36]:
colab_path = '/content/dsw-data-challenge-2023/'

import sys
sys.path.append('./')
sys.path.append(colab_path)

import re
import numpy as np
import pandas as pd

from src import utils
from src.product import Product
from src.preprocessing import preprocessing_catalog, preprocessing_pos, preprocessing_external

config = utils.load_config()

# UNCOMMENT THIS IF RUN IN COLAB
# config['catalog_data_path'] = colab_path + config['catalog_data_path']
# config['pos_data_path'] = colab_path + config['pos_data_path']
# config['external_data_path'] = colab_path + config['external_data_path']

### Load dataset

In [37]:
catalog = utils.pickle_load(config['catalog_data_processed_path'])
pos = utils.pickle_load(config['pos_data_processed_path'])
external = utils.pickle_load(config['external_data_processed_path'])
catalog_external = utils.pickle_load(config['catalog_external_processed_path'])

In [38]:
catalog.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,


In [39]:
pos.head()

Unnamed: 0,Product SKU,Brand,Type,Formula,Metrics,Full Name
0,Pupuk Urea N,,,,46%,Pupuk Urea N 46%
1,Pupuk Amonium Sulfat ZA,,,,,Pupuk Amonium Sulfat ZA
2,Pupuk Super Fosfat SP36,,,,,Pupuk Super Fosfat SP-36
3,Pupuk NPK Phonska,,,,,Pupuk NPK Phonska
4,Pupuk NPK Formula Khusus,,,,,Pupuk NPK Formula Khusus


In [40]:
external.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,NPK BOOSTER PREMIUM,DGW/Hextar,Others,
1,HX - NITRO,DGW/Hextar,Others,
2,KNO3 CRYSTAL,DGW/Hextar,Others,
3,KNO3 PRILL,DGW/Hextar,Others,
4,CAKRA PANDAWA DAPS,DGW/Hextar,Others,


In [41]:
catalog_external.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,


## Calculate similarity
### sparse_dot_topn

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from src.similarity import spdt
from src.similarity.ngrams import ngrams

In [43]:
data = pd.concat(
    [
        catalog_external,
        pos
    ], ignore_index=True
)

In [45]:
vectorizer = TfidfVectorizer(min_df=2, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(data['Product SKU'])

In [46]:
matches = spdt.awesome_cossim_top(
    tf_idf_matrix,
    tf_idf_matrix.transpose(),
    ntop=1000,
    lower_bound=0.6
)

In [47]:
matches_df = spdt.get_matches_df(matches, data['Product SKU'], top=200)
matches_df = matches_df[matches_df['similarity'] < 0.9999]

In [48]:
matches_df.sample(10)

Unnamed: 0,left_side,right_side,similarity
127,Urea PIM,urea,0.621515
57,Urea Petro,ZA Petro,0.652933
170,Urea Nitrea,Urea Nitrea Prill,0.832967
140,Urea PIM,Urea,0.621515
169,Urea Nitrea,Urea Nitrea Prill,0.832967
82,Urea Petro,Urea,0.62971
112,Urea PIM,Nitrea pim,0.702758
128,Urea PIM,Urea,0.621515
60,Urea Petro,ZA Petro,0.652933
157,Urea Nitrea,Urea Nitrea NS,0.904624
