# Harmonize data catalogs: Optimize operations by navigating POS variability

<table align="left">
  <td>
<a href="https://colab.research.google.com/github/carloabimanyu/dsw-data-challenges-2023/blob/main/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
<a href="https://github.com/carloabimanyu/dsw-data-challenges-2023/blob/main/notebook.ipynb" target='_blank'>
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>       
</table>
<br/><br/><br/>

## Overview

This notebook demonstrate how to do text preprocessing and calculate similarity using specific vectorizer and distance measure to manage data catalog.

### Objective

By managing data catalog, we can reach following objectives:
- Operational Efficiency
- Data Integrity & Quality
- Aiding Decision-Making

### Dataset
Dataset used in this project are:
1. POS data: given datasets that contains product name across multiple POS
2. Data catalog: given datasets that contains standardized product name, brand, type, and formula
3. External data collection: collected fertilizer catalog from various resource

## Installation
Run following command to clone repository.

In [10]:
# ! git clone https://github.com/carloabimanyu/dsw-data-challenge-2023.git

### Import library and define constants

In [1]:
import sys
sys.path.append('./')

import re
import numpy as np
import pandas as pd

from src import utils
from src.product import Product
from src.preprocessing import preprocessing_catalog, preprocessing_pos, preprocessing_external

config = utils.load_config()

### Load dataset

In [2]:
catalog = pd.read_excel(config['catalog_data_path'], sheet_name=config['catalog_data_sheet'])
catalog = preprocessing_catalog.preprocessing(catalog)

pos = pd.read_excel(config['pos_data_path'], sheet_name=config['pos_data_sheet'])
pos = pos.dropna()
pos['Product Name'] = pos['Product Name'].apply(lambda name: Product(name))
pos = preprocessing_pos.preprocessing(pos)

external = pd.read_csv(config['external_data_path'])
external['Nama Produk'] = external['Nama Produk'].apply(lambda name: Product(name))
external = preprocessing_external.preprocessing(external, config)

In [3]:
catalog.head()

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,
2,Urea Nitrea,PIHC,Urea,
3,Urea Daun Buah,PIHC,Urea,
4,Urea Pusri,PIHC,Urea,


In [4]:
pos.head()

Unnamed: 0,Product Name,Name,Formula,Metrics
0,Pupuk Urea N 46%,Pupuk Urea N,,46%
1,Pupuk Amonium Sulfat ZA,Pupuk Amonium Sulfat ZA,,
2,Pupuk Super Fosfat SP-36,Pupuk Super Fosfat SP36,,
3,Pupuk NPK Phonska,Pupuk NPK Phonska,,
4,Pupuk NPK Formula Khusus,Pupuk NPK Formula Khusus,,


In [15]:
external.head()

Unnamed: 0,Brand,Product SKU,Formula,Type
0,Yara,NPK 15-09-20 YARAMILA WINNER,15-09-20,Majemuk
1,Yara,NPK 16-16-16 YARAMILA UNIK,16-16-16,Majemuk
2,Yara,NPK 25-7-7 YARAMILA FASTER,25-7-7,Majemuk
3,Yara,NPK YARAMILA COMPLEX,,Others
4,Yara,NPK YARAMILA PALMAE,,Others


## Calculate similarity

In [12]:
catalog.head(2)

Unnamed: 0,Product SKU,Brand,Type,Formula
0,Urea Petro,PIHC,Urea,
1,Urea PIM,PIHC,Urea,


In [13]:
external.head(2)

Unnamed: 0,Brand,Nama Produk,Formula,Type
0,Yara,NPK 15-09-20 YARAMILA WINNER,15-09-20,Majemuk
1,Yara,NPK 16-16-16 YARAMILA UNIK,16-16-16,Majemuk


In [16]:
all_catalog = pd.concat([catalog, external], ignore_index=True)
all_catalog = all_catalog.reset_index(drop=True)

In [6]:
pos['Text'] = pos.apply(lambda row: row['Name'] if pd.isnull(row['Formula']) else f'{row["Name"]}{row["Formula"]}', axis=1)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from src.similarity import spdt
from src.similarity.ngrams import ngrams

In [9]:
products = pd.concat([all_catalog['Product SKU'], pos['Text']], axis=0, ignore_index=True).reset_index(drop=True)

vectorizer = TfidfVectorizer(
    min_df=2, 
    analyzer=ngrams
)
tf_idf_matrix = vectorizer.fit_transform(products)

In [10]:
matches = spdt.awesome_cossim_top(
    tf_idf_matrix,
    tf_idf_matrix.transpose(),
    ntop=10,
    lower_bound=0.6
)

<44187x9648 sparse matrix of type '<class 'numpy.float64'>'
	with 418515 stored elements in Compressed Sparse Row format>

In [None]:
matches_df = get_matches_df(matches, products, top=200)
matches_df = matches_df[matches_df['similarity'] < 1]

In [None]:
matches_df.sample(10)