<img src="1.png" width="400">

# UNSPSC code project - RPI Capstone

**Tips**
- This doc is only for Johnson & Johnson internal Use.
- This doc can be divided into two parts: Brief Summary & Technical Document

## Part I : Brief Summay

**All data are from Yilin Chen** (Thanks!)

- Paths : All received files are under folder csvfiles (./FinalCode/defaultInput/csvFiles)

**In this project, we generally have two paths to solve this problem.**
- Traditional Model: Predict each partition of unspsc code based on the naming patterns. 
- Deep Learning Model: Hybrid model with many machine learning techniques.

### 1.1 Business/ questions and data understandings

- Questions: Use current data from SAP to predict UNSPSC code
- Approaches: Data Visulizations

**Step 1**: Group all features by Unspsc to pick up some important features.
- It is obvious that some features are very useful to help our model determine the specific unspsc code.

**Step 2:** Choose other descriptions to train the model
- Because we find each Unspsc may contain some repetitive words. Using NLP method will make sense.

<img src="Unspsc.png" width=500>
<img src="Visual.png" width = 800>

### 1.2 Data cleaning and preparation 

**Code Model will be explained later in a separate section.**

- Text Fileds: Delete comma, digits ...
- Merge: Merge Prdha and GMDN (UNSPSC Full Data Update 2.csv with other fields to fully use the descriptions)
- Null Values: Text model needs descriptions so that all null descriptions will be deleted. 
- Keep other features: Production Dimensions and Locations (they are related to final model)

**This is an executable document. All data will be provided later.**

### 1.3 Model constructions

- Use all fields from "UNSPSC Full Data Update 2.csv" and then clean all data.
- For descriptions, 3 models have been built separately using fasttext to create new features (autoencoder).
- Deep learning techniques (RNN - GRU) have been used for autoencoder part.
- Merge all features together and use the tree-based model to forecast final unspsc.

<img src="model.png" width=500>

### 1.4 Code Model

- All data also are from "UNSPSC Full Data Update 2.csv"
- Use all fields and fill missing values with -1.
- Use random forest to predict unspsc code as a whole.
- Important Fileds: 'Zzp3 Low Level', 'Z Gmdn Preferred Term Code', 'Prdha', 'Spart'

<img src="code.png" width=500>

### 1.5 Model evaluations and optimizations

<img src="perfor.png" width=800>

### 1.6 Discussion with findings
- High Accuracy: Overfitting Problem may exist.
    - However, generalization ability of model is not very important because our data source may be the same.
    - During the training process, we have adopted many methods to avoid overfitting although it seems that all these models are not real. From our perspectives, we think that the overfitting problem is not so severed. (See our demo.)
- Code model and Text model
    - Given limited time and resources, code model performs better.
    - If some features like "Z.." can be used, code model will outperform text model. We don't realize these features are very valuable at first.
    
**Next Steps:**
- Use these features and techniques in other data sources to check the stability of models

## Part II : Technical Document

### 1.1 Code Model

Because code model is very simple. Here, I only provide all codes.

```python
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection  import train_test_split
from sklearn.metrics import accuracy_score

products = pd.read_csv('C:/Modeling/UNSPSC Full Data Update 2.csv')

# Change all objects to factors (one-hot)
for column_name in products.columns:
    if products[column_name].dtype == object:
        print(column_name)
        products[column_name] = pd.factorize(products[column_name])[0]
    else:
        pass

# Replace all Null values with -1
productsproduct .fillna(-1, inplace=True)

# Create each partitions
productsproduct ['Unspsc_str'] = products['Unspsc'].apply(str)
products['Unspsc_str']
#split UNSPSC into 2-digit pairs
products['segment'] = products['Unspsc_str'].str[:2]
products['family'] = products['Unspsc_str'].str[2:4]
products['class'] = products['Unspsc_str'].str[4:6]
products['commodity'] = products['Unspsc_str'].str[6:]


# Split Train and Test Set
train, test = train_test_split(products, test_size=0.3, random_state=0)
train_X = train[['Zzp3 Low Level', 'Z Gmdn Preferred Term Code', 'Prdha', 'Spart']]
train_y = train['commodity']
test_X = test[['Zzp3 Low Level', 'Z Gmdn Preferred Term Code', 'Prdha', 'Spart']]
test_y = test['commodity']

# Build Random Forest Model
clf = RandomForestClassifier()
clf.fit(train_X, train_y)
preds = clf.predict(test_X)

# Check the wrong predictions
pd.crosstab(test_y, preds, rownames=['actual'], colnames=['preds'])

# Show Model Performance
print(accuracy_score(train_y, clf.predict(train_X)))
print(accuracy_score(test_y, preds))

# Shoe Model Feature Importance
list(zip(train_X, clf.feature_importances_))
```

### 1.2 Code Model

#### 1.2.1 Main Function 
- Currently, all functions are in one model : mainFun.

Tree Structure :

--unspsc

    |-- __init__ : Define the path
    |-- dataPre :
    |-- FeatureMerge
    |-- Token
    |-- TokenInput
    |-- dataInput
    |-- dataPre
    |-- finalPrediction
    |-- modelNetTrain
    |-- modelPre
    |-- modelTrain
    |-- stagingFeature
    |-- tokenSide
    |-- wordEmbedding 


#### 1.2.2 Initialization

- Step 1: import mainFunc module because all related functions are saved in this module. 
``` python
import mainFun
```

- Step 2: Initilization
    - With all functions loaded, we can check all default paths and methods used in this module.
``` python
# 1.Default Path
model = mainFun.unspsc() 
# 2.Self-defined Path
filePath = '../FinalCode/defaultInput/csvFiles/UNSPSC Full Data Update 2.csv'
model = mainFun.unspsc(SapPath = filePath) 
```

- Step 3: Select Paths
    * weightsPath : path for all neuron network parameters
    * wordEmPath : embedding matrix after model training
    * GmdnPath: : all gmdn categories
    * PrhdaPath : current mapping file 
    * SapPath : data directely from SAP System
    * EMBEDDING_FILE : Path for loading fasttext file


##### 1.2.2 (a) Code Demo Default

In [1]:
# Initial
import mainFun
model = mainFun.unspsc()
filterAll, y_train = model.dataPre()

Using TensorFlow backend.


Data Preparation Finished


##### 1.2.2 (b) Code Demo Default
- Use new documents to create new objects.

In [2]:
# folder path
rootPath = '/Users/ValarMorghulis/Johnson_Johnson/FinalCode/'
# Data From SAP System
SapPath = rootPath + 'defaultInput/csvFiles/UNSPSC Full Data Update 2.csv'
# PRHDA File
PrhdaPath= rootPath + 'defaultInput/csvFiles/Prhda.csv'
# GMDN Description
GmdnPath = rootPath + 'defaultInput/csvFiles/GMDN_desp.csv'
# Fasttest WordEmbedding File
embedPath = rootPath + 'defaultInput/wordEmbeddingMartix/crawl-300d-2M.vec'
# All pre-trained parameters 
weightsPath = rootPath + 'defaultInput/preTrainedWeights/'
# All pre-trained model wordembedding files
wordEmPath = rootPath + 'defaultInput/wordEmbeddingMartix/'

# Initial
import mainFun
model = mainFun.unspsc(SapPath = SapPath,
                       PrhdaPath= PrhdaPath,
                       GmdnPath = GmdnPath,
                       embedPath = embedPath,
                       weightsPath = weightsPath,
                       wordEmPath = wordEmPath)
filterAll, y_train = model.dataPre()

Data Preparation Finished


#### 1.2.3  Data  

Given all data pathes, we merge all data together using **inner join**. All actions we took for our data.
- 1.Drop lines with Null GMDN name, Material Description and Prdha Description.
- 2.Uniform all Prdha code. (e.g Some Prdha codes are not 18 digits. We use a function to make all codes 18 digits.)

**Extra Explaination:** 

We only select limited features from the table: 

- Production Dimension : Breit, Brgew, Hoehe, Laeng, Volum, Ntgew (width, length, height...)
- Text Fields : Material Description, Gmdnptdefinition (long), Gmdnptname (short), Minor_name, Major_name, Material (all text)
- Digits : Ean11 (GTIN), Unspsc, Prdha
- Others : Zzwerks (location), Material_top3 (first three chars of material description) 

**Related Modules:**
* dataPre : Create and prepare all required data
* prdha_zero : Change all prdha code to 18 digits. (fill 0 in the beginning)

*Part of the data:*

In [3]:
filterAll.iloc[0,:]

Breit                                                                   0
Brgew                                                                   1
Hoehe                                                                   0
Laeng                                                                   0
Volum                                                                   0
Zzwerks                                                              CA02
Ntgew                                                                   1
Material Description                                     NORIAN DRILLABLE
Material                                                   07.704.003S-US
Ean11                                                          1.0887e+13
Gmdnptdefinition        A sterile bioabsorbable device made of synthet...
Gmdnptname                                 Bone matrix implant, synthetic
Unspsc                                                           42291501
Prdha                                 

In [4]:
filterAll.iloc[:,10:15].head()

Unnamed: 0,Gmdnptdefinition,Gmdnptname,Unspsc,Prdha,Minor_name
0,A sterile bioabsorbable device made of synthet...,"Bone matrix implant, synthetic",42291501,55078377837047824,Biomaterials
1,A sterile bioabsorbable device made of synthet...,"Bone matrix implant, synthetic",42291501,55078377837047824,Biomaterials
2,A sterile bioabsorbable device made of synthet...,"Bone matrix implant, synthetic",42291501,55078377837047824,Biomaterials
3,A sterile bioabsorbable device made of synthet...,"Bone matrix implant, synthetic",42291501,55078377837047824,Biomaterials
4,A sterile bioabsorbable device made of synthet...,"Bone matrix implant, synthetic",42291501,55078377837047824,Biomaterials


#### 1.2.3 Modeling 

This part is very complicated. Generally speaking, it can be divided into two parts.
- Model Training
- Model Output

##### 1.2.3.1 Modeling Training 

Currently, all modules have been encapsulated into the final model **stagingFeature**.

**Step 1: Sentence to Words**
- At first, Keras text mining tool will parse each sentence and then split them into words.
- Related Modules : Token, TokenInput, tokenSide
- We have **3** models according to different text. This step will create different words of bags. The number of each bag is different because some text has more different words (more information). 

**Step 2: Word Embedding**
- Next, Fasttest will change these words into matrix. The embedding matrix will be the first layer for neutral network.
- Related Modules : wordEmbedding

**Step 3: RNN**
- We use very complex techniques in the modeling part. For Natural Language Processing, we use RNN GRU. For detailed model part, see example below.

**Step 4 : Encapsulation**
- All these functions are encapsulated into one module. To simplify this training process, changing some parameters can train the model.
- Advanced Technique: we use **autoencoder** for informaiton extraction and compression. As a result, the output will be the inner layer of RNN.
- Related Module : modelTrain

**Example: GMDN**
- Use pre-cleaned data : filterAl(x), y_train(y)
- Text Type : gmdn
- Pre-trained : False (use new data to train the model)
- Summay : True (show training framework)
- epoches : juse epoch (50-60s per epoch, more epoches more accurate)
- Input : False (this is designed for interface)

**Important**
- Input : Determine if you will use new input data (x).
- Pre_trained : Determine if you need to train the model.

If choosing to train, folder "preTrainedWordEmbed" will have a new file with beginning "New"

In [5]:
gmdn_output = model.modelTrain(filterAll,y_train,'gmdn',
                               Pre_trained=False,Summary=True,epochs = 20,Input=False)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 300)         210000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, None, 300)         0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 160)         182880    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               16100     
_________________________________________________________________
dense_2 (Dense)              (None, 73)                7373      
Total params: 416,353
Trainable params: 416,353
Non-trainable params: 0
_________________________________________________________________
Trai

Here, Gmdn already has accuracy 80.81 % .

In [6]:
material_output = model.modelTrain(filterAll,y_train,'material',
                               Pre_trained=False,Summary=True,epochs = 20,Input=False)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 300)         1200000   
_________________________________________________________________
spatial_dropout1d_2 (Spatial (None, None, 300)         0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 160)         182880    
_________________________________________________________________
global_average_pooling1d_2 ( (None, 160)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 100)               16100     
_________________________________________________________________
dense_4 (Dense)              (None, 73)                7373      
Total params: 1,406,353
Trainable params: 1,406,353
Non-trainable params: 0
_________________________________________________________________


Here, Gmdn already has accuracy 80.88 % .

In [7]:
prdha_output = model.modelTrain(filterAll,y_train,'prdha',
                               Pre_trained=False,Summary=True,epochs = 20,Input=False)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 300)         300000    
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, None, 300)         0         
_________________________________________________________________
bidirectional_3 (Bidirection (None, None, 160)         182880    
_________________________________________________________________
global_average_pooling1d_3 ( (None, 160)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 200)               32200     
_________________________________________________________________
dense_6 (Dense)              (None, 73)                14673     
Total params: 529,753
Trainable params: 529,753
Non-trainable params: 0
_________________________________________________________________
Trai

Here, Gmdn already has accuracy 46.95 % .

In [8]:
gmdn_output.shape # (row,column)

(20701, 100)

##### 1.2.3.2 Pre-trained Model

Pre-trained model is also provied in this part so that pre-trained parameters can be used for prediction.

In [31]:
pre_gmdn_output, pre_material_output, pre_prdha_output = model.stagingFeature(filterAll,y_train,Pre_trained=True,Summary=False,epochs=20)

##### 1.2.3.3 Merge All Features

- This part is very simple. Just merge all previous word embedding matrix and other features such product dimensions, locations and other information

In [43]:
pre_featuresReady = model.FeatureMerge(pre_gmdn_output,pre_material_output,pre_prdha_output,filterAll)
featuresReady = model.FeatureMerge(gmdn_output,material_output,prdha_output,filterAll)

In [40]:
pre_featuresReady.iloc[:3,:8]

Unnamed: 0,Breit,Brgew,Hoehe,Laeng,Volum,Location_BR06,Location_BR08,Location_CA02
0,0.0,1.0,0.0,0.0,0.0,0,0,1
1,0.0,1.0,0.0,0.0,0.0,0,0,1
2,0.0,1.0,0.0,0.0,0.0,0,0,1


In [41]:
pre_featuresReady.shape # (row,column)

(20701, 417)

##### 1.2.3.4 Boosting Tree Prediction

- With all previous features, final model is XGBoost Boosting Tree
- Here, we only show 1 round with accuracy 98.07% (35 rounds will reach 99.95%.)

In [68]:
finalPrediction = model.finalPrediction(featuresReady, y_train, num_round=1,preTrained=False,InputTest=True)

[0]	train-merror:0.017391	test-merror:0.019804


In [69]:
pre_finalPrediction = model.finalPrediction(featuresReady, y_train, num_round=30,preTrained=True,InputTest=True)

In [72]:
pre_finalPrediction

array([ 0.,  0.,  0., ..., 72., 72., 72.], dtype=float32)

In [48]:
y_train[0] 

array([ 0,  0,  0, ..., 72, 72, 72])

### Appendix : Reference Materials
- Books: Flask Web Development: Developing Web Applications with Python 
- Tech : plotly, Dash