# Expense classification using Wordvectors
This file explains a methodology of expense classification using pretrained Wordvectors. This example uses two pretrained vectors, they are

* fast text (https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip)

* Glove (http://nlp.stanford.edu/data/glove.6B.zip)


## Algorithm Description

### Required Downloads & Installations
* Download the Wordvectors from above link, unzip in a folder of choice
* Download and install gensim and sklearn
### Algorithm
* Load the Wordvectors
* Read the training and the validation datafiles, extract target classes and predictors
    * For this example we are considering the **expense category** as target class and **expense description** as predictor
* Pass the expense description as input to feature engineering utility and get the sentence vectors
    * Feature engineering utility is in Data_Prep_Utils.py
    * It takes sentence list and the Wordvector models as inputs, returns sentence vectors as outputs
* Pass the training data sentence vector and target class to the Train_Model utility in ML_Utils.py
    * Train_Model currently implements a random forest with default settings. Returns trained model and the hold out data
* Validate the model by publishing the prediction probability, F1 score and comparison against the actual classes
### Assumptions
* All columns will have data - hence missing data treatment is not performed
* The files provided for algorithm will have the same format, hence will have **expense category** and **expense description** columns

## How to Run Algorithm for other files
Go to the fourth executable code cell and change the path of training and the validation files

Training_Sentences, Training_Labels = DPU.Get_Data(**'Your Training file path/Training_filename.csv'**)

Validation_Sentances, Validation_Labels = DPU.Get_Data(**'Your validataion file path/validation_filename.csv'**)

# Import Relavent Packages

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from gensim.models import KeyedVectors
# Load Pretrained Glove Model
import DF_Clean_Up as Clean_Up
import Data_Prep_Utils as DPU
import Text_Proc_Utils as TPU
import ML_Utils as MU



# Load Pretrained Fast_Text Wordvectors

In [4]:
Fast_Text_Model = TPU.Get_Word2Vec_Model('C:\Sundar\Project\Wave_ML_Challenge\ml-challenge-expenses\wiki-news-300d-1M.vec')
print (Fast_Text_Model.most_similar('dinner'))

[('supper', 0.8135534524917603), ('dinners', 0.7936621904373169), ('banquet', 0.7714864015579224), ('luncheon', 0.7681770324707031), ('lunch', 0.767798900604248), ('meal', 0.7627254128456116), ('Dinner', 0.7396844625473022), ('breakfast', 0.7091683149337769), ('brunch', 0.6875953674316406), ('meals', 0.6836446523666382)]


# Load Pretrained Glove Wordvectors

In [5]:
Glove_Model = TPU. Get_Word2Vec_Model_From_Glove('C:\Sundar\Project\Wave_ML_Challenge\ml-challenge-expenses\glove.6B.300d.txt')
print (Glove_Model.most_similar('dinner'))

[('dinners', 0.7367478609085083), ('breakfast', 0.7218676805496216), ('lunch', 0.7212514281272888), ('luncheon', 0.6610217094421387), ('guests', 0.6499863266944885), ('banquet', 0.646759033203125), ('meal', 0.6411886215209961), ('brunch', 0.5942021608352661), ('meals', 0.5799444913864136), ('gala', 0.5731527209281921)]


# Split Raw data into Predictors and Response

In [8]:
Training_Sentences, Training_Labels = DPU.Get_Data('./training_data_example.csv')
Validation_Sentances, Validation_Labels = DPU.Get_Data('./validation_data_example.csv')

# Generate Features using Wordvectors

In [9]:
Fast_Text_Training_Features = DPU.Get_Feature_Vectors(Training_Sentences,Fast_Text_Model)
Fast_Text_Trial = dict(zip(Training_Sentences,Fast_Text_Training_Features))

Glove_Training_Features = DPU.Get_Feature_Vectors(Training_Sentences,Glove_Model)
Glove_Trial =  dict(zip(Training_Sentences,Glove_Training_Features))

# Model Training and Performance (Fast text)

In [10]:
Model,Train_Score,X_Test,y_Test = MU.Train_Model(Fast_Text_Training_Features,Training_Labels)
results = MU.Validate_Model(Model,X_Test,y_Test)
results[1,'train_score'] = Train_Score

predicted_class = Model.predict(X_Test)
Actual_Vs_Pred = dict(zip(y_Test,predicted_class))

print("")
print("Validation results with Validation dataset")
print("==========================================")
print(results)
print("")
print("Actual class vs predicted class")
print("===============================")
print(Actual_Vs_Pred)


Prediction Probabilities
[[ 0.13333333  0.06666667  0.73333333  0.06666667]
 [ 0.13333333  0.2         0.13333333  0.53333333]
 [ 0.26666667  0.          0.13333333  0.6       ]
 [ 0.          0.          1.          0.        ]
 [ 0.4         0.06666667  0.13333333  0.4       ]
 [ 0.13333333  0.66666667  0.2         0.        ]]

Validation results with Validation dataset
      classifier  train_score  test_score  (1, train_score)
0              0          0.0    0.000000               1.0
1  Random Forest          NaN    0.833333               1.0

Actual class vs predicted class
{'Meals and Entertainment': 'Meals and Entertainment', 'Travel': 'Travel', 'Office Supplies': 'Computer - Hardware', 'Computer - Software': 'Computer - Software'}


# Model Validation and Performance (Fast text)

In [11]:
X_Test = DPU.Get_Feature_Vectors(Validation_Sentances,Fast_Text_Model)
y_Test = Validation_Labels

results = MU.Validate_Model(Model,X_Test,y_Test)
results[1,'train_score'] = Train_Score

predicted_class = Model.predict(X_Test)
Actual_Vs_Pred = dict(zip(y_Test,predicted_class))

print("")
print("Validation results with Validation dataset")
print("==========================================")
print(results)
print("")
print("Actual class vs predicted class")
print("===============================")
print(Actual_Vs_Pred)


Prediction Probabilities
[[ 0.          0.          0.          1.        ]
 [ 0.13333333  0.13333333  0.6         0.13333333]
 [ 0.6         0.2         0.06666667  0.13333333]
 [ 0.4         0.06666667  0.13333333  0.4       ]
 [ 0.33333333  0.          0.26666667  0.4       ]
 [ 0.13333333  0.26666667  0.26666667  0.33333333]
 [ 0.13333333  0.06666667  0.73333333  0.06666667]
 [ 0.          0.          1.          0.        ]
 [ 0.          0.          0.93333333  0.06666667]
 [ 0.          0.          1.          0.        ]
 [ 0.          0.          1.          0.        ]
 [ 0.          0.          1.          0.        ]]

Validation results with Validation dataset
      classifier  train_score  test_score  (1, train_score)
0              0          0.0    0.000000               1.0
1  Random Forest          NaN    0.833333               1.0

Actual class vs predicted class
{'Travel': 'Travel', 'Meals and Entertainment': 'Meals and Entertainment', 'Computer - Hardware': 'Comput

# Model Training and Performance (Glove)

In [12]:
Glove_Trained_Model,Train_Score,X_Test,y_Test = MU.Train_Model(Glove_Training_Features,Training_Labels)

results = MU.Validate_Model(Glove_Trained_Model,X_Test,y_Test)
results[1,'train_score'] = Train_Score

predicted_class = Glove_Trained_Model.predict(X_Test)
Actual_Vs_Pred = dict(zip(y_Test,predicted_class))

print("")
print("Validation results with Validation dataset")
print("==========================================")
print(results)
print("")
print("Actual class vs predicted class")
print("===============================")
print(Actual_Vs_Pred)

Prediction Probabilities
[[ 0.13333333  0.06666667  0.53333333  0.26666667]
 [ 0.26666667  0.13333333  0.2         0.4       ]
 [ 0.          0.06666667  0.26666667  0.66666667]
 [ 0.          0.          1.          0.        ]
 [ 0.33333333  0.06666667  0.33333333  0.26666667]
 [ 0.06666667  0.8         0.06666667  0.06666667]]

Validation results with Validation dataset
      classifier  train_score  test_score  (1, train_score)
0              0          0.0    0.000000               1.0
1  Random Forest          NaN    0.833333               1.0

Actual class vs predicted class
{'Meals and Entertainment': 'Meals and Entertainment', 'Travel': 'Travel', 'Office Supplies': 'Computer - Hardware', 'Computer - Software': 'Computer - Software'}


# Model Validation and Performance (Glove)

In [13]:
X_Test = DPU.Get_Feature_Vectors(Validation_Sentances,Glove_Model)
y_Test = Validation_Labels

results = MU.Validate_Model(Glove_Trained_Model,X_Test,y_Test)
results[1,'train_score'] = Train_Score

predicted_class = Glove_Trained_Model.predict(X_Test)
Actual_Vs_Pred = dict(zip(y_Test,predicted_class))

print("")
print("Validation results with Validation dataset")
print("==========================================")
print(results)
print("")
print("Actual class vs predicted class")
print("===============================")
print(Actual_Vs_Pred)

Prediction Probabilities
[[ 0.          0.          0.          1.        ]
 [ 0.13333333  0.          0.6         0.26666667]
 [ 0.46666667  0.26666667  0.13333333  0.13333333]
 [ 0.33333333  0.06666667  0.33333333  0.26666667]
 [ 0.2         0.          0.53333333  0.26666667]
 [ 0.          0.2         0.13333333  0.66666667]
 [ 0.06666667  0.13333333  0.73333333  0.06666667]
 [ 0.          0.          1.          0.        ]
 [ 0.          0.          0.86666667  0.13333333]
 [ 0.          0.          1.          0.        ]
 [ 0.          0.          1.          0.        ]
 [ 0.          0.          1.          0.        ]]

Validation results with Validation dataset
      classifier  train_score  test_score  (1, train_score)
0              0          0.0    0.000000               1.0
1  Random Forest          NaN    0.833333               1.0

Actual class vs predicted class
{'Travel': 'Travel', 'Meals and Entertainment': 'Meals and Entertainment', 'Computer - Hardware': 'Comput