# Spark based Expense Classification 

This notebook is the launchpad for running the expense classification algorithm built on Spark using pyspark. To run this notebook, following conditions should be satisfied in the end user machine.

1. Spark should be installed in the end user machine
2. SPARK_HOME Variable should be setup in the .bashrc file

## Approach Overview

1. Data Ingestion and Extraction, loading a CSV file is straightforward with Spark csv packages.
https://github.com/databricks/spark-csv. This package allows us to convert csv files to Spark dataframes. Hence it offers an advantage of dealing with data organized as a native Spark data structure. All the data management utilities are packaged under **Data_to_Spark_Utils.py** file.
<br><br>
2. Once CSV files are ingested, only 2 columns are retained, the expenses category and the expense description. My approach to solve this problem is to use an NLP based approach. Expense category column provides the labels/ classes, expense description column provides the features.  Features are extracted from the expense descriptions and are used to train a classifier 
<br><br>
3. Next step is to build a machine learning pipeline for the Spark machine learning library. https://spark.apache.org/docs/2.2.0/ml-pipeline.html. Pipeline construction, data preparation, model specification, training and validation functions are present in the **ML_Utils.py** file. For this exercise, we consider the following: <bbr> 
    1. regexTokenizer: Tokenization (with Regular Expression)
    2. stopwordsRemover: Remove Stop Words
    3. countVectors: Count vectors (“document-term vectors”)
    4. TF - IDF Features
<br><br>
 4. The features are used to construct a pipeline. Once the pipeline is constructed, the raw data is given to the pipeline to extract features and labels. This then is passed to classifiers.
<br><br>
 5. For this exercise, Logistic regression, Logistic regression with cross validation and random forests are implemented. 
<br><br>
 6. The evaluation metrics generated by the multiclass evaluators in Spark help in judging the performance of the models. 
<br><br>
    
## Directions to run Algorithm for other files
Just change the training and the validation file paths in the second executable cell. The Notebook will take care of the rest. 

## Key Observations

1. The data we use is limited and hence using TF-IDF and other features built here does not represent the classess better. Hence the performance across models are poor
<br><br>
2. In Python, implementation, I have used word embeddings to perform classification. The results are far superior, as the word embeddings allow us to construct sentance vectors.
<br><br>

### Import relevant packages

In [1]:
# To find Spark home and initialize Spark instance to work with Jupyter
import findspark
findspark.init()

# Import pyspark to work with Saprk from Python IDE
import pyspark
import random

from pyspark.sql import SQLContext
from pyspark import SparkContext

# User defined utilities
import Data_to_Spark_Utils as D2S
import ML_Utils as MU

### Setup Spark Context and Ingest CSV as Spark dataframes

In [2]:
sc =SparkContext()
Training_Data = D2S.Ingest_CSV_in_Spark(sc,"/home/muthusundaram/Python_trials/Wave_ML_Challenge/training_data_example.csv" )
Validation_Data = D2S.Ingest_CSV_in_Spark(sc,"/home/muthusundaram/Python_trials/Wave_ML_Challenge/validation_data_example.csv" )

### Benefits of using Dataframes with SQL like interface

In [3]:
Training_Data.columns

['date',
 'category',
 'employee id',
 'expense description',
 'pre-tax amount',
 'tax name',
 'tax amount']

In [4]:
drop_list = ['date','employee id', 'pre-tax amount', 'tax name',  'tax amount']
Training_Data = D2S.Drop_Coulmns_from_DataFrame(Training_Data,drop_list)
Validation_Data = D2S.Drop_Coulmns_from_DataFrame(Validation_Data,drop_list)
Validation_Data.show()

+--------------------+--------------------+
|            category| expense description|
+--------------------+--------------------+
|              Travel|           Taxi ride|
|Meals and Enterta...|  Dinner with Family|
| Computer - Hardware|Macbook Air Computer|
|     Office Supplies|               Paper|
|     Office Supplies|                Pens|
|              Travel|Airplane ticket t...|
|Meals and Enterta...|    Starbucks coffee|
|Meals and Enterta...|              Dinner|
|Meals and Enterta...|  Dinner with client|
|Meals and Enterta...|              Dinner|
|Meals and Enterta...|              Dinner|
|Meals and Enterta...|              Dinner|
+--------------------+--------------------+



In [5]:
Training_Data.printSchema()

root
 |-- category: string (nullable = true)
 |-- expense description: string (nullable = true)



In [6]:
Validation_Data.printSchema()

root
 |-- category: string (nullable = true)
 |-- expense description: string (nullable = true)



In [7]:
from pyspark.sql.functions import col
Training_Data.groupBy("Category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
|            Category|count|
+--------------------+-----+
|Meals and Enterta...|   10|
|              Travel|    6|
| Computer - Software|    4|
| Computer - Hardware|    3|
|     Office Supplies|    1|
+--------------------+-----+



In [8]:
Training_Data.groupBy("expense description") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------------------+-----+
| expense description|count|
+--------------------+-----+
|              Dinner|    4|
|           Taxi ride|    3|
|Airplane ticket t...|    2|
|Dropbox Subscription|    2|
|  Dinner with client|    1|
|     Flight to Miami|    1|
|              iPhone|    1|
| iCloud Subscription|    1|
|          Team lunch|    1|
|    Microsoft Office|    1|
|  HP Laptop Computer|    1|
|   Coffee with Steve|    1|
|               Paper|    1|
|Dinner with poten...|    1|
|       Client dinner|    1|
|Macbook Air Computer|    1|
|    Starbucks coffee|    1|
+--------------------+-----+



### Construct machine learning pipeline

In [9]:
# from pyspark.ml import Pipeline
pipeline = MU.Construct_Pipeline()

### Feed Raw data to Pipeline for Feature Engineering

In [10]:
# Engineer the features based on constructed pipeline for both training and validation set
Training_Dataset = MU.Construct_ML_Dataset(pipeline,Training_Data)
Validation_Dataset = MU.Construct_ML_Dataset(pipeline,Validation_Data)

### Build Classification Models

In [11]:
# Train Logistic regression model
LR_Model, LR_Test_Data = MU.Train_Model(Training_Dataset, "LR")

# Train Logistis regression model with cross validation
LRCV_Model, LRCV_Test_Data = MU.Train_Model(Training_Dataset, "LRCV")

# Train Random Forest model
RF_Model, RF_Test_Data = MU.Train_Model(Training_Dataset, "RF")

Training Dataset Count: 16
Test Dataset Count: 8
Training Dataset Count: 16
Test Dataset Count: 8
Training Dataset Count: 16
Test Dataset Count: 8


### Calculate and Publish Performance Metrics (both for test/ hold out data and validation data)

In [19]:
evaluator,predictions  = MU.Validate_Model(LR_Model,LR_Test_Data)

print("Summary Stats")
print("F(1) Score         = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "f1"}))
print("Weighted Precision = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall    = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedRecall"}))
print("Accuracy           = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "accuracy"}))

+---------------------+-----------------------+------------------------------+-----+----------+
|  expense description|               category|                   probability|label|prediction|
+---------------------+-----------------------+------------------------------+-----+----------+
|        Client dinner|Meals and Entertainment|[0.674359311290608,0.095459...|  0.0|       0.0|
|               Dinner|Meals and Entertainment|[0.674359311290608,0.095459...|  0.0|       0.0|
| Dropbox Subscription|    Computer - Software|[0.23892761729025916,0.2293...|  2.0|       0.0|
|    Coffee with Steve|Meals and Entertainment|[0.23892761729025916,0.2293...|  0.0|       0.0|
|Airplane ticket to NY|                 Travel|[0.23892761729025916,0.2293...|  1.0|       0.0|
|            Taxi ride|                 Travel|[0.23892761729025916,0.2293...|  1.0|       0.0|
|     Starbucks coffee|Meals and Entertainment|[0.23892761729025916,0.2293...|  0.0|       0.0|
|Airplane ticket to NY|                 

In [20]:
evaluator,predictions = MU.Validate_Model(LR_Model,Validation_Dataset)

print("Summary Stats")
print("F(1) Score         = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "f1"}))
print("Weighted Precision = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall    = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedRecall"}))
print("Accuracy           = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "accuracy"}))

+------------------------+-----------------------+------------------------------+-----+----------+
|     expense description|               category|                   probability|label|prediction|
+------------------------+-----------------------+------------------------------+-----+----------+
|      Dinner with Family|Meals and Entertainment|[0.4668343323157805,0.15839...|  0.0|       0.0|
|      Dinner with client|Meals and Entertainment|[0.4668343323157805,0.15839...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.4668343323157805,0.15839...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.4668343323157805,0.15839...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.4668343323157805,0.15839...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.4668343323157805,0.15839...|  0.0|       0.0|
|               Taxi ride|                 Travel|[0.23892761729025916,0.2293...|  1.0|       0.0|
|    Macbo

In [21]:
evaluator,predictions = MU.Validate_Model(LRCV_Model,LRCV_Test_Data)

print("Summary Stats")
print("F(1) Score         = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "f1"}))
print("Weighted Precision = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall    = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedRecall"}))
print("Accuracy           = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "accuracy"}))

+-------------------+-----------------------+------------------------------+-----+----------+
|expense description|               category|                   probability|label|prediction|
+-------------------+-----------------------+------------------------------+-----+----------+
|      Client dinner|Meals and Entertainment|[0.8213359733256489,0.05153...|  0.0|       0.0|
|             Dinner|Meals and Entertainment|[0.8213359733256489,0.05153...|  0.0|       0.0|
+-------------------+-----------------------+------------------------------+-----+----------+

Summary Stats
F(1) Score         = 0.369047619047619
Weighted Precision = 0.5208333333333334
Weighted Recall    = 0.375
Accuracy           = 0.375


In [22]:
evaluator,predictions = MU.Validate_Model(LRCV_Model,Validation_Dataset)

print("Summary Stats")
print("F(1) Score         = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "f1"}))
print("Weighted Precision = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall    = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedRecall"}))
print("Accuracy           = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "accuracy"}))

+-------------------+-----------------------+------------------------------+-----+----------+
|expense description|               category|                   probability|label|prediction|
+-------------------+-----------------------+------------------------------+-----+----------+
| Dinner with Family|Meals and Entertainment|[0.528261292898544,0.139022...|  0.0|       0.0|
|             Dinner|Meals and Entertainment|[0.528261292898544,0.139022...|  0.0|       0.0|
| Dinner with client|Meals and Entertainment|[0.528261292898544,0.139022...|  0.0|       0.0|
|             Dinner|Meals and Entertainment|[0.528261292898544,0.139022...|  0.0|       0.0|
|             Dinner|Meals and Entertainment|[0.528261292898544,0.139022...|  0.0|       0.0|
|             Dinner|Meals and Entertainment|[0.528261292898544,0.139022...|  0.0|       0.0|
+-------------------+-----------------------+------------------------------+-----+----------+

Summary Stats
F(1) Score         = 0.6217948717948718
Weigh

In [23]:
evaluator,predictions = MU.Validate_Model(RF_Model,RF_Test_Data)

print("Summary Stats")
print("F(1) Score         = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "f1"}))
print("Weighted Precision = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall    = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedRecall"}))
print("Accuracy           = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "accuracy"}))

+---------------------+-----------------------+------------------------------+-----+----------+
|  expense description|               category|                   probability|label|prediction|
+---------------------+-----------------------+------------------------------+-----+----------+
|        Client dinner|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|               Dinner|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
| Dropbox Subscription|    Computer - Software|[0.3576406681841363,0.19050...|  2.0|       0.0|
|    Coffee with Steve|Meals and Entertainment|[0.3576406681841363,0.19050...|  0.0|       0.0|
|Airplane ticket to NY|                 Travel|[0.3576406681841363,0.19050...|  1.0|       0.0|
|            Taxi ride|                 Travel|[0.3576406681841363,0.19050...|  1.0|       0.0|
|     Starbucks coffee|Meals and Entertainment|[0.3576406681841363,0.19050...|  0.0|       0.0|
|Airplane ticket to NY|                 

In [24]:
evaluator,predictions = MU.Validate_Model(RF_Model,Validation_Dataset)

print("Summary Stats")
print("F(1) Score         = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "f1"}))
print("Weighted Precision = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedPrecision"}))
print("Weighted Recall    = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "weightedRecall"}))
print("Accuracy           = %s" % evaluator.evaluate(predictions,{evaluator.metricName: "accuracy"}))

+------------------------+-----------------------+------------------------------+-----+----------+
|     expense description|               category|                   probability|label|prediction|
+------------------------+-----------------------+------------------------------+-----+----------+
|      Dinner with Family|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|      Dinner with client|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|                  Dinner|Meals and Entertainment|[0.385303005846474,0.181004...|  0.0|       0.0|
|               Taxi ride|                 Travel|[0.3576406681841363,0.19050...|  1.0|       0.0|
|    Macbo

In [25]:
sc.stop()