# Transaction Labeller Project

As a budget concious Data Scientist, I like to analyze my transactions by category to better understand where all my money is going. For several months, I hand labelled my transactions. I decided it was time to let a machine do the work.

The goal of this project is to use the description of each transaction to predict which 'Category' I should assign it to. This is a text classification task which falls into the realm of **Supervised ML**. This is a quick pass at the task with some ideas for improving the model going forward.

In [1]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
%matplotlib inline

### Read in and have a look at the data
This is a dataset which I have hand labelled. The `Category` variable is going to be our target. Note we have **imbalanced classes**. Food transactions are much more common than the others.

In [2]:
df = pd.read_csv('../data/labelled_clean.csv')

In [3]:
df.head()

Unnamed: 0,date,amount,description,Category
0,8/1/17,-65.1,THE PEDALER BIKE SHOP EL SOBRANTE CA,Stuff
1,8/1/17,-30.35,BETTE'S OCEANVIEW DINER BERKELEY CA,Food
2,8/1/17,-27.0,PURCHASE AUTHORIZED ON 08/01 CA DMV EL CERRITO...,Car
3,8/1/17,-23.67,ORCHARD SUPPLY #350,Stuff
4,8/1/17,-15.0,PURCHASE AUTHORIZED ON 08/01 CA DMV EL CERRITO...,Car


In [4]:
df.Category.value_counts()

Food             346
Stuff             76
Travel            70
Car               55
Health            38
Entertainment     36
Cash              18
School            10
Rent              10
Name: Category, dtype: int64

### Clean text data and label encode the target variable
Here we begin to work our data into shape for the modelling to come. At a quick glance, numbers and special characters don't seem to be relevant and so I strip them out of the strings. I also numerically encode our the `Category` variable to create `Target`.


In [5]:
# clean description text (lowercase and alphabet chars only)
def clean_desc(x):
    return ''.join(char.lower() for char in x if char.isalpha() or char == ' ')
df['desc_clean'] = df['description'].map(lambda x: clean_desc(x))

# Label Encode Target variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Target'] = le.fit_transform(df['Category'])
df.head()

Unnamed: 0,date,amount,description,Category,desc_clean,Target
0,8/1/17,-65.1,THE PEDALER BIKE SHOP EL SOBRANTE CA,Stuff,the pedaler bike shop el sobrante ca,7
1,8/1/17,-30.35,BETTE'S OCEANVIEW DINER BERKELEY CA,Food,bettes oceanview diner berkeley ca,3
2,8/1/17,-27.0,PURCHASE AUTHORIZED ON 08/01 CA DMV EL CERRITO...,Car,purchase authorized on ca dmv el cerrito fo ...,0
3,8/1/17,-23.67,ORCHARD SUPPLY #350,Stuff,orchard supply,7
4,8/1/17,-15.0,PURCHASE AUTHORIZED ON 08/01 CA DMV EL CERRITO...,Car,purchase authorized on ca dmv el cerrito fo ...,0


### Modeling
Grab **feature** column `desc_clean` and `Target` column and do a test train split

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

In [7]:
X,y = df['desc_clean'].values, df.Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

Let's try a couple of word-to-vec techniques and models. Since there aren't many words per *document* (esp few repeats), I expect CountVectorizer and TfidfVectorizer to yeild similar results. Random Forest should do a pretty good job out of the box by indentifing and splitting on tokens that are predictive. I include `MultinomialNB` for comparison as it is also commonly used for text classification problems

In [8]:
cVec = CountVectorizer()
tfVec = TfidfVectorizer()

rf = RandomForestClassifier()
nb = MultinomialNB()

In [9]:
from sklearn.pipeline import Pipeline
for mod in [rf,nb]:
    for vect in [cVec,tfVec]:
        steps = [('vect', vect),
                  ('clf', mod)]
        pipeline = Pipeline(steps)
        clf = pipeline.fit(X_train,y_train)
        print(str(type(vect)).split('.')[-1].strip("'>"))
        print(str(type(mod)).split('.')[-1].strip("'>"))
        print(clf.score(X_test,y_test))
        print('')

CountVectorizer
RandomForestClassifier
0.8333333333333334

TfidfVectorizer
RandomForestClassifier
0.803030303030303

CountVectorizer
MultinomialNB
0.8181818181818182

TfidfVectorizer
MultinomialNB
0.803030303030303



  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


### RandomForest with CountVecorizer wins!!
All models yeild pretty similar results, RF with CountVec is the simplest and most intuative out of the box approach so I will start with this and see if I can improve upon it later with a more custom approach.

#### Lets fit the final model to all the labelled data

In [10]:
cVec = CountVectorizer()
rf = RandomForestClassifier()
steps = [('vect', cVec),('clf', rf)]
pipeline = Pipeline(steps)
clf = pipeline.fit(X,y)

### Use Model to Label Unlabelled data
First load in the unlabelled data and clean it just as we did with the labelled dataset, then make predictions. I also included a column with the max predicted probability for any label. This will serve as a proxy for the model's confidence in it's result. I will then go through the predicted outcomes and hand label the rows that the model had trouble with.

In [11]:
# load unlabelled data and clean description column
df_unlabelled = pd.read_csv('../data/unlabelled_clean.csv')
df_unlabelled['desc_clean'] = df_unlabelled['description'].map(lambda x: clean_desc(x))
# Make predictions and label them
df_unlabelled['Target'] = clf.predict(df_unlabelled['desc_clean'].values)
df_unlabelled['confidence'] = np.max(clf.predict_proba(df_unlabelled['desc_clean'].values),axis=1)
df_unlabelled['model_label'] = le.inverse_transform(df_unlabelled['Target'])
# write output to CSV 
df_unlabelled[['date','amount','description','model_label','confidence']].to_csv('../data/unlabelled_preds.csv')

  if diff:


### Evaluate Model vs Handlabel
We can see that the model does better but still not perfect when `confidence == 1`. It seems that many of the errors come from the model defaulting to the most common label `food` when it doesn't have a great token to pick up on.

A couple of disappointing misses would be airbnb which should always be `travel` and variations of Amazon which should belong to `stuff`. I think my cleaning scheme is not tokenizing strings without spaces like `Amazon.com*MT9W93221` very well. The `amount` column might also contain some useful info for the model. Just a few things to keep in mind for next iterations.

All in all, this is still useful tool and I am happy with the result!

In [12]:
df_hl = pd.read_csv('../data/unlabelled_preds_hl.csv',index_col=0)

In [13]:
df_hl[df_hl.hand_label.notnull()]

Unnamed: 0,date,amount,description,model_label,confidence,hand_label
97,7/17/18,-18.0,THE YOGA ROOM,Food,0.5,Entertainment
182,8/24/18,-30.28,"BP#9415381KAHOL, INC LEXINGTON KY",Food,0.5,Car
203,10/24/18,-2.99,PAYPAL INST XFER 181024 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
204,10/9/18,-9.99,PAYPAL INST XFER 181007 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
209,9/24/18,-2.99,PAYPAL INST XFER 180924 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
211,9/7/18,-9.99,PAYPAL INST XFER 180907 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
219,8/24/18,-2.99,PAYPAL INST XFER 180824 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
223,8/7/18,-9.99,PAYPAL INST XFER 180807 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
229,7/24/18,-2.99,PAYPAL INST XFER 180724 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
234,7/13/18,-4.99,PAYPAL INST XFER 180713 ITUNESAPPST COLIN BROC...,Travel,0.5,Entertainment
