# Categorizing my personal bank transactions

### Goal: Predict what group a transaction belongs to using KNeighborsClassifier from scikit-learn

Transaction data supplied in this notebook is my own personal day to day income and expenses. The data is pulled from my own Checkbook program I use to maintain my financial situation. I first started keeping track in January of 2018 and have accumlated around 320 transactions since. 

There are five predetermined categories a transaction could fall into: **income, fuel, food, entertainment,** and **needs.**

### Loading data

In [3]:
# Import necessary libraries
import pandas as pd
import re
from sklearn.neighbors import KNeighborsClassifier

'''
Data structure:

  plusminus  amount description           type
0         +  121.55     desc1           income
1         -   -6.74     Walmart  entertainment
2         +   92.83     desc2           income
3         -  -10.60     Walmart  entertainment
4         +   91.18     desc3           income

'''

data = pd.read_csv('latest.csv')
data.tail().head(3)

Unnamed: 0,plusminus,amount,description,type
325,-,-1.25,Omaha parking,entertainment
326,-,-12.0,Spaghetti works,food
327,-,-9.31,Arbys omaha,food


*The data is pulled from a database via a process not shown here.*

## Prepare data

### Gathering all words used in transaction descriptions

Collect all unique words and put them in a list.

In [4]:
'''

Data structure for allWords:
['plusminus','amount','x'..,'y'...,'z']
Every unique word is put in the list

'''

allWords = [] # Any word used in any description
for line in list(set(list(data.description))): # For every description provided in each row of the data
    for word in re.findall('\w+',line):
        allWords.append(word.lower())
allWords = list(set(allWords)) # Turning list of descripition words into a set
allWords

cols = ['plusminus','amount']
for word in allWords:
    cols.append(word)

len(cols) # Number of unique words in all descriptions combined

234

### Building binary vectors for each transaction (one hot encoding)

Use the unique word list to build a one hot encoding version of the word list. That is, 1 if the word is in the description and 0 if it is not.

In [5]:
'''

Data structure for a binary vector:
[(1 for positive or 0 for negative), (amount of transaction), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

'''

# Manipulating the data for a binary vector
saveArr = [] # Big array that will hold all vectors

# Saving 1 if '+' or 0 if '-'
for pm in list(data.plusminus):
    if pm == '+':
        saveArr.append([1])
    else:
        saveArr.append([0])
saveArr

# Appending the amounts
for i in range(len(data.type)):
    saveArr[i].append(int(data.iloc[i][1]))
saveArr

# Appending the description vector
# A binary vector of descriptions that are 1 if word in, 0 if not
for i in range(len(data.type)):
    
    thisdesc = re.findall('\w+',data.iloc[i][2]) # Current description
    ld=[]
    # Lowered description for comparison to allWords list
    for item in thisdesc:
        ld.append(item.lower())
        
    # For every word in allWords, test if it is in the current description
    for word in allWords:
        if word in ld: # If word in the current description
            saveArr[i].append(1) # Append 1 if true, 0 if not
        else:
            saveArr[i].append(0)
            
len(saveArr)

330

### Grab the target categories for training

Assign each transaction its designated category so the model can learn.

In [6]:
'''

Length of targets should match the length of saveArr!
This is because each row in the data (saveArr) needs to have a target value.

Data structure for targets:
[2, 4, 2, 4, 2, 4, 2, 4, 1, 2, 4, 4, 2, 0, 1, 0, 2, 1, 1, 1,
0, 1, 2, 2, 1, 1, 2, 0, 0, 2, 1, 1, 0, 2, 3, 1, 1, 1, 1, 2, 
                        ...
0, 1, 2, 2, 1, 2, 1, 2, 2, 3, 4, 2, 1, 1, 2, 2, 2, 1, 0, 2, 1,
2, 3, 4, 2]

'''

targets = [] # List of all target values
types = list(set(list(data.type)))
# types = ['food', 'income', 'fuel', 'entertainment', 'need']


for i in range(len(data.type)): # For every i row in the data
    for j in range(len(types)): # For every j index in types
        
        # If the type equals the type of the row of the data
        if types[j] == data.iloc[i][3]: 
            targets.append(j) # Append the index of the type in the type vector
            
len(targets)

330

## Data analyzation

### Fit the data with KNeighborsClassifier

In [13]:
'''

Input for:
knn.fit(training_set, target_values)

Input for:
X = [dtp1, dtp2, dtp3...] 
X takes data rows to predict; should match structure of saveArr[i]

Expected output data of prediction:
[t1, t2, t3, t4...]
Types of predicted data

'''

# Data frame with rows of each vector in saveArr 
final = pd.DataFrame(saveArr,columns=cols)
final.head()

# Declare a KNN classifier class with the value with neighbors
knn = KNeighborsClassifier(n_neighbors=10)

# Fit the model with training data and target values
max_train = int(data.shape[0])-6
knn.fit(saveArr[:max_train], targets[:max_train])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

### Predict the categories

Using the model created above, we can now predict categories of transactions the model has not seen. That is, use what the machine has learned and apply it to new, foreign data that needs to be categorized.

We will try to accurately predict the types of the transactions immediately below.

In [16]:
data[max_train+1:].head(3)

Unnamed: 0,plusminus,amount,description,type
325,-,-1.25,Omaha parking,entertainment
326,-,-12.0,Spaghetti works,food
327,-,-9.31,Arbys omaha,food


In [14]:
# Provide data whose class labels are to be predicted
X = [saveArr[max_train+i] for i in range(1, data.shape[0]-max_train)]

# Prints the data provided
#print(X)

# Store predicted class labels of X
prediction = knn.predict(X)

# Prints the predicted class labels of X
predict_values = []
print(prediction)
for _type in prediction:
    print('Prediction is:',types[_type])
    predict_values.append(types[_type])

[4 4 4 3 1]
Prediction is: food
Prediction is: food
Prediction is: food
Prediction is: income
Prediction is: entertainment


In [15]:
# DataFrame construction
conclusion = pd.DataFrame({'prediction':predict_values,
                          'answer':data.type[max_train+1:]})
# Add correct indicator
conclusion['correct?'] = conclusion.prediction == conclusion.answer

# Get percent correct
total = conclusion.answer.count()
correct_predictions = (conclusion[conclusion['correct?'] == True]).answer.count()

percent_correct = str((correct_predictions/total) * 100) + '%'
print("Accuracy: {}".format(percent_correct))
conclusion

Accuracy: 60.0%


Unnamed: 0,prediction,answer,correct?
325,food,entertainment,False
326,food,food,True
327,food,food,True
328,income,income,True
329,entertainment,need,False


## *Conclusion*

The last 3 out of 5 transactions were succcessfully categorized using KNeighborsClassifier from scikit-learn, an example of supervised machine learning. This model was looking at the 10 nearest neighbors to make a decision on what category the new transaction falls into. This model learnt data and its target values then applied the model to new, unseen data and categorized appropriately.

What would make this model better? More transactions that provide diverse knowledge of my spending habits. The more time goes on, the more new things will be introduced to the model that it can remember for next time.