# Credit Account Suggester Training
This notebook trains a model to suggest credit account types based on transaction data. We will use features such as Description, Credit Amount, and account codes, and train a RandomForestClassifier for prediction.

## 1. Import Required Libraries
Import pandas, numpy, scikit-learn, joblib, and any other required libraries for data manipulation and analysis.

In [3]:
# Import Required Libraries
import joblib
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## 2. Load and Explore Customer Data
Load the transaction data from the provided CSV file and display basic statistics to understand the data structure.

In [4]:
# Load the transaction data
csv_path = "data/Sheet 1-Table 1 Snapshot-1 Snapshot.csv"
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,Date,Debit Account,Debit Amount,Credit Account,Credit Amount,Description
0,2024/5/1,KR10101-Toss Checking Account,328,KR70002-Interest Income,328,Interest Received
1,2024/5/2,KR60005-Personal Expenses,17500,KR10101-Toss Checking Account,17500,Myothurin
2,2024/5/2,KR60002-Food Expenses,9400,KR10101-Toss Checking Account,9400,Foreign Mart
3,2024/5/3,KR60005-Personal Expenses,8900,KR10101-Toss Checking Account,8900,iCloud+
4,2024/5/3,KR60002-Food Expenses,30000,KR10101-Toss Checking Account,30000,양꼬치 w 서진


In [5]:
# Display basic statistics and inspect columns
df.info()
df.describe(include='all')
df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1329 entries, 0 to 1328
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Date            1329 non-null   object
 1   Debit Account   1329 non-null   object
 2   Debit Amount    1329 non-null   int64 
 3   Credit Account  1329 non-null   object
 4   Credit Amount   1329 non-null   int64 
 5   Description     1329 non-null   object
dtypes: int64(2), object(4)
memory usage: 62.4+ KB


Index(['Date', 'Debit Account', 'Debit Amount', 'Credit Account',
       'Credit Amount', 'Description'],
      dtype='object')

## 3. Preprocess Data for Analysis
Clean and preprocess the data, handle missing values, and encode categorical variables. Prepare features for model training.

In [6]:
# Drop rows with missing values in key columns
key_columns = ['Description', 'Credit Amount', 'Credit Account']
df = df.dropna(subset=key_columns)

# Encode Credit Account labels
credit_account_encoder = LabelEncoder()
df['Credit Account Encoded'] = credit_account_encoder.fit_transform(df['Credit Account'])

# TF-IDF vectorization for Description
tfidf = TfidfVectorizer(max_features=100)
description_tfidf = tfidf.fit_transform(df['Description']).toarray()

# Prepare feature matrix
X = np.hstack([
    description_tfidf,
    df[['Credit Amount']].values
])

y = df['Credit Account Encoded'].values

## 4. Define Credit Account Suggestion Model
Train a RandomForestClassifier to predict the credit account type using the prepared features.

In [7]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred, target_names=credit_account_encoder.classes_))

Test Accuracy: 0.7218


ValueError: Number of classes, 20, does not match size of target_names, 24. Try specifying the labels parameter

## 5. Save Model and Encoders
Save the trained model, TF-IDF vectorizer, and label encoder for integration into the Flask app.

In [8]:
# Save model, vectorizer, and encoder
joblib.dump(clf, 'credit_account_suggester.joblib')
joblib.dump(tfidf, 'credit_account_label_vectorizer.joblib')
joblib.dump(credit_account_encoder, 'credit_account_label_encoder.joblib')

['credit_account_label_encoder.joblib']

## 6. Display Example Predictions
Show example customer details and their suggested credit account types.

In [9]:
# Show example predictions
examples = X_test[:5]
preds = clf.predict(examples)
pred_labels = credit_account_encoder.inverse_transform(preds)

for i, label in enumerate(pred_labels):
    print(f"Example {i+1}: Suggested Credit Account: {label}")

Example 1: Suggested Credit Account: KR10101-Toss Checking Account
Example 2: Suggested Credit Account: KR10101-Toss Checking Account
Example 3: Suggested Credit Account: KR30201-Friends Debt
Example 4: Suggested Credit Account: KR10101-Toss Checking Account
Example 5: Suggested Credit Account: KR10101-Toss Checking Account
