# Install Libraries

Before starting, install all required Libraries onto the System by using the following commands:
```bash
pip install scikit-learn
pip install pandas
```

# Import Packages

Import all required packages into the notebook

In [1]:
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion
from sklearn.svm import LinearSVC
import pandas as pd
import pickle
import re
import os

# Load Dataset

Now load the cleaned dataset into the notebook

In [2]:
df = pd.read_csv("datasets/Dataset_cleaned.csv")

In [3]:

df.head()

Unnamed: 0,category,country,currency,transaction_description_c
0,Utilities & Services,USA,USD,mobile center
1,Transportation,UK,GBP,megabus online
2,Utilities & Services,AUSTRALIA,AUD,mobile hotspot online weekday
3,Financial Services,INDIA,INR,pnc bank india digital wallet
4,Entertainment & Recreation,UK,GBP,cinema uk holiday


In [4]:
df.shape

(400000, 4)

# Train-Test Splitting

Split the dataset into `test` and `train` datasets for the model. We use `stratify = y` for making sure that the classes are balanced in train-test splits

In [5]:
X = df['transaction_description_c']
y = df['category']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = 0.2,
    random_state = 42,
    stratify = y
)

# TF-IDF Vectorization

TF-IDF (Term Frequency - Inverse Document Frequency). This is one of the most important step for the NLP, which converts text into numerical features, which can then be used further by the model training. Here, we use a **Unigram** and a **Bigram** which works fine with short-text features. 

We are limiting the `max_features` to `1000`, just t ensure the model takes in the most important tokens, making the model lightweight.

We are also limiting `min_df` to `2`, which ensures that the model takes in the words which appear in atleast `2` documents, which avoids taking in extremely rare tokens as important tokens

In [7]:
# Defining the Word Vectorizer behaviour
word_vectorizer = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1,2),
    max_features=10000,
    min_df=2
)

In [8]:
# Defining the Character Vectorizer behaviour
char_vectorizer = TfidfVectorizer(
    analyzer='char_wb',
    ngram_range=(3,5),
    max_features=10000,
    min_df=2
)

In [9]:
# Combined Vectorizers
vectorizer = FeatureUnion([
    ("word_tfidf", word_vectorizer),
    ("char_tfidf", char_vectorizer)
])

In [10]:
# Applying Vectorizer
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [11]:
# Checking Vectorized shapes
print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

Train shape: (320000, 16111)
Test shape: (80000, 16111)


# Model Training

In [12]:
model = LinearSVC()

In [13]:
model.fit(X_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2'}, default='l2' Specifies the norm used in the penalization. The 'l2' penalty is the standard used in SVC. The 'l1' leads to ``coef_`` vectors that are sparse.",'l2'
,"loss  loss: {'hinge', 'squared_hinge'}, default='squared_hinge' Specifies the loss function. 'hinge' is the standard SVM loss (used e.g. by the SVC class) while 'squared_hinge' is the square of the hinge loss. The combination of ``penalty='l1'`` and ``loss='hinge'`` is not supported.",'squared_hinge'
,"dual  dual: ""auto"" or bool, default=""auto"" Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features. `dual=""auto""` will choose the value of the parameter automatically, based on the values of `n_samples`, `n_features`, `loss`, `multi_class` and `penalty`. If `n_samples` < `n_features` and optimizer supports chosen `loss`, `multi_class` and `penalty`, then dual will be set to True, otherwise it will be set to False. .. versionchanged:: 1.3  The `""auto""` option is added in version 1.3 and will be the default  in version 1.5.",'auto'
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"C  C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.",1.0
,"multi_class  multi_class: {'ovr', 'crammer_singer'}, default='ovr' Determines the multi-class strategy if `y` contains more than two classes. ``""ovr""`` trains n_classes one-vs-rest classifiers, while ``""crammer_singer""`` optimizes a joint objective over all classes. While `crammer_singer` is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If ``""crammer_singer""`` is chosen, the options loss, penalty and dual will be ignored.",'ovr'
,"fit_intercept  fit_intercept: bool, default=True Whether or not to fit an intercept. If set to True, the feature vector is extended to include an intercept term: `[x_1, ..., x_n, 1]`, where 1 corresponds to the intercept. If set to False, no intercept will be used in calculations (i.e. data is expected to be already centered).",True
,"intercept_scaling  intercept_scaling: float, default=1.0 When `fit_intercept` is True, the instance vector x becomes ``[x_1, ..., x_n, intercept_scaling]``, i.e. a ""synthetic"" feature with a constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight. Note that liblinear internally penalizes the intercept, treating it like any other term in the feature vector. To reduce the impact of the regularization on the intercept, the `intercept_scaling` parameter can be set to a value greater than 1; the higher the value of `intercept_scaling`, the lower the impact of regularization on it. Then, the weights become `[w_x_1, ..., w_x_n, w_intercept*intercept_scaling]`, where `w_x_1, ..., w_x_n` represent the feature weights and the intercept weight is scaled by `intercept_scaling`. This scaling allows the intercept term to have a different regularization behavior compared to the other features.",1
,"class_weight  class_weight: dict or 'balanced', default=None Set the parameter C of class i to ``class_weight[i]*C`` for SVC. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``.",
,"verbose  verbose: int, default=0 Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in liblinear that, if enabled, may not work properly in a multithreaded context.",0


# Model Evaluation

In [14]:
y_pred = model.predict(X_test)

In [15]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.985425
                            precision    recall  f1-score   support

       Charity & Donations       1.00      1.00      1.00      7965
Entertainment & Recreation       1.00      1.00      1.00      8011
        Financial Services       1.00      1.00      1.00      8005
             Food & Dining       0.99      0.99      0.99      7972
        Government & Legal       0.99      0.97      0.98      7984
      Healthcare & Medical       0.98      0.96      0.97      8034
                    Income       0.98      1.00      0.99      8026
         Shopping & Retail       0.93      0.97      0.95      7982
            Transportation       0.99      0.99      0.99      8007
      Utilities & Services       0.99      0.99      0.99      8014

                  accuracy                           0.99     80000
                 macro avg       0.99      0.99      0.99     80000
              weighted avg       0.99      0.99      0.99     80000



# Manual Testing

In [16]:
def clean_text(text):
    # Lowercase text
    text = text.lower()
    
    # Remove TXN patterns
    text = re.sub(r'txn\d+', '', text)

    # Remove hashtags patterns
    text = re.sub(r'#\d+', '', text)
    
    # Remove standalone numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove special characters
    text = re.sub(r'[^a-z\s]', '', text)
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [17]:
def predict_category(text):
    # Clean input text
    cleaned = clean_text(text)
    
    # Convert to vector using trained vectorizer
    vector = vectorizer.transform([cleaned])
    
    # Predict category
    prediction = model.predict(vector)[0]
    
    return prediction

In [18]:
predict_category("Amazon Order")

'Shopping & Retail'

In [19]:
predict_category("Uber Ride")

'Transportation'

In [20]:
predict_category("Tacobell")

'Food & Dining'

In [21]:
predict_category("Taco bell")

'Food & Dining'

In [22]:
predict_category("Chicken Pot Pie")

'Food & Dining'

In [23]:
predict_category("Lotus Resorts")

'Entertainment & Recreation'

# Exporting Model

Since we have trained the model using a Vectorizer and SVM, we can proceed to save the model so that it can be used further

In [24]:
# Create models folder if not exists
os.makedirs("models", exist_ok=True)

In [25]:
# Save model
with open("models/model.pkl", "wb") as f:
    pickle.dump(model, f)

In [26]:
# Save vectorizer
with open("models/vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

# Test Loading

We need to make sure that our exported model is stable enough to be portable, so we load back the exported model and test it

In [28]:
with open("models/model.pkl", "rb") as f:
    loaded_model = pickle.load(f)

In [29]:
with open("models/vectorizer.pkl", "rb") as f:
    loaded_vectorizer = pickle.load(f)

In [30]:
def predict_loaded(text):
    cleaned = clean_text(text)
    vector = loaded_vectorizer.transform([cleaned])
    return loaded_model.predict(vector)[0]

In [31]:
predict_loaded("tacobell")

'Food & Dining'

In [32]:
predict_loaded("Amazon Delivery")

'Shopping & Retail'