# Dataset Collection


The Dataset is collected from HuggingFace, which contains approximately 4.5M entries in it. It has the data split into - Categories, Countries and Currencies. 

## Dataset Structure

Each record has the following fields:
```json
{
  "transaction_description": "string",
  "category": "string", 
  "country": "string",
  "currency": "string"
}
```

## Example records:

| transaction_description | category                      | country    | currency |
|--------------------------|-------------------------------|------------|----------|
| McDonald's #1234         | Food & Dining                | USA        | USD      |
| Uber Ride                | Transportation               | UK         | GBP      |
| Amazon Purchase          | Shopping & Retail            | CANADA     | CAD      |
| Netflix Subscription     | Entertainment & Recreation   | AUSTRALIA  | AUD      |
| Pharmacy Purchase        | Healthcare & Medical         | INDIA      | INR      |

# Installing Libraries
Before proceeding with any work, we should install the required python libraries
```bash
pip install datasets
pip install pandas
```

# Import Packages
Import nessasary packages into the notebook

In [3]:
from datasets import load_dataset
import pandas as pd

# Download Dataset
Download the required dataset from HuggingFace

In [4]:
dataset = load_dataset("mitulshah/transaction-categorization")

Generating train split: 100%|██████████| 4501043/4501043 [00:00<00:00, 12456768.02 examples/s]


In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['transaction_description', 'category', 'country', 'currency'],
        num_rows: 4501043
    })
})


In [6]:
dataset['train'].to_csv("datasets/Dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 4502/4502 [00:03<00:00, 1212.99ba/s]


216732132

In [7]:
dataset['train']

Dataset({
    features: ['transaction_description', 'category', 'country', 'currency'],
    num_rows: 4501043
})

# Dataset Scaling
Working with 4.5M entries is resource intensive, so we scale down the dataset to just have only 200k entries

In [8]:
small_dataset = dataset['train'].shuffle(seed=42).select(range(400000))
small_dataset.to_csv("datasets/Dataset_small.csv")

Creating CSV from Arrow format: 100%|██████████| 400/400 [00:01<00:00, 216.80ba/s]


19264648

# Dataset Check
Checking the scaled-down Dataset would help prevent issues later on.
(Better to be safe than sorry!)

In [9]:
df = pd.read_csv("datasets/Dataset_small.csv")

In [10]:
df.head()

Unnamed: 0,transaction_description,category,country,currency
0,Mobile Center TXN797664,Utilities & Services,USA,USD
1,Megabus Online,Transportation,UK,GBP
2,Mobile Hotspot Online - Weekday,Utilities & Services,AUSTRALIA,AUD
3,PNC Bank - INDIA (Digital Wallet),Financial Services,INDIA,INR
4,Cinema - UK - Holiday,Entertainment & Recreation,UK,GBP


In [11]:
df.shape

(400000, 4)