# Obtain a large classification database. The database needs to have at least 2 classes, more than 5 features and over 200 samples. Each group has to use a unique dataset.


The Dataset: https://www.kaggle.com/datasets/waqi786/global-black-money-transactions-dataset

# Provide a description of the dataset used including explanation of various features.


# Explanation:

This dataset gives a solid overview of black money transactions in different countries, focusing on financial activities tied to illegal dealings. It includes details like transaction amounts and risk scores, making it super useful for anyone looking to study financial crime trends or work on anti-money laundering tools.


# Dataset:

Transaction ID: Unique identifier for each transaction. (e.g., TX0000001)

Country: Country where the transaction occurred. (e.g., USA, China)

Amount (USD): Transaction amount in US Dollars. (e.g., 150000.00)

Transaction Type: Type of transaction. (e.g., Offshore Transfer, Property Purchase)

Date of Transaction: The date and time of the transaction. (e.g., 2022-03-15 14:32:00)

Person Involved: Name or identifier of the person/entity involved. (e.g., Person_1234)

Industry: Industry associated with the transaction. (e.g., Real Estate, Finance)

Destination Country: Country where the money was sent. (e.g., Switzerland)

Reported by Authority: Whether the transaction was reported to authorities. (e.g., True/False)

Source of Money: Origin of the money. (e.g., Legal, Illegal)

Money Laundering Risk Score: Risk score indicating the likelihood of money
laundering (1-10). (e.g., 8)

Shell Companies Involved: Number of shell companies used in the transaction. (e.g., 3)

Financial Institution: Bank or financial institution involved in the transaction. (e.g., Bank_567)

Tax Haven Country: Country where the money was transferred to a tax haven. (e.g., Cayman Islands)

# Pre-process and clean the dataset as appropriate.

Pre-processing

In [None]:
#Importing the libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

#Read the dataset
df_main = pd.read_csv('Big_Black_Money_Dataset.csv')
pd.set_option('display.float_format', '{:.2f}'.format)
# Converting time data types
df_main['Date of Transaction'] = pd.to_datetime(df_main['Date of Transaction'])

#df_main is main df with only general preprocessing, use copy or other names to not overwrite original
df_main.head()


Unnamed: 0,Transaction ID,Country,Amount (USD),Transaction Type,Date of Transaction,Person Involved,Industry,Destination Country,Reported by Authority,Source of Money,Money Laundering Risk Score,Shell Companies Involved,Financial Institution,Tax Haven Country
0,TX0000000001,Brazil,3267530.48,Offshore Transfer,2013-01-01 00:00:00,Person_1101,Construction,USA,True,Illegal,6,1,Bank_40,Singapore
1,TX0000000002,China,4965766.73,Stocks Transfer,2013-01-01 01:00:00,Person_7484,Luxury Goods,South Africa,False,Illegal,9,0,Bank_461,Bahamas
2,TX0000000003,UK,94167.5,Stocks Transfer,2013-01-01 02:00:00,Person_3655,Construction,Switzerland,True,Illegal,1,3,Bank_387,Switzerland
3,TX0000000004,UAE,386420.14,Cash Withdrawal,2013-01-01 03:00:00,Person_3226,Oil & Gas,Russia,False,Illegal,7,2,Bank_353,Panama
4,TX0000000005,South Africa,643378.43,Cryptocurrency,2013-01-01 04:00:00,Person_7975,Real Estate,USA,True,Illegal,1,9,Bank_57,Luxembourg


In [None]:
# One Hot Encoding categorical features
df_hotencoding = pd.get_dummies(df_main, columns=['Country', 'Transaction Type'])
df_hotencoding.head()

Unnamed: 0,Transaction ID,Amount (USD),Date of Transaction,Person Involved,Industry,Destination Country,Reported by Authority,Source of Money,Money Laundering Risk Score,Shell Companies Involved,...,Country_South Africa,Country_Switzerland,Country_UAE,Country_UK,Country_USA,Transaction Type_Cash Withdrawal,Transaction Type_Cryptocurrency,Transaction Type_Offshore Transfer,Transaction Type_Property Purchase,Transaction Type_Stocks Transfer
0,TX0000000001,3267530.48,2013-01-01 00:00:00,Person_1101,Construction,USA,True,Illegal,6,1,...,False,False,False,False,False,False,False,True,False,False
1,TX0000000002,4965766.73,2013-01-01 01:00:00,Person_7484,Luxury Goods,South Africa,False,Illegal,9,0,...,False,False,False,False,False,False,False,False,False,True
2,TX0000000003,94167.5,2013-01-01 02:00:00,Person_3655,Construction,Switzerland,True,Illegal,1,3,...,False,False,False,True,False,False,False,False,False,True
3,TX0000000004,386420.14,2013-01-01 03:00:00,Person_3226,Oil & Gas,Russia,False,Illegal,7,2,...,False,False,True,False,False,True,False,False,False,False
4,TX0000000005,643378.43,2013-01-01 04:00:00,Person_7975,Real Estate,USA,True,Illegal,1,9,...,True,False,False,False,False,False,True,False,False,False


In [None]:
#No boolean values only 1 for True and 0 for False
df_nobool = df_hotencoding.replace({True: 1, False: 0})
df_nobool.head()

Unnamed: 0,Transaction ID,Amount (USD),Date of Transaction,Person Involved,Industry,Destination Country,Reported by Authority,Source of Money,Money Laundering Risk Score,Shell Companies Involved,...,Country_South Africa,Country_Switzerland,Country_UAE,Country_UK,Country_USA,Transaction Type_Cash Withdrawal,Transaction Type_Cryptocurrency,Transaction Type_Offshore Transfer,Transaction Type_Property Purchase,Transaction Type_Stocks Transfer
0,TX0000000001,3267530.48,2013-01-01 00:00:00,Person_1101,Construction,USA,1,Illegal,6,1,...,0,0,0,0,0,0,0,1,0,0
1,TX0000000002,4965766.73,2013-01-01 01:00:00,Person_7484,Luxury Goods,South Africa,0,Illegal,9,0,...,0,0,0,0,0,0,0,0,0,1
2,TX0000000003,94167.5,2013-01-01 02:00:00,Person_3655,Construction,Switzerland,1,Illegal,1,3,...,0,0,0,1,0,0,0,0,0,1
3,TX0000000004,386420.14,2013-01-01 03:00:00,Person_3226,Oil & Gas,Russia,0,Illegal,7,2,...,0,0,1,0,0,1,0,0,0,0
4,TX0000000005,643378.43,2013-01-01 04:00:00,Person_7975,Real Estate,USA,1,Illegal,1,9,...,1,0,0,0,0,0,1,0,0,0


In [None]:
for i in range(len(df_nobool.columns)):
    missing_data = df_nobool.iloc[:, i].isna().sum()
    perc = missing_data / len(df_nobool) * 100
    print(f'Feature {i+1} >> Missing entries: {missing_data}  |  Percentage: {round(perc, 2)}')

# Number of features (columns)
num_features = df_nobool.shape[1]

# Number of samples (rows)
num_samples = df_nobool.shape[0]

# For counting unique classes in categorical columns
categorical_columns = df_nobool.select_dtypes(include=['object']).columns
classes_count = {col: df_nobool[col].nunique() for col in categorical_columns}

# Display the counts
print(f'Number of features: {num_features}\n')
print(f'Number of samples: {num_samples}\n')
print('Unique classes in categorical columns:')
for col, count in classes_count.items():
    print(f'{col}: {count}')

Feature 1 >> Missing entries: 0  |  Percentage: 0.0
Feature 2 >> Missing entries: 0  |  Percentage: 0.0
Feature 3 >> Missing entries: 0  |  Percentage: 0.0
Feature 4 >> Missing entries: 0  |  Percentage: 0.0
Feature 5 >> Missing entries: 0  |  Percentage: 0.0
Feature 6 >> Missing entries: 0  |  Percentage: 0.0
Feature 7 >> Missing entries: 0  |  Percentage: 0.0
Feature 8 >> Missing entries: 0  |  Percentage: 0.0
Feature 9 >> Missing entries: 0  |  Percentage: 0.0
Feature 10 >> Missing entries: 0  |  Percentage: 0.0
Feature 11 >> Missing entries: 0  |  Percentage: 0.0
Feature 12 >> Missing entries: 0  |  Percentage: 0.0
Feature 13 >> Missing entries: 0  |  Percentage: 0.0
Feature 14 >> Missing entries: 0  |  Percentage: 0.0
Feature 15 >> Missing entries: 0  |  Percentage: 0.0
Feature 16 >> Missing entries: 0  |  Percentage: 0.0
Feature 17 >> Missing entries: 0  |  Percentage: 0.0
Feature 18 >> Missing entries: 0  |  Percentage: 0.0
Feature 19 >> Missing entries: 0  |  Percentage: 0.0
Fe

In [None]:
# The Dataset to be used:
Data = df_nobool.copy()

# Use following approaches for classification of the dataset and Use GridSearchCV to tune the parameter of each of the above models. Can you obtain better results in this step for any of the models? Discuss your observations. Randomly (or based on certain hypothesis) remove some features and re-evaluate the models. Document your observations with respect to models performances.

Logistic Regression: Saif, Dwip


In [None]:
# Logistic Regression

Decision Tree: Nitish, Sehaj

In [None]:
# Decision Tree

Random Forest: Egor, Ash

In [None]:
# Random Forest

SGD: Devanshi, James, Abraham

In [None]:
# Stochastic Gradient Descent

SVM: Eric, Moosa

In [None]:
# Support Vector Machines

# Conclusion and comparison


# Present your work including approach and findings during the class on September 24th or 26th, 2024. Each group will have a maximum of 15 minutes to present their project. It is advised that your PowerPoint files to be no longer than 15 slides.

# Prepare a written technical report of no longer than 15 pages to discuss the problem statement, various steps conducted, summary of findings and conclusions. Submit the report and the notebook file (with proper headings, explanatory comments and code sections) by the midnight of September 29th, 2024.