Hello Everyone ,

In this practice assignment , we are going to solve some analysis questions based on the dataset.

Dataset Overview

The Fraud dataset consists of 100 rows and 5 columns, representing various attributes related to financial transactions. The dataset includes the following features:
TransactionAmount: The monetary value of each transaction. This column includes some missing values which need to be addressed.
TransactionType: The type of transaction, such as 'ATM', 'Online','In-store'.
CustomerAge: The age of the customer involved in the transaction. 
CustomerLocation: The geographical location of the customer, specified by country codes like 'Suburban', 'Urban', 'Rural'
Fraud: A boolean indicator specifying whether a transaction is fraudulent True (takes value 1) or False (takes value 0).
    
Our job is to analyze this dataset and answer few analysis questions which business wants to know.

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('fraud_detection_data.csv')
df.dtypes

TransactionAmount    float64
TransactionType       object
CustomerAge          float64
CustomerLocation      object
Fraud                float64
dtype: object

#### Q.1) Filter the dataset to only include transactions with a transaction amount greater than dollars 300 and return the TransactionAmount and TransactionType columns for those transactions.

In [2]:
def filtered_data(data):
    # write your code here
    filtered_data = df[df['TransactionAmount']>300][['TransactionAmount','TransactionType']]
    return filtered_data

In [3]:
# Assert statements (Test cases)
assert filtered_data(data=df).shape == (60, 2),"Filtered record shape might not be correct or columns selected might be wrong."
assert filtered_data(data=df)['TransactionAmount'].min() > 300, "Make sure that you have filtered the data correctly."

#### Q.2) Calculate the total transaction amount for each transaction type.

In [4]:
def total_txn_amnt(data):
    # write your code here
    transaction_totals = df.groupby("TransactionType")['TransactionAmount'].sum()
    return transaction_totals

In [5]:
# Assert statements (Test cases)
assert round(total_txn_amnt(data=df)['In-store'],2) == 12815.86,"Make sure that you have calculated all the values correctly"
assert round(total_txn_amnt(data=df)['Online'],2) == 13022.09,"Make sure that you have calculated all the values correctly"

#### Q.3) Analyze the relationship between transaction amount and fraud occurrence.
#### Split CustomerAge into a new feature AgeGroup which will take 3 values  '18-25':18 <= CustomerAge <= 25,'26-35':26 <= CustomerAge <= 35,'36+':CustomerAge>35
#### Create a new feature called HighValueTransaction that is True if the TransactionAmount is above the median transaction amount, and False otherwise. 
#### Then, calculate the fraud rate (percentage of fraudulent transactions) for high-value and low-value transactions & AgeGroup, and return these fraud rates.

#### Output dataframe should contains 3 columns 'AgeGroup','HighValueTransaction','FraudRate'

In [7]:
def analyze_fraud_patterns(df):
    # write your code here
    # Step 1: Create AgeGroup column
    df['AgeGroup'] = df['CustomerAge'].apply(lambda age: '18-25' if 18 <= age <= 25 else ('26-35' if 26 <= age <= 35 else '36+'))
    
    # Step 2: Create HighValueTransaction column
    median_amount = df['TransactionAmount'].median()
    df['HighValueTransaction'] = df['TransactionAmount'] > median_amount
    
    # Step 3: Calculate fraud rate for each combination of AgeGroup and HighValueTransaction
    fraud_rates = df.groupby(['AgeGroup', 'HighValueTransaction'])['Fraud'].mean().reset_index()
    fraud_rates.rename(columns={'Fraud': 'FraudRate'}, inplace=True)
    fraud_rates['FraudRate'] = fraud_rates['FraudRate'] * 100
    
    return fraud_rates

In [8]:
# Apply the function to the dataset
fraud_analysis = analyze_fraud_patterns(df)
print(fraud_analysis)

  AgeGroup  HighValueTransaction  FraudRate
0    18-25                 False  25.000000
1    18-25                  True  25.000000
2    26-35                 False  50.000000
3    26-35                  True  25.000000
4      36+                 False  16.666667
5      36+                  True   6.250000


In [9]:
# Assert statements (Test cases)
assert 'AgeGroup' in fraud_analysis.columns, "Check if AgeGroup column is present in the data or not"
assert 'HighValueTransaction' in fraud_analysis.columns,"Check if HighValueTransaction column is present in the data or not"
assert 'FraudRate' in fraud_analysis.columns,"Check if FraudRate column is present in the data or not"
assert fraud_analysis[fraud_analysis['AgeGroup'] == '18-25']['FraudRate'].mean() > 0,"Check if you have calculated the FraudRate correctly for HighValueTransaction or not"
assert fraud_analysis[fraud_analysis['HighValueTransaction'] == True]['FraudRate'].mean() > 0,"Check if you have calculated the FraudRate correctly for HighValueTransaction or not"

#### Q.4) Find the number of transactions and the average transaction amount for each customer location, sorted in descending order of the number of transactions.

In [11]:
def avg_no_txns(data):
    # write your code here
    location_stats = data.groupby('CustomerLocation').agg(TransactionCount=('TransactionAmount','count'),AvgTransactionAmount=('TransactionAmount','mean'))
    location_stats = location_stats.sort_values(by="TransactionCount",ascending=False)
    return location_stats

In [12]:
location_stats=avg_no_txns(data=df)

In [13]:
# Assert statements (Test cases) 
assert location_stats.loc['Suburban']['TransactionCount'] == 30,"Make sure if you have calculated the values correctly"
assert location_stats.loc['Urban']['AvgTransactionAmount'] == 502.1501891278586,"Make sure if you have calculated the values correctly"

#### Q.5) Find the total number of missing values in entire data & fill the missing values by average value for column "TransactionAmount".

In [14]:
def missing_imp(data):
    # your code here
    total_missing_values = data.isnull().sum()
    average_transaction_amount = data['TransactionAmount'].mean()
    data['TransactionAmount'].fillna(average_transaction_amount,inplace=True)
    return total_missing_values,data

In [15]:
df_missing=missing_imp(data=df)[1]   
total_missing_values_after=df_missing['TransactionAmount'].isnull().sum().sum()
total_missing_values=missing_imp(data=df)[0]

In [16]:
# Assert statements (Test cases) 
assert total_missing_values == 40 ,"Check if you have calculated the missing values correctly or not"
assert total_missing_values_after == 0, "Check if you have calculated the missing values correctly or not"

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

## Solution

In [None]:
#Q.1) Solution:
def filtered_data(data):
    ## Filtering the data on the basis of TransactionAmount condition.
    filtered_data = df[df['TransactionAmount'] > 300][['TransactionAmount', 'TransactionType']]
    return filtered_data

In [None]:
#Q.2) Solution:
def total_txn_amnt(data):
    # Grouping the data wrto TransactionType & aggregating TransactionAmount with the help of sum.
    transaction_totals = data.groupby('TransactionType')['TransactionAmount'].sum()
    return transaction_totals

In [None]:
# Q.3) Solution:
def analyze_fraud_patterns(df):
    # Step 1: Create AgeGroup column
    df['AgeGroup'] = df['CustomerAge'].apply(lambda age: '18-25' if 18 <= age <= 25 else ('26-35' if 26 <= age <= 35 else '36+'))
    
    # Step 2: Create HighValueTransaction column
    median_amount = df['TransactionAmount'].median()
    df['HighValueTransaction'] = df['TransactionAmount'] > median_amount
    
    # Step 3: Calculate fraud rate for each combination of AgeGroup and HighValueTransaction
    fraud_rates = df.groupby(['AgeGroup', 'HighValueTransaction'])['Fraud'].mean().reset_index()
    fraud_rates.rename(columns={'Fraud': 'FraudRate'}, inplace=True)
    fraud_rates['FraudRate'] = fraud_rates['FraudRate'] * 100
    
    return fraud_rates

In [None]:
# Q.4) Solution
def avg_no_txns(data):
    # using groupby and aggregation functions
    location_stats = data.groupby('CustomerLocation').agg(TransactionCount=('TransactionAmount', 'count'), AvgTransactionAmount=('TransactionAmount', 'mean'))
    location_stats = location_stats.sort_values(by='TransactionCount', ascending=False)
    return location_stats

In [None]:
# Q.5) Solution:
def missing_imp(data):
    # .isnull().sum().sum() will calculate the total number of missing values 
    total_missing_values = data.isnull().sum().sum()
    # Calculating avergae for TransactionAmount & using fillna for missing values
    average_transaction_amount = data['TransactionAmount'].mean()
    data['TransactionAmount'].fillna(average_transaction_amount, inplace=True)
    return total_missing_values, data