###Feature Engineering

Creating new features that might help us to detect fraud in a better way.

* **Risk scores based on distance** → This flags transactions that are unusually far from the user's home as potentially risky ; if distance is greater than 75th percentile of all distance values then we consider them as risky
* **Flag high-value transactions** → This flags transactions having median_to_purchase_price greater than 3 as 'high-value' ; high-value means when the user's transaction amount is higher than his/her median purchase price of transaction history
* **Detect suspicious pattern** → *online + new retailer + high value* means suspicious transactions;
* **Security score** →  chip + pin ;  represent a simple security measure based on whether a chip and/or a PIN was used in the transaction. A higher score indicates a more secure transaction method was used.
* **Movement velocity** → When the distance between consecutive transactions are pretty high ; if the distance from last transaction is  greater than 90th percentile of all distance values then we flag them as risky.

In [None]:
#Risk score based on distance
df['distance_risk'] = np.where(df['distance_from_home'] > df['distance_from_home'].quantile(0.75),1,0)
df['distance_risk']

Unnamed: 0,distance_risk
0,1
1,0
2,0
3,0
4,1
...,...
999995,0
999996,0
999997,0
999998,0


In [None]:
#Flag high-value transactions
df['high_value'] = np.where(df['ratio_to_median_purchase_price']>3,1,0)
df['high_value']

Unnamed: 0,high_value
0,0
1,0
2,0
3,0
4,0
...,...
999995,0
999996,0
999997,0
999998,0


In [None]:
#Supsicious pattern (online + rnew retailer + high value)
df['suspicious_pattern'] =(
    (df['online_order'] == 1) &
    (df['repeat_retailer'] == 0) &
    (df['high_value'] == 1)
).astype(int)
df['suspicious_pattern']

Unnamed: 0,suspicious_pattern
0,0
1,0
2,0
3,0
4,0
...,...
999995,0
999996,0
999997,0
999998,0


In [None]:
#Security score (chip + pin)
df['security_score'] = df['used_chip']  + df['used_pin_number']
df['security_score']

Unnamed: 0,security_score
0,1.0
1,0.0
2,0.0
3,1.0
4,1.0
...,...
999995,1.0
999996,1.0
999997,1.0
999998,0.0


In [None]:
#High velocity
df['high_velocity'] = np.where(df['distance_from_last_transaction'] > df['distance_from_last_transaction'].quantile(0.9),1,0)
df['high_velocity']

Unnamed: 0,high_velocity
0,0
1,0
2,0
3,0
4,0
...,...
999995,0
999996,0
999997,0
999998,0


###Handle Class Imbalance

* **SMOTE(Synthetic Minority Over-sampling Technique)** → works by creating synthetic samples of the minority class (fraud transactions in this case) based on existing minority samples ; preferred approach when we don't want to lose data from the majority class
* **Random Under-Sampling** → works by randomly removing instances form the majority class to balancce the dataset ; useful when we have huge dataset ; helps with computational efficiency ; but can lead to loass of potential information from the majority class.
* **Random Over-sampling** → works by balancing the classes by duplicating instances from the minority class ; simpler than SMOTE

When to use which technique:

Honestly, best approach depends on experimentation . Try different techniques and evaluate their impact on your model's performance using precision,recall, F-score or AUC

* SMOTE → moderate to large dataset ; need of increasing of minority class without simply duplicating them
* Random Under-Sampling → massive dataset & computational resources are a concern ; will significantly reduce size of the dataset & potential loss of information
* Random Over-Sampling → smaller datasets or increase in dataset size is not a major concern ; might lead to overfitting if not used carefully

In [None]:
#Prepare feature columns
feature_cols = [col for col in df.columns if col != 'fraud']
X = df[feature_cols]
y = df['fraud']

In [None]:
#Trying smote
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X,y)

print(f"Original distribution: {np.bincount(y)}")
print(f"Balanced dustribution: {np.bincount(y_balanced)}")

Original distribution: [912597  87403]
Balanced dustribution: [912597 912597]


In [None]:
#Trying out under sampling
rus = RandomUnderSampler(random_state=42)
X_balanced, y_balanced = rus.fit_resample(X,y)

print(f"Original distribution: {np.bincount(y)}")
print(f"Balanced distribution: {np.bincount(y_balanced)}")

Original distribution: [912597  87403]
Balanced distribution: [87403 87403]


In [None]:
#Trying out oversampling
fraud_cases = df[df['fraud'] == 1]
non_fraud_cases = df[df['fraud'] == 0]

fraud_upsampled = resample(fraud_cases,replace=True,
                           n_samples=len(non_fraud_cases),random_state=42)

balanced_df = pd.concat([non_fraud_cases,fraud_upsampled])
X_balanced = balanced_df[feature_cols]
y_balanced = balanced_df['fraud']

print(f"Original distribution: {np.bincount(y)}")
print(f"Balanced distribution: {np.bincount(y_balanced)}")

Original distribution: [912597  87403]
Balanced distribution: [912597 912597]


In our case, both SMOTE and Oversampling resulted in a balanced dataset.

Significantly large number of total instances compared to undersampling.

Since, SMOTE can generate more realistic synthetic samples, we'll be proceeding with SMOTE.