<a href="https://colab.research.google.com/github/rifkifauzi24/Ecommerce-Loyalty-and-Fraud-Analysis/blob/main/Ecommerce_Loyalty_and_Fraud_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
sahilislam007_e_commerce_customer_analytics_loyalty_vs_fraud_path = kagglehub.dataset_download('sahilislam007/e-commerce-customer-analytics-loyalty-vs-fraud')

print('Data source import complete.')


<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">
  <div style="font-size:150%; color:#FEE100"><b>E-commerce Customer Analytics: Loyalty vs Fraud</b></div>
  <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div>
</div>

# Table of Contents

- [Introduction](#Introduction)
- [Data Loading](#Data-Loading)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Predictive Modeling](#Predictive-Modeling)
- [Conclusion](#Conclusion)

In [None]:
# Import necessary libraries and suppress warnings
import warnings
warnings.filterwarnings('ignore')

import os
import numpy as np
import pandas as pd

import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend if needed
import matplotlib.pyplot as plt
plt.switch_backend('Agg')  # Switch backend if only plt is imported

import seaborn as sns

# Ensure inline plotting if running in Jupyter
%matplotlib inline

# scikit-learn imports for modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc, classification_report

sns.set(style='whitegrid')

# Introduction

There is an intriguing interplay between customer loyalty and fraudulent behavior in the e-commerce space. In this notebook, we explore a synthetic dataset that dives into these dynamics. If you find these insights useful, please consider upvoting this notebook.

Our journey begins with loading and cleaning the data, followed by exploratory data analysis and predictive modeling. Let's dive into the analysis and uncover some hidden patterns.

In [None]:
# Data Loading
data_file = 'synthetic_ecommerce_churn_dataset.csv'

try:
    # Check if file exists in the current directory
    if not os.path.exists(data_file):
        raise FileNotFoundError(f"The file {data_file} does not exist in the current directory.")

    df = pd.read_csv(data_file, encoding='ascii', delimiter=',')
    print('Dataframe loaded successfully. Here is a preview:')
    display(df.head())
except FileNotFoundError as e:
    print(e)
    # This error handling is crucial as many users may encounter a FileNotFoundError if the file path is not set correctly.

# Data Cleaning and Preprocessing

In this section, we address any issues that may be present in the data. Notably, though the dataset specifies the type for each column, some columns such as 'customer_since' are provided as strings even though they represent dates. We will convert these to datetime objects for easier manipulation during analysis. Additionally, we perform basic data cleaning such as handling missing values if necessary.

In [None]:
# Data Cleaning and Preprocessing

if 'df' in globals():
    # Convert 'customer_since' to datetime if it exists
    if 'customer_since' in df.columns:
        try:
            df['customer_since'] = pd.to_datetime(df['customer_since'], errors='coerce')
        except Exception as e:
            print('Error converting customer_since to datetime:', e)
            # It is often useful to use errors='coerce' to ensure non-date values are set as NaT

    # Check for missing values
    missing_values = df.isnull().sum()
    print('Missing values by column:')
    print(missing_values)

    # A simple approach: drop rows with missing values (could be revisited in future analysis)
    df.dropna(inplace=True)
    print('\nData shape after dropping missing values:', df.shape)
else:
    print('DataFrame is not loaded. Please check the data file path.')

# Exploratory Data Analysis

Now, we explore various aspects of the data using several visualization methods. We will use histograms, count plots, box plots, and heatmaps to understand the distributions and relationships in the dataset. Special attention is paid to the interaction between customer loyalty and fraudulent behavior.

In [None]:
if 'df' in globals():
    plt.figure(figsize=(12, 6))
    # Histogram: Distribution of age
    plt.subplot(1, 2, 1)
    sns.histplot(df['age'], kde=True, color='skyblue')
    plt.title('Age Distribution')

    # Histogram: Distribution of avg_order_value
    plt.subplot(1, 2, 2)
    sns.histplot(df['avg_order_value'], kde=True, color='olive')
    plt.title('Average Order Value Distribution')
    plt.tight_layout()
    plt.show()

    # Count plot for categorical variables
    plt.figure(figsize=(14, 4))
    plt.subplot(1, 3, 1)
    sns.countplot(x='gender', data=df, palette='pastel')
    plt.title('Gender Distribution')

    plt.subplot(1, 3, 2)
    sns.countplot(x='country', data=df, palette='muted')
    plt.title('Country Distribution')

    plt.subplot(1, 3, 3)
    sns.countplot(x='preferred_category', data=df, palette='bright')
    plt.title('Preferred Category')
    plt.tight_layout()
    plt.show()

    # Correlation Heatmap for numeric features (if 4 or more numeric columns are present)
    numeric_df = df.select_dtypes(include=[np.number])
    if numeric_df.shape[1] >= 4:
        plt.figure(figsize=(10, 8))
        corr = numeric_df.corr()
        sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
        plt.title('Correlation Heatmap')
        plt.show()

    # Pair Plot to explore relationships between numeric features
    sns.pairplot(numeric_df, diag_kind='hist')
    plt.show()
else:
    print('DataFrame is not loaded. Skipping EDA.')

# Predictive Modeling

In this section, we build a predictor for fraudulent behavior. Our target variable is 'is_fraudulent', and we use several features from the dataset to train a logistic regression model. We will split the data into training and testing sets, train the model, and evaluate its accuracy. Additionally, a confusion matrix and ROC curve will be plotted to assess the model's performance.

Note: In case the dataset is too small or the features are noisy, the prediction performance might not be stellar. Regardless, this serves as a baseline model for further tuning.

In [None]:
if 'df' in globals():
    # Prepare the data for modeling
    # We choose 'is_fraudulent' as the target variable
    target = 'is_fraudulent'

    # Select a mix of numerical and categorical features for prediction
    feature_cols = ['age', 'avg_order_value', 'total_orders', 'last_purchase', 'email_open_rate', 'loyalty_score', 'churn_risk', 'gender', 'country', 'preferred_category']

    # Subset the dataframe to only include the features that exist in the dataset
    feature_cols = [col for col in feature_cols if col in df.columns]

    X = df[feature_cols]
    y = df[target]

    # One-hot encode categorical features
    X = pd.get_dummies(X, columns=['gender', 'country', 'preferred_category'], drop_first=True)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

    # Initialize and train the logistic regression model
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)

    # Predict on the test set and evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'Accuracy of the logistic regression model: {accuracy:.2f}')

    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

    # ROC Curve
    y_prob = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(6, 4))
    plt.plot(fpr, tpr, label=f'ROC Curve (area = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc='lower right')
    plt.show()

    # For further insights, we can display a classification report
    print('\nClassification Report:')
    print(classification_report(y_test, y_pred))
else:
    print('DataFrame is not loaded. Skipping predictive modeling.')

# Conclusion

In this notebook, we navigated through the e-commerce customer data to study the relationship between loyalty and fraud. Our analysis involved data cleaning, diverse visualizations, and building a logistic regression model to predict fraudulent behavior. While the approach provided some insights, there is certainly room for further analysis such as incorporating more complex models, feature engineering, or exploring time-based trends in customer behavior.

Future work could also involve the use of permutation importance to better understand feature contributions and investigating the effect of different categorical variables on fraud detection. The methods and handling of potential file loading errors showcased here are designed to help other notebook creators tackle similar challenges.

If you found this notebook useful, please consider upvoting it.

In [None]:
print('Thank you for exploring this analysis. Happy coding!')