# Fraud Detection in Electricity and Gas Distribution

## Introduction to the Project
Fraud in electricity and gas distribution is a significant challenge for utilities and energy providers globally. Fraudulent activities not only lead to financial losses but also compromise the reliability and efficiency of energy distribution systems. With the integration of smart meters and digitalization in the energy sector, data analytics and machine learning techniques are increasingly being used to detect and mitigate fraud.

This project aims to develop a robust solutions to detect and mitigate fraudulent activities in electricity and gas distribution systems, leveraging data analytics and machine learning techniques to ensure the reliability and efficiency of energy distribution networks.

---

## Common Types of Fraud in Electricity and Gas Distribution
1. **Meter Tampering**: Physically altering meters to under-record consumption.
2. **Illegal Connections**: Unauthorized connections to electricity grids or gas pipelines.
3. **Billing Fraud**: Manipulating billing systems to reduce payable amounts.
4. **Energy Theft**: Direct tapping into energy lines to bypass metering systems.
5. **Identity Fraud**: Using false information to create accounts or avoid detection.

---

## Challenges Faced During Implementation
1. **Data Quality**: Dealing with incomplete, noisy, or inconsistent data from meters and sensors.
2. **Scalability**: Processing large volumes of data from millions of smart meters and sensors.
3. **False Positives**: Striking a balance between detection accuracy and minimizing incorrect flags.
4. **Privacy Concerns**: Ensuring compliance with data protection and privacy regulations.
5. **Integration Issues**: Merging fraud detection systems with existing legacy infrastructure.
6. **Adaptability**: Keeping up with evolving fraud tactics and updating detection models dynamically.

---

## Widely Used Data Analytics Techniques and Machine Learning Algorithms
### Data Analytics Techniques
1. **Descriptive Analytics**: Analyzing historical consumption data to identify patterns and anomalies.
2. **Predictive Analytics**: Forecasting potential fraudulent activities based on past behaviors.
3. **Network Analysis**: Mapping energy distribution networks to detect irregularities.

### Machine Learning Algorithms
- **Supervised Learning**:
  - **Logistic Regression**: Predicting the likelihood of fraud using labeled datasets.
  - **Random Forests**: Classifying normal and fraudulent activities effectively.
- **Unsupervised Learning**:
  - **Clustering (e.g., K-Means)**: Grouping unusual consumption patterns for further investigation.
  - **Autoencoders**: Detecting deviations in energy usage patterns.
- **Anomaly Detection**:
  - **Isolation Forests**: Identifying potential fraud cases by isolating outliers.
  - **Support Vector Machines (SVM)**: Detecting irregular consumption trends.
- **Deep Learning**:
  - **Recurrent Neural Networks (RNNs)**: Processing time-series data for fraudulent activity detection.
  - **Convolutional Neural Networks (CNNs)**: Analyzing spatial data from smart meters.

---

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore')

# Data Exploration and Understanding

In [None]:
#The file exploration code is a utility to understand the structure and contents of your input directory.
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Loading dataset
invoice_train = pd.read_csv("/kaggle/input/fraud-detection-in-electricity-and-gas-consumption/invoice_train.csv",low_memory=False)
Client_train = pd.read_csv("/kaggle/input/fraud-detection-in-electricity-and-gas-consumption/client_train.csv",low_memory=False)

In [None]:
invoice_train.head()

In [None]:
Client_train.head()

In [None]:
# quick look at our data types 
invoice_train.info()
print()
Client_train.info()

### Dataset Overview

In this project, we explored two primary datasets: **Invoice Train Table** and **Client Train Table**, providing detailed insights into customer behaviors, service usage, and potential fraudulent activities. Below is a summary of the datasets:

#### Invoice Train Table
- **Shape**: (4,476,749 rows, 16 columns)  
- **Key Columns**: Includes `client_id`, `invoice_date`, `tarif_type`, `counter_number`, `total_consumption`, `old_index`, `new_index`, and more.  
- **Data Types**: Predominantly numerical (`int64`), with a few categorical and date columns (`object`).  
- **Purpose**: This table captures transactional data for electricity and gas consumption, serving as the foundation for analyzing consumption patterns, anomalies, and trends.

#### Client Train Table
- **Shape**: (135,493 rows, 6 columns)  
- **Key Columns**: Includes `disrict`, `client_id`, `client_catg`, `region`, `creation_date`, and `target`.  
- **Data Types**: A mix of categorical (`object`) and numerical (`int64`, `float64`) data.  
- **Purpose**: This dataset provides client-level demographic and categorical information, such as regional distribution, service creation dates, and fraud target indicators.

#### Observations
1. **Invoice Table Size**: The significantly larger size of the Invoice Train Table indicates detailed transaction-level data over multiple periods, enabling temporal analysis and granular pattern discovery.
2. **Client Table Features**: The Client Train Table complements the Invoice data by adding client attributes, essential for segmentation and fraud analysis.

These datasets provide a comprehensive view of customer transactions and profiles, forming the basis for feature engineering, insights generation, and predictive modeling.


# Data Cleaning and Preprocessing

In [None]:
# Finding missing values in dataset
print(Client_train.isnull().sum())

print()
print(invoice_train.isnull().sum())

- No missing value obserevd in the both tables

### checking relationship to merge data

In [None]:
#checking relationship to merge data

# Count unique client_ids in each table
unique_clients_invoice = invoice_train['client_id'].nunique()
unique_clients_client = Client_train['client_id'].nunique()

print(f"Unique clients in invoice_train: {unique_clients_invoice}")
print(f"Unique clients in Client_train: {unique_clients_client}")

# Count the number of invoices per client
client_invoice_counts = invoice_train.groupby('client_id').size()

# Check cardinality
max_invoices_per_client = client_invoice_counts.max()
min_invoices_per_client = client_invoice_counts.min()

print(f"Maximum number of invoices per client: {max_invoices_per_client}")
print(f"Minimum number of invoices per client: {min_invoices_per_client}")

# Identify relationship type
if max_invoices_per_client == 1 :
    relationship = "One-to-One"
elif max_invoices_per_client > 1:
    relationship = "One-to-Many"
else:
    relationship = "Many-to-Many"

print(f"The relationship between Client_train and invoice_train is: {relationship}")

In [None]:
# Merging the client and invoice data on client_id
df = pd.merge(Client_train, invoice_train, on="client_id")

In [None]:
# Convert creation_date and invoice_date to datetime and fix inconsistent date format
df['creation_date'] = pd.to_datetime(df['creation_date'], format='%d/%m/%Y', errors='coerce')
df['invoice_date'] = pd.to_datetime(df['invoice_date'], format='%Y-%m-%d', errors='coerce')

### Data Cleaning and Preprocessing

In the initial steps, we performed necessary data cleaning and preprocessing to ensure consistency and reliability in our analysis. Below are the key steps:

#### Missing Values
- Both datasets were found to have **complete data across all columns**, ensuring a robust foundation for analysis without requiring extensive imputation or handling of missing values.

#### Checking Relationships for Merging
- We examined the relationship between `Client_train` and `Invoice_train` to determine how the data could be merged:
  - **Unique Clients in `invoice_train`**: 135,493  
  - **Unique Clients in `Client_train`**: 135,493  
  - **Maximum Invoices per Client**: 439  
  - **Minimum Invoices per Client**: 1  
  - **Relationship Identified**: **One-to-Many**  
    - Each `client_id` in `Client_train` maps to one or more rows in `Invoice_train`, making the relationship suitable for a merge.

#### Relationship Logic
- **One-to-One**: Not applicable, as some clients in `Invoice_train` have multiple invoices.
- **Many-to-Many**: Not applicable, as `client_id` is unique in `Client_train`.
- **One-to-Many**: This is the appropriate relationship, as each `client_id` in `Client_train` is linked to multiple records in `Invoice_train`.

#### Merging Datasets
- The datasets were merged using the `client_id` key, resulting in a combined dataset (`df`) for further analysis.

#### Fixing Date Format Inconsistencies
- Columns such as `invoice_date` and `creation_date` had inconsistent date formats.
- These were standardized to ensure temporal consistency, enabling accurate time-based analyses such as customer lifetime, invoice gaps, and seasonality.

By ensuring clean and consistent data, we laid the groundwork for reliable insights and feature engineering.


# Feature Engineering

In [None]:
# Total Consumption (sum of consommation levels)
df['total_consumption'] = df[['consommation_level_1', 'consommation_level_2', 
                               'consommation_level_3', 'consommation_level_4']].sum(axis=1)

In [None]:
# Time based features
df['invoice_year'] = df['invoice_date'].dt.year
df['invoice_month'] = df['invoice_date'].dt.month
df['creation_year'] = df['creation_date'].dt.year
df['creation_day_of_week'] = pd.to_datetime(df['creation_date']).dt.dayofweek  
df['invoice_quarter'] = pd.to_datetime(df['invoice_date']).dt.to_period('Q')  

# Find the most recent invoice date for each client
df['most_recent_invoice_date'] = df.groupby('client_id')['invoice_date'].transform('max')

# Invoice Gap
df['invoice_gap'] = (pd.to_datetime(df['most_recent_invoice_date'])- pd.to_datetime(df['invoice_date'])).dt.days

# Calculate client age as the difference between the most recent invoice date and the creation date
df['client_age_in_Days'] = (df['most_recent_invoice_date'] - df['creation_date']).dt.days

#Consumption Difference:
df['consumption_diff'] = df['new_index'] - df['old_index']

# Client Category Fraud Rate
client_catg_fraud_rate = df.groupby('client_catg')['target'].mean().to_dict()
df['client_catg_fraud_rate'] = df['client_catg'].map(client_catg_fraud_rate)

# Fraud Risk Flag
df['high_risk'] = (df['client_catg_fraud_rate'] > 0.7)

# Client Lifetime Buckets
df['client_lifetime_bucket'] = pd.cut(df['client_age_in_Days'], bins=[0, 365, 1825, 3650, np.inf],
                                      labels=['<1 Year', '1-5 Years', '5-10 Years', '>10 Years'])

### Feature Engineering Summary

To enhance our dataset and derive meaningful insights, we engineered several features that provided valuable context and analytical depth. Below is a summary of the key features developed:

#### Total Consumption
- **Description**: Computed as the sum of all consumption levels (`consommation_level_1` to `consommation_level_4`) to represents the overall consumption of each client, serving as a key metric for consumption analysis.

---

#### Time-Based Features
- **Invoice Year, Month, and Quarter**: Extracted from `invoice_date` to identify seasonal and periodic trends.
- **Creation Day of Week and Year**: Extracted from `creation_date` to analyze patterns based on the day of account creation.

---

#### Most Recent Invoice Date
- **Description**: Identified the latest `invoice_date` for each client to provide a reference point for time-based calculations. Useful for determining recent activity and as a baseline for further calculations.

---

#### Invoice Gap
- **Description**: Calculated the difference in days between the current `invoice_date` and the `most_recent_invoice_date` to analyze customer's activity patterns and helps identify inactive accounts.
 
 ---

#### Client Age (in Days)
- **Description**: Computed as the difference in days between the `most_recent_invoice_date` and the `creation_date` to represents the lifetime of the client within the system.

---

#### Consumption Difference
- **Description**: Calculated as the difference between `new_index` and `old_index` to captures the net change in consumption for each invoice period.
 
---

#### Client Category Fraud Rate
- **Description**: Computed the fraud rate for each `client_catg` (client category) as the mean of the `target` column (fraud indicator) to add a contextual fraud risk metric for each client category.
  
---

#### Fraud Risk Flag
- **Description**: Flagged clients as "high risk" if their `client_catg_fraud_rate` exceeded 0.7.This enables targeted monitoring of high-risk clients.
 
---

#### Client Lifetime Buckets
- **Description**: Segmented clients into buckets based on their age in the system (`client_age_in_Days`):
  - `<1 Year`
  - `1-5 Years`
  - `5-10 Years`
  - `>10 Years`
This helps in analyzing customer tenure and associated behaviors.

---

#### Value of Engineered Features
These features provided a deeper understanding of customer behavior, consumption trends, and fraud patterns, enabling targeted strategies and actionable insights for stakeholders.


# Business Overview and Customer Insights

- Proportional Split of Energy Services
- Number of clients for gas and electricity over time
- Regional Distribution of Clients
- Tariff-Based Customer Segmentation
- Average Consumption Across Tariffs
- Client Lifetime Distribution

In [None]:
Proportional_Split = df['counter_type'].value_counts()

# Plot the pie chart
plt.figure(figsize=(7, 7))
plt.pie(Proportional_Split, labels=Proportional_Split.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Proportional_Split (Electricity vs Gas)', fontsize=16)
plt.axis('equal') 
plt.show()


### Proportional Split of Energy: Electricity vs. Gas

#### Overview
The pie chart represents the proportional split of energy distribution in the business between electricity and gas.

#### Key Insights

1. **Electricity Dominance**:
   - Electricity accounts for **68.8%** of the total energy distribution, making it the primary service provided by the business.

2. **Gas Contribution**:
   - Gas constitutes **31.2%** of the energy distribution, playing a significant but smaller role compared to electricity.

#### Implications
- **Business Focus**: The higher proportion of electricity distribution suggests that the business may prioritize resources and strategic planning around electricity-related services and infrastructure.
- **Growth Potential in Gas**: The smaller share of gas offers potential opportunities for targeted marketing or service expansion to increase its contribution.
- **Balanced Energy Portfolio**: Despite electricity's dominance, gas remains a substantial part of the business, emphasizing the need for a balanced approach in service delivery and customer engagement.


In [None]:
# Filter out the year 2019
df_filtered = df[df['creation_year'] < 2019]

# Group by creation_year and counter_type, then count the number of clients
customers_by_year = df_filtered.groupby(['creation_year', 'counter_type']).size().unstack(fill_value=0)

# Plot the data
ax = customers_by_year.plot(kind='line', figsize=(12, 6), colormap="Set2", marker='o')

# Add title and labels
plt.title('Number of Clients Over the Years by Counter Type', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Clients', fontsize=12)
plt.legend(title='Counter Type')
plt.grid(True, linestyle='--', alpha=0.1)

# Adjust X-axis to show every second year with 45-degree rotation
years_to_show = customers_by_year.index[::2].tolist()  # Display every second year
years_to_show.append(customers_by_year.index[-1])  # Add the last year

ax.set_xticks(years_to_show)  # Set the ticks to the selected years
plt.xticks(rotation=45)

# Show the plot
plt.show()

### Client Trends for Gas and Electricity Over the Years

#### Overview
This line chart visualizes the trends in the number of clients for gas and electricity services over the years, offering insights into growth, adoption, and market dynamics. For a fair comparison, the year 2019 was removed from the dataset as it lacked client data.

#### Key Insights

1. **Overall Growth and Decline Trends**:
   - **Electricity**: The number of electricity clients showed a general increase until around 2007, peaking at **140,549 clients**. Post-2007, a gradual decline was observed, with a sharp drop after 2010. By 2018, the number significantly reduced to **10,793 clients**.
   - **Gas**: The client base for gas services also exhibited steady growth until 2008, peaking at **71,774 clients**. Similar to electricity, a substantial decline followed, with the number falling to **1,381 clients** by 2018.

2. **Significant Growth Periods**:
   - Both gas and electricity services saw notable growth between **1999 and 2007**, likely due to increased service coverage or adoption rates during that period.

3. **Steep Declines Post-2010**:
   - A sharp decline in the number of clients for both services was observed after 2010, which may be attributed to changes in market conditions, economic factors, or the emergence of alternative energy sources.

4. **Electricity as a Dominant Service**:
   - Throughout the analyzed years, electricity consistently maintained a significantly larger client base compared to gas, emphasizing its critical role in the energy sector.

5. **Transition in Recent Years**:
   - The decline in client numbers for both services became more pronounced after 2016, signaling potential market saturation, changes in energy policies.

#### Implications
This analysis provides a historical perspective on client adoption for gas and electricity services, which is crucial for identifying growth opportunities, planning service improvements, and strategizing for market expansion.


In [None]:
# Get region-wise client count
region_client_count = df['region'].value_counts().sort_values(ascending=False)

# Plot region-wise number of clients
plt.figure(figsize=(12, 6))
sns.barplot(x=region_client_count.index, y=region_client_count.values, palette="viridis",order=region_client_count.index)

# Set Y-axis ticks to the nearest 100,000
step_size = 100000  
max_count = region_client_count.max()
plt.yticks(range(0, max_count + step_size, step_size))

# Format Y-axis labels as integers with commas
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{int(x):,}'))

# Data labels on each bar
for i, value in enumerate(region_client_count.values):
    plt.text(i, value + 1, str(value), ha='center', fontsize=10, rotation = 60)

plt.title('Region-Wise Number of Clients', fontsize=16)
plt.xlabel('Region', fontsize=14)
plt.ylabel('Number of Clients', fontsize=14)
plt.xticks(rotation=45)
plt.show()

### Regional Distribution of Clients

#### Overview
This bar chart highlights the distribution of clients across various regions, providing insights into regional energy usage and client density. The data reveals significant disparities in client numbers across regions, which can be used to guide resource allocation and service planning.

#### Key Insights

1. **Top Regions with Highest Client Numbers**:
   - **Region 101**: Dominates with approximately **1.07 million clients**, making it the region with the highest client density.
   - **Region 311**: Follows with around **510,000 clients**.
   - **Region 104**: Registers roughly **401,000 clients**, securing the third position.

2. **Mid-Level Client Density Regions**:
   - Regions such as **301, 107, 103, 303, and 306** have client numbers ranging from **195,000 to 316,000**.

3. **Regions with Lower Client Numbers**:
   - Regions like **105, 371, 308, and 106** show a significantly smaller client base, ranging between **24,000 and 94,000 clients**.
   - **Region 199** and **Region 206** have the least number of clients, with near negligible figures (**2 and 1,004 clients**, respectively).

#### Implications
The regional client distribution offers a foundational understanding for targeting service improvements, regional campaigns, or addressing specific issues. Regions with lower client density may require exploration to identify growth potential, while high-density areas can be optimized for efficiency and customer satisfaction.


In [None]:
# Plot tariff-wise number of clients
tariff_client_count = df['tarif_type'].value_counts().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=tariff_client_count.index, y=tariff_client_count.values, palette="magma" ,order=tariff_client_count.index)
plt.title('Tariff-Wise Number of Clients', fontsize=16)
plt.xlabel('Tariff Type', fontsize=14)
plt.ylabel('Number of Clients in millions', fontsize=14)

# Show value labels on each bar
for i, value in enumerate(tariff_client_count.values):
    plt.text(i, value + 1, str(value), ha='center', fontsize=10, rotation = 35)
    
plt.show()

### Distribution of Clients by Tariff Type

#### Overview
This bar chart displays the number of clients segmented by tariff type, offering insights into the popularity and usage of different tariff plans. 

#### Key Insights

1. **Most Popular Tariff Types**:
   - **Tariff Type 11**: Accounts for the highest number of clients, with approximately **2.68 million clients**.
   - **Tariff Type 40**: Follows with around **1.38 million clients**.
   - Together, these two tariff types dominate the dataset, encompassing a vast majority of the client base.

2. **Moderately Used Tariff Types**:
   - **Tariff Type 10**: Registers approximately **276,000 clients**, indicating moderate usage.
   - **Tariff Type 15**: Has around **72,000 clients**, showing niche adoption.

3. **Least Used Tariff Types**:
   - **Tariff Types 45, 13, 14, and 12**: Each have between **10,000 and 18,000 clients**, reflecting limited usage.
   - Several other tariff types (**21, 8, 30, 24, 18, 42, 27**) show very low adoption, with some having fewer than **50 clients**.

4. **Noteworthy Observations**:
   - A stark contrast exists between the top two tariff types and the rest, highlighting a concentration of clients in a few plans.
   - The presence of several niche tariffs with minimal clients suggests either specialized services or legacy tariff plans.

#### Implications
The tariff-wise distribution underscores opportunities for optimization:
- **Popular Tariffs**: Ensure robust infrastructure and services for high-demand tariffs (e.g., Types 11 and 40).
- **Low-Usage Tariffs**: Investigate reasons for limited adoption—whether due to limited availability, high costs, or lack of awareness—and assess the need for potential restructuring or marketing efforts.


In [None]:
# Grouping data by tariff type and calculating the average total consumption
avg_consumption_by_tariff = df.groupby('tarif_type')['total_consumption'].mean().reset_index()

# Plotting the average consumption by tariff type
plt.figure(figsize=(10, 6))
sns.barplot(data=avg_consumption_by_tariff, x='tarif_type', y='total_consumption', palette='viridis')
plt.title("Average Consumption by Tariff Type")
plt.xlabel("Tariff Type")
plt.ylabel("Average Total Consumption")
plt.xticks(rotation=45)
plt.show()


### Average Consumption by Tariff Type

#### Overview
This bar chart illustrates the average energy consumption associated with each tariff type. The comparison highlights the variance in consumption levels, reflecting customer behavior and tariff-specific characteristics.

#### Key Insights

1. **High Consumption Tariffs**:
   - **Tariff Type 45**: Stands out with the highest total consumption, indicating a substantial contribution to overall energy usage.
   - **Tariff Type 13**: Registers significant consumption, following closely behind Tariff Type 45.

2. **Moderate Consumption Tariffs**:
   - **Tariff Types 9, 11,14 and 21**: Show moderate consumption levels, reflecting a notable but not dominant share of total energy usage.
   - **Tariff Type 29**: Also demonstrates moderate usage, although lower than the aforementioned tariffs.

3. **Low Consumption Tariffs**:
   - **Tariff Types 10,12, and 18**: Show relatively low consumption, indicating either limited client usage or smaller consumption per client.
   - **Tariff Types 15, 8, and 30**: Register minimal consumption, suggesting niche or restricted application.

4. **Zero Consumption Tariffs**:
   - **Tariff Types 24, 27, and 42**: Show no recorded consumption, potentially indicating discontinued or underutilized plans.

#### Implications
The disparity in consumption across tariff types provides actionable insights:
- **Focus on High-Consumption Tariffs**: Ensure reliable infrastructure and support to accommodate the demands of tariffs with high consumption.
- **Investigate Low/Zero Consumption Tariffs**: Assess reasons for limited usage—whether due to client preferences, cost, or outdated plans—and consider potential adjustments or rebranding.
- **Optimize Moderate Consumption Tariffs**: Explore ways to enhance adoption or efficiency for tariffs in the middle range of consumption.


In [None]:
# Client Lifetime Distribution
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='client_age_in_Days', bins=30, kde=True, color='skyblue')
plt.title("Client Lifetime Distribution")
plt.xlabel("Client Age in Days")
plt.ylabel("Frequency")
plt.show()


### Client Age Distribution

The histogram illustrates the distribution of client age in days, representing the duration clients have been associated with the company. Most clients fall within the range of **2,000 to 8,000 days**, indicating a substantial base of long-term clients. Beyond **8,000 days**, the number of clients decreases steadily, following a gradual linear decline up to **15,000 days**. This suggests a strong core of medium-term clients, with fewer very long-term associations.  


# Fraud Analysis and Insights
- Fraud vs non-fraud distribution
- Fraud distribution by region wise
- Fraud distribution among Gas and electricity
- Fraud distribution among Gas and electricity over years
- Number of Fraud Cases Over months for Gas and Electricity Counter Types
- Consumption Journey: Fraudulent, Non-Fraudulent, and Average Customers 

In [None]:
# Calculate the distribution of fraud and non-fraud cases
fraud_count = df['target'].value_counts()
labels = ['Non-Fraud', 'Fraud']
sizes = fraud_count.values

# Create a figure with two subplots
plt.figure(figsize=(12, 6))

# Bar chart for Fraud vs Non-Fraud distribution
plt.subplot(1, 2, 1)
sns.countplot(x='target', data=df)
plt.title('Fraud vs Non-Fraud Distribution', fontsize=16)
plt.xticks([0, 1], ['Non-Fraud', 'Fraud'], fontsize=12)
plt.ylabel('Count in millions', fontsize=14)
plt.xlabel('Category', fontsize=14)

# Pie chart for Fraud vs Non-Fraud distribution
plt.subplot(1, 2, 2)
plt.pie(
    sizes,
    labels=labels,
    autopct='%1.1f%%',
    startangle=140,
    textprops={'fontsize': 12},
)
plt.title('Fraud vs Non-Fraud Distribution (Pie Chart)', fontsize=16)

# Display the charts
plt.tight_layout()
plt.show()

# Count fraud and non-fraud cases
fraud_counts = df['target'].value_counts()
print("Non-Fraud Cases:", fraud_counts.get(0, 0))
print("Fraud Cases:", fraud_counts.get(1, 0))

### Interpretation: Fraud vs Non-Fraud Distribution
The bar graph and pie chart reveal a significant imbalance in the dataset:

Non-Fraud Cases: **4,123,637(92.1%)**
Fraud Cases: **353,112 (7.9%)**

This highlights that the majority of customers data in our dataset are non-fraudulent, with only a small fraction classified as fraud cases. Such data imbalance suggests the need for careful handling when developing machine learning models, as it may impact the model's ability to detect fraudulent behavior accurately. Balancing techniques like oversampling, undersampling, or advanced algorithms may be required.

In [None]:
fraud_cases_region = df[df['target'] == 1].groupby('region').size().sort_values(ascending=False)

# Plot fraud cases region-wise
plt.figure(figsize=(10, 6))
sns.barplot(x=fraud_cases_region.index, y=fraud_cases_region.values, palette="rocket", order=fraud_cases_region.index)
plt.title('Number of Fraud Cases Region-Wise')
plt.xlabel('Region', fontsize=14)
plt.ylabel('Number of Fraud Cases', fontsize=14)

# Show value labels on each bar
for i, value in enumerate(fraud_cases_region.values):
    plt.text(i, value + 1, str(value), ha='center', fontsize=10)

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Regional Fraud Distribution

The bar chart displays the distribution of fraud cases across various regions. The key observations are:

- **Top Regions with Fraud Cases**: Region 311 has the highest number of fraud cases (57,936), followed by Region 101 (51,578) and Region 103 (42,586).
- **Moderate Fraud Regions**: Regions like 104, 107, and 303 show a significant but comparatively lower number of fraud cases, ranging between 16,000 to 32,000.
- **Lowest Fraud Regions**: Regions such as 106, 379, 206, and 399 report the fewest fraud cases, with Region 206 having just 41 and Region 399 only 33 cases.

This distribution indicates that certain regions experience a disproportionately higher number of fraud cases, which may require targeted investigation and preventive measures.


In [None]:
fraud_data = df[df['target'] == 1]

# Calculate fraud distribution based on 'counter_type' (Electricity and Gas)
fraud_distribution = fraud_data['counter_type'].value_counts()

# Plot the pie chart
plt.figure(figsize=(7, 7))
plt.pie(fraud_distribution, labels=fraud_distribution.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Fraud Distribution Among Electricity and Gas', fontsize=16)
plt.axis('equal')  # Equal aspect ratio ensures the pie chart is circular.
plt.show()

### Fraud Distribution Among Electricity and Gas Services

The pie chart illustrates the proportion of fraud cases between electricity and gas services:

- **Electricity**: Accounts for the majority of fraud cases, with 66.9% of the total.
- **Gas**: Contributes to 33.1% of the fraud cases.

This highlights that fraud cases are significantly higher in electricity services compared to gas, suggesting a need for more robust fraud detection measures in the electricity segment.


In [None]:
# Filter fraud cases (target == 1)
fraud_cases = df[df['target'] == 1].copy()

# Create 'year' column to represent the year
fraud_cases['year'] = fraud_cases['invoice_date'].dt.year.astype(str)

# Group by 'counter_type' and 'year' to get the fraud count
fraud_distribution = fraud_cases.groupby(['counter_type', 'year']).size().reset_index(name='fraud_count')

# Filter data for gas and electricity only
fraud_distribution = fraud_distribution[fraud_distribution['counter_type'].isin(['ELEC', 'GAZ'])]

# Plotting
plt.figure(figsize=(14, 8))

sns.lineplot(data=fraud_distribution, x='year', y='fraud_count', hue='counter_type', marker='o')

# Adding data labels to the plot
for line in plt.gca().get_lines():
    for x, y in zip(line.get_xdata(), line.get_ydata()):
        plt.text(x, y, str(int(y)), color=line.get_color(), ha='center', va='bottom', fontsize=10)

# Formatting the plot
plt.title('Number of Fraud Cases Over Years for Gas and Electricity Counter Types', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Fraud Cases Count', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Counter Type')

# Display the plot
plt.tight_layout()
plt.show()

### Fraud Distribution Over Years for Electricity and Gas Services

The line chart compares the number of fraud cases over the years for electricity and gas services. Key observations include:

- **Electricity**: Fraud cases have shown a gradual increase from **2005 to 2017**, peaking at **20,202 cases**. A decline is observed after 2017, with cases dropping to **11,850 by 2019**.
- **Gas**: Similar to electricity, fraud cases in gas services steadily increased, starting from 792 cases in 2005 to a peak of **11,951 in 2017**. Post-2017, there is a decline, with fraud cases reducing to **7,057 by 2019**.

This trend highlights that fraud cases for both electricity and gas services rose significantly over the years, reaching a peak in 2017, followed by a gradual decline, potentially indicating the impact of improved fraud detection measures.


In [None]:
# Create 'month' and 'year' columns to represent the month and year
fraud_cases['month'] = fraud_cases['invoice_date'].dt.month
fraud_cases['year'] = fraud_cases['invoice_date'].dt.year.astype(str)

# Group by 'counter_type', 'year', and 'month' to get the fraud count and then calculate average per month
monthly_fraud = fraud_cases.groupby(['counter_type', 'year', 'month']).size().reset_index(name='fraud_count')

# Calculate the average fraud count per month
monthly_fraud_avg = monthly_fraud.groupby(['counter_type', 'month'])['fraud_count'].mean().reset_index(name='avg_fraud_count')

# Plotting
plt.figure(figsize=(14, 8))

sns.lineplot(data=monthly_fraud_avg, x='month', y='avg_fraud_count', hue='counter_type', marker='o')

# Add data labels
for line in plt.gca().get_lines():
    for x, y in zip(line.get_xdata(), line.get_ydata()):
        plt.text(x, y, f'{int(round(y))}', color='black', ha='center', va='bottom', fontsize=10)
        
# Formatting the plot
plt.title('Average Number of Fraud Cases Over Months for Gas and Electricity Counter Types', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average Fraud Cases Count', fontsize=12)
plt.xticks(ticks=range(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.legend(title='Counter Type')

# Display the plot
plt.tight_layout()
plt.show()

### Monthly Trends of Average Fraud Cases in Electricity and Gas Services

The line chart examines the average number of fraud cases per month for electricity and gas services to identify months more prone to fraudulent activities. Key observations include:

- **Electricity**: 
  - Fraud cases peak in **February** (1,652 cases on average) and remain relatively high in January and March.
  - A noticeable dip occurs from June to December, with the lowest average in **July** (1,149 cases).

- **Gas**:
  - Fraud cases are highest in **February** (881 cases on average) and January.
  - A significant decline is seen from **June to December**, reaching the lowest average in **December** (479 cases).

This trend suggests that February is the most fraud-prone month for both electricity and gas services, potentially due to seasonal factors or billing cycles.


In [None]:
# Step 1: Create separate lists for fraudulent and honest clients
fraudulent_client = df_filtered[df_filtered['target'] == 1]['client_id'].unique()
honest_client = df_filtered[df_filtered['target'] == 0]['client_id'].unique()

# Step 2: Define the function to compare consumption
def compare_consumption():
    # Randomly select one fraudulent and one honest client
    fraudulent_client_id = 'train_Client_125420'
    honest_client_id = 'train_Client_72532'

    # Print selected client IDs for transparency
    print(f"Selected Fraudulent Client ID: {fraudulent_client_id}")
    print(f"Selected Honest Client ID: {honest_client_id}")

    # Step 3: Filter the dataset for the period after 2005-01-01
    start_date = "2005-01-01"
    counter_invoice_train = df_filtered[df_filtered['invoice_date'] >= start_date]

    # Step 4: Calculate monthly average consumption for all clients, fraudulent client, and honest client

    def plot_consumption(counter_type):
        # Filter by counter type
        counter_invoice = counter_invoice_train[counter_invoice_train['counter_type'] == counter_type]

        # Monthly average consumption for all clients
        counter_invoice['invoice_month'] = pd.to_datetime(counter_invoice['invoice_date']).dt.month
        monthly_avg_con = (counter_invoice.groupby('invoice_month')['consommation_level_1']
                           .mean().reset_index())

        # Monthly average consumption for the honest client
        honest_monthly_avg_con = counter_invoice[counter_invoice['client_id'] == honest_client_id].copy()
        honest_monthly_avg_con['invoice_month'] = pd.to_datetime(honest_monthly_avg_con['invoice_date']).dt.month
        honest_monthly_avg_con = (honest_monthly_avg_con.groupby('invoice_month')['consommation_level_1']
                                  .mean().reset_index())

        # Monthly average consumption for the fraudulent client
        fraudulent_monthly_avg_con = counter_invoice[counter_invoice['client_id'] == fraudulent_client_id].copy()
        fraudulent_monthly_avg_con['invoice_month'] = pd.to_datetime(fraudulent_monthly_avg_con['invoice_date']).dt.month
        fraudulent_monthly_avg_con = (fraudulent_monthly_avg_con.groupby('invoice_month')['consommation_level_1']
                                      .mean().reset_index())

        # Plotting
        plt.figure(figsize=(12, 6))

        # Plot monthly average for all clients
        plt.plot(monthly_avg_con['invoice_month'], monthly_avg_con['consommation_level_1'], label="Overall Avg", color="#073763", linewidth=1.5)

        # Plot honest client consumption
        plt.plot(honest_monthly_avg_con['invoice_month'], honest_monthly_avg_con['consommation_level_1'], 
                 label="Honest Customer", color="#3e9c15", linewidth=1.2)

        # Plot fraudulent client consumption
        plt.plot(fraudulent_monthly_avg_con['invoice_month'], fraudulent_monthly_avg_con['consommation_level_1'], 
                 label="Fraudulent Customer", color="#cc0000", linestyle="--", linewidth=1.2)

        # Add data labels
        for line in plt.gca().get_lines():
            for x, y in zip(line.get_xdata(), line.get_ydata()):
                plt.text(x, y, f'{int(round(y))}', color='black', ha='center', va='bottom', fontsize=10)

        # Customize plot
        plt.xticks(ticks=range(1, 13), labels=["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
        plt.title(f"Consumption Comparison: Fraudulent vs Honest vs Overall Avg\n{counter_type} Consumption (After 2005)")
        plt.xlabel("Month")
        plt.ylabel("Consumption")
        plt.legend(loc="lower center", bbox_to_anchor=(0.5, -0.2), ncol=3)
        plt.grid(axis="y", linestyle="--", alpha=0.7)
        plt.tight_layout()
        plt.show()

    # Compare for both counter types
    plot_consumption("ELEC")
    plot_consumption("GAZ")

# Step 5: Call the function to compare consumption
compare_consumption()


### Comparison of Honest and Fraudulent Client Journeys with Average Consumption

This visual presents a comparative analysis of consumption trends for a random honest client, a fraudulent client, and the average consumption for electricity and gas services. Key insights include:

- **Fraudulent Client**: 
  - Displays a **zigzag pattern** in monthly consumption, deviating significantly from the average consumption.
  - The erratic pattern indicates inconsistent behavior, characteristic of fraudulent activity.

- **Overall Trend**: 
  - The distinct behavior of fraudulent clients compared to honest ones can be observed across multiple examples, validating this pattern as a recurring phenomenon.
  - This type of analysis is highly useful for identifying and investigating suspected customers based on irregular consumption behavior.


# Future Scope of the Project

As part of the current project, we carried out data understanding, cleaning, preprocessing, and exploratory data analysis (EDA) on an imbalanced dataset. Furthermore, several enhancements need to be done to improve the outcomes and insights derived from the data:

 **Model Development and Evaluation**:
   - Develop machine learning models to predict fraudulent activities.
   - Address the data imbalance using techniques like oversampling (SMOTE), undersampling, or cost-sensitive algorithms.
   - Evaluate models with appropriate metrics like F1-score, precision-recall curves, and AUC-ROC to handle the imbalance effectively.
**Advanced Feature Engineering**:
   - Create new features, such as seasonality-based consumption patterns or rolling averages, to capture behavioral insights.
   - Perform feature selection to identify the most influential predictors for fraud detection.

 **Time-Series Analysis**:

 **Real-Time Fraud Detection System**:
   - Implement a real-time monitoring system that flags unusual consumption patterns indicative of fraud.
   - Integrate the system with dashboards for immediate action by stakeholders.

 **Customer Segmentation**:
   - Cluster customers based on consumption behavior to identify groups more prone to fraudulent activities.
   - Design targeted interventions or personalized services for different customer segments.

 **Enhanced Data Collection**:
   - Incorporate additional data points such as customer demographics, weather conditions, and billing history.
   - Explore IoT integration for more granular and real-time data collection.

By focusing on these areas, the project can evolve into a comprehensive system for fraud detection, customer behavior analysis, and energy consumption optimization, ultimately delivering greater value to stakeholders.
