# Import Data 

In [None]:
import pandas as pd
import numpy as np

campaign_leads = pd.read_csv("campaign_leads.csv")
campaigns = pd.read_csv("campaigns.csv")
insights = pd.read_csv("insights.csv")
lead_status_changes = pd.read_csv("lead_status_changes.csv")

campaign_leads.head(), campaigns.head(), insights.head(), lead_status_changes.head()


In [None]:
datasets = {
    "campaign_leads": campaign_leads,
    "campaigns": campaigns,
    "insights": insights,
    "lead_status_changes": lead_status_changes
}

for name, df in datasets.items():
    print(f"\n===== {name} - Shape: {df.shape} =====")
    display(df.info())


# ‚úÖ üìå Section 1: Dataset Overview (Rows & Columns)

Before performing any analysis, I inspected the structure of all four datasets to understand their scale and schema.  
Here is a summary of the number of rows and columns in each dataset:

| Dataset | Rows | Columns |
|--------|-------|----------|
| campaign_leads | 56,965 | 7 |
| campaigns | 7,364 | 6 |
| insights | 68,733 | 7 |
| lead_status_changes | 38,925 | 3 |

This overview helps establish how much data we have for leads, campaigns, ad performance, and sales status updates.  
It also reveals that the datasets are relatively large and will require careful handling‚Äîespecially when merging.

# ‚úÖ üìå Section 2: Key Observations from Initial Exploration

## 2. Key Observations

### ‚úî No Missing Values
All datasets have **0 missing values**, which is uncommon in lead-generation data.  
This suggests either:
- the system enforces required fields, or  
- some fields may not be fully tracked even though values exist (e.g., `UNKNOWN` status).

### ‚úî No Duplicates Except in Two Tables
- `insights` contains **16 duplicated rows**  
- `lead_status_changes` contains **1 duplicated row**

This indicates that ingestion jobs may occasionally insert the same record twice.

### ‚úî Lead Status Distribution Is Highly Skewed
A significant number of leads fall under:
- `UNKNOWN`: **27,582** leads  
- `NEW_LEAD`: **14,724** leads  

Together, this represents **74% of all leads** having no meaningful sales activity logged.

### ‚úî Clear Multi-level Data Structure
- One campaign ‚Üí many leads  
- One campaign ‚Üí many daily insights  
- One lead ‚Üí potentially many status changes  

This confirms that merging everything into a single dataframe would distort the data (row explosion).


# ‚úÖ üìå Section 3: Data Quality Issues

## 3. Data Quality Issues Identified

### üî∏ 3.1 Duplicates in Status Updates
One duplicated row was found in `lead_status_changes`:

| lead_id | status | created_at |
|---------|---------|-------------|
| 129714 | NO_ANSWER | 2024-10-14 13:50:52 |
| 129714 | NO_ANSWER | 2024-10-14 13:50:52 |

This suggests that the system logged the same status update twice for the same lead at the exact same timestamp.

### üî∏ 3.2 Duplicates in Ad Insights
16 duplicated rows were found in `insights`.  
Example:

| campaign_id | reach | spend | clicks | impressions | created_at |
|-------------|--------|--------|---------|---------------|--------------|
| 21108 | 8130 | 348.65 | 222 | 14305 | 2024-07-19 |
| 21108 | 8130 | 348.65 | 222 | 14305 | 2024-07-19 |

This is typical in Facebook Ads ingestion where daily records are sometimes pulled twice.

### üî∏ 3.3 High Volume of Untracked Leads
- `UNKNOWN`: 27,582 leads  
- `NEW_LEAD`: 14,724 leads  

This means **74% of leads show no follow-up or sales interaction**, making funnel analysis and ML modeling challenging.

### üî∏ 3.4 Conversion Definition Is Not Explicit
There is no clear field that explicitly marks a lead as "Converted."  
Instead, conversion must be inferred from the `lead_status` values.


# ‚úÖ üìå Section 4: Defining Conversion (Based on Available Data)

## 4. Conversion Definition

Since no explicit "conversion" field exists, I derived conversion from the lead statuses.

A lead is considered **converted** if it reached a high-value sales stage.  
Based on the observed statuses, the following values represent successful outcomes:

**Conversion statuses:**
- DONE_DEAL  
- ALREADY_BOUGHT  
- MEETING_DONE  
- QUALIFIED  
- HIGH_INTEREST  
- RESALE_REQUEST  

**Total Converted Leads:** 4,436  
**Conversion Rate:** ~7.7%

This definition will be refined after consulting stakeholders, but it provides a strong analytical baseline for BI dashboards and predictive modeling.


# ‚úÖ üìå Section 5: Questions for Stakeholders

## 5. Questions for Stakeholders

To ensure accurate modeling and reporting, I would need clarification on the following:

### ‚ùì 5.1 Conversion Definition
- Which statuses should officially count as a "successful" lead?
- Do sales teams follow a standard funnel, or do statuses vary by client?

### ‚ùì 5.2 Meaning of `UNKNOWN`
- Does `UNKNOWN` mean the sales team never contacted the lead?
- Is it a system default when no status is provided?
- Should UNKNOWN be interpreted as a failed lead?

### ‚ùì 5.3 Missing Sales Activity
- Why do 74% of leads have no meaningful status updates?
- Is this expected behavior or data loss?
- Do clients update statuses manually or via API?

### ‚ùì 5.4 Insights Duplication
- Are duplicated insight rows a known ingestion issue?
- Should they be deduplicated daily before analytics?

### ‚ùì 5.5 Lead Lifecycle
- What is the expected maximum time from NEW_LEAD to a final status?
- Are statuses like FOLLOW_UP or CALL_AGAIN considered mid-funnel or success indicators?
