In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "# Phase 1: Data Acquisition & EDA\n\nThis notebook is dedicated to the initial exploration of the customer churn dataset, using a modular approach to handle data ingestion and preliminary analysis."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Import necessary libraries\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport sys\n\n# Add the src directory to the system path to import your functions\nsys.path.append('../src')\nfrom ingest_data import get_clean_data\nfrom feature_engineering import create_engagement_features, one_hot_encode_categorical\n\n# --- Step 1: Data Ingestion and Inspection ---\n# Load data using your custom function\ndata_path = '../data/CreditCardCustomers.csv'\ndf = get_clean_data(data_path)\n\nprint(\"DataFrame Info:\")\ndf.info()\nprint(\"\\nDataFrame Head:\")\nprint(df.head())\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Exploratory Data Analysis (EDA)"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Analyze the target variable `ChurnStatus`\nplt.figure(figsize=(8, 6))\nsns.countplot(x='ChurnStatus', data=df)\nplt.title('Distribution of Churn Status (0: Retained, 1: Churned)')\nplt.show()\n\n# Analyze the distribution of a key numerical feature\nplt.figure(figsize=(10, 6))\nsns.histplot(df['Age'], bins=30, kde=True)\nplt.title('Distribution of Customer Age')\nplt.xlabel('Age')\nplt.show()\n\n# Analyze the relationship between a categorical feature and churn\nplt.figure(figsize=(12, 6))\nsns.countplot(x='Card_Category', hue='ChurnStatus', data=df)\nplt.title('Churn Status by Card Category')\nplt.show()\n\n# Analyze the `Total_Trans_Amt` for Churned vs. Retained customers\nplt.figure(figsize=(10, 6))\nsns.boxplot(x='ChurnStatus', y='TransactionAmount', data=df)\nplt.title('Transaction Amount by Churn Status')\nplt.show()\n\n# Create a stacked bar chart to see the impact of gender on churn\ngender_churn = pd.crosstab(df['Gender'], df['ChurnStatus'], normalize='index') * 100\ngender_churn.plot(kind='bar', stacked=True, figsize=(8, 6))\nplt.title('Churn Rate by Gender')\nplt.ylabel('Percentage')\nplt.xticks(rotation=0)\nplt.show()\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Feature Engineering"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Apply your custom feature engineering functions\ndf = create_engagement_features(df)\n\n# Define the categorical columns to one-hot encode\ncategorical_cols = ['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']\ndf = one_hot_encode_categorical(df, columns=categorical_cols)\n\n# Display the final, engineered DataFrame to show your work\nprint(\"\\nFinal Engineered DataFrame Head:\")\nprint(df.head())\n"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Scalability & Production Readiness (Innovation)\n\nFor a production environment at American Express, which deals with petabytes of customer data, the feature engineering steps performed in pandas would be inefficient. To handle such scale, this entire pipeline would be migrated to a distributed computing framework.\n\nSpecifically, the data would be processed using **PySpark**. The pandas DataFrame would be converted to a Spark DataFrame, and all transformations would be performed using PySpark's API. This ensures that the code can be executed on a cluster of machines, parallelizing the workload and significantly reducing processing time. This approach demonstrates a forward-thinking and enterprise-ready mindset."
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}