In [2]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Customer Churn Data Exploration\n",
    "\n",
    "This notebook explores the Telco Customer Churn dataset to understand key characteristics and patterns before model development.\n",
    "\n",
    "## Data Overview\n",
    "\n",
    "The dataset contains information about customers of a telecom company, including:\n",
    "- Demographics (gender, senior citizen status, partners, dependents)\n",
    "- Services subscribed (phone, internet, security features, etc.)\n",
    "- Account information (tenure, contract type, payment method)\n",
    "- Billing information (monthly charges, total charges)\n",
    "- Churn status (whether the customer left the company)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Set plot style\n",
    "plt.style.use('seaborn-whitegrid')\n",
    "sns.set_palette('colorblind')\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Load the dataset\n",
    "file_path = '../data/WA_Fn-UseC_-Telco-Customer-Churn.csv'\n",
    "df = pd.read_csv(file_path)\n",
    "\n",
    "# Display the first few rows\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Structure Analysis\n",
    "\n",
    "Let's examine the basic structure of our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Basic information about the dataset\n",
    "print(f\"Dataset shape: {df.shape}\")\n",
    "print(f\"Number of unique customers: {df['customerID'].nunique()}\")\n",
    "\n",
    "# Display data types and missing values\n",
    "print(\"\\nData Types:\")\n",
    "print(df.dtypes)\n",
    "\n",
    "print(\"\\nMissing Values:\")\n",
    "print(df.isNone().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Summary statistics for numeric columns\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Summary statistics for categorical columns\n",
    "df.describe(include=['object'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Examine the 'TotalCharges' column which might have issues\n",
    "print(\"Data type of TotalCharges:\", df['TotalCharges'].dtype)\n",
    "print(\"Sample values of TotalCharges:\")\n",
    "print(df['TotalCharges'].head(10))\n",
    "\n",
    "# Convert to numeric and check for missing values\n",
    "df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')\n",
    "print(\"\\nMissing values after conversion:\", df['TotalCharges'].isNone().sum())\n",
    "\n",
    "# Check patterns in missing values\n",
    "missing_total_charges = df[df['TotalCharges'].isNone()]\n",
    "print(\"\\nRows with missing TotalCharges:\")\n",
    "missing_total_charges"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Observations about missing values\n",
    "\n",
    "- The `TotalCharges` column is stored as an object (string) type in the original dataset\n",
    "- When converting to numeric, some values become NaN\n",
    "- Looking at the rows with missing values, we can see they all have tenure of 0 months\n",
    "- This makes sense: customers with 0 tenure haven't been billed yet, so their total charges are 0\n",
    "- We'll handle this during data preprocessing by setting these to 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Data Analysis\n",
    "\n",
    "### 1. Churn Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Calculate churn rate\n",
    "churn_count = df['Churn'].value_counts()\n",
    "churn_rate = churn_count['Yes'] / len(df) * 100\n",
    "\n",
    "print(f\"Churn Count: {churn_count}\")\n",
    "print(f\"Churn Rate: {churn_rate:.2f}%\")\n",
    "\n",
    "# Visualize churn distribution\n",
    "plt.figure(figsize=(10, 6))\n",
    "ax = sns.countplot(x='Churn', data=df)\n",
    "\n",
    "# Add count labels on bars\n",
    "for p in ax.patches:\n",
    "    ax.annotate(f'{p.get_height():,}', \n",
    "                (p.get_x() + p.get_width() / 2., p.get_height()),\n",
    "                ha = 'center', va = 'bottom',\n",
    "                fontsize=12)\n",
    "\n",
    "plt.title('Distribution of Customer Churn')\n",
    "plt.ylabel('Number of Customers')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Demographic Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Gender distribution and churn\n",
    "plt.figure(figsize=(12, 5))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "sns.countplot(x='gender', data=df)\n",
    "plt.title('Gender Distribution')\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "sns.countplot(x='gender', hue='Churn', data=df)\n",
    "plt.title('Churn by Gender')\n",
    "plt.show()\n",
    "\n",
    "# Calculate and display churn rate by gender\n",
    "churn_by_gender = df.groupby('gender')['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)\n",
    "print(\"Churn Rate by Gender:\")\n",
    "print(churn_by_gender)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Senior Citizen, Partner, and Dependents analysis\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 6))\n",
    "\n",
    "# Senior Citizen\n",
    "sns.countplot(x='SeniorCitizen', hue='Churn', data=df, ax=axes[0])\n",
    "axes[0].set_title('Churn by Senior Citizen Status')\n",
    "axes[0].set_xticklabels(['No', 'Yes'])\n",
    "\n",
    "# Partner\n",
    "sns.countplot(x='Partner', hue='Churn', data=df, ax=axes[1])\n",
    "axes[1].set_title('Churn by Partner Status')\n",
    "\n",
    "# Dependents\n",
    "sns.countplot(x='Dependents', hue='Churn', data=df, ax=axes[2])\n",
    "axes[2].set_title('Churn by Dependents Status')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Calculate churn rates for each demographic factor\n",
    "demo_factors = ['SeniorCitizen', 'Partner', 'Dependents']\n",
    "for factor in demo_factors:\n",
    "    churn_rate = df.groupby(factor)['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)\n",
    "    print(f\"\\nChurn Rate by {factor}:\")\n",
    "    if factor == 'SeniorCitizen':\n",
    "        # Map 0/1 to No/Yes for display\n",
    "        churn_rate.index = ['No', 'Yes']\n",
    "    print(churn_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Service Usage Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Internet Service analysis\n",
    "plt.figure(figsize=(10, 6))\n",
    "ax = sns.countplot(x='InternetService', hue='Churn', data=df)\n",
    "\n",
    "for container in ax.containers:\n",
    "    ax.bar_label(container, fmt='%d')\n",
    "    \n",
    "plt.title('Churn by Internet Service Type')\n",
    "plt.show()\n",
    "\n",
    "# Calculate churn rate by internet service\n",
    "churn_by_internet = df.groupby('InternetService')['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)\n",
    "print(\"Churn Rate by Internet Service:\")\n",
    "print(churn_by_internet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Analyze additional services\n",
    "additional_services = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', \n",
    "                        'TechSupport', 'StreamingTV', 'StreamingMovies']\n",
    "\n",
    "# Create a figure with subplots for each service\n",
    "fig, axes = plt.subplots(3, 2, figsize=(15, 15))\n",
    "axes = axes.flatten()\n",
    "\n",
    "# Plot churn by each service\n",
    "for i, service in enumerate(additional_services):\n",
    "    sns.countplot(x=service, hue='Churn', data=df, ax=axes[i])\n",
    "    axes[i].set_title(f'Churn by {service}')\n",
    "    axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=45)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Calculate churn rate for each service\n",
    "print(\"Churn Rate by Additional Services:\")\n",
    "for service in additional_services:\n",
    "    churn_rate = df.groupby(service)['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)\n",
    "    print(f\"\\n{service}:\")\n",
    "    print(churn_rate)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Contract and Billing Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Contract type analysis\n",
    "plt.figure(figsize=(10, 6))\n",
    "ax = sns.countplot(x='Contract', hue='Churn', data=df)\n",
    "\n",
    "for container in ax.containers:\n",
    "    ax.bar_label(container, fmt='%d')\n",
    "    \n",
    "plt.title('Churn by Contract Type')\n",
    "plt.show()\n",
    "\n",
    "# Calculate churn rate by contract\n",
    "churn_by_contract = df.groupby('Contract')['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)\n",
    "print(\"Churn Rate by Contract:\")\n",
    "print(churn_by_contract)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Payment method and paperless billing analysis\n",
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# Payment Method\n",
    "sns.countplot(y='PaymentMethod', hue='Churn', data=df, ax=axes[0])\n",
    "axes[0].set_title('Churn by Payment Method')\n",
    "\n",
    "# Paperless Billing\n",
    "sns.countplot(x='PaperlessBilling', hue='Churn', data=df, ax=axes[1])\n",
    "axes[1].set_title('Churn by Paperless Billing')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Calculate churn rates\n",
    "print(\"Churn Rate by Payment Method:\")\n",
    "print(df.groupby('PaymentMethod')['Churn'].apply(lambda x: (x == 'Yes').mean() * 100))\n",
    "\n",
    "print(\"\\nChurn Rate by Paperless Billing:\")\n",
    "print(df.groupby('PaperlessBilling')['Churn'].apply(lambda x: (x == 'Yes').mean() * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Tenure and Charges Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Fix TotalCharges - convert to numeric\n",
    "df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')\n",
    "# Fill missing values with 0 for customers with 0 tenure\n",
    "df.loc[(df['TotalCharges'].isNone()) & (df['tenure'] == 0), 'TotalCharges'] = 0\n",
    "\n",
    "# Tenure distribution\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(data=df, x='tenure', hue='Churn', multiple='stack', bins=20)\n",
    "plt.title('Tenure Distribution by Churn Status')\n",
    "plt.xlabel('Tenure (months)')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Create tenure groups for better visualization\n",
    "tenure_bins = [0, 12, 24, 36, 48, 60, 72]\n",
    "tenure_labels = ['0-12', '13-24', '25-36', '37-48', '49-60', '61-72']\n",
    "df['tenure_group'] = pd.cut(df['tenure'], bins=tenure_bins, labels=tenure_labels, right=False)\n",
    "\n",
    "# Churn rate by tenure group\n",
    "plt.figure(figsize=(10, 6))\n",
    "tenure_churn = df.groupby('tenure_group')['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)\n",
    "tenure_churn.plot(kind='bar')\n",
    "plt.title('Churn Rate by Tenure Group')\n",
    "plt.ylabel('Churn Rate (%)')\n",
    "plt.xlabel('Tenure Group (months)')\n",
    "plt.show()\n",
    "\n",
    "print(\"Churn Rate by Tenure Group:\")\n",
    "print(tenure_churn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Monthly charges analysis\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.histplot(data=df, x='MonthlyCharges', hue='Churn', multiple='stack', bins=20)\n",
    "plt.title('Monthly Charges Distribution by Churn Status')\n",
    "plt.xlabel('Monthly Charges ($)')\n",
    "plt.show()\n",
    "\n",
    "# Monthly charges statistics by churn\n",
    "monthly_stats = df.groupby('Churn')['MonthlyCharges'].describe()\n",
    "print(\"Monthly Charges Statistics by Churn:\")\n",
    "print(monthly_stats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Relationships Analysis\n",
    "\n",
    "Let's examine relationships between key features and their impact on churn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Convert categorical churn to numeric for correlation analysis\n",
    "df['Churn_numeric'] = df['Churn'].map({'Yes': 1, 'No': 0})\n",
    "\n",
    "# Calculate correlation between numeric features\n",
    "numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn_numeric']\n",
    "corr = df[numeric_cols].corr()\n",
    "\n",
    "# Visualize correlation matrix\n",
    "plt.figure(figsize=(10, 8))\n",
    "sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')\n",
    "plt.title('Correlation Matrix of Numeric Features')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Relationship between tenure and monthly charges\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.scatterplot(x='tenure', y='MonthlyCharges', hue='Churn', data=df)\n",
    "plt.title('Relationship between Tenure, Monthly Charges, and Churn')\n",
    "plt.xlabel('Tenure (months)')\n",
    "plt.ylabel('Monthly Charges ($)')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": None,
   "metadata": {},
   "source": [
    "# Analyze churn by contract and internet service\n",
    "plt.figure(figsize=(12, 7))\n",
    "contract_internet_churn = df.groupby(['Contract', 'InternetService'])['Churn_numeric'].mean() * 100\n",
    "contract_internet_churn = contract_internet_churn.unstack()\n",
    "sns.heatmap(contract_internet_churn, annot=True, fmt='.1f', cmap='YlOrRd')\n",
    "plt.title('Churn Rate (%) by Contract Type and Internet Service')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Insights Summary\n",
    "\n",
    "Based on the exploratory data analysis, here are the key insights about customer churn:\n",
    "\n",
    "1. **Overall Churn Rate**: The overall churn rate is approximately 26.5%.\n",
    "\n",
    "2. **Demographics**:\n",
    "   - Senior citizens have a higher churn rate compared to non-seniors.\n",
    "   - Customers without partners or dependents are more likely to churn.\n",
    "\n",
    "3. **Services**:\n",
    "   - Fiber optic internet service customers have the highest churn rate.\n",
    "   - Customers without online security and tech support have higher churn rates.\n",
    "\n",
    "4. **Contract and Billing**:\n",
    "   - Month-to-month contracts have significantly higher churn compared to one or two-year contracts.\n",
    "   - Electronic check payment method is associated with higher churn.\n",
    "   - Paperless billing customers have higher churn rates.\n",
    "\n",
    "5. **Tenure and Charges**:\n",
    "   - New customers (lower tenure) are much more likely to churn.\n",
    "   - Higher monthly charges correlate with increased churn, especially for newer customers.\n",
    "   - The negative correlation between tenure and churn suggests that customer loyalty increases over time.\n",
    "\n",
    "## Next Steps\n",
    "\n",
    "Based on these insights, our next steps will include:\n",
    "\n",
    "1. **Data Preprocessing**:\n",
    "   - Handle missing values in the TotalCharges column\n",
    "   - Convert categorical variables to appropriate numerical representations\n",
    "\n",
    "2. **Feature Engineering**:\n",
    "   - Create tenure groups for better segmentation\n",
    "   - Develop service usage metrics (e.g., total number of services)\n",
    "   - Create contract risk factors\n",
    "   - Calculate customer lifetime value approximations\n",
    "\n",
    "3. **Model Development**:\n",
    "   - Train classification models to predict churn\n",
    "   - Analyze feature importance to identify key churn drivers\n",
    "   - Develop targeted retention strategies based on model insights"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

{'cells': [{'cell_type': 'markdown',
   'metadata': {},
   'source': ['# Customer Churn Data Exploration\n',
    '\n',
    'This notebook explores the Telco Customer Churn dataset to understand key characteristics and patterns before model development.\n',
    '\n',
    '## Data Overview\n',
    '\n',
    'The dataset contains information about customers of a telecom company, including:\n',
    '- Demographics (gender, senior citizen status, partners, dependents)\n',
    '- Services subscribed (phone, internet, security features, etc.)\n',
    '- Account information (tenure, contract type, payment method)\n',
    '- Billing information (monthly charges, total charges)\n',
    '- Churn status (whether the customer left the company)']},
  {'cell_type': 'code',
   'execution_count': None,
   'metadata': {},
   'source': ['# Import libraries\n',
    'import pandas as pd\n',
    'import numpy as np\n',
    'import matplotlib.pyplot as plt\n',
    'import seaborn as sns\n',
    '\n',
    '# 