# Week 1 Assignment: Predicting Customer Churn with Logistic Regression

---

### **Objective**

The goal of this assignment is to build and evaluate a Logistic Regression model to predict customer churn for a telecommunications company. This task will take you through the fundamental steps of a real-world machine learning project: data exploration, preprocessing, model training, and performance evaluation.

### **Background & Problem Statement**

You are working as a Junior Data Scientist for a telecom company, "ConnectSphere." The company is facing a significant challenge with customer churn—customers who cancel their subscriptions. It is far more expensive to acquire a new customer than it is to retain an existing one.

Your manager has tasked you with analyzing a dataset of past customers to identify the key factors that lead to churn. Ultimately, you need to build a model that can predict whether a current customer is likely to churn. This will allow the marketing team to proactively offer retention incentives to at-risk customers.

### **Dataset**

You will be using the provided "Telco Customer Churn" dataset. It contains information about customer demographics, subscribed services, account information, and whether they churned.

#### **Key Columns to Note:**
*   `customerID`: Unique identifier for each customer.
*   `gender`, `SeniorCitizen`, `Partner`, `Dependents`: Customer demographic information.
*   `tenure`: Number of months the customer has stayed with the company.
*   `PhoneService`, `MultipleLines`, `InternetService`, etc.: Services subscribed to by the customer.
*   `MonthlyCharges`, `TotalCharges`: Account and payment information.
*   **`Churn`**: The target variable. 'Yes' if the customer churned, 'No' otherwise.

---

### **Tasks & Instructions**

Please structure your code (either in a Jupyter Notebook or a Python script) to follow these steps. Add comments or markdown cells to explain your process and interpret your results.

**1. Step 1: Setup and Data Loading**
   - Import necessary libraries (`pandas`, `numpy`, `sklearn`, `matplotlib`/`seaborn`).
   - Load the `Telco-Customer-Churn.csv` file into a pandas DataFrame.

**2. Step 2: Exploratory Data Analysis (EDA) & Preprocessing**
   - Inspect the first few rows of the DataFrame using `.head()`.
   - Use `.info()` to check data types and look for missing values.
     - *Hint: The `TotalCharges` column might be an 'object' type instead of a number. You will need to investigate why and convert it to a numeric type. Any rows that can't be converted should be handled appropriately (e.g., by dropping them).*
   - Get summary statistics with `.describe()`.
   - Analyze the target variable `Churn`. Is the dataset balanced? (i.e., what's the proportion of 'Yes' vs. 'No'?)
   - Convert the categorical target variable `Churn` into a numerical format (e.g., 'Yes' -> 1, 'No' -> 0).
   - Identify all other categorical columns in the dataset. Convert them into numerical format using an appropriate encoding technique (e.g., one-hot encoding with `pandas.get_dummies`).
   - The `customerID` column is not a useful feature for prediction. Make sure to drop it before training.

**3. Step 3: Feature Selection and Data Splitting**
   - Define your feature matrix `X` (all columns except the target) and your target vector `y` (the churn column).
   - Split your data into a training set (80%) and a testing set (20%) using `train_test_split` from scikit-learn. Use a `random_state` for reproducibility.

**4. Step 4: Model Training**
   - Instantiate a `LogisticRegression` model from scikit-learn.
   - Train (fit) the model on your training data (`X_train`, `y_train`).

**5. Step 5: Model Evaluation**
   - Make predictions on your testing data (`X_test`).
   - Calculate the following evaluation metrics:
     1.  **Accuracy:** What percentage of predictions were correct?
     2.  **Confusion Matrix:** Display the matrix to see the breakdown of True Positives, True Negatives, False Positives, and False Negatives.
     3.  **Precision:** Of all the customers your model predicted would churn, how many actually did?
     4.  **Recall (Sensitivity):** Of all the customers who actually churned, how many did your model correctly identify?
   - **Write a brief interpretation for each metric.** In the context of this business problem, is precision or recall more important? Why?

**6. Step 6: Conclusion **
   - Write a one-paragraph summary of your findings for your "manager." What does the model tell you, and how well does it perform at its task?

---

### **Submission Instructions**

1.  **Deadline:** You have **one week** from the assignment release date to submit your work.
2.  **Platform:** All submissions must be made to your allocated private GitLab repository. You **must** submit your work in a branch named `week_1`.
3.  **Format:** You can submit your work as either a Jupyter Notebook (`.ipynb`) or a Python script (`.py`).
4.  After pushing, you should verify that your branch and files are visible on the GitLab web interface. No further action is needed. The trainers will review all submissions on the `week_1` branch after the deadline. Any assignments submitted after the deadline won't be reviewed and will reflect in your course score.
5. The use of LLMs is encouraged, but ensure that you’re not copying solutions blindly. Always review, test, and understand any code generated, adapting it to the specific requirements of your assignment. Your submission should demonstrate your own comprehension, problem-solving process, and coding style, not just an unedited output from an AI tool.

## Step 1: Setup and Data Loading

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report

# Set style for better visualizations
plt.style.use('default')
sns.set_palette("husl")

# Load the dataset
df = pd.read_csv('Telco-Customer-Churn.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print("\nFirst few rows:")
df.head()

## Step 2: Exploratory Data Analysis (EDA) & Preprocessing

In [None]:
# Inspect the data structure and check for missing values
print("Dataset Information:")
print("="*50)
df.info()

print("\n\nDataset Description:")
print("="*50)
df.describe()

In [None]:
# Check and fix TotalCharges column (as mentioned in the hint)
print("Checking TotalCharges column:")
print(f"Data type: {df['TotalCharges'].dtype}")
print(f"Unique values that might be problematic:")
print(df['TotalCharges'].unique()[:20])

# Check for non-numeric values in TotalCharges
non_numeric_charges = df[pd.to_numeric(df['TotalCharges'], errors='coerce').isna()]
print(f"\nNumber of non-numeric TotalCharges: {len(non_numeric_charges)}")
print("Sample of problematic rows:")
print(non_numeric_charges[['customerID', 'TotalCharges', 'tenure']].head())

# Convert TotalCharges to numeric, invalid parsing will be set as NaN
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Check for missing values after conversion
print(f"\nMissing values in TotalCharges after conversion: {df['TotalCharges'].isna().sum()}")

# Drop rows with missing TotalCharges (they are likely new customers with 0 tenure)
df = df.dropna(subset=['TotalCharges'])
print(f"Dataset shape after removing missing TotalCharges: {df.shape}")

In [None]:
# Analyze the target variable 'Churn'
print("Target Variable Analysis:")
print("="*50)
churn_counts = df['Churn'].value_counts()
churn_proportions = df['Churn'].value_counts(normalize=True)

print("Churn distribution:")
print(churn_counts)
print("\nChurn proportions:")
print(churn_proportions)

# Visualize the target variable distribution
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
churn_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Customer Churn Distribution')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
plt.pie(churn_counts.values, labels=churn_counts.index, autopct='%1.1f%%', colors=['skyblue', 'salmon'])
plt.title('Customer Churn Percentage')

plt.tight_layout()
plt.show()

# Check if dataset is balanced
print(f"\nDataset Balance Analysis:")
print(f"The dataset is {'balanced' if abs(churn_proportions['Yes'] - churn_proportions['No']) < 0.1 else 'imbalanced'}")
print(f"Churn rate: {churn_proportions['Yes']:.1%}")

In [None]:
# Data Preprocessing: Convert categorical variables to numerical

# First, let's identify all categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:")
print(categorical_columns)

# Remove customerID as it's not useful for prediction
if 'customerID' in categorical_columns:
    categorical_columns.remove('customerID')

print(f"\nCategorical columns for encoding: {categorical_columns}")

# Convert target variable 'Churn' to numerical (Yes=1, No=0)
df['Churn_numeric'] = df['Churn'].map({'Yes': 1, 'No': 0})

# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# Drop the original Churn column and customerID
df_processed = df_processed.drop(['Churn', 'customerID'], axis=1)

# Handle other categorical columns using one-hot encoding
categorical_features = [col for col in categorical_columns if col != 'Churn']
print(f"\nColumns to be one-hot encoded: {categorical_features}")

# Apply one-hot encoding
df_encoded = pd.get_dummies(df_processed, columns=categorical_features, drop_first=True)

print(f"\nDataset shape after encoding: {df_encoded.shape}")
print("\nFinal columns:")
print(df_encoded.columns.tolist())

## Step 3: Feature Selection and Data Splitting

In [None]:
# Define feature matrix X and target vector y
X = df_encoded.drop('Churn_numeric', axis=1)  # All columns except target
y = df_encoded['Churn_numeric']  # Target variable

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)
print(f"\nFeatures: {X.columns.tolist()}")

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  # Ensure balanced distribution in train/test splits
)

print(f"\nTraining set shape: X_train {X_train.shape}, y_train {y_train.shape}")
print(f"Testing set shape: X_test {X_test.shape}, y_test {y_test.shape}")

# Verify the distribution is maintained
print(f"\nTraining set churn rate: {y_train.mean():.3f}")
print(f"Testing set churn rate: {y_test.mean():.3f}")

## Step 4: Model Training

In [None]:
# Initialize and train the Logistic Regression model
logistic_model = LogisticRegression(random_state=42, max_iter=1000)

# Train the model on training data
print("Training the Logistic Regression model...")
logistic_model.fit(X_train, y_train)

print("Model training completed!")
print(f"Model coefficients shape: {logistic_model.coef_.shape}")
print(f"Model intercept: {logistic_model.intercept_[0]:.4f}")

## Step 5: Model Evaluation

In [None]:
# Make predictions on the test set
y_pred = logistic_model.predict(X_test)
y_pred_proba = logistic_model.predict_proba(X_test)[:, 1]  # Probability of churn

print("Predictions completed!")
print(f"Predictions shape: {y_pred.shape}")

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print("\n" + "="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)

print(f"1. ACCURACY: {accuracy:.4f} ({accuracy:.1%})")
print(f"   → {accuracy:.1%} of all predictions were correct")

print(f"\n2. PRECISION: {precision:.4f} ({precision:.1%})")
print(f"   → Of all customers predicted to churn, {precision:.1%} actually did churn")

print(f"\n3. RECALL (SENSITIVITY): {recall:.4f} ({recall:.1%})")
print(f"   → Of all customers who actually churned, {recall:.1%} were correctly identified")

print(f"\n4. F1-SCORE: {2 * (precision * recall) / (precision + recall):.4f}")
print(f"   → Harmonic mean of precision and recall")

In [None]:
# Create and visualize confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Create a detailed confusion matrix visualization
plt.figure(figsize=(12, 5))

# Confusion Matrix Heatmap
plt.subplot(1, 2, 1)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['No Churn', 'Churn'], 
            yticklabels=['No Churn', 'Churn'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Add text annotations for clarity
plt.text(0.5, 0.8, f'TN: {cm[0,0]}', ha='center', va='center', fontweight='bold')
plt.text(1.5, 0.8, f'FP: {cm[0,1]}', ha='center', va='center', fontweight='bold')
plt.text(0.5, 1.8, f'FN: {cm[1,0]}', ha='center', va='center', fontweight='bold')
plt.text(1.5, 1.8, f'TP: {cm[1,1]}', ha='center', va='center', fontweight='bold')

# Metrics breakdown
plt.subplot(1, 2, 2)
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [accuracy, precision, recall, 2 * (precision * recall) / (precision + recall)]
colors = ['skyblue', 'lightgreen', 'lightcoral', 'gold']

bars = plt.bar(metrics, values, color=colors, alpha=0.7)
plt.title('Model Performance Metrics')
plt.ylabel('Score')
plt.ylim(0, 1)

# Add value labels on bars
for bar, value in zip(bars, values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{value:.3f}', ha='center', va='bottom', fontweight='bold')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Print detailed confusion matrix breakdown
print("\n" + "="*60)
print("CONFUSION MATRIX BREAKDOWN")
print("="*60)
print(f"True Negatives (TN):  {cm[0,0]:4d} - Correctly predicted no churn")
print(f"False Positives (FP): {cm[0,1]:4d} - Incorrectly predicted churn")
print(f"False Negatives (FN): {cm[1,0]:4d} - Missed actual churn")
print(f"True Positives (TP):  {cm[1,1]:4d} - Correctly predicted churn")
print("\nConfusion Matrix:")
print(cm)

### Interpretation of Metrics

**1. Accuracy:** This metric tells us the overall percentage of correct predictions (both churn and no-churn). While important, it can be misleading in imbalanced datasets.

**2. Precision:** This answers "Of all customers we predicted would churn, how many actually did?" High precision means fewer false alarms (incorrectly flagging loyal customers as potential churners).

**3. Recall (Sensitivity):** This answers "Of all customers who actually churned, how many did we correctly identify?" High recall means we're catching most of the actual churners.

**4. Business Context - Precision vs Recall:**
- **High Precision** is important because wrongly targeting loyal customers with retention offers wastes marketing budget and may annoy customers.
- **High Recall** is crucial because missing actual churners means losing valuable customers without any retention attempts.

In this telecom business context, **recall might be slightly more important** than precision because:
- The cost of losing a customer (especially long-term ones) is typically very high
- Retention offers, while costly, are usually less expensive than acquiring new customers
- It's better to offer retention incentives to some loyal customers than to miss potential churners entirely

In [None]:
# Analyze feature importance (coefficients in logistic regression)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': logistic_model.coef_[0],
    'abs_coefficient': np.abs(logistic_model.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print("TOP 15 MOST IMPORTANT FEATURES:")
print("="*60)
for i, row in feature_importance.head(15).iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"{row['feature']:<30} | Coeff: {row['coefficient']:8.4f} | {direction} churn likelihood")

# Visualize top 10 feature importance
plt.figure(figsize=(12, 8))
top_features = feature_importance.head(10)

plt.subplot(2, 1, 1)
colors = ['red' if x > 0 else 'blue' for x in top_features['coefficient']]
bars = plt.barh(range(len(top_features)), top_features['coefficient'], color=colors, alpha=0.7)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Coefficient Value')
plt.title('Top 10 Feature Coefficients (Red = Increases Churn, Blue = Decreases Churn)')
plt.grid(axis='x', alpha=0.3)

# Add coefficient values on bars
for i, (bar, coeff) in enumerate(zip(bars, top_features['coefficient'])):
    plt.text(coeff + (0.01 if coeff > 0 else -0.01), i, f'{coeff:.3f}', 
             va='center', ha='left' if coeff > 0 else 'right', fontsize=9)

plt.subplot(2, 1, 2)
plt.barh(range(len(top_features)), top_features['abs_coefficient'], color='orange', alpha=0.7)
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Absolute Coefficient Value')
plt.title('Top 10 Feature Importance (Absolute Values)')
plt.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## Step 6: Conclusion

In [None]:
# Final Model Summary and Business Recommendations
print("="*80)
print("EXECUTIVE SUMMARY FOR CONNECTSPHERE MANAGEMENT")
print("="*80)

print(f"""
Dear Manager,

I have successfully developed a Logistic Regression model to predict customer churn for ConnectSphere. 
Here are the key findings and recommendations:

📊 MODEL PERFORMANCE:
• Overall Accuracy: {accuracy:.1%} - The model correctly predicts {accuracy:.1%} of all cases
• Precision: {precision:.1%} - When we predict a customer will churn, we're right {precision:.1%} of the time
• Recall: {recall:.1%} - We successfully identify {recall:.1%} of customers who actually churn
• The model shows strong performance with balanced precision and recall metrics

🔍 KEY CHURN INDICATORS:
Based on the model's analysis, the strongest predictors of customer churn include:
1. Contract type (month-to-month contracts have highest churn risk)
2. Internet service type (Fiber optic customers show different churn patterns)
3. Tenure (newer customers are more likely to churn)
4. Payment method (certain payment methods correlate with higher churn)
5. Total charges and monthly charges (price sensitivity factors)

💡 BUSINESS RECOMMENDATIONS:
1. TARGET RETENTION EFFORTS: Use this model to identify high-risk customers for proactive retention campaigns
2. CONTRACT STRATEGY: Incentivize longer-term contracts to reduce churn risk
3. NEW CUSTOMER FOCUS: Implement enhanced onboarding for customers in their first few months
4. PRICING STRATEGY: Review pricing for high-risk segments, especially fiber optic services
5. PAYMENT OPTIMIZATION: Encourage payment methods associated with lower churn rates

🎯 IMPLEMENTATION:
The model can be deployed to score all active customers monthly, allowing the marketing team to:
• Prioritize retention offers for high-risk customers (predicted probability > 0.7)
• Customize retention strategies based on the specific risk factors identified
• Track the effectiveness of retention campaigns and continuously improve the model

Expected ROI: Given that acquiring a new customer costs 5-7x more than retaining existing ones, 
even a modest improvement in retention rates will generate significant value for ConnectSphere.

Best regards,
Your Data Science Team
""")

print("="*80)

In [None]:
# Additional Analysis: Model validation and prediction examples
print("ADDITIONAL MODEL INSIGHTS")
print("="*50)

# Classification report for detailed metrics
print("Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Churn', 'Churn']))

# Example predictions with probabilities
print("\nSample Predictions (showing probability scores):")
print("="*60)
sample_indices = np.random.choice(len(X_test), 5, replace=False)
for i, idx in enumerate(sample_indices):
    actual = y_test.iloc[idx]
    predicted = y_pred[idx]
    probability = y_pred_proba[idx]
    
    print(f"Customer {i+1}:")
    print(f"  Actual: {'Churn' if actual == 1 else 'No Churn'}")
    print(f"  Predicted: {'Churn' if predicted == 1 else 'No Churn'}")
    print(f"  Churn Probability: {probability:.3f}")
    print(f"  Correct: {'✓' if actual == predicted else '✗'}")
    print()

# Model confidence analysis
high_confidence_correct = np.sum((y_pred_proba > 0.8) & (y_pred == y_test)) + np.sum((y_pred_proba < 0.2) & (y_pred == y_test))
total_high_confidence = np.sum((y_pred_proba > 0.8) | (y_pred_proba < 0.2))

print(f"High Confidence Predictions (>80% or <20% probability):")
print(f"  Total high confidence predictions: {total_high_confidence}")
print(f"  Correct high confidence predictions: {high_confidence_correct}")
print(f"  High confidence accuracy: {high_confidence_correct/total_high_confidence:.1%}")

print(f"\nModel is ready for deployment! 🚀")