In [1]:
import pandas as pd
from pathlib import Path

In [2]:
# Data path to the raw data
DATA_PATH = Path("../data/processed/Clean_Customer-Churn-Dataset.csv")

In [3]:
try:
    df = pd.read_csv(DATA_PATH)
    display(df.head())
except FileNotFoundError:
    print(f"Error: The file was not found at {DATA_PATH}")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,numAdminTickets,numTechTickets,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0,0,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,No,No,One year,No,Mailed check,56.95,1889.5,0,0,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,0,0,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0,3,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,0,0,Yes


In [4]:
# Create a new feature: Average Monthly Charge over Tenure
# Handle cases where tenure is 0 to avoid division by zero
df['AverageMonthlyCharge'] = df.apply(lambda row: row['TotalCharges'] / row['tenure'] if row['tenure'] != 0 else 0, axis=1)

# Display the head of the DataFrame with the new feature
display(df[['tenure', 'TotalCharges', 'AverageMonthlyCharge']].head())

Unnamed: 0,tenure,TotalCharges,AverageMonthlyCharge
0,1,29.85,29.85
1,34,1889.5,55.573529
2,2,108.15,54.075
3,45,1840.75,40.905556
4,2,151.65,75.825


### Purpose of the `AverageMonthlyCharge` Feature

The `AverageMonthlyCharge` feature was created by dividing the `TotalCharges` by the `tenure` for each customer.

The purpose of creating this feature is to capture the average cost a customer has incurred per month over their tenure. While `MonthlyCharges` represents the current monthly cost and `TotalCharges` is the cumulative cost, `AverageMonthlyCharge` provides a perspective on the customer's spending pattern over the entire duration of their service.

**Why it was created:**

*   **Potential Indicator of Value/Usage:** A higher `AverageMonthlyCharge` might indicate a customer who consistently subscribes to more services or higher-tier plans over time, while a lower value could suggest a more basic service usage or fluctuations in their subscriptions.
*   **Revealing Patterns Not Obvious from Individual Features:** This feature can potentially reveal patterns related to churn that are not immediately obvious from just looking at `tenure`, `MonthlyCharges`, or `TotalCharges` in isolation. For example, a customer with high `MonthlyCharges` but low `TotalCharges` (and thus high `AverageMonthlyCharge`) might be a recent subscriber to expensive services and could have a different churn risk profile than a long-term customer with the same high `MonthlyCharges` but a much higher `TotalCharges` (and thus lower `AverageMonthlyCharge`).
*   **Improving Predictive Power:** By providing a different angle on the customer's spending habits over time, this feature can potentially improve the predictive power of churn models. It might help the model better differentiate between customer segments with different value propositions or service engagement levels.

In essence, `AverageMonthlyCharge` serves as a potentially valuable feature for understanding customer behavior and improving the accuracy of churn prediction and customer segmentation.

### Save the final dataset 

In [5]:
# define clean data path 

final_data_path = Path("../data/processed/final_Customer-Churn-Dataset.csv")
df.to_csv(final_data_path, index=False)