# Churn Prediction: Data Ingestion and Preprocessing

## Objective
In this notebook, we will focus on the initial stages of the churn prediction project:
- Loading the raw data from the **Telco Customer Churn** dataset.
- Exploring the structure and contents of the dataset through **Exploratory Data Analysis (EDA)**.
- Cleaning and preprocessing the data to make it ready for machine learning tasks.

## Workflow
1. **Load the Dataset**:
   - Dynamically locate and load the dataset stored in the `data/raw/` directory.
2. **Exploratory Data Analysis (EDA)**:
   - Understand the dataset structure, including data types, missing values, and feature distributions.
   - Identify any necessary transformations or cleaning steps.
3. **Data Cleaning and Preprocessing**:
   - Handle missing values, encode categorical variables, and scale numerical features.
   - Save the cleaned and preprocessed dataset to the `data/processed/` directory for downstream tasks.

---

This notebook is part of the larger **Churn Prediction Project**, which aims to predict customer churn using machine learning. The processed data from this notebook will serve as input for model training and evaluation in subsequent steps.


In [16]:
import os
import pandas as pd

# Step 1 : Loading the dataset downloaded from Kaggle (see .scripts/kaggle_data_download.py)
#This notebook is in the project directory, the project_root variable serves us to dynamically get to our data
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
file_path = os.path.join(project_root, 'data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')

data = pd.read_csv(file_path)


In [17]:
# Step 2 : Preview the dataset by displaying the first few rows of the dataset to understand its structure and data
print("First 5 rows of data")
print(data.head())


First 5 rows of data
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies

**You should have seen some rows and columns associated with the data file that we are working on**

In [20]:
# Step 3 : Get Basic information using the info() method to get an overview of the data including column names, non-null counts and data types.
print("\nBasic information about the dataset:")
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [21]:
# Step 4 : Summary statistics for numerical columns to understand their distribution.
print("\nSummary statistics for numerical columns:")
print(data.describe())


Summary statistics for numerical columns:
       SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


In [24]:
# Step 5:  Check for missing values by using isnull() and sum() to count missing values in each columns 
print("\nMissing Values in each column:")
print(data.isnull().sum())


Missing Values in each column:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


In [25]:
# Step 6: Iterate through columns with object data types and print unique values
print("\nUnique values in categorical columns:")
for col in data.select_dtypes(include='object').columns:
    print(f"{col}: {data[col].unique()}")


Unique values in categorical columns:
customerID: ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
gender: ['Female' 'Male']
Partner: ['Yes' 'No']
Dependents: ['No' 'Yes']
PhoneService: ['No' 'Yes']
MultipleLines: ['No phone service' 'No' 'Yes']
InternetService: ['DSL' 'Fiber optic' 'No']
OnlineSecurity: ['No' 'Yes' 'No internet service']
OnlineBackup: ['Yes' 'No' 'No internet service']
DeviceProtection: ['No' 'Yes' 'No internet service']
TechSupport: ['No' 'Yes' 'No internet service']
StreamingTV: ['No' 'Yes' 'No internet service']
StreamingMovies: ['No' 'Yes' 'No internet service']
Contract: ['Month-to-month' 'One year' 'Two year']
PaperlessBilling: ['Yes' 'No']
PaymentMethod: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']
TotalCharges: ['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']
Churn: ['No' 'Yes']


## Handling Missing Values

### Context: No Missing Values in This Dataset
In this dataset, there are no missing values, so we do not need to handle any missing data. However, it's important to consider how missing values would be addressed in case they were present. Here’s what we would do if we encountered missing values in other datasets:

### Approaches for Handling Missing Values
1. **Drop Rows or Columns**:
   - If the percentage of missing values in a column or row is very high (e.g., >50%), it might be reasonable to drop them.
   - Example:
     ```python
     data.dropna(axis=0, inplace=True)  # Drop rows with missing values
     data.dropna(axis=1, inplace=True)  # Drop columns with missing values
     ```
   **Identify rows with missing values**
     ```python
        missing_rows = data[data.isnull().any(axis=1)]
        print("Rows with missing values:")
        print(missing_rows)
    ```

3. **Imputation (Filling Missing Values)**:
   - Replace missing values with meaningful statistics (e.g., mean, median, or mode) or values inferred from other data.
   - Examples:
     - **Numerical Data**: Fill with mean or median.
       ```python
       data['NumericalColumn'].fillna(data['NumericalColumn'].mean(), inplace=True)
       ```
     - **Categorical Data**: Fill with the mode (most frequent value).
       ```python
       data['CategoricalColumn'].fillna(data['CategoricalColumn'].mode()[0], inplace=True)
       ```

4. **Predict Missing Values**:
   - Use machine learning models to predict the missing values based on other features in the dataset. This is more advanced and useful when missing values are not random.

5. **Use Domain Knowledge**:
   - In some cases, domain expertise can help fill missing values with logical substitutes.
   - Example: If a column "Has_Internet_Service" is missing and the "InternetService" column is "No", fill the missing values with "No".

### Key Considerations
- **Understand the Context**: Analyze the column's role in the dataset before deciding how to handle missing values.
- **Avoid Bias**: Imputing with a mean or mode can introduce bias. Advanced techniques like KNN imputation can reduce this.
- **Impact on Modeling**: Document how missing value handling might affect model performance and interpretations.

Since there are no missing values in this dataset, we can skip this step and proceed with data preprocessing.
