## Preprocessing the Telco Customer Churn Dataset

### Objective
In this notebook, we will preprocess the Telco Customer Churn dataset to prepare it for machine learning. The steps include:
1. Encoding categorical variables into numerical formats.
2. Scaling numerical features to normalize their ranges.
3. Splitting the data into training and testing sets for model evaluation.

This step is critical for ensuring that the dataset is ready for modeling in the next phase.

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd
import os

## Step 1: Loading the data set


In [12]:
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
file_path = os.path.join(project_root, 'data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')
data = pd.read_csv(file_path)
# if you wish to validate the first initial rows for proper loading, uncomment the below print.
# print(data.head())

## Step 2 : Defining Features and Target

Let's split our data set into features and target. 

We want the features that will influence churn and our prediction target is the Churn column.

A customer ID is also not relevant for our analysis, as it has no impact on our prediction target, so we will drop that column.


In [13]:
X = data.drop(columns=['customerID', 'Churn']) # we drop the irrelvant column and the target column
Y = data['Churn'].apply(lambda x: 1 if x == 'Yes' else 0) # we convert our Churn Data into binary numbers, as our framework cannot work with the data as it exists
# You can uncomment the below prints to check the result 
#print(X.head())
#print(X.columns)
#print(Y.head())


## Step 3 : Identifying columns that are categorical and numerical

### Why Is This Step Important?
In this step, we separate the dataset into **categorical columns** and **numerical columns**. This distinction is crucial because:
- Machine learning models treat numerical and categorical data differently.
- Categorical data needs to be encoded into numerical formats before being passed to the model.
- Numerical data often needs scaling or normalization to ensure uniform ranges for all features.

By identifying categorical and numerical columns early, we can apply the appropriate preprocessing techniques to each type of data.


### Understanding "No Internet Service"
In the Telco Customer Churn dataset, some columns (e.g., `InternetService`) may include a value like `"No Internet Service"`. This is important to understand because:
- It represents a category where the customer has **no internet service**, rather than being a missing or invalid value.
- If we treat this value incorrectly (e.g., as missing data), we could lose important information.

### How We Address It:
1. **Treat "No Internet Service" as a Valid Category**:
   - When encoding categorical features, `"No Internet Service"` will be treated as one of the valid categories and converted into a one-hot encoded column (e.g., a binary column where `1` indicates `"No Internet Service"`).

2. **Ensure Consistency Across Features**:
   - Columns like `OnlineSecurity`, `OnlineBackup`, and similar services also have `"No Internet Service"` as a value, which indicates the customer does not have internet. These will also be treated as valid categories.

### Example:
Suppose the `InternetService` column contains the following values:
```
['DSL', 'Fiber optic', 'No Internet Service']
```
After one-hot encoding, these values will become:
```
DSL: [1, 0, 0]
Fiber optic: [0, 1, 0]
No Internet Service: [0, 0, 1]
```
This ensures that the model can learn from customers without internet service as a separate category.


### Some additional data types may be required
In some datasets, additional data types may exist, requiring specific handling. For example:

1. **Datetime Columns**:
   - If the dataset has dates (e.g., `signup_date` or `last_activity`), you may need to:
     - Extract useful features (e.g., month, day of the week, or the difference between two dates).
     - Convert these columns into numerical formats or cyclic representations (e.g., sine and cosine encoding for months).

     ```python
     X['signup_month'] = X['signup_date'].dt.month  # Extract month from date
     ```

2. **Boolean Columns**:
   - Columns with `True`/`False` values (e.g., `Has_Contract`) can be directly converted into `1` and `0`.

     ```python
     X['Has_Contract'] = X['Has_Contract'].astype(int)
     ```

3. **Text Columns**:
   - For text data (e.g., `comments` or `feedback`), you may need:
     - Natural Language Processing (NLP) techniques, such as tokenization or embedding models like TF-IDF or Word2Vec.

4. **Mixed-Type Columns**:
   - Some columns may contain a mix of numbers and strings (e.g., "10GB" or "Unlimited"). These need manual intervention to extract meaningful numerical features.


In [16]:
categorical_columns = X.select_dtypes(include=['object']).columns # categorical columns are represented by an object in the panda dataset
numerical_columns = X.select_dtypes(include=['int64','float64']).columns

print("Categorical columns:", categorical_columns)
print("Numerical columns:", numerical_columns)

Categorical columns: Index(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'TotalCharges'],
      dtype='object')
Numerical columns: Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')


## Step 4: Building a Preprocessing Pipeline

### Why Do We Need a Preprocessing Pipeline?
To ensure our data is ready for machine learning, we need to preprocess it into interpretable values. This involves two main tasks:

1. **Normalizing Numerical Features**:
   - Numerical features like `tenure` (which ranges from 0 to 72) and `MonthlyCharges` (which ranges from 0 to 200) are on different scales.
   - Without normalization, these differences in scale can negatively impact the performance of many machine learning models by giving undue importance to features with larger ranges.
   - Normalization ensures that all numerical features are on a similar scale, typically with a mean of 0 and a standard deviation of 1.

2. **Encoding Categorical Features**:
   - Machine learning models require numerical inputs, so categorical features (e.g., `gender`, `InternetService`) must be converted into numerical formats.
   - We use one-hot encoding to represent each category as a binary column. For example, a `gender` column with values `Male` and `Female` will be transformed into two columns: `Male` and `Female` with binary values (0 or 1).
   - This step was discussed in more detail in Step 3.

By combining these transformations into a preprocessing pipeline, we can ensure that our data is consistently prepared for both training and testing.


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer(
    transformers = [
        ('num', StandardScaler(), numerical_columns),
        ('cat', OneHotEncoder(handle_unknown = 'ignore'), categorical_columns
         ]
)