# About Dataset

### Context
This dataset contains customer information from a telecommunications company, including details about their demographics, services subscribed, account information, and most importantly - whether they churned (left the company). Customer churn prediction is a critical business problem in the telecom industry, as acquiring new customers is typically more expensive than retaining existing ones.

### Data Description
The dataset contains 7043 customer records with 21 features, including the target variable.

#### Feature Categories:

**Demographic Information:**
- `customerID`: Unique identifier for each customer
- `gender`: Customer gender (Male, Female)
- `SeniorCitizen`: Whether the customer is a senior citizen (1, 0)
- `Partner`: Whether the customer has a partner (Yes, No)
- `Dependents`: Whether the customer has dependents (Yes, No)

**Service Information:**
- `PhoneService`: Whether the customer has a phone service (Yes, No)
- `MultipleLines`: Whether the customer has multiple lines (Yes, No, No phone service)
- `InternetService`: Customer's internet service provider (DSL, Fiber optic, No)
- `OnlineSecurity`: Whether the customer has online security (Yes, No, No internet service)
- `OnlineBackup`: Whether the customer has online backup (Yes, No, No internet service)
- `DeviceProtection`: Whether the customer has device protection (Yes, No, No internet service)
- `TechSupport`: Whether the customer has tech support (Yes, No, No internet service)
- `StreamingTV`: Whether the customer has streaming TV (Yes, No, No internet service)
- `StreamingMovies`: Whether the customer has streaming movies (Yes, No, No internet service)

**Account Information:**
- `tenure`: Number of months the customer has stayed with the company
- `Contract`: The contract term (Month-to-month, One year, Two year)
- `PaperlessBilling`: Whether the customer has paperless billing (Yes, No)
- `PaymentMethod`: Customer's payment method (Electronic check, Mailed check, Bank transfer, Credit card)
- `MonthlyCharges`: The amount charged to the customer monthly
- `TotalCharges`: The total amount charged to the customer

**Target Variable:**
- `Churn`: Whether the customer churned (Yes, No)

### Business Problem
The goal is to predict customer churn to enable proactive retention strategies. By identifying customers at high risk of leaving, the company can target them with special offers, improved service, or other retention tactics.

### Data Characteristics
- **Number of instances:** 7,043
- **Number of features:** 21 (including target)
- **Missing values:** Minimal (primarily in TotalCharges for new customers)
- **Class distribution:** Imbalanced (approximately 73% No Churn, 27% Yes Churn)

### Potential Challenges
1. **Class imbalance** between churned and non-churned customers
2. **Mixed data types** - numerical, categorical, and binary features
3. **Correlated features** among service subscriptions
4. **Data leakage** considerations - ensuring features are available at prediction time

### Source
This is a publicly available dataset commonly used for churn prediction modeling and can be found on platforms like Kaggle.

# Data Loading

In [403]:
import os
import zipfile
import pandas as pd
from kaggle.api.kaggle_api_extended import KaggleApi

RND_SEED = 42

def download_telco_churn_dataset(data_dir='data'):
    """
    –°–∫–∞—á–∏–≤–∞–µ—Ç –¥–∞—Ç–∞—Å–µ—Ç Telco Customer Churn —Å Kaggle
    """
    # –°–æ–∑–¥–∞–µ–º –ø–∞–ø–∫—É –¥–ª—è –¥–∞–Ω–Ω—ã—Ö, –µ—Å–ª–∏ –µ—ë –Ω–µ—Ç
    os.makedirs(data_dir, exist_ok=True)
    
    try:
        # –ò–Ω–∏—Ü–∏–∞–ª–∏–∑–∏—Ä—É–µ–º Kaggle API
        api = KaggleApi()
        api.authenticate()
        
        # –°–∫–∞—á–∏–≤–∞–µ–º –¥–∞—Ç–∞—Å–µ—Ç
        dataset_name = 'blastchar/telco-customer-churn'
        api.dataset_download_files(dataset_name, path=data_dir, unzip=True)
        
        print(f"‚úÖ –î–∞—Ça—Å–µ—Ç —É—Å–ø–µ—à–Ω–æ —Å–∫–∞—á–∞–Ω –≤ –ø–∞–ø–∫—É: {data_dir}")
        
        # –ü—Ä–æ–≤–µ—Ä—è–µ–º —Å–∫–∞—á–∞–Ω–Ω—ã–µ —Ñ–∞–π–ª—ã
        files = os.listdir(data_dir)
        print(f"üìÅ –°–∫–∞—á–∞–Ω–Ω—ã–µ —Ñ–∞–π–ª—ã: {files}")
        
        # –ó–∞–≥—Ä—É–∂–∞–µ–º –¥–∞–Ω–Ω—ã–µ –¥–ª—è –ø—Ä–æ–≤–µ—Ä–∫–∏
        csv_file = [f for f in files if f.endswith('.csv')][0]
        df = pd.read_csv(os.path.join(data_dir, csv_file))
        print(f"üìä –†–∞–∑–º–µ—Ä –¥–∞—Ç–∞—Å–µ—Ç–∞: {df.shape}")
        print(f"üéØ –¶–µ–ª–µ–≤–∞—è –ø–µ—Ä–µ–º–µ–Ω–Ω–∞—è 'Churn': {df['Churn'].value_counts().to_dict()}")
        
        return df
        
    except Exception as e:
        print(f"‚ùå –û—à–∏–±–∫–∞ –ø—Ä–∏ —Å–∫–∞—á–∏–≤–∞–Ω–∏–∏: {e}")
        return pd.DataFrame()

if __name__ == "__main__":
    telco = download_telco_churn_dataset('../data/raw')

Dataset URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
‚úÖ –î–∞—Ça—Å–µ—Ç —É—Å–ø–µ—à–Ω–æ —Å–∫–∞—á–∞–Ω –≤ –ø–∞–ø–∫—É: ../data/raw
üìÅ –°–∫–∞—á–∞–Ω–Ω—ã–µ —Ñ–∞–π–ª—ã: ['WA_Fn-UseC_-Telco-Customer-Churn.csv']
üìä –†–∞–∑–º–µ—Ä –¥–∞—Ç–∞—Å–µ—Ç–∞: (7043, 21)
üéØ –¶–µ–ª–µ–≤–∞—è –ø–µ—Ä–µ–º–µ–Ω–Ω–∞—è 'Churn': {'No': 5174, 'Yes': 1869}


# EDA

## –ü–æ—Å–º–æ—Ç—Ä–∏–º –Ω–∞ –æ–±—â—É—é –∫–∞—Ä—Ç–∏–Ω—É

In [404]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [405]:
telco.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


## Create Test Dataset

In [406]:
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

telco_train, telco_test = train_test_split(telco, test_size=0.2, random_state=RND_SEED, stratify=telco["Churn"])

## Cleaning Data

In [407]:
telco = telco_train.copy()

–°–Ω–∞—á–∞–∞–ª —Ä–∞–∑–±–µ—Ä–µ–º—Å—è —Å –∫–æ–ª–æ–Ω–∫–æ–π TotalCharges - —Ç–∞–º —Ñ–ª–æ—Ç, –Ω–æ –≥–¥–µ-—Ç–æ –µ—Å—Ç—å –ø—É—Å—Ç—ã–µ —Å—Ç—Ä–æ–∫–∏.

In [408]:
telco["TotalCharges"] = pd.to_numeric(telco["TotalCharges"], errors="coerce")

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
telco["TotalCharges"] = imputer.fit_transform(telco[["TotalCharges"]])

telco["TotalCharges"].isna().sum()

np.int64(0)

–ü—Ä–µ–æ–±—Ä–∞–∑—É–µ–º –≤—Å–µ —Å—Ç—Ä–æ–∫–∏ —Å "Yes/No" –∫ `int` "1/0".

In [409]:
bool_cols = ["Partner", "Dependents", "PhoneService", "PaperlessBilling", "Churn"]

for col in bool_cols:
    telco[col] = (telco[col] == "Yes").astype("int")

–¢–µ–ø–µ—Ä—å —Ä–∞–∑–±–µ—Ä–µ–º—Å—è —Å –∫–æ–ª–æ–Ω–∫–∞–º–∏ –≥–¥–µ –ø–æ —Ç—Ä–∏ –∑–Ω–∞—á–µ–Ω–∏—è - –¥–∞/–Ω–µ—Ç/–Ω–µ—Ç_—Å–µ—Ä–≤–∏—Å–∞. –Ø –¥—É–º–∞—é –Ω–∞–¥–æ —Ç—Ä–µ—Ç–∏–π –≤–∞—Ä–∏–∞–Ω—Ç –¥–æ–±–∞–≤–∏—Ç—å –≤ "–Ω–µ—Ç" –∏ –∑–∞–∫–æ–¥–∏—Ä–æ–≤–∞—Ç—å LabelEncoding'–æ–º.

In [410]:
three_ans_cols = ["MultipleLines", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies"]

for col in three_ans_cols:
    telco[col] = (telco[col] == "Yes").astype("int")

In [411]:
telco.drop(columns="customerID", inplace=True)

In [412]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
telco["gender"] = label_enc.fit_transform(telco["gender"])

telco["gender"].value_counts()

gender
1    2833
0    2801
Name: count, dtype: int64

### OneHotEncoding

In [413]:
from sklearn.preprocessing import OneHotEncoder

cat_columns = ["InternetService", "PaymentMethod", "Contract"]

ohe = OneHotEncoder(sparse_output=False, dtype="int")
ohe.fit(telco[cat_columns])
telco_internet_encoded = pd.DataFrame(ohe.transform(telco[cat_columns]), columns=ohe.get_feature_names_out(), index=telco.index)

In [414]:
telco = pd.concat([telco, telco_internet_encoded], axis=1)
telco.drop(columns=cat_columns, inplace=True)

In [415]:
telco.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5634 entries, 3738 to 5639
Data columns (total 27 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   gender                                   5634 non-null   int64  
 1   SeniorCitizen                            5634 non-null   int64  
 2   Partner                                  5634 non-null   int64  
 3   Dependents                               5634 non-null   int64  
 4   tenure                                   5634 non-null   int64  
 5   PhoneService                             5634 non-null   int64  
 6   MultipleLines                            5634 non-null   int64  
 7   OnlineSecurity                           5634 non-null   int64  
 8   OnlineBackup                             5634 non-null   int64  
 9   DeviceProtection                         5634 non-null   int64  
 10  TechSupport                              5634 non-

–ï–µ–µ–µ–π, —Ö–æ—Ç—è–±—ã —â–∞—Å –≤—Å–µ —Ç–∏–ø—ã –≤—ã–≥–ª—è–¥—è—Ç –∞–¥–µ–∫–≤–∞—Ç–Ω–æ –∏ –º–æ–∂–Ω–æ –ø—Ä–∏—Å—Ç—É–ø–∏—Ç—å –∫ –≤–∏–∑—É–∞–ª–∏–∑–∞—Ü–∏–∏ –≤—Å—è–∫–æ–π –≤—Å—è—á–∏–Ω—ã.

## Visualize

–ü–æ—Å–º–æ—Ç—Ä–∏–º –Ω–∞ –∫–æ—Ä—Ä–µ–ª—è—Ü–∏–∏ —Å —Ç–∞—Ä–≥–µ—Ç–æ–º.

In [417]:
telco.corr()["Churn"].sort_values(ascending=False)

Churn                                      1.000000
Contract_Month-to-month                    0.406401
InternetService_Fiber optic                0.312656
PaymentMethod_Electronic check             0.309214
MonthlyCharges                             0.198040
PaperlessBilling                           0.197981
SeniorCitizen                              0.145599
StreamingTV                                0.072397
StreamingMovies                            0.063786
MultipleLines                              0.043766
PhoneService                               0.017928
gender                                    -0.002208
DeviceProtection                          -0.061624
OnlineBackup                              -0.082428
PaymentMethod_Mailed check                -0.089311
PaymentMethod_Bank transfer (automatic)   -0.125121
InternetService_DSL                       -0.128639
PaymentMethod_Credit card (automatic)     -0.137780
Partner                                   -0.145717
TechSupport 

## Preprocessing Pipeline

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector

preprocessing = ColumnTransformer([
    ("")
])