As a data scientist at a telecommunications company, you have been tasked with developing a classification model to predict customer churn (cancellation of subscriptions) based on their historical behavior and demographic information. The company wants to understand which customers are likely to churn in order to develop targeted customer retention programs.

**Dataset:**
You have been provided with a dataset named "Telco-Customer-Churn.csv". Each row in the dataset represents a customer, and each column contains attributes related to the customer's services, account information, and demographic details. The dataset includes the following information:

1.   Customers who left within the last month (Churn)
2.   Services that each customer has signed up for (phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies)
3.   Customer account information (how long they've been a customer, contract type, payment method, paperless billing, monthly charges, and total charges)
4.   Demographic information about customers (gender, age range, and whether they have partners and dependents)

**Objective:**
Your task is to develop a classification model that can accurately predict whether a customer is likely to churn based on the available information. The model should be trained on a portion of the dataset and evaluated on the remaining unseen data.

**Approach:**
1.   Load and preprocess the dataset: Load the dataset and handle any missing values. Encode categorical variables using appropriate encoding techniques.
2.   Split the dataset: Split the dataset into training and testing sets to train the model on a subset of data and evaluate its performance on unseen data.
3.   Feature scaling: Perform feature scaling on the numerical features to ensure they are on a similar scale, using techniques like StandardScaler.
4.   Model selection and training: Train multiple classification models such as K-Nearest Neighbors, Random Forest, Support Vector Machines, and Logistic Regression.
5.   Model evaluation: Evaluate the trained models on the testing set using appropriate evaluation metrics such as accuracy.
6.   Select the best model: Compare the performance of different models and select the best-performing model based on the chosen evaluation metric.
7.   Predict churn: Use the selected model to predict churn for new, unseen data.

**Deliverables:**


1.   Preprocessed dataset without missing values.
2.   Trained classification models with their respective evaluation metrics.
3.   The best-performing model for churn prediction.
4.   Predictions of churn for new, unseen data.

**Note:** The final model can be used by the telecommunications company to identify customers who are likely to churn, enabling them to take proactive measures to retain those customers.

Remember to adapt the problem statement and approach as needed, based on any specific requirements or modifications provided by the interviewer or the organization you are applying to.

In [2]:
import numpy as np
import pandas as pd


In [5]:
# Load and preprocess the dataset
dataset = pd.read_csv('Telco-Customer-Churn.csv')
# Check for missing values
dataset.isna().any()
# if there is any missing values drop them
dataset.dropna()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [8]:
# Encode categorical variables using appropriate encoding techniques.
from sklearn.preprocessing import LabelEncoder
dataset.columns
categorical_columns = ['gender','Partner', 'Dependents',
       'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'Churn']
laber_encoder = LabelEncoder()
for column in categorical_columns:
  dataset[column] = laber_encoder.fit_transform(dataset[column])
dataset

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,1,0,0,...,0,0,0,0,0,1,2,29.85,29.85,0
1,5575-GNVDE,1,0,0,0,34,1,0,0,2,...,2,0,0,0,1,0,3,56.95,1889.5,0
2,3668-QPYBK,1,0,0,0,2,1,0,0,2,...,0,0,0,0,0,1,3,53.85,108.15,1
3,7795-CFOCW,1,0,0,0,45,0,1,0,2,...,2,2,0,0,1,0,0,42.30,1840.75,0
4,9237-HQITU,0,0,0,0,2,1,0,1,0,...,0,0,0,0,0,1,2,70.70,151.65,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,1,0,1,1,24,1,2,0,2,...,2,2,2,2,1,1,3,84.80,1990.5,0
7039,2234-XADUH,0,0,1,1,72,1,2,1,0,...,2,0,2,2,1,1,1,103.20,7362.9,0
7040,4801-JZAZL,0,0,1,1,11,0,1,0,2,...,0,0,0,0,0,1,2,29.60,346.45,0
7041,8361-LTMKD,1,1,1,0,4,1,2,1,0,...,0,0,0,0,0,1,3,74.40,306.6,1


In [14]:
# Split the dataset: Split the dataset into training and testing sets to train the model on a
# subset of data and evaluate its performance on unseen data.

from sklearn.model_selection import train_test_split
dataset.info()
X = dataset.iloc[:,1:18]
y = dataset.iloc[:,20:]
X_train, X_test,y_train,y_test = train_test_split(X,y,train_size=0.8, random_state = 42)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   int64  
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   int64  
 4   Dependents        7043 non-null   int64  
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   int64  
 7   MultipleLines     7043 non-null   int64  
 8   InternetService   7043 non-null   int64  
 9   OnlineSecurity    7043 non-null   int64  
 10  OnlineBackup      7043 non-null   int64  
 11  DeviceProtection  7043 non-null   int64  
 12  TechSupport       7043 non-null   int64  
 13  StreamingTV       7043 non-null   int64  
 14  StreamingMovies   7043 non-null   int64  
 15  Contract          7043 non-null   int64  
 16  PaperlessBilling  7043 non-null   int64  


(5634, 17)

In [18]:
# Feature scaling:  using techniques like StandardScaler.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train
X_test = scaler.fit_transform(X_test)
X_test

array([[-0.94946537, -0.44854601,  1.04724696, ..., -0.81439391,
         0.81577242,  0.38277347],
       [ 1.05322431, -0.44854601, -0.95488461, ..., -0.81439391,
         0.81577242, -1.48664706],
       [-0.94946537, -0.44854601,  1.04724696, ...,  1.54425156,
        -1.22583208,  1.31748374],
       ...,
       [ 1.05322431, -0.44854601,  1.04724696, ...,  0.36492882,
        -1.22583208, -1.48664706],
       [-0.94946537, -0.44854601,  1.04724696, ...,  1.54425156,
         0.81577242, -0.55193679],
       [ 1.05322431, -0.44854601, -0.95488461, ..., -0.81439391,
         0.81577242,  1.31748374]])

In [22]:
# Model selection and training: K-Nearest Neighbors, Random Forest, Support Vector Machines, and Logistic Regression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# initialize the models
models = {'K-Nearest Neighbors':KNeighborsClassifier(n_neighbors=3),
          'Random Forest': RandomForestClassifier(n_estimators=1000),
          'Support Vector Machine': SVC(kernel='linear'),
          'Logistic Regression':LogisticRegression()}
# fit, predict and get accuracy score
max_score = 0
for model_name,model in models.items():
  model.fit(X_train,y_train)
  y_pred = model.predict(X_test)
  accuracy = accuracy_score(y_test,y_pred)
  print(f'{model_name} has accuracy score of: {accuracy: .4f}')
  if max_score < accuracy:
    max_score = accuracy
    best_model = model_name
print(f'{best_model} is best model with an accuracy score of:{max_score: .4f}')


  return self._fit(X, y)


K-Nearest Neighbors has accuracy score of:  0.7459


  model.fit(X_train,y_train)


Random Forest has accuracy score of:  0.7899


  y = column_or_1d(y, warn=True)


Support Vector Machine has accuracy score of:  0.7935
Logistic Regression has accuracy score of:  0.7999
Logistic Regression is best model with an accuracy score of: 0.7999


  y = column_or_1d(y, warn=True)
