# Telco Customer Churn Analysis

## Introduction to the Dataset

This project involves the analysis of telecommunication customer churn data to create a model that predicts customer churn. The dataset is provided by Kaggle, with the original source from IBM Sample Data Sets.

## Data Preprocessing

Each row in the dataset represents a customer, and each column represents an attribute of the customer. The 'CustomerID', which is not necessary for the analysis, is removed. Data in the form of strings is transformed into numerical form using label encoding.

## Model Training

Before the analysis, the dataset is split into training and testing data. The training data is standardized using the StandardScaler. A logistic regression model is then used to predict customer churn.

## Model Evaluation

The accuracy and classification report of the trained model are outputted to evaluate the model's performance. This allows us to identify the variables that are important in predicting customer churn, which can be utilized in developing customer retention strategies.

# Telco Customer Churn Analysis


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [4]:
import pandas as pd
df = pd.read_csv("Telco_Customer_Churn_Log_Reg.csv")

check types

In [5]:
df.dtypes

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

## See Labels

In [6]:
labels = df.columns

print(labels)

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')


In [7]:
import pandas as pd
#remove enrollee id (since its irrelivent)
df = df.drop(['customerID'], axis=1)
print(df.head(3))

   gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  Female              0     Yes         No       1           No   
1    Male              0      No         No      34          Yes   
2    Male              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity OnlineBackup  \
0  No phone service             DSL             No          Yes   
1                No             DSL            Yes           No   
2                No             DSL            Yes          Yes   

  DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0               No          No          No              No  Month-to-month   
1              Yes          No          No              No        One year   
2               No          No          No              No  Month-to-month   

  PaperlessBilling     PaymentMethod  MonthlyCharges TotalCharges Churn  
0              Yes  Electronic check           29.85        29.85    No

## NaN check

In [8]:
columns_to_convert = ['TotalCharges', 'MonthlyCharges']  # List of column names to convert

for column in columns_to_convert:
    df[column] = pd.to_numeric(df[column], errors='coerce')

# Check for NaN values in specific columns
nan_labels = df[columns_to_convert].columns[df[columns_to_convert].isna().any()].tolist()

print(nan_labels)
df.isnull()[nan_labels].sum()


['TotalCharges']


TotalCharges    11
dtype: int64

In [9]:
null_total_charges = df['TotalCharges'].isnull()
null_rows = df.loc[null_total_charges]

# Print the indices of the null rows
print(null_rows.index)


Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')


In [10]:
# Count the number of NaN rows before dropping
nan_count_before = df['TotalCharges'].isnull().sum()

# Drop the NaN rows
df = df.dropna(subset=['TotalCharges'])

# Count the number of NaN rows after dropping
nan_count_after = df['TotalCharges'].isnull().sum()

# Calculate and print the number of rows dropped
rows_dropped = nan_count_before - nan_count_after
print(f"Dropped {rows_dropped} rows.")


Dropped 11 rows.


no nan now

## Data processing

In [11]:
#check every unique data
import pandas as pd

for column in df.columns:
    print(f"Unique items in column '{column}':")
    print(df[column].unique())
    print()

Unique items in column 'gender':
['Female' 'Male']

Unique items in column 'SeniorCitizen':
[0 1]

Unique items in column 'Partner':
['Yes' 'No']

Unique items in column 'Dependents':
['No' 'Yes']

Unique items in column 'tenure':
[ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 39]

Unique items in column 'PhoneService':
['No' 'Yes']

Unique items in column 'MultipleLines':
['No phone service' 'No' 'Yes']

Unique items in column 'InternetService':
['DSL' 'Fiber optic' 'No']

Unique items in column 'OnlineSecurity':
['No' 'Yes' 'No internet service']

Unique items in column 'OnlineBackup':
['Yes' 'No' 'No internet service']

Unique items in column 'DeviceProtection':
['No' 'Yes' 'No internet service']

Unique items in column 'TechSupport':
['No' 'Yes' 'No internet service']

Unique items in column 'StreamingTV':
['No' 'Ye

In [12]:
#str to int
from sklearn.preprocessing import LabelEncoder

# Initialize a LabelEncoder
le = LabelEncoder()

# List of columns to encode (except customer ID)
cols_to_encode = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
        'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
        'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
        'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
        'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']

# Iterate over each categorical column and transform it
for col in cols_to_encode:
    df[col] = le.fit_transform(df[col])
    
print(df.head(3))

   gender  SeniorCitizen  Partner  Dependents  tenure  PhoneService  \
0       0              0        1           0       0             0   
1       1              0        0           0      33             1   
2       1              0        0           0       1             1   

   MultipleLines  InternetService  OnlineSecurity  OnlineBackup  \
0              1                0               0             2   
1              0                0               2             0   
2              0                0               2             2   

   DeviceProtection  TechSupport  StreamingTV  StreamingMovies  Contract  \
0                 0            0            0                0         0   
1                 2            0            0                0         1   
2                 0            0            0                0         0   

   PaperlessBilling  PaymentMethod  MonthlyCharges  TotalCharges  Churn  
0                 1              2             142            74   

## Split train/test

In [13]:
from sklearn.model_selection import train_test_split

x = df.drop(['Churn'],axis=1)
y = df['Churn']

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=1,stratify=y) #stratify means even data

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(4922, 19)
(2110, 19)
(4922,)
(2110,)


## Standardization (not sure)

In [14]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train)

x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

#check standard datas
print(x_train.head(2))
x_train_std[1:3,]

      gender  SeniorCitizen  Partner  Dependents  tenure  PhoneService  \
1864       0              1        1           0      25             1   
5830       1              0        0           1      29             1   

      MultipleLines  InternetService  OnlineSecurity  OnlineBackup  \
1864              0                0               0             0   
5830              2                2               1             1   

      DeviceProtection  TechSupport  StreamingTV  StreamingMovies  Contract  \
1864                 2            0            0                2         1   
5830                 1            1            1                1         1   

      PaperlessBilling  PaymentMethod  MonthlyCharges  TotalCharges  
1864                 0              3             556          3314  
5830                 0              0              99          2011  


array([[ 0.99150298, -0.43824216, -0.9636822 ,  1.52501661, -0.09405287,
         0.3252982 ,  1.11866127,  1.51777186,  0.24910529,  0.1052227 ,
         0.1070938 ,  0.23657362,  0.01173025,  0.00804988,  0.36680198,
        -1.19708825, -1.48109276, -1.21650014, -0.55632981],
       [-1.00856984, -0.43824216,  1.03768649, -0.65573056,  1.53554574,
         0.3252982 ,  1.11866127,  0.16442528, -0.91749435,  1.24347793,
         1.24557263,  1.3998257 ,  1.14381392,  1.14009235,  0.36680198,
         0.8353603 ,  0.39799954,  1.62025702,  1.69522124]])

## Prediction

In [15]:
#No Std
from sklearn.linear_model import LogisticRegression  
model = LogisticRegression(max_iter=1000) #fixed error
model.fit(x_train, y_train)
y_pred = model.predict(x_test)

from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.8004739336492891
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1549
           1       0.66      0.53      0.58       561

    accuracy                           0.80      2110
   macro avg       0.75      0.71      0.73      2110
weighted avg       0.79      0.80      0.79      2110



In [16]:
#Std
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

model = LogisticRegression(max_iter=1000)  # max_iter 값 설정
model.fit(x_train_std, y_train)  # 표준화된 x_train 사용
y_pred = model.predict(x_test_std)  # 표준화된 x_test 사용

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.8033175355450237
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1549
           1       0.66      0.53      0.59       561

    accuracy                           0.80      2110
   macro avg       0.75      0.72      0.73      2110
weighted avg       0.79      0.80      0.80      2110



## Activity

#### Build a Random Forest Model for classification

In [None]:
## Import Random Forest

In [None]:
## Fit Random Forest

In [None]:
## Predict Random Forest

In [None]:
## Evaluate Random Forest

In [None]:
## Visualize Random Forest

In [None]:
## Google what are the parameters for Random Forest, and try a few different parameters

In [None]:
## Compare logistic regression to Random Forest

#### Observations and Conclusions

In [None]:
## What are your observations on Random Forest, which model performs better logistic regression or Random Forest?
## How did changing the parameters impact the performance of Random Forest