### Problem Statement
As a data scientist at a telecommunications company, you have been tasked with developing a classification model to predict customer churn (cancellation of subscriptions) based on their historical behavior and demographic information. The company wants to understand which customers are likely to churn in order to develop targeted customer retention programs.

Dataset: You have been provided with a dataset named "Telco-Customer-Churn.csv". Each row in the dataset represents a customer, and each column contains attributes related to the customer's services, account information, and demographic details. The dataset includes the following information:

Customers who left within the last month (Churn)
Services that each customer has signed up for (phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies)
Customer account information (how long they've been a customer, contract type, payment method, paperless billing, monthly charges, and total charges)
Demographic information about customers (gender, age range, and whether they have partners and dependents)
Objective: Your task is to develop a classification model that can accurately predict whether a customer is likely to churn based on the available information. The model should be trained on a portion of the dataset and evaluated on the remaining unseen data.

Approach:

Load and preprocess the dataset: Load the dataset and handle any missing values. Encode categorical variables using appropriate encoding techniques.

Split the dataset: Split the dataset into training and testing sets to train the model on a subset of data and evaluate its performance on unseen data.

Feature scaling: Perform feature scaling on the numerical features to ensure they are on a similar scale, using techniques like StandardScaler.

Model selection and training: Train multiple classification models such as K-Nearest Neighbors, Random Forest, Support Vector Machines, and Logistic Regression.

Model evaluation: Evaluate the trained models on the testing set using appropriate evaluation metrics such as accuracy.

Select the best model: Compare the performance of different models and select the best-performing model based on the chosen evaluation metric.

Predict churn: Use the selected model to predict churn for new, unseen data.

Deliverables:

Preprocessed dataset without missing values.
Trained classification models with their respective evaluation metrics.
The best-performing model for churn prediction.
Predictions of churn for new, unseen data.
Note: The final model can be used by the telecommunications company to identify customers who are likely to churn, enabling them to take proactive measures to retain those customers.

Remember to adapt the problem statement and approach as needed, based on any specific requirements or modifications provided by the interviewer or the organization you are applying to.

#### 1. Load and preprocess the dataset: 
Load the dataset and handle any missing values. Encode categorical variables using appropriate encoding techniques.

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('Telco-Customer-Churn.csv')

# Display the first few rows of the dataset
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [4]:
# Handle Missing Values
# Check for missing values
data.isnull().any()


customerID          False
gender              False
SeniorCitizen       False
Partner             False
Dependents          False
tenure              False
PhoneService        False
MultipleLines       False
InternetService     False
OnlineSecurity      False
OnlineBackup        False
DeviceProtection    False
TechSupport         False
StreamingTV         False
StreamingMovies     False
Contract            False
PaperlessBilling    False
PaymentMethod       False
MonthlyCharges      False
TotalCharges        False
Churn               False
dtype: bool

We observe that there are none null values. But, if there are any null values,we can handle any missing value by following ways: 
Deleting rows,
Replacing null with custom values,
Replacing using Mean, Median, and Mode.

In [5]:
# Encode Categorical Variables

from sklearn.preprocessing import LabelEncoder

# Encode categorical variables using LabelEncoder
label_encoder = LabelEncoder()

# List of columns to encode
columns_to_encode = ['gender', 'Contract', 'PaymentMethod', 'Partner', 'Dependents', 'PhoneService', 'OnlineSecurity', 'OnlineBackup', 
                     'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling']

for column in columns_to_encode:
    data[column] = label_encoder.fit_transform(data[column])
    
# Define features and target
X = data.drop(['customerID', 'Churn', 'MultipleLines', 'InternetService', 'TotalCharges'], axis=1)
y = data['Churn']

### 2.Split the dataset

In [7]:
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 3.Feature scaling

In [9]:
from sklearn.preprocessing import StandardScaler

scaling = StandardScaler()
X_train = scaling.fit_transform(X_train)
X_test = scaling.transform(X_test)

### 4.Model selection and training
### 5.Model Evaluation

In [13]:
# # Training of multiple classification models such as K-Nearest Neighbors, Random Forest, Support Vector Machines, and Logistic Regression.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize classifiers
knn = KNeighborsClassifier(n_neighbors=3)
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
svc = SVC(kernel='linear', random_state=42)
log_reg = LogisticRegression(random_state=42)

# Train and evaluate models
models = [knn, rfc, svc, log_reg]
model_names = ['KNN', 'Random Forest', 'SVM', 'Logistic Regression']

# Initialize variables to store the maximum accuracy and the best model
max_accuracy = 0
best_model = None

for model, name in zip(models, model_names):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {accuracy * 100:.2f}%")
    
# Check if the current model has better accuracy than the previous ones
    if accuracy > max_accuracy:
        max_accuracy = accuracy
        best_model = model
      

KNN Accuracy: 73.95%
Random Forest Accuracy: 78.78%
SVM Accuracy: 81.83%
Logistic Regression Accuracy: 81.90%


### 6.Select the best model

In [16]:
print(f"Best Model: {best_model}")
print(f"Best Model Accuracy: {max_accuracy * 100:.2f}%")

Best Model: LogisticRegression(random_state=42)
Best Model Accuracy: 81.90%


### 7.Predict churn

In [23]:

y_pred_best = best_model.predict(X_test)
y_pred_best

array(['Yes', 'No', 'No', ..., 'No', 'No', 'No'], dtype=object)