# Task
Calculate and print the accuracy, precision, recall, and F1-score of the Decision Tree Classifier using `y_test` and `y_pred_dt`, then summarize the data preprocessing steps, the Decision Tree model built, and its comprehensive performance metrics, along with any key insights gained.

## Load Data

### Subtask:
Load the 'netflix_customer_churn.csv' file into a pandas DataFrame.


**Reasoning**:
First, I'll import the pandas library to enable data manipulation. Then, I'll load the specified CSV file, 'netflix_customer_churn.csv', into a DataFrame named 'df'. Finally, I will display the first few rows of the DataFrame to confirm successful loading.



In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report
)


In [2]:
df=pd.read_csv('netflix_customer_churn.csv')

# 2. Basic Data Exploration

In [3]:
print("First 5 rows:")
print(df.head())

First 5 rows:
                            customer_id  age  gender subscription_type  \
0  a9b75100-82a8-427a-a208-72f24052884a   51   Other             Basic   
1  49a5dfd9-7e69-4022-a6ad-0a1b9767fb5b   47   Other          Standard   
2  4d71f6ce-fca9-4ff7-8afa-197ac24de14b   27  Female          Standard   
3  d3c72c38-631b-4f9e-8a0e-de103cad1a7d   53   Other           Premium   
4  4e265c34-103a-4dbb-9553-76c9aa47e946   56   Other          Standard   

   watch_hours  last_login_days   region  device  monthly_fee  churned  \
0        14.73               29   Africa      TV         8.99        1   
1         0.70               19   Europe  Mobile        13.99        1   
2        16.32               10     Asia      TV        13.99        0   
3         4.51               12  Oceania      TV        17.99        1   
4         1.89               13   Africa  Mobile        13.99        1   

  payment_method  number_of_profiles  avg_watch_time_per_day favorite_genre  
0      Gift Card  

In [4]:
print("\nInfo:")
df.info()



Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   customer_id             5000 non-null   object 
 1   age                     5000 non-null   int64  
 2   gender                  5000 non-null   object 
 3   subscription_type       5000 non-null   object 
 4   watch_hours             5000 non-null   float64
 5   last_login_days         5000 non-null   int64  
 6   region                  5000 non-null   object 
 7   device                  5000 non-null   object 
 8   monthly_fee             5000 non-null   float64
 9   churned                 5000 non-null   int64  
 10  payment_method          5000 non-null   object 
 11  number_of_profiles      5000 non-null   int64  
 12  avg_watch_time_per_day  5000 non-null   float64
 13  favorite_genre          5000 non-null   object 
dtypes: float64(3), int64(4), object(7

In [5]:
print("\nMissing values:")
print(df.isnull().sum())


Missing values:
customer_id               0
age                       0
gender                    0
subscription_type         0
watch_hours               0
last_login_days           0
region                    0
device                    0
monthly_fee               0
churned                   0
payment_method            0
number_of_profiles        0
avg_watch_time_per_day    0
favorite_genre            0
dtype: int64


# 3. Drop Unnecessary Column

In [6]:
df = df.drop('customer_id', axis=1)

# 4. Separate Features & Target

In [7]:
X = df.drop('churned', axis=1)
y = df['churned']

# 5. Train-Test Split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# 6. Separate Numerical & Categorical Columns

In [9]:
numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns


# 7. Scale Numerical Features

In [10]:
scaler = StandardScaler()

X_train_num = scaler.fit_transform(X_train[numerical_features])
X_test_num = scaler.transform(X_test[numerical_features])

# 8. Encode Categorical Features

In [11]:
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

X_train_cat = encoder.fit_transform(X_train[categorical_features])
X_test_cat = encoder.transform(X_test[categorical_features])


# 9. Combine Processed Features

In [12]:
X_train_processed = np.hstack((X_train_num, X_train_cat))
X_test_processed = np.hstack((X_test_num, X_test_cat))

print("Final Training Shape:", X_train_processed.shape)
print("Final Testing Shape:", X_test_processed.shape)


Final Training Shape: (4000, 35)
Final Testing Shape: (1000, 35)


# 10. Train Base Decision Tree

In [13]:
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_processed, y_train)

y_pred = dt_model.predict(X_test_processed)

print("\nBase Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))


Base Model Performance:
Accuracy: 0.98
Precision: 0.9763313609467456
Recall: 0.9840954274353877
F1 Score: 0.9801980198019802


# 11. Hyperparameter Tuning


In [14]:
param_grid = {
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_processed, y_train)

print("\nBest Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)


Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 10}
Best Cross-Validation Accuracy: 0.9817500000000001


# 12. Train Tuned Model

In [15]:
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test_processed)

# 13. Final Evaluation

In [16]:
print("\nTuned Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Precision:", precision_score(y_test, y_pred_tuned))
print("Recall:", recall_score(y_test, y_pred_tuned))
print("F1 Score:", f1_score(y_test, y_pred_tuned))



Tuned Model Performance:
Accuracy: 0.979
Precision: 0.9839357429718876
Recall: 0.974155069582505
F1 Score: 0.9790209790209791


# 14. Confusion Matrix

In [17]:
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_tuned))


Confusion Matrix:
[[489   8]
 [ 13 490]]


# 15. Classification Report

In [18]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred_tuned))


Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.98      0.98       497
           1       0.98      0.97      0.98       503

    accuracy                           0.98      1000
   macro avg       0.98      0.98      0.98      1000
weighted avg       0.98      0.98      0.98      1000

