## Churn Model Building

### Importing Libs

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE  
from imblearn.under_sampling import EditedNearestNeighbours
from imblearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from imblearn.combine import SMOTEENN
from sklearn.metrics import confusion_matrix
import pickle

### Loading Data

In [2]:
df = pd.read_csv('Final_df.csv')
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_group
0,0,0,1,0,0,1,0,0,2,0,0,0,0,0,1,2,29.85,29.85,0,0
1,1,0,0,0,1,0,0,2,0,2,0,0,0,1,0,3,56.95,1889.5,0,2
2,1,0,0,0,1,0,0,2,2,0,0,0,0,0,1,3,53.85,108.15,1,0
3,1,0,0,0,0,1,0,2,0,2,2,0,0,1,0,0,42.3,1840.75,0,3
4,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,2,70.7,151.65,1,0


## Spliting data into input variable X and Target Variable y

In [3]:
X =  df.drop('Churn', axis=1)
y = df['Churn']

## Splitting data for train and test 

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Checking Various ML Algho in primary stage to see which is most fitted for our business solution

In [5]:
def model_check(X_train, X_test, y_train, y_test):
    pipelines = {
    'Random Forest': make_pipeline(StandardScaler(), SimpleImputer(), RandomForestClassifier()),
    'SVM': make_pipeline(StandardScaler(), SVC(probability=True)),
    'Naive Bayes': make_pipeline(StandardScaler(), GaussianNB()),
    'Decision Tree': make_pipeline(StandardScaler(), DecisionTreeClassifier()),
    'K-Nearest Neighbors': make_pipeline(StandardScaler(), KNeighborsClassifier())
    }

    for step, pipeline in pipelines.items():
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)

        print(f"{step} Metrics:")
        print(f"Accuracy: {accuracy}")
        print(f"Precision: {precision}")
        print(f"Recall: {recall}")
        print(f"F1 Score: {f1}")
        print('*' * 20)
        

In [6]:
model_check(X_train, X_test, y_train, y_test)

Random Forest Metrics:
Accuracy: 0.7810945273631841
Precision: 0.6204379562043796
Recall: 0.45454545454545453
F1 Score: 0.5246913580246914
********************
SVM Metrics:
Accuracy: 0.7910447761194029
Precision: 0.6515151515151515
Recall: 0.45989304812834225
F1 Score: 0.5391849529780565
********************
Naive Bayes Metrics:
Accuracy: 0.7391613361762616
Precision: 0.5065176908752328
Recall: 0.7272727272727273
F1 Score: 0.5971459934138309
********************
Decision Tree Metrics:
Accuracy: 0.7263681592039801
Precision: 0.48405797101449277
Recall: 0.446524064171123
F1 Score: 0.46453407510431155
********************
K-Nearest Neighbors Metrics:
Accuracy: 0.7356076759061834
Precision: 0.5027624309392266
Recall: 0.48663101604278075
F1 Score: 0.4945652173913044
********************


### From above result we can decide to go with Decision Tree and Random Forest Classifier and check which is best suited for our data.

In [7]:
def pre_model_dt_rf(X_train, X_test, y_train, y_test):
    
    model_dt = DecisionTreeClassifier(criterion = "gini",
                                  max_depth=6, 
                                  min_samples_leaf=8)

    model_dt.fit(X_train,y_train)

    y_pred = model_dt.predict(X_test)
    print("DecisionTreeClassifier Result")
    
    print(model_dt.score(X_test,y_test))
    print(classification_report(y_test, y_pred, labels=[0,1]))

    model_rf = RandomForestClassifier(n_estimators=100,
                                      criterion='gini',
                                      max_depth=6,
                                      min_samples_leaf=8)
    
    model_rf.fit(X_train, y_train)
    y_pred = model_rf.predict(X_test)
    
    print("RandomForestClassifier Result")
    
    print(model_rf.score(X_test,y_test))
    print(classification_report(y_test, y_pred, labels=[0,1]))
    
    

In [8]:
pre_model_dt_rf(X_train, X_test, y_train, y_test)

DecisionTreeClassifier Result
0.7633262260127932
              precision    recall  f1-score   support

           0       0.84      0.83      0.84      1033
           1       0.55      0.57      0.56       374

    accuracy                           0.76      1407
   macro avg       0.70      0.70      0.70      1407
weighted avg       0.77      0.76      0.76      1407

RandomForestClassifier Result
0.7889125799573561
              precision    recall  f1-score   support

           0       0.82      0.91      0.86      1033
           1       0.65      0.44      0.53       374

    accuracy                           0.79      1407
   macro avg       0.74      0.68      0.70      1407
weighted avg       0.77      0.79      0.77      1407



### As you can see that the accuracy is quite low, and as it's an imbalanced dataset, we shouldn't consider Accuracy as our metrics to measure the model, as Accuracy is cursed in imbalanced datasets.

### Hence, we need to check recall, precision & f1 score for the minority class, and it's quite evident that the precision, recall & f1 score is too low for Class 1, i.e. churned customers.

### Hence, moving ahead to call SMOTEENN (UpSampling + ENN)rom above results we can clearly see that our model predict more accurate foer 0 churn but not working good on 1 classification so we can use SMOTEENN technique to upsampling or downsampling the data to get more accurate result.

In [30]:
def final_model(X,y):
    sm = SMOTEENN()
    X, y = sm.fit_resample(X,y)
    
    ## split data
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=1)
    
    #using smoteenn technique
    print("Decision Tree Classifier with SMOTEENN ")
    model = DecisionTreeClassifier(criterion = "gini",
                                   max_depth=6,
                                   min_samples_leaf=8)
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    model_score = model.score(X_test,y_test)
    print('model_score', model_score)
    print(classification_report(y_test,y_pred))
    print('*'*50)
    
    #using Random forest
    print("Random Forest Classifier with SMOTEENN ")
    model1 = RandomForestClassifier(n_estimators=100,
                                  criterion='gini',
                                  max_depth=6,
                                  min_samples_leaf=8)
    model1.fit(X_train,y_train)
    y_pred = model1.predict(X_test)
    model_score = model.score(X_test,y_test)
    print('model_score', model_score)
    print(classification_report(y_test,y_pred))
    print('*'*50)
    
    with open('Final_model.pkl', 'wb') as model_file:
        pickle.dump(model1, model_file)
    print('pickle file saved!')    

In [31]:
final_model(X,y)

Decision Tree Classifier with SMOTEENN 
model_score 0.9171648163962425
              precision    recall  f1-score   support

           0       0.93      0.88      0.90       523
           1       0.91      0.95      0.93       648

    accuracy                           0.92      1171
   macro avg       0.92      0.91      0.92      1171
weighted avg       0.92      0.92      0.92      1171

**************************************************
Random Forest Classifier with SMOTEENN 
model_score 0.9171648163962425
              precision    recall  f1-score   support

           0       0.92      0.90      0.91       523
           1       0.92      0.94      0.93       648

    accuracy                           0.92      1171
   macro avg       0.92      0.92      0.92      1171
weighted avg       0.92      0.92      0.92      1171

**************************************************
pickle file saved!


## Doccumentation

Certainly, I can provide some suggestions based on the performance metrics you've obtained. The choice of the best classifier depends on your specific objectives and the trade-offs you are willing to make.

1. **SVM (Support Vector Machine):**
   - It has the highest accuracy among the classifiers.
   - It has a relatively balanced F1 score, indicating a good trade-off between precision and recall.
   - SVM is known for its ability to handle complex decision boundaries.

2. **Random Forest:**
   - While it has a slightly lower accuracy than SVM, it has the highest precision among the classifiers.
   - It's a robust classifier that can handle a variety of data types and feature importance analysis.

3. **Naive Bayes:**
   - It has the highest recall among the classifiers, suggesting that it's good at identifying positive cases (churn).
   - It has a relatively balanced F1 score.

4. **K-Nearest Neighbors:**
   - It has a balanced accuracy, precision, recall, and F1 score.
   - K-Nearest Neighbors can be simple to understand and implement.

5. **Decision Tree:**
   - While it has the lowest accuracy and F1 score, it may still be a viable choice if interpretability is essential.
   - Decision trees are easy to explain and understand, making them valuable in scenarios where interpretability is crucial.

our choice should consider the business context and objectives of your telecom churn project. Here are some factors to consider:

- **Balancing Precision and Recall:** If minimizing false positives (precision) is more critical to you, go with Random Forest. If it's more important to capture all true positives (recall), consider Naive Bayes.

- **Complexity:** SVM and Random Forest are more complex models, while Naive Bayes and K-Nearest Neighbors are simpler to understand and implement.

- **Interpretability:** Decision trees are highly interpretable, which can be beneficial in certain situations.

- **Performance:** If overall accuracy is the primary goal, SVM performs well. If you need a balance between precision and recall, consider Random Forest or Naive Bayes.
