### Importing Libraries

In [1]:
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from imblearn.combine import SMOTEENN

This line imports the metrics module from the sklearn library (Scikit-learn). Scikit-learn is a widely used library for machine learning in Python, and the metrics module provides various functions to evaluate the performance of machine learning models. This line imports the train_test_split function from the model_selection module of Scikit-learn. The train_test_split function is commonly used to split a dataset into training and testing subsets for machine learning tasks. These lines import specific functions from the metrics module in Scikit-learn. recall_score is a function to calculate the recall metric for classification tasks. classification_report is a function that generates a detailed report containing various evaluation metrics for a classification model. confusion_matrix is a function that computes a confusion matrix to evaluate the performance of a classification model. This line imports the DecisionTreeClassifier class from the tree module of Scikit-learn. DecisionTreeClassifier is an implementation of the decision tree algorithm for classification tasks. This line imports the SMOTEENN class from the combine module of the imblearn library. imblearn is a library used for handling imbalanced datasets in machine learning. SMOTEENN is a hybrid sampling technique that combines the SMOTE (Synthetic Minority Over-sampling Technique) and ENN (Edited Nearest Neighbors) algorithms to address class imbalance.

Overall, this code imports necessary libraries and modules for data manipulation, model evaluation, classification, and handling imbalanced datasets. It sets up the environment for further data analysis and machine learning tasks.

#### Reading csv

In [2]:
df=pd.read_csv("tel_churn.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,0,29.85,29.85,0,1,0,0,1,1,...,0,0,1,0,1,0,0,0,0,0
1,1,0,56.95,1889.5,0,0,1,1,0,1,...,0,0,0,1,0,0,1,0,0,0
2,2,0,53.85,108.15,1,0,1,1,0,1,...,0,0,0,1,1,0,0,0,0,0
3,3,0,42.3,1840.75,0,0,1,1,0,1,...,1,0,0,0,0,0,0,1,0,0
4,4,0,70.7,151.65,1,1,0,1,0,1,...,0,0,1,0,1,0,0,0,0,0


In [3]:
df=df.drop('Unnamed: 0',axis=1)

In [4]:
x=df.drop('Churn',axis=1)
x

Unnamed: 0,SeniorCitizen,MonthlyCharges,TotalCharges,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,Dependents_Yes,PhoneService_No,...,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,tenure_group_1 - 12,tenure_group_13 - 24,tenure_group_25 - 36,tenure_group_37 - 48,tenure_group_49 - 60,tenure_group_61 - 72
0,0,29.85,29.85,1,0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,0,0
1,0,56.95,1889.50,0,1,1,0,1,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,53.85,108.15,0,1,1,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
3,0,42.30,1840.75,0,1,1,0,1,0,1,...,1,0,0,0,0,0,0,1,0,0
4,0,70.70,151.65,1,0,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
5,0,99.65,820.50,1,0,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
6,0,89.10,1949.40,0,1,1,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
7,0,29.75,301.90,1,0,1,0,1,0,1,...,0,0,0,1,1,0,0,0,0,0
8,0,104.80,3046.05,1,0,0,1,1,0,0,...,0,0,1,0,0,0,1,0,0,0
9,0,56.15,3487.95,0,1,1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1


In [5]:
y=df['Churn']
y

0       0
1       0
2       1
3       0
4       1
5       1
6       0
7       0
8       1
9       0
10      0
11      0
12      0
13      1
14      0
15      0
16      0
17      0
18      1
19      0
20      1
21      0
22      1
23      0
24      0
25      0
26      1
27      1
28      0
29      1
       ..
7002    0
7003    0
7004    0
7005    0
7006    0
7007    1
7008    0
7009    0
7010    1
7011    0
7012    0
7013    0
7014    0
7015    1
7016    0
7017    0
7018    0
7019    0
7020    0
7021    1
7022    0
7023    1
7024    0
7025    0
7026    0
7027    0
7028    0
7029    0
7030    1
7031    0
Name: Churn, Length: 7032, dtype: int64

In above code snippet , it appears that df is a pandas DataFrame object, and Churn is one of the columns in that DataFrame. The code y = df['Churn'] assigns the values of the 'Churn' column to the variable y.

In other words, y now contains a pandas Series object that represents the 'Churn' column from the DataFrame df. This is often done in machine learning tasks to separate the target variable (in this case, the 'Churn' column) from the rest of the data, as the target variable is typically the variable we want to predict or analyze.

##### Train Test Split

In [6]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

The line of code x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) uses the train_test_split function from the Scikit-learn library to split the data into training and testing sets.

Here's what each variable represents:

x represents the features or input data that will be used to train the model.
y represents the target variable or the output that we want to predict or analyze.
test_size=0.2 specifies that 20% of the data should be allocated for testing, while the remaining 80% will be used for training.
After executing this line of code, four new variables are created:

x_train contains the training data for the features.
x_test contains the testing data for the features.
y_train contains the training data for the target variable.
y_test contains the testing data for the target variable.
By splitting the data into training and testing sets, we can train a machine learning model using the x_train and y_train data and then evaluate its performance on the unseen data using the x_test and y_test data. This helps us assess how well the model generalizes to new, unseen data.

#### Decision Tree Classifier

In [7]:
model_dt=DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=6, min_samples_leaf=8)

The line of code model_dt = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=6, min_samples_leaf=8) creates an instance of the DecisionTreeClassifier class from the Scikit-learn library and assigns it to the variable model_dt.

Here's an explanation of the parameters used in the DecisionTreeClassifier constructor:

* criterion="gini": This parameter specifies the criterion to measure the quality of a split. In this case, "gini" is used, which refers to the Gini impurity. Other options include "entropy" for information gain.
* random_state=100: This parameter sets the random seed for reproducibility. It ensures that the same random splits are generated each time the model is trained, allowing for consistent results.
* max_depth=6: This parameter limits the maximum depth of the decision tree. It controls the complexity of the tree and helps prevent overfitting. In this case, the maximum depth is set to 6.
* min_samples_leaf=8: This parameter sets the minimum number of samples required to be at a leaf node. It helps prevent the tree from growing too deep and captures smaller subsets of the data. In this case, the minimum samples per leaf is set to 8.

By creating an instance of the DecisionTreeClassifier with these parameter values, we are specifying the configuration for the decision tree model. This model can be trained on the data using the fit() method and then used to make predictions on new data.

Overfitting is a common problem in machine learning where a model learns the training data too well, to the point that it starts to perform poorly on unseen or new data. In other words, an overfit model has "memorized" the training data and fails to generalize well to unknown examples.

Here are some key characteristics and consequences of overfitting:

* High Training Accuracy, Low Test Accuracy: An overfit model often achieves a high accuracy or performance on the training data because it has learned the specific patterns and noise present in that data. However, when evaluated on new data (test data), the accuracy drops significantly.
* Complex Model: Overfitting can occur when the model becomes too complex, capturing noise or irrelevant patterns in the training data. It may have a large number of parameters or a high degree of freedom, making it highly flexible in fitting the training data.
* Lack of Generalization: The primary issue with overfitting is that the model fails to generalize well to unseen data. It becomes overly sensitive to the training data, making it less effective at making accurate predictions on new examples.
* Poor Performance on Unseen Data: When an overfit model encounters new data that it hasn't seen during training, it may struggle to make accurate predictions. It may produce unreliable and misleading results, which can lead to poor decision-making in real-world applications.

To address overfitting, several techniques can be employed, including:

*  Simplifying the model: Reducing the complexity of the model, such as by decreasing the number of parameters or using simpler algorithms, can help mitigate overfitting.
*  Regularization: Techniques like L1 or L2 regularization can be applied to add a penalty term to the model's loss function, discouraging excessive complexity and reducing overfitting.
*  Cross-validation: Using cross-validation techniques can help assess the model's performance on multiple subsets of the data, providing a more robust evaluation of its generalization ability.
*  Increasing training data: Providing more diverse and representative training data can help the model learn better and reduce overfitting by capturing a broader range of patterns.

By addressing overfitting, a model can achieve better generalization and perform well on unseen data, which is crucial for real-world applications and reliable predictions.

In a machine learning decision tree, a leaf (also referred to as a terminal node) is a node that does not split further. It represents a final prediction or a class label assigned to a subset of the data. Each leaf node in a decision tree corresponds to a specific outcome or decision.

When constructing a decision tree, the internal nodes represent features or attributes used for splitting the data, and the leaf nodes represent the final decision or prediction based on those splits.

In a classification task, each leaf node represents a specific class label or category that the model assigns to the instances that reach that node. For example, in a binary classification problem where the goal is to classify whether an email is spam or not, a leaf node could represent either the "spam" class or the "not spam" class.

In a regression task, the leaf nodes represent the predicted continuous values for the instances that reach those nodes. For example, in a decision tree predicting house prices based on features such as size and location, a leaf node may represent a specific predicted house price.

Regarding "gini" in machine learning, it refers to the Gini impurity. Gini impurity is a measure of impurity or disorder used in decision tree algorithms to determine the quality of a split. It quantifies the likelihood of misclassifying a randomly chosen element from the dataset if it were randomly labeled according to the class distribution in the subset.

In the context of decision trees, the Gini impurity is calculated for each potential split on a feature, and the split with the lowest Gini impurity is chosen as the best split. The goal is to minimize the Gini impurity and create splits that separate the data into pure or homogeneous subsets in terms of the target variable.

A Gini impurity value of 0 indicates a completely pure node where all instances belong to the same class, while a value of 0.5 represents maximum impurity, indicating an equal distribution of instances across different classes.

The "gini" criterion is commonly used in decision tree algorithms, such as the CART (Classification and Regression Trees) algorithm, to determine the splitting criteria and build the tree based on minimizing impurity or maximizing information gain.

In [8]:
model_dt.fit(x_train,y_train)

DecisionTreeClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

The line of code model_dt.fit(x_train, y_train) trains the decision tree model (model_dt) using the training data (x_train and y_train). The fit() method is a common function in machine learning libraries that trains a model on the provided data.

In this case, the fit() method is called on the model_dt object, which is an instance of the DecisionTreeClassifier class. The x_train parameter represents the training features, which are the input variables used to make predictions. The y_train parameter represents the corresponding target variable or the expected output for each training instance.

By calling fit(x_train, y_train), the decision tree model is trained to learn the patterns and relationships between the features (x_train) and the target variable (y_train). It adjusts its internal parameters based on the training data, optimizing itself to make accurate predictions.

After the training process is complete, the decision tree model (model_dt) is ready to make predictions on new, unseen data.

In [9]:
y_pred=model_dt.predict(x_test)
y_pred

array([0, 0, 1, ..., 0, 0, 0], dtype=int64)

The line of code y_pred = model_dt.predict(x_test) uses the trained decision tree model (model_dt) to make predictions on the test data (x_test). The predict() method is a common function in machine learning libraries that predicts the target variable or class labels based on the provided input features.

In this case, the predict() method is called on the model_dt object with x_test as the parameter. x_test represents the test data or the input features for which we want to make predictions.

By executing y_pred = model_dt.predict(x_test), the decision tree model uses the learned patterns and rules from the training phase to predict the target variable or class labels for the test data. The predicted values are stored in the y_pred variable.

After making predictions, y_pred will contain the predicted class labels or target variable values for the corresponding instances in the x_test dataset. These predicted values can be compared with the actual target variable values (y_test) to evaluate the performance of the model.


In [10]:
model_dt.score(x_test,y_test)

0.7818052594171997

The line of code model_dt.score(x_test, y_test) calculates the accuracy score of the decision tree model (model_dt) on the test data (x_test and y_test). The score() method is a convenient function provided by many machine learning libraries to evaluate the performance of a model.

In this case, the score() method is called on the model_dt object with x_test and y_test as the parameters. x_test represents the test features, and y_test represents the corresponding true target variable values or class labels.

The score() method internally uses the model to make predictions on the x_test data and then compares the predicted values with the true y_test values. It calculates the accuracy of the model by determining the proportion of correctly predicted instances.

The return value of model_dt.score(x_test, y_test) is the accuracy score, which is a value between 0 and 1. A score of 1 indicates that the model made all predictions correctly, while a score of 0 means that the model did not predict any instances correctly.

By evaluating the model's accuracy score on the test data, we can assess how well the decision tree model generalizes to unseen examples and how accurately it predicts the target variable values or class labels.

An accuracy score of 0.7818052594171997 indicates that approximately 78.18% of the instances in the test data were correctly predicted by the model. While it's difficult to provide a definitive judgment without additional context, here are some general guidelines to help interpret the score:

If this accuracy score is significantly higher than a random or baseline prediction, it can be considered reasonably good. However, the definition of "reasonably good" depends on the specific problem and the expectations for accuracy in that domain.

If there are other models or benchmarks available, it would be useful to compare the accuracy score of model_dt with those models. If the score is comparable or higher, it suggests that the decision tree model is performing well relative to the alternatives.

Additionally, it is important to consider the problem domain and the associated requirements. Some domains may require high accuracy, while others may prioritize different metrics such as precision, recall, or F1-score. Evaluating the model's performance using multiple metrics can provide a more comprehensive assessment.

In summary, an accuracy score of 0.7818052594171997 can be considered relatively good depending on the specific context, but it is always recommended to consider additional factors, such as the problem domain and comparison with other models or benchmarks, for a more accurate interpretation.








In [11]:
print(classification_report(y_test, y_pred, labels=[0,1]))

              precision    recall  f1-score   support



           0       0.82      0.89      0.86      1023

           1       0.63      0.49      0.55       384



    accuracy                           0.78      1407

   macro avg       0.73      0.69      0.70      1407

weighted avg       0.77      0.78      0.77      1407




The classification_report() function is a utility in the Scikit-learn library that generates a text report showing various evaluation metrics for a classification model. In this case, the classification_report() function is used to print a report for the predicted labels (y_pred) compared to the true labels (y_test).

The line of code print(classification_report(y_test, y_pred, labels=[0,1])) generates a classification report with specific labels [0, 1]. The report includes several evaluation metrics such as precision, recall, F1-score, and support for each class label.

Here's an explanation of some key metrics in the classification report:

Precision: Precision measures the proportion of true positive predictions (correctly predicted positive instances) out of all instances predicted as positive. A high precision indicates a low false positive rate.

Recall: Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions out of all actual positive instances. A high recall indicates a low false negative rate.

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of both precision and recall.

Support: Support represents the number of instances in each class. It indicates the number of actual occurrences of each class label in the test data.

By specifying the labels parameter as [0, 1], the classification report will be generated only for these specific labels.

Printing the classification report helps assess the performance of the model on each class label individually, providing insights into how well the model is performing for different classes.

The provided classification report gives a detailed evaluation of the model's performance for two class labels, 0 and 1. Let's analyze the key metrics and what they mean in this context:

For class label 0:

Precision: The precision for class 0 is 0.82, indicating that 82% of the instances predicted as class 0 were actually true positives. It has a relatively low false positive rate.
Recall: The recall for class 0 is 0.89, indicating that 89% of the actual instances of class 0 were correctly identified by the model. It has a relatively low false negative rate.
F1-score: The F1-score for class 0 is 0.86, which is the harmonic mean of precision and recall. It provides a balanced measure of precision and recall, indicating a reasonably good performance.
Support: The support for class 0 is 1023, representing the number of actual occurrences of class 0 in the test data.
For class label 1:

Precision: The precision for class 1 is 0.63, indicating that 63% of the instances predicted as class 1 were actually true positives. It has a moderate false positive rate.
Recall: The recall for class 1 is 0.49, indicating that 49% of the actual instances of class 1 were correctly identified by the model. It has a relatively high false negative rate.
F1-score: The F1-score for class 1 is 0.55, indicating a moderate balance between precision and recall.
Support: The support for class 1 is 384, representing the number of actual occurrences of class 1 in the test data.
Overall:

Accuracy: The overall accuracy of the model is 0.78, indicating that approximately 78% of the instances in the test data were predicted correctly by the model.
Macro avg: The macro average of precision, recall, and F1-score provides the average performance across all classes, giving equal weight to each class. In this case, the macro avg precision, recall, and F1-score are 0.73, 0.69, and 0.70, respectively.
Weighted avg: The weighted average of precision, recall, and F1-score provides the average performance across all classes, taking into account the support (number of instances) for each class. In this case, the weighted avg precision, recall, and F1-score are 0.77, 0.78, and 0.77, respectively.
In summary, the model appears to perform better for class 0, with higher precision, recall, and F1-score. However, it struggles with class 1, with relatively lower values for these metrics. It's important to consider the specific requirements of the problem and the trade-offs between precision and recall when interpreting these results.

Imbalance in data set so fix that. It has a relatively high false negative rate. 

In [12]:
sm = SMOTEENN()
X_resampled, y_resampled = sm.fit_sample(x,y)

The code sm = SMOTEENN() initializes an instance of the SMOTEENN class, which is a combination of the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN).

SMOTEENN is a technique commonly used in imbalanced classification problems to address the issue of class imbalance. It combines oversampling of the minority class using SMOTE and undersampling of the majority class using ENN to achieve a balanced dataset.

The fit_sample() method of SMOTEENN is then called on the input features x and target variable y. This method resamples the data to create a new balanced dataset by applying both SMOTE and ENN techniques.

The resampled data is returned as X_resampled (the new feature matrix) and y_resampled (the new target variable array), which can be used for training machine learning models on the balanced dataset to mitigate the impact of class imbalance.

In [13]:
xr_train,xr_test,yr_train,yr_test=train_test_split(X_resampled, y_resampled,test_size=0.2)

The code xr_train, xr_test, yr_train, yr_test = train_test_split(X_resampled, y_resampled, test_size=0.2) splits the resampled data (X_resampled and y_resampled) into training and test sets.

The train_test_split() function is a commonly used utility in machine learning libraries, which divides a dataset into random train and test subsets. In this case, the resampled feature matrix X_resampled and target variable array y_resampled are split into four sets: xr_train (training features), xr_test (test features), yr_train (training target variable), and yr_test (test target variable).

The test_size parameter is set to 0.2, which means that 20% of the resampled data will be allocated to the test sets, while the remaining 80% will be used for training. The data splitting is done randomly to ensure a representative distribution of instances in both sets.

By splitting the data into training and test sets, it allows the model to be trained on a portion of the resampled data (xr_train and yr_train) and evaluated on unseen data (xr_test and yr_test) to assess its generalization performance.

In [14]:
model_dt_smote=DecisionTreeClassifier(criterion = "gini",random_state = 100,max_depth=6, min_samples_leaf=8)

The line of code model_dt_smote = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=6, min_samples_leaf=8) creates an instance of the DecisionTreeClassifier class with specific parameters.

Here's an explanation of the parameters used:

* criterion: This parameter specifies the criterion used for splitting the decision tree nodes. In this case, "gini" is used, which refers to the Gini impurity criterion. Gini impurity measures the degree of impurity in a node, and the decision tree algorithm aims to minimize it during the tree construction process.

* random_state: This parameter sets the random seed for reproducibility. By setting it to 100, the random number generation for the decision tree will be consistent across different runs, resulting in the same tree structure if the other parameters and data remain unchanged.

* max_depth: This parameter determines the maximum depth or the maximum number of levels in the decision tree. In this case, the maximum depth is set to 6, limiting the tree's complexity and preventing it from growing too deep.

* min_samples_leaf: This parameter sets the minimum number of samples required to be at a leaf node. It controls the minimum size of leaf nodes in the decision tree. In this case, a minimum of 8 samples is required for a node to be considered as a leaf.

By creating an instance of DecisionTreeClassifier with these parameters, model_dt_smote is initialized as a decision tree classifier that will use the Gini criterion for splitting nodes, have a maximum depth of 6, and a minimum of 8 samples at each leaf node. This model will be trained and evaluated using the resampled data obtained through the SMOTEENN technique.

In [15]:
model_dt_smote.fit(xr_train,yr_train)
yr_predict = model_dt_smote.predict(xr_test)
model_score_r = model_dt_smote.score(xr_test, yr_test)
print(model_score_r)
print(metrics.classification_report(yr_test, yr_predict))

0.934412265758092

              precision    recall  f1-score   support



           0       0.97      0.88      0.93       540

           1       0.91      0.98      0.94       634



    accuracy                           0.93      1174

   macro avg       0.94      0.93      0.93      1174

weighted avg       0.94      0.93      0.93      1174




The code provided trains and evaluates the model_dt_smote decision tree classifier on the resampled training data and prints the model's score and classification report based on the predictions made on the resampled test data. The fit() method is called on model_dt_smote to train the decision tree classifier using the resampled training data (xr_train as features and yr_train as target variables). This step involves constructing the decision tree based on the provided data. The predict() method is used to make predictions on the resampled test data (xr_test) using the trained model_dt_smote. The predicted target variable values are stored in the yr_predict variable.The score() method is called on model_dt_smote to calculate the accuracy score of the model on the resampled test data (xr_test and yr_test). The accuracy score represents the proportion of correctly predicted instances.This line prints the accuracy score of the model (model_score_r).The classification_report() function from the metrics module is used to generate a text report that includes various evaluation metrics for the predictions made by model_dt_smote on the resampled test data (yr_test as true labels and yr_predict as predicted labels). This report provides detailed information on metrics such as precision, recall, F1-score, and support for each class.

By printing the accuracy score and classification report, you can assess the performance of the decision tree classifier (model_dt_smote) on the resampled test data and gain insights into its predictive capabilities for each class label.



The provided output includes the accuracy score and classification report for the model_dt_smote decision tree classifier on the resampled test data. Let's analyze the results:

Accuracy Score: The accuracy score is 0.934412265758092, which means that approximately 93.44% of the instances in the resampled test data were correctly predicted by the model. This indicates a high level of accuracy in the predictions.

Classification Report:

For class label 0:

Precision: The precision for class 0 is 0.97, indicating that 97% of the instances predicted as class 0 were actually true positives. It has a relatively low false positive rate.
Recall: The recall for class 0 is 0.88, indicating that 88% of the actual instances of class 0 were correctly identified by the model. It has a relatively low false negative rate.
F1-score: The F1-score for class 0 is 0.93, which is the harmonic mean of precision and recall. It provides a balanced measure of precision and recall, indicating a good performance.
Support: The support for class 0 is 540, representing the number of actual occurrences of class 0 in the resampled test data.
For class label 1:

Precision: The precision for class 1 is 0.91, indicating that 91% of the instances predicted as class 1 were actually true positives. It has a relatively low false positive rate.
Recall: The recall for class 1 is 0.98, indicating that 98% of the actual instances of class 1 were correctly identified by the model. It has a relatively low false negative rate.
F1-score: The F1-score for class 1 is 0.94, indicating a balanced measure of precision and recall.
Support: The support for class 1 is 634, representing the number of actual occurrences of class 1 in the resampled test data.
Overall:

The weighted average precision, recall, and F1-score are 0.94, 0.93, and 0.93, respectively. The weighted average takes into account the support (number of instances) for each class, providing an overall measure of performance that considers both class imbalances.
The macro average precision, recall, and F1-score are 0.94, 0.93, and 0.93, respectively. The macro average calculates the average performance across all classes, giving equal weight to each class.
The accuracy, precision, recall, and F1-score values indicate that the model performs well for both class labels, with high values across the metrics.
In summary, the model_dt_smote decision tree classifier shows strong performance on the resampled test data, achieving high accuracy and demonstrating good precision, recall, and F1-score for both class labels. These results suggest that the model is effective in making predictions on the resampled data and accurately classifying instances into their respective classes.

For example, if you are working on a churn prediction problem, Class 0 could represent customers who do not churn (stay with the service), and Class 1 could represent customers who churn (cancel their subscription). In this case, the confusion matrix and classification report would provide insights into how well the model is performing in predicting customers who do not churn (Class 0) and customers who do churn (Class 1).

In [16]:
print(metrics.confusion_matrix(yr_test, yr_predict))

[[477  63]

 [ 14 620]]


The confusion matrix is a 2x2 matrix representing the two class labels (0 and 1). Let's break it down:

True Negatives (TN): The number of instances that were correctly predicted as class 0. In this case, there are 476 true negatives, which means 476 instances that were actually class 0 were correctly classified as class 0.

False Positives (FP): The number of instances that were incorrectly predicted as class 1 (false alarms or type I errors). In this case, there are 64 false positives, which means 64 instances that were actually class 0 were wrongly classified as class 1.

False Negatives (FN): The number of instances that were incorrectly predicted as class 0 (misses or type II errors). In this case, there are 14 false negatives, which means 14 instances that were actually class 1 were wrongly classified as class 0.

True Positives (TP): The number of instances that were correctly predicted as class 1. In this case, there are 620 true positives, which means 620 instances that were actually class 1 were correctly classified as class 1.

The confusion matrix provides important information about the performance of the model, allowing for a more detailed analysis of the model's ability to correctly classify instances into their respective classes.

The code print(metrics.confusion_matrix(yr_test, yr_predict)) generates the confusion matrix for the predictions made by the model_dt_smote decision tree classifier on the resampled test data. The confusion matrix provides a tabular representation of the model's performance by counting the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

###### Now we can see quite better results, i.e. Accuracy: 92 %, and a very good recall, precision & f1 score for minority class.

###### Let's try with some other classifier.

#### Random Forest Classifier

Random Forest Classifier and Decision Tree are both popular machine learning algorithms used for classification tasks, but they have some key differences. Here's a comparison between Random Forest Classifier and Decision Tree:

Decision Tree:

A Decision Tree is a simple and interpretable algorithm that predicts the target variable by recursively partitioning the feature space into smaller regions based on the feature values.
It makes decisions based on a series of binary splits at each node of the tree, dividing the data based on the features that provide the most information gain or the best split criterion (e.g., Gini impurity or information gain).
Decision Trees tend to be prone to overfitting, meaning they can create complex trees that fit the training data too closely and may not generalize well to unseen data.
Random Forest Classifier:

Random Forest is an ensemble learning method that combines multiple Decision Trees to make predictions. It builds an ensemble of Decision Trees and aggregates their predictions to make the final prediction.
Each Decision Tree in the Random Forest is trained on a random subset of the training data (sampling with replacement) and a random subset of the features.
Random Forest reduces the overfitting problem of individual Decision Trees by averaging the predictions of multiple trees, which helps to improve the model's generalization performance.
It can handle high-dimensional data and capture complex interactions between features.
Advantages of Decision Tree:

Decision Trees are easy to understand and interpret, providing human-readable rules that explain the decision-making process.
They can handle both numerical and categorical features without requiring extensive data preprocessing.
Decision Trees can capture non-linear relationships between features and the target variable.
Advantages of Random Forest Classifier:

Random Forest combines the predictions of multiple Decision Trees, leading to more robust and accurate predictions.
It reduces the risk of overfitting by introducing randomness in the model training process.
Random Forest can handle large datasets with high dimensionality and noisy data.
In summary, Decision Trees are simple and interpretable but can be prone to overfitting, while Random Forest Classifier is an ensemble of Decision Trees that provides better generalization performance and handles complex datasets. However, Random Forest may be more computationally expensive and less interpretable compared to a single Decision Tree. The choice between the two algorithms depends on the specific requirements of the problem and the trade-offs between interpretability, accuracy, and computational resources.

In general, when it comes to predicting churn rate, Random Forest Classifier tends to perform better than a single Decision Tree. Here's why:

Handling complex relationships: Churn prediction involves analyzing various factors and their interactions that contribute to customer churn. Random Forest Classifier can handle complex relationships between features by combining multiple Decision Trees. It captures different patterns and interactions present in the data, leading to improved predictive performance.

Handling high-dimensional data: Churn prediction often involves working with datasets that have a high number of features or attributes. Random Forest Classifier can effectively handle high-dimensional data without overfitting. It selects a random subset of features for each tree, ensuring that different subsets of features are considered, reducing the risk of spurious correlations and improving generalization.

Dealing with class imbalance: Churn prediction tasks often suffer from class imbalance, where the number of churned customers (positive class) is significantly smaller than the number of non-churned customers (negative class). Random Forest Classifier can handle class imbalance better than a single Decision Tree. By using bootstrapping and random feature selection, it can mitigate the impact of class imbalance and provide more balanced predictions.

Robustness and generalization: Random Forest Classifier reduces the risk of overfitting by aggregating predictions from multiple Decision Trees. It combines the strengths of individual trees and reduces the impact of noise or outliers in the data. This leads to more robust and accurate predictions, improving the model's ability to generalize well to unseen data.

Overall, due to its ability to handle complex relationships, high-dimensional data, class imbalance, and improve generalization, Random Forest Classifier is generally considered a better choice for churn rate prediction compared to a single Decision Tree. However, it's important to note that the performance of the algorithm can still depend on the specific characteristics of the dataset and the tuning of hyperparameters. It's always recommended to experiment and evaluate different algorithms to find the best solution for a particular churn prediction problem.

In [17]:
from sklearn.ensemble import RandomForestClassifier

The code snippet from sklearn.ensemble import RandomForestClassifier imports the RandomForestClassifier class from the sklearn.ensemble module in scikit-learn.

Ensemble learning refers to the technique of combining multiple machine learning models to make predictions. Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model.

The RandomForestClassifier class in scikit-learn is an implementation of the Random Forest algorithm specifically designed for classification tasks. It inherits the properties and methods from the base ClassifierMixin class in scikit-learn.

By importing RandomForestClassifier, you gain access to various functionalities and parameters to create, train, and use a Random Forest Classifier in your code. You can instantiate an instance of the RandomForestClassifier class and configure its parameters to customize the behavior of the classifier.

For example, you can set parameters such as n_estimators (the number of decision trees in the Random Forest), max_depth (the maximum depth of each decision tree), min_samples_leaf (the minimum number of samples required to be at a leaf node), and many more. These parameters allow you to control the complexity, generalization ability, and performance of the Random Forest Classifier.

Once you have created an instance of the RandomForestClassifier, you can use it to train the model on labeled training data, make predictions on new unseen data, and evaluate the performance of the classifier using various evaluation metrics.

In summary, importing RandomForestClassifier from sklearn.ensemble provides you with the necessary tools and functionality to create and use a Random Forest Classifier for classification tasks in scikit-learn.

In [18]:
model_rf=RandomForestClassifier(n_estimators=100, criterion='gini', random_state = 100,max_depth=6, min_samples_leaf=8)

In [19]:
model_rf.fit(x_train,y_train)

RandomForestClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

In [20]:
y_pred=model_rf.predict(x_test)

In [21]:
model_rf.score(x_test,y_test)

0.7953091684434968

In [22]:
print(classification_report(y_test, y_pred, labels=[0,1]))

              precision    recall  f1-score   support



           0       0.82      0.92      0.87      1023

           1       0.69      0.45      0.55       384



    accuracy                           0.80      1407

   macro avg       0.75      0.69      0.71      1407

weighted avg       0.78      0.80      0.78      1407




In [23]:
sm = SMOTEENN()
X_resampled1, y_resampled1 = sm.fit_sample(x,y)

In [24]:
xr_train1,xr_test1,yr_train1,yr_test1=train_test_split(X_resampled1, y_resampled1,test_size=0.2)

In [25]:
model_rf_smote=RandomForestClassifier(n_estimators=100, criterion='gini', random_state = 100,max_depth=6, min_samples_leaf=8)

In [26]:
model_rf_smote.fit(xr_train1,yr_train1)

RandomForestClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

In [27]:
yr_predict1 = model_rf_smote.predict(xr_test1)

In [28]:
model_score_r1 = model_rf_smote.score(xr_test1, yr_test1)

In [29]:
print(model_score_r1)
print(metrics.classification_report(yr_test1, yr_predict1))

0.9427350427350427

              precision    recall  f1-score   support



           0       0.95      0.92      0.93       518

           1       0.94      0.96      0.95       652



    accuracy                           0.94      1170

   macro avg       0.94      0.94      0.94      1170

weighted avg       0.94      0.94      0.94      1170




In [30]:
print(metrics.confusion_matrix(yr_test1, yr_predict1))

[[478  40]

 [ 27 625]]


###### With RF Classifier, also we are able to get quite good results, infact better than Decision Tree.

###### We can now further go ahead and create multiple classifiers to see how the model performance is, but that's not covered here, so you can do it by yourself :)

#### Performing PCA

In [31]:
# Applying PCA
from sklearn.decomposition import PCA
pca = PCA(0.9)
xr_train_pca = pca.fit_transform(xr_train1)
xr_test_pca = pca.transform(xr_test1)
explained_variance = pca.explained_variance_ratio_

In [32]:
model=RandomForestClassifier(n_estimators=100, criterion='gini', random_state = 100,max_depth=6, min_samples_leaf=8)

In [33]:
model.fit(xr_train_pca,yr_train1)

RandomForestClassifier(max_depth=6, min_samples_leaf=8, random_state=100)

In [34]:
yr_predict_pca = model.predict(xr_test_pca)

In [35]:
model_score_r_pca = model.score(xr_test_pca, yr_test1)

In [36]:
print(model_score_r_pca)
print(metrics.classification_report(yr_test1, yr_predict_pca))

0.7239316239316239

              precision    recall  f1-score   support



           0       0.72      0.61      0.66       518

           1       0.72      0.81      0.77       652



    accuracy                           0.72      1170

   macro avg       0.72      0.71      0.71      1170

weighted avg       0.72      0.72      0.72      1170




##### With PCA, we couldn't see any better results, hence let's finalise the model which was created by RF Classifier, and save the model so that we can use it in a later stage :)

#### Pickling the model

In [37]:
import pickle

In [38]:
filename = 'model.sav'

In [39]:
pickle.dump(model_rf_smote, open(filename, 'wb'))

In [40]:
load_model = pickle.load(open(filename, 'rb'))

In [41]:
model_score_r1 = load_model.score(xr_test1, yr_test1)

In [42]:
model_score_r1

0.9427350427350427

##### Our final model i.e. RF Classifier with SMOTEENN, is now ready and dumped in model.sav, which we will use and prepare API's so that we can access our model from UI.