# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [3]:
import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items
df = pd.read_csv(r"C:\Users\SAI29\Downloads\cleaned_churn_data.csv")
df

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TotalCharges_scaled,ChargesPerTenure,MonthlyChargesPercentage
0,1,1,0,0,29.85,29.85,1,-0.994194,29.850000,100.000000
1,34,0,1,1,56.95,1889.50,1,-0.173740,55.573529,3.014025
2,2,0,0,1,53.85,108.15,0,-0.959649,54.075000,49.791956
3,45,1,1,3,42.30,1840.75,1,-0.195248,40.905556,2.297976
4,2,0,0,0,70.70,151.65,0,-0.940457,75.825000,46.620508
...,...,...,...,...,...,...,...,...,...,...
7027,24,0,1,1,84.80,1990.50,1,-0.129180,82.937500,4.260236
7028,72,0,1,2,103.20,7362.90,1,2.241056,102.262500,1.401622
7029,11,1,0,0,29.60,346.45,1,-0.854514,31.495455,8.543801
7030,4,0,0,1,74.40,306.60,0,-0.872095,76.650000,24.266145


In [1]:
!pip install pycaret

Collecting pycaret
  Obtaining dependency information for pycaret from https://files.pythonhosted.org/packages/3e/6f/b3d59fac3869a7685e68aecdd35c336800bce8c8d3b45687bb82cf9a2848/pycaret-3.3.2-py3-none-any.whl.metadata
  Using cached pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Using cached pyod-2.0.3-py3-none-any.whl
Collecting category-encoders>=2.4.0 (from pycaret)
  Obtaining dependency information for category-encoders>=2.4.0 from https://files.pythonhosted.org/packages/63/a8/e2929e8654c15a64504022a8bd1444e748a8bda3450a4868567caf19a6c1/category_encoders-2.8.0-py3-none-any.whl.metadata
  Using cached category_encoders-2.8.0-py3-none-any.whl.metadata (7.9 kB)
Collecting lightgbm>=3.0.0 (from pycaret)
  Obtaining dependency information for lightgbm>=3.0.0 from https://files.pythonhosted.org/packages/5e/23/f8b28ca248bb629b9e08f877dd2965d1994e1674a03d67cd10c5246da248/lightgbm-4.6.0-py3-none-win_amd64.whl.metadata
  Using cached lightgbm-4.6.0-p

In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,4904
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 10)"
4,Transformed data shape,"(7032, 10)"
5,Transformed train set shape,"(4922, 10)"
6,Transformed test set shape,"(2110, 10)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


In [8]:
print(automl.dataset)

      tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
363       47             0         1              1       20.549999   
6924      65             0         1              2      110.800003   
4413      34             0         0              1       88.849998   
2636       5             0         0              2       59.900002   
1848       1             0         0              0       45.150002   
...      ...           ...       ...            ...             ...   
40        10             0         1              1       49.549999   
2916       9             0         0              0       95.500000   
2832       1             0         0              1       20.500000   
3993      39             0         0              0       99.750000   
303       68             1         1              2       60.299999   

      TotalCharges  TotalCharges_scaled  ChargesPerTenure  \
363     945.700012            -0.590133         20.121277   
6924   7245.899902       

In [9]:
from pycaret.classification import ClassificationExperiment

exp1 = ClassificationExperiment()

In [10]:
automl = exp1.setup(data=df, target='Churn', session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 10)"
4,Transformed data shape,"(7032, 10)"
5,Transformed train set shape,"(4922, 10)"
6,Transformed test set shape,"(2110, 10)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


In [11]:
best_model = exp1.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7981,0.836,0.909,0.8319,0.8685,0.4358,0.4452,0.766
lda,Linear Discriminant Analysis,0.7954,0.8316,0.9078,0.8298,0.8669,0.428,0.4378,0.027
ridge,Ridge Classifier,0.7952,0.8316,0.925,0.8195,0.869,0.4075,0.4244,0.027
gbc,Gradient Boosting Classifier,0.7928,0.8366,0.9048,0.8291,0.8651,0.4211,0.43,0.665
ada,Ada Boost Classifier,0.7889,0.8332,0.8918,0.8327,0.8611,0.4228,0.4278,0.216
lightgbm,Light Gradient Boosting Machine,0.7842,0.8225,0.8915,0.8281,0.8585,0.4063,0.4126,0.244
xgboost,Extreme Gradient Boosting,0.7729,0.8089,0.8796,0.8234,0.8504,0.38,0.3845,0.119
rf,Random Forest Classifier,0.7664,0.7988,0.8705,0.8219,0.8453,0.3685,0.372,0.509
knn,K Neighbors Classifier,0.758,0.745,0.8744,0.8111,0.8414,0.3335,0.3384,0.051
et,Extra Trees Classifier,0.7554,0.78,0.8531,0.8208,0.8365,0.351,0.3526,0.296


In [12]:
leaderboard = exp1.get_leaderboard()
print(leaderboard)

                            Model Name  \
Index                                    
0                  Logistic Regression   
1               K Neighbors Classifier   
2                          Naive Bayes   
3             Decision Tree Classifier   
4                  SVM - Linear Kernel   
5                     Ridge Classifier   
6             Random Forest Classifier   
7      Quadratic Discriminant Analysis   
8                 Ada Boost Classifier   
9         Gradient Boosting Classifier   
10        Linear Discriminant Analysis   
11              Extra Trees Classifier   
12           Extreme Gradient Boosting   
13     Light Gradient Boosting Machine   
14                    Dummy Classifier   

                                                   Model  Accuracy     AUC  \
Index                                                                        
0      (TransformerWrapper(exclude=None,\n           ...    0.7981  0.8360   
1      (TransformerWrapper(exclude=None,\n         

In [14]:
from pycaret.classification import create_model

model_name = leaderboard.iloc[5]["Model"]  # Get the model name from leaderboard
selected_model = create_model(model_name)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8093,0.8556,0.942,0.8237,0.8789,0.4405,0.4633
1,0.8012,0.8363,0.9061,0.8367,0.87,0.4504,0.457
2,0.8272,0.8459,0.9475,0.8386,0.8898,0.4975,0.5178
3,0.7907,0.8319,0.9061,0.8262,0.8643,0.4107,0.4192
4,0.7886,0.8298,0.9224,0.8142,0.8649,0.3875,0.404
5,0.7866,0.8419,0.9224,0.8122,0.8638,0.3799,0.3969
6,0.8089,0.8589,0.9224,0.8346,0.8763,0.4612,0.4726
7,0.7825,0.8431,0.9114,0.8144,0.8601,0.3784,0.3908
8,0.8028,0.8357,0.9252,0.8267,0.8732,0.4365,0.4508
9,0.7825,0.8286,0.8975,0.8223,0.8583,0.3948,0.4019


In [15]:
best_model

In [16]:
df.iloc[-2:-1].shape

(1, 10)

In [17]:
print(df.iloc[-1].shape)
print(df.iloc[-2:-1].shape) 

(10,)
(1, 10)


In [18]:
from pycaret.classification import predict_model

# Select the last row as a 2D DataFrame
new_data = df.iloc[-2:-1]

# Predict using the best model
predictions = predict_model(best_model, data=new_data)

# Show predictions
print(predictions)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,0.0,0.0,0.0,,0.0


      tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
7030       4             0         0              1       74.400002   

      TotalCharges  TotalCharges_scaled  ChargesPerTenure  \
7030    306.600006            -0.872095         76.650002   

      MonthlyChargesPercentage  Churn  prediction_label  prediction_score  
7030                 24.266146      0                 0            0.5473  


In [19]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,0.0,0.0,0.0,,0.0


Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TotalCharges_scaled,ChargesPerTenure,MonthlyChargesPercentage,Churn,prediction_label,prediction_score
7030,4,0,0,1,74.400002,306.600006,-0.872095,76.650002,24.266146,0,0,0.5473


In [20]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'TotalCharges_scaled',
                                              'ChargesPerTenure',
                                              'MonthlyChargesPercentage'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features...
                                                               fill_value=None,
                             

In [21]:
import pickle

with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [22]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [23]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([0], dtype=int8)

In [24]:
loaded_lr = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [26]:
predict_model(loaded_lr, new_data)

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TotalCharges_scaled,ChargesPerTenure,MonthlyChargesPercentage,prediction_label,prediction_score
7030,4,0,0,1,74.400002,306.600006,-0.872095,76.650002,24.266146,0,0.5473


In [93]:
from IPython.display import Code

Code('Churn.py')

In [94]:
%run Churn.py

Transformation Pipeline and Model Successfully Loaded


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7975,0.8386,0.9097,0.8306,0.8684,0.4336,0.4421


predictions:
0       Yes
1       Yes
2        No
3       Yes
4        No
       ... 
7027    Yes
7028    Yes
7029    Yes
7030     No
7031    Yes
Name: Churn, Length: 7032, dtype: object


# Summary

Write a short summary of the process and results here.

Summary
In this notebook, I used PyCaret to identify the best machine learning model to predict customer churn.
I chose an appropriate evaluation metric and trained the model. Saved the best model and loaded it later to
make predictions on new data.

Key steps:
- Used PyCaret for automatic model selection and evaluation.
- Chose Logistic Regression as the best model based on accuracy and other metrics.
- Saved and reloaded the trained model.
- Developed a function to predict probabilities of churn for new data.
- Successfully tested the function on 'new_churn_data.csv' and achieved predictions.

Results
- The Logistic Regression model achieved an accuracy of 79.75%.
- The final predictions for the new data were printed, showing 'Yes' and 'No' labels indicating customer churn.