# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Load data

In [28]:
# import panda and we imported prepared_churn_data.csv from week 2

import pandas as pd
df = pd.read_csv('prepared_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_MonthlyCharges_ratio,tenure_TotalCharges_ratio,MonthlyCharges_TotalCharges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,0.033501,0.033501,1.000000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,0.597015,0.017994,0.030140
3668-QPYBK,2,1,0,1,53.85,108.15,1,0.037140,0.018493,0.497920
7795-CFOCW,45,0,1,2,42.30,1840.75,0,1.063830,0.024447,0.022980
9237-HQITU,2,1,0,0,70.70,151.65,1,0.028289,0.013188,0.466205
...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,0.283019,0.012057,0.042602
2234-XADUH,72,1,1,3,103.20,7362.90,0,0.697674,0.009779,0.014016
4801-JZAZL,11,0,0,0,29.60,346.45,0,0.371622,0.031751,0.085438
8361-LTMKD,4,1,0,1,74.40,306.60,1,0.053763,0.013046,0.242661


# AutoML with pycaret

In [2]:
#installed pycaret using conda
conda install -c conda-forge pycaret -y


Note: you may need to restart the kernel to use updated packages.


In [3]:
#downgraded scikit_learn from 1.1.1 to 0.23.2 using the anaconda prompt and using jypyter notbook 
conda install -c conda-forge scikit-learn=0.23.2


Note: you may need to restart the kernel to use updated packages.


In [29]:
#used to import setup, compare_models,predict_model,save_model, load_model unctions from pycaret
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [30]:
#used to setup Churn as target
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,2644
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 10)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [31]:
#used to show the processed data 
# everytime i tried the number it give me randam results such as -1, false, true AND SUCH i MANAGED TO USE RANDOM NUMBER
# AND WAS ABLE TO GET THE PROCESSED DATA AT 4
automl[4]

Unnamed: 0_level_0,tenure,MonthlyCharges,TotalCharges,tenure_MonthlyCharges_ratio,tenure_TotalCharges_ratio,MonthlyCharges_TotalCharges_ratio,PhoneService_1,Contract_0,Contract_1,Contract_2,PaymentMethod_0,PaymentMethod_1,PaymentMethod_2,PaymentMethod_3
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
4981-FLTMF,57.0,65.199997,3687.850098,0.874233,0.015456,0.017680,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4121-AGSIN,58.0,24.500000,1497.900024,2.367347,0.038721,0.016356,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
7733-UDMTP,57.0,55.000000,3094.050049,1.036364,0.018422,0.017776,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
8387-UGUSU,15.0,20.049999,284.299988,0.748130,0.052761,0.070524,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
8679-LZBMD,44.0,90.650002,3974.149902,0.485383,0.011072,0.022810,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4662-EKDPQ,2.0,62.049999,118.300003,0.032232,0.016906,0.524514,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
7284-BUYEC,5.0,50.950001,229.399994,0.098135,0.021796,0.222101,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2541-YGPKE,42.0,63.700001,2763.350098,0.659341,0.015199,0.023052,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2984-MIIZL,4.0,74.800003,321.899994,0.053476,0.012426,0.232370,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [32]:
# comparing models, the  linear discriminant analaysis  is the best model for accuracy which has accuracy of 0.8032 or 80.3%
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7986,0.8394,0.5129,0.6459,0.57,0.4409,0.447,0.008
lr,Logistic Regression,0.797,0.8424,0.5175,0.6382,0.5699,0.4392,0.4443,0.023
ridge,Ridge Classifier,0.7966,0.0,0.4528,0.6636,0.5369,0.4125,0.4257,0.006
gbc,Gradient Boosting Classifier,0.7882,0.8365,0.4615,0.6309,0.531,0.3988,0.4081,0.211
ada,Ada Boost Classifier,0.7872,0.8366,0.4591,0.6251,0.5286,0.3955,0.4038,0.072
catboost,CatBoost Classifier,0.7868,0.8301,0.4685,0.6244,0.5331,0.3989,0.4071,1.721
lightgbm,Light Gradient Boosting Machine,0.7805,0.8212,0.4794,0.6019,0.5316,0.391,0.3966,0.121
xgboost,Extreme Gradient Boosting,0.7763,0.8127,0.47,0.5897,0.522,0.3786,0.3834,0.235
rf,Random Forest Classifier,0.7696,0.8006,0.4536,0.5747,0.506,0.3586,0.3635,0.154
knn,K Neighbors Classifier,0.7657,0.7406,0.4271,0.5671,0.4864,0.3387,0.3449,0.016


In [33]:
# The is used to show the best _model  with its hyperparameter
#and it is showing  best model is LinearDiscriminantAnalysis (lda)
best_model

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

In [34]:
#used to make new pradiction by getting the last  raw using 
#used ths tow show the two dimentinal array of the we are seeng that it is 1st raw and  column 10
df.iloc[-2:-1].shape

(1, 10)

In [35]:
#  using predict_ model from pycaret i used to predict using the best model by giving the new data frame[-2:1]
# the result added Label and score column which has a result 0 and 0.5019 respectively. 
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_MonthlyCharges_ratio,tenure_TotalCharges_ratio,MonthlyCharges_TotalCharges_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,0.053763,0.013046,0.242661,0,0.5471


# Saving and loading our model

In [36]:
# used to creat LDA pickle file(LDA.pkl) to save and load in binary data
save_model(best_model, 'LDA')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs

In [37]:
#used to  import pickle then  using open function with writing it in binary format 
import pickle

with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [38]:
# using open function we open the LDA_model.pk file and read it in binary format
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [41]:
# used to define a new_data and drop the churn index in axis one
new_data = df.iloc[-2:-1].copy() 

new_data.drop('Churn', axis = 1, inplace = True)


In [42]:
#used to predict
predictions = predict_model(best_model, data=new_data)

In [43]:
# used to load the LDA file
loaded_lda = load_model('LDA')

Transformation Pipeline and Model Successfully Loaded


In [44]:
#used to load and predict the new data
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_MonthlyCharges_ratio,tenure_TotalCharges_ratio,MonthlyCharges_TotalCharges_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,0.053763,0.013046,0.242661,0,0.5471


# Making a Python module to make predictions

In [96]:
# used to import the code in prediction_churn_data.py
from IPython.display import Code
Code('prediction_churn_data.py')

In [97]:
#used to make prediction 
%run prediction_churn_data.py

Transformation Pipeline and Model Successfully Loaded
predictions:
            tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
customerID                                                                  
7832-POPKP      62             1         0              2           101.7   

            TotalCharges      Churn  tenure_MonthlyCharges_ratio  \
customerID                                                         
7832-POPKP       3106.56  50.105806                          NaN   

            tenure_TotalCharges_ratio  MonthlyCharges_TotalCharges_ratio  \
customerID                                                                 
7832-POPKP                        NaN                                NaN   

            Label   Score  
customerID                 
7832-POPKP      1  0.5615  


# Summary

  The first objective of the project is to train autoML with pycaret  then to load and save the model. After importing pandas as pd, I imported the week 2 prepared file ,prepared_churn_data.csv. Pycaret is installed using conda . Scikit-learn version 0.23.2 is used. Setup, compare_models, predict_model, save_model and load_model functions of pycaret are imported. Churn is  set up as a target in automl. The processed data is shown at automl[4] . Different metrics were used for comparing models.  After comparing different models, linear discriminant analysis(lda) is the best model  with an accuracy rate of 0.7986 or 80.0%. For making predictions ,I used a two dimensional array [-2:,-1] which is 1st  raw and 10th column . Using the predict_model function from pycaret I was able to predict a score of 0.5471 and a label of 0. Save_model  function is used to create and save LDA pickle files. Once the  Pickle is  imported then it is written in binary format followed by reading it in binary format.  Then the loading and prediction of the new data was tested,  which gave 0 for label column and 0.5471 for score column . 
  The second objective of the project is to  use the Python file to make predictions.The python code is used to make predictions for new data which is curn_new_data.csv in this case . The Python code predicts that label column as 1 and scores column0.5615 for the new data. A GitHub repository is created and the code is uploaded . 
 