# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [186]:
# Importing Libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [187]:
# loading the prepped_churn_data.csv into a df
df = pd.read_csv('prepped_churn_data.csv', index_col = 'customerID')
df.drop('totcharges_to_tenure_ratio', axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tenure          7032 non-null   int64  
 1   PhoneService    7032 non-null   int64  
 2   Contract        7032 non-null   int64  
 3   PaymentMethod   7032 non-null   int64  
 4   MonthlyCharges  7032 non-null   float64
 5   TotalCharges    7032 non-null   float64
 6   Churn           7032 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 439.5+ KB


In [171]:
!conda install -c conda-forge pycaret -y

Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [188]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [189]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,7415
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 7)"
4,Transformed data shape,"(7032, 7)"
5,Transformed train set shape,"(4922, 7)"
6,Transformed test set shape,"(2110, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


#### Session ID (7415):
- A special number linked to this particular session that helps ensure that the outcomes are repeatable.

#### goal Variable (Churn):
- This prediction's goal variable is binary (Churn or No Churn).

#### Data Shape (7032, 7):
- There were 7032 rows and 7 columns in the original dataset.

#### Transformed Data Shape (7032, 7):
- Following pretreatment and transformation, the data keeps its original shape.

#### Transformed Train Set Shape (4922, 7):
- This is the 4922-row, 7-column shape of the transformed training dataset following its division into training and validation sets.

#### Transformed Test Set Shape (2110, 7):
- The transformed test dataset, consisting of 2110 rows and 7 columns, will be utilized for model assessment.

#### Numerical Features (6):
- The number of numerical features included in the dataset, signifying the number of columns holding numerical information.

#### Enabling preprocessing:
- The dataset has undergone preprocessing operations including encoding, scaling, and imputation.

#### Simple Imputation:
- The approach of imputation (mean for numerical features and mode for categorical features) is utilized to deal with missing values.

#### Fold Generator (StratifiedKFold):
- The stratified K-Fold cross-validation technique is employed in the model assessment process.

#### Fold Number (10):
- Count of folds used in model training and assessment for cross-validation.

#### CPU Jobs (-1):
- The quantity of CPU cores utilized for concurrent operations during the training of a model. All accessible cores are indicated with a -1.

#### use GPU (False):
- Indicates if using GPU acceleration for specific calculations is turned on (currently set to False).

#### Log Experiment (False):
- This option determines whether or not experiment results should be logged; it is presently set to False.

#### Experiment Name (clf-default-name):
- The experiment's default name.
#### USI (0b0e): 
- A brief identification connected to the session is called the Unique Session identification.

***These details suggest that the dataset has undergone preprocessing, been divided into training and test sets, and is prepared for model training and assessment through the use of machine learning methods. Among the preprocessing techniques used include encoding category features, scaling numeric features, and managing missing information. To assess the models' performance, cross-validation will be carried out using a Stratified K-Fold with 10 folds.***

In [190]:
best_model = compare_models(sort='Prec.')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ridge,Ridge Classifier,0.7893,0.0,0.4458,0.6533,0.5291,0.3997,0.4123,0.012
gbc,Gradient Boosting Classifier,0.7922,0.831,0.4679,0.6524,0.5442,0.4145,0.4245,0.129
lr,Logistic Regression,0.7928,0.8256,0.5046,0.6402,0.5641,0.4307,0.4361,0.033
catboost,CatBoost Classifier,0.7893,0.8298,0.4863,0.6364,0.5509,0.4166,0.4232,1.133
ada,Ada Boost Classifier,0.7806,0.8271,0.4618,0.6163,0.5274,0.3886,0.3957,0.057
lda,Linear Discriminant Analysis,0.7832,0.8141,0.4924,0.616,0.5466,0.4067,0.4115,0.011
lightgbm,Light Gradient Boosting Machine,0.7782,0.8193,0.4931,0.6021,0.5417,0.3973,0.4011,0.096
xgboost,Extreme Gradient Boosting,0.7684,0.8112,0.4901,0.5766,0.5293,0.3772,0.3797,0.034
rf,Random Forest Classifier,0.7662,0.7979,0.4618,0.576,0.5119,0.3607,0.3649,0.158
knn,K Neighbors Classifier,0.7617,0.7456,0.4335,0.5699,0.4915,0.3398,0.3457,0.025


The model with the greatest score is currently our best_model object. In order to select a different metric to use as our scoring metric, we can also set an argument sort in compare_models. It employs accuracy by default, By the way the above table is organized based on Precision. For example, we utilize precision (TP / (TP + FN)) if we set this to sort='Precision.'

#### Model Performance:
- The table shows the accuracy, precision, recall, area under the curve (AUC), F1 score, kappa, Matthews correlation coefficient (MCC), and training time in seconds for many machine learning models.

#### Best Model Selection:
- By default, the model with the greatest accuracy score is regarded as the best model. But we took Precision, so the score is 0.6533, the Ridge Classifier got the greatest score in this instance.
 
#### Alternative Scoring Metrics:
- By adjusting the sort parameter in the compare_models() method, we may also assess models using different scoring metrics like accuracy, recall, or AUC.

#### Model Comparison:
- We can determine which models work better for a particular dataset and job by comparing the performance characteristics of several models. For example, we may see differences across several models in terms of performance measures like accuracy, precision, recall, and AUC.

#### Trade-offs:
- We may prioritize some metrics over others based on the particular requirements of the challenge. For example, we may give priority to models with greater recall scores in a binary classification job where detecting true positives is important. Similarly, accuracy could be a more significant statistic in situations when false positives are expensive.

#### Training Time:
- Every model's training time is also given, which helps determine how computationally efficient and scalable it is especially for applications that require quick turnaround times or big datasets.

***The output of compare_models() aids in choosing the best model, taking into account trade-offs between different performance indicators, for the job and dataset provided.***

In [191]:
best_model

In [192]:
df.iloc[-2:-1].shape

(1, 7)

In [193]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Ridge Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,3,74.400002,306.600006,1,1


In [194]:
save_model(best_model, 'RP_Churn')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',
                  TransformerWrapper(exclude=None, include=[],
                                     transfo

In [195]:
import pickle

with open('RP_Churn_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

We are now importing pickle and creating a new pickle file "RP_Churn_model.pk" and dumping best_model "RP_Churn" into it

In [196]:
with open('RP_Churn_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

Now, We are reading the pickle file which we created and loading it into loaded_model variable for future usage

In [197]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int8)

We are selecting the second-to-last row of the DataFrame `df` and copying it to create a new DataFrame `new_data`. It is common to use negative indexing to select rows from the end of the DataFrame.

Then after that, we are dropping the column named 'Churn' from the `new_data` DataFrame along the columns axis (axis=1). This is likely done because the target variable 'Churn' should not be included in the features used for prediction.


Then we are using the method `predict()` of the loaded_model to make predictions on the new_data. It predicts the label for the new data point(s) based on the features provided in new_data.
The predicted label(s) are returned as an array. so, the output indicates that the predicted label for the new data point is 1.
Overall, the code segment is predicting the churn label for a single new data point (new_data) using a pre-trained machine learning model (loaded_model). The model predicts whether this particular data point corresponds to a churn event or not, based on its features.






In [198]:
loaded_lda = load_model('RP_Churn')

Transformation Pipeline and Model Successfully Loaded


In [199]:
predict_model(loaded_lda, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,prediction_label
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8361-LTMKD,4,1,0,3,74.400002,306.600006,1


In [200]:
from IPython.display import Code

Code('RP_Churn_pred.py')

In [201]:
%run RP_Churn_pred.py

Transformation Pipeline and Model Successfully Loaded


predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP       Churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


I appreciate PyCaret, a minimal code alternative library that produces remarkable results by replacing hundreds of lines of code with a few words. After applying every FTE instruction, I discovered 3 "No Churn" values in the fresh data. I utilized the comparison model to identify the optimal model. I acquired the ability to write Python code and utilize .py files to construct dynamic predictions.

# Summary

#### Data Loading:
- The churn data was imported into a dataframe (df) including customer churn-related attributes.

#### AutoML Setup:
- We used the setup() method with the dataset (df) and specified the target variable as 'Churn' in order to prepare the data for automatic machine learning (AutoML). PyCaret's AutoML module was utilized for this purpose.

#### Model Comparison:
- We used the compare_models() method to compare different machine learning models and received a table with each model's performance characteristics. Then we searched for the best model using the precision metric. 

#### Model Selection:
- According on the Precision metric, the Ridge Classifier seems to be the best-performing model that we could find.

#### Model Creation:
- Using the create_model() method, we produced an instance of the top-performing model (Ridge), which would be our ultimate model for prediction-making.

#### Preparing New Data:
- Since we intended to predict the churn label for this new data point, we prepared new data (new_data) by choosing a particular row from the dataset and eliminating the target variable ('Churn').

#### Prediction:
- We utilized the predict_model() function, which gave the expected label or probability of churn for the new data point, to make predictions using the final model (Ridge) on the new data point (new_data).

#### Analysis and Conclusion:
- Lastly, we assessed the prediction's outcomes and wrapped up our procedure by perhaps assessing the model's performance on the new data point and taking into account any further steps or revelations from the analysis.

***In summary, PyCaret's features were utilized to automate and streamline the machine learning workflow, which included data preparation, model selection, prediction, and analysis.***
