# Mobile Carrier Subscriber Analysis

<div style="border: 2px solid #007bff; padding: 10px; border-radius: 5px;">

# Project Overview

This project develops a Machine Learning model to analyze subscriber behavior, and recommend one of the carrier's newer plans: Smart or Ultra.    

## Methodology  

Code for working with and manipulating datasets and Machine Learning models is contained in several classes and functions. This approach avoids repetitive code, and allows the functionality to be included easily in other projects.

* Support Code
    * Enums
        * MLModel
        * MLModelMethod
        * MLModelSearch
    * Classes for facilitating analysis of the dataset:
        * DataframeColumn
        * DataframeInfo
    * Functions
        * is_float_string
        * analyze_column_data
        * analyze_dataset
        * any_to_list
    * Classes for working with Machine Learning models:
        * MLModelSets
        * MLDataset
        * MLModelResult
        * MLDecisionTree
        * MLRandomForest
        * MLLogisticRegression
        * MLLinearRegression
        * MLThreeSetSplit
</div>

<div style="border: 2px solid #007bff; padding: 10px; border-radius: 5px;">

# Environment Setup and Required Libraries
</div>


In [1]:
from enum import Enum
from abc import ABC, abstractmethod
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from skopt import gp_minimize  # Gaussian Process optimization
from skopt.space import Integer, Real, Categorical
from skopt.utils import use_named_args

<div style="border: 2px solid #007bff; padding: 10px; border-radius: 5px;">
    
# Supporting Classes and Functions
</div>

<div style="border: 2px solid #3F9B0B; padding: 10px; border-radius: 5px;">
    
## Enums
</div>

## MLModel

### Summary
An enum for specifying a Machine Learning model type.

In [2]:
class MLModel(Enum):
    Undefined = 0,
    DecisionTree = 1,
    RandomForestGrid = 2,
    RandomForestRandomized = 3,
    RandomForestBayesianOptimization = 4,
    LogisticRegression = 5

## MLModelMethod

### Summary
An enum for specifying the method applied in the evaluation of a model.

In [3]:
class MLModelMethod(Enum):
    Undefined = 0,
    Accuracy = 1,
    RMSE = 2

## MLModelSearch

### Summary
An enum for specifying the type of search to use for evaluating model parameters.

In [4]:
class MLModelSearch(Enum):
    Undefined = 0,
    Grid = 1,
    Randomized = 2,
    BayesianOptimization = 3

## MLModelResultStorageParadigm

### Summary
An enum for specifying the paradigm for storing model results.

In [5]:
class MLModelStorageParadigm(Enum):
    Overwrite = 0,
    LowestValue = 1,
    HighestValue = 2,

<div style="border: 2px solid #3F9B0B; padding: 10px; border-radius: 5px;">
    
## Functions / Classes
</div>

## DataframeColumn

### Summary
Describes a column within a dataframe.

### Attributes:
* **`data_type`** (`str`): The data type of the column. Read-only.  
* **`name`** (`str`): The name of the column. Read-only.  
* **`non_null_count`** (`int`): The number of non-null values in the column. Read-only.  

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`name`** (`str`): The name of the column.  
* **`non_null_count`** (`int`): The number of non-null values in the column.   
* **`data_type`** (`str`): The data type of the column.   


### Public Methods

#### `dfc_list_from_df(df)`
Static class method to create a list of DataframeColumn objects for the columns in a specified dataframe.

##### Arguments:
* **`df`** (`dataframe`): The dataframe.  

##### Returns:
  * `[DataframeColumn]` A list of DataframeColumn objects for the columns in the specified dataframe.


In [6]:
class DataframeColumn:
    @staticmethod
    def dfc_list_from_df(df):        
        df_columns = df.columns.tolist()
        df_non_null_count = df.count().tolist()
        df_data_types = df.dtypes.tolist()
        n = len(df_columns)
        dfc_list = []

        for c in range(n):
            dfc = DataframeColumn(df_columns[c],
                                  df_non_null_count[c],
                                  df_data_types[c])
            dfc_list.append(dfc)
        
        return dfc_list

    def __init__(self, name, non_null_count, data_type):
        self._name = name
        self._non_null_count = non_null_count
        self._data_type = data_type

    @property
    def name(self):
        return self._name

    @property
    def non_null_count(self):
        return self._non_null_count

    @property
    def data_type(self):
        return self._data_type

## DataframeInfo

### Summary
Provides a detailed description of a dataframe.

### Attributes:
* **`columns`** (`[DataframeColumn]`): A list of DataframeColumn objects representing the columns within the dataframe. Read-only.
* **`duplicate_row_count`** (`int`): The number of duplicate rows. Read-only.
* **`row_count`** (`int`): The number of rows. Read-only.

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`df`** (`Dataframe`): The dataframe.  

### Public Methods

#### `info(self)`
Prints detailed information describing the dataframe.


Sample Output:  

```python
Rows: 3214  
Duplicate rows: 0  

Column           Non-null   Data type   
calls                3214   float64     
minutes              3214   float64     
messages             3214   float64     
mb_used              3214   float64     

Column:             calls
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      184
Integer values:     3214
Non-integer values: 0



Column:             minutes
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      3144
Integer values:     60
Non-integer values: 3154


Column:             messages
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      180
Integer values:     3214
Non-integer values: 0


Column:             mb_used
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      3203
Integer values:     34
Non-integer values: 3180
```

In [7]:
class DataframeInfo:
    def __init__(self, df):
        self._row_count = len(df)
        self._duplicate_row_count = df.duplicated().sum()
        self._columns = DataframeColumn.dfc_list_from_df(df)

    @property
    def row_count(self):
        return self._row_count

    @property
    def duplicate_row_count(self):
        return self._duplicate_row_count

    @property
    def columns(self):
        return self._columns
    
    def info(self):
        print(f'Rows: {self.row_count}')
        print(f'Duplicate rows: {self.duplicate_row_count}')
        print()
        col_headers = ['Column', 'Non-null', 'Data type']
        col_width = [15, 10, 12]
        print(
            f'{col_headers[0]:<{col_width[0]}}{col_headers[1]:>{col_width[1]}}   {col_headers[2]:<{col_width[2]}}')

        for c in self.columns:
            print(
                f'{c.name:<{col_width[0]}}{c.non_null_count:>{col_width[1]}}   {c.data_type.name:<{col_width[2]}}')

#### `is_float_string(value)`
Determines whether a string can be converted to a float.  

##### Arguments:
* **`value`** (`str`): The string value to test.  

##### Returns:
* `True, if the string can be converted to a float; otherwise, False.`  

In [8]:
def is_float_string(value):
    if value is None:
        return False
    try:
        float(value)
        return True
    except ValueError:
        return False

#### `analyze_column_data(series, dataframe_column)`
Analyzes the data within a column, and prints a summary of the data.

##### Arguments:
* **`series`** (`Series`): The data contained in the column.  
* **`dataframe_column`** (`DataframeColumn`): The corresponding DataframeColumn object.  

Sample Output:  

```python
Column:             calls
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      184
Integer values:     3214
Non-integer values: 0
```

In [9]:
def analyze_column_data(series, dataframe_column):
    series_length = len(series)
    is_float_type = (dataframe_column.data_type.name == 'float64')
    integer_analysis = ''

    if is_float_type:
        integer_value_count = series.apply(lambda x: x.is_integer()).sum()
        non_integer_value_count = series_length - integer_value_count
        integer_analysis = f"""Integer values:     {integer_value_count}
Non-integer values: {non_integer_value_count}
"""

    is_object_data_type = (dataframe_column.data_type.name == 'object')
    object_analysis = ''

    if is_object_data_type:
        numeric_value_count = series.str.isnumeric().sum()
        non_numeric_value_count = series_length - \
            series.apply(is_float_string).sum()

        object_analysis = f"""Numeric values:     {numeric_value_count}
Non-numeric values: {non_numeric_value_count}
"""

    analysis = f"""
Column:             {dataframe_column.name}
Data type:          {dataframe_column.data_type.name}
Non-null:           {dataframe_column.non_null_count}
N/A count:          {series.isna().sum()}
Unique values:      {series.nunique()}"""

    print(analysis)

    if is_float_type:
        print(integer_analysis)

    if is_object_data_type:
        print(object_analysis)

#### `analyze_dataset(df)`
Analyzes the data in a dataset, and returns a DataframeInfo object containing the results.

##### Arguments:
* **`df`** (`dataset`): The dataset to analyze.

##### Returns:
* `A DataframeInfo object containing the results of the analysis.` 

In [10]:
def analyze_dataset(df):
    df_info = DataframeInfo(df)
    df_info.info()

    n = len(df_info.columns)

    for c in range(n):
        col = df_info.columns[c]
        analyze_column_data(df[col.name], df_info.columns[c])

    return df_info

#### `any_to_list(a)`
Returns a list from an input value.

##### Arguments:
* **`a`** (`any`): The value to convert to a list. See Notes.    

##### Returns:
* `A list containing the input value.`

##### Notes:
1. The following data types are supported:  
    * list (returns the input value)
    * np.ndarray
    * int, float, str
    * tuple
    * dict (a list of the values is returned)

&emsp;Any other data type returns an empty list, and a message is printed to the console.

In [11]:
def any_to_list(a):
    if isinstance(a, list):
        return a
    elif isinstance(a, np.ndarray):
        return a.tolist()
    elif isinstance(a, (int, float, str)):
        return [str(a)]
    elif isinstance(a, tuple):
        return [list(a)]
    elif isinstance(a, dict):
        return list(a.values())
    else:
        print(f'Unrecognized data type: {type(a)}')
        return []

## MLModelSets(ABC)

### Summary
Abstract class for use with classes that represent source datasets for model training.

### Attributes:
* **`df_test`** (`dataset`): The test dataset. Read-only.  
* **`df_train`** (`dataset`): The training dataset. Read-only.  
* **`df_valid`** (`dataset`): The validation dataset. Read-only.  
* **`random_state`** (`int`): The random state to use. Read-only.  

In [12]:
class MLModelSets(ABC):
    @abstractmethod
    def df_train(self):
        pass

    def df_test(self):
        pass

    def df_valid(self):
        pass

    def random_state(self):
        pass

## MLDataset

### Summary
Represents an Machine Learning dataset.

### Attributes:
* **`features`** (`[Series]`): The features in the dataset. Read-only.  
* **`target`** (`Series`): The target in the dataset. Read-only.  

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`features`** (`[Series]`): The features in the dataset.   
* **`target`** (`Series`): The target in the dataset.   

In [13]:
class MLDataset:
    def __init__(self, features, target):
        self._features = features
        self._target = np.ravel(target)
    
    @property
    def features(self):
        return self._features
    
    @property
    def target(self):
        return self._target

## MLModelResult

### Summary
Container for results of a model test.

### Attributes:
* **`model`** (`DecisionTreeRegressor`): The model that corresponds to the score. Read-only.  
* **`model_method`** (`MLModelMethod`): The method used for testing. Read-only.
* **`parameters`** (`int`): The parameters that correspond to the score. Read-only.
* **`score`** (`float`): The score corresponding to the model. Read-only.

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`model_method`** (`MLModelMethod`): The method used for testing.  
 
### Public Methods:

#### `reset(self)`
Resets the instance properties to None.

#### `store_result(self, model, score, parameters, storage_paradigm)`
Stores the result of a model test.

##### Arguments:
* **`model`** (`any`): The model.  
* **`score`** (`float`): The score for the model.
* **`parameters`** (`int`): The parameters that generated the score.  
* **`storage_paradigm`** (`MLModelStorageParadigm`, *default*= `MLModelStorageParadigm.Overwrite`): The paradigm to use for storing results.

In [14]:
class MLModelResult:
    def __init__(self,
                 model_method
                ):
        self._model_method = model_method
        self.reset()

    @property
    def model_method(self):
        return self._model_method

    @property
    def model(self):
        return self._model
        
    @property
    def score(self):
        return self._score

    @property
    def parameters(self):
        return self._parameters

    def reset(self):
        self._score = None
        self._parameters = None
        self._model = None

    def store_result(self,
                     model,
                     score,
                     parameters,
                     storage_paradigm=MLModelStorageParadigm.Overwrite
                    ):
        def overwrite_values():
                self._model = model
                self._score = score
                self._parameters = parameters
        
        match storage_paradigm:
            case MLModelStorageParadigm.Overwrite:
                overwrite_values()
            case MLModelStorageParadigm.HighestValue:
                if self._score is None or self._score < score:
                    overwrite_values()
            case MLModelStorageParadigm.LowestValue:
                if self._score is None or self._score > score:
                    overwrite_values()


## MLDecisionTree

### Summary
Represents a Decision Tree Classifier.

### Attributes:
* **`accuracy_results`** (`MLModelResult`): The results of the accuracy test with the highest score. Read-only.
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object. Internal.
* **`rmse_results`** (`MLModelResult`): The results of the rmse test with the highest score. Read-only.

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object.  

### Public Methods:

#### `tune(self, max_depth, min_samples_split, log)`
Determines the best accuracy model for the Decision Tree classifier, and updates the accuracy_results property.

##### Arguments:
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree.  The calculate_max_depth() method is called on the value passed in this parameter.
* **`min_samples_split`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter. The calculate_min_samples_split() method is called on the value passed in this parameter.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

#### `rmse(self, max_depth, log)`
Determines the best RMSE model for the Decision Tree classifier, and update the rmse_results property.

##### Arguments:
* **`max_depth`** (`range`, *default*= `None`): The range of max_depth values.  The calculate_max_depth() method is called on the value passed in this parameter.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

### Private Methods:

#### `_calculate_max_depth(self, proposed_value)`
Returns the values for max_depth based on the proposed_value. This method is intended to be used to supply default values for the max_depth parameter of the public methods, if the max_depth parameter is None.

##### Arguments:
* **`proposed_value`** (`[int]`, *default*= `None`): The max depth of the tree. If proposed_value is None, the value returned is [None, 10, 20].

##### Returns:
The value to use for the max_depth parameter.

#### `_calculate_min_samples_split(self, proposed_value)`
Returns the values for min_samples_split based on the proposed_value. This method is intended to be used to supply default values for the min_samples_split parameter of the public methods, if the min_samples_split parameter is None.

##### Arguments:
* **`proposed_value`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter. If proposed_value is None, the value returned is [2, 5, 10].

##### Returns:
The value to use for the min_samples_split parameter.

In [15]:
class MLDecisionTree:
    def __init__(self, 
                 ml_modelsets,
                ):
        self._ml_modelsets = ml_modelsets
        self._accuracy_results = MLModelResult(MLModelMethod.Accuracy)
        self._rmse_results = MLModelResult(MLModelMethod.RMSE)

    @property
    def accuracy_results(self):
        return self._accuracy_results

    @property
    def rmse_results(self):
        return self._rmse_results
    
    def _calculate_max_depth(
            self,
            proposed_value=None
    ):
        if proposed_value is None:
            max_depth = [None, 10, 20]
        else:
            max_depth = proposed_value

        return max_depth
    
    def _calculate_min_samples_split(
            self,
            proposed_value=None
    ):
        if proposed_value is None:
            min_samples_split = [2, 5, 10]
        else:
            min_samples_split = proposed_value
        
        return min_samples_split

    def tune(self,
             max_depth=None,
             min_samples_split=None,
             log=True
            ):
        self._accuracy_results.reset()

        if log:
            print()
            print(f'------------------ Decision Tree (Accuracy) ------------------')

        max_depth = self._calculate_max_depth(max_depth)
        min_samples_split = self._calculate_min_samples_split(min_samples_split)

        param_grid = {
            'max_depth': max_depth,
            'min_samples_split': min_samples_split
        }

        dtree = DecisionTreeClassifier(random_state = self._ml_modelsets.random_state)
        grid_search = GridSearchCV(dtree,
                                   param_grid,
                                   cv=5,
                                   scoring='accuracy',
                                   n_jobs=-1
                                   )
        grid_search.fit(self._ml_modelsets.df_train.features, self._ml_modelsets.df_train.target)
        self._accuracy_results.store_result(dtree, grid_search.best_score_, grid_search.best_params_)

        if log:
            print()
            print(f'Best parameters: {grid_search.best_params_}')
            print(f'Best score: {grid_search.best_score_}')

    def rmse(self,
             max_depth=None,
             log=True
            ):
        self._rmse_results.reset()
        max_depth = self._calculate_max_depth(max_depth)

        if log:
            print()
            print(f'------------------- Decision Tree (RMSE) -------------------')

        for depth in max_depth:
            model = DecisionTreeClassifier(random_state=self._ml_modelsets.random_state, max_depth=depth)
            model.fit(self._ml_modelsets.df_main.features, self._ml_modelsets.df_main.target)
            predictions = model.predict(self._ml_modelsets.df_test.features)
            rmse = mean_squared_error(self._ml_modelsets.df_test.target, predictions) ** 0.5
            self._rmse_results.store_result(model, rmse, depth, storage_paradigm=MLModelStorageParadigm.LowestValue)

            if log:
                print(f'max_depth = {depth} : {rmse}')

        if log:
            print()
            print(f'Best model: {self._rmse_results.parameters} ({self._rmse_results.score})')

## MLRandomForest

### Summary
Represents a Random Forest Classifier.

### Attributes:
* **`accuracy_results`** (`MLModelResult`): The results of the accuracy test with the highest score. Read-only.
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object. Internal.
* **`model_search`** (`MLModelSets`): The search method to use for evaluating model parameters. Internal.
* **`rmse_results`** (`MLModelResult`): The results of the rmse test with the highest score. Read-only.

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object. Internal.  
* **`model_search`** (`MLModelSets`): The search method to use for evaluating model parameters. Internal.

### Public Methods:

#### `calculate_best_model(self, n_estimators, max_depth, min_samples_split, log)`
Calculates the best model for the Random Forest Classifier.

##### Arguments:
* **`n_estimators`** (`range`, *default*= `None`): The range of estimators to use. See Notes.
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree.  See Notes.
* **`min_samples_split`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If n_estimators is None, the value assigned will be based on the model_search attribute:
    * Grid: [10, 50, 100, 200]
    * Randomized: random.randint(10, 200)
    * Bayesian: (100, 500), i.e., (low, high)
2. To pass a value for n_estimators, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)
3. If max_depth is None, the value assigned will be based on the model_search attribute:
    * Grid: [None, 10, 20]
    * Randomized: [None, 10, 20, 30]
    * Bayesian: (2, 20), i.e., (low, high)
4. To pass a value for max_depth, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)
5. If min_samples_split is None, the value assigned will be based on the model_search attribute:
    * Grid: [2, 5, 10]
    * Randomized: random.randint(2, 11)
    * Bayesian: (1e-5, 1e-2), i.e., (low, high)
6. To pass a value for min_samples_split, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

#### `rmse(self, max_depth, n_estimators)`
Determines the best RMSE for the Random Forest Classifier, and updates the rmse_results property.

##### Arguments:
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree.  See Notes.
* **`n_estimators`** (`range`, *default*= `None`): The range of estimators to use. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If max_depth is None, the value assigned will be based on the model_search attribute:
    * Grid: [None, 10, 20]
    * Randomized: [None, 10, 20, 30]
    * Bayesian: (2, 20), i.e., (low, high)
2. To pass a value for max_depth, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)
3. If n_estimators is None, the value assigned will be based on the model_search attribute:
    * Grid: [10, 50, 100, 200]
    * Randomized: random.randint(10, 200)
    * Bayesian: (100, 500), i.e., (low, high)
4. To pass a value for n_estimators, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

### Private Methods:

#### `_calculate_max_depth(self, proposed_value)`
Returns the values for max_depth based on the proposed_value. This method is intended to be used to supply default values for the max_depth parameter of the public methods, if the max_depth parameter is None.

##### Arguments:
* **`proposed_value`** (`[int]`, *default*= `None`): The max depth of the tree.  See Notes.

##### Returns:
The value to use for the max_depth parameter.

##### Notes:
1. If max_depth is None, the value assigned will be based on the model_search attribute:
    * Grid: [None, 10, 20]
    * Randomized: [None, 10, 20, 30]
    * Bayesian: (2, 20), i.e., (low, high)
2. To pass a value for max_depth, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

#### `_calculate_min_samples_split(self, proposed_value)`
Returns the values for min_samples_split based on the proposed_value. This method is intended to be used to supply default values for the min_samples_split parameter of the public methods, if the min_samples_split parameter is None.

##### Arguments:
* **`proposed_value`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter. See Notes.

##### Returns:
The value to use for the min_samples_split parameter.

##### Notes:
1. If min_samples_split is None, the value assigned will be based on the model_search attribute:
    * Grid: [2, 5, 10]
    * Randomized: random.randint(2, 11)
    * Bayesian: (1e-5, 1e-2), i.e., (low, high)
2. To pass a value for min_samples_split, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

#### `_calculate_n_estimators(self, proposed_value)`
Returns the values for n_estimators based on the proposed_value. This method is intended to be used to supply default values for the n_estimators parameter of the public methods, if the n_estimators parameter is None.

##### Arguments:
* **`proposed_value`** (`[int]`, *default*= `None`): The list of values for the n_estimators parameter. See Notes.

##### Returns:
The value to use for the n_estimators parameter.

##### Notes:
1. If n_estimators is None, the value assigned will be based on the model_search attribute:
    * Grid: [10, 50, 100, 200]
    * Randomized: random.randint(10, 200)
    * Bayesian: (100, 500), i.e., (low, high)
2. To pass a value for n_estimators, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

#### `_calculate_best_grid_search_model(self, max_depth, min_samples_split, n_estimators, log)`
Calculates the best Grid Search model for the Random Forest Classifier.

##### Arguments:
* **`rf`** (`[int]`): The RandomForestClassifier model.
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree. See Notes.
* **`min_samples_split`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter.
* **`n_estimators`** (`range`, *default*= `None`): The range of estimators to use. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If max_depth is None, the value used will be [None, 10, 20].
2. If min_samples_split is None, the value used will be [2, 5, 10].
3. If n_estimators is None, the value used will be [10, 50, 100, 200].

#### `_calculate_best_randomized_search_model(self, max_depth, min_samples_split, n_estimators, log)`
Calculates the best Randomized Search model for the Random Forest Classifier.

##### Arguments:
* **`rf`** (`[int]`): The RandomForestClassifier model.
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree. See Notes.
* **`min_samples_split`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter.
* **`n_estimators`** (`range`, *default*= `None`): The range of estimators to use. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If max_depth is None, the value used will be [None, 10, 20, 30].
2. If min_samples_split is None, the value used will be random.randint(2, 11).
3. If n_estimators is None, the value used will be random.randint(10, 200).

#### `_calculate_best_bayesian_optimization_model(self, max_depth, min_samples_split, n_estimators, log)`
Calculates the best Bayesian Optimization model for the Random Forest Classifier.

##### Arguments:
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree. See Notes.
* **`min_samples_split`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter.
* **`n_estimators`** (`range`, *default*= `None`): The range of estimators to use. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If max_depth is None, the value used will be trial.suggest_int('max_depth', 5, 50).
2. If min_samples_split is None, the value used will be trial.suggest_int('min_samples_split', 2, 10).
3. If n_estimators is None, the value used will be trial.suggest_int('n_estimators', 10, 200).


In [16]:
class MLRandomForest:
    def __init__(self,
                 ml_modelsets,
                 model_search,
                ):
        self._ml_modelsets = ml_modelsets
        self._model_search = model_search
        self._accuracy_results = MLModelResult(MLModelMethod.Accuracy)
        self._rmse_results = MLModelResult(MLModelMethod.RMSE)

    @property
    def accuracy_results(self):
        return self._accuracy_results

    @property
    def rmse_results(self):
        return self._rmse_results
    
    def _calculate_max_depth(
            self,
            proposed_value=None
    ):
        if proposed_value is None:
            match self._model_search:
                case MLModelSearch.Grid:
                    max_depth = [None, 10, 20]
                case MLModelSearch.Randomized:
                    max_depth = [None, 10, 20, 30]
                case MLModelSearch.BayesianOptimization:
                    max_depth = (2, 20)
        else:
            max_depth = proposed_value

        return max_depth
    
    def _calculate_min_samples_split(
            self,
            proposed_value=None
    ):
        if proposed_value is None:
            match self._model_search:
                case MLModelSearch.Grid:
                    min_samples_split = [2, 5, 10]
                case MLModelSearch.Randomized:
                    min_samples_split = [random.randint(2, 11)]
                case MLModelSearch.BayesianOptimization:
                    min_samples_split = (1e-5, 1e-2)
        else:
            min_samples_split = proposed_value

        return min_samples_split
    
    def _calculate_n_estimators(
            self,
            proposed_value=None
    ):
        if proposed_value is None:
            match self._model_search:
                case MLModelSearch.Grid:
                    n_estimators = [10, 50, 100, 200]
                case MLModelSearch.Randomized:
                    n_estimators = [random.randint(10, 200)]
                case MLModelSearch.BayesianOptimization:
                    n_estimators = (100, 500)
        else:
            n_estimators = proposed_value

        return n_estimators

    def calculate_best_model(self,
                             n_estimators=None,
                             max_depth=None,
                             min_samples_split=None,
                             log=True
                            ):
        self._accuracy_results.reset()
        rf = RandomForestClassifier()

        match self._model_search:
            case MLModelSearch.Grid:
                if log:
                    print()
                    print(f'------------ Random Forest (Accuracy): Grid ------------------')
    
                self._calculate_best_grid_search_model(
                    rf,
                    max_depth,
                    min_samples_split,
                    n_estimators,
                    log
                    )
            case MLModelSearch.Randomized:
                if log:
                    print()
                    print(f'--------- Random Forest (Accuracy): Randomized ---------------')
    
                self._calculate_best_randomized_search_model(
                    rf,
                    max_depth,
                    min_samples_split,
                    n_estimators,
                    log
                    )
            case MLModelSearch.BayesianOptimization:
                if log:
                    print()
                    print(f'---- Random Forest (Accuracy): Bayesian Optimization ----------')

                self._calculate_best_bayesian_optimization_model(
                    rf,
                    max_depth,
                    min_samples_split,
                    n_estimators,
                    log
                    )

    def _calculate_best_grid_search_model(
            self,
            rf,
            max_depth=None,
            min_samples_split=None,
            n_estimators=None,log=True
            ):
        max_depth = self._calculate_max_depth(max_depth)
        min_samples_split = self._calculate_min_samples_split(min_samples_split)
        n_estimators = self._calculate_n_estimators(n_estimators)

        param_grid = {
            'n_estimators': n_estimators,
            'max_depth': max_depth,
            'min_samples_split': min_samples_split
        }

        grid_search = GridSearchCV(rf,
                                   param_grid,
                                   cv=5,
                                   scoring='accuracy',
                                   n_jobs=-1
                                   )
        grid_search.fit(self._ml_modelsets.df_train.features, self._ml_modelsets.df_train.target)
        self._accuracy_results.store_result(rf, grid_search.best_score_, grid_search.best_params_)

        if log:
            print()
            print(f'Best parameters: {grid_search.best_params_}')
            print(f'Best score: {grid_search.best_score_}')

    def _calculate_best_randomized_search_model(
            self,
            rf,
            max_depth=None,
            min_samples_split=None,
            n_estimators=None,log=True
            ):
        max_depth = self._calculate_max_depth(max_depth)
        min_samples_split = self._calculate_min_samples_split(min_samples_split)
        n_estimators = self._calculate_n_estimators(n_estimators)

        param_dist = {
            'n_estimators': n_estimators,
            'max_depth': max_depth,
            'min_samples_split': min_samples_split
        }

        grid_search = GridSearchCV(rf,
                                   param_dist,
                                   cv=5,
                                   scoring='accuracy',
                                   n_jobs=-1
                                   )
        grid_search.fit(self._ml_modelsets.df_train.features, self._ml_modelsets.df_train.target)
        self._accuracy_results.store_result(rf, grid_search.best_score_, grid_search.best_params_)

        if log:
            print()
            print(f'Best parameters: {grid_search.best_params_}')
            print(f'Best score: {grid_search.best_score_}')

    def _calculate_best_bayesian_optimization_model(
            self,
            rf,
            max_depth=None,
            min_samples_split=None,
            n_estimators=None,
            log=True
            ):
        max_depth = self._calculate_max_depth(max_depth)
        min_samples_split = self._calculate_min_samples_split(min_samples_split)
        n_estimators = self._calculate_n_estimators(n_estimators)
        rf_space = [
            Integer(low=max_depth[0], high=max_depth[1], name='max_depth'),
            Real(low=min_samples_split[0], high=min_samples_split[1], name='min_samples_split'),
            Integer(low=n_estimators[0], high=n_estimators[1], name='n_estimators')
        ]

        @use_named_args(rf_space)
        def objective(**params):
            model = RandomForestClassifier(
                random_state=self._ml_modelsets.random_state,
                **params
            )

            score = cross_val_score(
                model,
                self._ml_modelsets.df_train.features,
                self._ml_modelsets.df_train.target,
                cv=5,
                n_jobs=-1,
                scoring='accuracy'
                ).mean()
            
            return -score

        res_gp = gp_minimize(
            func=objective,
            dimensions=rf_space,
            n_calls=50,
            n_random_starts=10,
            random_state=self._ml_modelsets.random_state
        )

        best_hyperparameters = dict(zip([d.name for d in rf_space], res_gp.x))
        best_score = -res_gp.fun
        self._accuracy_results.store_result(rf, best_score, best_hyperparameters)

        if log:
            print()
            print(f'Best cross-validation accuracy found: {best_score:.4f}')
            print(f'Best hyperparameters: {best_hyperparameters}')

    def rmse(self,
             max_depth=None,
             n_estimators=None,
             log=True
            ):
        self._rmse_results.reset()
        max_depth = self._calculate_max_depth(max_depth)
        n_estimators = self._calculate_n_estimators(n_estimators)
        
        if log:
            print()
            print(f'------------------- Random Forest (RMSE) -------------------')

        for est in n_estimators:
            for depth in max_depth:
                model = RandomForestClassifier(random_state=self._ml_modelsets.random_state,                                          
                                               n_estimators=est,
                                               max_depth=depth
                                              )
                model.fit(self._ml_modelsets.df_main.features, self._ml_modelsets.df_main.target)
                predictions = model.predict(self._ml_modelsets.df_test.features)
                result = mean_squared_error(self._ml_modelsets.df_test.target, predictions) ** 0.5
                self._rmse_results.store_result(model, result, (est, depth), storage_paradigm=MLModelStorageParadigm.LowestValue)

                if log:
                    print(f'RMSE model (n_estimators, max_depth = {self._rmse_results.parameters}): {self._rmse_results.score}')        

        if log:
            print()
            print(f'Best model: (n_estimators, max_depth = {self._rmse_results.parameters}): {self._rmse_results.score}')

## MLLogisticRegression

### Summary
Represents a Logistic Regression.

### Attributes:
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object. Internal.
* **`model`** (`LogisticRegression`): The model corresponding to the scores. Read-only.
* **`test_score`** (`float`): The test set score. Read-only.  
* **`train_score`** (`float`): The training set score. Read-only.
* **`valid_score`** (`float`): The validation set score. Read-only.

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object.  

### Public Methods:

#### `calculate_scores(self, log)`
Calculates the scores of the various sets.

##### Arguments:
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

In [17]:
class MLLogisticRegression:
    def __init__(self,
                 ml_modelsets,
                ):
        self._ml_modelsets = ml_modelsets
        self._model = None
        self._train_score = 0.0
        self._valid_score = 0.0
        self._test_score = 0.0

    @property
    def model(self):
        return self._model
        
    @property
    def train_score(self):
        return self._train_score

    @property
    def valid_score(self):
        return self._valid_score

    @property
    def test_score(self):
        return self._test_score

    def calculate_scores(self,
                         log=True
                        ):
        self._model = LogisticRegression(random_state=self._ml_modelsets.random_state,
                                         solver='liblinear'
                                        )
        self._model.fit(self._ml_modelsets.df_train.features, self._ml_modelsets.df_train.target)
        self._train_score = self._model.score(self._ml_modelsets.df_train.features, self._ml_modelsets.df_train.target)
        self._valid_score = self._model.score(self._ml_modelsets.df_valid.features, self._ml_modelsets.df_valid.target)
        self._test_score = self._model.score(self._ml_modelsets.df_test.features, self._ml_modelsets.df_test.target)

        if log:
            print()
            print(f'--------------- Logistic Regression (Accuracy) ---------------')
            print(f'Accuracy of the logistic regression model on the training set: {self._train_score}')
            print(f'Accuracy of the logistic regression model on the validation set: {self._valid_score}')
            print(f'Accuracy of the logistic regression model on the test set: {self._test_score}')

## MLLinearRegression

### Summary
Represents a Linear Regression.

### Attributes:
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object. Internal.
* **`model`** (`LinearRegression`): The model corresponding to the RMSE. Read-only.
* **`rmse_result`** (`float`): The calculated RMSE. Read-only.  

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`ml_modelsets`** (`MLModelSets`): The associated model sets object.  

### Public Methods:

#### `rmse(self, log)`
Calculates the RMSE, and updates the object properties.

##### Arguments:
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

In [18]:
class MLLinearRegression:
    def __init__(self,
                 ml_modelsets,
                ):
        self._ml_modelsets = ml_modelsets
        self._model = None
        self._rmse_result = 0.0

    @property
    def model(self):
        return self._model
        
    @property
    def rmse_result(self):
        return self._rmse_result

    def rmse(self,
             log=True
            ):
        model = LinearRegression()
        model.fit(self._ml_modelsets.df_main.features, self._ml_modelsets.df_main.target)
        predictions = model.predict(self._ml_modelsets.df_test.features)
        self._rmse_result = mean_squared_error(self._ml_modelsets.df_test.target, predictions) ** 0.5

        if log:
            print()
            print(f'----------------- Linear Regression (RMSE) -----------------')
            print(f'Linear Regression: {self._rmse_result}')

## MLThreeSetSplit

### Summary
Creates machine learning datasets from a source dataset. Includes methods to determine the best model, and to provide a summary of the model calculations.

### Attributes:
* **`accuracy_threshold`** (`float`): The lowest accuracy value that should be considered for a final model. Read-write.
* **`best_model`** (`MLModel`): The model type that produces the most accurate model. Internal.
* **`best_score`** (`float`): The best model classifier score. Internal.
* **`decision_tree`** (`MLDecisionTree`): The Decision Tree Classifier. Internal.
* **`df`** (`dataset`): The dataframe object. Internal.
* **`df_main`** (`dataset`): The main dataset, created by the initial train_test_split. Read-only.
* **`df_target`** (`dataset`): The target from the complete source dataset. Read-only.
* **`df_test`** (`dataset`): The test dataset, created by the initial train_test_split. Read-only.
* **`df_train`** (`dataset`): The training dataset. Read-only.
* **`df_valid`** (`dataset`): The validation dataset. Read-only.
* **`features_to_ignore`** (`[str]`): A list of columns (other than target) to ignore for training. Internal.
* **`linear_regression`** (`MLLinearRegression`): The Linear Regression. Internal.
* **`logistic_regression`** (`MLLogisticRegression`): The Logistic Regression. Internal.
* **`random_forest_bayesian_optimization`** (`MLRandomForest`): The Random Forest Classifier for use with a Bayesian Optimization. Internal.
* **`random_forest_grid`** (`MLRandomForest`): The Random Forest Classifier for use with a Grid Search. Internal.
* **`random_forest_randomized`** (`MLRandomForest`): The Random Forest Classifier for use with a Randomized Search. Internal.
* **`random_state`** (`int`): The random state to use. Internal.
* **`target`** (`[str]`): A list of columns to use as target. Internal.
* **`test_size`** (`float`): The portion of the source dataset to use for testing. Internal.
* **`valid_size`** (`float`): The portion of the main dataset to use for validation. Internal.

#### Constructor (`__init__`)
Initializer.

##### Arguments:
* **`df`** (`dataset`): The dataframe object.
* **`target`** (`[str]`): A list of columns to use as target.
* **`features_to_ignore`** (`[str]`, *default*= `[]`): A list of columns (other than target) to ignore for training.
* **`test_size`** (`float`, *default*= `0.2`): The portion of the source dataset to use for testing. See Notes.
* **`valid_size`** (`float`, *default*= `0.25`): The portion of the main dataset to use for validation. See Notes.
* **`random_state`** (`int`, *default*= `12345`): The random state to use.
* **`accuracy_threshold`** (`float`, *default*= `0.0`): The lowest accuracy value that should be considered for a final model.

##### Notes:
1. The main (working) dataset is the result of the split between the source dataset and the test dataset. If the ratio for train:validation:test is to be 3:1:1, specify 0.2 for the test set, and 0.25 for the validation set (25% of 80% = 20% of the source dataset). These are the default values for the \_\_init\_\_ method.

### Public Methods:

#### `train_test_split(self)`
* Initializes the following attributes:
    * df_main
    * df_test
    * df_train
    * df_valid
    * decision_tree
    * random_forest_grid
    * random_forest_randomized
    * random_forest_bayesian_optimization
    * logistic_regression
    * linear_regression

#### `analyze_models(self, max_depth, min_samples_split, estimators, log)`

Investigates the quality of the various models (Decision Tree, Random Forest, Logistic Regression), and determines the best model based on the accuracy threshold.

##### Arguments:
* **`max_depth`** (`[int]`, *default*= `None`): The max depth of the tree.  See Notes.
* **`min_samples_split`** (`[int]`, *default*= `None`): The list of values for the min_samples_split parameter. See Notes.
* **`n_estimators`** (`range`, *default*= `None`): The range of estimators to use. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If max_depth is None, the value assigned will be based on the model_search attribute:
    * Grid: [None, 10, 20]
    * Randomized: [None, 10, 20, 30]
    * Bayesian: (2, 20), i.e., (low, high)
2. To pass a value for max_depth, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)
3. If min_samples_split is None, the value assigned will be based on the model_search attribute:
    * Grid: [2, 5, 10]
    * Randomized: random.randint(2, 11)
    * Bayesian: (1e-5, 1e-2), i.e., (low, high)
4. To pass a value for min_samples_split, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)
5. If n_estimators is None, the value assigned will be based on the model_search attribute:
    * Grid: [10, 50, 100, 200]
    * Randomized: random.randint(10, 200)
    * Bayesian: (100, 500), i.e., (low, high)
6. To pass a value for n_estimators, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

#### `model_summary(self)`
Returns a summary of the model analysis.

* **Returns:**
    * `str` The summary of the model analysis.

Sample output:
```python
------------------ Model Summary ------------------
Best model within accuracy threshold: Random Forest (Grid)

parameters: {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 100}
accuracy:   82.00%
```

#### `sanity_check(self, depths, estimators, log)`
Calls the rmse() function for each model type, and, optionally, prints the results.

##### Arguments:
* **`max_depth`** (`range`, *default*= `None`): The range of values to use with Decision Tree and Random Forest models. See Notes.
* **`n_estimators`** (`range`, *default*= `None`): The range of values to use as estimators with Random Forest models. See Notes.
* **`log`** (`bool`, *default*= `True`): If True, results are output to the console.

##### Notes:
1. If max_depth is None, the value assigned will be based on the model_search attribute:
    * Grid: [None, 10, 20]
    * Randomized: [None, 10, 20, 30]
    * Bayesian: (2, 20), i.e., (low, high)
2. To pass a value for max_depth, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)
3. If n_estimators is None, the value assigned will be based on the model_search attribute:
    * Grid: [10, 50, 100, 200]
    * Randomized: random.randint(10, 200)
    * Bayesian: (100, 500), i.e., (low, high)
4. To pass a value for n_estimators, use the following guidelines for the parameter, based on the model_search attribute:
    * Grid: pass [int]
    * Randomized: int
    * Bayesian: (low, high)

### Private Methods:

#### `evaluate_models(self)`
Evaluates the models, to determine which is the best model for analysis.

In [19]:
class MLThreeSetSplit(MLModelSets):
    def __init__(self, 
                 df, 
                 target, 
                 features_to_ignore=[], 
                 test_size=0.2, 
                 valid_size=0.25,
                 random_state=12345,
                 accuracy_threshold=0.0
                ):
        self._df = df
        self._target = any_to_list(target)
        self._features_to_ignore = any_to_list(features_to_ignore)
        self._test_size = test_size
        self._valid_size = valid_size
        self._random_state = random_state
        self._accuracy_threshold = accuracy_threshold
        self._df_target = df[self._target]
        self._df_main = None
        self._df_test = None
        self._df_train = None
        self._df_valid = None
        self._best_model = MLModel.Undefined
        self._best_score = 0.0
        self._decision_tree = None
        self._random_forest_grid = None
        self._random_forest_randomized = None
        self._random_forest_bayesian_optimization = None
        self._logistic_regression = None
        self._linear_regression = None

    @property
    def df_target(self):
        return self._df_target

    @property
    def df_main(self):
        return self._df_main

    @property
    def df_test(self):
        return self._df_test

    @property
    def df_train(self):
        return self._df_train

    @property
    def df_valid(self):
        return self._df_valid

    @property
    def random_state(self):
        return self._random_state

    @property
    def accuracy_threshold(self):
        return

    @accuracy_threshold.setter
    def accuracy_threshold(self, value):
        self._accuracy_threshold = float(value)

    def train_test_split(self):
        if len(self._features_to_ignore) > 0:
            tts_features = self._df.drop(self._features_to_ignore, axis=1)
        else:
            tts_features = self._df
        
        tts_features = tts_features.drop(self._target, axis=1)
        tts_target = self._df[self._target]

        # Create main (for training and validation) and test sets
        features_main, features_test, target_main, target_test = train_test_split(tts_features, 
                                                                                  tts_target,
                                                                                  test_size=self._test_size, 
                                                                                  random_state=self._random_state
                                                                                 )
        self._df_main = MLDataset(features_main,
                                  target_main
                                  )
        self._df_test = MLDataset(features_test,
                                  target_test
                                 )

        # Create training and validation sets
        features_train, features_valid, target_train, target_valid = train_test_split(features_main, 
                                                                                      target_main,
                                                                                      test_size=self._valid_size, 
                                                                                      random_state=self._random_state
                                                                                     )
        self._df_train = MLDataset(features_train,
                                   target_train
                                  )
        self._df_valid = MLDataset(features_valid,
                                   target_valid
                                  )

        self._decision_tree = MLDecisionTree(self)
        self._random_forest_grid = MLRandomForest(self, MLModelSearch.Grid)
        self._random_forest_randomized = MLRandomForest(self, MLModelSearch.Randomized)
        self._random_forest_bayesian_optimization = MLRandomForest(self, MLModelSearch.BayesianOptimization)
        self._logistic_regression = MLLogisticRegression(self)
        self._linear_regression = MLLinearRegression(self)

    def analyze_models(
            self,
            max_depth=None,
            min_samples_split=None,
            n_estimators=None,
            log=True
            ):
        self.train_test_split()
        self._decision_tree.tune(
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            log=log
            )
        self._random_forest_grid.calculate_best_model(
            n_estimators=n_estimators,
            log=log
            )
        self._random_forest_randomized.calculate_best_model(
            n_estimators=n_estimators,
            log=log
            )
        self._random_forest_bayesian_optimization.calculate_best_model(
            n_estimators=n_estimators,
            log=log
            )
        self._logistic_regression.calculate_scores(log=log)

    def _evaluate_models(self):
        # Models in order of Accuracy: DecisionTree (low), LogisticRegression (medium), RandomForest (high)
        self._best_score = 0.0
        self._best_model = None

        if self._decision_tree.accuracy_results.score >= self._accuracy_threshold:
            self._best_model = MLModel.DecisionTree
            self._best_score = self._decision_tree.accuracy_results.score

        if self._logistic_regression.test_score >= self._accuracy_threshold and \
            self._logistic_regression.test_score > self._best_score:
            self._best_model = MLModel.LogisticRegression
            self._best_score = self._logistic_regression.test_score

        if self._random_forest_grid.accuracy_results.score >= self._accuracy_threshold and \
            self._random_forest_grid.accuracy_results.score > self._best_score:
            self._best_model = MLModel.RandomForestGrid
            self._best_score = self._random_forest_grid.accuracy_results.score

        if self._random_forest_randomized.accuracy_results.score >= self._accuracy_threshold and \
            self._random_forest_randomized.accuracy_results.score > self._best_score:
            self._best_model = MLModel.RandomForestRandomized
            self._best_score = self._random_forest_randomized.accuracy_results.score

        if self._random_forest_bayesian_optimization.accuracy_results.score >= self._accuracy_threshold and \
            self._random_forest_bayesian_optimization.accuracy_results.score > self._best_score:
            self._best_model = MLModel.RandomForestBayesianOptimization
            self._best_score = self._random_forest_bayesian_optimization.accuracy_results.score

    def model_summary(self):
        self._evaluate_models()
        best_model = "Unable to determine best model"
        model_description = ""
        
        if self._best_model == MLModel.Undefined:
            best_model = "Undefined"
        elif self._best_model == MLModel.DecisionTree:
            best_model = "Decision Tree"
            model_description = f"""
max_depth: {self._decision_tree.accuracy_results.depth}
accuracy:  {self._decision_tree.accuracy_results.score:.2%}
"""
        elif self._best_model == MLModel.RandomForestGrid:
            best_model = "Random Forest (Grid)"
            model_description = f"""
parameters: {self._random_forest_grid.accuracy_results.parameters}
accuracy:   {self._random_forest_grid.accuracy_results.score:.2%}
"""            
        elif self._best_model == MLModel.RandomForestRandomized:
            best_model = "Random Forest (Randomized)"
            model_description = f"""
parameters: {self._random_forest_randomized.accuracy_results.parameters}
accuracy:   {self._random_forest_randomized.accuracy_results.score:.2%}
"""            
        elif self._best_model == MLModel.RandomForestBayesianOptimization:
            best_model = "Random Forest (Bayesian Optimization)"
            model_description = f"""
parameters: {self._random_forest_bayesian_optimization.accuracy_results.parameters}
accuracy:   {self._random_forest_bayesian_optimization.accuracy_results.score:.2%}
"""            
        elif self._best_model == MLModel.LogisticRegression:
            best_model = "Logistic Regression"
            model_description = f"""
accuracy:   {self._logistic_regression.test_score:.2%}
"""     
        else:
            best_model = "Unknown value"

        summary = f"""------------------ Model Summary ------------------
Best model within accuracy threshold: {best_model}
{model_description}
"""
        return summary

    def sanity_check(self,
                     max_depth=None,
                     n_estimators=None,
                     log=True
                    ):
        self._decision_tree.rmse(max_depth=max_depth,
                                 log=log
                                )
        self._random_forest_grid.rmse(max_depth=max_depth,
                                 n_estimators=n_estimators,
                                 log=log
                                )
        self._linear_regression.rmse()

## 1&emsp;Load and Analyze the Dataset

In [20]:
users_behavior = pd.read_csv('datasets/users_behavior.csv')
analyze_dataset(users_behavior)

Rows: 3214
Duplicate rows: 0

Column           Non-null   Data type   
calls                3214   float64     
minutes              3214   float64     
messages             3214   float64     
mb_used              3214   float64     
is_ultra             3214   int64       

Column:             calls
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      184
Integer values:     3214
Non-integer values: 0


Column:             minutes
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      3149
Integer values:     57
Non-integer values: 3157


Column:             messages
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      180
Integer values:     3214
Non-integer values: 0


Column:             mb_used
Data type:          float64
Non-null:           3214
N/A count:          0
Unique values:      3203
Integer values:     26
Non-integer values: 3188


Column:         

<__main__.DataframeInfo at 0x16d1027b0>

### 1.1&emsp;Data Analysis Summary

There are 3214 records in the dataset.  
The data types of the columns are shown above.  
There are no missing values in any of the columns.  
Importing the csv file into Excel allowed for further data inspection.  

* Issues / Patterns / Anomalies
    * calls
        * This column has zero non-integer values.
        * Convert to int64.
    * minutes
        * The Excel import reveals that many values in this column have significantly more than two decimal places.
        * For this analysis, fractional minutes are not significant.
        * Convert the values to int64, effectively rounding down to zero decimal places.
    * messages
        * This column has zero non-integer values.
        * Convert to int64.
    * mb_used
        * The Excel import reveals that many values in this column have significantly more than two decimal places.
        * For this analysis, fractional mb of data are not significant.
        * Convert the values to int64, effectively rounding down to zero decimal places.
    * is_ultra
        * This column has two unique values.
        * The values must remain numeric in order to use the column as a target in RMSE calculations.

## 2&emsp;Data Preparation

### Manipulate the dataset based on the conclusions reached in the Data Analysis Summary.

In [21]:
# Change data types, as needed.
users_behavior = users_behavior.astype({'calls': 'int64', 
                                        'minutes': 'int64',
                                        'messages': 'int64',
                                        'mb_used': 'int64'
                                       })

# Print info() and head() for verification.
users_behavior.info()
users_behavior.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calls     3214 non-null   int64
 1   minutes   3214 non-null   int64
 2   messages  3214 non-null   int64
 3   mb_used   3214 non-null   int64
 4   is_ultra  3214 non-null   int64
dtypes: int64(5)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40,311,83,19915,0
1,85,516,56,22696,0
2,77,467,86,21060,0
3,106,745,81,8437,1
4,66,418,1,14502,0


## 3&emsp;Create main set, training set, validation set, and test set.

In [22]:
ml_source = MLThreeSetSplit(users_behavior,
                            'is_ultra'
                           )

### 3.1&emsp;Analyze Models

In [23]:
ml_source.analyze_models()


------------------ Decision Tree (Accuracy) ------------------

Best parameters: {'max_depth': 10, 'min_samples_split': 10}
Best score: 0.7852943947244466

------------ Random Forest (Accuracy): Grid ------------------

Best parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 50}
Best score: 0.8195007065473388

--------- Random Forest (Accuracy): Randomized ---------------

Best parameters: {'max_depth': 10, 'min_samples_split': 4, 'n_estimators': 190}
Best score: 0.8137971872686898

---- Random Forest (Accuracy): Bayesian Optimization ----------

Best cross-validation accuracy found: 0.8180
Best hyperparameters: {'max_depth': np.int64(7), 'min_samples_split': 0.003993452899784223, 'n_estimators': np.int64(423)}

--------------- Logistic Regression (Accuracy) ---------------
Accuracy of the logistic regression model on the training set: 0.7017634854771784
Accuracy of the logistic regression model on the validation set: 0.6951788491446346
Accuracy of the logistic regr

### 3.2 Print Model Summary

In [24]:
print(ml_source.model_summary())

------------------ Model Summary ------------------
Best model within accuracy threshold: Random Forest (Grid)

parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 50}
accuracy:   81.95%




### Conclusion  

* The accuracy threshold is 75%.
* The best Decision Tree and Random Forest models were within the accuracy threshold.
* Of these two, the Random Forest (Grid) model had the higher accuracy score (81.95%).
* The Logistic Regression test model did not reach the accuracy threshold.

## 4&emsp;Sanity Check

In [26]:
ml_source.sanity_check()


------------------- Decision Tree (RMSE) -------------------
max_depth = None : 0.5334823534681324
max_depth = 10 : 0.48299222871755937
max_depth = 20 : 0.5392812450616766

Best model: 10 (0.48299222871755937)

------------------- Random Forest (RMSE) -------------------
RMSE model (n_estimators, max_depth = (10, None)): 0.4748741340974856
RMSE model (n_estimators, max_depth = (10, 10)): 0.45136737131523463
RMSE model (n_estimators, max_depth = (10, 10)): 0.45136737131523463
RMSE model (n_estimators, max_depth = (10, 10)): 0.45136737131523463
RMSE model (n_estimators, max_depth = (50, 10)): 0.447908566541585
RMSE model (n_estimators, max_depth = (50, 10)): 0.447908566541585
RMSE model (n_estimators, max_depth = (50, 10)): 0.447908566541585
RMSE model (n_estimators, max_depth = (50, 10)): 0.447908566541585
RMSE model (n_estimators, max_depth = (50, 10)): 0.447908566541585
RMSE model (n_estimators, max_depth = (50, 10)): 0.447908566541585
RMSE model (n_estimators, max_depth = (200, 10))

### Conclusion  

* The Linear Regression model score (0.443) is the best RSME model.
* The best Random Forest model (n_estimators = 200, depth = 10) is only 0.003 above the Linear Regression model, a virtual tie for the best RSME model.
* The best Decision Tree model (max_depth = 10) score is only 0.036 higher.

Given the success of the Random Forest model in both accuracy and RMSE, that model is determined to have the highest quality overall.