In this notebook, I'll pick the best algorithm from automated model training. This is a classification problem and I'll try automating 3 algorithms, which are K-Nearest Neighbors, SVM and Random Forest. First, import common packages. For training result reporting, also install `beautifultable`.

In [None]:
pip install beautifultable

In [None]:
import argparse
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle

from beautifultable import BeautifulTable
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# classifier algorithms 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Quick Look & Missing Values

Look at the head preview to get some sense of the datasets.

In [None]:
df = pd.read_csv('../input/mobile-price-classification/train.csv')
df.head()

Also we know that we need to avoid missing values in datasets. We'll need to (i) remove the feature columns if it contains too much missing values, or (ii) impute those missing values whenever possible. Get the quick resume this way:

In [None]:
print("\nMissing values percentage in each column: ")
print(df.isnull().sum() / len(df) * 100)

Cool, we see no missing value in all columns.

# Target & Config

For the purpose of generalization in automating the process, I find it easier to create a config dictionary as follows. The dict contains all the parameters we'll need to pass to the autoML function.
* target: The target column name we want to predict. Type: str
* irrelevant_cols: The column name(s) irrelevant to the target column, that may have been better removed. In this case, I passed a column example `phone_wallpaper` that won't be relevant to predict our target column, which is `price_range`. The column doesn't exist anyway, so comment that out. Type: list of str
* num_features: Feature columns containing numerical values. Type: list of str
* cat_features: Feature columns containing categorical values. Type: list of str
* num_imputer: Imputation strategy for num_features. Type: str
* cat_imputer: Imputation strategy for cat_features. Type: str
* scaler: In some cases we might need to add scaling function for the num_features. Type: function
* encoder: We need to encode cat_features due to computational reason. Type: function

In this dataset, the feature columns are quite straight forward. We can quickly infer whether or not each of them relate to `price_range`. I personally would use all features available, as follows.

In [None]:
config = {
    'target': 'price_range',
    #'irrelevant_cols' : ["phone_wallpaper"],
    'num_features': ["battery_power", "clock_speed", "fc", "int_memory", "m_dep", "mobile_wt", "n_cores", "pc", "px_height", \
                     "px_width", "ram", "sc_h", "sc_w", "talk_time"],
    'cat_features': ["blue", "dual_sim", "four_g", "three_g", "touch_screen", "wifi"],
    'num_imputer': 'mean',
    'cat_imputer': 'most_frequent',
    'scaler': RobustScaler(),
    'encoder': OneHotEncoder(handle_unknown='ignore')
}

# Algorithms and Parameters to Automate

We can use a simple dictionary like this one to feed the algorithms and paramaters to the function

In [None]:
algos_params_classifier = {
    'KNNClassifier': {
        'algo': KNeighborsClassifier(),  # change parameter values here
        'parameter': {  # change tuning parameter values here
            'algo__n_neighbors': np.arange(1, 51, 2),
            'algo__weights': ['uniform', 'distance'],
            'algo__p': [1, 2]
        }
    },
    'SVMClassifier': {
        'algo': SVC(),  # common max_iter=500
        'parameter': {
            'algo__gamma': np.array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
            'algo__C': np.array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
        }

    },
    'RandomForestClassifier': {
        'algo': RandomForestClassifier(),  # scaling wont help RF. to explore, use feature importance instead
        'parameter': {
            'algo__n_estimators': [100, 150, 200],
            'algo__max_depth': [20, 50, 80],
            'algo__max_features': [0.3, 0.6, 0.8],
            'algo__min_samples_leaf': [1, 5, 10]
        }
    }  
}

# Preprocessing

Define a generic preprocessing function we can use later in the autoML pipeline.

In [None]:
def preprocessor(num_features, cat_features, num_imputer=None, cat_imputer=None, scaler=None, encoder=None):
        '''
        :param imputer: 'mean', 'median', 'most_frequent', 'constant'
        :param scaler: MinMaxScaler(), StandardScaler(), RobustScaler()
        :param encoder: OneHotEncoder(handle_unknown='ignore')
        :return: preprocessed dataset (imputed, scaled, etc)
        '''
        numerical_pipeline = Pipeline([
            ("imputer", SimpleImputer(strategy=num_imputer)),
            ("scaler", scaler)
        ])

        categorical_pipeline = Pipeline([
            ("imputer", SimpleImputer(strategy=cat_imputer)),
            ("encoder", encoder)
        ])

        preprocessor = ColumnTransformer([
            ('numeric', numerical_pipeline, num_features),
            ('categoric', categorical_pipeline, cat_features),
        ])
        
        return preprocessor

# AutoML

Define the function that takes the dictionary `algos_params_classifier` as an argument. The function will then loop over the algorithms, fit it on the train dataset, outputs:
* Train and test score from CV score calculations
* `diff`, which is a method I usually use to quick check the fit. `diff` calculates the absolute difference between train and test score. Hypothetically, the smaller the `diff`, the better the model fit. Hence, the better the model generalizes on unseen data.
* Best parameters results from RandomizedSearchCV.

In [None]:
def automl_train(algos_params_dict):
        '''
        :param algos_params_dict: ML algorithms and parameters to train the dataset
        :return: saved models from all trainings in .pickle
        '''
        # dataset_splitting
        X = df.drop(columns=config['target'])
        y = df[config['target']]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
        
        
        # add report table
        table = BeautifulTable(max_width=150)
        table.column_headers = ["model", "train score", "test score", "diff"]
        
        
        # autoML
        print(f'Train test results\n')
        global saved_models
        saved_models = []
        for algo, algo_params in algos_params_dict.items():
            # key, value = algo, algo_params; algo = algo type, algo_params = (i) algo (ii) parameters
            pipeline = Pipeline([
                ("prep", preprocessor(num_imputer=config['num_imputer'], cat_imputer=config['cat_imputer'], \
                         num_features=config['num_features'], cat_features=config['cat_features'], \
                         scaler=config['scaler'], encoder=config['encoder'])),
                ("algo", algo_params['algo'])
            ])

            parameter = algo_params['parameter']
 
            # selection mode: RandomizedSearchCV
            model = RandomizedSearchCV(pipeline, param_distributions=parameter, cv=3, n_iter=50, n_jobs=-1, verbose=1,
                                       random_state=42) # for more verbosity: verbose=degree
            model.fit(X_train, y_train)
            model_name = f'model_{algo}.pickle'
            model_path = Path.cwd() / model_name
            pickle.dump(model, open(model_path, 'wb'))
            saved_models.append(model_path)
            diff = abs(model.score(X_train, y_train) - model.score(X_test, y_test))
            table.append_row(
                [algo, model.score(X_train, y_train), model.score(X_test, y_test), diff])
        print(table)
        
        return saved_models
    
automl_train(algos_params_classifier)

# Predict on Unseen Test Data

From the result table, I personally would pick SVM model to proceed with prediction and further improvements as it has the smallest `diff` value. Hypothetically speaking, I assume models with smallest `diff` would be the ones having good fit, not overfitting or underfitting. Here we load the test data and the SVM model dumped from automl_train function. Predict the `price_range` on test set and recheck the test dataframe (here `test_df`).

In [None]:
def predict_testset(testset_path, pick_model, models):
    df_test = pd.read_csv(testset_path)
    for model in saved_models:
        if pick_model in str(model):
            load_model = pickle.load(open(model, 'rb'))
            pred = load_model.predict(df_test)
    
    df_test['prediction'] = pred
    print(df_test.head())
    
    
predict_testset(testset_path='../input/mobile-price-classification/test.csv', pick_model='SVMClassifier', models=saved_models)

That's it! We've predict the price range in column `prediction`. Unfortunately, this test set doesn't have `price_range` column in it for us to compare and calculate the prediction accuracy score to evaluate the fit.  