# Introduction

> FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically. It frees users from selecting learners and hyperparameters for each learner. It is fast and economical. The simple and lightweight design makes it easy to extend, such as adding customized learners or metrics. FLAML is powered by a new, cost-effective hyperparameter optimization and learner selection method invented by Microsoft Research. FLAML leverages the structure of the search space to choose a search order optimized for both cost and error. For example, the system tends to propose cheap configurations at the beginning stage of the search, but quickly moves to configurations with high model complexity and large sample size when needed in the later stage of the search. For another example, it favors cheap learners in the beginning but penalizes them later if the error improvement is slow. The cost-bounded search and cost-based prioritization make a big difference in the search efficiency under budget constraints.

[Source](https://github.com/microsoft/FLAML)

# Installation

Steps mentioned on [FLAML GitHub](https://github.com/microsoft/FLAML#installation)

In [None]:
!pip install flaml[notebook]

# Imports

In [None]:
import numpy as np
import pandas as pd

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px

from flaml import AutoML
from flaml.data import get_output_from_log

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# EDA

In [None]:
pizza_v1_df = pd.read_csv("../input/pizza-price-prediction/pizza_v1.csv")
pizza_v2_df = pd.read_csv("../input/pizza-price-prediction/pizza_v2.csv")

print(f"Shape of pizza_v1 dataframe: {pizza_v1_df.shape}")
print(f"Shape of pizza_v2 dataframe: {pizza_v2_df.shape}")

In [None]:
pizza_v1_df.sample(20)

In [None]:
pizza_v2_df.sample(20)

In [None]:
pizza_v1_df.dtypes

In [None]:
pizza_v2_df.dtypes

In [None]:
# converting price_rupiah from string to integer

def convert_price_from_string_to_int(price: str) -> int:
    try:
        price = price[2:]
        
        price = price.replace(",", "")
        
        price = int(price)
        
        return price
    except:
        print(f"Error converting from string to int {price}")
        
pizza_v1_df['target'] = pizza_v1_df['price_rupiah'].apply(convert_price_from_string_to_int)
pizza_v2_df['target'] = pizza_v2_df['price_rupiah'].apply(convert_price_from_string_to_int)

In [None]:
# for dataset pizza_v2, diameter is string instead of float. Converting string to float

def convert_diameter_from_string_to_float_for_v2(diameter: str) -> float:
    try:
        diameter = diameter.replace("inch", "")
        
        diameter = diameter.strip()
        
        diameter = float(diameter)
        
        return diameter
    except:
        print(f"Error converting from string to float {diameter}")

pizza_v2_df['diameter'] = pizza_v2_df['diameter'].apply(convert_diameter_from_string_to_float_for_v2)

In [None]:
# drop the previous column price_rupiah
pizza_v1_df = pizza_v1_df.drop('price_rupiah', axis=1)
pizza_v2_df = pizza_v2_df.drop('price_rupiah', axis=1)

In [None]:
print(f"Unique values in v1 variant: {pizza_v1_df['variant'].unique()}")
print(f"Unique values in v2 variant: {pizza_v2_df['variant'].unique()}")

In [None]:
# for variant, spicy_tuna and spicy tuna categories are same, so we need to rectificy spicy tune to spicy_tuna
pizza_v1_df.loc[pizza_v1_df['variant'] == "spicy tuna", "variant"] = "spicy_tuna"
pizza_v2_df.loc[pizza_v2_df['variant'] == "spicy tuna", "variant"] = "spicy_tuna"

In [None]:
print(f"After removing duplicates, unique values in v1 variant: {pizza_v1_df['variant'].unique()}")
print(f"After removing duplicates, unique values in v2 variant: {pizza_v2_df['variant'].unique()}")

## Analysis graphs for V1

In [None]:
pizza_v1_df.groupby(by=['topping']).size().index

In [None]:
# Count graphs for categorical variables for v1

count_cols = ['company', 'topping', 'variant', 'size', 'extra_sauce', 'extra_cheese']

fig = make_subplots(rows=3, cols=2, subplot_titles=count_cols)

for i in range(6):
    group_by = pizza_v1_df.groupby(by=[count_cols[i]]).size()
    index, values = group_by.index, group_by.values
    fig.add_trace(
        go.Bar(x=values, y=index, name=count_cols[i], orientation='h'),
        row=i%3+1,
        col=i//3+1
    )
    fig.update_xaxes(title_text="Count", row=i%3+1, col=i//3+1)

fig.update_layout(
    title="Count Plots for Categorical Variables V1",
    autosize=False,
    width=1440,
    height=1280,
)
    
fig.show()

In [None]:
# correlation between diameter and price of pizza

fig = px.scatter(
    pizza_v1_df, x='diameter', y='target',
    opacity=0.75, trendline='ols', title="Regression between diameter and price V1"
)

fig.show()

## Analysis graphs for V2

In [None]:
# Count graphs for categorical variables for v2

count_cols = ['company', 'topping', 'variant', 'size', 'extra_sauce', 'extra_cheese', 'extra_mushrooms']

fig = make_subplots(rows=3, cols=3, subplot_titles=count_cols)

for i in range(7):
    group_by = pizza_v2_df.groupby(by=[count_cols[i]]).size()
    index, values = group_by.index, group_by.values
    fig.add_trace(
        go.Bar(x=values, y=index, name=count_cols[i], orientation='h'),
        row=i%3+1,
        col=i//3+1
    )
    fig.update_xaxes(title_text="Count", row=i%3+1, col=i//3+1)

fig.update_layout(
    title="Count Plot for Categorical Variables V2",
    autosize=False,
    width=1440,
    height=1280,
)
    
fig.show()

In [None]:
# correlation between diameter and price of pizza

fig = px.scatter(
    pizza_v2_df, x='diameter', y='target',
    opacity=0.75, trendline='ols', title="Regression between diameter and price V1"
)
fig.show()

# "Dummy"fying Data 

In [None]:
# select the categorical columns and replace them by dummies
categorical_cols_for_v1 = ['company', 'topping', 'variant', 'size', 'extra_sauce', 'extra_cheese']
categorical_cols_for_v2 = ['company', 'topping', 'variant', 'size', 'extra_sauce', 'extra_cheese', 'extra_mushrooms']

pizza_v1_df = pd.get_dummies(pizza_v1_df, columns=categorical_cols_for_v1, drop_first=True)
pizza_v2_df = pd.get_dummies(pizza_v2_df, columns=categorical_cols_for_v2, drop_first=True)

# AutoML Model Execution

In [None]:
TIME_BUDGET = 600

def train_and_return_automl_model_for_dataset(X_train, y_train, version: str) -> AutoML:
    
    automl = AutoML()

    settings = {
        'time_budget': TIME_BUDGET,
        'metric': 'r2',
        'estimator_list': ['lgbm', 'catboost', 'rf', 'extra_tree'],
        'task': 'regression',
        'log_file_name': f"{version}.log",
        'seed': 987654321
    }
    
    automl.fit(X_train=X_train, y_train=y_train, **settings)
    
    return automl

In [None]:
X_v1 = pizza_v1_df.drop('target', axis=1)
y_v1 = pizza_v1_df['target']

X_v2 = pizza_v2_df.drop('target', axis=1)
y_v2 = pizza_v2_df['target']

X_train_v1, X_test_v1, y_train_v1, y_test_v1 = train_test_split(X_v1, y_v1, test_size=0.2)
X_train_v2, X_test_v2, y_train_v2, y_test_v2 = train_test_split(X_v2, y_v2, test_size=0.2)

assert X_train_v2.shape[1] == X_v2.shape[1]

automl_v1 = train_and_return_automl_model_for_dataset(X_train_v1, y_train_v1, "v1")
automl_v2 = train_and_return_automl_model_for_dataset(X_train_v2, y_train_v2, "v2")

In [None]:
print(f'X_v1 shape:{X_v1.shape}')
print(f'y_v1 shape:{y_v1.shape}')
print(f'X_v2 shape:{X_v2.shape}')
print(f'y_v2 shape:{y_v2.shape}')

# Log History

In [None]:
def return_valid_loss_history(model_version: str): 
    time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history = \
        get_output_from_log(filename=f"{model_version}.log", time_budget=100)
    
    return (time_history, valid_loss_history)

v1_time_history, v1_valid_loss_history = return_valid_loss_history("v1")
v2_time_history, v2_valid_loss_history = return_valid_loss_history("v2")

fig = make_subplots(rows=1, cols=2, subplot_titles=("V1 Validation History", "V2 Validation History"))

fig.add_trace(
    go.Scatter(x=v1_time_history, y=np.array(v1_valid_loss_history)),
    row=1, col=1
)


fig.add_trace(
    go.Scatter(x=v2_time_history, y=np.array(v2_valid_loss_history)),
    row=1, col=2
)

fig.update_xaxes(title_text="Wall Clock Time (s)", row=1, col=1)
fig.update_xaxes(title_text="Wall Clock Time (s)", row=1, col=2)

fig.update_yaxes(title_text="Validation Loss", row=1, col=1)
fig.update_yaxes(title_text="Validation Loss", row=1, col=2)

fig.show()

In [None]:
def get_r2_score_from_best_estimator(X_test, y_test, best_estimator):
    
    y_pred = best_estimator.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    
    return r2

v1_best_model = automl_v1.model.estimator
v2_best_model = automl_v2.model.estimator

print(f'Best config for v1: {v1_best_model}')
print(f'Best config for v2: {v2_best_model}')

In [None]:
v1_r2_score = get_r2_score_from_best_estimator(X_test_v1, y_test_v1, v1_best_model)
v2_r2_score = get_r2_score_from_best_estimator(X_test_v2, y_test_v2, v2_best_model)

print(f'R2 score from best model for v1 : {v1_r2_score:.4f}')
print(f'R2 score from best model for v2 : {v2_r2_score:.4f}')

# Conclusion

- R2 Score is pretty good for AutoML model
- What are your thoughts on Microsoft FLAML?