# __Create a classification model using AutoML__

## Preface

- Tutorial difficulty : ★☆☆☆☆
- 4 min read
- Languages : [SQL](https://en.wikipedia.org/wiki/SQL) (100%)
- File location : tutorial_en/thanosql_ml/classification/automl_classification.ipynb
- References : [(Kaggle) Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/overview)

## Tutorial Introduction

<div class="admonition note">
    <h4 class="admonition-title">Understanding Classification Operations</h4>
    <p>Classification is a form of <a href="https://en.wikipedia.org/wiki/Machine_learning">Machine Learning</a> used to predict the category (Category or Class) to which the target belongs. For example, both binary classifications that classify men or women and multiple classifications that predict animal species (dogs, cats, rabbits, etc.) are included in the classification task. <br></p>
</div>

To predict whether or not a potential customer will react positively to a particular marketing promotion in your company, you can use your customer's [Customer Relationship Management (CRM)](https://en.wikipedia.org/wiki/Customer_relationship_management) data (demographic information, customer behavior/search data, etc.) can be used. In this case, the <a href="https://en.wikipedia.org/wiki/Feature_(machine_learning)">Feature</a> expressed in the CRM data is used as input data, and the target value, which is the value to be predicted, is whether the target customer's response to the promotion is positive (1 or True) or negative (0 or False). By using this classification model, you can predict the reaction of customers who have not been exposed to marketing in advance and expose the marketing to the appropriate customers, thereby continuously increasing marketing efficiency.

__The following is an example and usage of the ThanoSQL classification model.__

- The classification model enables early detection of current user deviations and enables proactive response to problems (deviations). Historical data can help you identify the characteristics of your exodus and allow you to take appropriate action by discovering users who are likely to leave in advance. This can help prevent customer defections and increase sales.

- You can predict your [Segments](https://en.wikipedia.org/wiki/Market_segmentation) within the online platform. Most service users have different characteristics and have different behaviors and needs. Classification prediction models use the characteristics of service users to identify granular groups and enable them to develop strategies tailored to them.  

<div class="admonition note">
    <h4 class="admonition-title">In this tutorial</h4>
    <p>👉 Create a predictive classification model for survivors using the <mark style="background-color:#FFD79C"> <strong>Titanic: Machine Learning from Disaster</strong></mark> dataset for beginners to the flagship machine learning contest platform <a href="https://www.kaggle.com/">Kaggle</a>. The goals of this competition are as follows:
    (For reference, the data for the event is a list of passengers who were on board during the actual Titanic incident on April 15, 1912.)</p>
</div>

__Predicting Passengers Who Can Survive Titanic__

ThanoSQL provides automated machine learning (__Auto-ML__) tools. This tutorial uses Auto-ML to predict passengers who can survive in the Titanic. Auto-ML from ThanoSQL automates the process for model development and enables data collection and storage, machine learning model development and distribution (end-to-end machine learning pipelines) in a single language without data science expertise.

__Automated ML has the following advantages:__

1. Implementation and deployment of machine learning solutions without extensive programming or data science knowledge
2. Saving time and resources for deployment of development models
3. It is possible to quickly solve problems using the data you have for decision-making

Now let's use ThanoSQL to create a classification model that predicts passengers who can simply survive in the Titanic.

## __0. Prepare Dataset__

To use the query syntax of ThanoSQL, you must create an API token and run the query below, as mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/en/getting_started/how_to_use_ThanoSQL/#5-thanosql-workspace).

In [None]:
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>

### __Prepare Dataset__

In [None]:
%%thanosql
GET THANOSQL DATASET titanic_data
OPTIONS (overwrite=True)

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>GET THANOSQL DATASET</strong>" Use the query syntax to save the desired dataset to the workspace. </li>
        <li>"<strong>OPTIONS</strong>" Specifies the option to use for <strong>GET THANOSQL DATASET</strong> via query syntax.
        <ul>
            <li>"overwrite" : Set whether to overwrite if a dataset with the same name exists. If True, the old dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

In [None]:
%%thanosql
COPY titanic_train 
OPTIONS (overwrite=True)
FROM "tutorial_data/titanic_data/titanic_train.csv"

In [None]:
%%thanosql
COPY titanic_test 
OPTIONS (overwrite=True)
FROM "thanosql-dataset/titanic_data/titanic_test.csv"

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>"<strong>COPY</strong>" Use the query syntax to specify the name of the dataset to be saved in the DB. </li>
        <li>Specifies the options to use for <strong>COPY</strong> via the query syntax "<strong>OPTIONS</strong>" .
        <ul>
            <li>"overwrite" : Set whether overwrite is possible if a dataset with the same name exists on the DB. If True, the old dataset is replaced with the new dataset (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

## __1. Check Dataset__

To create the survivor prediction classification model, we use the <mark style="background-color:#FFEC92 "><strong>titanic_train</strong></mark> table stored in ThanoSQL DB. Check the table contents while executing the query below.

In [None]:
%%thanosql
SELECT * 
FROM titanic_train
LIMIT 5 

<div class="admonition note">
    <h4 class="admonition-title">Understanding Data</h4>
    <p>The <mark style="background-color:#FFEC92 "><strong>tianic_train</strong></mark> dataset contains the following information.</p>
    <ul>
        <li><mark style="background-color:#D7D0FF">passengerid</mark> : Passenger ID</li>
        <li><mark style="background-color:#D7D0FF">survived</mark> : Is the passenger on board alive</li>
        <li><mark style="background-color:#D7D0FF">pclass</mark> : Passenger ticket class</li>
        <li><mark style="background-color:#D7D0FF">name</mark> : Passenger name</li>
        <li><mark style="background-color:#D7D0FF">sex</mark> : Passenger gender</li>
        <li><mark style="background-color:#D7D0FF">age</mark> : Age of passengers</li>
        <li><mark style="background-color:#D7D0FF">sibsp</mark> : Number of siblings or spouses on board</li>
        <li><mark style="background-color:#D7D0FF">parch</mark> : Number of parents or children on board</li>
        <li><mark style="background-color:#D7D0FF">ticket</mark> : Ticket number</li>
        <li><mark style="background-color:#D7D0FF">fare</mark> : Fare</li>
        <li><mark style="background-color:#D7D0FF">cabin</mark> : Cabin</li>
        <li><mark style="background-color:#D7D0FF">embarked</mark> : Boarding point or port</li>
    </ul>
</div>

In this tutorial, we will proceed with model learning except for <mark style="background-color:#D7D0FF">name</mark>, <mark style="background-color:#D7D0FF">ticket</mark>, and <mark style="background-color:#D7D0FF">cabin</mark> columns that require data preprocessing using additional query statements.

## __2. Create a classification model__

Create a survivor prediction classification model using the <mark style="background-color:#FFEC92 ">titanic_train</mark> data from the previous step. Execute the query syntax below to create a model named <mark style="background-color:#E9D7FD ">titanic_automl_classification</mark>.  
(Estimated duration of query execution: 8 min)

In [None]:
%%thanosql
BUILD MODEL titanic_automl_classification
USING AutomlClassifier 
OPTIONS (
    target='survived', 
    impute_type='iterative',  
    features_to_drop=['name', 'ticket', 'passengerid', 'cabin'],
    time_left_for_this_task=300,
    overwrite=True
    ) 
AS 
SELECT * 
FROM titanic_train

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Create and train a model named <mark style="background-color:#E9D7FD ">titanic_automl_classification</mark> using the query syntax "<strong>BUILD MODEL</strong>".</li>
        <li>"<strong>OPTIONS</strong>" Specifies the options to use for model creation via the query syntax.
        <ul>
            <li>"target" : The name of the column containing the target value of the classification model </li>
            <li>"impute_type" : set how empty values ​​(NaNs) in data tables are handled ('simple'|'iterative' , DEFAULT: 'simple') </li>
            <li>"features_to_drop" : List of column names that cannot be used for training in the data table </li>
            <li>"time_left_for_this_task" : Time taken to find a suitable classification prediction model (DEFAULT: 300)</li>
            <li>"overwrite" : Set whether overwriting is possible if a model with the same name exists. If True, the old model is changed to the new model (True|False, DEFAULT : False) </li>
        </ul>
        </li>
    </ul>
</div>

<div class="admonition warning">
    <h4 class="admonition-title">Warning</h4>
    <p>When creating an Auto-ML classification model, if other parameters than those specified in <a href="https://docs.thanosql.ai/en/how-to_guides/OPTIONS/#1-automlclassifier-algorithm">OPTIONS</a> are used, the model can be created, but all set values are ignored.</p>
</div>

## __3. Evaluate the generated model__

Execute the query statement below to evaluate the performance of the predictive model created in the previous step.

In [None]:
%%thanosql 
EVALUATE USING titanic_automl_classification 
OPTIONS (
    target = 'survived'
    )
AS
SELECT *
FROM titanic_train

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <ul>
        <li>Evaluate a model named <mark style="background-color:#E9D7FD ">titanic_automl_classification</mark> built using the query syntax "<strong>EVALUATE USING</strong>". </li>
        Specifies the options to use for model evaluation via the query syntax <li>"<strong>OPTIONS</strong>".
        <ul>
            <li>"target" : The name of the column that is the target value in the classification prediction model.</li>
        </ul>
        </li>
    </ul>
</div>

<div class="admonition warning">
    <h4 class="admonition-title">Dataset for evaluation</h4>
    <p>The evaluation dataset should not be used for training by isolating a part of the training dataset, but the tutorial uses the training data for convenience.</p>
</div>

## __4. Predict survivors using the generated model__

Using the survivor prediction model created in the previous step, try to predict whether or not to survive according to the passenger information. Use the dataset for testing (data table not used for training, <mark style="background-color:#FFEC92 "><strong>titanic_test</strong></mark>).

In [None]:
%%thanosql 
PREDICT USING titanic_automl_classification
AS 
SELECT * 
FROM titanic_test

<div class="admonition note">
    <h4 class="admonition-title">Query Details</h4>
    <p>Use the <mark style="background-color:#E9D7FD ">titanic_automl_classification</mark> model created in the previous step for prediction using the "<strong>PREDICT USING</strong>" query syntax. For "<strong>PREDICT</strong>", no special options are required as it follows the procedure of the generated model.</p>
</div>

## __5. In Conclusion__

In this tutorial, we created a Titanic survivor classification prediction model using <mark style="background-color:#FFD79C"><strong>Titanic: Machine Learning from Disaster</strong></mark> data from [Kaggle](https://www.kaggle.com/). As this is a beginner-level tutorial, I proceeded with an explanation focusing on the overall process rather than the process for improving accuracy. If you'd like to learn more about building advanced classification models, I'd recommend going through the intermediate tutorial.

In the next [Creating an Intermediate Classification Prediction Model] tutorial, we'll dive deeper into "__OPTIONS__" for improving accuracy. Create a classification prediction model for your own service/product after completing intermediate and advanced levels. In the intermediate stage, we will create sophisticated classification prediction models using the various "__OPTIONS__" provided by ThanoSQL's AutoML. In addition, after completing the intermediate level, at the advanced level, you can quantify unstructured data and include it as a learning element in AutoML to create a classification prediction model.

- [How to Upload to ThanoSQL DB](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/data_upload/)
- [Creating an Intermediate Image Classification Model]
- [Image conversion and creating My model using Auto-ML]
- [Deploy My Image Classification model](https://docs.thanosql.ai/en/how-to_guides/ThanoSQL_connecting/thanosql_api/rest_api_thanosql_query/)

<div class="admonition tip">
    <h4 class="admonition-title">Inquiries about deploying a model for your own service</h4>
    <p>If you have any difficulties in creating your own model using ThanoSQL or applying it to the service, please feel free to contact us below😊</p>
    <p>For inquiries about building a classification model: contact@smartmind.team</p>
</div>