<img src="https://cybersecurity-excellence-awards.com/wp-content/uploads/2017/06/366812.png">

<h1><center>Darwin Supervised Classification Model Building </center></h1>

Prior to getting started, there are a few things you want to do:
1. Set the dataset path.
2. Enter your username and password to ensure that you're able to log in successfully

Once you're up and running, here are a few things to be mindful of:
1. For every run, look up the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can do better by exploring a larger search space, use the resume function.

## Import libraries

In [1]:
# Import necessary libraries
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image
from time import sleep
import os
import numpy as np
from sklearn.metrics import classification_report

from amb_sdk.sdk import DarwinSdk

ModuleNotFoundError: No module named 'amb_sdk'

## Setup

**Login to Darwin**<br>
Enter your registered username and password below to login to Darwin.

In [2]:
# Login
ds = DarwinSdk()
ds.set_url('https://amb-demo-api.sparkcognition.com/v1/')
status, msg = ds.auth_login_user('yasser@utexas.edu', 'En5dC4ZGwL')

if not status:
    print(msg)

**Data Path** <br>
In the cell below, set the path to your dataset, the default is Darwin's example datasets

In [3]:
path = 'listings/'

## Data Upload and Clean

**Read dataset and view a file snippet**

After setting up the dataset path, the next step is to upload the dataset from your local device to the server. <br> In the cell below, you need to specify the dataset_name if you want to use your own data.

In [4]:
dataset_name = 'austin_listings.csv'
df = pd.read_csv(os.path.join(path, dataset_name))
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,1078,*UT/Hyde Park Craftsman Apartment,4635658,Tracy,,78705,30.30123,-97.73674,Entire home/apt,85,1,208,2017-07-14,1.63,3,43
1,2265,Zen-East in the Heart of Austin,2466,Paddy,,78702,30.2775,-97.71398,Entire home/apt,225,2,23,2018-09-16,0.19,3,125
2,5245,"Green, Colorful, Clean & Cozy home",2466,Paddy,,78702,30.27577,-97.71379,Private room,100,28,9,2018-03-14,0.07,3,3
3,5456,"Walk to 6th, Rainey St and Convention Ctr",8028,Sylvia,,78702,30.26112,-97.73448,Entire home/apt,95,2,472,2019-02-22,3.88,1,302
4,5769,NW Austin Room,8186,Elizabeth,,78729,30.45596,-97.7837,Private room,40,1,240,2019-02-24,2.21,1,72


**Upload dataset to Darwin**

In [5]:
# Upload dataset
status, dataset = ds.upload_dataset(os.path.join(path, dataset_name))
if not status:
    print(dataset)

400: BAD REQUEST - {"message": "Dataset already exists"}



**Clean dataset**

In [6]:
# clean dataset
import urllib
target = "price"
status, job_id = ds.clean_data(dataset_name, target = target)



if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)AttributeError: 'module' object has no attribute 'parse'

AttributeError: 'module' object has no attribute 'parse'

## Create and Train Model 

We will now build a model that will learn the class labels in the target column.<br> In the default cancer dataset, the target column is "Diagnosis". <br> You will have to specify your own target name for your custom dataset. <br> You can also increase max_train_time for longer training.


In [14]:
model = target + "_model0"
status, job_id = ds.create_model(dataset_names = dataset_name, \
                                 model_name =  model, \
                                 max_train_time = '00:02')
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

400: BAD REQUEST - {"message": "Dataset has not been cleaned. Please clean dataset before training model."}



## Extra Training (Optional)
Run the following cell for extra training, no need to specify parameters

In [None]:
# Train some more
status, job_id = ds.resume_training_model(dataset_names = dataset_name,
                                          model_name = model,
                                          max_train_time = '00:05')
                                          
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

## Analyze Model
Analyze model provides feature importance ranked by the model. <br> It indicates a general view of which features pose a bigger impact on the model

In [None]:
# Retrieve feature importance of built model
status, artifact = ds.analyze_model(model)
sleep(1)
if status:
    ds.wait_for_job(artifact['job_name'])
else:
    print(artifact)
status, feature_importance = ds.download_artifact(artifact['artifact_name'])

Show the 10 most important features of the model.

In [None]:
feature_importance[:10]

## Predictions
**Perform model prediction on the the training dataset.**

In [None]:
status, artifact = ds.run_model(dataset_name, model)
sleep(1)
ds.wait_for_job(artifact['job_name'])

Download predictions from Darwin's server.

In [None]:
status, prediction = ds.download_artifact(artifact['artifact_name'])
prediction.head()

Create plots comparing predictions with actual target

In [None]:
unq = prediction[target].unique()[::-1]
p = np.zeros((len(prediction),))
a = np.zeros((len(prediction),))
for i,q in enumerate(unq):
    p += i*(prediction[target] == q).values
    a += i*(df[target] == q).values
#Plot predictions vs actual
plt.plot(a)
plt.plot(p)
plt.legend(['Actual','Predicted'])
plt.yticks([i for i in range(len(unq))],[q for q in unq]);
print(classification_report(df[target], prediction[target]))

**Perform model prediction on a test dataset that wasn't used in training.** <br>
Upload test dataset

In [None]:
test_data = 'cancer_test.csv'
status, dataset = ds.upload_dataset(os.path.join(path, test_data))
if not status:
    print(dataset)

clean test data

In [None]:
# clean test dataset
status, job_id = ds.clean_data(test_data, target = target, model_name = model)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

Run model on test dataset.

In [None]:
status, artifact = ds.run_model(test_data, model)
sleep(1)
ds.wait_for_job(artifact['job_name'])

Create plots comparing predictions with actual target

In [None]:
# Create plots comparing predictions with actual target
status, prediction = ds.download_artifact(artifact['artifact_name'])
df = pd.read_csv(os.path.join(path,test_data))
unq = prediction[target].unique()[::-1]
p = np.zeros((len(prediction),))
a = np.zeros((len(prediction),))
for i,q in enumerate(unq):
    p += i*(prediction[target] == q).values
    a += i*(df[target] == q).values
#Plot predictions vs actual
plt.plot(a)
plt.plot(p)
plt.legend(['Actual','Predicted'])
plt.yticks([i for i in range(len(unq))],[q for q in unq]);
print(classification_report(df[target], prediction[target]))

## Find out which machine learning model did Darwin use:

In [None]:
status, model_type = ds.lookup_model_name(model)
print(model_type['description']['best_genome'])