<a href="https://colab.research.google.com/github/volkangurel/titanic-layer/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

You will need Python 3.7.12 for this demo. You can ignore the pip resolver errrors

In [None]:
from platform import python_version

required = "3.7.12"
if not python_version() == required:
    print(f"Python {required} is required to run this demo. Your version is {python_version()}")

In [None]:
!rm -rf titanicv2
!git clone https://github.com/mecevit/titanicv2.git
!pip install wheel
!pip install titanicv2/layer_sdk-0.8.15.post50.dev0+g2adb2162d8.dirty-py3-none-any.whl -qqq

Let's see if we have Layer installed successfully

In [None]:
!layer --version

Layer, version 0.8.15.post50.dev+g2adb2162d8.dirty


# Register and Login

To use Layer, you have to register and login. Run the following cell, click the link, register and paste the code in the input

In [None]:
from layer.v2.assertions import greatexpectations, assert_true, assert_valid_values, assert_not_null, assert_unique
from layer.v2.decorators import dataset, model
from layer.v2.dependencies import File
from layer.v2 import LayerProject
from layer.client import Dataset
import layer

# layer.logout()
layer.login("https://dev-judgment-day.layer.co/")

# Let's dive in!

Everything is ready. Now let's start building our model. Within our notebook, we will be building a model for predicting the survivals of the Titanic passengers. We will be using the famous [Kaggle Titanic](https://www.kaggle.com/c/titanic/data?select=train.csv) dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from pathlib import Path
import pandas as pd

def clean_sex(sex):
    result = 0
    if sex == "female":
        result = 0
    elif sex == "male":
        result = 1
    return result


def clean_age(data):
    age = data[0]
    pclass = data[1]
    if pd.isnull(age):
        if pclass == 1:
            return 37
        elif pclass == 2:
            return 29
        else:
            return 24
    else:
        return age

def dummy_passengers():
    # Based on passenger 2 (high passenger class female)
    passenger2 = {'PassengerId': 2,
                  'Pclass': 1,
                  'Name': ' Mrs. John',
                  'Sex': 'female',
                  'Age': 38.0,
                  'SibSp': 1,
                  'Parch': 0,
                  'Ticket': 'PC 17599',
                  'Fare': 71.2833,
                  'Embarked': 'C'}

    return passenger2


def get_passenger_features(df):
    df['Sex'] = df['Sex'].apply(clean_sex)
    df['Age'] = df[['Age', 'Pclass']].apply(clean_age, axis=1)
    return df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]


def test_survival_probability(model:RandomForestClassifier) -> bool:
    """
    We have 2 directional expectations here:
    - Changing gender from female to male should decrease survival probability.
    - Changing Pclass from 1 to 3 should decrease survival probability.

    Reference:
    https://eugeneyan.com/writing/testing-ml/

    :param model: Trained survival model
    :return:
    """
    p2 = dummy_passengers()

    # Get original survival probability of passenger 2
    test_df = pd.DataFrame.from_dict([p2], orient='columns')
    X = get_passenger_features(test_df)
    p2_prob = model.predict_proba(X)[0][1]  # 0.99

    # Change gender from female to male
    p2_male = p2.copy()
    p2_male['Sex'] = 'male'
    test_df = pd.DataFrame.from_dict([p2_male], orient='columns')
    X = get_passenger_features(test_df)
    p2_male_prob = model.predict_proba(X)[0][1]  # 0.53

    # Change class from 1 to 3
    p2_class = p2.copy()
    p2_class['Pclass'] = 3
    test_df = pd.DataFrame.from_dict([p2_class], orient='columns')
    X = get_passenger_features(test_df)
    p2_class_prob = model.predict_proba(X)[0][1]  # 0.0

    # Changing gender from female to male should decrease survival probability.
    return p2_male_prob < p2_prob and p2_class_prob < p2_prob

data_file = 'titanicv2/titanic.csv'

In [None]:
@dataset('passengers', dependencies=[File(data_file)])
@assert_valid_values('Sex', ['male', 'female'])
@assert_unique("PassengerId")
def read_and_clean_dataset():
    df = pd.read_csv(data_file)
    return df

@dataset('features')
@assert_valid_values('Sex', [0,1])
def extract_features():
    df = layer.get_dataset("passengers").to_pandas()
    # df = read_and_clean_dataset()
    df['Sex'] = df['Sex'].apply(clean_sex)
    df['Age'] = df[['Age', 'Pclass']].apply(clean_age, axis=1)
    df = df.drop(["PassengerId", "Name", "Cabin", "Ticket", "Embarked"], axis=1)
    return df

@model(name='survival_model')
@assert_true(test_survival_probability)
# @fabric("f-spark-large")
# @schedule(reactive=[Dataset("features")])
def train():
    # df = extract_features()
    df = layer.get_dataset("features").to_pandas()
    X = df.drop(["Survived"], axis=1)
    y = df["Survived"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    random_forest = RandomForestClassifier(n_estimators=100)
    random_forest.fit(X_train, y_train)
    y_pred = random_forest.predict(X_test)
    layer.log_metric("accuracy", accuracy_score(y_test, y_pred))
    layer.log_metric("f1", f1_score(y_test, y_pred))
    return random_forest



# ++ init Layer
layer_project = LayerProject(name="titanic", requirements=File("titanicv2/requirements.txt"), debug=True)

# ++ To run the whole project on Layer Infra, the order is not important
layer_project.run([train, read_and_clean_dataset, extract_features])

# ++ To build individual assets on Layer infra
# layer_project.run([read_and_clean_dataset])
# layer_project.run([extract_features])
# layer_project.run([train])

# ++ To debug the code locally, just call the function:
# train()
# df = read_and_clean_dataset()
# df = extract_features()
# df.head()
# train()

# Results

After you train your model, you can see all your datasets, features and model experiments here in the Layer interface

https://dev-judgment-day.layer.co/

Or you can re-use one of the entities you have created. Let's fetch the model and the features you just built and do some batch predictions

In [None]:
survival_model = layer.get_model("survival_model")
survival_model.metrics

{'accuracy': [(1643901290564, 0.7947761194029851)],
 'f1': [(1643901290579, 0.7417840375586854)]}

In [None]:
train = survival_model.get_train()
train

RandomForestClassifier()

In [None]:
passenger_features = layer.get_dataset("features").to_pandas()
passenger_features.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.925
3,1,1,0,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


In [None]:
X = get_passenger_features(passenger_features)
train.predict_proba(X.values)[0][1]


0.52