In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import typing

# Introduction

The premise for this notebook is to work through how an AutoML system might be created. I intend to take this concept further in future notebooks and eventually create my own autoML release to tackle common tasks. This will be a learning experience for me and hopefully will be useful for you the reader.

For this notebook I will attempt to create a series of functions that are entirely generic and then apply them to the problem to make a submission. This is entirely different to the usual approach of creating specific solution to the problem at hand. It is my hope that these generic functions can eventually be placed into classes to form an automl learning pipeline.  

The from scratch in the title references that I won't be forking an existing AutoML model or copying and pasting blocks of code from one. 

Further notebooks on the development of the RabbitML library are below: 

[AutoML from Scratch #1](https://www.kaggle.com/code/taranmarley/automl-from-scratch-1/notebook)

[AutoML from Scratch #2](https://www.kaggle.com/code/taranmarley/automl-from-scratch-2/notebook)

[AutoML from Scratch #3](https://www.kaggle.com/code/taranmarley/automl-from-scratch-3/notebook)

[AutoML from Scratch #4](https://www.kaggle.com/code/taranmarley/automl-from-scratch-4/notebook)

[AutoML from Scratch #5](https://www.kaggle.com/code/taranmarley/automl-from-scratch-5/notebook)

[AutoML from Scratch #6](https://www.kaggle.com/code/taranmarley/automl-from-scratch-6/notebook)

[AutoML from Scratch #7](https://www.kaggle.com/code/taranmarley/automl-from-scratch-7/notebook)

[AutoML from Scratch #8](https://www.kaggle.com/code/taranmarley/automl-from-scratch-8/notebook)

[AutoML from Scratch #9](https://www.kaggle.com/code/taranmarley/automl-from-scratch-9/notebook)

# Load Data

In [None]:
df = pd.read_csv("../input/titanic/train.csv")
test_df = pd.read_csv("../input/titanic/test.csv")

In [None]:
df.head()

# Detect NaNs

First let's create a generic function that gives us feedback on NaNs in the data. I will attempt to make a pythonic docstring since we will be using this later on.

In [None]:
def detect_NaNs(df_temp : pd.DataFrame, name = '', silent : bool = False, plot : bool = True):
    """
    Detect NaNs in a provided dataframe and return the columns that NaNs were detected in     
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to detect NaN values in
    name : str
        Name of the dataframe which helps give a more descriptive read out
    silent : bool
        Whether the print statements should fire
    plot : bool
        Whether to return a plot of the counts of NaNs in the data
    
    Returns
    -------
    typing.List
        List of columns in the provided dataframe that contain NaN values
    """
    count_nulls = df_temp.isnull().sum().sum()
    columns_with_NaNs = []
    # Count NaNs by column
    if count_nulls > 0:
        for col in df_temp.columns:
            if df_temp[col].isnull().sum().sum() > 0:
                columns_with_NaNs.append(col)
    # Print out the NaN values
    if not silent:            
        if name != '': 
            print('Detecting NaNs in', name)
        print('NaNs in data:', count_nulls)
        if count_nulls > 0:
            print('******')
            for col in columns_with_NaNs:
                print('NaNs in', col + ": ", df_temp[col].isnull().sum().sum())
            print('******')
    print('')
    # Plot the NaN values in columns in bar plot
    if plot and count_nulls > 0:
        sns.barplot(x=df_temp[columns_with_NaNs].isnull().sum().index, y=df_temp[columns_with_NaNs].isnull().sum().values)
        plt.show()
    return columns_with_NaNs

Let's run this to check that it works as expected

In [None]:
nan_columns = detect_NaNs(df, "Training Data")

# Fill NaNs in Dataframe

Now I will create a function to fill in NaN values and return the dataframe with the filled values and new columns signifying if there was a NaN value there in case this is significant.

In [None]:
def fill_nans_create_columns(df_temp : pd.DataFrame, columns : typing.List, value : float = 0):
    """
    Fill NaN of provided columns and create columns to signify they weren't there.
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to modify
    columns : typing.List
        Columns of the provided dataframe to modify
    value : float
        Value to replace the NaN values with
    
    Returns
    -------
    pd.DataFrame
        Modified Dataframe with NaNs filled and new columns signifying the rows that contained NaNs
    """
    for col in columns:
        df_temp[col + "_was_null"] = df_temp[col].isnull().astype(int)
        df_temp[col] = df_temp[col].fillna(value)
    return(df_temp)

In [None]:
df = fill_nans_create_columns(df, nan_columns, 0)
test_df = fill_nans_create_columns(test_df, nan_columns, 0)

# Detect Duplicates

Now I will look at detecting duplicates in the data with a generic function.

In [None]:
def detect_duplicates(df_temp : pd.DataFrame, silent : bool = False): 
    """
    Detect duplicates in data and return the columns in which duplicates where detected.
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to detect duplicates in
    silent : bool
        Whether to run print statements 
    """
    # Filter out identity columns
    cols_to_use = []
    id_cols = []
    for col in df_temp.columns:
        if len(df_temp[col].unique()) != len(df_temp[col]):
            cols_to_use.append(col)
        else:
            id_cols.append(col)
    df_temp = df_temp.copy()[cols_to_use]    
    count_dupes = df_temp.duplicated().sum()
    if not silent:
        print('Duplicates in data: ', str(count_dupes))
        print('When filtering out id columns: ', str(id_cols))
    

In [None]:
detect_duplicates(df)

# Break Up Object Columns by Spaces

It would be helpful it the object columns were broken up by the spaces in them. So that multiple columns are created.

In [None]:
def break_up_by_string(df_temp : pd.DataFrame, splitting_string : str):
    """
    Break up columns by string to create new columns from each split.

    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to start splitting up object columns
    splitting_string : str
        String to split up columns by
        
        
    Returns
    -------
    pd.DataFrame
        modified dataframe with extra columns containing split up values
    """
    obj_cols = df_temp.select_dtypes(include=[object])
    # count spaces
    for col in obj_cols:
        if df_temp[col].str.contains(splitting_string).sum() > 0:
            df2 = df_temp[col].str.split(splitting_string, expand=True)
            # Rename columns
            rename_dict = {}
            for rename_col in df2.columns:
                if (splitting_string != " "):
                    rename_dict[rename_col] = col + splitting_string + str(rename_col)
                else:
                    rename_dict[rename_col] = col + str(rename_col)
            df2 = df2.rename(columns=rename_dict)
            df2 = df2.fillna(0)
            df_temp = pd.concat([df_temp,df2], axis=1) 
    return df_temp

df = break_up_by_string(df, ' ')
test_df = break_up_by_string(test_df, ' ')

# Remove the ID Columns

ID columns commonly have no predictive power. In this dataset names actually might be useful but I'm trying for an automated system here, certain kinds of feature engineering may be hard to replicate. I don't believe even the best thought out feature engineering is likely to defeat human feature engineering at this time. 

In [None]:
def detect_id_columns(df_temp : pd.DataFrame):
    """
    Detect which columns are ID columns, those for which one unique value exists for each row.
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to detect ID columns
        
    Returns
    -------
    typing.List
        List of Identity columns that were detected
    """
    id_cols = []
    for col in df.columns:
        if len(df[col].unique()) == len(df[col]):
            id_cols.append(col)
    return id_cols

id_cols = detect_id_columns(df)
df = df.drop(columns=id_cols)
test_df = test_df.drop(columns=id_cols)

# Remove Unshared Columns

It is important that there aren't columns in the test dataframe that aren't in the train dataframe

In [None]:
def drop_unshared_columns(df_temp : pd.DataFrame, df_temp_2 : pd.DataFrame, exclude_columns : typing.List):
    """
    Detect which columns are not shared between the two dataframes excepting for a target_col if provided.
    Delete in place.
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to check for shared columns        
    df_temp_2 : pd.DataFrame
        Second dataframe to check for shared columns
    exclude_columns : typing.List
        Columns not to remove in this process
    """    
    for col in df_temp_2.columns:
        if col not in df_temp.columns:
            if col not in exclude_columns:
                df_temp_2.drop(columns=col, axis=1, inplace=True)
    for col in df_temp.columns:
        if col not in df_temp_2.columns:
            if col not in exclude_columns:
                df_temp.drop(columns=col, axis=1, inplace=True)
drop_unshared_columns(df, test_df, ['Survived'])
    

# Feature Encoding

We will now go through and encode the features in an automated fashion

In [None]:
from sklearn import preprocessing

def encode_columns(df : pd.DataFrame, columns : pd.Series, test_df : pd.DataFrame = None, cutoff : int = 12):
    """
    Encode columns based on the number of unique values in each column
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to encode columns in 
    columns : pd.Series
        Columns to encode
    test_df : pd.DataFrame
        Test dataframe to encode based on classes in the Dataframe
    cut_off : int
        The cut off number of classes to choose between label encoding and get dummies. This keeps the dimensionality under control
        
    Returns
    -------
    (pd.DataFrame, pd.DataFrame)
        Original dataframe and the test dataframe
    """    
    for col in columns:
        le = preprocessing.LabelEncoder()
        classes_to_encode = df[col].astype(str).unique().tolist()
        classes_to_encode.sort()
        classes_to_encode.append('None')
        le.fit(classes_to_encode)
        if len(le.classes_) < cutoff:
            df = pd.get_dummies(df, columns = [col])
            if test_df is not None:
                test_df = pd.get_dummies(test_df, columns = [col])
        else:
            check_col = df.copy()[col]
            df[col] = le.transform(df[col].astype(str))
            if test_df is not None:
                #Clean out unseen labels
                inputs = []
                for idx, row in test_df.iterrows():
                    if row[col] in pd.unique(check_col):
                        inputs.append(row[col])
                    else:
                        inputs.append('None')
                test_df[col] = inputs
                test_df[col] = le.transform(test_df[col].astype(str))
    return df, test_df

df, test_df = encode_columns(df, df.select_dtypes(include=['object']).columns, test_df, 12)

We then drop unshared columns again to get rid of the get dummies columns that don't fit.

In [None]:
drop_unshared_columns(df, test_df, ['Survived'])

# Quantile Transformation

Quantile transformation can be helpful in my experience for analysis and learning. So this will be my first scaling function

In [None]:
from sklearn.preprocessing import QuantileTransformer

def quantile_transform_column_wise(df_temp : pd.DataFrame, target_col : str = ""):
    """
    Transform values in dataframe to quantile uniform distribution
    
    Parameters
    ----------
    df_temp : pd.DataFrame
        Dataframe to quantile transform 
    target_col : str
        This is the target col and is not transformed
        
    Returns
    -------
    pd.DataFrame
        Modified dataframe
    """    
    df_temp = df_temp.copy()
    # find n_samples
    n_samples : int = 1000
    if len(df_temp) < 1000:
        n_samples = len(df_temp)
    for col in df_temp.columns:
        if col != target_col:
            transformed = QuantileTransformer(random_state=1, n_quantiles=n_samples).fit_transform(df_temp[col].values.reshape(-1, 1))
            df_temp[col] = pd.Series(transformed[:,0], index=df_temp[col].index, name=df_temp[col].name)
    return df_temp

df = quantile_transform_column_wise(df, "Survived")
test_df = quantile_transform_column_wise(df, "Survived")

# Examine Pipeline

We have something reasonable now. Though I already have many ideas for improvement. Let's see if we can start from the top and run this as a pipeline which takes the dataframe and applies all modifications in order.

In [None]:
def data_pipeline(df : pd.DataFrame, test_df : pd.DataFrame, target_col : str = "", plot : bool = False, silent = True):
    """
    Perform a series of modifications to make a tabular dataset more suitable for machine learning analysis
    
    Parameters
    ----------
    df : pd.DataFrame
        Dataframe to transform 
    test_df : pd.DataFrame
        Test dataframe to transform  
    target_col : str
        This is the target col and is not transformed or removed
    plot : bool
        This tells it whether to plot graphs in called functions
    silent : bool
        This tells it whether to avoid print statements in called functions
        
    Returns
    -------
    (pd.DataFrame, pd.DataFrame)
        Modified dataframe versions of the provided dataframes
    """    
    # Detect NaNs
    nan_columns : typing.List = detect_NaNs(df, "Training Data", silent, plot)
    test_nan_columns : typing.List = detect_NaNs(test_df, "Test Data", silent, plot)
    # Fill NaNs
    df = fill_nans_create_columns(df, nan_columns, 0)
    test_df = fill_nans_create_columns(test_df, test_nan_columns, 0)
    # Detect duplicates
    detect_duplicates(df, silent)
    # Break up the columns by spaces
    df = break_up_by_string(df, ' ')
    test_df = break_up_by_string(test_df, ' ')
    # Break up the columns by slash
    df = break_up_by_string(df, '/')
    test_df = break_up_by_string(test_df, '/')
    # Remove id columns  
    id_cols = detect_id_columns(df.copy())
    df = df.drop(columns=id_cols)
    test_df = test_df.drop(columns=id_cols)    
    # Remove unshared columns   
    drop_unshared_columns(df, test_df, [target_col])
    # Encode columns
    df, test_df = encode_columns(df, df.select_dtypes(include=['object']).columns, test_df, 12)
    # Remove new unshared columns   
    drop_unshared_columns(df, test_df, [target_col])
    # Quantile transform features
    df = quantile_transform_column_wise(df, target_col)
    test_df = quantile_transform_column_wise(test_df, target_col)
    return df, test_df

df, test_df = data_pipeline(pd.read_csv("../input/titanic/train.csv"), pd.read_csv("../input/titanic/test.csv"), "Survived")

# Quick Prediction Test

I will now do a quick machine learning test using logistic regression on the data from the pipeline. We are definitely not trying to set a record score here. Just see our work in practice.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

y = df["Survived"].values
X = df.drop(columns="Survived").values

skf = StratifiedKFold(n_splits=5)
lr = LogisticRegression(random_state=1, max_iter=200)
scores = []
for train_index, test_index in skf.split(X, y):
    lr.fit(X[train_index], y[train_index])
    scores.append(lr.score(X[test_index], y[test_index]))

print("Score:", sum(scores) / len(scores))

Now let's try it to compare if we simply used the csv file as is with the absolute minimum modifications for logistic regression to be able to run. This will show if our modifications achieve anything.

In [None]:
df = pd.read_csv("../input/titanic/train.csv")
y = df["Survived"].values
X = df.drop(columns="Survived").select_dtypes(exclude=['object']).fillna(0).values

skf = StratifiedKFold(n_splits=5)
lr = LogisticRegression(random_state=1, max_iter=200)
scores = []
for train_index, test_index in skf.split(X, y):
    lr.fit(X[train_index], y[train_index])
    scores.append(lr.score(X[test_index], y[test_index]))

print("Score:", sum(scores) / len(scores))

We can see the score has changed dramaticaly and therefore our pipeline is effective against simply opening the csv. Due to its generic nature this pipeline therefore should offer improved accuracy against similar datasets without requiring the user to do anything through automated feature engineering.

# Conclusion

Feature engineering is a core part of autoML and while what I have here doesn't neccessarily replace human feature engineering it is nice to have this baseline for future projects to compare results too. This presents already an effort free feature engineering that I can attempt in future project. 

There is definitely room for expansion and I will come back to this in a future notebook. Eventually I hope to have a large enough body of code neatly tucked into classes to create a cohesive pip package that people can use and contribute to. Thanks for reading and I'd love to here your suggestions and feedback.

Part #2 is done here: https://www.kaggle.com/taranmarley/automl-from-scratch-2

# Create Submission

Just for fun I will now create a submission from what I have so far.

In [None]:
df, test_df = data_pipeline(pd.read_csv("../input/titanic/train.csv"), pd.read_csv("../input/titanic/test.csv"), "Survived")
lr = LogisticRegression(random_state=1, max_iter=200)
y = df["Survived"].values
X = df.drop(columns="Survived").values
lr.fit(X, y)
submission_df = pd.read_csv("../input/titanic/gender_submission.csv")
test_X = test_df.values
submission_df["Survived"] = lr.predict(test_X)
submission_df.to_csv("submission.csv", index=False)

At the time of writing this submission is in the top 55% which in my opinion really isn't too bad for untuned logistic regression with entirely generic feature engineering. Obviously future work will be much better than that. 