# Tabular Playground Series - July 2022

By Michael Mortenson

The dataset for this challenge is simulated manufacturing control data. The goal is to use unsupervised (clustering) to identify different control states. We are not told the number of control states, the units, time dependencies, or any other information about the data.

## Domain Insights

Manufacturing products requires raw materials to be processed through a series of steps (and often many machines). Each step will not be perfectly exact, so the engineering design will have tolerances for the manufacture of each piece of a product. The job of manufacturing control to ensure that production stays efficient by detecting problems as machine pieces begin to wear, devices lose their calibration, or other problems arise.

Since we are not told what each column in the data represents, we will need to analyze to decide how we should treat it. 

In [1]:
# Useful Packages
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
# Read in data
X = pd.read_csv('../input/tabular-playground-series-jul-2022/data.csv')
# X_id = X['id'].copy()
# X = X.drop(columns=['id']).copy()

I'm interested in anomaly detection because this is a manufacturing control problm

In [3]:
from sklearn.ensemble import IsolationForest

In [4]:
def get_anomalies(df, contamination='auto', verbose=True):
    
    # Remove the 'id' column
    try:
        df = df.drop(columns=['id']).copy()
    except:
        print('No id column in df')
        
    # create an isolation forest model
    clf = IsolationForest(random_state=0, contamination=contamination).fit(df)
    
    # Get anomalies
    anomalies_array = clf.predict(df)

    # Report some details
    if verbose == True:
        print(f"{len(df) - anomalies_array.sum()} of {len(df)} samples ({round((len(df) - anomalies_array.sum())/len(X)*100,4)}%) deemed anomalous")

    return anomalies_array

In [5]:
# Determine whether each sample is a general outlier
anoms = get_anomalies(X)

1992 of 98000 samples (2.0327%) deemed anomalous


In [6]:
# Get id for each general anomaly
anom_id1 = X.id[anoms < 1]

# Pull the poits categorized as anomalies
X1 = X.copy()
X1 = X1.drop(labels=anom_id1.index)


In [7]:
# # Now, there are 4 general groups of labels in the dataset (based on the box & whisker plots).
# g1 = [f"f_{x:02d}" for x in range(7)]
# g1.append('id')
# g2 = [f"f_{x:02d}" for x in range(7, 14)]
# g2.append('id')
# g3 = [f"f_{x:02d}" for x in range(14, 22)]
# g3.append('id')
# g4 = [f"f_{x:02d}" for x in range(22, 29)]
# g4.append('id')

# # Let's work our way through each one and pull out the anomalies

In [8]:
# # The first group
# X_g1 = X1[g1].copy()

# # Determine whether each sample in the group is a general outlier
# anoms = get_anomalies(X_g1)

# # Get id for each g1 anomaly
# anom_id2 = X_g1.id[anoms < 1]

# # Pull the poits categorized as anomalies
# X1 = X1.drop(labels=anom_id2.index)

In [9]:
# # The second group
# X_g2 = X1[g2].copy()

# # Determine whether each sample in the group is a general outlier
# anoms = get_anomalies(X_g2)

# # Get id for each g1 anomaly
# anom_id3 = X_g2.id[anoms < 1]

# # Pull the poits categorized as anomalies
# X1 = X1.drop(labels=anom_id3.index)

In [10]:
# # The third group
# X_g3 = X1[g3].copy()

# # Determine whether each sample in the group is a general outlier
# anoms = get_anomalies(X_g3)

# # Get id for each g1 anomaly
# anom_id4 = X_g3.id[anoms < 1]

# # Pull the poits categorized as anomalies
# X1 = X1.drop(labels=anom_id4.index)

In [11]:
# # The fourth group
# X_g4 = X1[g4].copy()

# # Determine whether each sample in the group is a general outlier
# anoms = get_anomalies(X_g4)

# # Get id for each g1 anomaly
# anom_id5 = X_g4.id[anoms < 1]

# # Pull the poits categorized as anomalies
# X1 = X1.drop(labels=anom_id5.index)

In [12]:
# # # Determine whether each sample is a general outlier
# anoms = get_anomalies(X1)

# # Get id for each general anomaly
# anom_id1 = X1.id[anoms < 1]

# # Pull the poits categorized as anomalies
# # X1 = X.copy()
# # X1 = X1.drop(labels=anom_id1.index)

In [13]:
# Collect the cluster labels
label_map_dict = {}
for idx in anom_id1:
    label_map_dict[idx] = 0
# for idx in anom_id2:
#     label_map_dict[idx] = 1
# for idx in anom_id2:
#     label_map_dict[idx] = 2
# for idx in anom_id3:
#     label_map_dict[idx] = 3
# for idx in anom_id4:
#     label_map_dict[idx] = 4
# for idx in anom_id5:
#     label_map_dict[idx] = 5
for idx in X1.id:
    label_map_dict[idx] = 1       #6

# Check that we got all of them
if len(label_map_dict) == len(X):
    print("All samples labeled")

All samples labeled


In [14]:
# Add the predicted clusters to the dataset
X['Predicted'] = X['id'].map(label_map_dict)

In [15]:
# Submit results
y = X[['Predicted','id']]
y.rename(columns={'id':'Id'})
y.to_csv("submission.csv", index=False)