# SECOM Manufacturing Quality Analysis  
## Data Understanding Phase

This notebook performs preliminary data analysis on the SECOM manufacturing dataset.
The goal of this phase is to understand the structure, quality, and characteristics of the data
before performing any cleaning, feature engineering, or modeling.

No transformations that affect the dataset are finalized in this phase.
All observations made here will inform decisions in later stages of the project.


import data analysis libraries

In [2]:
import os
import sys

print("Current working directory:")
print(os.getcwd())

print("\nPython search path:")
for p in sys.path:
    print(p)


Current working directory:
g:\GitHub\SECOM-Process-Sensor-Analysis\notebooks

Python search path:
c:\Python314\python314.zip
c:\Python314\DLLs
c:\Python314\Lib
c:\Python314

C:\Users\Ted\AppData\Roaming\Python\Python314\site-packages
c:\Python314\Lib\site-packages


In [3]:
import importlib.util
import sys
from pathlib import Path

# CHANGE THIS PATH to where dpf.py actually lives
dpf_path = Path("G:\\GitHub\\SECOM-Process-Sensor-Analysis\\dpf.py")

spec = importlib.util.spec_from_file_location("dpf", dpf_path)
dpf = importlib.util.module_from_spec(spec)
sys.modules["dpf"] = dpf
spec.loader.exec_module(dpf)

# Now test
dpf.Check


<function dpf.Check(df)>

In [8]:
# Core data libraries
import pandas as pd
import numpy as np

# Visualization libraries (used later, not heavily in this phase)
import matplotlib.pyplot as plt
import seaborn as sns

# Utility inspection function
import dpf


### Dataset Loading

The SECOM dataset consists of:
- A sensor measurement file containing hundreds of process variables
- A labels file indicating pass/fail outcomes and timestamps

The data is loaded without predefined headers, as provided in the original dataset.


In [9]:
# Load SECOM sensor data
secom_data = pd.read_csv(
    "../data/secom.data",
    delimiter=" ",
    header=None,
    na_values="NaN"
)

# Load SECOM labels
secom_labels = pd.read_csv(
    "../data/secom_labels.data",
    delimiter=" ",
    header=None,
    na_values="NaN"
)

print("Sensor data shape:", secom_data.shape)
print("Labels shape:", secom_labels.shape)


Sensor data shape: (1567, 590)
Labels shape: (1567, 2)


### Column Naming

Each sensor feature is assigned a sequential name (Feature_1, Feature_2, …).
The label dataset contains a pass/fail indicator and a timestamp.


In [10]:
# Assign feature names
secom_data.columns = [
    f"Feature_{i}" for i in range(1, secom_data.shape[1] + 1)
]

# Assign label names
secom_labels.columns = ["Pass/Fail", "Timestamp"]

# Convert timestamp to datetime
secom_labels["Timestamp"] = pd.to_datetime(
    secom_labels["Timestamp"],
    dayfirst=True,
    errors="coerce"
)


### Dataset Integration

The sensor measurements and labels are combined into a single dataset.
The timestamp is placed at the beginning and the target variable at the end
for clarity and consistency.


In [11]:
# Concatenate datasets
secom_dataset = pd.concat([secom_data, secom_labels], axis=1)

# Reorder columns
secom_dataset = secom_dataset[
    ["Timestamp"] + list(secom_data.columns) + ["Pass/Fail"]
]

secom_dataset.head()


Unnamed: 0,Timestamp,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,...,Feature_582,Feature_583,Feature_584,Feature_585,Feature_586,Feature_587,Feature_588,Feature_589,Feature_590,Pass/Fail
0,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,...,,0.5005,0.0118,0.0035,2.363,,,,,-1
1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1
2,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,...,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,...,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1


### Initial Data Quality Assessment

A preliminary check confirms the presence of missing values in the dataset.
This is expected in real-world sensor data and will be addressed in later phases.


In [15]:
summary = dpf.Check(secom_dataset)


Initating Data Checking Process...
Shape of the DataFrame:
Shape: 1567 rows, 592 columns

                      Dtype  Missing  Missing %  Unique
Timestamp    datetime64[ns]        0       0.00    1534
Feature_1           float64        6       0.38    1520
Feature_2           float64        7       0.45    1504
Feature_3           float64       14       0.89     507
Feature_4           float64       14       0.89     518
...                     ...      ...        ...     ...
Feature_587         float64        1       0.06     322
Feature_588         float64        1       0.06     260
Feature_589         float64        1       0.06     120
Feature_590         float64        1       0.06     611
Pass/Fail             int64        0       0.00       2

[592 rows x 4 columns]

First 5 rows:
            Timestamp  Feature_1  Feature_2  Feature_3  Feature_4  Feature_5  \
0 2008-07-19 11:55:00    3030.93    2564.00  2187.7333  1411.1265     1.3602   
1 2008-07-19 12:32:00    3095.78    246

In [21]:
# Check for missing values
if secom_dataset.isnull().any().any() == np.True_:
    print("Missing values detected in the dataset.")



Missing values detected in the dataset.


In [None]:
# Save to secom_data.csv
secom_dataset.to_csv("secom_data.csv", index=False)

In [None]:
# Statistical summary
secom_dataset.describe().transpose()


Unnamed: 0,count,mean,min,25%,50%,75%,max,std
Timestamp,1567,2008-09-09 18:37:39.859604224,2008-07-19 11:55:00,2008-08-22 00:55:30,2008-09-11 08:06:00,2008-09-29 11:33:00,2008-10-17 06:07:00,
Feature_1,1561.0,3014.452896,2743.24,2966.26,3011.49,3056.65,3356.35,73.621787
Feature_2,1560.0,2495.850231,2158.75,2452.2475,2499.405,2538.8225,2846.44,80.407705
Feature_3,1553.0,2200.547318,2060.66,2181.0444,2201.0667,2218.0555,2315.2667,29.513152
Feature_4,1553.0,1396.376627,0.0,1081.8758,1285.2144,1591.2235,3715.0417,441.69164
...,...,...,...,...,...,...,...,...
Feature_587,1566.0,0.021458,-0.0169,0.013425,0.0205,0.0276,0.1028,0.012358
Feature_588,1566.0,0.016475,0.0032,0.0106,0.0148,0.0203,0.0799,0.008808
Feature_589,1566.0,0.005283,0.001,0.0033,0.0046,0.0064,0.0286,0.002867
Feature_590,1566.0,99.670066,0.0,44.3686,71.9005,114.7497,737.3048,93.891919


Preliminary timestamp analysis

In [19]:
data = secom_dataset.copy()

data['year'] = data['Timestamp'].dt.year
data['month'] = data['Timestamp'].dt.month
data['day'] = data['Timestamp'].dt.day
data['weekday'] = data['Timestamp'].dt.weekday
data['hour'] = data['Timestamp'].dt.hour
data['minute'] = data['Timestamp'].dt.minute

print("Years:", data['year'].unique())
print("Months:", data['month'].unique())
print("Weekdays:", data['weekday'].unique())


Years: [2008]
Months: [ 7  8  9 10]
Weekdays: [5 6 0 1 2 4 3]
