# SENSOR SELECTION CASE STUDY

### IoT-Based Predictive Maintenance for Manufacturing Equipment

**BUSINESS SCENARIO**

**Company:** TechManufacture Inc.

**Problem:** Predicting equipment failures in a production line

**CURRENT SITUATION:**
- 12 IoT sensors installed on critical manufacturing equipment
- Sensors monitor temperature, vibration, pressure, and operational metrics
- Each sensor costs $500/month for maintenance and data transmission
- Total cost: $6,000/month or $72,000/year

**CHALLENGE:**
- Management wants to reduce operational costs
- Need to identify which sensors are truly necessary
- Cannot compromise on prediction accuracy
- This is a PRE-MODELING task - we need to decide BEFORE building ML models

**OBJECTIVE:**
Use EDA techniques to systematically identify which sensors can be removed while maintaining the ability to predict equipment failures.

**DATASET DETAILS:**
- 2,000 observations collected over 6 months
- 12 sensor measurements per observation
- Binary target: equipment_failure (0 = normal, 1 = failure)
- Sensor types: temperature, vibration, pressure, speed, current, voltage


IMPORTS AND CONFIG

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')

In [10]:
np.random.seed(42)
n_samples = 2000

### STEP 1: DATA GENERATION - Creating Realistic Sensor Data

In [11]:
# Generating sensor data with specific patterns...

# TEMPERATURE SENSORS (3 sensors)
# temp_core: Core temperature - highly predictive of failures
temp_core = np.random.normal(75, 8, n_samples)  # Normal operation around 75°C
temp_core[1800:] += np.random.uniform(15, 25, 200)  # Failures show high temp

# temp_ambient: Ambient temperature - less variable, less predictive
temp_ambient = np.random.normal(22, 2, n_samples)  # Room temperature

# temp_exhaust: Exhaust temperature - HIGHLY CORRELATED with core temp
temp_exhaust = temp_core * 0.85 + np.random.normal(5, 2, n_samples)


In [12]:
# VIBRATION SENSORS (3 sensors)
# vibration_x: X-axis vibration - predictive of mechanical failures
vibration_x = np.random.gamma(2, 2, n_samples)
vibration_x[1800:] += np.random.uniform(8, 15, 200)  # High vibration before failure

# vibration_y: Y-axis vibration - somewhat correlated with X
vibration_y = vibration_x * 0.6 + np.random.gamma(1.5, 1.5, n_samples)

# vibration_z: Z-axis - NEAR CONSTANT (sensor malfunction/not useful)
vibration_z = np.random.normal(0.5, 0.001, n_samples)  # Almost no variance


In [13]:
# PRESSURE SENSORS (2 sensors)
# pressure_inlet: Inlet pressure - normal operation
pressure_inlet = np.random.normal(100, 5, n_samples)

# pressure_outlet: Outlet pressure - HIGHLY CORRELATED with inlet
pressure_outlet = pressure_inlet * 0.95 + np.random.normal(2, 1, n_samples)


In [14]:
# OPERATIONAL SENSORS (4 sensors)
# motor_speed: Motor RPM - predictive of failures
motor_speed = np.random.normal(1800, 50, n_samples)
motor_speed[1800:] += np.random.uniform(-200, -100, 200)  # Speed drops before failure

# motor_current: Electrical current - predictive
motor_current = np.random.normal(15, 2, n_samples)
motor_current[1800:] += np.random.uniform(5, 10, 200)  # Current spikes before failure

# voltage_supply: Supply voltage - VERY STABLE (grid supply)
voltage_supply = np.random.normal(220, 0.5, n_samples)  # Almost constant

# power_factor: Power efficiency - less predictive
power_factor = np.random.uniform(0.85, 0.95, n_samples)

In [15]:
# Create DataFrame
sensor_data = pd.DataFrame({
    'temp_core': temp_core,
    'temp_ambient': temp_ambient,
    'temp_exhaust': temp_exhaust,
    'vibration_x': vibration_x,
    'vibration_y': vibration_y,
    'vibration_z': vibration_z,
    'pressure_inlet': pressure_inlet,
    'pressure_outlet': pressure_outlet,
    'motor_speed': motor_speed,
    'motor_current': motor_current,
    'voltage_supply': voltage_supply,
    'power_factor': power_factor
})

In [16]:
sensor_data.head()

Unnamed: 0,temp_core,temp_ambient,temp_exhaust,vibration_x,vibration_y,vibration_z,pressure_inlet,pressure_outlet,motor_speed,motor_current,voltage_supply,power_factor
0,78.973713,22.996443,71.590365,1.034029,0.773871,0.499038,105.869505,101.911063,1783.1925,14.832178,219.832933,0.9105
1,73.893886,24.280298,64.137083,0.735611,1.636718,0.500435,94.909812,90.621499,1794.028842,14.939575,220.076285,0.910947
2,80.181508,25.161081,72.288734,7.210694,6.105532,0.499314,101.590797,98.328942,1801.761265,14.932829,220.103354,0.928143
3,87.184239,19.969812,80.69424,2.951314,3.182083,0.498511,102.068239,99.071263,1845.818941,17.595548,219.857668,0.932297
4,73.126773,20.378285,66.563005,1.532191,6.906448,0.499071,95.25689,93.762405,1775.076796,12.783893,220.338451,0.852723


In [17]:
# Create target variable (equipment failure)
# Failures are influenced by specific sensors
failure_score = (
    (temp_core - 75) * 0.4 +           # High temperature
    vibration_x * 2 +                   # High vibration
    (1800 - motor_speed) * 0.1 +       # Low speed
    (motor_current - 15) * 3            # High current
)


In [18]:
failure_score

array([ 4.83482776,  1.44461613, 16.11635197, ..., 88.67513998,
       55.23883322, 64.11161593], shape=(2000,))

In [19]:
# Add noise and create binary outcome
failure_score += np.random.normal(0, 10, n_samples)
equipment_failure = (failure_score > np.percentile(failure_score, 85)).astype(int)

sensor_data['equipment_failure'] = equipment_failure

In [23]:
# Dataset Preview
sensor_data.head()

Unnamed: 0,temp_core,temp_ambient,temp_exhaust,vibration_x,vibration_y,vibration_z,pressure_inlet,pressure_outlet,motor_speed,motor_current,voltage_supply,power_factor,equipment_failure
0,78.973713,22.996443,71.590365,1.034029,0.773871,0.499038,105.869505,101.911063,1783.1925,14.832178,219.832933,0.9105,0
1,73.893886,24.280298,64.137083,0.735611,1.636718,0.500435,94.909812,90.621499,1794.028842,14.939575,220.076285,0.910947,0
2,80.181508,25.161081,72.288734,7.210694,6.105532,0.499314,101.590797,98.328942,1801.761265,14.932829,220.103354,0.928143,0
3,87.184239,19.969812,80.69424,2.951314,3.182083,0.498511,102.068239,99.071263,1845.818941,17.595548,219.857668,0.932297,0
4,73.126773,20.378285,66.563005,1.532191,6.906448,0.499071,95.25689,93.762405,1775.076796,12.783893,220.338451,0.852723,0


In [21]:
# Total Samples
n_samples

2000

In [22]:
# Failure rate %
equipment_failure.mean() * 100

np.float64(15.0)

In [24]:
# Statistical Summary
sensor_data.describe().round(2)

Unnamed: 0,temp_core,temp_ambient,temp_exhaust,vibration_x,vibration_y,vibration_z,pressure_inlet,pressure_outlet,motor_speed,motor_current,voltage_supply,power_factor,equipment_failure
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,77.34,22.0,70.65,5.19,5.37,0.5,100.21,97.21,1784.34,15.71,220.0,0.9,0.15
std,9.86,2.01,8.61,4.5,3.28,0.0,5.01,4.87,68.86,3.02,0.48,0.03,0.36
min,49.07,16.02,44.91,0.04,0.16,0.5,85.33,80.54,1510.37,7.69,218.25,0.85,0.0
25%,70.58,20.6,64.99,2.15,2.97,0.5,96.76,93.85,1751.1,13.77,219.68,0.88,0.0
50%,76.47,22.0,69.89,3.71,4.55,0.5,100.2,97.21,1792.08,15.22,220.0,0.9,0.0
75%,82.77,23.33,75.45,6.63,6.84,0.5,103.65,100.44,1829.9,16.93,220.31,0.92,0.0
max,123.17,29.85,109.01,25.26,25.22,0.5,122.4,119.61,1980.14,29.38,221.61,0.95,1.0


In [25]:
# Shuffle the rows
sensor_data = sensor_data.sample(frac=1, random_state=42).reset_index(drop=True)
# Save to CSV for reference
sensor_data.to_csv('sensor_data.csv', index=False)
print("\n✓ Data saved to 'sensor_data.csv'")


✓ Data saved to 'sensor_data.csv'


### STEP 2: EXPLORATORY DATA ANALYSIS - Understanding the Data

In [26]:
# Separate features and target
X = sensor_data.drop('equipment_failure', axis=1)
y = sensor_data['equipment_failure']

In [27]:
# Feature matrix shape
X.shape

(2000, 12)

In [28]:
# Target Distribution
y.value_counts(normalize=True)

equipment_failure
0    0.85
1    0.15
Name: proportion, dtype: float64

#### TECHNIQUE 1: VARIANCE ANALYSIS

**CONCEPT:** 

Variance measures how much a sensor's readings vary over time. 

Low variance = sensor readings are almost constant → Not useful for prediction since it doesn't capture changes. 

In [29]:
# Calculate variance for each sensor
variances = X.var().sort_values(ascending=True)

print("Variance Analysis Results:")
for sensor, var_value in variances.items():
    print(f"  {sensor:20s} : {var_value:12.4f}")


Variance Analysis Results:
  vibration_z          :       0.0000
  power_factor         :       0.0008
  voltage_supply       :       0.2323
  temp_ambient         :       4.0448
  motor_current        :       9.0935
  vibration_y          :      10.7906
  vibration_x          :      20.2204
  pressure_outlet      :      23.7452
  pressure_inlet       :      25.1258
  temp_exhaust         :      74.0920
  temp_core            :      97.2579
  motor_speed          :    4742.0285


In [30]:
# INTERPRETATION

variance_threshold = 1.0
low_var_sensors = variances[variances < variance_threshold].index.tolist()
high_var_sensors = variances[variances >= variance_threshold].index.tolist()
print(f"Variance Threshold: {variance_threshold}")

Variance Threshold: 1.0


In [32]:
print(f"LOW VARIANCE SENSORS (Should Remove): {len(low_var_sensors)}")
for sensor in low_var_sensors:
    print(f"   • {sensor:20s} - Variance: {variances[sensor]:.4f}")


LOW VARIANCE SENSORS (Should Remove): 3
   • vibration_z          - Variance: 0.0000
   • power_factor         - Variance: 0.0008
   • voltage_supply       - Variance: 0.2323


* Sensor : 'vibration_z'
    - Variance = 0.0000
    - Almost no variation - likely sensor malfunction or not capturing
* Sensor : 'voltage_supply':
    - Variance = 0.2323
    - Very stable grid voltage - minimal information gain"


In [33]:
# Create variance comparison dataframe
variance_df = pd.DataFrame({
    'Sensor': variances.index,
    'Variance': variances.values,
    'Status': ['REMOVE' if v < variance_threshold else 'KEEP' for v in variances.values]
})
# Variance Summary Table
print(variance_df.to_string(index=False))

         Sensor    Variance Status
    vibration_z    0.000001 REMOVE
   power_factor    0.000814 REMOVE
 voltage_supply    0.232279 REMOVE
   temp_ambient    4.044805   KEEP
  motor_current    9.093513   KEEP
    vibration_y   10.790573   KEEP
    vibration_x   20.220396   KEEP
pressure_outlet   23.745160   KEEP
 pressure_inlet   25.125772   KEEP
   temp_exhaust   74.092005   KEEP
      temp_core   97.257898   KEEP
    motor_speed 4742.028474   KEEP


- Sensors with variance < 1.0 show minimal variation across all measurements
- These sensors are unlikely to help differentiate between normal operation and equipment failure. 
- Removing them will NOT hurt prediction capability.

In [34]:
# Calculate potential savings
removed_count = len(low_var_sensors)
savings_monthly = removed_count * 500
savings_annual = savings_monthly * 12

print(f"COST SAVINGS FROM VARIANCE ANALYSIS:")
print(f"Sensors to remove: {removed_count}")
print(f"Monthly savings: ${savings_monthly:,}")
print(f"Annual savings: ${savings_annual:,}")

COST SAVINGS FROM VARIANCE ANALYSIS:
Sensors to remove: 3
Monthly savings: $1,500
Annual savings: $18,000


### TECHNIQUE 2: CORRELATION ANALYSIS

**CONCEPT:** 

- Correlation measures how similarly two sensors behave.
- High correlation (|r| > 0.9) = sensors capture redundant info
- → Keeping both provides minimal additional value
- → Can remove one and retain most information

In [35]:
# Calculate correlation matrix
correlation_matrix = X.corr()

# Full Correlation Matrix
correlation_matrix.round(3)

Unnamed: 0,temp_core,temp_ambient,temp_exhaust,vibration_x,vibration_y,vibration_z,pressure_inlet,pressure_outlet,motor_speed,motor_current,voltage_supply,power_factor
temp_core,1.0,-0.043,0.973,0.46,0.389,0.0,0.037,0.038,-0.403,0.448,0.0,0.012
temp_ambient,-0.043,1.0,-0.033,-0.028,-0.041,0.032,0.02,0.013,0.007,-0.0,-0.016,0.032
temp_exhaust,0.973,-0.033,1.0,0.452,0.385,-0.003,0.034,0.035,-0.397,0.441,0.006,0.003
vibration_x,0.46,-0.028,0.452,1.0,0.831,0.001,0.002,0.003,-0.521,0.559,-0.005,0.003
vibration_y,0.389,-0.041,0.385,0.831,1.0,-0.017,-0.024,-0.023,-0.441,0.473,-0.02,0.004
vibration_z,0.0,0.032,-0.003,0.001,-0.017,1.0,0.004,0.004,0.036,0.002,-0.023,-0.008
pressure_inlet,0.037,0.02,0.034,0.002,-0.024,0.004,1.0,0.979,-0.019,0.009,0.023,0.014
pressure_outlet,0.038,0.013,0.035,0.003,-0.023,0.004,0.979,1.0,-0.02,0.004,0.015,0.025
motor_speed,-0.403,0.007,-0.397,-0.521,-0.441,0.036,-0.019,-0.02,1.0,-0.501,0.009,-0.042
motor_current,0.448,-0.0,0.441,0.559,0.473,0.002,0.009,0.004,-0.501,1.0,-0.008,0.038


In [37]:
# Highly Correlated Sensor Pairs
correlation_threshold = 0.9
high_corr_pairs = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > correlation_threshold:
            high_corr_pairs.append({
                'Sensor 1': correlation_matrix.columns[i],
                'Sensor 2': correlation_matrix.columns[j],
                'Correlation': corr_value
            })

print(f"{len(high_corr_pairs)} highly correlated pairs (|correlation| > {correlation_threshold})")
high_corr_pairs

2 highly correlated pairs (|correlation| > 0.9)


[{'Sensor 1': 'temp_core',
  'Sensor 2': 'temp_exhaust',
  'Correlation': np.float64(0.9725999011314618)},
 {'Sensor 1': 'pressure_inlet',
  'Sensor 2': 'pressure_outlet',
  'Correlation': np.float64(0.9788324868012145)}]

In [38]:
# DETAILED ANALYSIS OF CORRELATED PAIRS

In [39]:
# First highly correlated pair out of two 
pair = high_corr_pairs[0]
sensor_1, sensor_2 = pair['Sensor 1'], pair['Sensor 2']
corr = pair['Correlation']
print(f"Analyzing Pair: {sensor_1} & {sensor_2} (Correlation: {corr:.3f})")

Analyzing Pair: temp_core & temp_exhaust (Correlation: 0.973)


In [40]:
# Sensor 1 Mean and Variance
mean_1 = X[sensor_1].mean()
var_1 = X[sensor_1].var()
print(f"{sensor_1} - Mean: {mean_1:.3f}, Variance: {var_1:.3f}")

# Sensor 2 Mean and Variance
mean_2 = X[sensor_2].mean()
var_2 = X[sensor_2].var()
print(f"{sensor_2} - Mean: {mean_2:.3f}, Variance: {var_2:.3f}")

temp_core - Mean: 77.336, Variance: 97.258
temp_exhaust - Mean: 70.655, Variance: 74.092


**Comment:**

- Exhaust temperature is directly driven by core temperature.
- They move together - one is redundant.
- RECOMMENDATION: Keep temp_core (more direct measurement)

In [41]:
# Second highly correlated pair out of two
pair = high_corr_pairs[1]
sensor_1, sensor_2 = pair['Sensor 1'], pair['Sensor 2']
corr = pair['Correlation']
print(f"Analyzing Pair: {sensor_1} & {sensor_2} (Correlation: {corr:.3f})")

Analyzing Pair: pressure_inlet & pressure_outlet (Correlation: 0.979)


In [42]:
# Sensor 1 Mean and Variance
mean_1 = X[sensor_1].mean()
var_1 = X[sensor_1].var()
print(f"{sensor_1} - Mean: {mean_1:.3f}, Variance: {var_1:.3f}")
# Sensor 2 Mean and Variance
mean_2 = X[sensor_2].mean()
var_2 = X[sensor_2].var()
print(f"{sensor_2} - Mean: {mean_2:.3f}, Variance: {var_2:.3f}")

pressure_inlet - Mean: 100.214, Variance: 25.126
pressure_outlet - Mean: 97.207, Variance: 23.745


**Comment:** 

- Inlet and outlet pressure are mechanically linked.
- Knowing one gives you information about the other.
- RECOMMENDATION: Keep pressure_inlet (primary source)

In [43]:
# Correlation heatmap data summary
# Correlation Strength Distribution

In [49]:
# Get upper triangle correlations (excluding diagonal)
upper_triangle = correlation_matrix.where(
    np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
)
correlations_flat = upper_triangle.stack().abs()

corr_ranges = [
    (0.0, 0.3, "Very Weak"),
    (0.3, 0.5, "Weak"),
    (0.5, 0.7, "Moderate"),
    (0.7, 0.9, "Strong"),
    (0.9, 1.0, "Very Strong")
]

for low, high, label in corr_ranges:
    count = ((correlations_flat >= low) & (correlations_flat < high)).sum()
    pct = count / len(correlations_flat) * 100
    print(f"   {label:12s} ({low:.1f}-{high:.1f}): {count:3d} pairs ({pct:5.1f}%)")


   Very Weak    (0.0-0.3):  50 pairs ( 75.8%)
   Weak         (0.3-0.5):  10 pairs ( 15.2%)
   Moderate     (0.5-0.7):   3 pairs (  4.5%)
   Strong       (0.7-0.9):   1 pairs (  1.5%)
   Very Strong  (0.9-1.0):   2 pairs (  3.0%)


In [50]:
# Sensors to remove based on correlation
corr_remove = ['temp_exhaust', 'pressure_outlet']

In [51]:
print(f"COST SAVINGS FROM CORRELATION ANALYSIS:")
print(f"Additional sensors to remove: {len(corr_remove)}")
print(f"Monthly savings: ${len(corr_remove) * 500:,}")
print(f"Annual savings: ${len(corr_remove) * 500 * 12:,}")

print(f"RUNNING TOTAL:")
print(f"Total sensors to remove so far: {len(low_var_sensors) + len(corr_remove)}")
print(f"Cumulative monthly savings: ${(len(low_var_sensors) + len(corr_remove)) * 500:,}")
print(f"Cumulative annual savings: ${(len(low_var_sensors) + len(corr_remove)) * 500 * 12:,}")

COST SAVINGS FROM CORRELATION ANALYSIS:
Additional sensors to remove: 2
Monthly savings: $1,000
Annual savings: $12,000
RUNNING TOTAL:
Total sensors to remove so far: 5
Cumulative monthly savings: $2,500
Cumulative annual savings: $30,000


### TECHNIQUE 3: VARIANCE INFLATION FACTOR (VIF)