# Sensor Component Failure Detection

## 1) Problem statement.

**Data:** Sensor Data

**Problem statement :**
- The system in focus is the Air Pressure system (APS) which generates pressurized air that are utilized in various functions in a truck, such as braking and gear changes. The datasets positive class corresponds to component failures for a specific component of the APS system. The negative class corresponds to trucks with failures for components not related to the APS system.

- The problem is to reduce the cost due to unnecessary repairs. So it is required to minimize the false predictions.

|True class | Positive | Negative | |
| ----------- | ----------- |   |  |
|<b>Predicted class</b>||| |
| Positive      |   -       | cost_1  |    |
| Negative   | cost_2        |  | |


Cost 1 = 10 and Cost 2 = 500

- The total cost of a prediction model the sum of `Cost_1` multiplied by the number of Instances with type 1 failure and `Cost_2` with the number of instances with type 2 failure, resulting in a `Total_cost`. In this case `Cost_1` refers to the cost that an unnessecary check needs to be done by an mechanic at an workshop, while `Cost_2` refer to the cost of missing a faulty truck, which may cause a breakdown. 
- `Total_cost = Cost_1 * No_Instances + Cost_2 * No_Instances.`

- From the above problem statement we could observe that, we have to reduce false positives and false negatives. More importantly we have to **reduce false negatives, since cost incurred due to false negative is 50 times higher than the false positives.**

## Challenges and other objectives

- Need to Handle many Null values in almost all columns
- No low-latency requirement.
- Interpretability is not important.
- misclassification leads the unecessary repair costs.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
sensor=pd.read_csv("aps_failure_training_set1.csv", na_values="na")

In [4]:
sensor

Unnamed: 0,class,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
0,pos,153204,0.0,1.820000e+02,,0.0,0.0,0.0,0.0,0.0,...,129862.0,26872.0,34044.0,22472.0,34362.0,0.0,0.0,0.0,0.0,0.0
1,pos,453236,,2.926000e+03,,0.0,0.0,0.0,0.0,222.0,...,7908038.0,3026002.0,5025350.0,2025766.0,1160638.0,533834.0,493800.0,6914.0,0.0,0.0
2,pos,72504,,1.594000e+03,1052.0,0.0,0.0,0.0,244.0,178226.0,...,1432098.0,372252.0,527514.0,358274.0,332818.0,284178.0,3742.0,0.0,0.0,0.0
3,pos,762958,,,,,,776.0,281128.0,2186308.0,...,,,,,,,,,,
4,pos,695994,,,,,,0.0,0.0,0.0,...,1397742.0,495544.0,361646.0,28610.0,5130.0,212.0,0.0,0.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36183,neg,153002,,6.640000e+02,186.0,0.0,0.0,0.0,0.0,0.0,...,998500.0,566884.0,1290398.0,1218244.0,1019768.0,717762.0,898642.0,28588.0,0.0,0.0
36184,neg,2286,,2.130707e+09,224.0,0.0,0.0,0.0,0.0,0.0,...,10578.0,6760.0,21126.0,68424.0,136.0,0.0,0.0,0.0,0.0,0.0
36185,neg,112,0.0,2.130706e+09,18.0,0.0,0.0,0.0,0.0,0.0,...,792.0,386.0,452.0,144.0,146.0,2622.0,0.0,0.0,0.0,0.0
36186,neg,80292,,2.130706e+09,494.0,0.0,0.0,0.0,0.0,0.0,...,699352.0,222654.0,347378.0,225724.0,194440.0,165070.0,802280.0,388422.0,0.0,0.0


In [5]:
sensor.shape

(36188, 171)

In [6]:
sensor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36188 entries, 0 to 36187
Columns: 171 entries, class to eg_000
dtypes: float64(169), int64(1), object(1)
memory usage: 47.2+ MB


In [7]:
sensor.describe()

Unnamed: 0,aa_000,ab_000,ac_000,ad_000,ae_000,af_000,ag_000,ag_001,ag_002,ag_003,...,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000
count,36188.0,8292.0,34047.0,26988.0,34601.0,34601.0,35809.0,35809.0,35809.0,35809.0,...,35809.0,35809.0,35809.0,35809.0,35809.0,35809.0,35809.0,35809.0,34458.0,34459.0
mean,65910.16,0.71177,353522300.0,318544.7,7.2343,11.606543,195.2347,1508.277,12507.18,115692.8,...,485362.1,229320.8,483784.6,440101.9,368694.3,371805.1,148511.7,8897.664,0.083464,0.209234
std,164123.8,3.054033,792648600.0,52253980.0,186.437282,234.405353,18528.62,43713.6,180154.3,885338.0,...,1254188.0,594805.4,1251106.0,1331837.0,1220688.0,1722483.0,515326.5,53163.75,3.78902,8.613915
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,866.0,0.0,16.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2986.0,1190.0,2732.0,3680.0,566.0,118.0,0.0,0.0,0.0,0.0
50%,31026.0,0.0,152.0,128.0,0.0,0.0,0.0,0.0,0.0,0.0,...,237850.0,113784.0,226356.0,195248.0,95594.0,42966.0,4278.0,0.0,0.0,0.0
75%,50068.5,0.0,964.0,432.0,0.0,0.0,0.0,0.0,0.0,0.0,...,447012.0,222286.0,474868.0,410328.0,279192.0,170608.0,143230.0,2018.0,0.0,0.0
max,2746564.0,100.0,2130707000.0,8584298000.0,21050.0,20070.0,3376892.0,4109372.0,10552860.0,29047300.0,...,31232720.0,16769290.0,27477580.0,57435240.0,31607810.0,37278560.0,19267400.0,3810078.0,362.0,1146.0


In [8]:
sensor.dtypes

class      object
aa_000      int64
ab_000    float64
ac_000    float64
ad_000    float64
           ...   
ee_007    float64
ee_008    float64
ee_009    float64
ef_000    float64
eg_000    float64
Length: 171, dtype: object

In [9]:
sensor.columns

Index(['class', 'aa_000', 'ab_000', 'ac_000', 'ad_000', 'ae_000', 'af_000',
       'ag_000', 'ag_001', 'ag_002',
       ...
       'ee_002', 'ee_003', 'ee_004', 'ee_005', 'ee_006', 'ee_007', 'ee_008',
       'ee_009', 'ef_000', 'eg_000'],
      dtype='object', length=171)

In [10]:
pd.DataFrame(sensor.dtypes,columns=['dtypes'])

Unnamed: 0,dtypes
class,object
aa_000,int64
ab_000,float64
ac_000,float64
ad_000,float64
...,...
ee_007,float64
ee_008,float64
ee_009,float64
ef_000,float64


In [11]:
sensor["class"].value_counts()

class
neg    35188
pos     1000
Name: count, dtype: int64

In [12]:
sensor.dtypes

class      object
aa_000      int64
ab_000    float64
ac_000    float64
ad_000    float64
           ...   
ee_007    float64
ee_008    float64
ee_009    float64
ef_000    float64
eg_000    float64
Length: 171, dtype: object

In [13]:
sensor['class'].dtypes

dtype('O')

In [14]:
sensor['aa_000'].dtypes

dtype('int64')

In [15]:
numerical_columns=[feature for feature in sensor.columns if sensor[feature].dtypes!='O']
categorical_columns=[feature for feature in sensor.columns if sensor[feature].dtypes=='O']

In [16]:
categorical_columns

['class']

In [17]:
# Need to Install D tale and Autoviz
import dtale

In [18]:
d=dtale.show(sensor)

In [19]:
# run below command to open in browser
#d.open_browser()

In [20]:
from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()
viz=AV.AutoViz(sensor)

Imported v0.1.905. Please call AutoViz in this sequence:
    AV = AutoViz_Class()
    %matplotlib inline
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
Shape of your Data Set loaded: (36188, 171)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    Number of Numeric Columns =  167
    Number of Integer-Categorical Columns =  1
    Number of String-Categorical Columns =  0
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  1
    Number of Numeric-Boolean Columns =  1
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Numbe

Unnamed: 0,Data Type,Missing Values%,Unique Values%,Minimum Value,Maximum Value,DQ Issue
class,object,0.0,0.0,,,No issue
aa_000,int64,0.0,45.0,0.0,2746564.0,"Column has 3483 outliers greater than upper bound (123872.25) or lower than lower bound(-72937.75). Cap them or remove them., Column has a high correlation with ['ah_000', 'an_000', 'ao_000', 'ap_000', 'ba_001', 'ba_002', 'ba_003', 'ba_004', 'bb_000', 'bg_000', 'bh_000', 'bi_000', 'bt_000', 'bu_000', 'bv_000', 'bx_000', 'by_000', 'cc_000', 'ci_000', 'cn_004', 'cq_000', 'cs_005', 'ds_000', 'dt_000']. Consider dropping one of them."
ab_000,float64,77.086327,,0.0,100.0,"27896 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 1627 outliers greater than upper bound (0.00) or lower than lower bound(0.00). Cap them or remove them."
ac_000,float64,5.916326,,0.0,2130706664.0,"2141 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 6482 outliers greater than upper bound (2386.00) or lower than lower bound(-1406.00). Cap them or remove them."
ad_000,float64,25.422792,,0.0,8584297742.0,"9200 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 2472 outliers greater than upper bound (1044.00) or lower than lower bound(-588.00). Cap them or remove them."
ae_000,float64,4.385432,,0.0,21050.0,"1587 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 1135 outliers greater than upper bound (0.00) or lower than lower bound(0.00). Cap them or remove them."
af_000,float64,4.385432,,0.0,20070.0,"1587 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 1179 outliers greater than upper bound (0.00) or lower than lower bound(0.00). Cap them or remove them., Column has a high correlation with ['ae_000']. Consider dropping one of them."
ag_000,float64,1.047309,,0.0,3376892.0,"379 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 150 outliers greater than upper bound (0.00) or lower than lower bound(0.00). Cap them or remove them."
ag_001,float64,1.047309,,0.0,4109372.0,"379 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 580 outliers greater than upper bound (0.00) or lower than lower bound(0.00). Cap them or remove them."
ag_002,float64,1.047309,,0.0,10552856.0,"379 missing values. Impute them with mean, median, mode, or a constant value such as 123., Column has 2070 outliers greater than upper bound (0.00) or lower than lower bound(0.00). Cap them or remove them., Column has a high correlation with ['ag_001']. Consider dropping one of them."


Number of All Scatter Plots = 465
Image size of 1500x87200 pixels is too large. It must be less than 2^16 in each direction.
Could not draw Pair Scatter Plots
All Plots done
Time to run AutoViz = 60 seconds 

 ###################### AUTO VISUALIZATION Completed ########################
