## Defect Detection

### Dataset Information

Project Assignment

    Defect Detection in Manufacturing Systems

    In todays’ highly competitive markets, customers expect defect free products. High defect ratios in manufacturing yield unwanted additional costs that decrease the profit margins. In this competition the students are expected to use and calibrate machine learning (ML) algorithms for predicting and detecting defects at the manufacturing stage of a product used in fabric detergents production.
    
    The attendants will compare the prediction performances and determine the most appropriate modelling algorithm among different ML techniques. The data which is used in this study provided by one of the leading manufacturing companies in Turkey. The students are strongly recommended to use feature engineering techniques using the originally provided features. 


### Attribute Information


Explanation of features in the dataset:

1. Date: Data of measurement
2. Shift Period: Measurement shift period
3. Measurement Time: Time information of measurement
4. Experience: Experience information of measurement (a metric related with the measurement, not related with operator) olcumun hangi kosullarda yapildigini gostermekte ve o da kategorik varirable.
5. Percentage of dye (%): Percentage of coloring
6. Flow rate: Flow rate of coloring
7. Part: Which part of product is colored.
8. Teflon: Teflon rate of part
9. Thickness: Thickness rate of coloring.
10. Strap: Information of coloring in which strap is used. Kategorik renklendirme icin kullanilan aski aparatinin hangi id’de oldugunu belirliyor.
11. Defective or not defective: Defect information of coloring. This variable will be the target variable in prediction stage.


In [None]:
install.packages("readxl")

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)


In [117]:
library(readxl)
data = read_excel("/home/nbuser/library/defect_detection_dataset.xlsx")

In [118]:
head(data)

Date,Shift,Time,Operator,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness,Strap,IsDefective
2017-11-01,16/24,1899-12-31 18:30:00,26000392,2.5,106,34,A,20,25,1,Non-defective
2017-11-02,08/16,1899-12-31 08:35:00,110485,,102,34,A,21,27,2,Non-defective
2017-11-02,16/24,1899-12-31 18:00:00,111891,3.0,107,34,A,22,25,3,Non-defective
2017-11-03,00/08,1899-12-31 02:30:00,709050,5.0,110,34,A,23,24,4,Non-defective
2017-11-03,08/16,1899-12-31 08:15:00,112241,2.5,110,34,A,24,18,5,Non-defective
2017-11-03,08/16,1899-12-31 09:15:00,112241,2.5,130,34,A,25,26,6,Non-defective


In [119]:
experience <- data[["Experience"]]
flow_rate <- data[["Flow rate"]]
part <- data[["Part"]]
shift <- data[["Shift"]]
date <- data[["Date"]]
time <- data[["Time"]]
is_defective <- data[["IsDefective"]]

In [7]:
flow_rate

#### Label Encoding

Label encoding was applied to columns(part, shift, date, time, is_detective, experience) which have categorical data. 

In [120]:
factor_fr <- factor(flow_rate)
factor_part <- factor(part)
factor_shift <- factor(shift)
factor_date <- factor(date)
factor_time <- factor(time)
factor_isdefective <- factor(is_defective)
factor_exp <- factor(experience)

flow_rate <- as.numeric(factor_fr)
part <- as.numeric(factor_part)
shift <- as.numeric(factor_shift)
date <- as.numeric(factor_date)
time <- as.numeric(factor_time)
is_defective <- as.numeric(factor_isdefective)
experience <- as.numeric(factor_exp)

In [121]:
data[["Flow rate"]] <- flow_rate
data[["Part"]] <- part
data[["Shift"]] <- shift
data[["Date"]] <- date
data[["Time"]] <- time
data[["IsDefective"]] <- is_defective
data[["Experience"]] <- experience

In [122]:
data

Date,Shift,Time,Operator,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness,Strap,IsDefective
1,3,50,26000392,1,106,3,1,20,25,1,2
2,2,20,110485,,102,3,1,21,27,2,2
2,3,48,111891,2,107,3,1,22,25,3,2
3,1,11,709050,4,110,3,1,23,24,4,2
3,2,19,112241,1,110,3,1,24,18,5,2
3,2,24,112241,1,130,3,1,25,26,6,2
3,2,37,110485,,129,3,1,26,26,7,2
3,3,44,111891,2,124,3,1,27,22,8,2
4,3,44,26000392,1,130,3,1,28,25,9,2
5,2,31,111891,2,132,3,1,29,20,10,2


In [123]:
data$Operator <- NULL
data$Strap <- NULL

In [21]:
head(data)

Date,Shift,Time,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness,IsDefective
1,3,50,1.0,106,3,1,20,25,2
2,2,20,,102,3,1,21,27,2
2,3,48,2.0,107,3,1,22,25,2
3,1,11,4.0,110,3,1,23,24,2
3,2,19,1.0,110,3,1,24,18,2
3,2,24,1.0,130,3,1,25,26,2


In [124]:
X <- data
y <- data[["IsDefective"]]
X$IsDefective <- NULL

In [125]:
head(X)

Date,Shift,Time,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness
1,3,50,1.0,106,3,1,20,25
2,2,20,,102,3,1,21,27
2,3,48,2.0,107,3,1,22,25
3,1,11,4.0,110,3,1,23,24
3,2,19,1.0,110,3,1,24,18
3,2,24,1.0,130,3,1,25,26


In [10]:
install.packages("lattice")
library(mice)

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)


In [14]:
methods(mice)

“function 'mice' appears not to be S3 generic; found functions that look like S3 methods”

 [1] mice.impute.2l.lmer      mice.impute.2l.norm      mice.impute.2l.pan      
 [4] mice.impute.2lonly.mean  mice.impute.2lonly.norm  mice.impute.2lonly.pmm  
 [7] mice.impute.cart         mice.impute.lda          mice.impute.logreg      
[10] mice.impute.logreg.boot  mice.impute.mean         mice.impute.midastouch  
[13] mice.impute.norm         mice.impute.norm.boot    mice.impute.norm.nob    
[16] mice.impute.norm.predict mice.impute.passive      mice.impute.pmm         
[19] mice.impute.polr         mice.impute.polyreg      mice.impute.quadratic   
[22] mice.impute.rf           mice.impute.ri           mice.impute.sample      
[25] mice.mids                mice.theme              
see '?methods' for accessing help and source code

In [43]:
install.packages("bnstruct")

Installing package into ‘/home/nbuser/R’
(as ‘lib’ is unspecified)


In [48]:
library("bnstruct")

Loading required package: bitops
Loading required package: Matrix
Loading required package: igraph

Attaching package: ‘igraph’

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union


Attaching package: ‘bnstruct’

The following object is masked from ‘package:mice’:

    complete



In [56]:
library(VIM)

Loading required package: colorspace
Loading required package: grid
Loading required package: data.table
VIM is ready to use. 
 Since version 4.0.0 the GUI is in its own package VIMGUI.

          Please use the package to use the new (and old) GUI.

Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

Attaching package: ‘VIM’

The following object is masked from ‘package:datasets’:

    sleep



In [60]:
?kNN()

#### kNN Imputation

We can apply methods of imputation for avoid effects of missing data. `kNN imputation` is one of these methods. It finds that missing values are in which classes for categorical columns. So, this method can be applied for `experience` feature which has missing(NaN) values using kNN algoithm. 

In [126]:
X <- kNN(X, variable = c("Experience"), k = 5)

In [127]:
head(X)

Date,Shift,Time,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness,Experience_imp
1,3,50,1,106,3,1,20,25,False
2,2,20,1,102,3,1,21,27,True
2,3,48,2,107,3,1,22,25,False
3,1,11,4,110,3,1,23,24,False
3,2,19,1,110,3,1,24,18,False
3,2,24,1,130,3,1,25,26,False


In [128]:
X$Experience_imp <- NULL

In [129]:
head(X)

Date,Shift,Time,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness
1,3,50,1,106,3,1,20,25
2,2,20,1,102,3,1,21,27
2,3,48,2,107,3,1,22,25
3,1,11,4,110,3,1,23,24
3,2,19,1,110,3,1,24,18
3,2,24,1,130,3,1,25,26


In [130]:
data[["Experience"]] <- X$Experience

In [131]:
data

Date,Shift,Time,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness,IsDefective
1,3,50,1,106,3,1,20,25,2
2,2,20,1,102,3,1,21,27,2
2,3,48,2,107,3,1,22,25,2
3,1,11,4,110,3,1,23,24,2
3,2,19,1,110,3,1,24,18,2
3,2,24,1,130,3,1,25,26,2
3,2,37,1,129,3,1,26,26,2
3,3,44,2,124,3,1,27,22,2
4,3,44,1,130,3,1,28,25,2
5,2,31,2,132,3,1,29,20,2


### Normalization

DyePercentage, Teflon, and Thickness are numerical values but not categorical. So, these attributes are normalized using feature scaling. 

In [132]:
dfNormTeflon <- scale(data["Teflon"])
dfNormDyePercentage <- scale(data["DyePercentage"])
dfNormThickness <- scale(data["Thickness"])

In [133]:
data[["Teflon"]] <- dfNormTeflon
data[["DyePercentage"]] <- dfNormDyePercentage
data[["Thickness"]] <- dfNormThickness

In [134]:
head(data)

Date,Shift,Time,Experience,DyePercentage,Flow rate,Part,Teflon,Thickness,IsDefective
1,3,50,1,0.995462,3,1,0.4940906,-0.366527,2
2,2,20,1,0.8242053,3,1,0.5932727,0.1344315,2
2,3,48,2,1.0382761,3,1,0.6924548,-0.366527,2
3,1,11,4,1.1667186,3,1,0.7916369,-0.6170062,2
3,2,19,1,1.1667186,3,1,0.890819,-2.1198816,2
3,2,24,1,2.0230018,3,1,0.9900011,-0.1160477,2


#### Splitting data

In [135]:
sample <- sample.int(n = nrow(X), size = floor(.75*nrow(data)), replace = F)
train_data <- data[sample, ]
test_data  <- data[-sample, ]

Target feature must be a `factor` to be set a suitable `model`. 

In [142]:
train_data$IsDefective <- as.factor(train_data$IsDefective)
class(train_data$IsDefective)

Model is created to `fit` train data using `glm` function.

In [144]:
model <- glm(IsDefective ~ ., data=train_data, family=binomial)

In [145]:
summary(model)


Call:
glm(formula = IsDefective ~ ., family = binomial, data = train_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.4186  -0.2250   0.1556   0.4250   1.0791  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    1.910476   1.254653   1.523    0.128    
Date          -0.008431   0.014543  -0.580    0.562    
Shift         -0.604811   0.870012  -0.695    0.487    
Time           0.019822   0.040211   0.493    0.622    
Experience    -0.030355   0.369732  -0.082    0.935    
DyePercentage -0.411387   0.335469  -1.226    0.220    
`Flow rate`    0.138057   0.142161   0.971    0.331    
Part          -0.222135   0.354055  -0.627    0.530    
Teflon         0.303462   0.335431   0.905    0.366    
Thickness     -3.972168   0.551031  -7.209 5.65e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 312.35  on 244  degrees of freedom

In [146]:
predictTrain = predict(model, type="response")

In [147]:
summary(predictTrain)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000061 0.2989648 0.8670079 0.6653061 0.9767098 0.9999424 

In [149]:
table(train_data$IsDefective, predictTrain > 0.5)

   
    FALSE TRUE
  1    77    5
  2     0  163

In [169]:
predictTest = predict(model, type = "response", newdata = test_data)

In [170]:
table(test_data$IsDefective, predictTest > 0.5)

   
    FALSE TRUE
  1    24    6
  2     1   51

So, number of data which is predicted as true is 75(24+51). So, 24 is number of data which is predicted as `false` and has actual value is `false`. 51 is number of data which is predicted as `true` and has actual value is `true`. 6 is number of data which is predicted as `true` and has actual value is `false`. 1 is number of data which is predicted as `false` and has actual value is `true`. 

In [171]:
(24+51)/(24+51+7)

So, accuracy of model is 0.91. 