# Product entity yield type prediction
## Domain
Semiconductor manufacturing process
## Business Context
A complex modern semiconductor manufacturing process is normally under constant
surveillance via the monitoring of signals/variables collected from sensors and or
process measurement points. However, not all of these signals are equally valuable in
a specific monitoring system.
The measured signals contain a combination of useful information, irrelevant
information as well as noise. Engineers typically have a much larger number of signals
than are actually required. If we consider each type of signal as a feature, then feature
selection may be applied to identify the most relevant signals. The Process Engineers
may then use these signals to determine key factors contributing to yield excursions
downstream in the process. This will enable an increase in process throughput,
decreased time to learning and reduce the per unit production costs.
These signals can be used as features to predict the yield type. And by analyzing and
trying out different combinations of features, essential signals that are impacting the
yield type can be identified.
## Objective
We will build a classifier to predict the Pass/Fail yield of a particular process entity and
analyze whether all the features are required to build the model or not.
## Dataset description
- sensor-data.csv : (1567, 592)
- The data consists of 1567 examples each with 591 features.
- The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing. Target column “ –1” corresponds to a pass and “1” corresponds to a fail and the data time stamp is for that specific test point
## the curse of dimensionality 
For a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve In most cases, the additional information that is lost by discarding some features is (more than) compensated by a more accurate mapping in the lower-dimensional space

In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.image import imread
import plotly.express as px
import cv2
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
# from torchviz import make_dot
from torch.utils.data import DataLoader, random_split
from tqdm import tqdm
from PIL import Image

import holoviews as hv
from holoviews import opts
#hv.extension('bokeh')
import json
import shap

from lime import lime_image
from lime.wrappers.scikit_image import SegmentationAlgorithm


import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import VotingRegressor

In [5]:
df = pd.read_csv('./data/uci-secom.csv').iloc[:, 1:]
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,581,582,583,584,585,586,587,588,589,Pass/Fail
0,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,0.0162,...,,0.5005,0.0118,0.0035,2.363,,,,,-1
1,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,-0.0005,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1
2,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,0.0041,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,-0.0124,...,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,-0.0031,...,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1


In [None]:
profile = ProfileReport(df, title='Semiconductor Manufacturing Process Dataset', html={'style':{'full_width':True}})
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]