# Informasi Data

https://www.kaggle.com/paresh2047/uci-semcom

**UCI SECOM Dataset**


Semiconductor manufacturing process dataset


Data Structure: The data consists of 2 files the dataset file SECOM consisting of 1567 examples each with 591 features a 1567 x 591 matrix and a labels file containing the classifications and date time stamp for each example.


As with any real life data situations this data contains null values varying in intensity depending on the individuals features. This needs to be taken into consideration when investigating the data either through pre-processing or within the technique applied.


The data is represented in a raw text file each line representing an individual example and the features seperated by spaces. The null values are represented by the 'NaN' value as per MatLab.

In [None]:
import numpy as np 
import pandas as pd
import warnings
warnings.simplefilter("ignore")

# Dataset Preprocessing

In [None]:
dataset=pd.read_csv("../input/uci-semcom/uci-secom.csv")
dataset.head()

In [None]:
d = dataset.isnull().sum()
j = []
for i in d.keys():
    if(d[i] >900):
        print(i, d[i])
        j.append(i)

In [None]:
dataset.drop(j, axis = 1, inplace = True)
dataset.replace(np.nan, 0, inplace = True)

In [None]:
dataset.head()

## Dataset Separation

In [None]:
X=dataset.drop(['Pass/Fail','Time'],axis=1) #Predictors
y=dataset['Pass/Fail'] #Response
X.head()

## Training and Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Scoring Function

In [None]:
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [None]:
def print_score(classifier,X_train,y_train,X_test,y_test,train=True):
    if train == True:
        print("Training results:\n")
        print('Accuracy Score: {0:.4f}\n'.format(accuracy_score(y_train,classifier.predict(X_train))))
        print('Classification Report:\n{}\n'.format(classification_report(y_train,classifier.predict(X_train))))
        print('Confusion Matrix:\n{}\n'.format(confusion_matrix(y_train,classifier.predict(X_train))))
        res = cross_val_score(classifier, X_train, y_train, cv=10, n_jobs=-1, scoring='accuracy')
        print('Average Accuracy:\t{0:.4f}\n'.format(res.mean()))
        print('Standard Deviation:\t{0:.4f}'.format(res.std()))
    elif train == False:
        print("Test results:\n")
        print('Accuracy Score: {0:.4f}\n'.format(accuracy_score(y_test,classifier.predict(X_test))))
        print('Classification Report:\n{}\n'.format(classification_report(y_test,classifier.predict(X_test))))
        print('Confusion Matrix:\n{}\n'.format(confusion_matrix(y_test,classifier.predict(X_test))))

In [None]:
from sklearn.tree import DecisionTreeClassifier as DT

classifier = DT(criterion='entropy',random_state=42)
classifier.fit(X_train,y_train)

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)

# Feature Selection

## Filter - ANOVA

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
# Memilih fitur terbaik

selected = SelectKBest(score_func=f_classif, k=10)
anova_fit = selected.fit(X_train, y_train)

In [None]:
np.set_printoptions(precision=3)
print(anova_fit.scores_)

Elemen dari array di atas merupakan nilai evaluasi dari semua atribut atau fitur yang ada pada dataset. Nilai yang lebih tinggi menandakan bahwa fitur tersebut relatif lebih bermakna daripada fitur lainnya, sehingga akan lebih diprioritaskan untuk dipilih. Akan dipilih 10 fitur terbaik dalam implementasi feature selection berdasarkan perhitungan ANOVA.

In [None]:
features_columns = np.empty((0, 0))

for i in (np.argsort(anova_fit.scores_)[::-1]):
    if (not np.isnan(anova_fit.scores_[i])):
        features_columns = np.append(features_columns, i)
        
print(features_columns[:10])

Hasil di atas adalah kolom index dari 10 fitur yang paling penting dari dataset ini. Sementara berikut ini adalah sampel 5 data pertama yang sudah disaring memanfaatkan hanya 10 fitur tersebut.

In [None]:
features_by_anova = anova_fit.transform(X_train)
print(features_by_anova[0:5,:])

## Embedded - Ridge

In [None]:
from sklearn.linear_model import Ridge

In [None]:
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

In [None]:
def pretty_print_coefs(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name) for coef, name in lst)

In [None]:
print("Ridge model:", pretty_print_coefs(ridge.coef_))

Diperoleh koefisien untuk setiap fitur dari algoritma seleksi Ridge. Regresi Ridge menghasilkan L2-Regularization. Apabila ditemukan beberapa fitur yang memiliki nilai koefisien cukup dekat, maka fitur-fitur memiliki hubungan yang berpengaruh pada dataset. Untuk fitur-fitur yang memperoleh koefisien negatif, berarti fitur tersebut tidak berpengaruh banyak dan bisa dieliminasi sesuai tujuan feature selection.

## Perbandingan Feature Selection

Penggunaan ridge regression (embedded) memerlukan analisis lebih untuk seleksi fitur yang ada dan mengonsumsi waktu yang lebih lama dibandingkan ANOVA (filter). Membutuhkan percobaan algoritma lainnya untuk memperoleh metode feature selection yang lebih ideal.

# Feature Extraction

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
%matplotlib inline

## PCA

Langkah awal perlu dilakukan standardisasi terlebih dahulu terhadap fitur pada data. Hal tersebut disebabkan data dengan range yang lebih besar dapat mendominasi data dengan range yang lebih kecil sehingga hasil menjadi bias

In [None]:
X_train_std = StandardScaler().fit_transform(X_train)
X_test_std = StandardScaler().fit_transform(X_test)
X_std = StandardScaler().fit_transform(X)

In [None]:
# Initializing PCA and fitting
# Choose minimum number of PCA features such that 85% variance is retained to avoid overfitting
pca = PCA(0.85)
pca.fit(X_std)

In [None]:
print('Variance of each component:', pca.explained_variance_ratio_)
print('\nTotal features:', pca.n_components_)
print('Total Variance Explained:', round(sum(list(pca.explained_variance_ratio_))*100, 2))

In [None]:
# Transform train and test datasets
X_train_pca = pca.transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

print('X_train_pca shape:', X_train_pca.shape)
print('X_test_pca shape:', X_test_pca.shape)

Selanjutnya, dilakukan training menggunakan data train PCA

In [None]:
from sklearn.tree import DecisionTreeClassifier as DT

classifier = DT(criterion='entropy',random_state=42)
classifier.fit(X_train_pca,y_train)

In [None]:
print_score(classifier,X_train_pca,y_train,X_test_pca,y_test,train=False)

## LDA

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()

lda.fit(X, y)

In [None]:
# Transform train and test datasets
X_train_lda = lda.transform(X_train)
X_test_lda = lda.transform(X_test)

print('X_train_lda shape:', X_train_lda.shape)
print('X_test_lda shape:', X_test_lda.shape)
print('Variance of each component:', lda.explained_variance_ratio_)

Selanjutnya, dilakukan training menggunakan data train PCA

In [None]:
from sklearn.tree import DecisionTreeClassifier as DT

classifier = DT(criterion='entropy',random_state=42)
classifier.fit(X_train_lda,y_train)

In [None]:
print_score(classifier,X_train_lda,y_train,X_test_lda,y_test,train=False)

# Pembagian Kerja

1. 13517002 - Isa : Feature Extraction
2. 13517095 - Naufal : Data preprocessing, baseline model
3. 13517098 - Anzaldi : Feature Selection