# Task
# [Steel Plates Faults Data Set](http://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults)

 ![](https://img.shields.io/badge/sector-steal-gray.svg)
 ![](https://img.shields.io/badge/labeled-yes-blue.svg)
 ![](<https://img.shields.io/badge/simulation-no-red.svg>)
 ![](https://img.shields.io/badge/time--series-no-red.svg)

Parameter | Value
---- | ----
Data Set Characteristics | Multivariate
Attribute Characteristics	| Integer, Real
Associated Tasks	| Classification
Number of Instances	| 1941
Number of Attributes	| 27
Date Donated | 2010-10-26
Source | UCI Machine Learning Repository
Dataset size | 299KB

## Source

Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy.\
www.semeion.it


## Data Set Information
Type of dependent variables (7 Types of Steel Plates Faults):
1. Pastry
2. Z_Scratch
3. K_Scatch
4. Stains
5. Dirtiness
6. Bumps
7. Other_Faults


## Attribute Information

27 independent variables:
1. X_Minimum
2. X_Maximum
3. Y_Minimum
4. Y_Maximum
5. Pixels_Areas
6. X_Perimeter
7. Y_Perimeter
8. Sum_of_Luminosity
9. Minimum_of_Luminosity
10. Maximum_of_Luminosity
11. Length_of_Conveyer
12. TypeOfSteel_A300
13. TypeOfSteel_A400
14. Steel_Plate_Thickness
15. Edges_Index
16. Empty_Index
17. Square_Index
18. Outside_X_Index
19. Edges_X_Index
20. Edges_Y_Index
21. Outside_Global_Index
22. LogOfAreas
23. Log_X_Index
24. Log_Y_Index
25. Orientation_Index
26. Luminosity_Index
27. SigmoidOfAreas


## References   
- [UCI](http://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults): Database source link.

## Citation Request:

1. M Buscema, S Terzi, W Tastle, A New Meta-Classifier,in NAFIPS 2010, Toronto (CANADA),26-28 July 2010, 978-1-4244-7858-6/10 2010 IEEE
2. M Buscema, MetaNet: The Theory of Independent Judges, in Substance Use & Misuse, 33(2), 439-461,1998

In [1]:
import copy
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, mean_absolute_error, accuracy_score, f1_score
from sklearn.preprocessing import StandardScaler

  from pandas_profiling import ProfileReport


In [2]:
df = pd.read_csv('./data/steel+plates+faults/Faults.NNA', sep = '\t', header = None)
df.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,24,25,26,27,28,29,30,31,32,33
1936,249,277,325780,325796,273,54,22,35033,119,141,...,-0.4286,0.0026,0.7254,0,0,0,0,0,0,1
1937,144,175,340581,340598,287,44,24,34599,112,133,...,-0.4516,-0.0582,0.8173,0,0,0,0,0,0,1
1938,145,174,386779,386794,292,40,22,37572,120,140,...,-0.4828,0.0052,0.7079,0,0,0,0,0,0,1
1939,137,170,422497,422528,419,97,47,52715,117,140,...,-0.0606,-0.0171,0.9919,0,0,0,0,0,0,1
1940,1261,1281,87951,87967,103,26,22,11682,101,133,...,-0.2,-0.1139,0.5296,0,0,0,0,0,0,1


In [3]:
# onehot -> label
label = df.iloc[:, -7:].apply(lambda x: (x * [1,2,3,4,5,6,7]).sum(), axis = 1)
df = df.iloc[:, :-7]
df['label'] = label
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,label
0,42,50,270900,270944,267,17,44,24220,76,108,...,0.4706,1.0,1.0,2.4265,0.9031,1.6435,0.8182,-0.2913,0.5822,1
1,645,651,2538079,2538108,108,10,30,11397,84,123,...,0.6,0.9667,1.0,2.0334,0.7782,1.4624,0.7931,-0.1756,0.2984,1
2,829,835,1553913,1553931,71,8,19,7972,99,125,...,0.75,0.9474,1.0,1.8513,0.7782,1.2553,0.6667,-0.1228,0.215,1
3,853,860,369370,369415,176,13,45,18996,99,126,...,0.5385,1.0,1.0,2.2455,0.8451,1.6532,0.8444,-0.1568,0.5212,1
4,1289,1306,498078,498335,2409,60,260,246930,37,126,...,0.2833,0.9885,1.0,3.3818,1.2305,2.4099,0.9338,-0.1992,1.0,1


The detailed modeling process can be found in use case1, here we go directly to modeling

In [4]:
X=df.drop('label', axis=1)
y=df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape, y_train.shape,y_test.shape

((1358, 27), (583, 27), (1358,), (583,))

In [5]:
sc_x = StandardScaler()
X_train = pd.DataFrame(sc_x.fit_transform(X_train), columns=X.columns.values)
X_test = pd.DataFrame(sc_x.transform(X_test), columns=X.columns.values)

In [6]:
from sklearn.metrics import classification_report

In [7]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)

print(classification_report(y_test, model.predict(X_test)))


              precision    recall  f1-score   support

           1       0.69      0.47      0.56        51
           2       0.91      0.80      0.85        61
           3       0.98      0.94      0.96       122
           4       0.88      0.96      0.92        23
           5       0.92      0.85      0.88        13
           6       0.67      0.70      0.69       120
           7       0.69      0.77      0.73       193

    accuracy                           0.78       583
   macro avg       0.82      0.78      0.80       583
weighted avg       0.78      0.78      0.78       583



The above shows the performance of the machine learning method Random Forest on this task. The precision indicates how many of the judgments made by the model are correct, and the recall indicates the proportion of the corresponding classes that the model is able to retrieve. For example, the precision of category 5 is 1, which means that all the judgments made by the model about category 5 are correct. However, because the model judged some of the samples that originally belonged to category 5 as other categories, this resulted in the model finding only 85% of all the samples that belonged to category 5. Here we can also see that a single result is not a good indication of the model's performance.
The f1score is a score calculated by combining recall and precision. Larger values correspond to better results.

So is this result the optimal result that can be achieved by machine learning methods? No. There are two different ways to further improve the model's performance. One is feature engineering, which is a method of automatically generating new, useful features to get better results from the model. The other is automatic machine learning, which improves the performance of the model by automatically selecting the optimal model parameters. Here we show the first approach through an example.

introduction to automatic machine learning ....
And a brief introduction to tpot

In [8]:
from tpot import TPOTClassifier

In [9]:
# 创建TPOT分类器
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)

# 训练模型
tpot.fit(X_train, y_train)

# 打印得分
print(f"Test score: {tpot.score(X_test, y_test)}")

# 导出最优的管道到Python脚本
tpot.export('tpot_best_model.py')

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7599278272194486

Generation 2 - Current best internal CV score: 0.7599278272194486

Generation 3 - Current best internal CV score: 0.7599278272194486

Generation 4 - Current best internal CV score: 0.7599278272194486

Generation 5 - Current best internal CV score: 0.7680296288257

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.8, min_samples_leaf=1, min_samples_split=4, n_estimators=100)
Test score: 0.79073756432247


In [11]:
predictions = tpot.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           1       0.68      0.45      0.54        51
           2       0.92      0.92      0.92        61
           3       1.00      0.94      0.97       122
           4       0.88      0.96      0.92        23
           5       1.00      0.85      0.92        13
           6       0.69      0.68      0.68       120
           7       0.70      0.79      0.74       193

    accuracy                           0.79       583
   macro avg       0.84      0.80      0.81       583
weighted avg       0.79      0.79      0.79       583



We can see that the final performance of the model is improved by automated machine learning. It is also worth noting that tpot itself has a huge library of parameters such as scoring, population_size, etc. These parameters themselves can greatly affect the performance of the final model. 

Also, unless it's on a small dataset, tpot typically runs for hours or even days. one thing about tpot is that it can be interrupted at any time.

Finally, automl is not guaranteed to improve the performance of the model.
