# <center>DATA SCI 423 | Final Project</center>

# <center>Understanding Alloy Steel Composition-Property <br/> Relationships Using Machine Learning<center>

## <center> Raymond Wang, Yangdongling Liu </center>

## I. Introduction

We studied the composition-property relationships in alloy steels using machine learning model. Data analysis using correlation heatmap shows degree of correlation within alloy steel composition and properties. Several strongly correlated quantities (e.g. hardness and tensile strength) are identified. Machine learning algorithm performance on predicting composition-property relationship is benchmarked using grid search. XGBoost outperforms the other methods in the alloy steel case. Further analysis shows predicting thermal conductivity from chemical composition using XGBoost has satisfying accuracy and the model performance can be improved by having more training data. Our findings suggest that machine learning methods could provide more insights of alloy steel composition-property relationship than using human physical intuition or experience.

<img src="workflow.png" alt="drawing" width="500"/>

## II. Data Acquisition

The challenge project starts from acquiring materials data from online resources. Automated scraping code is developed by the author in Python, which automatically collects materials information that meets the searching criteria and saves to local CSV files. 

Materials data is collected from \href{http://www.matweb.com/index.aspx}{MatWeb}, where alloy steels containing Manganese, Chromium, and Nickel are set as the target materials for scraping. Each entry of material data spans four columns and multiple rows, as shown in Table \ref{tab:web_data}. Data under `English' has different unit from `Metric', this project only processes data with metric unit.

Physical properties (e.g. bulk modulus, thermal conductivity) in different units, chemical compositions (e.g. weight of Mn, Ni, Cr), and the potentially helpful comments, are saved to local files with unchanged format. Since there is no data-editing involved in this step, the local data is guaranteed to be the same as its online version.

One of the technical challenges during data acquisition is IP-blocking from the web server. MatWeb will block the IP (possibly permanently) once it detects over-access within a short period of time. From the author's experience, one IP address gives access to 100$\sim$200 materials information per day. One of the advantage of the developed code is its high portability. The only two dependencies are \texttt{selenium} and \texttt{pandas} Python packages, which can be installed easily. In order to accelerate the project, the author used six machines with Unix-like systems for development and production. Eventually the program extracted 855 alloy steels that meet the criteria. The code (\hyperlink{scrape}{\texttt{scrape.py}}) is available in SI.

Simple post-processing is performed right after all the data has been saved locally. This step modifies the file names and contents containing non-utf-8 encoding and fixes unwanted line breaks. Relevant code (\hyperlink{pprocess}{\texttt{post\_processing.py}}) is available in SI.

In [None]:
# put some code here
import seaborn as sn

## III. Featurization 

The data collected from the last step is still string-based, e.g. 97% is interpreted as a sequence of characters instead of a floating point number. Therefore, it is necessary to convert these strings to machine-readable form before any further data analysis.

Since there are multiple materials properties and not all of them are available for each materials, dataset with missing values will be dropped. In order to keep a relatively large number of training set (e.g. several hundred), we only convert and keep physical properties with more than 100 available measured data points. These physical variables are: density, hardness (Vickers), thermal conductivity, specific heat capacity, CTE-linear, electrical resistivity, elongation at break, bulk modulus, modulus of elasticity, shear modulus, poisson's ratio, tensile strength at yield, and tensile strength at ultimate. Ten element types including Fe, Mn, Cr, Ni, Mo, Cu, C, S, Si, P are considered.

Floating point numbers are extracted from string data and converted to \texttt{pandas.DataFrame} format. The converted dataset is shown in Table \ref{tab:featurization}. Non-available data points are converted to \texttt{nan} instead of `N/A'. The `\%' and other units are dropped, only the floating point numbers are extracted. Since data will be standardized before numerical processing, the only thing to make sure is that all data in the same column share the same unit/percentage sign. The code (\hyperlink{featurize}{\texttt{featurize.py}}) is available in SI.

In [None]:
# put some code here
import pandas as pd

## IV. Correlation Analysis

The size of the dataset after featurization is 726$\times$23 (with 726 instances and 23 features). It is impractical for humans (at least for the author) to directly learn patterns from such large amount of data. Based on basic statistical knowledge, the author decides to start from learning correlation patten of these variables. The instances with \texttt{nan} entries are dropped from the dataset, and eventually 254 materials are used to generate the heatmap, as shown in Figure \ref{fig:heatmap}. The code (\hyperlink{correlation}{\texttt{correlation.py}}) can be found in SI.

Some of the findings agree well with the way elements contribute to alloy steel properties as reported online, e.g. carbon decreases ductility of steel, and could lead to a small shear modulus. However, the information from correlation analysis is more qualitative than quantitative. If we want more quantitative descriptions of the materials composition-property relationship, more sophisticated methods are needed. In the next section we discuss alloy steel data analysis using machine learning models.

In [None]:
# put some code here
import seaborn as sn

<img src="CorrelationHeatmap.png" alt="drawing" width="800"/>

## V. Benchmarking Machine Learning Model Performance

The author performs systematic benchmark of different machine learning algorithm performance on various physical property predictions.

Each training takes one physical property as the target, while treats the rest 12 properties plus aforementioned 10 element types as input features. Dataset instances with \texttt{nan} are dropped from the dataset. Numerical data (both X and y) are standardized as implemented in \texttt{StandardScaler} in \texttt{sklearn}.

10 machine learning algorithms are benchmarked here, including:
\begin{enumerate}
	\item Linear Regression	
	\item Least-angle regression with Lasso
	\item Kernel Ridge
	\item Linear SVR
	\item SGD Regression
	\item MLP Regressor
	\item AdaBoost Regression
	\item Random Forest Regression
	\item Gradient Boosting Regression
	\item Extremen Gradient Boosting
\end{enumerate}

The dataset is divided into training (75\%) and testing (25\%) parts. Grid search method is used for hyper-parameter tuning. The set of hyper-parameters are shown in Table \ref{tab:hyperparam} . The optimal setting of the parameters is determined based on 5-fold cross-validation performed on training data only. R2 score is used as error metric. Relevant code (\hyperlink{ml}{\texttt{benchmark.py}})can be found in SI. Results are shown in Figure \ref{fig:benchmark}.

The y-axis is the calculated root mean square error (RMSE) value divided by the difference of maximum and minimum value in the target data. This normalization step is essential so as to directly compare the performance of different machine learning algorithms in predicting quantities at various scales. The \texttt{sklearn} built-in performance scores (e.g. R2-score) are not used here.

Interestingly, some trends can be identified across different algorithms as well as physical properties. In general, linear regression, Lasso, linear SVR and SGD algorithms lead to larger variance in normalized RMSE, while the rest have relatively better performance. XGBoost algorithm achieves the best performance among the 10 benchmarked methods.

The physical properties are divided into two panels due to different scales of RMSE. We find it hard to have a good prediction in hardness and tensile strength, but it may not be a coincidence. It is exciting to notice that from the correlation heatmap analysis, it is clear that hardness is strongly and positively correlated to tensile strengths. This may suggest that these quantities are related to some other factors not considered here, such as processing and microscopic structures. And we need to include the other important features in order to build better machine learning models.

<img src="Benchmark.png" alt="drawing" width="500"/>

## VI. Learning with XGBoost

We further investigate machine learning model performance on more challenging problems. In the previous section single property is targeted using both alloy chemical composition and other known physical properties. However in practical situations, sometimes other properties are also unknown (e.g. a new family of alloy steel), would machine learning be able to distill the composition-property relationship to facilitate alloy steel design?

We choose to use Extreme Gradient Boosting (XGBoost) algorithm for further analysis on alloy steel composition-relationship. This time the features only include composition information, i.e. weight percentage of each element. The performance of XGBoost in independently predicting 11 different physical properties is shown in Figure \ref{fig:compo-prop}. R2-score is used as error metric in this case. The implementation of this part is very similar to that in \hyperlink{ml}{\texttt{benchmark.py}} with some small tweaks, the code is not attached in SI but can be found on the author's \href{https://github.com/raymond931118/scraping}{GitHub}.

Among the 11 tested properties, 8 of them do not show a strong relationship to composition, both training score and test score are below 0.5. Specific heat capacity and electrical resistivity get a high score in training but score poorly in test. This probably means XGBoost cannot correctly capture the relationship from the training set, but only numerically fit the data. In other words, the model suffers from overfitting. One thing to notice here is that R2-score can be negative, which means the model is even worse than a horizontal line through the mean feature value.

XGBoost performs well in predicting thermal conductivity using composition information, where the test score is even higher than training score. In order to validate our findings, the learning curve of predicting thermal conductivity is shown in Figure \ref{fig:learning_curve}. As the number of training samples increase, cross-validation scores quickly increase and converge to the high training score. This clearly means adding more training samples can further improve the performance of XGBoost model in predicting thermal conductivity.

<img src="XGBoostR2.png" alt="drawing" width="500"/>

<img src="XGBoostLearncurve.png" alt="drawing" width="300"/>

## VII. Conclusions

We find that machine learning could provide both qualitative and quantitative insights of alloy steel composition-property relationship. Gradient boost-based algorithms outperform the other machine learning algorithms in our benchmark. XGBoost is particularly successful in predicting thermal conductivity using alloy chemical composition, and the performance can be improved by adding more training samples.

## Supporting Information

\begin{table*}[t]
\begin{ruledtabular}
\caption{Hyper-parameters used in different machine learning algorithms for grid search.\label{tab:hyperparam}}
\centering
\begin{tabular}{ccc}
\sffamily Algorithm & \sffamily Parameter & \sffamily Values \\
\hline
Linear Regression & default & default \\
\hline
Lasso & `alpha' & \{ 1E-4, 0.001, 0.01, 0.1, 1 \}\\
\hline
& `kernel' & \{`linear', `poly', `rbf', `sigmoid' \} \\
Kernel Ridge & `alpha' & \{ 1E-4, 1E-2, 0.1, 1 \} \\
& `gamma' & \{ 0.01, 0.1, 1, 10 \} \\
\hline
Linear SVR & `C' & \{ 1E-6, 1E-4, 0.1, 1 \} \\
& `loss' & \{ `epsilon\_insensitive', `squared\_epsilon\_insensitive'\} \\
\hline
SGD & `alpha' & \{1E-6, 1E-4, 0.01, 1 \} \\
& `penalty' & \{`l2', `l1', `elasticnet'\} \\
\hline
& `activation' & \{`logistic', `tanh', `relu'\} \\
MLP & `solver' & \{`lbfgs', `adam', `sgd'\} \\
& `learning\_rate' & \{`constant', `invscaling', `adaptive'\} \\
\hline
Adaboost & `n\_estimators' & \{10, 100, 1000\} \\
& `learning\_rate' & \{0.01, 0.1, 1, 10\} \\
\hline
& `n\_estimators' & \{10, 100, 1000\} \\
RandForest& `min\_weight\_fraction\_leaf' & \{0.0, 0.25, 0.5\} \\
& `max\_features' & \{`sqrt', `log2', None\} \\
\hline
& `n\_estimators' &  \{10, 100, 1000\}\\
GradBoost & `min\_weight\_fraction\_leaf' &  \{0.0, 0.25, 0.5\}\\
& `max\_feature' &  \{`sqrt', `log2', None\}\\
\hline
& `n\_estimators' &  \{10, 50, 100, 250, 500, 1000\}\\
& `learning\_rate' &  \{1E-4, 0.01, 0.05, 0.1, 0.2\}\\
XGBoost & `gamma' &  \{0, 0.1, 0.2, 0.3, 0.4\}\\
& `max\_depth' &  \{6\}\\
& `subsample' &  \{0.5, 0.75, 1\}\\
\end{tabular}
\end{ruledtabular}
\end{table*}