## Goals

This project was inspired by my understanding that companies often have many roles which are at the same time technical and non-technical. These roles are formally business-side (as opposed to IT-side), and should therefore not be technical, but in practice they require strong technical competencies.
People working in these roles usually have a strong STEM background, understand statistics and machine learning, and use them daily. They also write their own SQL code. 
*They, however, mostly do not code Python or R for data analysis and advanced analytics, but they rely on (expensive) commercial tools.*

**My first goal, therefore, was to produce a visual wrapper around Python for data science.** The aim was not to produce a beautiful GUI, but to estimate how difficult it would be to come up with a GUI that would be natural to use in data science projects. This GUI would have to enable analysts to perform usual data analysis, model building, and model evaluation tasks through a point-and-click interface. This program offers a graphical interface around many *numpy*, *pandas*, and *scikit-learn* functionalities.

Second, I wanted to experiment with linear regression, for which purpose **I built a tool that would generate data for regression analysis (supervised learning) with exactly known probability distribution.** In this way I could completely control and understand performance of machine learning algorithms. 

Third, I wanted to **evaluate the viability and performance of a one-click machine learning solution.** For this purpose **I built a simple automated machine learning tool** which I named optiLearn, as it finds the optimal machine learning model in the space of all available features (variables) and model parameters.

**All three goals were very successfully achieved** and wrapped in a *not-entirely-ugly* graphical user interface which is quite natural to use for the given tasks.

## Layout overview

The machine learning **GUI consists of four areas dedicated to different types of operations**.

The entire left side of the main window, under bold text "*Select data source*", is dedicated to data selection or generation of (synthetic) data.

The right side is split into three vertical areas. 
On top, under the title "*Select and train model*" is the area for machine learning model selection, train-test splitting, and model training.

In the middle area there are model-tuning controls and model-scoring dashboard. They come under titles "*Tune model parameters*" and "*Model scoring and evaluation*", respectively.

Finally, the last line on the bottom says "*Make the magic happen*", and this functionality actually performs automated machine learning. Upon clicking the optiLearn button, the application performs automated feature selection and automated parameter tuning (which is implicitly also model selection), and then outputs a dialog window with details about the best found model.

Let us quickly review each of the four areas.

## Data sources

The area on the left has a drop-down menu on top. In the above image the menu shows selection "Generate data". 
When "Generate data" is selected, the button below the drop-down menu says "Refresh". Upon clicking that button new data is generated according to the request entered in the three edit forms below, and it is displpayed in the table in bottom.

If the three forms are empty, as they are in picture, upon clicking the "Refresh" button the app would generate 100 records of two predictive variables and the target variable, and it would display the data in the table.

**Let's see how the application builds synthetic datasets for regression analysis (supervised learning) problems.** 

In real-world applications, you would collect variables that you suspect could be used to predict values of certain other variable of interest. For example you could try to calculate the price of an apartment from the number of its rooms, *number_of_rooms*, construction date, *date*, and a variable *is_smoker* saying whether the current owner is a smoker. By model building, you could discover that, say, price increases with the number of rooms (so 4-room apartment is two times as expensive as a 2-room apartment, all other things being equal), it goes down rapidly with the age, and it is independent of whether the owner is a smoker. You would conclude that the first two variables are predictive, the last one is redundant, and you could describe your pricing model to someone by providing them the "coefficients" of the first two variables. In the above example the coefficient of the variable *number_of_rooms* was very close to 1.

We simulate the above real-world situation as follows.

The forms allow the user to specify the type of data. In particular, one can determine the number of **predictive features**, the number of **redundant features**, and the number of records that will be generated.
**Target is generated as a linear combination of predictive features, with some Gaussian noise sprinkled on top.**
Going back to the apartment-pricing example, this is *as if* we knew exactly which variables determine the price of the appartment and to what extent. All that is left is some Gaussian noise coming from people randomly achieving slightly higher or lower prices in the market.

Both predictive and redundant features are drawn from the same distribution, which is a Gaussian centered at zero, with rather large (fixed) variance. **Redundant features do not participate in the calculation of the target**, which is precisely what makes them redundant - they are here just to confuse the machine learning model, so they play the role of unpredictive, irrelevant data that you might have in your database and that you might try to use for modeling not knowing that it is unpredictive for the problem at hand. It is like the *is_smoker* variable in apartment pricing example.

When the new data is generated, it is shown in the main table. The header will be populated with column names, which will be called *p1*, *p2*, and so on for predictive variables, and *r1*, *r2*, etc. for redundant variables. 

Linear coefficients used to get the target from the predictive features are displayed in the "*Model scoring and evaluation*" part (area 3 in the pcicture on top), in the line saying "*True coefficients are: { x, y, z, ...}*".
Note that there are as many coefficients as there are predictive features. Because redundant features do not enter in the calculation of the target, they do not have any coefficients, i.e. their coefficient is zero because they are uncorrelated with the target.

The goal of the model builder will be precisely to build a model that best predicts these **true coefficients**, as this is literally how the target is built from the predictive features.

Going back to the "Generate data" drop-down menu, the second available selection is "Load CSV". Upon choosing this option, the button below the menu changes caption to "Select data", and all of the parameter input forms get greyed out.

Clicking on the "Select data" button, the usual file selection dialog opens up, allowing you to browse the file system and to select any *.csv* file to import into the app. When you choose to open some *.csv* file from the disk, the data will be imported and shown in the table at the bottom of the GUI area 1, and the variable headers will be populated with the headers in the *.csv* file.

## Model selection and training

In order to reach the goals mentioned in the beginning we clearly want to have a number of different models to choose from. At the same time, there is no need to offer every single model available in scikit-learn. This is particularly important because different models have different prerequisites and are evaluated in different ways.

Since I didn't want to build a big comprehensive solution offering every imaginable machine learning tool, I decided to limit the application to linear regression models. The drop-down menu in the area 2 of the GUI offers **four linear regression models** to choose from:
* Linear regression
* Ridge regression
* Lasso regression
* ElasticNet

Even though ElasticNet covers all of the remaining models as special cases (i.e. as sub-spaces of its parameter space) there are two things to consider: on theoretical level these models have different justifications and rationale for using them, and on a practical level ElasticNet implementation in scikit-learn becomes unstable in the parameter space covering linear regression (see [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) for more details). So at least at practical level it is necessary to make the distinction.

At the theoretical level, the difference between the models can be described as follows. The regression models learn by minimizing one of the objective functions in the following class of functions:

$$ \frac{1}{2 n} |y-\beta\cdot X|^2 + \alpha_1 \left\|\beta\right\|_{L1} + \frac{1}{2}\alpha_2 \left\|\beta\right\|_{L2} $$

Here $n$ is the number of records, $y$ are target variables, $\beta\cdot X$ are model predictions and $\beta$, in particular, are learned weights. So the first term in the objective function is a sum of squared errors. The second and the third terms are penalizing large weights (coefficients). L1 and L2 are common names of the two commonly used metrics (norms). L1 is essentially the absolute value function (or Manhattan distance), and L2 is the usual Euclidean distance.

**Different regression models are obtained by specifying the objective function to be minimized**.
* Linear regression minimizes sum of squared errors (so $\alpha_1 = \alpha_2 = 0$)
* Ridge regression uses L2 regularization (so $\alpha_1=0$ and $\alpha_2>0$)
* Lasso regression uses L1 regularization ($\alpha_1>0$ and $\alpha_2=0$)
* ElasticNet uses the most general objective function ($\alpha_1>0$ and $\alpha_2>0$)

In sklearn, **ElasticNet is unstable when** $$\frac{\alpha_1}{\alpha_1+\alpha_2} \rightarrow 0$$

This happens when the model reduces to Ridge regression. As a special case this also includes simple linear regression.

Once the user chooses the model, he can train the model by clicking the button "Train". Note that every time "Train" is clicked, slightly different results will be reported in area dedicated to "Model scoring and evaluation". This is because every time "Train" is clicked, data is randomly split into train and test subset, and then model is trained on train set, and evaluated on both sets. You can control the share of data used for the test set by changing the value in the input in the bottom of the area 2 of the GUI.

## Model tuning, scoring, and evaluation

The third part of the GUI, area 3 in the picture above, allows user to choose model parameters. Depending on the choice of the model, only applicable parameters will be enabled, and the other parameter(s) will be greyed out.

After you manually choose the $\alpha_1$ and/or $\alpha_2$ coefficients, and you allocate certain share of the data to the test set, you can run your model by clicking "Train" button. The results will be shown in the lower part of area 3 of the GUI, under the title "Model scoring and evaluation".

In particular, the **GUI displays $R^2$ score achieved on the train and the test set**. Recall that $R^2$ is calculated as: 
$$R^2 = 1 - \frac{u}{v}$$ 
where u is residual sum of squares: 
$$ u = \sum(y_{true} - y_{prediction})^2 $$
and v is the total sum of squares:
$$ v = \sum(y_{true} - \bar{y_{true}})^2 $$

In the above expressions $y_{true}$ is the target variable, $y_{prediction}$ is the model prediction, and we use notation $mean(x)=\bar{x}$.

Just as a reminder, **the best possible score is $R^2=1$**, for the perfectly predictive model, it is 0 for the persistence model (which simply predicts the average y), and it can be negative.

In addition to the $R^2$-score, we display true coefficients of the predictive variables used to generate the data, and the coefficients estimated from the machine learning algorithm for all variables. If the dataset contained $P$ predictive features, and $R$ redundant features, then ideally first $P$ estimated coefficients should be very close to the list of the true coefficients, and the next $R$ coefficients should be as close as possible to zero.

The model evaluation part of the GUI contains one final feature, which is the button "Visualize". Upon clicking it, you will see **a dialog screen showing an analysis of residuals**. 

Residuals are differences between predictions and true target values. They should be close to zero for the full range of possible predicted values. They allow us to see how model performs in any given range of the target variable.
This can be explored in the **residual plot** on top of the Residuals analysis dialog screen, which shows residuals normalized to the predicted value. (Vertical axis shows residuals as percentage of the predicted value, keeping also track of the correct sign.)

In the lower part of the residuals analysis window you will see a histogram showing **distribution of residuals**. This should be bell-curve like and centered around zero. The larger the dataset size, the more pronounced this shape will be with data generated inside the application.

## Automated machine learning

The last part of the GUI, area 4, is dedicated to automated machine learning functionality.
For this purporse, **I designed a simple methodology to determine the optimal choice of data, model, and model parameters, given the original dataset**.

Upon generating new data with $P$ predictive features, $R$ redundant features, and $n$ records, one can click on "optiLearn" button to "make the magic happen".

optiLearn methodology is quite straightforward and logical, but at the same time it might be a little complex to explain. 
You can think of it as consisting of a couple of loops:
* there is the outermost feature-selection loop
* next, there is internal parameter-selection (model selection) loop
* finally, there is inner-most k-fold cross-validation step to help with model selection

The outermost loop is a feature selection loop. It starts with the entire dataset shown in the table in area 1 of the GUI. This table has both predictive and redundant features. Having found the best possible model for the entire dataset, optiLearn will drop the least predictive feature and try to model-build on a reduced dataset. If better model is found using less variables, then the process continues and optiLearn attempts to remove next least-predictive variable. **The process continues until the balance between model simplicity and model performance is reached, and the most parsimonious model is returned as the best choice.**

Now, for the given dataset, optiLearn enters the model-selection loop. In the model selection loop it determines new values for model parameters $\alpha_1$ and $\alpha_2$. optiLearn works with ElasticNet only, so it will scan for the optimal model in the parameter space spanned by these two variables. As boundary cases it can reproduce Lasso and Ridge-like models.

Having fixed the dataset, and having fixed model parameters, we want to estimate model quality. This is the only slightly-more-tricky part, as 5 model versions are used to estimate the model quality.
In particular, **optiLearn uses k-fold cross-validation for this purpose**, with $k=5$. 

For given model parameters, optiLearn will split the train set into five subsets, and then use all 4 combinations of 4 subsets to train a new model on. It will test the model on the remaining, fifth, subset of data. This will result in 5 models, each trained on some 80% of the train set, and validated on the remaining 20% of the train set (aka validation set).
**The average score of these five models is taken as model score for the given choice of parameters and data.**

If the model results in a score that outperforms the best score seen so far, then the new model and the current choice of variables (data) becomes the new best choice.

At this point we can change the model parameters, and repeat the k-fold cross-validation step, looking for the new best model for the given dataset. **optiLearn runs over approximately 1000 combinations of $\alpha_1$ and $\alpha_2$ parameters logarithmically spaced in intervals $0<\alpha_1<=1$ and $0<\alpha_2<=1$**.

Finally, having found the model with the best parameters, given the current data, we eliminate the least predictive variable, and repeat the process on the reduced dataset.
Once the best choice of model and predictive features has been found, the best model is scored on the test set - which is the first time that it sees this data.

An important point is that more complex models, with more variables at their disposal, could overperform simpler models. On the other hand, such models are more difficult to interpret and understand. What optiLearn tries to achieve is to understand which variables are non-predictive, and remove them from the model.
**A simpler model, with a score almost as good as a more complex model, is considered more parsimonious, and can be prefered based on this quality.** 

In order to come up with good, parsimonious, models, **optiLearn employs Bayesian Information Criterion or BIC**.
Calculating BIC of a linear model is a topic of its own, and is quite interesting, but for now I would just refer reader to [this](http://statweb.stanford.edu/~jtaylo/courses/stats203/notes/selection.pdf) Stanford lecture and provide only high-level view.

Page 11 gives the usual definition of BIC in terms of log-likelihood, as

$$ BIC(\mathcal{M}) = -2\log{\mathcal{L}(\mathcal{M})} + 2\ p(\mathcal{M})\ \log{n}$$

where $\mathcal{M}$ stands for particular model, $\mathcal{L}$ is likelihood, $p(\mathcal{M})$ is the number of parameters (coefficients) in the model, and $n$ is number of records in the dataset.

Likelihood of a "correct" linear model is simply a Gaussian function of its residuals (going back to the residual distribution plot), from which we can calculate maximum likelihood estimators of the coefficients $\beta$ of the model, and of the variance. Using this log-likelihood in the $BIC$ formula above we can arrive at an expression similar to the expression for Akaike Information Criterion (or AIC) shown on page 13 of the reference.

(I use BIC instead of the AIC because it results in more parsimonious models, i.e. it imposes feature-selection more strongly.)

**Long-story short, the models are chosen based on the BIC score, and the BIC score on the test set is shown in the dialog that you see upon completion of the optiLearn model search.** For comparison with the manually found models, you can also see model $R^2$ score on the test set.

Additionally, you will see the optimal values of parameters $\alpha$ and *$l_1$-ratio*, which are scikit-learn parameters for the ElasticNet regressor:

$$ \alpha = \alpha_1 + \alpha_2 $$
$$ l_1\text{-ratio} = \frac{\alpha_1}{\alpha} $$

Admittedly, this is not the most convenient form to compare to the manually selected models, but the transformation is easy to do by hand.

Finally, regarding feature-selection, selected predictive features are provided by name. **In all tests I have done so far, optiLearn has successfully removed all redundant variables and has kept all predictive features!**
The last line of the output of optiLearn is a list of learned coefficients - these can be compared to the coefficients learned by the manually chosen model, and to the list of true coefficients. 
Note that optiLearn will probably produce much simpler models than manual model selection, as manual model selection always keeps all variables, so even if redundant variables have very small coefficients, they still enter model and therefore also interpretation of results.

## Application architecture

In this final section I would like to briefly comment on the design of the application itself. **I created GUI in PyQt**, which is the reason it doesn't look as beautiful as some other modern GUIs. On the other hand, for quick prototyping of a desktop GUI application it served its purpose nicely. 

Main graphical elements such as labels, drop-down menus and push buttons all have PyQt classes which are simple to use.
The only things which were slightly more complicated to achieve were to make the table work, to connect with matplot (pyplot) for residuals analysis, and to achieve dynamics (such as button in area 1 of GUI changing purpose when drop-down menu choice changes). 

If you use the app you will notice that **I didn't invest much time in handling exceptions and border cases**. I did pay attention to some obvious stuff, like share of data that is reserved for the test set can be input as *10%, 10* or *0.1*. All of these seem like a natural way to say "ten percent", so they will all work. On the other hand, a lot of other error-handling is missing, since I'm just assuming the user will not input nonsense in the program. That's good enough for the current proof-of-concept.

**As for the application architecture, it uses model-view-controller pattern**, see [here](https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller) for one description.

The ***view* part of the pattern** is achieved by two classes.

There is a ```TableView``` class that inherits from PyQts *QTableWidget* class. It only has two simple methods to set and update data in the table displayed in area 1 of the GUI.

The main GUI class is called ```appGUI``` and it handles both the main window, and the dialogs that pop up when you want to display analysis of residuals or when you use optiLearn. It contains methods to create different areas of the GUI. 
```_createDataSelection``` method creates area 1, so everything related to data loading or data generation.
```_createControls``` method creates area 2, namely everything responsible for manual model selection and tuning.
```_createReport``` creates area 3 - the part that displays $R^2$ scores, true coefficients and estimated coefficients. It also contains the part related to residuals analysis.
Finally, ```_createOptiLearn``` is responsible for the optiLearn part of the main window, and for the pop-up dialog.

The ***controller* part of the pattern** is achieved by one class, ```Controller```. This class hass access to both the main GUI object and the model objects. Controller controlls all actions via ```_connectSignals``` method. In this method, all dynamic objects such as buttons and drop-down menus have their **signals** mapped to **slots** of other objects, GUI or otherwise. In this way, when you click a button somewhere, something happens. Most signals such as "button clicked" or "drop-down menu activated" are mapped to slots that are other methods within the ```Controller``` class. These methods perform some relatively complex tasks. In particular, ```_updateData``` will update data in GUI and in model parts of the architecture, ```_loadCSV``` will handle opening of a *.csv* file from disk, ```_selectDataSource``` will configure area 1 depending on whether loading data from disk, or in-app data generation was selected in the drop-down menu, ```_trainModel``` will train machine learning model and display model scores, ```_setTunableParameters``` is used to control which parameters are tunable, depending on what regression model is chosen in the drop-down menu, and finally ```_plotResiduals``` and ```_optiLearn``` do the obvious things. 

Finally, **the *model* part of the pattern** contains two classes. One is ```provideData```, used to generate data within the application, and the other is ```MLmodel```, responsible for all things related to machine learning.

```MLmodel``` has method ```updateData``` which cleans up existing state (erases model, scores, etc.) and prepares new data. 
```splitTrainTest``` wraps around scikit-learn's train_test_split to produce the train set and the test set from the entire dataset. 
```trainModel``` performs model training and scoring, it will train different model depending on the selection in the drop-down menu. 
```modelQuality``` simply returns model scores and coefficients.
optiLearn is realized via two separate methods, the main one is ```optiLearn```, which is responsible for all the "magic", and then there's a helper function ```getBIC``` used to get Bayesian Information Criterion to score different models in the model selection process.

## Conclusion

In this article, I've given you a relatively detailed overview of the machine learning GUI application that I wrote as a small personal project. **The goal of the project was to have a proof-of-concept for a GUI-based automated machine learning application**. Another goal was to have a simple way to generate controlled data with known distribution in order to test machine learning algorithms and feature-selection functionalities. Finally, I wanted to build a one-click machine learning solution which would find the best subset of data and the best model at the same time.

I believe that after this thorough overview you can agree with me that **the result is a success**, and that the end product is quite cool.

**Let's recap.** I've shown you how the data can be imported from the disk, or generated within app. Next, we've seen how machine learning models can be built via GUI in a traditional way on the available data. This approach allows for manual selection of the train-test split, and of model parameters. There was a number of model-scoring indicators, and there was also support for analysis of residuals. Finally, I've described in detail a simple methodology that I developed for automated machine learning. It consists of feature selection and model selection loops. 

I'd like to briefly discuss some **things that I didn't include in the app, but which would be natural extensions of the project**.

optiLearn methodology is very easy to extend with an additional step: before going into feature selection, we could have also generated many new features by simple operations such as log-transformations. Such an additional step would make **feature engineering** part even more comprehensive.

Regarding data preprocessing, I didn't include any tools for handling **missing data**, but such a step could easily be included both for manual workflow and for automated machine learning via optiLearn. Similarly, for variables which have $\mathcal{O}(n)$ distinct values in a dataset with $n$ records, it is usually useful to "bin" distinct values into histogram-like bins. It is a process that can be automated as well.

Regarding model catalogue, I restricted the app to linear models only, but it is a simple matter to add **more machine learning models**. This, however, would require the reporting part of the app to be more complex, since different models are analyzed in different ways (using different metrics and asking different questions).

All-in-all I would say that the application is rather self-contained and does its job perfectly. At the same time, there are many interesting and natural ways in which it can be extended.

I would like to **thank you** for taking the time to read this article and I sincerely hope it was interesting and useful to you.
As always, you can find the code on accompanying github page. **All it takes to test out the application yourself is to download the script and run it with python!**