# The 2025-2026 Computing Challenge: Machine Learning Classification for Materials Science


This year the Computing Challenge revolves around writing code to build **machine learning classifiers** for materials science applications. Machine learning has become an essential tool in modern materials science, enabling researchers to predict material properties, discover new materials, and understand complex relationships between structure and function. Applications include **predicting material properties**, **accelerating materials discovery**, **optimizing synthesis conditions**, and **classifying materials based on their characteristics**.

You are part of an engineering firm, Materials.AI.ML, whose main expertise is to analyse a client database to help them **reduce the cost of testing**. In this project, a client has 2 different requests, and associated dataset 1 and 2. Solving the client problem will require you to train, test and present the results of a dataset-specific algorithm.  

1. **Dataset 1**: For each sample of a new 5-element alloy, you are given 10 features:

> 1. Density (g/cm³)
> 2. Vacancy content (fraction)
> 3. Melting temperature (K)
> 4. Heat conductivity (W/m·K)
> 5. Band gap (eV)
> 6. Crystallinity index
> 7. Thermal expansion coefficient (1/K)
> 8. Young's modulus (GPa)
> 9. Hardness (GPa)
> 10. Lattice parameter (Å)


measured by the client. For each sample, the client also measured electrical conductivity, and then labelled the sample as conductive (1) or non-conductive. For unexplained reasons, measuring conductivity for these samples is extremely complicated and expensive, and they want an algorithm that is able to predict predict whether or not the sample will be conductive, or not. However, measuring the other features comes with a cost too, although it is much smaller (and the same for each feature). Based on your results, you have to **come up with a final recommendation** about which are the best features to measure to reduce costs yet retain the highest possible classification accuracy. 

2. **Dataset 2**: In a secret project, you are only given the values of 8 features measured for a new compound, and a label for classification. In this case, the client simply wants you to build an algorithm with the best possible classification accuracy. Additionally, the client wants to understand what is the **smallest number of datapoints required to obtain a classifier with 70% accuracy, if it is at all possible**.

To answer these questions, you will need to consider data analysis and preparation, then decide which classifier to train, evaluate their performance, and visualize the results.


### Suggestions and things to consider

You need to implement classifiers for both datasets. While you may use machine learning libraries (like scikit-learn) for the actual classifier implementation, you should understand what they do and structure your code appropriately. Also, remember to use your brain and materials knowledge when considering what is the best course of action, or you risk wasting valuable resources. 

**In general, you should**:
> 0) Make sense of the data and have a look at it before starting implementing an ML algorithm

> 1) Clean the data if necessary. 

> 2) Avoid complexity if you can do something simple and still efficient. At equal efficiency, simpler == better. 

> 3) Always measure performance by using appropriate metrics (accuracy, precision, recall, F1-score).

> 4) Investigate feature importance. You can do this by:
>    - Training the classifier and examining the learned weights/coefficients
>    - Training with and without certain features and comparing performance

**Important**: You should structure your code using classes. For example, you might have:
- A `Classifier` class (or separate classes for different classifier types)
- An `Evaluator` class for computing metrics and visualizing results


### Evaluation and Visualization

You need to evaluate your classifiers and visualize the results:

> 1) **Confusion matrices**: Create confusion matrices for both datasets to see which classes are confused with each other.

> 2) **Learning curves**: For Dataset 2, plot accuracy as a function of training set size to understand how much data is needed.

> 4) **Feature importance visualization**: For Dataset 1, create a bar plot showing the importance/weight of each feature in the classifier.


## Material to be Submitted

> 1) **The code prepared to solve the challenge**, implementing classifiers, evaluation, and visualization

> 2) **For Dataset 1**:
>    - A very short (<<500 words) written report providing your recommendation to the client regarding which measurements would reduce cost (see previous text), with an explanation for the rational for this recommendation. The latter might require using tables or plots, this is up to you.

> 3) **For Dataset 2**:
>    - A very short (<<500 words) written report providing your recommendation to the client
>    - Confusion matrices for each classifier you tested, with a note on any interesting observations or difficulties encountered, and why you think a given classifier worked best, if there is any.
>    - A comparison plot showing accuracy of different classifiers tested
>    - A learning curve plot (accuracy vs. training set size). **Use this plot to answer the client question.**

## Submission and Marking

**Only a single person per group will have to submit the code and the plots on behalf of the whole group**. 

However: 

**each single person should separately submit a peer-review** of the contribution of different team members (more below). 

The submission must be done via Blackboard.

The peer review evaluation, which can be done as a Word or text file, should be named "peer-review evaluation" and contain the names of your group components and, for each person, a mark of 0, 50 or 100 evaluating their contribution to the project (obviously, do not rate your own contribution...)

The final total mark you will receive will be 70% of the group mark for the exercise + 30% coming from the peer evaluation. **If a person has more than a single peer evaluation of 0%, this person will have their total mark set to 0**, unless specific mitigation circumstances can be provided.

### Marking Criteria

> 1. Implementation of the overall Machine Learning pipeline and its different parts. The code should implement the various steps in a way that translates this problem into correct algorithms to solve it. This includes proper data analysis, train-test splitting, classifier training, and evaluation. **20 out of 100 Marks**

> 2. **Appropriate structuring of the code**. The code must be organized into classes with clear responsibilities (e.g., DataGenerator, Classifier, Evaluator classes). **20 out of 100 Marks**

> 3. **Use of the appropriate functionalities in scikit-learn**. **20 out of 100 Marks** 

> 4. Analysis of the results obtained, including the correct recommendation to the client (and the reasoning behind it). **40 out of 100 Marks** 

## Some Remarks and How to Go About Solving This Challenge

- There is **not** a single solution in terms of how to structure the code. However, a well-structured code will probably be divided into classes, with specific attributes. To build a step-by-step approach to the problem, you might have:
  - A `Preprocessor` class that handles data cleaning / normalisation and train-test splitting
  - A `BinaryClassifier` class (or a more general `Classifier` class)
  - An `Evaluator` class that computes metrics and creates visualizations

- In implementing the code, make use of libraries like scikit-learn to write compact code, you do not need to re-code things from scratch. 

- Before implementing any algorithm, take some time to reflect on which algorithm might be best, what assumptions different algorithms might make regarding the data, or even whether or not they are needed at all. The focus is understanding, not on code implementation (which should be almost trivial).

- The general approach I would suggest is the following:

1. First, **read the whole text of the problem once all together**. If you got to this point, you have basically done it already!
2. Try and visualize the data provided to make sense of it. Consider how different features might be useful for making predictions, or not. 
3. Only after carefully considering the data, try implementing different algorithms. First, start with the simplest that make sense, given the data, do not go immediately to the most complex one! 
4. Once you trained your algorithms, always visualise the results. Visualisation helps understanding, and deciding the next step - or when to stop.