# AMLD project: Bone fracture lozalization and classification

Team members:
- Jieyu Lian
- Søren Mingon Esbensen
- Sebastian Faurby

## 1) Introduction

### Central Problem and Domain

This project aims to explore a real-world problem by implementing some advanced machine learning methods. We decided to explore the topic of object classification and localization in the domain of physiology. Accurate detection and localization of fractures in musculoskeletal X-ray images is essential for assisting radiologists in timely diagnosis. Manual interpretation is often challenging due to subtle and small fracture patterns. In this project, we explore deep learning-based object detection methods to automate this task, comparing the performance of **YOLOv11**, a real-time lightweight model, with **Faster R-CNN (ResNet50 + FPN)**, a high-accuracy two-stage model.

Our goal is to explore the scope of clinical applications of these models, taking detection acuracy and computational efficiency into account. Therefore, this project falls under the domain of Medical Image analysis and Computer Aided Diagnosis (CAD). Deep convolutional neural networks are applied to grayscale radiographs to detect and localize fracture regions automatically.

### Data Characteristics
We use the **FracAtlas** dataset (Saxena et al., 2023), a benchmark collection of annotated musculoskeletal radiographs, which includes:
- *Image types*: Grayscale radiographs of different bones and joints.
- *Annotations*: YOLO-format bounding boxes with binary class labels (fractured vs non-fractured).
- *Challenges*:
  - Class imbalance: fractured cases are less frequent.
  - Fractures are often small and difficult to localize.
  - Limited contrast and high noise due to real-world clinical settings.

---

#### References

- Wanmian Wei, Yan Huang, Junchi Zheng, Yuanyong Rao, Yongping Wei, Xingyue Tan, Haiyang OuYang,
YOLOv11-based multi-task learning for enhanced bone fracture detection and classification in X-ray images,
Journal of Radiation Research and Applied Sciences,
Volume 18, Issue 1,
2025,
101309,
ISSN 1687-8507,
https://doi.org/10.1016/j.jrras.2025.101309

- Abedeen, I., Rahman, M.A., Prottyasha, F.Z. et al. FracAtlas: A Dataset for Fracture Classification, Localization and Segmentation of Musculoskeletal Radiographs. Sci Data 10, 521 (2023). https://doi.org/10.1038/s41597-023-02432-4

## 2) EDA



#### Over all
- 4.083 X-ray scans from 3 major hospitals in Bangladesh, that have been manually annotated for bone fracture classification, localization and segmenation.


#### Distribution


- There are 717 (17.6%) images of fractures and 922 instances of fractures and 3366 of non-factures (82.4%), which indicates a clear imbalance between fractures and non-fractures.

<img src="EDA%20plots/label_distribution_2.png" alt="Label Distribution" width="40%">



#### Distribution of body parts of fractures and non-fractures 
- We see a clear imbalance in the distribution of body parts, we have very few hips and shoulders compared to legs and hands.

<div style="display: flex; gap: 10px;">
  <img src="EDA%20plots/fractured_body_parts_2.png" alt="Fractured Body Parts" width="45%">
  <img src="EDA%20plots/non_fractured_body_parts_2.png" alt="Non-Fractured Body Parts" width="45%">
</div>



#### Inspection 

On the picture is shown 5 X-rays with fractures, including bounding boxes around the fractures and 5 without fractures.

What is worth noting from the X-rays is:
- Varying box sizes
- Rotations and flippings
- Noise, like labels for the radiologists
- Varying pixel intensity

<img src="EDA%20plots/example_images_with_bboxes.png" alt="Example images with bounding boxes" width="60%">




#### Embeddings

The purpose of looking at the embeddings is to see if there are high-level feature differences in embeddings such as bone edges, texture patterns, pixel density or similar, between the fractured and non-fractured pictures that the model might be able to find. To find the embeddings, the FasterRCNN-ResNet50 architecture and the RadImageNet. 
- FasterRCNN-ResNet50 is a common architecture as part of the torch package.
- RadImageNet is an architecture specifically used and trained for MRI's and CT scans.  

##### t-SNE
A way to look at the difference between these high-dimensional embeddings is to visualize the two groups, fractures (blue dots) and non-fractures (orange dots) in a 2D space, where each point represents and X-ray.

What is worth noting from the t-SNE is:
- It can be seen that there is clear overlap between the fractures and the non-fractures, which suggess that the difference between the feature representations are not very distinct.
- What we can also see is distinct clusters with overlapping groups. This could be due to pictures clustering based on visual characteristics pictures like arms, hands, shoulders etc. and from different angles. 
- We also see that RadImageNet seems to be a finding difference between the high-level features

<div style="display: flex; gap: 10px;">
  <img src="EDA%20plots/tsne_embeddings_frcnn.png" alt="t-SNE Faster-RCNN" width="35%">
  <img src="EDA%20plots/t-SNE%20with%20RadImageNet.png" alt="t-SNE RadImageNet" width="35%">
</div>


##### PCA
To further examine the embeddings we look do a PCA analysis to see if the data can linearly seperated in groups in a 2 dimensional space.

In the PCA plot we see that:
- There is a heavy overlap between the two classes, and there is no clear linear seperation between first two components. 
- The overall variance of the points around the center seems to be rather large with some outliers. This variance could be explained by different body-parts, angles or overexposed picture and not something that is class specific.

<img src="EDA%20plots/pca_embeddings_frcnn.png" alt="PCA of faster R-CNN ResNet50 Embeddings" width="40%">

##### Outliers
To further undestand the outliers shown in the PCA plot Isolation Forest, an unsupervised outlier detection method, is used on the embeddings. 

What can be seen from the X-rays outliers are due to:
- Overexposed or underexposed scans. 
- Non-important body-parts, metal implants, other people, casks, low-bone density, 2 pictures in one.


<div style="display: flex; gap: 10px;">
  <img src="EDA plots/embedding_fractured_outlier_1.png" alt="Outlier 1" width="30%">
  <img src="EDA plots/embedding_fractured_outlier_19.png" alt="Outlier 2" width="30%"> 
  <img src="EDA plots/embedding_fractured_outlier_5.png" alt="Outlier 3" width="30%"> 
</div>

<div style="display: flex; gap: 10px; margin-top: 10px;">
  <img src="EDA plots/embedding_fractured_outlier_13.png" alt="Outlier 4" width="30%">
  <img src="EDA plots/embedding_fractured_outlier_11.png" alt="Outlier 5" width="30%"> 
  <img src="EDA plots/embedding_fractured_outlier_8.png" alt="Outlier 6" width="30%"> 
</div>





## 3) Models, training strategies and evaluation

### Models

The models we compare are the two object detection models, YOLOv11 and Faster R-CNN (ResNet50 FPN) in terms of localizing and classifying fractures in radiographs.

#### YOLOv11
The YOLOv1 model is latest in the series of state of the art models in terms of object detection, instance segmentation, pose estimaton and oriented object detection. 
Besides convolutional layers, the architecture consists of the following components C3K2 (Cross Stage Partial with kerniel size 2), both in the neck and the head, SPPF (Spatial Pyramid Pooling - Fast) and C2PSA (Convolutional block with Parallel Spatial Attention) both in the head, which improves the models performance in serveral ways including better feature extraction and faster computations. 
The backbone is specically used for feature extraction by using CNNs to transform an input image into feature maps at various resolutions. The neck is used as the intermediate processing stage with the purpose of aggregating and enchancing high-dimensional feature representations. The head is used for predictions, and generate the final outputs.
The YOLOv11 models come in various sizes fron nano (n) to extral large (xl) and due to computation limitations the YOLOv11s was used.


The Loss function to be minimized in the model includes distributed focal loss to focus on less on the easy to predict examples, bounding box regression loss and class probability loss. 

$$
L_{YOLOv11} = \text{w}_{dfl} \cdot \text{L}_{dfl} + \text{w}_{bls} \cdot \text{L}_{bls} + \text{w}_{cls} \cdot L_{cls}
$$

The model was tuned and trained using pretrained default weights for object detection from the YOLOv11 model. 

Hyperparameteres tuned and used where learning rate, weight decay, classification-loss weight, box-loss weight, dropout, the degree of horizontal flips on images before processing and scale, which gives ratios of zooms on the pictures. The picture below shows the original architecture of the YOLOv11 delivered by ultralytics.


![Image description](models/YOLO/architecture_yolov11.png)


#### Faster R-CNN (ResNet50 + FPN)
The faster R-CNN is a object detection model in two stages, that significantly improved the earlier R-CNN models. The way faster R-CNN works is that the radiograph passes through the backbone, in this case the ResNet50_fpn, where a feature map is created. At every point of on the feature map an anchor box, that represents a possible object location, is generated with different sizes and aspect ratios. These anchor boxes are starting point of the predicted bounding boxes. A region proposal network now predicts whether each of the anchor boxes are background or an object and refines the anchor box to better align with objects by predicting offsets. 

So the RPN does the following
1. Measures how well the model distinguish between foreground and background, which is a binary classification loss called objectness
2. Measures how well accuracy of the intial bounding box proposals generated by the RPN are, called RPN box regression loss using smooth L1 loss.

The various regions proposed are now handled the RoI (Region of interest) pooling layer, which outputs a fixed-sized feature map for each of the proposal. This approach increases the computational efficieny. 

The last part is the two parallel fully connected layers.
1. A classification head that predicts the class of the object using cross-entropy.
2. A bounding box regression that refines the coordinates of the detected object using smooth L1 loss. 

The loss function to be minimized is then expressed as:

$$
\text{L}_{total} = \text{w}_{cls} \cdot \text{L}_{cls} + \text{w}_{box} \cdot \text{L}_{box} + \text{w}_{obj} \cdot \text{L}_{obj} + \text{w}_{rpn} \cdot \text{L}_{rpn}
$$

The model was tuned and trained using ResNet50_fpn_v2 default weights. 

The hyperparameters tuned where the learning rates, weight decay, classification loss weight, box regression weight, objectness weight and region proposal weight.

The picture below shows the architecture of the Faster RCNN model when first proposed by Ren et al. 2015.

![Image description](models/RCNN/architecture_frcnn.png)


### Training setup and strategies
To be aligned with the paper, wang et al, the input size for both models is set to 416 x 416. 
Both models used AdamW as optimizer and where trained with a batch size of 16.
The data was split in % training, %validation and %test. The hyperparameters where then tuned by training and evaluated with the validation dataset. This process where in both cases split into smaller tuning processes. 

- The first of which a smaller image resolution was used, (320x320) with the purpose of tuning the model to extract the more general features.

- The last of which a larger image resolution was used, (416x416) and the backbone was frozen with the purpose of fine tuning the model to learn the more fine grained features herunder improving the bounding boxes.  

### Evaluation
To evaluate the models well known metrics for object detection tasks where used. 
- mAP50, Mean Average Precision, to measure the detection accuracy.  
- mAP50-95, a stricter version of the mAP50.
- Precision, is the ratio of true positive detections to all predicted positives, which means how many boxes where correctly detected.
- Recall, The ratio of true positive detections to how many real objects are found


---

**References**
- Wang, Y., Li, X., Zhang, M., Chen, H., & Liu, J. (2025). *YOLOv11-based multi-task learning for enhanced bone fracture detection and localization*. *Journal of Orthopaedic Translation*. https://www.sciencedirect.com/science/article/pii/S1687850725000214

- Redmon, J., Divvala, K., Girshick B. & Farhadi, A., *You Only Look Once: Unified, Real-Time Object Detection*, 2015,
https://doi.org/10.48550/arXiv.1506.02640

- Ren, S., He, K., Girshick, R. & Sun, J., *Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks*, 2015, 
https://doi.org/10.48550/arXiv.1506.01497


## 4) Results

Now that the models have been trained and hyperturned, this section provides the results and gives a closer look at how well they performed. The results are split into three parts:

1. **Training behaviour**: How did the models learn over time?

2. **Quantitative Evaluation**: How well do they perform on unseen data?

3. **Qualitative examples**: What do the predictions actually look like? How do they perform in different cases?


### 1. Training Behaviour

We start by looking at the learning curves from training. These include the different losses in both models during the final training, both for training and validation data.
We recall that the architecture of these models consist of several parts, each which contribute with a term to the total loss.

#### Faster RCNN - model 1: no regularization

The first Faster RCNN model was trained over 130 epochs (after hypertuning) with no regulazation. The figure below shows that learning curve for the non-regularized model for the different losses for each epoch in the components of the model. Furthermore, the last three graphs of the picture show some performance metrics that we used for all the trained models.

The model we got from the final training was clearly overfitting to the data. In the very beginning the losses fall for both training and validation data, but after not so many epochs the losses start to get bigger for the validation data while continue falling for the training data. We knew there was a high chance of this happening since we did not apply any kind of regularization.

Looking at the metrics graphs from in the figure below (last three graph on the bottom) we get the clearliest signs of overfitting, where the model obtains almost perfect mAP, precision and recall for the training data, but for validation data it performs quite poorly. The training was stopped after 130 epochs and no more inference was done to this model since it made sense to train a new model with regularization.

![Image description](plots/rcnn_training.png)

#### Faster RCNN - model 2: a regularized model

The second RCNN model was regularized. This was done by applying weight decay on the parameter weights. Furthermore we used data augmentation on the input. This was done by adding flips, rotations, changes in contrasts and changes in gamma resolution (non linear transformations on pixels).
The figure below shows the learning curve on the validation dataset for the regularised faster-RCNN model. Notice that the training data is missing due to an error in hour code that failed to save the data and we did not have time to retrain the model to get the overview. Therefore part of the comparisson is uncomplete, since we are not able to check how much regularisation helped. Once thing we could observe is that there is more noise per epoch. This could be a sign of less overfitting, since we added weight decay and more variability to the data. In terms of performance, the model did not perform any better than the first model, and in some cases even worse.

It is also worth noticing that the model was only trained for 80 epochs due to lack of time. This will be taken into account when comparing to YOLO model, since it is not a fair comparisson. By looking at the mAP, Precision and Recall graphs we see that the curves might improve if more epochs we applied but this is not a certainty.

If the model had to be trained again, it would be wise to consider applying other regularization techniches, such as dropout. Also, since the dataset is not so big, one could have taken advantage of better validation methods such as k-fold cross validation or strified k-fold cross validation in order to get better hyperparameters during hypertuning.

![Image description](plots/rcnn_training_reg.png)

#### YOLO model

The YOLO Modelv11 showed promising performance during training. The figure below presents a set of learning curves, including loss values for different components of the model’s loss function (box loss, classification loss, DFL loss, and total loss), as well as evaluation metrics like mAP, precision, and recall on the validation set.

**Loss curves (top row):**
From the loss plots, we can see that the model improves rapidly in the early epochs. Both training and validation losses decrease, though the training losses drop more sharply, which is expected. A visible gap remains between the training and validation curves across all loss types—this suggests some degree of overfitting. However, this gap remains relatively stable over time, rather than widening, which indicates that the model's regularization mechanisms (e.g., dropout, data augmentation, or weight decay) were at least partially effective in controlling overfitting.
In short, while some overfitting is present, it appears to be kept in check, and the model’s performance continues to improve steadily across the training period.

**Metric curves (bottom row):**
The bottow row shows key performance metrics on the validation data per epoch: mAP50, mAP50-95, precision and recall. These metrics help evaluate both classification and localization quality. Notably, both mAP50 and mAP95 continue to improve throughout the training and have not clearly plateaued by epoch 200. This suggests that the model might have benefited from a longer training period, potentially achieving even higher accuracy—though at the risk of further overfitting. 

Precision and recall curves also trend upward, with precision showing more variability. This might indicate sensitivity to specific types of examples (e.g., different fracture sizes or image qualities), which could be worth analyzing further.

![Image description](plots/yolo_training.png)

### 2. Quantitative evaluation

Now that we have trained the final models, it is time to check whether they perform well on unseen (test) data. To check their performance we calculate different metrics for each model on the test dataset. This part of the data has not been given to the model at any point of the training phase.

| Model               |N. epochs| mAP50     | mAP50-95  | Precision | Recall    | F1 Score  |
|---------------------|---------|-----------|-----------|-----------|-----------|-----------|
|Faster RCNN (No Reg.)| 130     | 0.3318    | 0.1258    | 0.6067    | 0.4154    | 0.4932    |
| Faster RCNN (Reg)   | 80      | 0.3992    | 0.1443    | 0.6067    | 0.4154    | 0.4932    |
| YOLOv11             | 200     | 0.5438    | 0.2402    | 0.6534    | 0.5462    | 0.5950    |

\*\* *Reg: Regularised model*

In the table above we observe that the YOLOv11 from ultralytics model performed better than our self-made faster-RCNN model across all metrics:

- YOLOv11 clearly outperforms Faster RCNN across all metrics: mAP@50, mAP@50–95, precision, recall, and F1 score.

- The improvements are especially notable in mAP50 and F1 score, indicating more accurate and balanced detection performance overall.

- YOLOv11's higher mAP50-95 score suggests better consistency across different IoU (intersection over Union) thresholds.

### 3. Qualitative examples:

To get a view of how the model would work in a production-scenario we take a look at some examples for scans of bones and the model predictions compared to the ground truth given by the annotations of the dataset. Take into account that these predictions have been hand-picked to show the different scenarios. The overall performance of the models is described in the quantitative comparisson.

Figures 1 and 2 show different cases for the predictions made on the test data by the Faster-RCNN and YOLOv11s, respectively.
* Figures 1a) and 2a) show cases where the models classified and localized the fractures succesfully. 
* Figures 1b) and 2b) show cases where the models had to classify several fractures. This was shown to be tricky for the models in the cases we analysed, specially for the faster RCNN.
* Figure 1c) shows cases where the models did not fint the fractures (false negatives) and Figure 2c) shows cases where the model classified a (true) negative correctly (this prediction is basically just the scanning that was given as an input)


![Image description](models/rcnn_examples.png)

![Image description](models/yolo_examples.png)

## 5) Discussion and further studies
#### Fairness of comparisson
As stated above, the YOLO model performed better than the faster RCNN model across all metrics. At first glance, this comparisson shows that the YOLOv11 model is the better model. While informative, this comparisson is not entirely fair due to differences in training conditions. YOLOv11 was trained with more extensive hyperparameter tuning and for 200 epochs, allowing it to better adapt to the dataset and optimize its performance. In contrast, Faster RCNN was only trained for 80 epochs, with limited time and computational resources, potentially leaving its performance under-optimized. These disparities in training duration and tuning effort likely contribute to YOLOv11's superior metrics and make it difficult to draw definitive conclusions about the inherent strengths of the architectures themselves.


Therefore, from a purely metric-performance comparisson the results are not too usefull, since the models are not fairly compared. On the other hand, the reason why the models could not be fairly compared was due to the practical challenges that Faster RCNN presents in contrast to the YOLOv11 model. For example, the faster RCNN model was more complicated to build and required more computational power. This led to 1) the final faster-RCNN model was sat to be trained after the YOLO model and 2) with a delayed start the model did not get to be trained as much as the YOLO model.
However, by looking at the learning curves one could see that it was not even a certainty that the Faster-RCNN model would come close to the YOLO model in terms of performance given that its training phase had been prolonged. Therefore, the overall conclusion is that the YOLO model was the better of the two models in question during this project.

#### Improvements to be made and key takeaways

As we observed in the results, both models were able to classify and localize the fractures at some extend, but there was a lot of room for improvement for both models. The study from which we took inspiration we saw a mAP50 of 96% using the YOLOv11 model, which indicates that we were far from achieving an optimal solution. What is important to note is their the dataset consisted of 15.666 radiographs, with 47% fractures.

For example we had some issues with overfitting. This was to be expected given the number of observations compared to the number of parameter in the models. Therefore, to fix this issue two improvements can be recommended:

1) More regularization methods: in the results chapter, we observed that the Faster-RCNN model without regularization did not perform well at all on unseen data eventhough it was better than the regularized model in training data (this is exactly the overfitting cases we read on books). Even with some regularization, both final models showed signs of overfitting, which is why we recommend to add even more regularization to get the models to generalize better. Notice that even if the models generalized better, their performace was still poor on validation data. This leads us to recommendation 2.

2) Add more data: this is another clear step to take to battle overfitting. If we add more data and keep the number of parameters constant, we expect the model to learn better from the data and overfit less. Eventhough we add some transformations and augmentations, this was still on the minimum. More data augmentation techniches could have been used to add more observations to the dataset. This was clearly a bottleneck on the improvement of the performance of the models. Adding more data would not only help with overfitting, but also with the general classification and localization performance of the models.

Therefore, if one is not able to get more annotated data (only a fraction was provided by the study to free use) like we experienced, then one of the key take aways of this project is data augmentation. Trying more ways of augmenting the data could be a way of increasing the data available for training, and using autoencoders(AE) or even variational autoencoders (VAE) pretraining to capture high dimensional representations, could be a way of denoising the data. Another approach to get more data, would be to use GANs to create synthetic images with bounding boxes as done in Tu et al. This would require a adding a model that generated "false" data and a discriminator model that was trained to tell apart from the true and false observations. Litterature has shown that this type of training has lead to good results, which is why it might be beneficial to try in further studies.

The issue of data could have been improved by looking at other dimensions as well. For example, the data provided included a meta-data file, containing a different parameters that described each scan of a given bone. One could have used this information as extra data to help the model during training. It was for example thanks to this meta data that we were able to identify some of the outliers in the data. This leads us to another point of discussion, which is the handling of outliers. As we observed during EDA there were some pictures that were flagged as outliers due to overexposure or implants. Handling these outliers, either by removing them, transforming them or downweighting them might help the model focus on informative patterns.

As shown during EDA, the dataset in this project suffered of high class imbalance, meaning that most of the observations/scans where Non-fractures. This was taken into account during the split of data, which was splitted using stratified-split. However, during training, this was only adressed in the YOLO model, which uses Distributed Focal Loss during training. This loss function is known to focus on the hard-to-detect examples. Therefore, it is plausible to think that a significant improvement could be drawn if adding a distributed focal loss variant to the Faster-RCNN model.

For further studies, one could further classify the fractures inside the bounding boxes by severity, which is relevant in the domain since it could help doctors and radiologists in providing the right type of treatment. This would of course require a new dimension of annotations with the error margin of the people making them. 
In this project we have focused on the Faster-RCNN and the YOLOv11 model. In future studies it could be interesting to have a broader scope such as RetinaNet or Mask R-CNN. This project was constrained to only two models given the time-constraint, which proved to be the greatest challenge during the project.

**References**

- *Tu, E., Burkow, J., Tsai, A., Junewick, J., Perez, F. A., Otjen, J., & Alessio, A. M. (2024). Near-pair patch generative adversarial network for data augmentation of focal pathology object detection models. Journal of Medical Imaging, 11(3), 034505. https://doi.org/10.1117/1.JMI.11.3.034505*
