# Parallel Quicksort Performance Analysis
## Science Methodology, Confidence Interval & Performance Evaluation (solo work)
**Author:** Andrei Bituleanu

This experiment aims to evaluate the performance of a parallel quicksort implementation in C using Pthreads, compared against its sequential and built-in counterparts.

The primary objective is to understand:
- How the number of threads (`THREAD_LEVEL`) or array size impacts sorting performance.
- At what point parallelism provides a measurable speed-up.
- How thread management overhead affects runtime efficiency.

All tests were executed on a machine with an 8-core processor (2.90 GHz) using GCC (pthread) and analyzed in a Jupyter Notebook environment for visualization and computation reproducibility.

**Reproducibility Goal:** All results and visualizations in this report are generated from code cells executed below.

### All source code below comes from the parallelQuicksort.c file by Joshua Stough and Arnaud Legrand:

**https://gricad-jupyter.univ-grenoble-alpes.fr/hub/user-redirect/lab/tree/Science-Methodology/exercice1/quicksort.ipynb**

**Each run produces the following format:**
Thread value = X
Sequential quicksort took: Y sec.
Parallel quicksort took: Z sec.
Built-in quicksort took: W sec.

Below is a sanity test to make sure everything works as intended. As you can see, by default, the sequential is significatively more efficient than its parallel counterpart.

In [12]:
%cd /home/bituleaa/notebooks/Science-Methodology/exercice1
!./M2R-ParallelQuicksort/src/parallelQuicksort 1000000

/home/bituleaa/notebooks/Science-Methodology/exercice1
Thread value = 1 
Sequential quicksort took: 0.126768 sec.
Parallel quicksort took: 0.221530 sec.
Built-in quicksort took: 0.353628 sec.


### Experimental Scaling and Thread-Level Analysis

After verifying the program’s correctness with a baseline test (`N = 1,000,000`), I extended the analysis to study how parallel quicksort performance scales when varying the array size with larger numbers.

Initially, I experimented with increasing array sizes defined by `N = 10,000,000 × i`, where `i` is the iteration number.  
At this stage, the number of threads was constant (`THREAD_LEVEL = 10`).  
The results revealed that the sequential quicksort consistently and often narrowly outperformed the parallel version at 10 threads.

This indicates that the limitation does not lie in the idea of parallelism itself, but rather in its implementation strategy.

To document the first scaling experiment, where I varied the array size while keeping the thread count constant to 10,  
the following code snippet outlines the procedure used to collect runtimes and confidence intervals.

First start by running the next cell once, that will compile the file.

In [None]:
!gcc -O2 -pthread -DTHREAD_LEVEL=10 M2R-ParallelQuicksort/src/parallelQuicksort.c -o M2R-ParallelQuicksort/src/parallelQuicksort
!chmod +x M2R-ParallelQuicksort/src/parallelQuicksort

In [14]:
import subprocess
import pandas as pd

exe_path = "./M2R-ParallelQuicksort/src/parallelQuicksort"

# 5 repetitions per test to ensure a confidence interval
repetitions = 5
results = []

for i in range(1, 3):  # N = 10M, 20M, 30M
    N = 10_000_000 * i
    print(f"\n=== Running test with N = {N:,} ===")
    for r in range(repetitions):
        result = subprocess.run([exe_path, str(N)], capture_output=True, text=True)
        output = result.stdout.strip()
        print(f"Run {r+1}:")
        print(output)
        results.append({"N": N, "output": output})

df_raw = pd.DataFrame(results)
df_raw.to_csv("parallel_quicksort_raw.csv", index=False)
print("\n Raw results saved to parallel_quicksort_raw.csv")


=== Running test with N = 10,000,000 ===
Run 1:
Thread value = 1 
Sequential quicksort took: 1.478864 sec.
Parallel quicksort took: 1.541873 sec.
Built-in quicksort took: 4.117032 sec.
Run 2:
Thread value = 1 
Sequential quicksort took: 1.480496 sec.
Parallel quicksort took: 1.476274 sec.
Built-in quicksort took: 4.230843 sec.
Run 3:
Thread value = 1 
Sequential quicksort took: 1.425557 sec.
Parallel quicksort took: 1.524230 sec.
Built-in quicksort took: 4.168868 sec.
Run 4:
Thread value = 1 
Sequential quicksort took: 1.455553 sec.
Parallel quicksort took: 1.625081 sec.
Built-in quicksort took: 4.153235 sec.
Run 5:
Thread value = 1 
Sequential quicksort took: 1.467240 sec.
Parallel quicksort took: 1.589459 sec.
Built-in quicksort took: 4.166741 sec.

=== Running test with N = 20,000,000 ===
Run 1:
Thread value = 1 
Sequential quicksort took: 3.178551 sec.
Parallel quicksort took: 3.206185 sec.
Built-in quicksort took: 8.655065 sec.
Run 2:
Thread value = 1 
Sequential quicksort took: 

From the results, it becomes clear that regardless of the array size, the sequential version consistently achieves better runtimes than the parallel one.

**You may want to view the CSV file that it generated.**
    
Since the increasing the array size doesn't make any notable change to the efficiency of both algorithm's ratio, we can exclude the risk of
a Simpson paradox and move onto the next experiment where the thread count will change and the array size will remain constant, at an arbitrarily high array size.

In [None]:
Experimental Direction 2 : Changing Thread levels at a constant N array size

After confirming that increasing array size does not make the parallel quicksort outperfmorm the sequential one,
the next step is to investigate how the number of threads (THREAD_LEVEL) influences performance for a fixed, large array size.

In this experiment:

The array size N is kept constant at 10 000 000 elements.

The number of threads (THREAD_LEVEL) is varied across the values: 1, 2, 4, 6, 8, 10.

Each configuration is executed five times to account for run-to-run variability and to ensure a robust confidence interval.

All results (sequential, parallel, built-in) are saved automatically to a CSV file for reproducibility and future D3.js visualization.

The goal is to determine whether increasing thread depth ever leads to performance gains,
or whether synchronization and thread management overhead remain dominant even at large workloads.
Once again, we will use 5 repetitions per test to ensure a confidence value so we get accurately consistent values.

You may now run the code snippet below to reproduce the results for yourself:

In [1]:
import subprocess
import pandas as pd

exe_path = "./M2R-ParallelQuicksort/src/parallelQuicksort"

THREAD_LEVELS = [1, 2, 4, 6, 8, 10]
ARRAY_SIZE = 10_000_000
REPETITIONS = 5

results = []

for t in THREAD_LEVELS:
    print(f"\n=== Testing THREAD_LEVEL = {t} with N = {ARRAY_SIZE:,} ===")

    # Recompile with new THREAD_LEVEL
    compile_cmd = [
        "gcc", "-O2", "-pthread",
        f"-DTHREAD_LEVEL={t}",
        "M2R-ParallelQuicksort/src/parallelQuicksort.c",
        "-o", "M2R-ParallelQuicksort/src/parallelQuicksort"
    ]
    subprocess.run(compile_cmd, check=True)

    for r in range(REPETITIONS):
        result = subprocess.run([exe_path, str(ARRAY_SIZE)], capture_output=True, text=True)
        output = result.stdout.strip()
        print(f"Run {r+1}:\n{output}\n")
        results.append({"Thread_Level": t, "N": ARRAY_SIZE, "Output": output})

# Save raw outputs
df_threads_raw = pd.DataFrame(results)
df_threads_raw.to_csv("parallel_quicksort_threads_raw.csv", index=False)
print("\n✅ Thread-level experiment results saved to parallel_quicksort_threads_raw.csv")



=== Testing THREAD_LEVEL = 1 with N = 10,000,000 ===
Run 1:
Thread value = 1 
Sequential quicksort took: 1.470688 sec.
Parallel quicksort took: 1.637856 sec.
Built-in quicksort took: 4.050624 sec.

Run 2:
Thread value = 1 
Sequential quicksort took: 1.420977 sec.
Parallel quicksort took: 1.483066 sec.
Built-in quicksort took: 4.177301 sec.

Run 3:
Thread value = 1 
Sequential quicksort took: 1.486452 sec.
Parallel quicksort took: 1.535989 sec.
Built-in quicksort took: 4.163392 sec.

Run 4:
Thread value = 1 
Sequential quicksort took: 1.457046 sec.
Parallel quicksort took: 1.580551 sec.
Built-in quicksort took: 4.172396 sec.

Run 5:
Thread value = 1 
Sequential quicksort took: 1.592987 sec.
Parallel quicksort took: 1.498176 sec.
Built-in quicksort took: 4.148829 sec.


=== Testing THREAD_LEVEL = 2 with N = 10,000,000 ===
Run 1:
Thread value = 2 
Sequential quicksort took: 1.427151 sec.
Parallel quicksort took: 1.786101 sec.
Built-in quicksort took: 4.188842 sec.

Run 2:
Thread value = 

In [None]:
# J'ai apporté une modification au code afin que je supprime l'erreur du Thread_Level qui est défini par défaut avec
# les lignes suivantes:

#ifndef THREAD_LEVEL
#define THREAD_LEVEL 1
#endif

I invite you, if needed, to admire the table in the file generated from the code snippet, but for a more complete
visualisation of our results, below is the d3.js graph of the code we just ran, reproduction of the test should globally
be very similar as only the machine could make a slight influence in absolute results, but the graph should be virtually
the same. The code is always accessible if need be at :
**Science-Methodology/exercice1/M2R-ParallelQuicksort/src/parallelQuicksort.c**

Anyway, let's observe our results.

In [None]:
Note: To be as transparent as is, I put the entire HTML file inside **Science-Methodology/exercice1/d3.txt** in case you desire to see
the values obtained for every run and thread count. Feel free to analyse it if needed.

![Parallel Quicksort Performance](quicksort_threads.png)

In [None]:
The graph shows how the three quicksort versions perform when we change the number of threads 
while keeping the array size fixed at 10 million elements.

We realize that the sequential algorithm stays pretty stable, meaning its speed doesn't change based on
thread levels which makes sense. The parallel algorithm gets slower as we add more threads. Furthermore, even
with 1 thread, the sequential algorithm is already more efficient for such code and use case.

Since we add more thread levels, the parallel version creates too many threads that tries to manage and spends 
a lot of time dealing with them instead of sorting the array.

In conclusion, this shows that regardless of the thread level, as long as the array contains a reasonably high number
of elements (in the tens of millions), the sequential algorithm remains the most efficient for this use case. 
Parallelism would only become beneficial if each thread handled a large enough workload to offset the extra management 
and synchronization costs. However, this isn’t the case here, neither when varying the thread level 
with an array size of 10,000,000, nor when varying the array size itself.