Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix/improve HRTF normalization #27

Closed
ahicks92 opened this issue Nov 12, 2020 · 18 comments
Closed

fix/improve HRTF normalization #27

ahicks92 opened this issue Nov 12, 2020 · 18 comments
Milestone

Comments

@ahicks92
Copy link
Collaborator

We have a really good story around ITD, but a not so great one around the HRTF normalization. Specifically, we end up throwing out too much of the frequency dependent effects. I'll need to sit down for a weekend or something and hammer on them until we get better quality on the normalization. It may be possible to borrow from the matlab scripts that come with the MIT dataset.

@ahicks92 ahicks92 added this to the 1.0.0 milestone Nov 12, 2020
ahicks92 added a commit that referenced this issue Feb 14, 2021
Fix the minimum phase conversion, improve the behind/in front difference with
a lowpass, remove the base frequencies. Generally clean up and debug mit.py
into something that's maintainable.  It was last thrown together before the
library even existed as a quick proof of concept.  Also, increase the fft length
for the magnitude responses to match the sample rate; this gives much better
processing when averaging power and so on.

Refs #27. Doesn't close it because we want to collect feedback.
@SemihBudak
Copy link

Hello,

I've been testing the last commit and have some feedback and questions.

I've been using OpenAl Soft for my games and am in the process of migrating to Synthizer. As it stands, Synthizer does not perform as I expect it to perform, could you please let me know if I'm missing a setting of some sorts with the following scenario? Note that HRTF is enabled.

  • I have a listener at 0, 0, 0
  • I have a sound source at 1, 0, 0
  • The sound comes from the right, which is correct
  • I start incrementing x axis of the source to move it farther to the right, but the only thing that changes is mostly the volume until it hits 50, which is the range of that source. By 'mostly the volume', I mean that the left side still partially has the sound and I cannot accurately calculate the distance of the sound
  • For OpenAl Soft, there's a depthness (I'm not sure how to describe it) that really indicates that the source is moving away and farther to the right without just lowering the volume. It's hard to put it to words as I am not very familiar with the nomenclature

I can provide demos for both OpenAl Soft and Synthizer if required in which you can move the listener around the source to test the problem I've described.

As always, good work.

@ahicks92
Copy link
Collaborator Author

It's possible that they're doing something I'm not but I can't look at their implementation because it's GPL.

There's a few possibilities here:

  • You're using a reverb and OpenALSoft is fading it into the reverb. Synthizer will do that in future; I haven't implemented it yet because I need to design it. Suffice it to say that OpenAL-style effect sends have limitations and it's not as straightforward as that seems, but that's a whole other discussion.
  • OpenALSoft has an implementation of near-field effects. If this is the case, then 1 meter to 2 meters will sound significantly different, but 50 meters to 100 meters with sufficient distance model configuration to make it audible at those ranges will sound like Synthizer.
  • OpenALSoft has in some way disregarded being physically accurate. The difference between this and near-field effects is that if that's the case they're doing something at all ranges.

We'll want to determine which. For reverb that's a simple enough answer because you either are or aren't. For near-field effects, modifying your demo to do the same thing but with larger ranges will tell us that; I think you have to touch max distance but there might be another thing to tweak. In general if you can get the same effect at larger ranges, that's near-field. If both of those aren't relevant, then they're probably doing something interesting and I'll want demo sounds or demo programs or whatever you've got, but it'd be helpful if you could play eliminate the possibilities first.

How's the rest of it? One of the problems is that I'm too close to the audio to do blind listening tests. Does it function as you'd expect at larger ranges or is there also a problem there?

And lastly, are you using a custom dataset or the one built into OpenALSoft?

@SemihBudak
Copy link

I'm not using any reverb myself, and I don't think OpenAl Soft uses one in the background. For both libraries, I went with no configuration other than enabling HRTF. I haven't changed OpenAl Soft's HRTF dataset either, it's just the default.

It looks like it is near-field effects, as the difference becomes crystal clear when the source is close, e.g. when the source is 2 meters above and to the left, I can literally turn my head to point at it, but when it's more than like 10, it gets difficult to pinpoint with same accuracy.

I'll try to get you the demos today.

@ahicks92
Copy link
Collaborator Author

Ok. That's basically what I needed to know. I'll get that to you when I can, but it's going to require tracking down literature. I might be able to do some short term things, though.

@ahicks92
Copy link
Collaborator Author

Are you a C or Python user? if I give you links to specific commits can you work with that?

I probably won't have anything for a week or so, but it would be nice if I could put out betas or something without having to spam releasing all the time.

@SemihBudak
Copy link

I'm wrapping it in C#, but I can definitely work with commits; I can compile it myself for my wrapper.

@SemihBudak
Copy link

It looks like SYZ_DISTANCE_MODEL_EXPONENTIAL is the closest to their default implementation. I'm still experimenting, will let you know if I can find a match to do a better quality check. In either case, it sounds awesome.

@ahicks92
Copy link
Collaborator Author

yeah. That's probably a factor too. Have you seen this? OpenAL has all of the same distance models, but I believe their defaults are different. Perhaps mine should be tweaked to match, though we'll have to wait a bit for 0.9 since that's breaking. Also, there's definitely docs improvements here.

You're supposed to match them to whatever environment you're representing, but they don't map well to physical properties so you kind of have to do it by ear. If you find some good defaults that just make you go "WOW" out of the box post them here and I'll probably just use them. It's quite difficult for the person writing the audio library to evaluate it because a lot of how hearing works is that it sounds better if you know what you're hearing.

I'll still be looking into near-field effects though. It would be neat if we could get it, and Creative definitely has it. that video implies that OpenALSoft doesn't (OpenALSoft starts at 1:39). But I have trouble believing that because my preliminary research suggests that it's beyond super easy, in the sense that someone who knows more than me could just wave their hand and add some terms to a formula, and OpenALSoft has been doing theirs for 10 years. Maybe there's an unexpired Creative patent floating around.

@SemihBudak
Copy link

I've tried to replicate what Creative has with OpenALSoft, but no matter how slow I move the source, there's just one snap point where the sound just moves across the other side. I would definitely love that effect to exist. I'll keep testing to see if I'm missing a setting.

@ahicks92 ahicks92 modified the milestone: 1.0.0 Feb 28, 2021
@ahicks92
Copy link
Collaborator Author

Going to go ahead and close this out because it seems as though there's no complaints. I'm deferring near-field stuff under the "it's as good as OpenALSoft" justification, though perhaps this issue will come back if it turns out we aren't as good as them. Fortunately, most future changes in regards to the HRTF should no longer be nearly as painful as the first round of this issue.

@Tetsujinfr
Copy link

Tetsujinfr commented Mar 17, 2021

Reading this thread, I have a couple of questions on the HRTF rendering:

  1. how do you set up the distance model type in python and what is the default model ?
  2. I do not hear at all the Y dimension of the sound, is it only me? Is it possible to use different sofa profiles with the lib?
  3. So do you plan on adding a near field correction? that seems to make a great deal of a difference, and personally I am interested by accurate sound spatialisation for small radius from the listener's position.
  4. do you have plans for volumetric sound sources described by Oculus in 2017 as "let sound designers model objects of virtually any size in a way that sounds realistic. Rather than trying to pinpoint the source of a sound, designers can give a sound a radius". There is a demo here.
  5. when playing with the source positions, it seems the sound only changes for integer position coordinates. Like when I move x from -1.0 to +1.0 through 200 steps, I only hear a brutal change from full left to full right. Looking at the 3D media player python example the 'pos' variable is casted as a float so I am a bit surprised it does not work with non-integers. Do I do something wrong?

Thanks a lot and sorry for the long list of questions.

@ahicks92
Copy link
Collaborator Author

The object reference lists all the properties you can set and the tutorial does show how to set properties. Everything that takes an enum value is bound to Python enums. You might have to read the Python bindings source code to work a little bit of it out, but I've had multiple people figure this out without a problem so it would be good to understand where exactly the difficulty here is. It may not be entirely clear why you'd want to set something, but everything you could possibly want to set is documented.

I think you're using coordinates wrong. If you move a sound from x = 1 to x = -1 without involving y or z, you're moving the sound through the center of the head. You need to move multiple coordinates at once in a realistic fashion, for example (1, 1, 0), (0.5, 1, 0), etc. If you're not trying to map things to an actual 3D space and need to just throw pan values at the library, then you can grab pannedSource and control the gain, azimuth, and elevation yourself; I suspect this is what you want. This would also address your "it only works for integer positions" thing as well: if you move it through the center of the head, it can only be on the right, in the middle, or on the left.

If by near-field correction you mean something like the video I linked above, yes, eventually. If by near-field correction you mean working out an entire second set of HRIRs and crossfading or somesuch, probably not. No one publishes that data as open source, I haven't found any good resources on doing it myself through mathematical analysis, and making the data myself requires a good bit of specialized hardware and access to an anechoic chamber for the duration of the project.

I'm going to guess that maybe you're sighted. I'm blind, and when I listen to the Oculus video I can't even reliably tell what is and isn't a volumetric source, and their HRTF is full of artifacts. They're probably using an ambisonics implementation, which allows for volumetric sources at the cost of quality and accuracy. In practice, the best that's possible is "this source sounds small" and "this source sounds big" with no real subtlety between them, and whether it even works or not would depend on the input sound as well. Most of the rest of the "size" perception comes from knowing what it is (e.g. trains are big, cars are smaller than trains, the source is bigger than a point, hey it's a car, boom you have "perceived" how big it is) or seeing it and combining that with what you're hearing. I can go into more details as to why this is specifically hard, but suffice it to say that I don't consider the gain as worth the trade-off and time spent especially since it probably also means doing patent research in order to find something that's not going to get us shut down by Oculus or whoever else's toes doing this is bound to step on.

Sorry I can't be more optimistic here. I absolutely agree volumetric sources would be cool.

@Tetsujinfr
Copy link

Thanks for your answers.
On volumetric sources points taken, and agreed the video is not super clear on what is large volume or small volume source when you close eyes. Would be cool but not clear of that can really work so let's forget this.

On the near field correction yes I meant what is in the video you shared. In that video, I find the near field effect of creative a bit disturbing but nonetheless it works I think. OpenAL does not work from that perspective.

Regarding the hrtf spatialization, yes I am trying to map things to an actual 3D space. Before going full steam with 3d space movements I was just assessing my perception of the sound spatialization for each dimension, hence why playing only with x, or y or z independently from each others. I know that the x dimension usually renders well but that height and depth are more challenging to perceive properly. For instance when trying google resonnance I realized that Y and Z were just not perceivable, at least to me. Mayne I got a big hears vs default hrtf params haha. I still do not understand why when moving the sound source to 0.1 or 0.5 or to 1.0 it does not make any audible difference.

Regarding properties I definitely missed the references you did mention. Can you please point my to the object reference list file or doc? For the python bindings, is there a particular source file I should look at?

Thanks a lot

@ahicks92
Copy link
Collaborator Author

I think the problem you're having with 0.1 vs. 1.0 sounding the same is the same thing that @SemihBudak had earlier in this thread. You need to set the distance model params differently. I will make them match the OpenAL defaults in the near future which will help at least a bit. Every HRTF implementation out there that I know of that's not a sound card from Creative models the head as a point, and all distance controls is the volume.

If you hit synthizer.github.io the properties are all documented in the object reference and map to Python in the straightforward manner of SYZ_P_FOO_BAR to obj.foo_bar. There is a Python type stub. This is probably the easiest thing to look at, though I don't think it has yet made it onto Pypi.

Some people can. hear height with HRTF. Some can't. It's subtle, and the problem is that it's incredibly personal. Even in an entirely real environment height perception is usually much worse than you'd give it credit for. For me with Synthizer/OpenALSoft/etc, it's most obvious if the source is at about 45 degrees off center, but your mileage may vary. Azimuth has both the gain difference and the interaural time delay, but height cues are entirely about how sound bounces off your individual ears/head which isn't something that can be averaged in a dataset. Maybe I can do better at some point; occasionally I find a demo of someone who has, notably the Blue Ripple people, but (again) all of that research is seemingly highly proprietary so who knows if I can figure out how to duplicate it or not.

@Tetsujinfr
Copy link

Tetsujinfr commented Mar 19, 2021

ok thanks. so I have done some testing and I have some feedback.
I am sharing a modified 3D media player example python file (.py file attached as a .zip file), with the following additional commands to try:
'circlexy' which animate the sound in a circle around the listener's position, along the x y plane
'circlexz' which animate the sound in a circle around the listener's position, along the x z plane
'distance dist_in_units', followed by the distance in real units (default is 2), which only apply to the circlexy or circlexz animated sounds and represents the radius of the circle from the listener
'duration time_in_seconds', followed by the distance in seconds (default is 16), which is the time the circle animations last, in seconds
example_3D_mediaplayer.zip

This way we can compare similar sounds simulation renders but with our different perceptions.

  1. So for the circling sound along x and y, I only hear a sounds passing behind my head but not in front of it. I just do not get the front part, it renders almost the same as the rear part. Also, it sounds slightly above my head, like at my scalp level. To me that is typical of hrtf parameters being used and heads specificities no? Type of issue you fix with custom sofa files no?
  2. on the circling sound along x and z, I hear the arch above my head, but when the sound circles below my head, there is like a discontinuity, very strange. Not sure if you can hear it too. Is it just me?
  3. regarding the depth of the sound, no that I did play with the sounds dynamically I understand what @SemihBudak meant: indeed a sound positioned at 1,1,1 or at 10,10,10 sound like at the same distance but one is louder and the other softer, but they feel at the exact same position. Having played with Steam audio in VR quite a bit, they have a way to simulate depth that is for sure, and it works quite well. Depth for me is a must, really critical, especially if I do not have visual support to help estimate sound positioning. I am no specialist, but now I understand that the current way hrtf is rendered is like you have sounds on a sphere around the listener's head, and the volume is adjustable but not the radius of the sphere.

Regarding cpu usage, the library is quite efficient for now, I do not observe significant usage. I did play with only one source though.

@ahicks92
Copy link
Collaborator Author

It is normal for a horizontal circle to sound like it is at the level of your head, because the position you are feeding the library is the position of the listener's head.

The discontinuity is because most of the datasets don't contain data for below an azimuth of -45 degrees. At some point I will figure out synthesizing some fake ones, but this has so far only been a problem in theory.

The lack of depth is probably the lack of near-field effects plus you haven't added a reverb. That will get you a little bit of the way there, but you'll probably need to play with manually tweaking gains for the time being. It's also possible that the Steam people are using an occlusion filter to simulate air, but that particular implementation does things like raycasting level geometry that I don't have the resources for. Most of the "depth" stuff you're talking about isn't just HRTF, but the library as a whole, and I'm working on it.

In the interest of setting expectations appropriately, you're not going to get what you want on a short time horizon. This is a pre-1.0 weekend project. I'm competitive in the OSS space, but when it comes to "how does insert proprietary solution do x" I have to reverse engineer it from first principles because all of the research on this stuff happens inside the VR companies and they're not telling us how they do it. The publicly available stuff is something like 5 years behind what Steam, Oculus, etc. are doing, not just on audio, but on everything they're involved in.

@Tetsujinfr
Copy link

Ok I understand the situation. Thanks for setting expectations and I appreciate this is not easy at all and the VR companies have lots of resources to work on this domain.

I need to try your reverb functions, on my todo list. I have been told reverb helps a lot with sound spatialisation, but I am not so convinced for outdoor scenarios, especially if one does not use vision on top of sound perception.

What about the sofa profiles, is it something your library manages or will manage in your view? When I played with sofa profiles in the past (youtube video playing a wave sound on the beach), to be honest, I was having difficulties to pick up one in particular, they kind of all sounded the same to me. What is the default dataset you are using?

@ahicks92
Copy link
Collaborator Author

Even if you buy the best HRTF that exists, you will not be able to get an exact distance estimate out of anyone. The best you can do is "yeah it sounds close" or "maybe it's far away". A lot of the distance perception is also you combining sight with the demo. I'm getting the impression that you think that HRTF alone does way more than it can. Even the best blind people in the real world can't get beyond "pretty close" or "hm, probably far away" unless we're talking about echolocation which encodes the distance in the amount of time it takes the sound generated by the echolocator to return. The primary queues for distance are the direct path as compared to a reverb (e.g. in the distance mostly it's from the reverb, late reflections sound closer to the direct sound), how much sounds move when you turn your head (closer sounds move more), and so on. My best guess for e.g. Steam is that they're fading out a stereo crossover and faking it, then just saying that sounds beyond a meter or two they're not going to bother simulating to that level of detail. I also wouldn't be surprised if their demos/library put in a reverb by default.

Since I already know what you're working on, it is important to point out that HRTF isn't magically a lot better for blind people: we aren't going to get much more out of it than you can with your eyes closed. We're just better at processing what we do get. Much of the perceived quality and accuracy is that you are combining sound and sight without realizing; in particular, the Steamaudio demos have audible artifacts to me, even if they do convey a feeling of "space" better. Things like audiogames always add additional mechanisms that provide more information to one degree or another when precise distance judgements are required.

The dataset that's currently being used is MIT Kemar: https://sound.media.mit.edu/resources/KEMAR.html

This is what OpenALSoft uses. I may eventually look at switching to Cipic, which samples the lower elevations of the Kemar head, and has a higher horizontal resolution.

In general changing the HRTF can help if you find exactly the right one, but almost no one bothers changing it because it's a laborious process for a not hugely significant gain. I'm not going to support the Sofa format directly. I might support it indirectly later, but getting HRTF from a dataset to what actually goes into a library like this requires a lot of processing, as demonstrated in our Python script for MIT. A lot of that is reusable, but there is typically some degree of processing required so at best it'd have to be a custom file format and at worst a reimplementation of a lot of Numpy's stuff in C++ (most deps for serious C++ math aren't even close to public domain, which is a goal of this library).

It's also worth noting that a large part of why I don't use more recent data is that most of the recent data isn't available for commercial use without prior permission from the authors. This may or may not factor into your decisions as to what you end up using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants