-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run inference on my own data #2
Comments
Hi Quick and dirty: yes. If you can be bothered creating annotations that are in the same format as the AVA videos then you can create the directory structure of the "download" directory, and stick your files in there, mimmicking an AVA video. Then you can generate the model performance metrics on that. If you're thinking of doing slightly more interesting things, like attempting to use it as a (live) video classifier, then it requires a bit of code modification. I will sketch out roughly what I think needs to be done to get it going for that purpose. Data Predict Line 67 in 190c1c6
Next, you want to load the keras model and call the The examples in this project are roughly doing this, but to evaluate against other data. Line 56 in 190c1c6
Hope this helps! |
@tuanchien thanks for your reply. I'll see what i can do following your idea... Hope i'll be able to infer model with my own data. Could you also point me on AVA-files format which i should follow to mimmick avadata with my own frames? Nevermind, i've found AVA paper with csv file structure. I guess, i'll try to handle this. I'll update my status later. |
@tuanchien hello again.
Thanks in advance. P.S.: |
|
|
You need both audio and video data for this particular model to work. If you just had audio or just video you would want to use something simpler. You can for example build a model with just the video part of the model, or just the audio part of the model. What I mean by extra footage is, if you have extra footage before frame 1 in your mp4 that you cropped out, you could stick that back in, and then start labelling from what used to be frame 1. If you don't, then you could put in blank frames at the start if you want, but the predictions will almost certainly be terrible for those frames at the start. |
Well, about audio/video only i meant that if in some particular frame sequence there are too noisy (for example, some sort of external noise from drill behind the wall) we could at least see what model's output in video_pred. Or we couldn't? Well, i have 6 mins video and i've extracted pics and audio from it from start till the end. THough, there is some sort of step, since not every frame was taen from the video. I guess, it is the fps-option in config file. I meant not like blank frames, but copy of the first frames (though how much?) |
You could look at those video/audio only outputs sure. They might give an indication but I wouldn't rely on it as an accurate indication of what's going on. Would pay to manually inspect the problematic parts. There is a parameter called ann_stride in the config.yaml that indicates how many frames to skip when generating annotations. This effectively controls the frame skip. I would recommend just padding out the beginning by say 3s worth of frames and audio (conservative estimate). For each video frame, the audio feature extraction needs enough previous audio data to calculate the feature. |
Alright, thanks fr the answers. I was able to launch the model on several videos, though i needed to preprocess them using your extract code. Question is - how to avoid that? I mean, i need to launch it now in near real time situation, when we don't have the whole video. As i understand, model takes 5 frames as input resized to 100, 100 (i've checked that using cv2.imwrite on each frame in input), but i dont get what is the audio input. It has dimensions 13, 20 (i'm not considering (16, ), as it is the batch size and (,1)). How it is correlated with 5 frames? AS i'm seeing in your code, 20 is the sequence length of the audio input (like 5 for video frames). Is this means that 4 audio-frames on each video-frame? Sorry, if i'm bothering you too much =) Probably, as if i understood your last words about "audio feature extraction needs enough previous audio data to calculate the feature" it could mean that i'm right that for each video-frame we need several audio-frames. But question about (,13,) is still there. It is some kind of resize? Nevermind, i've checked once again extract mfccs code and your paper. 13 you're getting from librosa.feature.mfcc. As i understand, it is some sort of preprocess of raw audio input. Now i need to preprocess my audio input the same way each time i want to launch model. |
WEll, i've met some problems while trying to emulate real-time usage of your model. It would be really great if you could look at this code and tell me if i'm wrong somewhere:
Function get_line_with_id gives me next line with speaker's id 0. So i'm sending only one speaker to the model every time. I'm trying to get 5 frames and corresponding audio (4 audio_frames for each video_frame). Problem is - model always returns that speaker is speaking. I'm currently trying to investigate where i'm wrong, but would be great if you could assist here. Thanks in advance. |
I suggest to start with, double check all the parameters match what your model expects. e.g., the fps, and the functions you call are returning what you expect. You may want to visualise the video frames and audio array to see if they match what your annotation says. Your while loop is also effectively skipping 5 frames each iteration. Make sure that is the behaviour you want. Good luck debugging. |
Hello again. I was on holidays, so just returned to that task. My problem in that code was that my video_frames were [0:255] int32, not [0:1] float32. After fixing that, i could get some results. But i'm still have some questions regarding audio frames. In fact, about two parameters - mfcc_window_size and stride in your config file.
|
See Line 137 in 190c1c6
the window size corresponds to n_fft and stride to the hop_length of librosa.feature.mfcc |
Well, unfortunately there are no information on n_fft and hop_length on this page in your comment. Some info could be found here http://man.hubwiz.com/docset/LibROSA.docset/Contents/Resources/Documents/generated/librosa.core.stft.html But it is still not clear. So, it seems my guess on how to get exact value of mfcc_window_size is wrong? Main question was why you've set those numbers for stride and mfcc_win_size, how you've calculated them... Oh, well. I guess i'll just test some my ideas and different values... |
The parameters came from https://arxiv.org/abs/1906.10555 |
Seems like the parameters are relevant with which sampling rate you use. The MFCC window size is the time window multiplied by the sampling rate. Here is another paper implementation for reference. https://github.com/Rudrabha/Wav2Lip/blob/deeec76ee8dba10cad6ef133e068659faf707f1e/hparams.py#L43 |
Thank you, @eddiecong . That clarify my question |
Never mind, @DaddyWesker I am working on the same task as you, writing the inference codes of this model in real time. I will share my inference codes when I finished, hope we could create a PL of inference function. |
@eddiecong I've currently placed same "window" to both n_fft=window, hop_length=window. And i'm getting exactly the number of features i need for the video. |
Hello!
Thanks for sharing your model! Can you tell me, is it possible to run inference on my own mp4 file? I guess, i need to extract frames and audio from it as you do with downloaded data, is that right or is there is another way?
The text was updated successfully, but these errors were encountered: