Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dataset #12

Closed
cinjon opened this issue Jun 24, 2019 · 6 comments
Closed

New dataset #12

cinjon opened this issue Jun 24, 2019 · 6 comments

Comments

@cinjon
Copy link

cinjon commented Jun 24, 2019

Hi there, thanks for releasing your code. I've went through it with the intention of adding a new dataset and, as far as I can tell, the main thing that needs to be done is to generate the video_anno file, which is a large json consisting of:

I understand that the annotations field is meant to be a list of {'label': , 'segment': [start, end]}, but can you verify what the other three are meant to be? It's not clear if duration_second is according to a normalized FPS or if it's just the timestamp in the video. It's also unclear what the difference is between duration_frame and feature_frame.

In what units is the start and end of segment, i.e. is it relative to the actual time in the video or a normalized time?

Additionally, I will not be modifying the video to be 100 frames each. It seems like you did that for ActivityNet but the paper doesn't mention anything similar for Thumos. What was your strategy for Thumos?

Finally, what's the story with video_df in _get_base_data? It seems like it loads the full data in every time. That's 11G uncompressed. Is this right?

@wzmsltw
Copy link
Owner

wzmsltw commented Jun 25, 2019

Hi
‘duration second' is the duration of video in form of second, and 'duration frame' is the orignial number of video frames. During feature extraction, I adopt a 16 frame snippet, thus actually 16*n frame is used during feature extraction. the 'feature_frame' is 16 * feature_len.

In THUMOS-14, I adopt sliding window fashion to prepare data, you can leave you email here then I can send you corresponding codes.

No, in _get_base_data, feature of each video is load separately, not full data.

@wzmsltw wzmsltw closed this as completed Jun 25, 2019
@cinjon
Copy link
Author

cinjon commented Aug 9, 2019

Apologies for the delay, but I don't quite understand everything. It appears that:

  1. duration_second is just the time of the original video.
  2. duration_frame is the number of frames of the original video.

I am fine with those.

For feature_frame, though, I don't get it. Here's an example:

('v_--6bJUbfpnQ',
 {'duration_second': 26.75,
  'duration_frame': 647,
  'annotations': [{'segment': [2.578755070202808, 24.914101404056165],
    'label': 'Drinking beer'}],
  'feature_frame': 624})

In that one, why is feature_frame 624?

I found that ~ featureFrame = len(readData(videoName))*16 in data_process.py, but readData references two csvs that are not otherwise referenced or in the directory. Is the temporal / spatial directories that it is pulling from supposed to be flow and rgb? If so, then why are these concatenated before applying the 16x multiplier?

Overall, I just dont quite get how the faeture_frame works, but it's clearly important for computing the corrected_second in training. If you could clarify this, I'd really appreciate it.

Last, I just want to verify - do I need to change anything in order to run the code AND the trained models on a dataset that has an arbitrary number of frames for each video?

@cinjon
Copy link
Author

cinjon commented Aug 26, 2019

@wzmsltw, friendly bump in case this got lost in the shuffle.

@wzmsltw
Copy link
Owner

wzmsltw commented Aug 27, 2019

@cinjon
Since I extract feature for each 16 frames snippet, so the corresponding frame number used for extracting feature is len(feature)*16 = feature_frame.
corrected_second is adopted for result alignment. But acutally, this alignment has little impact on final result, you can directly use corrected_second = duration_second.

@cinjon
Copy link
Author

cinjon commented Aug 27, 2019

Another question - is it right that this codebase is setup only to work with feature vectors that cover the entire video? As far as I can tell, dataset.py needs to be adjusted in the scenario when the model does not get the entire interpolated video at once like is done with the 100 vectors for the ActivityNet videos in the paper. For example, it appears that the gt_bbox computation in _get_train_label should be changed so that the model is predicting only over the time duration given (say 120 seconds) rather than assuming that the time is over the full video_second.

Is that right or am I misunderstanding something?

(Ok, in that case I am going to ignore feature_frame and just treat it as the same as duration_frame.)

@tobiascz
Copy link

tobiascz commented Sep 25, 2019

@cinjon how is your progress to try this code on THUMOS?

If I understand this correctly I have to extract the snippet level features using TSN (https://github.com/yjxiong/anet2016-cuhk). But the anet2016-cuhk is pretrained on activity net so you first have to finetune the network on THUMOS and then extract the snippet level features from THUMOS and do the TEM, PGM & finally the PEM training? Is this correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants