Violence Detection using Dense Multi Head Self-Attention with Bidirectional Convolutional LSTM

This is the code for the paper:

ViolenceNet: Dense Multi Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence
Fenando José Rendón Segador, Juan Antonio Álvarez-García, Fernando Enriquez, Oscar Deniz

The paper addresses the problem of detecting violent actions in videos, analyzing the state of the art of various deep learning models and developing a new one. Our aim is to develop a new neural network model capable of reaching or exceeding the state of the art.

Abstract

Introducing efficient automatic violence detection in video surveillance or audiovisual content monitoring systems would greatly facilitate the work of closed-circuit television (CCTV) operators, rating agencies or those in charge of monitoring social network content. In this paper we present a new deep learning architecture, using an adapted version of DenseNet for three dimensions, a multi-head self-attention layer and a bidirectional convolutional long short-term memory (LSTM) module, that allows encoding relevant spatio-temporal features, to determine whether a video is violent or not. Furthermore, an ablation study of the input frames, comparing dense optical flow and adjacent frames subtraction and the influence of the attention layer is carried out, showing that the combination of optical flow and the attention mechanism improves results up to 4.4%. The conducted experiments using four of the most widely used datasets for this problem, matching or exceeding in some cases the results of the state of the art, reducing the number of network parameters needed (4.5 millions), and increasing its efficiency in test accuracy (from 95.6% on the most complex dataset to 100% on the simplest one) and inference time (less than 0.3 s for the longest clips). Finally, to check if the generated model is able to generalize violence, a cross-dataset analysis is performed, which shows the complexity of this approach: using three datasets to train and testing on the remaining one the accuracy drops in the worst case to 70.08% and in the best case to 81.51%, which points to future work oriented towards anomaly detection in new datasets.

Requirements

This project is implemented in Python using the Tensorflow and Keras libraries to develop the model.

OpenCV v4.4.0.46
Ipython v7.16.1
Keras v2.4.3
Scikit-Image v0.17.2
Scikit-Learn v0.24.2
Tensorflow v2.5.0
Livelossplot v0.5.3
Matplotlib v3.3.2
Numpy v1.19.2

Install the repository:

git clone https://github.com/FernandoJRS/violence-detection-deeplearning

Install the requirements:

pip install -r requirements.txt

Model Architecture

The following graph shows the architecture of our model.

In section A the architecture of the ViolenceNet model, that takes the optical flow as input, is shown. It is composed of four parts: a DenseNet-121 network spatio-temporal encoder, a multi-head self-attention layer, a bidirectional convolution 2D LSTM (BiConvLSTM2D) layer and a classifier. Below each Dense Block its number of components is indicated. The variable h corresponds to the number of heads used in parallel by the multi-head self-attention layer and the variables Q,K,V their inputs. Section B shows the internal architecture of a 5-component DenseBlock (x5).

Results

This section shows the results obtained from the different experiments carried out with the datasets (HF - Hockey Fights), (MF - Movies Figths), (VF - Violent Flows) and (RLVS - Real Life Violence Situations).

Ablation Study With Attention Mechanism, 5-Fold Cross-Validation

Dataset	Input	Test Accuracy (with Attention)	Test Accuracy (without Att.)	Test Inference Time (with Attention)	Test Inference Time (without Att.)
HF	Optical Flow	99.20 ± 0.6%	99.00 ± 1.0%	0.1397 ± 0.0024 s	0.1626 ± 0.0034 s
HF	Pseudo-Optical Flow	97.50 ± 1.0%	97.20 ± 1.0%	0.1397 ± 0.0024 s	0.1626 ± 0.0034 s
MF	Optical Flow	100.00 ± 0.0%	100.00 ± 0.0%	0.1916 ± 0.0093 s	0.2019 ± 0.0045 s
MF	Pseudo-Optical Flow	100.00 ± 0.0%	100.00 ± 0.0%	0.1916 ± 0.0093 s	0.2019 ± 0.0045 s
VF	Optical Flow	96.90 ± 0.5%	94.00 ± 1.0%	0.2991 ± 0.0030 s	0.3114 ± 0.0073 s
VF	Pseudo-Optical Flow	94.80 ± 0.5%	92.50 ± 0.5%	0.2991 ± 0.0030 s	0.3114 ± 0.0073 s
RLVS	Optical Flow	95.60 ± 0.6%	93.40 ± 1.0%	0.2767 ± 0.020 s	0.3019 ± 0.0059 s
RLVS	Pseudo-Optical Flow	94.10 ± 0.8%	92.20 ± 0.8%	0.2767 ± 0.020 s	0.3019 ± 0.0059 s

Performance Comparison For One Iteration

Dataset	Input	Training Accuracy	Training Loss	Test Accuracy Violence	Test Accuracy Non-Violence	Test Accuracy
HF	Optical Flow	100%	1.20×10−5	99.00%	100.00%	99.50%
HF	Pseudo-Optical Flow	99%	1.35×10−5	97.00%	98.00%	97.50%
MF	Optical Flow	100%	1.18×10−5	100%	100%	100%
MF	Pseudo-Optical Flow	100%	1.19×10−5	100%	100%	100%
VF	Optical Flow	98%	1.50×10−4	97.00%	96.00%	96.50%
VF	Pseudo-Optical Flow	97%	2.94×10−4	95.00%	94.00%	94.50%
RLVS	Optical Flow	97%	3.10×10−4	96.00%	95.00%	95.50%
RLVS	Pseudo-Optical Flow	95%	7.31×10−4	94.00%	93.00%	93.50%

Cross-Dataset Experiment, 5-Fold Cross-Validation

Dataset Training	Dataset Testing	Test Accuracy Optical Flow	Test Accuracy Pseudo-Optical Flow
HF	MF	65.18 ± 0.34%	64.86 ± 0.41%
HF	VF	62.56 ± 0.33%	61.22 ± 0.22%
HF	RLVS	58.22 ± 0.24%	57.36 ± 0.22%
MF	HF	54.92 ± 0.33%	53.50 ± 0.12%
MF	VF	52.32 ± 0.34%	51.77 ± 0.30%
MF	RLVS	56.72 ± 0.19%	55.80 ± 0.20%
VF	HF	65.16 ± 0.59%	64.76 ± 0.49%
VF	MF	60.02 ± 0.24%	59.48 ± 0.16%
VF	RLVS	58.76 ± 0.49%	58.32 ± 0.27%
RLVS	HF	69.24 ± 0.27%	68.86 ± 0.14%
RLVS	MF	75.82 ± 0.17%	74.64 ± 0.22%
RLVS	VF	67.84 ± 0.32%	66.68 ± 0.22%
HF + MF + VF	RLVS	70.08 ± 0.19%	69.84 ± 0.14%
HF + MF + RLVS	VF	76.00 ± 0.20%	75.68 ± 0.14%
HF + RLVS + VF	MF	81.51 ± 0.09%	80.49 ± 0.05%
RLVS + MF + VF	HF	79.87 ± 0.33%	78.63 ± 0.01%

Evaluation

We provide a Jupyter Notebook with instructions to train and test our model.

In order to run the application it is necessary to download the datasets. And to download the datasets it is necessary to upload the kaggle.json file found in this repository.

The following steps must be followed:

First step

Download the kaggle.json file to a local environment
Second step

Open the Jupyter Notebook
Third step

Execute the first cell of the jupyter notebook.
Step four

After executing the first cell, it will ask you to load the kaggle.json file. Do it.
Fifth step

After the datasets is downloaded, run the other parts of the code.

Citation

This section gives the information to cite the paper: ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence (2021), doi: https://doi.org/10.3390/electronics10131601.

@article{rendon2021violencenet,
  title={ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence},
  author={Rend{\'o}n-Segador, Fernando J and {\'A}lvarez-Garc{\'\i}a, Juan A and Enr{\'\i}quez, Fernando and Deniz, Oscar},
  journal={Electronics},
  volume={10},
  number={13},
  pages={1601},
  year={2021},
  publisher={Multidisciplinary Digital Publishing Institute}
}

Acknowledgements

This research is partially supported by The Spanish Ministry of Economy and Competitiveness through the project VICTORY (grant no.: TIN2017-82113-C2-1-R).

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
figures		figures
src		src
.gitignore		.gitignore
README.md		README.md
ViolenceActionDetection.ipynb		ViolenceActionDetection.ipynb
kaggle.json		kaggle.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Violence Detection using Dense Multi Head Self-Attention with Bidirectional Convolutional LSTM

Abstract

Requirements

Install the repository:

Install the requirements:

Model Architecture

Results

Ablation Study With Attention Mechanism, 5-Fold Cross-Validation

Performance Comparison For One Iteration

Cross-Dataset Experiment, 5-Fold Cross-Validation

Evaluation

Citation

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

wilfoderek/violence-detection-deeplearning

Folders and files

Latest commit

History

Repository files navigation

Violence Detection using Dense Multi Head Self-Attention with Bidirectional Convolutional LSTM

Abstract

Requirements

Install the repository:

Install the requirements:

Model Architecture

Results

Ablation Study With Attention Mechanism, 5-Fold Cross-Validation

Performance Comparison For One Iteration

Cross-Dataset Experiment, 5-Fold Cross-Validation

Evaluation

Citation

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages