# Video summary and visual question answering

In [None]:
import ammico

### Read your data into AMMICO
The ammico package reads in one or several input video files given in a folder for processing. The user can select to read in all videos in a folder, to include subfolders via the `recursive` option, and can select the file extension that should be considered (i.e. "mp4"). For reading in the files, the ammico function `find_videos` is used, with supported extentions supported:

| input key | input type | possible input values |
| --------- | ---------- | --------------------- |
`path` | `str` | the directory containing the video files (defaults to the location set by environment variable `AMMICO_DATA_HOME`) |
| `pattern` | `str\|list` | the file extensions to consider (defaults to "mp4", "mov", "avi", "mkv", "webm") |
| `recursive` | `bool` | include subdirectories recursively (defaults to `True`) |
| `limit` | `int` | maximum number of files to read (defaults to `5`, for all videos set to `None` or `-1`) |
| `random_seed` | `str` | the random seed for shuffling the videos; applies when only a few videos are read and the selection should be preserved (defaults to `None`) |

The `find_videos` function returns a nested dictionary that contains the file ids and the paths to the files and is empty otherwise. 

In [None]:
video_dict = ammico.find_videos(
    path=str("/insert/your/path/here/"),  # path to the folder with videos
    limit=-1,  # -1 means no limit on the number of files, by default it is set to 20
    pattern="mp4",  # file extensions to look for
)

### Define all AI models

The cell below loads the model for VQA tasks. By default, it loads a large model on the GPU (if your device supports CUDA), otherwise it loads a relatively smaller model on the CPU. But you can specify other settings (e.g., a small model on the GPU) if you want.

In [None]:
model = ammico.MultimodalSummaryModel()

The cell below loads the model for audio to text extraction, for more precise VQA results. By default, it loads a small model on the GPU (if your device supports CUDA), also you can specify size of the audio model ("small", "base", "large"), or device ("cuda" or "cpu") if you want. Increasing the model size can improve the result of converting an audio track to text, but consumes more RAM or VRAM.

In [None]:
audio_model = ammico.model.AudioToTextModel(model_size="small", device="cuda")

The cell below creates an object that analyzes videos and generates a summary and/or answers questions using a specific vqa model and audio model (optional, since you may want to analyze only visual part of a video).

In [None]:
vid_summary_model = ammico.VideoSummaryDetector(
    summary_model=model, audio_model=audio_model, subdict=video_dict
)

### Video Summary

To start your work with videos, you should call the `analyse_videos_from_dict` method.

You can specify what kind of analysis you want to perform with `analysis_type`. `"summary"` will generate a summary for all videos in your dictionary, `"questions"` will prepare answers to your questions for all videos, and `"summary_and_questions"` will do both.



In [None]:
summary_dict = vid_summary_model.analyse_videos_from_dict(analysis_type="summary")

### Video VQA


In addition to analyzing videos in `ammico`, the same model can be used in VQA mode. To do this, you need to define the questions that will be applied to all videos from your dict.

In [None]:
questions = ["What did people in the frame say?"]

In [None]:
vqa_results = vid_summary_model.analyse_videos_from_dict(
    analysis_type="questions",
    list_of_questions=questions,
)