# Analyzing video content with `ammico`

This is a tutorial notebook to get you started with video summarization and question answering (VQA).
You can run this notebook on google colab or locally / on your own HPC resource. For production data processing, it is recommended to run the analysis locally on a GPU-supported machine. You can also make use of the colab GPU runtime, or purchase additional runtime. However, google colab comes with pre-installed libraries that can lead to dependency conflicts. 

This first cell only runs on google colab; on all other machines, you need to create a conda environment first and install ammico from the Python Package Index using  
```pip install ammico```  
Alternatively you can install the development version from the GitHub repository  
```pip install git+https://github.com/ssciwr/AMMICO.git```

On google colab, select "TPU" as runtime, otherwise the notebook may not run:  

<div style="display: flex; justify-content: space-around; align-items: center;">
  <img src="../_static/select_runtime.png" alt="Select Runtime" style="width: 45%;">
  <img src="../_static/runtime_options.png" alt="Runtime Options" style="width: 45%;">
</div>

Then you need to uninstall the already installed `transformers` version, and `peft`, since these lead to dependency conflicts. Then you can install `ammico`. Simply execute the cell below by pressing shift+enter.

In [None]:
# when running on Google colab, otherwise the below cell is skipped
if "google.colab" in str(get_ipython()):
    # uv is a fast Python package manager, see https://github.com/astral-sh/uv
    %pip install uv
    # Uninstall conflicting packages
    !uv pip uninstall peft transformers
    # Install ammico as the latest version from GitHub, which will pull in the compatible dependencies
    !uv pip install git+https://github.com/ssciwr/ammico.git

Now you need to restart the kernel to load the new dependencies. For this, click on "Runtime -> Restart Session" or press Ctrl+M.

<div style="display: flex; justify-content: space-around; align-items: center;">
  <img src="../_static/restart_session.png" alt="Restart Session" style="width: 45%;">
</div>

Now you are ready to import ammico.

In [None]:
import ammico

This imports all the functionality from `ammico`. To analyze images, you need to upload images to google colab or [connect to your Google Drive](https://colab.research.google.com/notebooks/io.ipynb). To upload files (note that these will not persist over the runtime of the notebook), click on the folder symbol ("Files") on the left navbar and press the upload button.

<div style="display: flex; justify-content: space-around; align-items: center;">
  <img src="../_static/select_files.png" alt="Select Files" style="width: 45%;">
  <img src="../_static/upload_files.png" alt="Upload Files" style="width: 45%;">
</div>



# Step 1: Read your data into AMMICO

`ammico` reads in files from a directory. You can iterate through directories in a recursive manner and filter by extensions. Note that the order of the files may vary on different OS. Reading in these files creates a dictionary `video_dict`, with one entry per image file, containing the file path and filename. This dictionary is the main data structure that ammico operates on and is extended successively with each detector run as explained below.

For reading in the files, the ammico function `find_videos` is used, with optional keywords:

| input key | input type | possible input values |
| --------- | ---------- | --------------------- |
| `path` | `str` | the directory containing the video files (defaults to the location set by environment variable `AMMICO_DATA_HOME`) |
| `pattern` | `str\|list` | the file extensions to consider (defaults to "mp4", "mov", "avi", "mkv", "webm") |
| `recursive` | `bool` | include subdirectories recursively (defaults to `True`) |
| `limit` | `int` | maximum number of files to read (defaults to `20`, for all images set to `None` or `-1`) |
| `random_seed` | `int` | the random seed for shuffling the videos; applies when only a few videos are read and the selection should be preserved (defaults to `None`) |

In [None]:
# Define your data path
data_path = "."  # the current directory

# Find files and create the image dictionary
video_dict = ammico.find_videos(
    path=data_path,
    limit=10,  # Limit the number of files to process (optional)
)

# 2. Run the content analysis: Video summary

As an example we will create a video caption ("Summary") using the [QWEN 2.5 Vision-Language model family](https://huggingface.co/collections/Qwen/qwen25-vl). Two variants are supported:

This module is built on the Qwen2.5-VL model family. In this project, two model variants are supported: 

1. `Qwen2.5-VL-3B-Instruct`, which requires approximately 3 GB of video memory to load.
2. `Qwen2.5-VL-7B-Instruct`, which requires 8.5 GB of VRAM for initialization (default).

The optimal length of the video is more than 30s and less than ~2-3 minutes. The former is due to possible inaccuracies with the automated language detection from the audio, which requires sufficient data to be accurate (however, you may also specify the language). The latter is due to the high compute demand for long videos.

First, the model needs to be specified and loaded into memory. This will take several minutes.

In [None]:
model = ammico.MultimodalSummaryModel()  # load the default model

To analyze the audio content from the video, `ammico` uses the [WhisperX model family](https://github.com/m-bain/whisperX) for audio transcription as [developed by OpenAI](https://arxiv.org/abs/2303.00747). The available flavors available are:

1. `small`
2. `base`
2. `large`

These models can also detect many languages and provide translations, hwowever are more accurate for longer videos.

In [None]:
audio_model = ammico.model.AudioToTextModel(model_size="small", device="cuda")

Then, we create an instance of the Python class that handles the image summary and visual question answering tasks:

In [None]:
vid_summary_vqa = ammico.VideoSummaryDetector(
    summary_model=model, audio_model=audio_model, subdict=video_dict
)

After this, we can create the video captions. Depending on the length and number of videos and the hardware provided, this can take several minutes.

In [None]:
summary = vid_summary_vqa.analyse_videos_from_dict(analysis_type="summary")

The results are provided in the updated dictionary. Execute the cell below to see the first frame of the video that was analyzed together with the generated caption (summary).

In [None]:
import cv2
import matplotlib.pyplot as plt
from pprint import pprint


def display_first_frame(video_path):
    cap = cv2.VideoCapture(video_path)
    ok, frame = cap.read()
    cap.release()
    if not ok:
        raise RuntimeError(f"Could not read first frame from {video_path}")
    # Convert BGR -> RGB for matplotlib
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    plt.figure(figsize=(8, 6))
    plt.imshow(frame_rgb)
    plt.axis("off")
    plt.tight_layout()
    plt.show()


for key in summary.keys():
    # Load and display the image
    video_path = summary[key]["filename"]
    display_first_frame(video_path)
    pprint(summary[key]["summary"], width=100, compact=True)

# 3. Run the content analysis: Visual question answering

You may also ask questions about the videos. For this, provide a list of questions and pass it to the Python class that you have instantiated above. Note that the question answering takes longer than video summarization. Ideally you would carry out both tasks together in one exection as below:

In [None]:
list_of_questions = [
    "Who is in the picture?",
    "Is Trump in the picture, answer with only yes or no?",
]  # add or replace with your own questions

In [None]:
summary_and_answers = vid_summary_vqa.analyse_videos_from_dict(
    analysis_type="summary_and_questions", list_of_questions=list_of_questions
)

Again for your convenience we display the first frame of the videos and the answers to the questions together below:

In [None]:
for key in summary_and_answers.keys():
    # Load and display the image
    video_path = summary_and_answers[key]["filename"]
    display_first_frame(video_path)

    for answer in summary_and_answers[key]["vqa_answers"]:
        pprint(answer, width=100, compact=True)

# 4. Export the results

To export the results for further processing, convert the image dictionary into a pandas dataframe.

In [None]:
video_df = ammico.get_dataframe(video_dict)

Inspect the dataframe:

In [None]:
video_df.head(5)

Export the dataframe to a csv file:

In [None]:
video_df.to_csv("./data_out.csv")

# 5. Check out further notebooks or create your own!

Congratulations! You have used `ammico` for a video analysis task. Check out [the documentation](https://github.com/ssciwr/AMMICO) for further tutorials on how to extract text from images or analyze image content! Do not hesitate to [get in touch](https://github.com/ssciwr/AMMICO/issues) with questions, feedback or any technical issues!