# 物體偵測 Object Detection

物件檢測（Object Detection）是指在影像或視頻中，識別和定位一個或多個特定物件類別的技術。與單純的影像分類（Image Classification）不同，物件檢測需要同時標示出影像中不同物件的位置及其所屬的類別。

物件檢測通常包括以下步驟：

1. **物件識別**：識別影像中的物件類別，通常使用深度學習模型，如卷積神經網絡（CNN）等。
  
2. **物件定位**：定位出影像中每個被識別物件的位置，通常是以矩形邊界框（Bounding Box）的形式表示。

物件檢測技術的應用非常廣泛，包括但不限於：

1. **智能安防**：如人臉識別、監視視頻中的人物或物件檢測等。
  
2. **自動駕駛**：在自駕車技術中，識別和追蹤周圍的車輛、行人、交通標誌等，以確保安全駕駛。
  
3. **零售和物流**：如商品檢測和計數、庫存管理等。
  
4. **醫學影像處理**：在醫學影像中識別和檢測病變、器官等。

總之，物件檢測是一種將影像中的物件識別和定位的技術，廣泛應用於各種領域，從智能安防到醫學影像處理都有相關的應用。

In [None]:
!gdown '1txG_6i_yAndp_y-liVma8tHPRCIvc1wh' --output lesson8.zip
!unzip lesson8.zip


- 如果您想在自己的電腦上執行此程式碼，您可以安裝以下內容:

```
    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install inflect
    !pip install phonemizer
```

In [None]:
#!pip install transformers
!pip install -q -U gradio timm inflect phonemizer
!sudo apt-get update
!sudo apt-get install espeak-ng && pip install py-espeak-ng

**注意：** `py-espeak-ng` 僅適用於 Linux 作業系統。

若要在 Linux 電腦上本機運行，請執行下列命令:
```
    sudo apt-get update
    sudo apt-get install espeak-ng
    pip install py-espeak-ng
```

### 使用 🤗 Transformers 庫建立「物件偵測」管道

- 該模型隨 Carion 等人的論文 [End-to-End Object Inspection with Transformers](https://arxiv.org/abs/2005.12872) 一起發布。 (2020)

In [None]:
#  @title helper.py
# !cat helper.py
import io
import matplotlib.pyplot as plt
import requests
import inflect
from PIL import Image

def load_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

def render_results_in_image(in_pil_img, in_results):
    plt.figure(figsize=(16, 10))
    plt.imshow(in_pil_img)

    ax = plt.gca()

    for prediction in in_results:

        x, y = prediction['box']['xmin'], prediction['box']['ymin']
        w = prediction['box']['xmax'] - prediction['box']['xmin']
        h = prediction['box']['ymax'] - prediction['box']['ymin']

        ax.add_patch(plt.Rectangle((x, y),
                                   w,
                                   h,
                                   fill=False,
                                   color="green",
                                   linewidth=2))
        ax.text(
           x,
           y,
           f"{prediction['label']}: {round(prediction['score']*100, 1)}%",
           color='red'
        )

    plt.axis("off")

    # Save the modified image to a BytesIO object
    img_buf = io.BytesIO()
    plt.savefig(img_buf, format='png',
                bbox_inches='tight',
                pad_inches=0)
    img_buf.seek(0)
    modified_image = Image.open(img_buf)

    # Close the plot to prevent it from being displayed
    plt.close()

    return modified_image

def summarize_predictions_natural_language(predictions):
    summary = {}
    p = inflect.engine()

    for prediction in predictions:
        label = prediction['label']
        if label in summary:
            summary[label] += 1
        else:
            summary[label] = 1

    result_string = "In this image, there are "
    for i, (label, count) in enumerate(summary.items()):
        count_string = p.number_to_words(count)
        result_string += f"{count_string} {label}"
        if count > 1:
          result_string += "s"

        result_string += " "

        if i == len(summary) - 2:
          result_string += "and "

    # Remove the trailing comma and space
    result_string = result_string.rstrip(', ') + "."

    return result_string


##### To ignore warnings #####
import warnings
import logging
from transformers import logging as hf_logging

def ignore_warnings():
    # Ignore specific Python warnings
    warnings.filterwarnings("ignore", message="Some weights of the model checkpoint")
    warnings.filterwarnings("ignore", message="Could not find image processor class")
    warnings.filterwarnings("ignore", message="The `max_size` parameter is deprecated")

    # Adjust logging for libraries using the logging module
    logging.basicConfig(level=logging.ERROR)
    hf_logging.set_verbosity_error()

########

- Here is some code that suppresses warning messages.

In [None]:
from helper import load_image_from_url, render_results_in_image

from transformers.utils import logging
logging.set_verbosity_error()

from helper import ignore_warnings
ignore_warnings()

from transformers import pipeline
od_pipe = pipeline('object-detection', 'facebook/detr-resnet-50')

facebook/detr-resnet-50

Info about [facebook/detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50)
探索更多 [Hugging Face Hub for more object detection models](https://huggingface.co/models?pipeline_tag=object-detection&sort=trending)

### Use the Pipeline

In [None]:
from PIL import Image

raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((569, 491))

pipeline_output = od_pipe(raw_image)

len(pipeline_output)

processed_image = render_results_in_image(raw_image, pipeline_output)



- 使用輔助函數 `render_results_in_image` 從管道返回結果。

### 使用 `Gradio` 作為簡單介面

- 使用 [Gradio](https://www.gradio.app) 為物件偵測應用程式建立示範。
- 演示使它看起來友好且易於使用。
- 您也可以與您的朋友和同事分享簡報。

In [None]:
import os
import gradio as gr

def get_pipeline_prediction(pil_image):
  outputs = od_pipe(pil_image)
  processed_image = render_results_in_image(pil_image, outputs)
  return processed_image

demo = gr.Interface(
  fn=get_pipeline_prediction,
  inputs=gr.Image(label="輸入影像",
                  type="pil"),
  outputs=gr.Image(label="輸出影像",
                   type="pil")
)

demo.launch(share=True)

In [None]:
demo.close()

### 製作一個人工智慧驅動的音訊助手

- 將物件偵測器與文字轉語音模型結合，這將有助於確定影像內的內容。

- 檢查物件偵測管道的輸出。

In [None]:
pipeline_output

od_pipe

raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((284, 245))
from helper import summarize_predictions_natural_language
text = summarize_predictions_natural_language(pipeline_output)
text

### 生成 影像的音訊旁白 Generate Audio Narration of an Image

text-to-speech

kakao-enterprise/vits-ljs

In [None]:
tts_pipe = pipeline('text-to-speech','kakao-enterprise/vits-ljs')
narrated_text = tts_pipe(text)

更多資訊關於 [kakao-enterprise/vits-ljs](https://huggingface.co/kakao-enterprise/vits-ljs).

### 播放生成的音訊

In [None]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(narrated_text['audio'][0], rate=narrated_text['sampling_rate'])