<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Google-Cloud-Speech-to-Text-API-speaker_diarization-(English-only)" data-toc-modified-id="Google-Cloud-Speech-to-Text-API-speaker_diarization-(English-only)-1">Google Cloud Speech-to-Text API speaker_diarization (English only)</a></span></li></ul></div>

# Google Cloud Speech-to-Text API speaker_diarization (English only)

- Google Cloud Speech-to-Text API는 en-US, en-IN, es-ES 언어의 화자 분할을 지원합니다. (**한국어 미지원**)
- https://cloud.google.com/speech-to-text/docs/multiple-voices
- 화자 분리 API는 베타 버전 입니다.
- google.cloud.**speech_v1p1beta1**.SpeechClient().streaming_recognize(streaming_config, requests)

In [1]:
import io
import os

from google.cloud import speech_v1p1beta1 as speech

client = speech.SpeechClient()

stream_file = os.path.join(os.getcwd(), 'audio_file', 'commercial_mono.wav')

with io.open(stream_file, 'rb') as audio_file:
        content = audio_file.read()

# In practice, stream should be a generator yielding chunks of audio data.
stream = [content]

requests = (speech.types.StreamingRecognizeRequest(audio_content=chunk)
            for chunk in stream)

config = speech.types.RecognitionConfig(
    encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=8000,
    language_code='en-US',
    enable_speaker_diarization=True,
    diarization_speaker_count=2,
    enable_automatic_punctuation=True,
    enable_word_time_offsets=True,
    model='default')

streaming_config = speech.types.StreamingRecognitionConfig(config=config)

# streaming_recognize returns a generator.
responses = client.streaming_recognize(streaming_config, requests)

In [2]:
type(responses)

google.api_core.grpc_helpers._StreamingResponseIterator

In [3]:
for response in responses:
    # Once the transcription has settled, the first result will contain the is_final result. 
    # The other results will be for subsequent portions of the audio.

    for result in response.results:
        print('Finished: {}'.format(result.is_final))
        print('Stability: {}'.format(result.stability))
        alternatives = result.alternatives
        # The alternatives are ordered from most likely to least.
        for alternative in alternatives:
            print('Confidence: {}'.format(alternative.confidence))
            print(u'Transcript: {}'.format(alternative.transcript))
            
            print('Speaker:(누적)', end=" ")
            for words in alternative.words:
                print(words.speaker_tag, end=" ")
            print()

Finished: True
Stability: 0.0
Confidence: 0.9251322746276855
Transcript: Okay, I'm here.
Speaker:(누적) 2 2 2 
Finished: True
Stability: 0.0
Confidence: 0.9856593608856201
Transcript:  Hi, I'd like to buy a Chromecast and I was wondering whether you could help me with that.
Speaker:(누적) 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 
Finished: True
Stability: 0.0
Confidence: 0.9087250232696533
Transcript:  Sicily which color would you like we have blue black and red?
Speaker:(누적) 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 
Finished: True
Stability: 0.0
Confidence: 0.9783868789672852
Transcript:  Let's get the black one.
Speaker:(누적) 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 
Finished: True
Stability: 0.0
Confidence: 0.9030610918998718
Transcript:  Okay, great. Would you like the new Chromecast Ultra model or the regular compass?
Speaker:(누적) 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 