# Local transcription using Whisper
The following setup can be used to locally transcribe interviews without sending potentially sensitive userdata to an external service. It relies on separately recorded audios for the interviewer and participant to make speaker detection easy.

## Prerequesites
A local setup for [Jupyter Notebooks](https://jupyter.org/).

[ffmpeg](https://ffmpeg.org/) must be installed in the command line.

[Whisper](https://github.com/openai/whisper) only works with Python 3.9 or lower due to a dependency to PyTorch.

## Input data
This setup relies on separate audio files for interviewer and participant. To generate those it is easiest to use [Zoom](https://zoom.us/) and select "Record a separate audio file for each participant" in Settings, Recording.

When not speaking, microphones should be muted. If possible, rely on video calls and nodding to encourage participants to continue to talk.

## Install Whisper and dependencies

In [None]:
%pip install git+https://github.com/openai/whisper.git

## Setup variables
Configure an interviewer and participant name here.

The audio files will be expected in the `data/<participantName>/` directory.

In [None]:
import whisper
import json
import os

interviewerName = 'Interviewer'
participantName = 'P1'

participantAudio = 'audioP1.m4a'
interviewerAudio = 'audioInterviewer.m4a'

## Load model
The following code creates an output directory for intermediate data and loads the whisper model. 

A full list of available models can be found [here](https://github.com/openai/whisper#available-models-and-languages).

In [None]:
if not os.path.exists(f'out/{participantName}/'):
    os.mkdir(f'out/{participantName}/')

model = whisper.load_model('medium.en')

## Transcribe interview audios using Whisper
Transcribe both the interviewee and interviewer audio separately using whisper. Transcription with the `medium.en` model roughly takes 50% of the original interview time.

Intermediate data of the transcription is saved as JSON in `out/<participantName>/`.

In [None]:
interviewee = model.transcribe(audio=f'data/{participantName}/{participantAudio}', verbose=False)

with open(f'out/{participantName}/{participantName}.json', 'w', encoding='utf-8') as f:
    json.dump(interviewee, f, ensure_ascii=False, indent=4)

In [None]:
interviewer = model.transcribe(audio=f'data/{participantName}/{interviewerAudio}', verbose=False)

with open(f'out/{participantName}/{interviewerName}.json', 'w', encoding='utf-8') as f:
    json.dump(interviewer, f, ensure_ascii=False, indent=4)

## Clean up and combine segments
Segments from both audios are combined into one JSON file. If multiple segments by the same speaker follow each other, they are combined in one larger segment and all original ids preserved. This ensures that a clear data-trail exists from original audio, to individual transcription, to combined segments.

The combined segments are saved as `out/<participantName>/combined.json`.

In [None]:
import json

with open(f'out/{participantName}/{interviewerName}.json', 'r') as inputInterviewer:
    interviewer = json.load(inputInterviewer)
with open(f'out/{participantName}/{participantName}.json', 'r') as inputInterviewee:
    interviewee = json.load(inputInterviewee)

combinedSegments = []

for segment in interviewer['segments']:
    combinedSegments.append({
        'speaker': interviewerName,
        'text': segment['text'].strip(' '),
        'original_ids': [segment['id']],
        'start': segment['start'],
        'end': segment['end'],
    })

for segment in interviewee['segments']:
    combinedSegments.append({
        'speaker': participantName,
        'text': segment['text'].strip(' '),
        'original_ids': [segment['id']],
        'start': segment['start'],
        'end': segment['end'],
    })

combinedSegments = sorted(combinedSegments, key=lambda d: d['start'])

outputSegments = []
for segment in combinedSegments:
    # Whisper seems to sometimes just add a . as text if it can not detect speech
    if (segment['text'] == '.'):
        continue

    if (len(outputSegments) == 0 or segment['speaker'] != outputSegments[-1]['speaker']):
        outputSegments.append(segment)
    else:
        outputSegments[-1]['original_ids'].append(segment['original_ids'][0])
        outputSegments[-1]['end'] = segment['end']
        outputSegments[-1]['text'] += ' ' + segment['text']

with open(f'out/{participantName}/combined.json', 'w', encoding='utf-8') as f:
    json.dump(outputSegments, f, ensure_ascii=False, indent=4)

## Create output
Finally, the combined segments are rendered as markdown file in `out/<participantName>.md`, including timestamps. Any other output rendering could be added here.

In [None]:
output = f'# Interview {participantName}\n\n'

for segment in outputSegments:
    startMin, startSec = divmod(segment['start'], 60)
    endMin, endSec = divmod(segment['end'], 60)
    output += f'## [{segment["speaker"]}] ({startMin:02.0f}:{startSec:02.0f} - {endMin:02.0f}:{endSec:02.0f})\n'
    output += f'{segment["text"]}\n\n'

with open(f'out/{participantName}.md', 'w') as outputFile:
    print(output, file=outputFile)

## Manual cleaning
After creating the output markdown file, copy it to `cleaned` and manually fix any mistakes that occured during transcription. Zoom also provides a combined audio file of the whole interview that can be used to compare the transcription with the actual audio.

A useful tool is [VLC](https://www.videolan.org/vlc/) with [global hotkeys](https://wiki.videolan.org/VLC_HowTo/Global_hotkeys/) to jump back/forwards 10 seconds and change replay speed.

## Exporting as PDF
Markdown files can be converted to PDF using [pandoc](https://pandoc.org/). 

An example command that converts `cleaned/P1.md` to `P1.pdf` with 0.5 inch margins is: `pandoc -V geometry:margin=0.5in -o P1.pdf cleaned/P1.md`.