<a href="https://colab.research.google.com/github/vitsiupia/projektPython/blob/main/ami_train_val_test_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jeśli chcemy, żeby podczas treningu wyświetlało się nam "accuracy" to musimy mieć także dane walidacyjne. 

Zakładamy, że nasze dane treningowe wyglądają tak: transkrypt-podsumowanie.


🤔... Ale czy rzeczywiście mamy wszystkie podsumowania odpowiadające wszystkim transkryptom? Sprawdźmy to!

In [1]:
# Pobieramy dane z github
!wget 'https://github.com/vitsiupia/projektPython/raw/main/ami_meetings_preprocessed.zip'

--2023-05-20 17:21:12--  https://github.com/vitsiupia/projektPython/raw/main/ami_meetings_preprocessed.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vitsiupia/projektPython/main/ami_meetings_preprocessed.zip [following]
--2023-05-20 17:21:12--  https://raw.githubusercontent.com/vitsiupia/projektPython/main/ami_meetings_preprocessed.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1038298 (1014K) [application/zip]
Saving to: ‘ami_meetings_preprocessed.zip’


2023-05-20 17:21:12 (90.0 MB/s) - ‘ami_meetings_preprocessed.zip’ saved [1038298/1038298]



In [2]:
import zipfile
import os
import shutil
import random

# Stable results across multiple runs.
SEED=3
random.seed(13)

In [3]:
# Rozpakowujemy dane
zip_ref = zipfile.ZipFile('ami_meetings_preprocessed.zip')
zip_ref.extractall()
zip_ref.close()

In [4]:
transcripts_dir = 'ami_meetings_preprocessed/transcripts/'
abstractive_dir = 'ami_meetings_preprocessed/abstractive/'

# Nazwy plików.
transcripts_files = set(os.listdir(transcripts_dir))
abstractive_files = set(os.listdir(abstractive_dir))

# Tylko nazwy transkryptów.
transcripts_files_shortened = set([filename.split('.')[0] for filename in transcripts_files])
abstractive_files_shortened = set([filename.split('.')[0] for filename in abstractive_files])

missing_transcripts = transcripts_files_shortened - abstractive_files_shortened
missing_abstractive = abstractive_files_shortened - transcripts_files_shortened

if len(missing_transcripts) == 0 and len(missing_abstractive) == 0:
    print("Dla każdego pliku transkryptu istnieje odpowiadające mu podsumowanie, i odwrotnie.")
else:
    if len(missing_transcripts) > 0:
        print("Brak podsumowania dla następujących plików transkryptów:")
        for filename in missing_transcripts:
            print(filename)

    if len(missing_abstractive) > 0:
        print("Brak transkryptu dla następujących plików podsumowań:")
        for filename in missing_abstractive:
            print(filename)

Brak podsumowania dla następujących plików transkryptów:
IN1005
EN2001b
IN1013
IN1008
IN1001
IN1009
EN2002a
IN1007
EN2006b
EN2006a
EN2003a
EN2002d
EN2001e
IB4001
EN2009c
EN2002c
IN1012
IN1014
IN1002
EN2002b
IB4004
EN2001d
EN2009d
EN2004a
IB4002
EN2009b
EN2005a
EN2001a
IN1016


In [5]:
len(os.listdir(transcripts_dir))

171

In [6]:
len(missing_transcripts)

29

Teraz wszystko jasne! Te zagubione transkrypty posłużą nam później do testowania przetrenowanego modelu.

In [7]:
print("Tyle danych mamy do podzielenia na treningowe i walidacyjne:")
len(os.listdir(transcripts_dir)) - len(missing_transcripts)

Tyle danych mamy do podzielenia na treningowe i walidacyjne:


142

In [8]:
transcripts_folder = 'ami_meetings_preprocessed/transcripts'
summaries_folder = 'ami_meetings_preprocessed/abstractive'
test_folder = 'meetings_split/test'
train_folder = 'meetings_split/train'
val_folder = 'meetings_split/val'

# Create directories.
for dir in [train_folder + '/transcripts', train_folder + '/summaries',
            val_folder + '/transcripts', val_folder + '/summaries',
            test_folder + '/transcripts']:
  os.makedirs(dir, exist_ok=True)

In [9]:
# Go over transcripts and move transcripts without summaries to the test/transcripts folder. 
for transcript_file in os.listdir(transcripts_folder):
    summary_file = transcript_file.replace('.transcript.txt', '.abssumm.txt')    # respective summary file
    # If the summary doesn't exist, then use the transcript for testing and move it to test/transcripts folder.
    if not os.path.exists(os.path.join(summaries_folder, summary_file)):  
        shutil.move(os.path.join(transcripts_folder, transcript_file), os.path.join(test_folder, 'transcripts', transcript_file))

In [10]:
# All transcript files left after moving transcripts used for testing.
transcript_files = [f for f in os.listdir(transcripts_folder) if f.endswith('.transcript.txt')]
random.shuffle(transcript_files)

# Train/Val Split
train_size = int(0.85 * len(transcript_files))
train_files = transcript_files[:train_size]
val_files = transcript_files[train_size:]

for transcript_file in train_files:
    shutil.move(os.path.join(transcripts_folder, transcript_file), os.path.join(train_folder, 'transcripts', transcript_file))
    summary_file = transcript_file.replace('.transcript.txt', '.abssumm.txt')   # respective summary file.
    shutil.move(os.path.join(summaries_folder, summary_file), os.path.join(train_folder, 'summaries', summary_file))

for transcript_file in val_files:
    shutil.move(os.path.join(transcripts_folder, transcript_file), os.path.join(val_folder, 'transcripts', transcript_file))
    summary_file = transcript_file.replace('.transcript.txt', '.abssumm.txt')   # respective summary file.
    shutil.move(os.path.join(summaries_folder, summary_file), os.path.join(val_folder, 'summaries', summary_file))

In [11]:
for folder_path in ["meetings_split/train/transcripts", "meetings_split/train/summaries", 
                    "meetings_split/val/transcripts", "meetings_split/val/summaries", "meetings_split/test/transcripts"]:
  file_count = len(os.listdir(folder_path))
  print(f"Ilość plików w folderze '{folder_path}': {file_count}")

Ilość plików w folderze 'meetings_split/train/transcripts': 120
Ilość plików w folderze 'meetings_split/train/summaries': 120
Ilość plików w folderze 'meetings_split/val/transcripts': 22
Ilość plików w folderze 'meetings_split/val/summaries': 22
Ilość plików w folderze 'meetings_split/test/transcripts': 29


Ok, teraz wyeksportujmy to do zip i pobierzmy.

In [12]:
def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))
zipf = zipfile.ZipFile('meetings_split.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('meetings_split', zipf)
zipf.close()