<a href="https://colab.research.google.com/github/vitsiupia/projektPython/blob/main/ami_train_val_test_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jeśli chcemy, żeby podczas treningu wyświetlało się nam "accuracy" to musimy mieć także dane walidacyjne. 

Zakładamy, że nasze dane treningowe wyglądają tak: transkrypt-podsumowanie.


🤔... Ale czy rzeczywiście mamy wszystkie podsumowania odpowiadające wszystkim transkryptom? Sprawdźmy to!

In [1]:
# Pobieramy dane z github
!wget 'https://github.com/vitsiupia/projektPython/raw/main/ami_meetings_preprocessed.zip'

--2023-05-14 19:29:34--  https://github.com/vitsiupia/projektPython/raw/main/ami_meetings_preprocessed.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/vitsiupia/projektPython/main/ami_meetings_preprocessed.zip [following]
--2023-05-14 19:29:34--  https://raw.githubusercontent.com/vitsiupia/projektPython/main/ami_meetings_preprocessed.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1038298 (1014K) [application/zip]
Saving to: ‘ami_meetings_preprocessed.zip’


2023-05-14 19:29:34 (18.7 MB/s) - ‘ami_meetings_preprocessed.zip’ saved [1038298/1038298]



In [2]:
# Rozpakowujemy je
import zipfile
# Unzip the file
zip_ref = zipfile.ZipFile('ami_meetings_preprocessed.zip')
zip_ref.extractall()
zip_ref.close()

In [3]:
import os

transcripts_dir = 'ami_meetings_preprocessed/transcripts/'
abstractive_dir = 'ami_meetings_preprocessed/abstractive/'

transcripts_files = set(os.listdir(transcripts_dir))
abstractive_files = set(os.listdir(abstractive_dir))

transcripts_files_shortened = set([filename.split('.')[0] for filename in transcripts_files])
abstractive_files_shortened = set([filename.split('.')[0] for filename in abstractive_files])

missing_transcripts = transcripts_files_shortened - abstractive_files_shortened
missing_abstractive = abstractive_files_shortened - transcripts_files_shortened

if len(missing_transcripts) == 0 and len(missing_abstractive) == 0:
    print("Dla każdego pliku transkryptu istnieje odpowiadające mu podsumowanie, i odwrotnie.")
else:
    if len(missing_transcripts) > 0:
        print("Brak podsumowania dla następujących plików transkryptów:")
        for filename in missing_transcripts:
            print(filename)

    if len(missing_abstractive) > 0:
        print("Brak transkryptu dla następujących plików podsumowań:")
        for filename in missing_abstractive:
            print(filename)

Brak podsumowania dla następujących plików transkryptów:
IN1009
EN2001b
EN2002d
IN1012
IB4001
IB4004
IN1001
EN2006a
EN2004a
EN2009c
EN2001a
IB4002
EN2009b
EN2009d
EN2005a
EN2002b
IN1002
IN1008
EN2002c
IN1005
IN1007
EN2003a
IN1013
EN2001d
EN2001e
IN1014
EN2006b
IN1016
EN2002a


In [4]:
len(os.listdir(transcripts_dir))

171

In [5]:
len(missing_transcripts)

29

Teraz wszystko jasne! Te zagubione transkrypty posłużą nam później do testowania przetrenowanego modelu.

In [6]:
print("Tyle danych mamy do podzielenia na treningowe i walidacyjne:")
len(os.listdir(transcripts_dir)) - len(missing_transcripts)

Tyle danych mamy do podzielenia na treningowe i walidacyjne:


142

In [7]:
import os
import shutil

transcripts_folder = 'ami_meetings_preprocessed/transcripts'
summaries_folder = 'ami_meetings_preprocessed/abstractive'
test_folder = 'meetings_split/test'
train_folder = 'meetings_split/train'
val_folder = 'meetings_split/val'

# create test folder and move transcripts without summaries
os.makedirs(test_folder, exist_ok=True)
for file in os.listdir(transcripts_folder):
    if not file.endswith('.transcript.txt'):
        continue
    summary_file = file.replace('.transcript.txt', '.abssumm.txt')
    if not os.path.exists(os.path.join(summaries_folder, summary_file)):
        shutil.move(os.path.join(transcripts_folder, file), os.path.join(test_folder, file))

# create train and val folders and move remaining transcripts
os.makedirs(train_folder, exist_ok=True)
os.makedirs(val_folder, exist_ok=True)
transcript_files = [f for f in os.listdir(transcripts_folder) if f.endswith('.transcript.txt')]
train_size = int(0.85 * len(transcript_files))
train_files = transcript_files[:train_size]
val_files = transcript_files[train_size:]

for file in train_files:
    shutil.move(os.path.join(transcripts_folder, file), os.path.join(train_folder, file))
    summary_file = file.replace('.transcript.txt', '.abssumm.txt')
    shutil.move(os.path.join(summaries_folder, summary_file), os.path.join(train_folder, summary_file))

for file in val_files:
    shutil.move(os.path.join(transcripts_folder, file), os.path.join(val_folder, file))
    summary_file = file.replace('.transcript.txt', '.abssumm.txt')
    shutil.move(os.path.join(summaries_folder, summary_file), os.path.join(val_folder, summary_file))

Ok, teraz wyeksportujmy to do zip i pobierzmy.

In [10]:
import os
import zipfile
def zipdir(path, ziph):
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))
zipf = zipfile.ZipFile('meetings_split.zip', 'w', zipfile.ZIP_DEFLATED)
zipdir('meetings_split', zipf)
zipf.close()