<a href="https://colab.research.google.com/github/shiwangi27/googlecolab/blob/main/Combine_multiple_datasets_into_one_Hindi_OpenLSR_and_Common_Voice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## How to make use of 🤗 Datasets' *concatenate_datasets* function to combine multiple datasets into one

This short google colab explains how to combine multiple datasets into one by using the convenient `concatenate_datasets(...)` function of 🤗 Datasets.

Let's assume you would like to train your speech recognition model on the [Common Voice](https://huggingface.co/datasets/common_voice) dataset in Abkhaz and you would like to use additional training data which is stored in `.json` files.

First, let's install datasets.

In [None]:
%%capture 
!pip install datasets==1.5

Next, we will use hindi training & validation split as the training dataset.

In [None]:
from datasets import load_dataset

common_voice_train = load_dataset("common_voice", "hi", split="train+validation")

Reusing dataset common_voice (/root/.cache/huggingface/datasets/common_voice/hi/6.1.0/0041e06ab061b91d0a23234a2221e87970a19cf3a81b20901474cffffeb7869f)


For speech recognition we only need to keep the path to the audio file and the transcription. So we will remove all other columns.

In [None]:
common_voice_train = common_voice_train.remove_columns(['client_id', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'])

Load OpenSLR data downloaded from:

- Train Data: https://www.openslr.org/resources/103/Hindi_train.zip 
- Test Data: https://www.openslr.org/resources/103/Hindi_test.zip

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


I have pre-saved data in my drive and takes the folder: OpenSLR_Hindi_Data/

In [None]:
!unzip /content/drive/MyDrive/Projects/OpenSLR_Hindi_Data.zip -d .

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: ./OpenSLR_Hindi_Data/train/audio/5597_076.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/5333_050.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/0715_079.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/1170_026.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/1091_018.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/2671_076.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/1678_002.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/0924_046.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/0592_080.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/2379_027.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/2488_092.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/1134_038.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/4646_051.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/4526_019.wav  
  inflating: ./OpenSLR_Hindi_Data/train/audio/1869_038.wav  
  inflating: ./OpenS

Load the transcripts data into Pandas Dataframe

In [None]:
import pandas as pd

In [None]:
train_data = pd.read_csv('OpenSLR_Hindi_Data/train/transcription.txt',header = None)
train_data.columns = ["label"]

test_data = pd.read_csv('OpenSLR_Hindi_Data/test/transcription.txt',header = None)
test_data.columns = ["label"] 



```
# This is formatted as code
```

OpenSLR transcriptions have no headers is a tab separated file. Create two columns {filename, transcripts}. 


In [None]:
train_data[['filename','transcripts']] = train_data["label"].str.split(" ", 1, expand=True)
test_data[['filename','transcripts']] = test_data["label"].str.split(" ", 1, expand=True)

In [None]:
# Drop label column
train_data = train_data[["filename","transcripts"]]
test_data = test_data[["filename","transcripts"]]

print("(Train Shape, Test Shape): (%s, %s)" % (train_data.shape, test_data.shape))

(Train Shape, Test Shape): ((99925, 2), (3843, 2))


Filter data and get good examples for training.
```
# This is formatted as code
```



In [None]:
unique_train_data = train_data["transcripts"].unique()
unique_train_data.shape

(4506,)

In [None]:
unique_train_data

array(['यह है मोटा राजा', 'मोटे राजा का है दुबला कुत्ता',
       'मोटा राजा व दुबला कुत्ता घूमने निकले', ...,
       'ऐ आबएरूदएगंगा वह दिन हैं याद तुझको',
       'उतरा तिरे किनारे जब कारवाँ हमारा',
       'मज़्हब नहीं सिखाता आपस में बैर रखना'], dtype=object)

In [None]:
train_data_sampled = train_data.sample(n=20000, axis=0, replace=False, random_state=42)
train_data_sampled

Unnamed: 0,filename,transcripts
22186,train/audio/1498_092.wav,कहतेकहते वह पेड़ पर खड़े हो गए
833,train/audio/0071_051.wav,अभी नहीं बेटा कल पहन लेना
78293,train/audio/4698_035.wav,पर आज उसे देखकर उनके प्राण सूख गये
63160,train/audio/3760_034.wav,और गांधी को जेल से रिहा करने
54622,train/audio/3340_002.wav,चंपा की कामना थी कि
...,...,...
55143,train/audio/3363_092.wav,महानाविक कब तक आयेंगे बाहर पूछो तो
20214,train/audio/1364_013.wav,लेकिन दादाजी भड़क गए
31380,train/audio/2030_097.wav,महाराज की बात सुनकर जादूगर बोला
22160,train/audio/1498_002.wav,कहतेकहते वह पेड़ पर खड़े हो गए


In [None]:
train_data_sampled["transcripts"].unique().shape

(4356,)

In [None]:
unique_test_data = test_data["transcripts"].unique()
unique_test_data.shape

(386,)

In [None]:
unique_test_data

array(['और अपने पेट को माँ की स्वादिष्ट गरमगरम जलेबियाँ हड़पते',
       'मुनिया ने उन्हें मछली पकड़ने की बंसीे ले कर जाते हुए देखा',
       'दो मछलियाँ सामने से तैरती हुई निकल गयीं एक पतली और दूसरी गोल',
       'मुनिया ने हँसते हुए कहा यह तो अप्पा के पैरों से भी बड़ी है',
       'हर पोंगल पर उसे एक कोलम बनाने दिया जाता था',
       'वह फ़र्श पर कोलम बनाती सीढ़ियों पर दीवारों पर',
       'सब को सुशीला के कोलम बहुत पसन्द आते',
       'एक दिन वायुसेना ने उस से कोलम बनाने में मदद माँगी',
       'सुशीला ने विमानचालकों को बताया कि उड़ान भरते हुए विमान कैसे गोता खाएँ',
       'और मिनटों में आसमान में बड़ा और रंगबिरंगा कोलम दिखने लगा',
       'क्या तुम बता सकते हो कि सुशीला अगला कोलम कहाँ बनाएगी',
       'लोग नए दिन का स्वागत करने के लिए अपने घर के बाहर कोलम बनाते हैं',
       'हमारे देश में सभी जगह फ़र्श और दीवारों पर चित्र बनाए जाते हैं',
       'बसंत पंचमी के दिन भारत के कई प्रदेशो मे लोग ग्यान संगीत और',
       'मेरे घर के पास वाले पेड़ पर बहुत सारी चिड़ियाँ बैठी हैँ',
       'बसंत रितु मे कोई भी नयी

In [None]:
import os

Next, Create full audio path from filenames.



In [None]:
def create_audio_file_paths(filename, train=True):
  if train:
    audio_path = os.path.join("OpenSLR_Hindi_Data/train/audio", filename + ".wav")
  else:
    audio_path = os.path.join("OpenSLR_Hindi_Data/test/audio", filename + ".wav")
  return audio_path


In [None]:
# add file path
train_data_sampled['filename'] = train_data_sampled['filename'].map(lambda x: create_audio_file_paths(x))
test_data['filename'] = test_data['filename'].map(lambda x: create_audio_file_paths(x, train= False))

In [None]:
train_data_sampled

Unnamed: 0,filename,transcripts
22186,OpenSLR_Hindi_Data/train/audio/train/audio/149...,कहतेकहते वह पेड़ पर खड़े हो गए
833,OpenSLR_Hindi_Data/train/audio/train/audio/007...,अभी नहीं बेटा कल पहन लेना
78293,OpenSLR_Hindi_Data/train/audio/train/audio/469...,पर आज उसे देखकर उनके प्राण सूख गये
63160,OpenSLR_Hindi_Data/train/audio/train/audio/376...,और गांधी को जेल से रिहा करने
54622,OpenSLR_Hindi_Data/train/audio/train/audio/334...,चंपा की कामना थी कि
...,...,...
55143,OpenSLR_Hindi_Data/train/audio/train/audio/336...,महानाविक कब तक आयेंगे बाहर पूछो तो
20214,OpenSLR_Hindi_Data/train/audio/train/audio/136...,लेकिन दादाजी भड़क गए
31380,OpenSLR_Hindi_Data/train/audio/train/audio/203...,महाराज की बात सुनकर जादूगर बोला
22160,OpenSLR_Hindi_Data/train/audio/train/audio/149...,कहतेकहते वह पेड़ पर खड़े हो गए


We are going to combine the sampled train data and test data from openslr and use that for training. The test data is going to be the common voice test data.

In [None]:
openslr_data = pd.concat([train_data_sampled, test_data], ignore_index=True)

In [None]:
openslr_data

Unnamed: 0,filename,transcripts
0,OpenSLR_Hindi_Data/train/audio/train/audio/149...,कहतेकहते वह पेड़ पर खड़े हो गए
1,OpenSLR_Hindi_Data/train/audio/train/audio/007...,अभी नहीं बेटा कल पहन लेना
2,OpenSLR_Hindi_Data/train/audio/train/audio/469...,पर आज उसे देखकर उनके प्राण सूख गये
3,OpenSLR_Hindi_Data/train/audio/train/audio/376...,और गांधी को जेल से रिहा करने
4,OpenSLR_Hindi_Data/train/audio/train/audio/334...,चंपा की कामना थी कि
...,...,...
23838,OpenSLR_Hindi_Data/test/audio/test/audio/6033_...,जहाँ मन आपसे प्रेरित हो कर निरन्तरप्रगतिशील वि...
23839,OpenSLR_Hindi_Data/test/audio/test/audio/6033_...,जहाँ मन आपसे प्रेरित हो कर निरन्तरप्रगतिशील वि...
23840,OpenSLR_Hindi_Data/test/audio/test/audio/6033_...,जहाँ मन आपसे प्रेरित हो कर निरन्तरप्रगतिशील वि...
23841,OpenSLR_Hindi_Data/test/audio/test/audio/6033_...,जहाँ मन आपसे प्रेरित हो कर निरन्तरप्रगतिशील वि...


In [None]:
!mkdir openslr_jsons

In [None]:
import json

In [None]:
for i, row in openslr_data.iterrows():
    json_dump = {
        "filename": row["filename"], 
        "transcripts": row["transcripts"]
    }
    with open(f'openslr_jsons/example_{i}.json', 'w') as outfile:
        json.dump(json_dump, outfile)

Now we can make use of 🤗 Datasets `from_pandas(...)` function to load all data files into a `Dataset` class. 

**Note**: Besides the json format, local data files of many other formats are also supported. Check out the official [docs](https://huggingface.co/docs/datasets/loading_datasets.html#from-a-pandas-dataframe) for more information.[link text](https:// [link text](https://))*italicized text*

In [None]:
from datasets import Dataset, load_metric

openslr_data = Dataset.from_pandas(openslr_data)

In [None]:
openslr_data

Dataset({
    features: ['filename', 'transcripts'],
    num_rows: 23843
})

In [None]:
openslr_train_data.column_names

['filename', 'transcripts']

Now we need to make sure that all column names of the openslr `openslr_data` and `common_voice_train` match. Thus we rename our columns of `openslr_data` accordingly.

In [None]:
openslr_train_data = openslr_train_data.rename_column("filename", "path")
openslr_train_data = openslr_train_data.rename_column("transcripts", "sentence")

In [None]:
common_voice_train

Dataset({
    features: ['path', 'sentence'],
    num_rows: 292
})

In [None]:
openslr_train_data

Dataset({
    features: ['path', 'sentence'],
    num_rows: 23843
})

Finally, we can concatenate both datasets into one 😎.

In [None]:
import datasets

train_dataset = datasets.concatenate_datasets([common_voice_train, openslr_train_data])

ValueError: ignored

As we can see that now both datasets have been concatenated into one.

In [None]:
train_dataset[0]

{'path': '/home/dummy/dummy_file_0.wav',
 'sentence': 'Hello this is the transcription of the sound file no. 0'}

In [None]:
train_dataset[-1]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/1957a5008174870315768d2ada035ac8430e5033d49a8b8a60ae9e5f554795ff/cv-corpus-6.1-2020-12-11/ab/clips/common_voice_ab_20813183.mp3',
 'sentence': 'Нас иузымдырӡои ажәҩанқәеи адгьыли знапаҿы иҟоу, аҳрагьы зтәу Аллаҳ имацара шиакәу?'}