Manipulating the MASSIVE dataset using python

Quick Links

Introduction

This repository contains code that makes use of the MASSIVE dataset by amazon using python. MASSIVE is a parallel dataset of more than 1 million utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types.

In this project, the dataset's files, which originally come in the .jsonl format, are converted to excel readable .xlsx files. The data from the dataset is also manipulated to generate new .jsonl files and to generate a large .json file showing some translations for utterances made as part of the train partition.

You can read more about the dataset here.

Project installation

Open your terminal and create a virtual python environment to store all the required dependencies to run this project. The project was created using python version 3.11.5 which can be installed automatically when working with anaconda environments or can be downloaded directly from here.

If you prefer to use python's venv facility:
```
python3 -m venv environment_name
```
You can read more on working with python and pip here.

If you prefer to use anaconda:
```
conda create -n environment_name
```
You can read more on working with anaconda here.

You can use pip to install all the project's dependencies into your environment:
```
pip install -r requirements.txt
```
Fork and clone this repository.

Run the following command in your terminal to clone the forked repository:
```
git clone <repository link> <folder name>
```
Download the massive dataset. The massive data set 1.1 which was used for this project can be downloaded here. You will need WinRar to extract the compressed folder.
Retrieve the data folder from the extracted folder and import it into your local repository in the src folder.

The file hierarchy for this should be something like this:
```
C:\Users\username\my_project\src\data
```
Install git bash which is usually obtained during git installation. You can begin your download of git from here.

Running the project

Upon completing the project installation steps:

Open your git bash terminal and navigate to the project's src folder.
Run the following commands to execute the bash file and generate the project's output files.

To make the bash file executable:
```
chmod +x generator.sh
```
To run the bash file and generate the project's output:
```
./generator.sh
```

Disclaimer

Due to the large number of files being processed and generated, the process of generating the output could take a few minutes.

Output files

The project's output files were backed-up on Google Drive and can be accessed here.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
docs		docs
outputs		outputs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
backup.py		backup.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Manipulating the MASSIVE dataset using python

Quick Links

Introduction

Project installation

Running the project

Disclaimer

Output files

About

Releases

Packages

Contributors 5

Languages

tkxwaweru/python_data_manipulation

Folders and files

Latest commit

History

Repository files navigation

Manipulating the MASSIVE dataset using python

Quick Links

Introduction

Project installation

Running the project

Disclaimer

Output files

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages