Skip to content

tkxwaweru/python_data_manipulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Manipulating the MASSIVE dataset using python

Quick Links

Introduction

This repository contains code that makes use of the MASSIVE dataset by amazon using python. MASSIVE is a parallel dataset of more than 1 million utterances across 52 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types.

In this project, the dataset's files, which originally come in the .jsonl format, are converted to excel readable .xlsx files. The data from the dataset is also manipulated to generate new .jsonl files and to generate a large .json file showing some translations for utterances made as part of the train partition.

You can read more about the dataset here.

Project installation

  1. Open your terminal and create a virtual python environment to store all the required dependencies to run this project. The project was created using python version 3.11.5 which can be installed automatically when working with anaconda environments or can be downloaded directly from here.

    If you prefer to use python's venv facility:

    python3 -m venv environment_name
    

    You can read more on working with python and pip here.

    If you prefer to use anaconda:

    conda create -n environment_name
    

    You can read more on working with anaconda here.

    You can use pip to install all the project's dependencies into your environment:

    pip install -r requirements.txt
    
  2. Fork and clone this repository.

    Run the following command in your terminal to clone the forked repository:

    git clone <repository link> <folder name>
    
  3. Download the massive dataset. The massive data set 1.1 which was used for this project can be downloaded here. You will need WinRar to extract the compressed folder.

  4. Retrieve the data folder from the extracted folder and import it into your local repository in the src folder.

    The file hierarchy for this should be something like this:

    C:\Users\username\my_project\src\data
    
  5. Install git bash which is usually obtained during git installation. You can begin your download of git from here.

Running the project

Upon completing the project installation steps:

  1. Open your git bash terminal and navigate to the project's src folder.

  2. Run the following commands to execute the bash file and generate the project's output files.

    To make the bash file executable:

    chmod +x generator.sh
    

    To run the bash file and generate the project's output:

    ./generator.sh
    

Disclaimer

Due to the large number of files being processed and generated, the process of generating the output could take a few minutes.

Output files

The project's output files were backed-up on Google Drive and can be accessed here.

About

Manipulating the MASSIVE dataset using python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published