Cortx Take Home Problem

Cortx Take Home Problem: Created a deep learning model that can be trained and run with under 16GB GPU using HuggingFace’s Transformers Library that achieves a high result on Google’s Open Domain Question Answering Long Answer task.

Instructions to install and run code

Create an AWS account.
Under the All Services Section select EC2 under the compute section
Click Launch Instance a. Type Deep Learning AMI (Ubuntu 18.04) and select it b. Then select t2.micro for setting up (change to p2.xlarge at a later time) c. Add 150 GiB of storage d. Create a .pem key and launch instance
In the Elastic Block Store section: Click Volumes and Create a SSD volume with 50 GiB of memory. Then click actions and attach to same volume.
Use Instance: a. SSH using the following line: ssh -i .pem -L 8000:localhost:8888 ubuntu@ec2-.us-east-.compute.amazonaws.com ( fill based on your instance) b. Add SWAP memory from the new volume attached. i. Type lsblk into the terminal ii. sudo mkswap /dev/* (volume name) iii. sudo swapon /dev/ (*volume name)
Clone the following repository: git clone https://github.com/sapereir/CortxHomeProblem.git
Download the necessary data and model from the following link: https://drive.google.com/drive/folders/17k8dXden_vkVhq5DX__ggZ-FbRpnIQxz?usp=sharing Feel free to download the data directly from Google here: https://ai.google.com/research/NaturalQuestions/download; I used the simplified version but either could be used. However, the non-simplified version sometimes has a runtime error. Additionally, feel free to train the model directly instead of using my pre-trained models. You could use the following link to connect ec2-instance with google drive but the data is small enough to do directly: https://zapier.com/apps/amazon-ec2/integrations/google-drive
Running the code: a. Type jupyter notebook into the terminal b. Change necessary cells (commented) or run cells. c. Long term training i. Type tmux new -s "name" into terminal ii. Type the following command into the terminal: jupyter nbconvert --to script .ipynb ( file name) iii. Type python .py ( file name) iv. Click control-b + d to exit tmux window v. To reattch to window type tmux attach -t "name" into terminal

F1 Score with Precision and Recall

Model 1

F1 Score: 2.0000 Precision: 2.4000 Recall: 1.7143

Model 2

F1 Score: 1.4667 Precision: 1.3750 Recall: 1.5714

Model 3; Final Model

F1 Score: 3.0667 Precision: 4.6000 Recall: 2.3000

Instructions to replicate these Results

All Hyperparameters are in run.ipynb and in the writeup.

Preliminary Model 1

Trained on nq-train-00.jsonl.gz and nq-val-00.jsonl.gz. Just run all the cells but just on one training file. Results will most likely be similar. Training time took about 1 to 1.5 hours.

Preliminary Model 2

Trained on nq-train-simplified.jsonl.gz and nq-val-simplified.jsonl.gz. This will require changing the input to convert_func to be val=False for training only as document_html doesn't exist. This was trained for about 7 hours but the results were poor because document html exists in validation but not in training and thus lots of information was lost in training.

Model 3; Final Model

Just run all the cells in the jupyter notebook run.ipynb. This will run 6 training files and 1 validation file. The training took about 12 hours to complete. Please take a look at the change in loss over each file too. Additionally, with sufficient exposure to additional training files over more epochs could result in a higher score. Lastly, this was only done over a subset of validation data and will improve to a certain point over a larger set.

Writeup

Please take a look at the repo and read Cortx_Writeup.pdf

Opinion and Suggestions

I really enjoyed the assignment because developing sufficient models that extract answers from entire page of content verus paragraphs is relatively a new problem in the field. I haven't dealt with something at this scale that needs to be robust. I learned a lot primarily from reading a lot of tensorflow code that can be translated to pytorch. Additionally, I learned how I effective I must be in having an efficient data pipeline. Laslty, it prompted me to gauge the importance in distrbuted models. I love how that problem was open-ended and I didn't have to reimplement architechture based on a particular research paper. I don't have too many suggestions as this was an open ended problem and that is the best aspect of the project but I think a common problems people run into could be very useful as they are probably very similar.

Favorite Charity

Link: https://www.10000degrees.org/ 10,000 Degrees helps students from low-income backgrounds get to and through college.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Cortx Writeup.zip		Cortx Writeup.zip
Cortx_Writeup.pdf		Cortx_Writeup.pdf
Loss.png		Loss.png
README.md		README.md
data_loader.py		data_loader.py
data_pre_process.py		data_pre_process.py
model.py		model.py
run.ipynb		run.ipynb
validation.py		validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cortx Take Home Problem

Instructions to install and run code

F1 Score with Precision and Recall

Model 1

Model 2

Model 3; Final Model

Instructions to replicate these Results

Preliminary Model 1

Preliminary Model 2

Model 3; Final Model

Writeup

Opinion and Suggestions

Favorite Charity

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cortx Take Home Problem

Instructions to install and run code

F1 Score with Precision and Recall

Model 1

Model 2

Model 3; Final Model

Instructions to replicate these Results

Preliminary Model 1

Preliminary Model 2

Model 3; Final Model

Writeup

Opinion and Suggestions

Favorite Charity

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages