Navigate to https://www.kaggle.com/c/google-quest-challenge/data for the challenge page
-
Run
cd lstm
to navigate to the LSTM folder -
Run
pip3 install -r requirements.txt
-
Run
python3 preprocess_data.py
to preprocess the QUEST dataset -
Run
python3 train.py
-
Run
python3 test.py
to generate a test submission file for Kaggle -
The final, best model will be saved in the models/ folder
Steps to obtain the training accuracy (test accuracy is displayed on Kaggle) using Spearman score -
-
cd into 'log_reg' directory -
cd log_reg
-
Run
pip3 install -r requirements.txt
-
We then run logistic regression to generate the features, train and save the model -
python logistic_regression.py
(Will take around 5 minutes) -
Above command will display the Spearman score for each label and also the overall average Spearman score, and also generate a generate a test submission file `submission.csv' for Kaggle
The naive Bayes model requires only the numpy
, pandas
, re
, and collections
libraries to run.
-
Run
cd naive-bayes
to navigate to the naive Bayes folder. -
Run
python3 naive-bayes.py
to execute the naive Bayes model. This will train the model on thetrain.csv
dataset and save a filesubmission.csv
containing the predicted labels for thetest.csv
dataset.
The initial code repository is located at https://github.com/ElizaLo/Question-Answering-based-on-SQuAD .
-
Run
cd qa
to navigate to the qa folder -
Run
pip3 install -r requirements.txt
-
Run
pip3 install -r requirements.txt
to install all Python3 libraries -
Run
python3 -m spacy download en
to download necessary Spacy models. These should be located in yourPython/Python36/Lib/site-packages/en_core_web_sm/
folder. -
Modify data_dir, spicy_en, and glove in
config.py
to be the locations where you want to download SQuAD to, the location from step 3 of your Spacy models on your system, and the location of your pre-trained GloVe embeddings downloaded from https://nlp.stanford.edu/projects/glove/ and extracted to that folder. -
Run
python3 make_dataset.py
to preprocess the SQuAD dataset and create the vocabulary. -
Run
python3 train.py
to train and save the BiDAF model. -
Run
python3 test.py
to test the BiDAF model.
Before running the below programs, please download the SQuAD1.1 dataset from https://www.wolframcloud.com/objects/d91733a5-57f5-40fe-8e09-2f5285d21fe6, create a data
directory, and place the file into that directory. Ensure the file is titled SQuAD-v1.1.csv before continuing.
-
Run
analysis.py
to generate all plots of the QUEST datasets' features, which are placed in the plots/ directory -
Run
quest_analysis.py
to generate QUEST dataset statistics. -
Run
squad_analysis.py
to generate SQuAD dataset statistics.
Steps to generate error analysis plots (Plots already present in plots_error_analysis
folder, run steps if not present)
-
cd into 'log_reg' directory -
cd log_reg
-
Run logistic regression if not done above -
python3 logistic_regression.py
-
Now, run the command -
python3 error_analysis.py
to generate the error analysis plots in theplots_error_analysis
folder.
-
First we parse squad json into csv for a more readable format, run the command from the parent directory-
python3 parse_squad_to_csv.py
. This will generate train-v1.1.csv and dev-v1.1.csv in the squad_dataset folder. -
cd into 'log_reg' directory -
cd log_reg
-
Run the command -
python3 read_and_label_squad.py
. This will use the saved logistic regression models to label squad. This will take ~5 minutes. -
The labeled csv will be generated in the
squad_labelled
folder, dev-v1.1_labeled.csv and train-v1.1_labeled.csv.
-
cd into 'log_reg' directory -
cd log_reg
-
Run the command -
python3 analysis_labeled.py
-
The plots will be generated in the
plots_labeled
folder.