This project is created for academic purpose and also available publicly. More information about the competition can be found here.
- Anaconda
- Kaggle API:
- If you are using window, you will have to use window's linux subsystem to run the bash script which download the dataset. Or you can download directly from kaggle.
- Install the prerequisite library:
pip install -r requirements.txt # or conda env create -f environment.yml
I recommend you use conda virtual environment and use the conda command instead of pip
Go to notebook.py
to evaluate each cell.
Week 1: Feb 1 - Feb 5: Chest X-ray project week 1
Week 2: Feb 8 - Feb 12: Chest X-ray project week 2
Week 3: Feb 15 - Feb 19: Chest X-ray project week 3
Week 4: Feb 22 - Feb 26: Chest X-ray project week 4
Week 5: March 1 - March 5: Chest X-ray project week 5
Week 6: March 8 - March 12: Chest X-ray project week 6
Week 7: March 15 - March 19: Chest X-ray project week 7
Week 8: March 22 - March 26: Chest X-ray project week 8
-
[3 hours] Analyzing the nature of the data.
- [2 hours] Analyzing the models should be used: TFIDF and ResNEt Transfer learning.
- [1 hours] Analyzing the columns of the training dataset.
-
[6 hours] Wrting the
dataset
class- [1 hours] Set up conda environment
- [3 hours] Creating the whole pipeline
- [1 hours] Writing pandas dataframe data analysis (groupby, concatenation)
- [1 hours] Wrting
notebook.py
.
- [1 hours] Set up virtual environment. Look at the file
requirements.txt
andenvironment.yml
:- Add tensorflow library.
- Add open cv library.
- Add pytorch library.
- Add other realted library: tensorboard, tfmodel, etc.
- Update documentation about the environment requirements and installation.
- [2 hours]
- Research about TF-IDF model, code it in the
notebook
- Research about TF-IDF model, code it in the
- [3 hours] Reasearch about the Facebook's FAISS text to vector store. For the use of finding nearest text vector instead of using neighboring algorithms.
- [2 hours] Research about how to code K-nearest neighbor in the context of the text. Train the K-nearest neighbor model on the dataset of over 65,000 instances.
- [1 hours] Finish the notebook pipeline using sklearn without parameters tuned and create submission.
- [1 hours] Write
utils
class for image preprocessing, ram, cpu, and gpu control when training pytorch model. Install libraries and updateenvironment.yml
andrequirements.txt
. - [4 hours] Write pytorch transfer learning pipeline in
notebook.py
.- Pytorch dataset
- Image transpose
- Initialize model method which dynamically initialize model with names.
- [2 hours] Read and research about FAISS algorithms from Facebook:
- Reference: https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
- Run test on 1-million-random-vectors with dimension 50.
- This algorithm is very fast for vector searching.
- [2 hours] Train and export KNN numpy vector model (distances, indices). Prepare work for next week's prediction.
- [2 hours] Read the paper Billion-scale similarity search with GPUs and try to go through the algorithms.
- [0.5 hours] Write utilities function that can be seen in
utils.py
:preprocess()
method to use in preprocessing the test set.rgb2gray()
converts the 3-channel image to grayscale image.ConvNet
tensorflow model class which evaluate the images.- Other supporting layer methods such as
down_conv()
,dropout_layer()
, etc.
- [4.5 hours] Examine the ways to convert words to vectors:
- [1.5 hours] Read, research, and Implement TF-IDF Vectorizers in file
vectorizers.py
- [1 hour] Read of about bags of words.
- [1.5 hours] Read about how to use the Embedding layer in Deep learning to embed the text to vector with machine.
- [0.5 hours] Write test template for classes of word vectorizers in file
word2vec.py
- [1.5 hours] Read, research, and Implement TF-IDF Vectorizers in file
- [0.5 hours] Test FAISS with vectors and visualize in
faiss_test.py
- [1.5 hours] Work on the FAISS Application in the actual program
notebook.py
, fitted data vectors from tfidf vectorizers. Plan to use text embedding next week.