Shopee's Categorizing Listings Kaggle Competition

About the project:

This project is created for academic purpose and also available publicly. More information about the competition can be found here.

Prerequisites

Anaconda
Kaggle API:
- Set up kaggle.
- Create api key and move the key file to ./kaggle folder..
If you are using window, you will have to use window's linux subsystem to run the bash script which download the dataset. Or you can download directly from kaggle.

Install the prerequisite library:

pip install -r requirements.txt
# or
conda env create -f environment.yml

I recommend you use conda virtual environment and use the conda command instead of pip

Run the project

Go to notebook.py to evaluate each cell.

Project Backlog:

Other works

Week 1: Feb 1 - Feb 5: Chest X-ray project week 1

Week 2: Feb 8 - Feb 12: Chest X-ray project week 2

Week 3: Feb 15 - Feb 19: Chest X-ray project week 3

Week 4: Feb 22 - Feb 26: Chest X-ray project week 4

Week 5: March 1 - March 5: Chest X-ray project week 5

Week 6: March 8 - March 12: Chest X-ray project week 6

Week 7: March 15 - March 19: Chest X-ray project week 7

Week 8: March 22 - March 26: Chest X-ray project week 8

Week 9: March 29 - April 2:

[3 hours] Analyzing the nature of the data.
- [2 hours] Analyzing the models should be used: TFIDF and ResNEt Transfer learning.
- [1 hours] Analyzing the columns of the training dataset.
[6 hours] Wrting the dataset class
- [1 hours] Set up conda environment
- [3 hours] Creating the whole pipeline
- [1 hours] Writing pandas dataframe data analysis (groupby, concatenation)
- [1 hours] Wrting notebook.py.

Week 10: April 5 - April 9:

[1 hours] Set up virtual environment. Look at the file requirements.txt and environment.yml:
- Add tensorflow library.
- Add open cv library.
- Add pytorch library.
- Add other realted library: tensorboard, tfmodel, etc.
- Update documentation about the environment requirements and installation.
[2 hours]
- Research about TF-IDF model, code it in the notebook
[3 hours] Reasearch about the Facebook's FAISS text to vector store. For the use of finding nearest text vector instead of using neighboring algorithms.
[2 hours] Research about how to code K-nearest neighbor in the context of the text. Train the K-nearest neighbor model on the dataset of over 65,000 instances.
[1 hours] Finish the notebook pipeline using sklearn without parameters tuned and create submission.

Week 11: April 12 - April 15:

[1 hours] Write utils class for image preprocessing, ram, cpu, and gpu control when training pytorch model. Install libraries and update environment.yml and requirements.txt.
[4 hours] Write pytorch transfer learning pipeline in notebook.py.
- Pytorch dataset
- Image transpose
- Initialize model method which dynamically initialize model with names.
[2 hours] Read and research about FAISS algorithms from Facebook:
- Reference: https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
- Run test on 1-million-random-vectors with dimension 50.
- This algorithm is very fast for vector searching.
[2 hours] Train and export KNN numpy vector model (distances, indices). Prepare work for next week's prediction.

Week 12: April 18 - April 22:

[2 hours] Read the paper Billion-scale similarity search with GPUs and try to go through the algorithms.
[0.5 hours] Write utilities function that can be seen in utils.py:
- preprocess() method to use in preprocessing the test set.
- rgb2gray() converts the 3-channel image to grayscale image.
- ConvNet tensorflow model class which evaluate the images.
- Other supporting layer methods such as down_conv(), dropout_layer(), etc.
[4.5 hours] Examine the ways to convert words to vectors:
- [1.5 hours] Read, research, and Implement TF-IDF Vectorizers in file vectorizers.py
- [1 hour] Read of about bags of words.
- [1.5 hours] Read about how to use the Embedding layer in Deep learning to embed the text to vector with machine.
- [0.5 hours] Write test template for classes of word vectorizers in file word2vec.py
[0.5 hours] Test FAISS with vectors and visualize in faiss_test.py
[1.5 hours] Work on the FAISS Application in the actual program notebook.py, fitted data vectors from tfidf vectorizers. Plan to use text embedding next week.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
core		core
dataset		dataset
models		models
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
faiss_test.py		faiss_test.py
notebook.py		notebook.py
requirements.txt		requirements.txt
vectorizers.py		vectorizers.py
word2vec.py		word2vec.py

rxng8/ShopeeKaggle

Folders and files

Latest commit

History

Repository files navigation

Shopee's Categorizing Listings Kaggle Competition

About the project:

Prerequisites

Run the project

Project Backlog:

Other works

Week 1: Feb 1 - Feb 5: Chest X-ray project week 1

Week 2: Feb 8 - Feb 12: Chest X-ray project week 2

Week 3: Feb 15 - Feb 19: Chest X-ray project week 3

Week 4: Feb 22 - Feb 26: Chest X-ray project week 4

Week 5: March 1 - March 5: Chest X-ray project week 5

Week 6: March 8 - March 12: Chest X-ray project week 6

Week 7: March 15 - March 19: Chest X-ray project week 7

Week 8: March 22 - March 26: Chest X-ray project week 8

Week 9: March 29 - April 2:

Week 10: April 5 - April 9:

Week 11: April 12 - April 15:

Week 12: April 18 - April 22:

Other works

Week 13: April 25 - April 28: Faster R-CNN Research Week 13

Week 14: May 1 - May 5: Faster R-CNN Research Week 13

About

Resources

Stars

Watchers

Forks

Languages