Aster is a python based bot (or a module), which is capabale of writing baseline starter kernels for competitions or datasets hosted on Kaggle. As of now, It can work with two types of datasets - numerical dataset (having continuous and / or categorical columns) and text datasets having single text / document field.
- Can create kernels on Compeititon and Datasets both
- Can create kernels on datasets with binary / multi classification
- Can create kernels on text datasets and numerical datasets
- Performs Quick Exploration, Preprocessing, Feature Engineering, and Modelling
- Changes the visuals according to data, for example - generates word clouds for text data and pairplots for numerical datasets
- Uses a config to create new kernels
Aster first understands the inputs given in the config by the user and the types of columns present in the dataset. According to this information, aster dynamically chooses the most relevant code / text templates and appends them to the baseline kernel. For example, if the dataset belongs to text classification category, then aster will generate some wordclouds and will not perform correlation charts, pair plots or categorical variable distributions. While if the dataset is a non-text classification type, then aster will choose the most relevant template, for example - distribution of categorical variables, missing value treatments etc.
Aster creates following contents based on the type of data.
- Environment Preparation
- Quick Exploration
2.1 Load Dataset
2.2 Dataset Snapshot and Summary
2.3 Target Variable Distribution
2.4 Missing Values
2.5 Variable Types
2.6 Variable Correlations - Preprocessing
3.1 Label Encoding
3.2 Missing Values Treatment
3.3 Feature Engineering (text fields)
3.3.1 TF-IDF Vectorizor
3.3.2 Top Keywords - Wordcloud
3.4 Train Test Split - Modelling
4.1 Logistic Regression
4.2 Decision Tree
4.3 Random Forest
4.4 ExtraTrees Classifier
4.5 Extreme Gradient Boosting - Feature Importance
- Model Ensembling
6.1 A simple Blender - Creating Submission
from aster.aster import aster
config = { "COMPETITION" : "titanic",
"_TARGET_COL" : "Survived",
"_ID_COL" : "PassengerId"}
ast = aster(config) # aster object with config
ast._prepare() # prepare the kernel
ast._push() # push the kernel on kaggle
from aster.aster import aster
config = { "COMPETITION" : "spooky-author-identification",
"_TARGET_COL" : "author",
"_ID_COL" : "id",
"_TAG" : "doc",
"_TEXT_COL" : "text"}
ast = aster(config) # aster object with config
ast._prepare() # prepare the kernel
ast._push() # push the kernel on kaggle
Aster uses config and its key-value pairs to write kernels on different datasets. All of the keys are not mandatory and most of them are optional. Check the following table.
Key | Example Value | Default | Optional/Mandatory | Definition |
---|---|---|---|---|
DATASET | iris | "" | optional | Name of the dataset to be used |
COMPETITION | titanic | "" | optional | Name of the competition |
_TARGET_COL | Survived | "" | mandatory | target column name |
_ID_COL | PassengerId | "" | optional | id column name |
_TRAIN_FILE | train | train | optional | name of the train file |
_TEST_FILE | test | test | optional | name of the test file |
_TAG | doc | num | optional (only for text) | doc : text dataset, num : numerical dataset |
_TEXT_COL | text | "" | optional (only for text) | name of the column containing text data |
- Titanic Baseline Kernel : https://www.kaggle.com/shivamb/bot-generated-baseline-kernel-id-26988
- Spooky Author Baseline Kernel : https://www.kaggle.com/shivamb/bot-generated-baseline-kernel-id-18345
-
Iris Dataset : https://www.kaggle.com/shivamb/bot-generated-baseline-kernel-id-06423
-
Diabetes Dataset : https://www.kaggle.com/shivamb/bot-generated-baseline-kernel-id-96823
-
Mushrooms Dataset : https://www.kaggle.com/shivamb/bot-generated-baseline-kernel-id-60200
Aster can be installed directly from github using following commands
git clone https://github.com/shivam5992/aster.git
cd textstat
python setup.py install
- Dynamic Code Selection Improvements
- Add More Content
- Automated Feature Engineering
- Hyperparameter Tuning - Extend Datatypes
- Regression Problems - Numerical Data
- Image Classifiication
P.S. - I derived name "Aster" from NASA's terra satellite (https://asterweb.jpl.nasa.gov/) which is aimed to provide the next generation remote sensing imaging capabilities from outer space.