Text classification competition: Avito Category Prediction
Features:
- Products name (Russian text)
- Description (Russian text)
Target:
- Category (50 classes)
Text preprocessing:
- remove punctuation and extra symbols
- lowercase
- lemmatize using PyMystem3
- remove stopwords using NLTK
- remove short words with length < 3
Text embedding: TfidfVectorizer(ngram_range=(1, 2))
Model: SGDClassifier(n_jobs=-1, alpha=0.0000002, tol=1e-4)
Tuned the hyper-parameters using Grid Search
Got second place with accuracy=0.91686