# Data Science Challenge

(refer to README for the challenge description)

My first approach was to stick to the basics: TF-IDF + Naive Bayes. The results were no good, so I tried several flavors of SVM.

Why?

Naive Bayes is simple, fast and had some great results in the past for text classification. I try it first to have a baseline.

SVM seems to be one of the best in the present days, unless we go to Neural Networks, I guess, so it was my next step. The classification accuracy increased a lot.

I didn't reach accuracy over 90% as the pre-requisites, probably my data preparation is poor. I tried stemming, but the results didn't improved much. Maybe Latent Semantic Analysis to identify synonyms, but it seem too much right now.

As you asked for >90% accuracy in 2-4 hours, my feeling is that I'm missing something really obvious here. But...

Anyway... my next steps would be:

1. Talk to someone more experienced on that (in a real situation, of course).
2. Search a little more for another algorithm.
3. Experimenting with feature engineering, the algorithms don't make miracles, but playing with the features usually leads to big improvements. That's were I'm probably missing something, like separate Pros and Cons.
4. Spend a little more time doing Grid Search. It's slow, but requires only computer time, and also it can improve a lot some models.
5. Precision and Recall. I wouldn't finish this without calculate them. IMHO, they provide a good picture of the model performance. Accuracy can be misleading,

Also, I would create separate models for each language. The stemming in languages that have more verb flexion (latin languages mostly) can make a bigger difference.

Finally, I made a big mistake here, I tried to use the whole data set for all algorithms. It's fast for the MultinomialNB and SGDClassifier, but it was really slow with the other algorithms I tried (SVC, NuSVC, XGBoost). That made me spend a lot of time just waiting instead of play more with the features.

## Preparing

Fist run:

    pipenv install
    pipenv shell
    python -m ipykernel install --user --name=<env-name>
    jupyter notebook

Once the kernel is installed:

   pipenv run jupyter notebook

In [51]:
# Easier to compare results
pd.options.display.float_format = '{:.5f}'.format

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/ronie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [138]:
import pickle
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split, cross_validate, cross_val_predict, GridSearchCV
from sklearn.metrics import accuracy_score

## Load data and check the format

In [4]:
raw = pd.read_pickle('data/labelled_dataset.pickle')
raw.head()

Unnamed: 0,text,labelmax
0,Pros - The people who work here are brilliant ...,customer
1,Pros Start-up vibes Fast growing company Tech-...,customer
2,"Pros The team is great, I love the ambition of...",collaboration
3,"Pros The company is constantly growing, and at...",adaptability
4,Pros Cool office. Friendly people. Good atmosp...,collaboration


In [None]:
raw.shape()

In [5]:
raw.text[0]

"Pros - The people who work here are brilliant (intelligent, hard-working etc.) - Exciting career opportunities, plenty of room to grow! - Great company culture, social events etc. - Management really value everyone's opinion and are open to ideas - Ambitious company, always working to grow and improve - I feel like my work is really valued, and outstanding performance is always recognised by management Cons - Salary isn't great compared to other grad jobs, hopefully this will improve as the company grows - Occasionally have to work weekends, would much rather there was a separate weekend team - Communication could be better about changes in the company, future plans etc. Advice to Management - Keep listening to your employees - Don't become cold and corporate. Airsorted needs to keep the company spirit, even when it is a big global company."

In [6]:
raw.labelmax.value_counts()

customer         26981
collaboration    21067
result           18948
adaptability     17204
detail            4030
integrity         2815
null               535
Name: labelmax, dtype: int64

## Cleaning data

In [7]:
clean = raw[raw.labelmax != 'null'] # No classification

In [8]:
# I tried stemming too soon before.
# As some of the last steps, it increase the model accuracy by 5 to 6 percentual points, from ~0.76 to ~0.82.
# That's the difference between 1/4 to 1/5, not bad. However, the model now spend too much time training and
# the experimentation slowed down a lot. Maybe I should reduce the dataset in this phase.

from nltk.stem.snowball import EnglishStemmer

stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

## Helpers

This little functions help to test with different models.

In [146]:
class TextFeatures(BaseEstimator, TransformerMixin):
    def fit(self, x, y=None):
        return self

    def transform(self, texts):
        rslt = []
        for text in texts:
            pros, cons = re.match(r'Pros(.+)Cons(.+)', text).groups()
            rslt.append({'pros_length': len(pros),
                         'pros_topics': pros.count('-'),
                         'cons_length': len(cons),
                         'cons_topics': cons.count('-')})
        return rslt
    
def pipeline_with(classifier):
    "Base pipeline for text classification using bag of words"
    word_counter = CountVectorizer(stop_words='english',
                                   analyzer=stemmed_words,
                                   max_df=0.98, # Remove domain specific words like "Pros" and "Cons"
                                   min_df=3)  # Remove words rarely seen

    return Pipeline([('union', FeatureUnion([('bow',
                                              Pipeline([('frequency', word_counter),
                                                        ('tfidf', TfidfTransformer())])),
                                             ('text',
                                              Pipeline([('features', TextFeatures()),
                                                        ('vectorizer', DictVectorizer(sparse=False)),
                                                        ('scaler', MinMaxScaler())]))])),
                     ('model', classifier)])

def fast_accuracy_for(model, features, target):
    "Simple train test split, so we spend less time testing"
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)
    pipeline = pipeline_with(model)
    pipeline.fit(features_train, target_train)
    target_predicted = pipeline.predict(features_test)
    return accuracy_score(target_test, target_predicted)
    

def scores_for(model, features, target):
    "Use cross validation for more reliable results, replace 'fast_accuracy_for' with this after some tests."
    scores = cross_validate(pipeline_with(model), features, target,
                              scoring=['accuracy', 'recall_macro', 'precision_macro'],
                              return_train_score=False,
                              cv=5, n_jobs=3)
    return {k: np.mean(v) for k, v in scores.items()}

## Naive Bayes

Naive Bayes is my first try to classify texts. Sometimes it is good enough and it's fast, I use it as a baseline.

In [128]:
from sklearn.naive_bayes import MultinomialNB
print(scores_for(MultinomialNB(), clean.text, clean.labelmax))

{'fit_time': array([151.58923173, 146.28640103, 113.25049996]), 'score_time': array([139.91203356, 209.29278708, 213.56147528]), 'test_accuracy': array([0.51345919, 0.49680374, 0.49855006]), 'test_recall_macro': array([0.35050865, 0.33543207, 0.33260133]), 'test_precision_macro': array([0.37630908, 0.45693172, 0.4397979 ])}


## SVM

SVM does a great job with text classification. As expected, the performance here makes a great jump from Naive Bayes.

In [11]:
from sklearn.linear_model import SGDClassifier
print(accuracy_for(SGDClassifier(max_iter=1000), clean.text, clean.labelmax))

0.7942775550551925


In [147]:
from sklearn.svm import LinearSVC
print(scores_for(LinearSVC(), clean.text, clean.labelmax))

{'fit_time': 155.57943754196168, 'score_time': 101.06812071800232, 'test_accuracy': 0.8296331983860641, 'test_recall_macro': 0.7475973719554054, 'test_precision_macro': 0.8177883532026818}


I should play a little more with the features, but before that, let's try to extract the maximum possible with this algorithm. Ensembles are kinda cheap to implement and test and sometimes it can improves the model with just one or two lines of codes (and a lot of processing).

In this case, a very little sample gives the same accuracy. Which probably means we don't need so much samples.

In [139]:
from sklearn.ensemble import BaggingClassifier
model = BaggingClassifier(LinearSVC(), max_samples=0.1, n_estimators=15)
print(scores_for(model, clean.text, clean.labelmax))

KeyboardInterrupt: 

In [None]:
parameters = {'model__max_samples': [0.2, 0.4, 0.6, 0.8],
              'model__n_estimators': [10, 50, 100, 500],}

model = pipeline_with(BaggingClassifier(LinearSVC()))
grid = GridSearchCV(model, parameters, n_jobs=-1)
grid.fit(clean.text, clean.labelmax)
print(grid.best_score_)
print(grid.best_params_)

Let's see how the probabilities are going for the LinearSVC.

(As it has no "predict_proba" method, we need to put it inside a CalibratedClassifierCV)

In [23]:
from sklearn.calibration import CalibratedClassifierCV

model = CalibratedClassifierCV(LinearSVC())
prediction = cross_val_predict(pipeline_with(model), clean.text, clean.labelmax,
                               method='predict_proba',
                               cv=5, n_jobs=-1)

array([[2.29012638e-01, 6.53272858e-02, 7.02847748e-01, 1.70583291e-05,
        1.47171430e-05, 2.78055290e-03],
       [2.32969359e-01, 2.28206531e-01, 5.38774786e-01, 4.40663724e-06,
        2.26231892e-06, 4.26544256e-05],
       [3.37507621e-02, 9.62128793e-01, 5.50265983e-04, 1.44398853e-03,
        1.39634793e-03, 7.29842829e-04],
       ...,
       [1.56789131e-03, 9.92471497e-01, 1.08110477e-03, 1.62034729e-03,
        2.46655598e-05, 3.23449382e-03],
       [1.78588282e-03, 5.28711306e-01, 9.33329978e-02, 3.75461398e-01,
        1.01236476e-05, 6.98291614e-04],
       [1.21908089e-01, 3.34450535e-01, 7.01324078e-04, 3.57798761e-04,
        5.39491511e-01, 3.09074221e-03]])

In [69]:
# This actually a curryied function, it's a bit verbose on
# Python, but I prefer to use it instead of create full classes
# that do just one thing.
def select_col(columns):
    def max_index(values):
        index, _ = max(enumerate(values), key=lambda e: e[1])
        return columns[index]
    return max_index

col_names = sorted(set(clean.labelmax))
get_predicted = select_col(col_names)

df = pd.DataFrame(prediction, columns=col_names)
df['classified'] = clean.labelmax
df['predicted'] = list(map(get_predicted, prediction))
df

Unnamed: 0,adaptability,collaboration,customer,detail,integrity,result,classified,predicted
0,0.22901,0.06533,0.70285,0.00002,0.00001,0.00278,customer,customer
1,0.23297,0.22821,0.53877,0.00000,0.00000,0.00004,customer,customer
2,0.03375,0.96213,0.00055,0.00144,0.00140,0.00073,collaboration,collaboration
3,0.96971,0.01834,0.00763,0.00001,0.00004,0.00427,adaptability,adaptability
4,0.45953,0.53793,0.00241,0.00002,0.00004,0.00007,collaboration,collaboration
5,0.00624,0.97068,0.00116,0.00001,0.00005,0.02186,collaboration,collaboration
6,0.00859,0.88693,0.07386,0.03008,0.00042,0.00012,collaboration,collaboration
7,0.09093,0.34031,0.56535,0.00170,0.00070,0.00100,customer,customer
8,0.34056,0.08486,0.51683,0.02330,0.00065,0.03381,customer,customer
9,0.00050,0.87131,0.12193,0.00025,0.00171,0.00430,collaboration,collaboration


## XGBoost

Never used it before, but it's a winner in Kaggle competitions, so let's try it for text classification.

Results:

- It's way slower than the others.
- No good results in the first run (acc ~0.5). I probably need to prepare the data better for text classification.
- I got some warnings from scipy, but it seems expected.

It was a very naive approach, but fairly cheap to try (but I spent 30 min waiting for a response, bad move).

In [None]:
#from xgboost import XGBClassifier
#print(fast_accuracy_for(XGBClassifier()))

## Improve best options

Naive Bayes and SVM are great. Let's try to improve their parameters (took a long time). Here I experiment just for one algorithm, but I would try with others.

Kinda of unexpected, but the default parameter were the best ones. Good from one hand, but I got a little disappointed, as it took a long time just to say "yeap, the default are really ok".

Note: this took a whole day running, as expected :p

In [32]:
parameters = {'frequency__ngram_range': [(1, 1), (1, 2), (1, 3)],
              'frequency__max_df': [0.5, 0.7, 0.9],
              'frequency__min_df': [1, 2],
              'tfidf__sublinear_tf': [True, False],
              'model__C': [10, 1, 0.1, 0.001]}

model = pipeline_with(LinearSVC()) 
grid = GridSearchCV(model, parameters, n_jobs=-1)
grid.fit(clean.text, clean.labelmax)
print(grid.best_score_)
print(grid.best_params_)

0.8221099456312813
{'frequency__max_df': 0.9, 'frequency__min_df': 2, 'frequency__ngram_range': (1, 1), 'model__C': 1, 'tfidf__sublinear_tf': False}


# Run with unlabelled data

I prefer to enrich the json with the predicted label instead of create a separate file with the labels only. In my experience, it's easier to work.

In [31]:
import json
from pathlib import Path

def read_json(path):
    with path.open() as file:
        return json.load(file)
    
def save_json(path, data):
    with path.open('w') as file:
        json.dump(data, file)

data = pd.read_pickle('data/labelled_dataset.pickle')

model = pipeline_with(LinearSVC())
model.fit(data.text, data.labelmax)

Path('./data/labelled-dataset').mkdir(exist_ok=True)
path = Path('./data/unlabelled-dataset').glob('*.json')
for input_file in path:
    data = read_json(input_file)
    if len(data) == 0: continue
        
    features = [doc['text'] for doc in data]
    predicted = model.predict(features)
    
    for doc, target in zip(data, predicted):
        doc['predicted'] = target
    
    output_file = Path(file.parents[1]).joinpath('labelled-dataset', file.name)
    save_json(output_file, data)
