# Data Science Challenge

(refer to README for the challenge description)

My first approach was to stick to the basics: TF-IDF + Naive Bayes. The results were no good, so I tried several flavors of SVM.

Why?

Naive Bayes is simple, fast and had some great results in the past for text classification. I try it first to have a baseline.

SVM seems to be one of the best in the present days, unless we go to Neural Networks, I guess, so it was my next step. The classification accuracy increased a lot.

I didn't reach accuracy over 90% as the pre-requisites, probably my data preparation is poor. I tried stemming, but the results didn't improved much. Maybe Latent Semantic Analysis to identify synonyms, but it seem too much right now.

As you asked for >90% accuracy in 2-4 hours, my feeling is that I'm missing something really obvious here. But...

Anyway... my next steps would be:

1. Talk to someone more experienced on that (in a real situation, of course).
2. Search a little more for another algorithm.
3. Experimenting with feature engineering, the algorithms don't make miracles, but playing with the features usually leads to big improvements. That's were I'm probably missing something, like separate Pros and Cons.
4. Spend a little more time doing Grid Search. It's slow, but requires only computer time, and also it can improve a lot some models.
5. Precision and Recall. I wouldn't finish this without calculate them. IMHO, they provide a good picture of the model performance. Accuracy can be misleading,

Also, I would create separate models for each language. The stemming in languages that have more verb flexion (latin languages mostly) can make a bigger difference.

Finally, I made a big mistake here, I tried to use the whole data set for all algorithms. It's fast for the MultinomialNB and SGDClassifier, but it was really slow with the other algorithms I tried (SVC, NuSVC, XGBoost). That made me spend a lot of time just waiting instead of play more with the features.

## Preparing

Fist run:

    pipenv install
    pipenv shell
    python -m ipykernel install --user --name=<env-name>
    jupyter notebook

Once the kernel is installed:

   pipenv run jupyter notebook

In [1]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/ronie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score

## Load data and check the format

In [3]:
raw = pd.read_pickle('data/labelled_dataset.pickle')
raw.head()

Unnamed: 0,text,labelmax
0,Pros - The people who work here are brilliant ...,customer
1,Pros Start-up vibes Fast growing company Tech-...,customer
2,"Pros The team is great, I love the ambition of...",collaboration
3,"Pros The company is constantly growing, and at...",adaptability
4,Pros Cool office. Friendly people. Good atmosp...,collaboration


In [4]:
raw.text[0]

"Pros - The people who work here are brilliant (intelligent, hard-working etc.) - Exciting career opportunities, plenty of room to grow! - Great company culture, social events etc. - Management really value everyone's opinion and are open to ideas - Ambitious company, always working to grow and improve - I feel like my work is really valued, and outstanding performance is always recognised by management Cons - Salary isn't great compared to other grad jobs, hopefully this will improve as the company grows - Occasionally have to work weekends, would much rather there was a separate weekend team - Communication could be better about changes in the company, future plans etc. Advice to Management - Keep listening to your employees - Don't become cold and corporate. Airsorted needs to keep the company spirit, even when it is a big global company."

In [11]:
raw.labelmax.value_counts()

customer         26981
collaboration    21067
result           18948
adaptability     17204
detail            4030
integrity         2815
null               535
Name: labelmax, dtype: int64

## Cleaning data

In [13]:
clean = raw[raw.labelmax != 'null'] # No classification

In [7]:
# I tried to use stemming early on, no good results. Little improvement and it slowed the process down a lot!

#from nltk.stem.snowball import EnglishStemmer

# stemmer = EnglishStemmer()
# analyzer = CountVectorizer().build_analyzer()

# def stemmed_words(doc):
#     return (stemmer.stem(w) for w in analyzer(doc))

# word_counter = CountVectorizer(stop_words='english',
#                                analyzer=stemmed_words,
#                                max_df=0.95, # Remove domain specific words like "Pros" and "Cons"
#                                min_df=2)  # Remove words seen only once

## Helpers

This little functions help to test with different models.

In [8]:
def pipeline_with(classifier):
    "Base pipeline for text classification using bag of words"
    word_counter = CountVectorizer(stop_words='english',
                                   max_df=0.95, # Remove domain specific words like "Pros" and "Cons"
                                   min_df=2)  # Remove words seen only once

    return Pipeline([('frequency', word_counter),
                     ('tfidf', TfidfTransformer()),
                     ('model', classifier)])

def fast_accuracy_for(model, features, target):
    "Simple train test split, so we spend less time testing"
    features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2)
    pipeline = pipeline_with(model)
    pipeline.fit(features_train, target_train)
    target_predicted = pipeline.predict(features_test)
    return accuracy_score(target_test, target_predicted)
    

def accuracy_for(model, features, target):
    "Use cross validation for more reliable results, replace 'fast_accuracy_for' with this after some tests."
    accuracy = cross_val_score(pipeline_with(model), features, target, scoring='accuracy', cv=5, n_jobs=-1)
    return accuracy.mean()

## Naive Bayes

Naive Bayes is my first try to classify texts. Sometimes it is good enough and it's fast, I use it as a baseline.

In [9]:
from sklearn.naive_bayes import MultinomialNB
print(accuracy_for(MultinomialNB(), clean.text, clean.labelmax))

0.4991591807521781


## SVM

SVM does a great job with text classification. As expected, the performance here makes a great jump from Naive Bayes.

In [10]:
from sklearn.linear_model import SGDClassifier
print(accuracy_for(SGDClassifier(max_iter=5), clean.text, clean.labelmax))

0.7368770066607537


In [14]:
from sklearn.svm import LinearSVC
print(accuracy_for(LinearSVC(), clean.text, clean.labelmax))

0.7671477575325867


## XGBoost

Never used it before, but it's a winner in Kaggle competitions, so let's try it for text classification.

Results:

- It's way slower than the others.
- No good results in the first run (acc ~0.5). I probably need to prepare the data better for text classification.
- I got some warnings from scipy, but it seems expected.

It was a very naive approach, but fairly cheap to try (but I spent 30 min waiting for a response, bad move).

In [None]:
#from xgboost import XGBClassifier
#print(fast_accuracy_for(XGBClassifier()))

## Improve best options

Naive Bayes and SVM are great. Let's try to improve their parameters (took a long time). Here I experiment just for one algorithm, but I would try with others.

In [None]:
parameters = {'frequency__ngram_range': [(1, 1), (1, 2), (1, 3)],
              'frequency__max_df': [0.5, 0.7, 0.9],
              'frequency__min_df': [1, 2],
              'tfidf__sublinear_tf': [True, False]}

model = pipeline_with(LinearSVC()) 
grid = GridSearchCV(model, parameters, n_jobs=-1)
grid.fit(clean.text, clean.labelmax)
print(grid.best_score_)
print(grid.best_params_)

# Run with unlabelled data

I prefer to enrich the json with the predicted label instead of create a separate file with the labels only. In my experience, it's easier to work.

In [31]:
import json
from pathlib import Path

def read_json(path):
    with path.open() as file:
        return json.load(file)
    
def save_json(path, data):
    with path.open('w') as file:
        json.dump(data, file)

data = pd.read_pickle('data/labelled_dataset.pickle')

model = pipeline_with(LinearSVC(max_iter=5))
model.fit(data.text, data.labelmax)

Path('./data/labelled-dataset').mkdir(exist_ok=True)
path = Path('./data/unlabelled-dataset').glob('*.json')
for input_file in path:
    data = read_json(input_file)
    if len(data) == 0: continue
        
    features = [doc['text'] for doc in data]
    predicted = model.predict(features)
    
    for doc, target in zip(data, predicted):
        doc['predicted'] = target
    
    output_file = Path(file.parents[1]).joinpath('labelled-dataset', file.name)
    save_json(output_file, data)
