## Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, hamming_loss, confusion_matrix
from sklearn.externals import joblib

## Read data

Load your training data and pass column names for the columns you have saved.

In [None]:
data = pd.read_csv( ... ,sep=',', header=None, names=[ ... , ... ], index_col=False)

In [None]:
data

## stop words

Use stop_words to remove less-meaningful words. The logic of removing stop words has to do with the fact that these words don't carry a lot of meaning, and they appear a lot in most text. Read the list of stopwords, strip and decode them like in the first exercise.

Hint: retweets are marked in the text, you might add this marker to your stopword list.

In [None]:
import io
import unidecode

with io.open( ... , mode= ... , encoding= ... ) as f:
      content = f.readlines()
content = [... for x in content]
content = [... for x in content]

In [None]:
content

## split data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.1, random_state=1234)

In [None]:
X_train.shape

In [None]:
y_train.shape

## Define ML pipeline

Define a ML pipeline below by setting the respective parameters to values of your choice.

The [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) combines the functionalities of a CountVectorizer and a TfidfTransformer (read [this short explanation](http://www.tfidf.com/) to grasp the fundamental idea of tfidf). The parameters you could play around with are:

**`ngram_range`:** Set `ngram_range` to `(1,1)` for outputting only one-word tokens, `(1,2)` for one-word and two-word tokens, `(2, 3)` for two-word and three-word tokens, etc.

**`analyzer`:** `ngram_range` works hand-in-hand with analyzer. Set analyzer to "word" for outputting words and phrases, or set it to "char" to output character ngrams.

**Advanced:** If you want your output to have both "word" and "char" features, use sklearn's `FeatureUnion`.


**`max_df`:** Ignore words with a document frequency higher than this value (float between `0.0` and `1.0`).
Since stop words generally have a high frequency, it might make sense to use `max_df` as a float of say 0.95 to remove the top 5% but then you're assuming that the top 5% is all stop words which might not be the case. It really depends on your text data. In some lines of work, it's very common that the top words or phrases are NOT stop words because you work with dense text (search query data) in very specific topics.

**`min_df`:** Ignore words with a document frequency lower than this value (float between `0.0` and `1.0`). Use `min_df` as an integer to remove rare-occurring words. If they only occur once or twice, they won't add much value and are usually really obscure. Furthermore, there's generally a lot of them so ignoring them with say `min_df=5` can greatly reduce your memory consumption and data size.

**Advanced:** `token_pattern` allows you to use a regex pattern e.g. `\b\w\w+\b` which means that tokens have to be at least 2 characters long so words like "I", "a" are removed and also numbers like 0 - 9 are removed. You'll also notice it removes apostrophes. It is only used if `analyzer == 'word'`.

As a classifier we will use a linear support vector machine. This algorithm has proven to be very effective in practice.

In [None]:
pipeline = Pipeline([
    (
        'tfidv',
        TfidfVectorizer(
            ngram_range=(... , ...), 
            analyzer= ..., 
            strip_accents = 'unicode', 
            use_idf = True, #NOTE: use_idf=False AND norm=None is equivalent to using sklearn's CountVectorizer. It will just return counts.
            stop_words= ... ,
            sublinear_tf=True, 
            max_features=100, # if not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
            min_df= ..., 
            max_df= ...
        )
    ),
    (
        'lin_svc',
        svm.SVC(
            C=1.0,
            probability=True,
            kernel='linear'
        )
    )
])

## Train model

In [None]:
pipeline.fit(X_train,y_train)

## Score model

In [None]:
def score_model(true, pred):
    print('Accuracy:', accuracy_score(true, pred))
    print('F1:', f1_score(true, pred, average='weighted'))
    print('Precision:', precision_score(true, pred, average='weighted'))
    print('Hamming loss', hamming_loss(true, pred))


score_model(y_test,pipeline.predict(X_test))

## Save model

Save your final model as `YOURTEAM_model.pkl` using joblib's `dump()` function, with `compress=3`.

In [None]:
joblib.dump(..., ..., ...)