**In this assignment you will learn how to predict tags for posts from StackOverflow. To solve this task you will use multilabel classification approach.**

Libraries
In this task you will need the following libraries:

Numpy — a package for scientific computing.

Pandas — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python

scikit-learn — a tool for data mining and data analysis.

NLTK — a platform to work with natural language.

### Text preprocessing

For this we will use "NLTK" library

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

In this task you will deal with a dataset of post titles from StackOverflow. You are provided a split to 3 sets: *train*, *validation* and *test*. All corpora (except for *test*) contain titles of the posts and corresponding tags (100 tags are available). The *test* set is provided for Coursera's grading and doesn't contain answers. Upload the corpora using *pandas* and look at the data:

In [None]:
from ast import literal_eval
import pandas as pd
import numpy as np

In [None]:
# create a function to read data 
def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    return data

In [None]:
train = read_data('../input/train.tsv')
validation = read_data('../input/validation.tsv')

In [None]:
train.head()

In [None]:
import re
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') # 
STOPWORDS = set(stopwords.words('english')) # set of stop word

def text_prepare(text):
    """
        text: a string
        return: modified initial string
    """
    text = text.lower() # lowercase text
#     print(text)
    text = re.sub(REPLACE_BY_SPACE_RE, " ", text) # replace REPLACE_BY_SPACE_RE symbols by space in text
#     print(text)
    text = re.sub(BAD_SYMBOLS_RE, "", text) # delete symbols which are in BAD_SYMBOLS_RE from text
#     print(text)
    text = " " + text + " "
    for sw in STOPWORDS:
        text = text.replace(" "+sw+" ", " ") # delete stopwors from text it is important to add space at the start and end of string to be replace 
                                             # else it would replace indivitual alphabate by space 
#     print(text)
    text = re.sub('[ ][ ]+', " ", text)
    if text[0] == ' ':
        text = text[1:]
    if text[-1] == ' ':
        text = text[:-1]
    return text

In [None]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [None]:
print(test_text_prepare())

In [None]:
X_train, y_train = train['title'].values, train['tags'].values
X_val, y_val = validation['title'].values, validation['tags'].values

In [None]:
X_train = [text_prepare(x) for x in X_train]
X_val = [text_prepare(x) for x in X_val]

In [None]:
X_train[:3]

**WordsTagsCount**
Find 3 most popular tags and 3 most popular words in the train data

In [None]:
from collections import Counter
tag_counts = Counter()
word_counts = Counter()

In [None]:
for sentence in X_train:
    for words in sentence.split():
        word_counts[words] += 1

In [None]:
for tag in y_train:
    for l in tag:
        tag_counts[l] += 1

In [None]:
print(tag_counts.most_common(5))
print(word_counts.most_common(5))

**Transforming text to a vector***
Machine Learning algorithms work with numeric data and we cannot use the provided text data "as is". There are many ways to transform text data to numeric vectors. In this task you will try to use two of them.

Bag of words
One of the well-known approaches is a bag-of-words representation. To create this transformation, follow the steps:

Find N most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
For each title in the corpora create a zero vector with the dimension equals to N.
For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

In [None]:
# most common words sorted by count in decending order
most_common_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:6000]

In [None]:
DICT_SIZE = 5000

In [None]:
WORDS_TO_INDEX = {p[0]:i for i,p in enumerate(most_common_words[:DICT_SIZE])}
INDEX_TO_WORDS = {WORDS_TO_INDEX[k]:k for k in WORDS_TO_INDEX}
ALL_WORDS = WORDS_TO_INDEX.keys()

In [None]:
def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size) # create vector with all zeroes
    for word in text.split():
        if word in words_to_index:
            result_vector[words_to_index[word]] += 1
    return result_vector

In [None]:
def test_my_bag_of_words():
    words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
    examples = ['hi how are you']
    answers = [[1, 1, 0, 1]]
    for ex, ans in zip(examples, answers):
        if (my_bag_of_words(ex, words_to_index, 4) != ans).any():
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [None]:
print(test_my_bag_of_words())

In [None]:
from scipy import sparse as sp_sparse

In [None]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])
print('X_train shape ', X_train_mybag.shape)
print('X_val shape ', X_val_mybag.shape)

**TF-IDF**
The second approach extends the bag-of-words framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space.

Implement function tfidf_features using class TfidfVectorizer from scikit-learn. Use train corpus to train a vectorizer. Don't forget to take a look into the arguments that you can pass to it. We suggest that you filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, use bigrams along with unigrams in your vocabulary.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
def tfidf_features(X_train, X_val):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """  
    tfidf_vectorizer = TfidfVectorizer(token_pattern='(\S+)', min_df=5, max_df=0.9, ngram_range=(1,2))
    #/s represent white space where as + stands for one or more
    # consider one or more white space for splitting
    tfidf_vectorizer.fit(X_train)
    X_train = tfidf_vectorizer.transform(X_train)
    X_val = tfidf_vectorizer.transform(X_val)
    return X_train, X_val, tfidf_vectorizer.vocabulary_

In [None]:
X_train_tfidf, X_val_tfidf, tfidf_vocab = tfidf_features(X_train, X_val)

In [None]:
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

Once you have done text preprocessing, always have a look at the results. Be very careful at this step, because the performance of future models will drastically depend on it.

In this case, check whether you have c++ or c# in your vocabulary, as they are obviously important tokens in our tags prediction task:

In [None]:
print('c++' in tfidf_vocab)

**MultiLabel classifier**
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose it is convenient to use MultiLabelBinarizer from sklearn.

In [None]:
y_train

In [None]:
# to convert labels into one hot encoding
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=sorted(tag_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_val = mlb.fit_transform(y_val)

Implement the function train_classifier for training a classifier. In this task we suggest to use One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time, because a number of classifiers to train is large

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier

In [None]:
def train_classifier(X_train, y_train):
    logregss = LogisticRegression(C=3,penalty='l1')
    onevrest = OneVsRestClassifier(logregss)
    onevrest.fit(X_train,y_train)
    return onevrest

In [None]:
classifier_mybag = train_classifier(X_train_mybag, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)

Now you can create predictions for the data. You will need two types of predictions: labels and scores.



In [None]:
y_val_predicted_labels_mybag = classifier_mybag.predict(X_val_mybag)
y_val_predicted_scores_mybag = classifier_mybag.decision_function(X_val_mybag)

y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

In [None]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)

**Evaluation**
To evaluate the results we will use several classification metrics:

* **Accuracy**
* **F1-score**
* **Area under ROC-curve**
* **Area under precision-recall curve**

Make sure you are familiar with all of them. How would you expect the things work for the multi-label scenario? Read about micro/macro/weighted averaging following the sklearn links provided above.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [None]:
def print_evaluation_scores(y_val, predicted):
    # print(len(y_val), len(y_val))
    accuracy = accuracy_score(y_val, predicted)
    print(#accuracy,
          #f1_score(y_val, predicted, average='macro'),
          #f1_score(y_val, predicted, average='micro'),
          f1_score(y_val, predicted, average='weighted')#,
#           average_precision_score(y_val, predicted, average='macro')
#           average_precision_score(y_val, predicted, average='micro'),
          #average_precision_score(y_val, predicted, average='weighted')
         )

In [None]:
print('Bag-of-words')
print_evaluation_scores(y_val, y_val_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

 ROC curve for the case of multi-label classification
 Provided function roc_auc can make it for you. The input parameters of this function are:

* true labels
* decision functions scores
* number of classes

**Hyper parameter can be tuned best output i got with c = 3 & penalty = 'l1'**

**Analysis of the most important features
**

Finally, it is usually a good idea to look at the features (words or n-grams) that are used with the largest weigths in your logistic regression model.

Implement the function print_words_for_tag to find them. Get back to sklearn documentation on OneVsRestClassifier and LogisticRegression if needed

In [None]:
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
    """
        classifier: trained classifier
        tag: particular tag
        tags_classes: a list of classes names from MultiLabelBinarizer
        index_to_words: index_to_words transformation
        all_words: all words in the dictionary
        
        return nothing, just print top 5 positive and top 5 negative words for current tag
    """
    print('Tag:\t{}'.format(tag))
    
    idx = tags_classes.index(tag)
    coef=classifier_tfidf.coef_[idx]
    cd = {i:coef[i] for i in range(len(coef))}
    scd=sorted(cd.items(), key=lambda x: x[1], reverse=True)
       
    top_positive_words = [index_to_words[k[0]] for k in scd[:5]]# top-5 words sorted by the coefficiens.
    top_negative_words = [index_to_words[k[0]] for k in scd[-5:]]# bottom-5 words  sorted by the coefficients.
    print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
    print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))

In [None]:
print_words_for_tag(classifier_tfidf, 'c', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'c++', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'linux', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)