<a href="https://colab.research.google.com/github/vernor/LightGBM/blob/master/week1_MultilabelClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict tags on StackOverflow with linear models

In this assignment you will learn how to predict tags for posts from [StackOverflow](https://stackoverflow.com). To solve this task you will use multilabel classification approach.

### Libraries

In this task you will need the following libraries:
- [Numpy](http://www.numpy.org) — a package for scientific computing.
- [Pandas](https://pandas.pydata.org) — a library providing high-performance, easy-to-use data structures and data analysis tools for the Python
- [scikit-learn](http://scikit-learn.org/stable/index.html) — a tool for data mining and data analysis.
- [NLTK](http://www.nltk.org) — a platform to work with natural language.

### Data

The following cell will download all data required for this assignment into the folder `week1/data`.

In [0]:
! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py

import setup_google_colab

setup_google_colab.setup_week1()  

import sys

sys.path.append("..")

from common.download_utils import download_week1_resources

download_week1_resources()

--2019-03-24 22:07:47--  https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2330 (2.3K) [text/plain]
Saving to: ‘setup_google_colab.py’


2019-03-24 22:07:47 (32.1 MB/s) - ‘setup_google_colab.py’ saved [2330/2330]

File data/train.tsv is already downloaded.
File data/validation.tsv is already downloaded.
File data/test.tsv is already downloaded.
File data/text_prepare_tests.tsv is already downloaded.


### Grading
We will create a grader instance below and use it to collect your answers. Note that these outputs will be stored locally inside grader and will be uploaded to platform only after running submitting function in the last part of this assignment. If you want to make partial submission, you can run that cell any time you want.

### Text preprocessing

For this and most of the following assignments you will need to use a list of stop words. It can be downloaded from *nltk*:

In [0]:
from grader import Grader

grader = Grader()

import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer

from ast import literal_eval

import pandas as pd

import numpy as np

import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In this task you will deal with a dataset of post titles from StackOverflow. You are provided a split to 3 sets: *train*, *validation* and *test*. All corpora (except for *test*) contain titles of the posts and corresponding tags (100 tags are available). The *test* set is provided for Coursera's grading and doesn't contain answers. Upload the corpora using *pandas* and look at the data:

In [0]:
def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    return data
  
train = read_data('data/train.tsv')
validation = read_data('data/validation.tsv')
test = pd.read_csv('data/test.tsv', sep='\t')


X_train, y_train = train['title'].values, train['tags'].values

X_val, y_val = validation['title'].values, validation['tags'].values

X_test = test['title'].values

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

tokenizer = nltk.tokenize.WhitespaceTokenizer()

As you can see, *title* column contains titles of the posts and *tags* column contains the tags. It could be noticed that a number of tags for a post is not fixed and could be as many as necessary.

For a more comfortable usage, initialize *X_train*, *X_val*, *X_test*, *y_train*, *y_val*.

One of the most known difficulties when working with natural data is that it's unstructured. For example, if you use it "as is" and extract tokens just by splitting the titles by whitespaces, you will see that there are many "weird" tokens like *3.5?*, *"Flip*, etc. To prevent the problems, it's usually useful to prepare the data somehow. In this task you'll write a function, which will be also used in the other assignments. 

**Task 1 (TextPrepare).** Implement the function *text_prepare* following the instructions. After that, run the function *test_test_prepare* to test it on tiny cases and submit it to Coursera.

In [0]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')

BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

STOPWORDS = set(stopwords.words('english'))

examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                  "How to free c++ memory vector<int> * arr?"]

def text_prepare(text):  
  
  text = text.lower()

  text = re.sub(REPLACE_BY_SPACE_RE,' ',text)

  text = re.sub(BAD_SYMBOLS_RE,'',text)

  text = ' '.join([w for w in tokenizer.tokenize(text) if w not in STOPWORDS])
  
  return text

In [0]:
def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'
  
test_text_prepare()

'Basic tests are passed.'

Run your implementation for questions from file *text_prepare_tests.tsv* to earn the points.

In [0]:
prepared_questions = []
for line in open('data/text_prepare_tests.tsv', encoding='utf-8'):
    line = text_prepare(line.strip())
    prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)

grader.submit_tag('TextPrepare', text_prepare_results)

Current answer for task TextPrepare is:
 sqlite php readonly
creating multiple textboxes dynamically
self one prefer javascript
save php date...


Now we can preprocess the titles using function *text_prepare* and  making sure that the headers don't have bad symbols:

In [0]:
text_prepare_results

'sqlite php readonly\ncreating multiple textboxes dynamically\nself one prefer javascript\nsave php date string mysql database timestamp\nfill dropdownlist data xml file aspnet application\nprogrammatically trigger jqueryui draggables drag event\nget value method argument via reflection java\nknockout mapingfromjs observablearray json object data gets lost\nfacebook connect localhost weird stuff\nfullcalendar prev next click\nsyntaxerror unexpected token\neffective way float double comparison\ngem install rails fails dns error\nlistshuttle component richfaces getting updated\nlaravel responsedownload show images laravel\nwrong rspec test\ncalendar display using java swing\npython selenium import regular firefox profile addons\nrandom number 2 variables values\naltering http responses firefox extension\nstart session python web application\nalign radio buttons horizontally django forms\ncount number rows sqlite database\nwordpress wp_rewrite rules\nremoving sheet excel 2005 using php\np

In [0]:
X_train = [text_prepare(x) for x in X_train]
X_val = [text_prepare(x) for x in X_val]
X_test = [text_prepare(x) for x in X_test]

In [0]:
X_train[:3]

['draw stacked dotplot r',
 'mysql select records datetime field less specified value',
 'terminate windows phone 81 app']

For each tag and for each word calculate how many times they occur in the train corpus. 

**Task 2 (WordsTagsCount).** Find 3 most popular tags and 3 most popular words in the train data and submit the results to earn the points.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

tag_process = train.tags.map(" ".join)

basicvectorizer_tags = CountVectorizer(token_pattern= '(\S+)')

features_tags = basicvectorizer_tags.fit_transform(tag_process)

count_raw = pd.DataFrame(features_tags.todense(),columns = basicvectorizer_tags.get_feature_names()).sum().to_dict()

In [0]:
# title_process = train.title.map(text_prepare)

basicvectorizer_title = CountVectorizer(token_pattern= '(\S+)')

features_title = basicvectorizer_title.fit_transform(train.title)

In [0]:
# Dictionary of all tags from train corpus with their counts.
tags_counts = dict(zip(basicvectorizer_tags.get_feature_names(), features_tags.sum(axis = 0).tolist()[0]))
# Dictionary of all words from train corpus with their counts.
words_counts = dict(zip(basicvectorizer_title.get_feature_names(), features_title.sum(axis = 0).tolist()[0]))

######################################
######### YOUR CODE HERE #############
######################################

In [0]:
tags_counts.items()

dict_items([('.net', 3872), ('ajax', 1767), ('algorithm', 419), ('android', 2818), ('angularjs', 1353), ('apache', 441), ('arrays', 2277), ('asp.net', 3939), ('asp.net-mvc', 1244), ('c', 3119), ('c#', 19077), ('c++', 6469), ('class', 509), ('cocoa-touch', 507), ('codeigniter', 786), ('css', 1769), ('csv', 435), ('database', 740), ('date', 560), ('datetime', 557), ('django', 1835), ('dom', 400), ('eclipse', 992), ('entity-framework', 649), ('excel', 443), ('facebook', 508), ('file', 582), ('forms', 872), ('function', 487), ('generics', 420), ('google-maps', 408), ('hibernate', 807), ('html', 4668), ('html5', 842), ('image', 672), ('ios', 3256), ('iphone', 1909), ('java', 18661), ('javascript', 19078), ('jquery', 7510), ('json', 2026), ('jsp', 680), ('laravel', 525), ('linq', 964), ('linux', 793), ('list', 693), ('loops', 389), ('maven', 432), ('mongodb', 350), ('multithreading', 1118), ('mysql', 3092), ('node.js', 771), ('numpy', 502), ('objective-c', 4338), ('oop', 425), ('opencv', 401

We are assuming that *tags_counts* and *words_counts* are dictionaries like `{'some_word_or_tag': frequency}`. After applying the sorting procedure, results will be look like this: `[('most_popular_word_or_tag', frequency), ('less_popular_word_or_tag', frequency), ...]`. The grader gets the results in the following format (two comma-separated strings with line break):

    tag1,tag2,tag3
    word1,word2,word3

Pay attention that in this assignment you should not submit frequencies or some additional information.

In [0]:
most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:3]

most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:3]

grader.submit_tag('WordsTagsCount', '%s\n%s' % (','.join(tag for tag, _ in most_common_tags), 
                                                 ','.join(word for word, _ in most_common_words)))

most_common_tags

most_common_words

Current answer for task WordsTagsCount is:
 javascript,c#,java
to,in,a...


[('javascript', 19078), ('c#', 19077), ('java', 18661)]

[('to', 35065), ('in', 30817), ('a', 24323)]

### Transforming text to a vector

Machine Learning algorithms work with numeric data and we cannot use the provided text data "as is". There are many ways to transform text data to numeric vectors. In this task you will try to use two of them.

#### Bag of words

One of the well-known approaches is a *bag-of-words* representation. To create this transformation, follow the steps:
1. Find *N* most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
2. For each title in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.

Let's try to do it for a toy example. Imagine that we have *N* = 4 and the list of the most popular words is 

    ['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this: 

    {'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:

    'hi how are you'

For this text we create a corresponding zero vector 

    [0, 0, 0, 0]
    
And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:

    'hi':  [1, 0, 0, 0]
    'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
    'are': [1, 0, 0, 1]
    'you': [1, 1, 0, 1]

The resulting vector will be 

    [1, 1, 0, 1]
   
Implement the described encoding in the function *my_bag_of_words* with the size of the dictionary equals to 5000. To find the most common words use train data. You can test your code using the function *test_my_bag_of_words*.

In [0]:
DICT_SIZE = 5000

dict_1 = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE]

dict_1_pre_0 = [i[0] for i in sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:DICT_SIZE]]

WORDS_TO_INDEX = dict(zip(dict_1_pre_0, range(5000)))

INDEX_TO_WORDS = dict(zip(range(5000), dict_1_pre_0))

ALL_WORDS = WORDS_TO_INDEX.keys()

def my_bag_of_words(text, words_to_index, dict_size):
  
    """
        text: a string
        
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    
    result_vector = np.zeros(dict_size)
    
    token = text.split(' ')
    
#     for i in pd.Series(.map(words_to_index).dropna().values.astype(int)
                       
    for i in token:
      ind_1 = words_to_index.get(i)
      if ind_1 is not None:      
        result_vector[ind_1] = 1

    return result_vector

In [0]:
def test_my_bag_of_words():
    words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
    examples = ['hi how are you']
    answers = [[1, 1, 0, 1]]
    for ex, ans in zip(examples, answers):
        if (my_bag_of_words(ex, words_to_index, 4) != ans).any():
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [0]:
print(test_my_bag_of_words())

Basic tests are passed.


Now apply the implemented function to all samples (this might take up to a minute):

In [0]:
from scipy import sparse as sp_sparse

In [0]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])

X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])

X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])

print('X_train shape ', X_train_mybag.shape)

print('X_val shape ', X_val_mybag.shape)

print('X_test shape ', X_test_mybag.shape)

X_train shape  (100000, 5000)
X_val shape  (30000, 5000)
X_test shape  (20000, 5000)


As you might notice, we transform the data to sparse representation, to store the useful information efficiently. There are many [types](https://docs.scipy.org/doc/scipy/reference/sparse.html) of such representations, however sklearn algorithms can work only with [csr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix) matrix, so we will use this one.

**Task 3 (BagOfWords).** For the 11th row in *X_train_mybag* find how many non-zero elements it has. In this task the answer (variable *non_zero_elements_count*) should be a number, e.g. 20.

In [0]:
row = X_train_mybag[10].toarray()[0]

non_zero_elements_count = row.sum()####### YOUR CODE HERE #######

grader.submit_tag('BagOfWords', str(non_zero_elements_count))

Current answer for task BagOfWords is:
 6.0...


#### TF-IDF

The second approach extends the bag-of-words framework by taking into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 

Implement function *tfidf_features* using class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from *scikit-learn*. Use *train* corpus to train a vectorizer. Don't forget to take a look into the arguments that you can pass to it. We suggest that you filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles). Also, use bigrams along with unigrams in your vocabulary. 

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
def tfidf_features(X_train, X_val, X_test):
    """
        X_train, X_val, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train, test, and val sets and return the result
    
    tfidf_vectorizer = TfidfVectorizer(min_df = 5, max_df = 0.9, ngram_range=(1, 1),token_pattern= '(\S+)')

    X_train = tfidf_vectorizer.fit_transform(X_train)
    
    X_val = tfidf_vectorizer.fit_transform(X_val)
    
    X_test = tfidf_vectorizer.fit_transform(X_test)
    
    return X_train, X_val, X_test, tfidf_vectorizer.vocabulary_

Once you have done text preprocessing, always have a look at the results. Be very careful at this step, because the performance of future models will drastically depend on it. 

In this case, check whether you have c++ or c# in your vocabulary, as they are obviously important tokens in our tags prediction task:

In [0]:
X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)

tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

In [0]:
######### YOUR CODE HERE #############

If you can't find it, we need to understand how did it happen that we lost them? It happened during the built-in tokenization of TfidfVectorizer. Luckily, we can influence on this process. Get back to the function above and use '(\S+)' regexp as a *token_pattern* in the constructor of the vectorizer.  

Now, use this transormation for the data and check again.

In [0]:
######### YOUR CODE HERE #############

### MultiLabel classifier

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose it is convenient to use [MultiLabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) from *sklearn*.

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
# train = read_data('data/train.tsv')
# validation = read_data('data/validation.tsv')
# test = pd.read_csv('data/test.tsv', sep='\t')


X_train, y_train = train['title'].values, train['tags'].values

X_val, y_val = validation['title'].values, validation['tags'].values

X_test = test['title'].values

X_train = [text_prepare(x) for x in X_train]

X_val = [text_prepare(x) for x in X_val]

X_test = [text_prepare(x) for x in X_test]

In [0]:
X_train[99]

'mappedby reference unknown target entity property'

In [0]:
X_train_val_test = X_train + X_val + X_test

basicvectorizer_all = CountVectorizer(min_df = 1, max_df = 0.95, ngram_range=(1, 1),token_pattern= '(\S+)')

BOW_all = basicvectorizer_all.fit_transform(X_train_val_test)

BOW_all.shape

BOW_train = BOW_all[:100000]

BOW_val = BOW_all[100000:130000]

BOW_test = BOW_all[130000:150000]

BOW_train.shape

BOW_val.shape

BOW_val.sum(axis = 1).min()

BOW_train.sum(axis = 1).min()

(150000, 40314)

(100000, 40314)

(30000, 40314)

1

1

In [0]:
BOW_val.shape

np.where(BOW_val.sum(axis = 1) ==0)

(30000, 405333)

(array([], dtype=int64), array([], dtype=int64))

In [0]:
y_train = train['tags'].values

y_val =  validation['tags'].values

mlb = MultiLabelBinarizer(classes=sorted(tags_counts.keys()))

y_train = mlb.fit_transform(y_train)

y_val = mlb.fit_transform(y_val)

Implement the function *train_classifier* for training a classifier. In this task we suggest to use One-vs-Rest approach, which is implemented in [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) class. In this approach *k* classifiers (= number of tags) are trained. As a basic classifier, use [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time, because a number of classifiers to train is large.

In [0]:
import statsmodels.api as sm

def train_classifier(X_train, y_train):
    """
      X_train, y_train — training data
      
      return: trained classifier
    """
    
    OVR = OneVsRestClassifier(LogisticRegression()).fit(X_train,y_train)
    
    return OVR
    

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier

Train the classifiers for different data transformations: *bag-of-words* and *tf-idf*.

In [0]:
BOW_train.shape

y_train.shape

(100000, 7813)

(100000, 100)

In [0]:
classifier_mybag = train_classifier(BOW_train, y_train)

# classifier_tfidf = train_classifier(X_train_tfidf, y_train)



Now you can create predictions for the data. You will need two types of predictions: labels and scores.

In [0]:
BOW_val.sum(axis = 1).min()

0

In [0]:
BOW_val.sum(axis = 1).shape

BOW_val.sum(axis = 1).min()

BOW_val.shape

np.where(BOW_val[1,:].todense() == 1)

(30000, 1)

1

(30000, 40314)

(array([0, 0, 0, 0, 0, 0, 0]),
 array([ 1979,  4581,  6490,  6662,  6684, 38142, 39321]))

In [0]:
type(classifier_mybag)

NoneType

In [0]:
y_val_predicted_labels_mybag = classifier_mybag.predict(BOW_val)

y_val_predicted_scores_mybag = classifier_mybag.decision_function(BOW_val)

y_val_predicted_prob_mybag = classifier_mybag.predict_proba(BOW_val)

In [0]:
y_val_predicted_labels_mybag_revise = y_val_predicted_labels_mybag.copy()

for i in range(len(y_val_predicted_labels_mybag_revise)):
  
  x = y_val_predicted_labels_mybag_revise[i]
  
  score = y_val_predicted_prob_mybag[i]
  
  if x.sum() == 0:
    
    ind = np.where(score == score.max())
    
    x[ind] = 1

In [0]:
y_val_predicted_labels_mybag_revise[0].sum()

1

In [0]:
  i = 0
  
  x = y_val_predicted_labels_mybag_revise[i]
  
  score = y_val_predicted_prob_mybag
  
  if x.sum() == 0:
    
    ind = np.where(score == score.max())
    
    x[ind] = 1

100

In [0]:
y_val_predicted_labels_mybag_revise[0].shape

y_val_predicted_labels_mybag_revise[0].sum()

(100,)

100

In [0]:
validation.iloc[5,:].title

'library not found for.....?'

In [0]:
ind = 30

pd.options.display.max_colwidth = 100

pd.concat([validation[:ind], 
           pd.Series(mlb.inverse_transform(y_val_predicted_labels_mybag[:ind])),
          pd.Series(mlb.inverse_transform(y_val_predicted_labels_mybag_revise[:ind]))],
          axis = 1)

Unnamed: 0,title,tags,0,1
0,Why odbc_exec always fail?,"[php, sql]",(),"(c#,)"
1,Access a base classes variable from within a child class,[javascript],(),"(class,)"
2,"Content-Type ""application/json"" not required in rails","[ruby-on-rails, ruby]","(ruby-on-rails,)","(ruby-on-rails,)"
3,Sessions in Sinatra: Used to Pass Variable,"[ruby, session]","(ruby,)","(ruby,)"
4,"Getting error - type ""json"" does not exist - in Postgresql during rake db migrate","[ruby-on-rails, ruby, json]","(json, ruby-on-rails)","(json, ruby-on-rails)"
5,library not found for.....?,"[c++, iphone, ios, xcode]",(),"(c#,)"
6,.csproj File - Programmatic adding/deleting files,[c#],(),"(c#,)"
7,TypeError: makedirs() got an unexpected keyword argument 'exists_ok',"[python, django]","(python,)","(python,)"
8,How to Pan a div using JQuery,"[javascript, jquery, html]","(javascript, jquery)","(javascript, jquery)"
9,Hibernate intermediate/advanced tutorials,"[java, hibernate]","(hibernate, java)","(hibernate, java)"


In [0]:
np.where(y_val_predicted_labels_mybag.sum(axis = 1) > 0)

(array([    2,     3,     4, ..., 29995, 29996, 29998]),)

In [0]:
validation.tags[0]

validation.tags[1]

validation.tags[2]

['php', 'sql']

['javascript']

['ruby-on-rails', 'ruby']

In [0]:
ind = 0

np.where(y_val[ind] == 1)

np.where(y_val[ind] == np.max(y_val[ind]))

np.where(y_val_predicted_labels_mybag[ind] > 0)

y_val_predicted_scores_mybag[ind]

y_val_predicted_prob_mybag[ind]

(array([60, 79]),)

(array([60, 79]),)

(array([], dtype=int64),)

array([-3.71433998, -5.41500174, -5.9087868 , -3.66715473, -5.08714229,
       -6.21922424, -5.69322619, -3.02050159, -5.10976064, -3.77603978,
       -1.40428296, -3.03438931, -6.72585617, -5.32375851, -6.08137548,
       -5.35051278, -7.35145568, -4.96844721, -6.5256824 , -6.08388806,
       -5.31804745, -5.82663018, -5.18897035, -5.56469178, -7.93708856,
       -6.89650549, -5.87657399, -6.11200187, -7.13297366, -6.64571331,
       -7.25596942, -5.38076077, -4.21384017, -4.5839305 , -5.484181  ,
       -3.59758327, -3.0108355 , -1.63223876, -3.1033512 , -4.17534711,
       -5.22057712, -6.11961208, -6.64047162, -5.77398403, -4.53681998,
       -6.00442354, -6.89028787, -6.40407498, -6.50608521, -4.8041737 ,
       -4.90567766, -5.92648413, -5.79635804, -3.30246621, -5.96141754,
       -5.34938369, -5.72538055, -6.60975153, -6.39096926, -5.3150334 ,
       -1.69596372, -6.47313834, -3.11636651, -6.00059949, -6.00617924,
       -5.13441011, -5.12067613, -6.02984341, -6.31970444, -3.90

array([0.02379168, 0.00442962, 0.00270813, 0.02491257, 0.00613774,
       0.00198683, 0.0033574 , 0.04650823, 0.00600129, 0.0224    ,
       0.19713735, 0.04589624, 0.00119806, 0.00485075, 0.00227982,
       0.0047233 , 0.00064125, 0.00690591, 0.00146317, 0.00227411,
       0.0048784 , 0.00293933, 0.00554681, 0.00381614, 0.00035712,
       0.00101029, 0.00279653, 0.00221121, 0.00079771, 0.00129789,
       0.00070545, 0.00458322, 0.01457392, 0.01011138, 0.00413477,
       0.02665963, 0.04693875, 0.1635239 , 0.04296923, 0.0151372 ,
       0.00537516, 0.00219448, 0.00130471, 0.00309773, 0.01059397,
       0.00246174, 0.00101659, 0.00165207, 0.00149209, 0.00812885,
       0.00735   , 0.00266075, 0.0030294 , 0.03548668, 0.00256964,
       0.00472861, 0.0032515 , 0.00134535, 0.00167382, 0.00489305,
       0.15499316, 0.00154199, 0.04243718, 0.00247114, 0.00245743,
       0.00585603, 0.00593653, 0.0024001 , 0.00179724, 0.01978204,
       0.03074196, 0.00742552, 0.00085132, 0.00631658, 0.00057

In [0]:
y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

AttributeError: ignored

Now take a look at how classifier, which uses TF-IDF, works for a few examples:

In [0]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)
for i in range(3):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_val[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

Now, we would need to compare the results of different predictions, e.g. to see whether TF-IDF transformation helps or to try different regularization techniques in logistic regression. For all these experiments, we need to setup evaluation procedure. 

### Evaluation

To evaluate the results we will use several classification metrics:
 - [Accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
 - [F1-score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
 - [Area under ROC-curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)
 - [Area under precision-recall curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score) 
 
Make sure you are familiar with all of them. How would you expect the things work for the multi-label scenario? Read about micro/macro/weighted averaging following the sklearn links provided above.

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

Implement the function *print_evaluation_scores* which calculates and prints to stdout:
 - *accuracy*
 - *F1-score macro/micro/weighted*
 - *Precision macro/micro/weighted*

In [0]:
def print_evaluation_scores(y_val, predicted):
    
    ######################################
    ######### YOUR CODE HERE #############
    ######################################

In [0]:
y_val_predicted_labels_mybag[:5]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0,

In [0]:
[np.where(i) for i in y_val]

[(array([60, 79]),),
 (array([38]),),
 (array([69, 70]),),
 (array([69, 74]),),
 (array([40, 69, 70]),),
 (array([11, 35, 36, 98]),),
 (array([10]),),
 (array([20, 62]),),
 (array([32, 38, 39]),),
 (array([31, 37]),),
 (array([ 0, 10, 12]),),
 (array([10, 96]),),
 (array([60]),),
 (array([37]),),
 (array([10, 96]),),
 (array([10, 96]),),
 (array([62, 81]),),
 (array([20, 84]),),
 (array([ 3, 37]),),
 (array([69, 70]),),
 (array([19, 38, 60]),),
 (array([62, 81]),),
 (array([10, 43, 81]),),
 (array([ 3, 38]),),
 (array([66]),),
 (array([37]),),
 (array([20]),),
 (array([17, 37]),),
 (array([70]),),
 (array([37]),),
 (array([ 1, 38, 39]),),
 (array([ 3, 37]),),
 (array([11]),),
 (array([ 0, 10, 32]),),
 (array([ 9, 61]),),
 (array([60]),),
 (array([16, 66]),),
 (array([37, 79]),),
 (array([10, 43]),),
 (array([32, 38]),),
 (array([60]),),
 (array([35, 36, 53]),),
 (array([60]),),
 (array([ 0, 10, 43]),),
 (array([10, 49, 75]),),
 (array([38]),),
 (array([14, 60]),),
 (array([37]),),
 (ar

In [0]:
accuracy_score(y_val, y_val_predicted_labels_mybag)

0.36033333333333334

In [0]:
print('Bag-of-words')
print_evaluation_scores(y_val, y_val_predicted_labels_mybag)
# print('Tfidf')
# print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

Bag-of-words


NameError: ignored

You might also want to plot some generalization of the [ROC curve](http://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc) for the case of multi-label classification. Provided function *roc_auc* can make it for you. The input parameters of this function are:
 - true labels
 - decision functions scores
 - number of classes

In [0]:
from metrics import roc_auc
%matplotlib inline

In [0]:
n_classes = len(tags_counts)
roc_auc(y_val, y_val_predicted_scores_mybag, n_classes)

In [0]:
n_classes = len(tags_counts)
roc_auc(y_val, y_val_predicted_scores_tfidf, n_classes)

**Task 4 (MultilabelClassification).** Once we have the evaluation set up, we suggest that you experiment a bit with training your classifiers. We will use *F1-score weighted* as an evaluation metric. Our recommendation:
- compare the quality of the bag-of-words and TF-IDF approaches and chose one of them.
- for the chosen one, try *L1* and *L2*-regularization techniques in Logistic Regression with different coefficients (e.g. C equal to 0.1, 1, 10, 100).

You also could try other improvements of the preprocessing / model, if you want. 

In [0]:
######################################
######### YOUR CODE HERE #############
######################################

When you are happy with the quality, create predictions for *test* set, which you will submit to Coursera.

In [0]:
pd.Series(mlb.inverse_transform(y_val_predicted_labels_mybag)).map(len).value_counts()

validation.tags.map(len).value_counts()

1    13136
2     8324
0     6063
3     1991
4      414
5       67
6        5
dtype: int64

2    12402
1    10521
3     5343
4     1514
5      220
Name: tags, dtype: int64

In [0]:
ind = 10

pd.options.display.max_colwidth = 100

pd.concat([validation[:ind], 
           pd.Series(mlb.inverse_transform(y_val_predicted_labels_mybag[:ind]))], axis = 1)

Unnamed: 0,title,tags,0
0,Why odbc_exec always fail?,"[php, sql]",()
1,Access a base classes variable from within a child class,[javascript],()
2,"Content-Type ""application/json"" not required in rails","[ruby-on-rails, ruby]","(ruby-on-rails,)"
3,Sessions in Sinatra: Used to Pass Variable,"[ruby, session]","(ruby,)"
4,"Getting error - type ""json"" does not exist - in Postgresql during rake db migrate","[ruby-on-rails, ruby, json]","(json, ruby-on-rails)"
5,library not found for.....?,"[c++, iphone, ios, xcode]",()
6,.csproj File - Programmatic adding/deleting files,[c#],()
7,TypeError: makedirs() got an unexpected keyword argument 'exists_ok',"[python, django]","(python,)"
8,How to Pan a div using JQuery,"[javascript, jquery, html]","(javascript, jquery)"
9,Hibernate intermediate/advanced tutorials,"[java, hibernate]","(hibernate, java)"


In [0]:
test_predictions = ######### YOUR CODE HERE #############
test_pred_inversed = mlb.inverse_transform(test_predictions)

test_predictions_for_submission = '\n'.join('%i\t%s' % (i, ','.join(row)) for i, row in enumerate(test_pred_inversed))
grader.submit_tag('MultilabelClassification', test_predictions_for_submission)

### Analysis of the most important features

Finally, it is usually a good idea to look at the features (words or n-grams) that are used with the largest weigths in your logistic regression model.

Implement the function *print_words_for_tag* to find them. Get back to sklearn documentation on [OneVsRestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) and [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) if needed.

In [0]:
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
    """
        classifier: trained classifier
        tag: particular tag
        tags_classes: a list of classes names from MultiLabelBinarizer
        index_to_words: index_to_words transformation
        all_words: all words in the dictionary
        
        return nothing, just print top 5 positive and top 5 negative words for current tag
    """
    print('Tag:\t{}'.format(tag))
    
    # Extract an estimator from the classifier for the given tag.
    # Extract feature coefficients from the estimator. 
    
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    
    top_positive_words = # top-5 words sorted by the coefficiens.
    top_negative_words = # bottom-5 words  sorted by the coefficients.
    print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
    print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))

In [0]:
print_words_for_tag(classifier_tfidf, 'c', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'c++', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'linux', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)

### Authorization & Submission
To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate token on this programming assignment page. <b>Note:</b> Token expires 30 minutes after generation.

In [0]:
grader.status()

You want to submit these parts:
Task TextPrepare:
 sqlite php readonly
creating multiple textboxes dynamically
self one prefer javascript
save php date...
Task WordsTagsCount:
 javascript,java,php
using,php,java...
Task BagOfWords:
 6.0...
Task MultilabelClassification:
 ----------...


In [0]:
STUDENT_EMAIL = # EMAIL 
STUDENT_TOKEN = # TOKEN 
grader.status()

If you want to submit these answers, run cell below

In [0]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)