<a href="https://colab.research.google.com/github/waqqasansari/Natural_Language_Processing/blob/master/predict_tags_stackoverlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
import numpy as np
import pandas as pd
from ast import literal_eval

In [0]:
def read_data(filename):
  data = pd.read_csv(filename, sep='\t')
  data['tags'] = data['tags'].apply(literal_eval)
  return data

In [0]:
train = read_data('train.tsv')
validation = read_data('validation.tsv')
test = pd.read_csv('test.tsv', sep='\t')

In [5]:
print(train.head(5))
print(validation.head(5))
print(test.head(5))

                                               title                  tags
0                How to draw a stacked dotplot in R?                   [r]
1  mysql select all records where a datetime fiel...          [php, mysql]
2             How to terminate windows phone 8.1 app                  [c#]
3  get current time in a specific country via jquery  [javascript, jquery]
4                      Configuring Tomcat to Use SSL                [java]
                                               title                         tags
0                         Why odbc_exec always fail?                   [php, sql]
1  Access a base classes variable from within a c...                 [javascript]
2  Content-Type "application/json" not required i...        [ruby-on-rails, ruby]
3         Sessions in Sinatra: Used to Pass Variable              [ruby, session]
4  Getting error - type "json" does not exist - i...  [ruby-on-rails, ruby, json]
                                               title
1  ge

we can see that number of tags are not fixed
so for more comfortable usage we can initialize X_tain, X_val, X_test, y_train, y_val

In [0]:
X_train, y_train = train['title'].values, train['tags'].values
X_val, y_val = validation['title'].values, validation['tags'].values
X_test = test['title'].values

In [7]:
print(X_train)
print(y_train)

['How to draw a stacked dotplot in R?'
 'mysql select all records where a datetime field is less than a specified value'
 'How to terminate windows phone 8.1 app' ...
 'Python Pandas Series of Datetimes to Seconds Since the Epoch'
 'jqGrid issue grouping - Duplicate rows get appended every time sort is changed'
 'Create a List of primitive int?']
[list(['r']) list(['php', 'mysql']) list(['c#']) ...
 list(['python', 'datetime', 'pandas']) list(['javascript', 'jquery'])
 list(['java', 'list', 'generics'])]


lets clean the text by removing some special symbols

In [0]:
import re

In [0]:
re_replace_by_space = re.compile('[/(){}\[\]\|@,;]')
re_bad_words = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
  text = text.lower()
  text = re_replace_by_space.sub(' ', text)
  text = re_bad_words.sub('', text)
  text = ' '.join(x for x in text.split() if x and x not in STOPWORDS) #deleting stopwords

  return text

In [0]:
def test_text_prepare():
  examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
                "How to free c++ memory vector<int> * arr?"]
  answers = ["sql server equivalent excels choose function", 
               "free c++ memory vectorint arr"]
  for ex, ans in zip(examples, answers):
    if text_prepare(ex) != ans:
      return "Wrong answer for the case: '%s'" % ex

  return 'test passed!!'

In [11]:
print(test_text_prepare())

test passed!!


In [12]:
prepared_questions = []
for line in open('text_prepare_tests.tsv', encoding='utf-8'):
  line = text_prepare(line.strip())
  prepared_questions.append(line)

text_prepare_result = '\n'.join(prepared_questions)
  
print(text_prepare_result)

sqlite php readonly
creating multiple textboxes dynamically
self one prefer javascript
save php date string mysql database timestamp
fill dropdownlist data xml file aspnet application
programmatically trigger jqueryui draggables drag event
get value method argument via reflection java
knockout mapingfromjs observablearray json object data gets lost
facebook connect localhost weird stuff
fullcalendar prev next click
syntaxerror unexpected token
effective way float double comparison
gem install rails fails dns error
listshuttle component richfaces getting updated
laravel responsedownload show images laravel
wrong rspec test
calendar display using java swing
python selenium import regular firefox profile addons
random number 2 variables values
altering http responses firefox extension
start session python web application
align radio buttons horizontally django forms
count number rows sqlite database
wordpress wp_rewrite rules
removing sheet excel 2005 using php
php fatal error function na

In [0]:
X_train = [text_prepare(x) for x in X_train]
X_val = [text_prepare(x) for x in X_val]
X_test = [text_prepare(x) for x in X_test]

In [14]:
X_train[:5]

['draw stacked dotplot r',
 'mysql select records datetime field less specified value',
 'terminate windows phone 81 app',
 'get current time specific country via jquery',
 'configuring tomcat use ssl']

In [0]:
from collections import Counter

tags_counts = Counter() # Dictionary of all tags from train corpus with their counts.
words_counts = Counter() # Dictionary of all words from train corpus with their counts.

for tags in y_train:
  for tag in tags:
    tags_counts[tag] += 1

for words in X_train:
  for word in words.split():
    words_counts[word] += 1

In [16]:
print(tags_counts)
print(words_counts)

Counter({'javascript': 19078, 'c#': 19077, 'java': 18661, 'php': 13907, 'python': 8940, 'jquery': 7510, 'c++': 6469, 'html': 4668, 'objective-c': 4338, 'asp.net': 3939, '.net': 3872, 'ruby-on-rails': 3344, 'ios': 3256, 'c': 3119, 'mysql': 3092, 'android': 2818, 'ruby': 2326, 'arrays': 2277, 'json': 2026, 'vb.net': 1918, 'iphone': 1909, 'django': 1835, 'css': 1769, 'ajax': 1767, 'r': 1727, 'string': 1573, 'winforms': 1468, 'swift': 1465, 'regex': 1442, 'angularjs': 1353, 'xml': 1347, 'spring': 1346, 'wpf': 1289, 'sql': 1272, 'asp.net-mvc': 1244, 'multithreading': 1118, 'eclipse': 992, 'linq': 964, 'xcode': 900, 'forms': 872, 'html5': 842, 'windows': 838, 'hibernate': 807, 'linux': 793, 'codeigniter': 786, 'node.js': 771, 'swing': 759, 'database': 740, 'list': 693, 'ruby-on-rails-3': 692, 'jsp': 680, 'image': 672, 'entity-framework': 649, 'web-services': 633, 'spring-mvc': 618, 'visual-studio-2010': 588, 'sql-server': 585, 'file': 582, 'sockets': 579, 'visual-studio': 574, 'date': 560, '

In [0]:
#print top three tags and words from the dictionary
most_common_tags = sorted(tags_counts.items(), key=lambda x: x[1], reverse=True)[:3]
most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:3]

In [18]:
print(most_common_tags)
print(most_common_words)

[('javascript', 19078), ('c#', 19077), ('java', 18661)]
[('using', 8278), ('php', 5614), ('java', 5501)]


##Transforming text to a vector
#Bag of word

In [0]:
DICT_SIZE = 5000
INDEX_TO_WORDS = sorted(words_counts.keys(), key=lambda x: words_counts[x], reverse=True)[:DICT_SIZE]
WORDS_TO_INDEX = {word:i for i, word in enumerate(INDEX_TO_WORDS)}
ALL_WORDS = WORDS_TO_INDEX.keys()

def my_bag_of_words(text, words_to_index, dict_size):
  """
        text: a string
        dict_size: size of the dictionary
        return a vector which is a bag-of-words representation of 'text'
  """
  result_vector = np.zeros(dict_size)

  for word in text.split():
      if word in words_to_index:
          result_vector[words_to_index[word]] += 1
  return result_vector

In [0]:
def test_my_bag_of_words():
    words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
    examples = ['hi how are you']
    answers = [[1, 1, 0, 1]]
    for ex, ans in zip(examples, answers):
        if (my_bag_of_words(ex, words_to_index, 4) != ans).all():
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'

In [21]:
print(test_my_bag_of_words())

Basic tests are passed.


In [22]:
print(INDEX_TO_WORDS)
print(WORDS_TO_INDEX)



In [0]:
from scipy import sparse as sp_sparse

In [24]:
#Now apply the implemented function to all samples (this might take up to a minute):

X_train_mybag = sp_sparse.vstack(sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train)
X_val_mybag = sp_sparse.vstack(sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val)
X_test_mybag = sp_sparse.vstack(sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test)

print('X_train shape', X_train_mybag.shape)
print('X_val shape', X_val_mybag.shape)
print('X_test shape', X_test_mybag.shape)

X_train shape (100000, 5000)
X_val shape (30000, 5000)
X_test shape (20000, 5000)


In [25]:
row = X_train_mybag[10].toarray()[0]
non_zero_element_count = np.count_nonzero(row)

print(non_zero_element_count)

7


##TF-IDF

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
def tfidf_features(X_train, X_val, X_test):

  tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=0.9, 
                                     ngram_range=(1,2), 
                                     token_pattern='(\S+)')
  
  X_train = tfidf_vectorizer.fit_transform(X_train)
  X_val = tfidf_vectorizer.transform(X_val)
  X_test = tfidf_vectorizer.transform(X_test)
    
  return X_train, X_val, X_test, tfidf_vectorizer.vocabulary_



In [0]:
X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

In [29]:
print(tfidf_vocab['c++'])
print(tfidf_vocab['c#'])
print(tfidf_vocab['java'])

1976
1879
8265


##MultiLabel classifier


As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose it is convenient to use MultiLabelBinarizer from sklearn.

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [31]:
mlb = MultiLabelBinarizer(classes=sorted(tags_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_val = mlb.fit_transform(y_val)

print(y_train)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier

In [0]:
def train_classifier(X_train, y_train, penalty='l2', C=1.0):
  lr = LogisticRegression(penalty=penalty, C=C, solver='liblinear')
  ovr = OneVsRestClassifier(lr)
  ovr.fit(X_train, y_train)
  return ovr

In [0]:
classifier_mybag = train_classifier(X_train_mybag, y_train)
classifier_tfidf = train_classifier(X_train_tfidf, y_train)

In [0]:
#Now you can create predictions for the data. 
#You will need two types of predictions: labels and scores.

y_val_predicted_labels_mybag = classifier_mybag.predict(X_val_mybag)
y_val_predicted_scores_mybag = classifier_mybag.decision_function(X_val_mybag)

y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
y_val_predicted_scores_tfidf = classifier_tfidf.decision_function(X_val_tfidf)

In [39]:
y_val_pred_inversed = mlb.inverse_transform(y_val_predicted_labels_tfidf)
y_val_inversed = mlb.inverse_transform(y_val)
for i in range(8):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_val[i],
        ','.join(y_val_inversed[i]),
        ','.join(y_val_pred_inversed[i])
    ))

Title:	odbc_exec always fail
True labels:	php,sql
Predicted labels:	


Title:	access base classes variable within child class
True labels:	javascript
Predicted labels:	


Title:	contenttype application json required rails
True labels:	ruby,ruby-on-rails
Predicted labels:	json,ruby-on-rails


Title:	sessions sinatra used pass variable
True labels:	ruby,session
Predicted labels:	


Title:	getting error type json exist postgresql rake db migrate
True labels:	json,ruby,ruby-on-rails
Predicted labels:	ruby-on-rails


Title:	library found
True labels:	c++,ios,iphone,xcode
Predicted labels:	


Title:	csproj file programmatic adding deleting files
True labels:	c#
Predicted labels:	


Title:	typeerror makedirs got unexpected keyword argument exists_ok
True labels:	django,python
Predicted labels:	python




In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [0]:
def print_evaluation_scores(y_val, predicted):
  print('Accuracy:', accuracy_score(y_val, predicted))
  print('F1-score macro:', f1_score(y_val, predicted, average='macro'))
  print('F1-score micro:', f1_score(y_val, predicted, average='micro'))
  print('F1-score weighted:', f1_score(y_val, predicted, average='weighted'))
  print('Precision macro:', average_precision_score(y_val, predicted, average='macro'))
  print('Precision micro:', average_precision_score(y_val, predicted, average='micro'))
  print('Precision weighted:', average_precision_score(y_val, predicted, average='weighted'))


In [42]:
print('Bag-of-words')
print_evaluation_scores(y_val, y_val_predicted_labels_mybag)
print('Tfidf')
print_evaluation_scores(y_val, y_val_predicted_labels_tfidf)

Bag-of-words
Accuracy: 0.358
F1-score macro: 0.5047325582597497
F1-score micro: 0.6710820449370445
F1-score weighted: 0.6486950381244107
Precision macro: 0.34458812912520126
Precision micro: 0.4812849070834009
Precision weighted: 0.5108520393587743
Tfidf
Accuracy: 0.33393333333333336
F1-score macro: 0.44570945215918634
F1-score micro: 0.6418233967551946
F1-score weighted: 0.6143634328155098
Precision macro: 0.3020320489939477
Precision micro: 0.4570020540292232
Precision weighted: 0.4851114604464971


In [0]:
from sklearn.metrics import roc_auc_score
%matplotlib inline

In [55]:
n_classes = len(tags_counts)
roc_auc_score(y_val, y_val_predicted_scores_mybag)

0.9408409212102764

In [56]:
roc_auc_score(y_val, y_val_predicted_scores_tfidf)

0.9481824423737995

In [57]:
for penalty in ('l1', 'l2'):
    for C in (0.1, 0.6, 1, 3):
        print('Penalty:', penalty, 'C=', C)
        classifier_mybag = train_classifier(X_train_mybag, y_train, penalty, C)
        classifier_tfidf = train_classifier(X_train_tfidf, y_train, penalty, C)
        y_val_predicted_labels_mybag = classifier_mybag.predict(X_val_mybag)

        y_val_predicted_labels_tfidf = classifier_tfidf.predict(X_val_tfidf)
        print('Bag-of-words')
        print('F1-score weighted:', f1_score(y_val, y_val_predicted_labels_mybag, average='weighted'))
        print('Tfidf')
        print('F1-score weighted:', f1_score(y_val, y_val_predicted_labels_tfidf, average='weighted'))

Penalty: l1 C= 0.1
Bag-of-words
F1-score weighted: 0.6116000654698222
Tfidf
F1-score weighted: 0.5664251398311194
Penalty: l1 C= 0.6
Bag-of-words
F1-score weighted: 0.6521164770726211
Tfidf
F1-score weighted: 0.641622122067735
Penalty: l1 C= 1
Bag-of-words
F1-score weighted: 0.6561055351063915
Tfidf
F1-score weighted: 0.6524175635735291
Penalty: l1 C= 3
Bag-of-words
F1-score weighted: 0.6581916700247518
Tfidf
F1-score weighted: 0.6632622106009176
Penalty: l2 C= 0.1
Bag-of-words
F1-score weighted: 0.5919941381102238
Tfidf
F1-score weighted: 0.3922289028503371
Penalty: l2 C= 0.6
Bag-of-words
F1-score weighted: 0.6424203076969988
Tfidf
F1-score weighted: 0.5872068920146579
Penalty: l2 C= 1
Bag-of-words
F1-score weighted: 0.6486950381244107
Tfidf
F1-score weighted: 0.6143634328155098
Penalty: l2 C= 3
Bag-of-words
F1-score weighted: 0.6546506903323904
Tfidf
F1-score weighted: 0.645291460191507


In [0]:
classifier_tfidf = train_classifier(X_train_tfidf, y_train, penalty='l1', C=3)

In [64]:
test_predictions = classifier_tfidf.predict(X_test_tfidf)
test_pred_inversed = mlb.inverse_transform(test_predictions)

test_predictions_for_submission = '\n'.join('%i\t%s' % (i, ','.join(row)) for i, row in enumerate(test_pred_inversed))

print(test_predictions_for_submission)

0	mysql,php
1	html,javascript,jquery
2	
3	javascript,jquery
4	android,java
5	parsing,php,xml
6	json,php
7	java,swing
8	python
9	html
10	jquery
11	r
12	php
13	ruby-on-rails,ruby-on-rails-3
14	c#
15	python
16	c++
17	ajax,html,javascript,jquery,ruby-on-rails
18	
19	c,linux,sockets
20	python
21	pandas,python
22	c++,multithreading
23	
24	php,wordpress
25	arrays,c++
26	ruby,ruby-on-rails
27	c#,wpf
28	python
29	r
30	html,javascript,jquery
31	c#
32	html,javascript
33	python
34	hibernate,java,spring
35	
36	c#,wpf,xaml
37	javascript
38	php
39	java
40	java,sockets
41	c#
42	javascript,jquery
43	eclipse,java
44	c#
45	php
46	
47	
48	
49	c++,eclipse
50	javascript,jquery
51	c#
52	arrays,c++
53	
54	google-maps,javascript
55	
56	python
57	c#
58	ios,javascript,objective-c
59	dom,html,javascript
60	java
61	date,javascript
62	c#
63	
64	django,python
65	c,python
66	java,string
67	file,python,string
68	
69	javascript
70	javascript,jquery
71	c++
72	python
73	
74	python
75	ajax,php
76	
77	
78	c#
79	php
80	html

In [0]:
def print_words_for_tag(classifier, tag, tags_classes, index_to_words, all_words):
  print('Tag:\t{}'.format(tag))

  coef = classifier.coef_[tags_classes.index(tag)]
    
  top_positive_words = [index_to_words[idx] for idx in coef.argsort()[-1:-6:-1]]# top-5 words sorted by the coefficiens.
  top_negative_words = [index_to_words[idx] for idx in coef.argsort()[:5]]# bottom-5 words  sorted by the coefficients.
  print('Top positive words:\t{}'.format(', '.join(top_positive_words)))
  print('Top negative words:\t{}\n'.format(', '.join(top_negative_words)))


In [66]:
print_words_for_tag(classifier_tfidf, 'c', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'c++', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)
print_words_for_tag(classifier_tfidf, 'linux', mlb.classes, tfidf_reversed_vocab, ALL_WORDS)

Tag:	c
Top positive words:	c, malloc, scanf, fscanf, c++ java
Top negative words:	php, begin, javascript, java, python

Tag:	c++
Top positive words:	c++, qt, stdstring, boost, stl
Top negative words:	php, java, c++ stl, javascript, jquery

Tag:	linux
Top positive words:	linux, kernel space, system call, dlopen, killed
Top negative words:	aspnet, nokogiri, codeigniter, javascript, c#

