# <u>StackOverflow Tag Predictor
StackOverflow lets us post your queries and the other user can help you with answers. The site uses tags for managing the questions effectively. Here we will be predicting tags for a given question. Tags like C, C++, Python are widely used.

In [1]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from ast import literal_eval
import pandas as pd
import numpy as np
from tqdm import tqdm
import re
from collections import defaultdict
from scipy import sparse
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer

[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### <u>Data loading

In [0]:
# for reading the data
def load_data(dirname):
    # laod the data file
    data = pd.read_csv(dirname, sep='\t')
    # convert string charcter to language syntactic characters if any
    data['tags'] = data['tags'].apply(literal_eval)
    return data

In [0]:
# load training and validation data
train_data = load_data('drive/Colab Notebooks/dataset/train.tsv')
val_data = load_data('drive/Colab Notebooks/dataset/validation.tsv')

In [64]:
train_data[45:50]

Unnamed: 0,title,tags
45,Apache POI: change page format for Excel works...,"[java, excel]"
46,Python Tkinter indeterminate progress bar not ...,[python]
47,how to store image path in mysql database usin...,"[c#, mysql, winforms]"
48,jdbc connection error: not associated with a m...,[java]
49,How to use multiple tables in one model in yii2,"[php, mysql]"


In [0]:
# training data
X_train = train_data['title'].values 
y_train = train_data['tags'].values
# validation data
X_val = val_data['title'].values
y_val = val_data['tags'].values

In [13]:
print(X_train.shape)
print(y_train.shape)
print(X_val.shape)
print(y_val.shape)

(100000,)
(100000,)
(30000,)
(30000,)


### <u>Text Preprocessing
We remove the punctuations, unecessary whitespaces and some other characters

In [0]:
# preprocess text
def preprocess_data(text):
    STOPWORDS = set(stopwords.words('english'))
    # convert to lowercase
    text = text.lower()
    # replace whitespaces and punctuations
    text = re.sub('[/(){}\[\]\|@,;]', ' ', text)
    text = re.sub('[^0-9a-z #+_]', '', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS)
    return text

In [0]:
# preprocess the data
X_train = [preprocess_data(text) for text in X_train]
X_val = [preprocess_data(text) for text in X_val]

Find word and tag frequencies

In [0]:
def compute_frequency(X_train, y_train):
    # dictionary of all tags with their frequency.
    tag_counts = defaultdict(int)
    # dictionary of all words with their frequency.
    word_counts = defaultdict(int)

    # find tag counts
    for _,tags in tqdm(enumerate(y_train)):
        for tag in tags:
            #print(tag)
            tag_counts[tag] += 1

    # for words
    for _,senten in tqdm(enumerate(X_train)):
        for word in senten.split():
            word_counts[word] += 1
    
    return word_counts, tag_counts

In [8]:
word_counts, tag_counts = compute_frequency(X_train, y_train)

100000it [00:00, 1006728.80it/s]
100000it [00:00, 419402.72it/s]


We will create vocabulary dictionary of top **N** words from the training data. We need two mappings:<br>
1) Words to index<br>
2) Index to words

In [0]:
# for creating word to index and vice versa mappings
def create_vocabulary_mappings(X_train, word_counts, DICT_SIZE=4500):
    # word to index mapping
    word_to_idx = {word:idx for idx,(word,f) in enumerate(
                sorted(word_counts.items(), key=lambda v:v[1], reverse=True)[:DICT_SIZE])}
    # reverse index to word mapping
    idx_to_word= {word_to_idx[word]:word for word in word_to_idx.keys()}
    
    return word_to_idx, idx_to_word

In [0]:
DICT_SIZE=4500
word_to_idx, idx_to_word = create_vocabulary_mappings(X_train, word_counts, DICT_SIZE=4500)

Now we will be trying two feature representations : Bag of Words(BOW) and TF-IDF. First we will create a function for **BOW**. For BOW we will use most commonly used 4500 words.

### Bag of Words

In [0]:
# for creating BOW representation
def create_bag_of_words(text, word_to_idx, DICT_SIZE):
    # Intial Matrix for holding the features
    feature_vector = np.zeros(DICT_SIZE)
    
    # update the word frequencies
    for word in text.split():
        if word in word_to_idx.keys():
            feature_vector[word_to_idx[word]] += 1 
    
    return feature_vector

In [12]:
# create the bag of words feature vector
# we will use a sparse representation , here we will be using csr matrix representation
# for storing it
X_train_bow = sparse.vstack([sparse.csr_matrix(create_bag_of_words(text, word_to_idx, DICT_SIZE)) for text in X_train])
X_val_bow = sparse.vstack([sparse.csr_matrix(create_bag_of_words(text, word_to_idx, DICT_SIZE)) for text in X_val])

print('X_train shape ', X_train_bow.shape)
print('X_val shape ', X_val_bow.shape)

X_train shape  (100000, 4500)
X_val shape  (30000, 4500)


### TF-IDF

In [0]:
# creates tf-idf feature vector
def create_tfidf_features(X_train, X_val):
    # fit for training data
    tfidf = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, token_pattern='(\S+)')
    # apply for training and validation set
    X_train = tfidf.fit_transform(X_train)
    X_val = tfidf.transform(X_val)
    
    return X_train, X_val, tfidf.vocabulary_

In [0]:
X_train_tfidf, X_val_tfidf, tfidf_vocab = create_tfidf_features(X_train, X_val)
tfidf_reverse_vocab = {i:word for word,i in tfidf_vocab.items()}

## <u> Classifier
Since there can be multiple tags for the query question, so we will represent the output as either 0 or 1, where 1 means that tag is present and 0 means it is absent. So will use MultiLabelBinarizer from scikit-learn for this.

In [0]:
# create an instance
mlb_object = MultiLabelBinarizer(classes=sorted(tag_counts.keys()))
# transform the tags 
y_train = mlb_object.fit_transform(y_train)
y_val = mlb_object.transform(y_val)

## <u>Training
We will experiment with different classifiers. We will use One vs All approach here. 

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

In [0]:
import pickle

In [0]:
# define the classifier and fit it to the training data
def train_classifier(X_train, y_train, inner_clf):
    # define the classifier
    clf = OneVsRestClassifier(inner_clf)
    # train it
    clf.fit(X_train, y_train)
    return clf

In [0]:
# classifer for one vs all
ridge_clf = RidgeClassifier()
rf_clf = RandomForestClassifier(n_estimators = 10, max_depth=10, n_jobs=-1, verbose=0)
lr_clf = LogisticRegression(penalty="l2", C=1)
svm_clf = LinearSVC(penalty="l2", C=0.06)

In [0]:
# for bag of words
clf_bow_rf = train_classifier(X_train_bow, y_train, rf_clf)
# for ifidf
clf_tfidf_rf = train_classifier(X_train_tfidf, y_train, rf_clf)

In [0]:
# for bag of words
clf_bow_lr = train_classifier(X_train_bow, y_train, lr_clf)
# for ifidf
clf_tfidf_lr = train_classifier(X_train_tfidf, y_train, lr_clf)


In [0]:
# for bag of words
clf_bow_svm = train_classifier(X_train_bow, y_train, svm_clf)
# for ifidf
clf_tfidf_svm = train_classifier(X_train_tfidf, y_train, svm_clf)


In [0]:
# for bag of words
clf_bow_ridge = train_classifier(X_train_bow, y_train, ridge_clf)
# for ifidf
clf_tfidf_ridge = train_classifier(X_train_tfidf, y_train, ridge_clf)

## <u>Evaluation metrics

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [0]:
# gives evaluation statistics
def evaluate_classifiers(y_val, predicted):
    print('Accuracy: '+ str(accuracy_score(y_val, predicted)*100))
    print(average_precision_score(y_val, predicted))

In [0]:
# make predictions 
pred_val_bow_ridge = clf_bow_ridge.predict(X_val_bow)
pred_val_tfidf_ridge = clf_tfidf_ridge.predict(X_val_tfidf)

In [0]:
pred_val_bow_rf = clf_bow_rf.predict(X_val_bow)
pred_val_tfidf_rf = clf_tfidf_rf.predict(X_val_tfidf)


In [0]:
pred_val_bow_lr = clf_bow_lr.predict(X_val_bow)
pred_val_tfidf_lr = clf_tfidf_lr.predict(X_val_tfidf)


In [0]:

pred_val_bow_svm = clf_bow_svm.predict(X_val_bow)
pred_val_tfidf_svm = clf_tfidf_svm.predict(X_val_tfidf)


In [53]:
print('SVM')
print('Bag of words')
evaluate_classifiers(y_val, pred_val_bow_svm)
print('Tf-IDF')
evaluate_classifiers(y_val, pred_val_tfidf_svm)

SVM
Bag of words
Accuracy: 36.93333333333334
0.35772793986296214
Tf-IDF
Accuracy: 37.63
0.3605364578590816


In [50]:
print('Ridge')
print('Bag of words')
evaluate_classifiers(y_val, pred_val_bow_ridge)
print('Tf-IDF')
evaluate_classifiers(y_val, pred_val_tfidf_ridge)

Ridge
Bag of words
Accuracy: 34.696666666666665
0.3470226517064863
Tf-IDF
Accuracy: 36.199999999999996
0.35857002371520363


In [52]:
print('Logistic Regression')
print('Bag of words')
evaluate_classifiers(y_val, pred_val_bow_lr)
print('Tf-IDF')
evaluate_classifiers(y_val, pred_val_tfidf_lr)

Logistic Regression
Bag of words
Accuracy: 35.733333333333334
0.3438171474531022
Tf-IDF
Accuracy: 33.39333333333334
0.30203064788106676


In [44]:
print('Random Forest')
print('Bag of words')
evaluate_classifiers(y_val, pred_val_bow_rf)
print('Tf-IDF')
evaluate_classifiers(y_val, pred_val_tfidf_rf)

Random Forest
Bag of words
Accuracy: 0.18333333333333332
0.020434962476275707
Tf-IDF
Accuracy: 0.11666666666666668
0.020050132944348178


## Some Validation results

### For TF-IDF

In [63]:
# convert back the predictions to the original tags they are suppose to
pred_val_inverse = mlb_object.inverse_transform(pred_val_tfidf_svm)
# convert the original tag labels
y_val_inverse = mlb_object.inverse_transform(y_val)

for i in range(10):
    print('Query:\t' + str(X_val[i]))
    print('True tags:\t' + str(y_val_inverse[i]))
    print('Predicted tags:\t' + str(pred_val_inverse[i]))
    print()

Query:	properly return reference class member
True tags:	('eclipse', 'hibernate', 'java', 'web-services')
Predicted tags:	('hibernate', 'java', 'mysql', 'web-services')

Query:	temporarily stop form events either raised handled
True tags:	('ios', 'swift')
Predicted tags:	()

Query:	write program accept number command line display sum digits number using recursive funcion
True tags:	('javascript', 'php')
Predicted tags:	('function', 'javascript', 'php')

Query:	void invalid type variable paint
True tags:	('ruby-on-rails',)
Predicted tags:	('ruby-on-rails',)

Query:	making copy azure database
True tags:	('c#', 'database', 'wpf')
Predicted tags:	('c#', 'wpf')

Query:	redux reducer called
True tags:	('javascript',)
Predicted tags:	('javascript',)

Query:	trying update sql server ce failing
True tags:	('java',)
Predicted tags:	('java',)

Query:	solve orgxmlsaxsaxparseexception linenumber 1 columnnumber 1 content allowed prolog
True tags:	('laravel', 'php')
Predicted tags:	('laravel', 'php')

### For Bag of Words (BOW)

In [62]:
# convert back the predictions to the original tags they are suppose to
pred_val_inverse = mlb_object.inverse_transform(pred_val_bow_svm)
# convert the original tag labels
y_val_inverse = mlb_object.inverse_transform(y_val)

for i in range(10):
    print('Query:\t' + str(X_val[i]))
    print('True tags:\t' + str(y_val_inverse[i]))
    print('Predicted tags:\t' + str(pred_val_inverse[i]))
    print()

Query:	create swift animation class flips two images
True tags:	('ios', 'swift')
Predicted tags:	('swift',)

Query:	setting default activity android
True tags:	('android', 'java')
Predicted tags:	('android', 'java')

Query:	rspec passing get params put action
True tags:	('ruby-on-rails',)
Predicted tags:	('ruby-on-rails',)

Query:	retrieving sql server output variables c#
True tags:	('asp.net', 'c#')
Predicted tags:	('c#',)

Query:	get value property corresponding sqlalchemy instrumentedattribute
True tags:	('python',)
Predicted tags:	('python',)

Query:	way get guice grapher work
True tags:	('java',)
Predicted tags:	('java',)

Query:	get access violation reading location error code
True tags:	('c++',)
Predicted tags:	('c++',)

Query:	get computed elements table django queryset
True tags:	('django',)
Predicted tags:	('django',)

Query:	add google maps marker firebase data
True tags:	('android', 'google-maps', 'java')
Predicted tags:	('google-maps', 'javascript')

Query:	return permutat