# Objective

Build a model to automatically predict tags for a given a StackExchange question by using the text of the question.
![alt text](https://cdn.sstatic.net/Sites/stackoverflow/company/img/logos/se/se-logo.svg?v=d29f0785ebb7)

__Dataset Specs__: Over 85,000 questions

__License__

All Stack Exchange user contributions are licensed under [CC-BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/) with [attribution required](http://blog.stackoverflow.com/2009/06/attribution-required/).

<br>

***

In [1]:
# mount Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Steps
1. Loading Data
2. Text Cleaning
3. Merge Tags with Questions
4. Dataset Preparation
5. Feature Engineering using TF-IDF
6. Model Building
    1. Naive Bayes
    2. Logistic Regression
    3. Model Building Summary

7. Final Question Tagging Pipeline



In [2]:
# Importing Required Libraries

# for string matching
import re 

# for handling data
import pandas as pd

# for numerical computing
import numpy as np

# for handling html data
from bs4 import BeautifulSoup

# for NLP related tasks
import spacy
nlp=spacy.load('en_core_web_sm')

# Loading Data

In [3]:
# load questions
questions_df = pd.read_csv('/content/drive/MyDrive/Questions.csv',encoding='latin-1')
print('Shape=>',questions_df.shape)
questions_df.head()

Shape=> (85085, 6)


Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,"<p>Last year, I read a blog post from <a href=..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demog...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,<p>How would you describe in plain English the...
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,<p>After taking a statistics course and then t...
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,"<p>There is an old saying: ""Correlation does n..."


1. Id: Question ID
2. OwnerUserId: User ID
3. CreationDate: Date of posting question
4. Score: Count of Upvotes received by the question
5. Title: Title of the question
6. Body: Text body of the question

In [5]:
# load tags
tags_df = pd.read_csv('/content/drive/MyDrive/Tags.csv')
print('Shape=>',tags_df.shape)
tags_df.head()

Shape=> (244228, 2)


Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


# Text Cleaning

Let's define a function to clean the text data.

In [6]:
def cleaner(text):

  # take off html tags
  text = BeautifulSoup(text,"html").get_text()
  
  # fetch alphabetic characters
  text = re.sub("[^a-zA-Z]", " ", text)

  # convert text to lower case
  text = text.lower()
  
  # removing extra spaces
  text=re.sub("[\s]+"," ",text)

  # creating doc object
  doc=nlp(text)

  # remove stopwords and lemmatize the text
  tokens=[token.lemma_ for token in doc if(token.is_stop==False)]

  return " ".join(tokens)

In [7]:
# Pre-processing Questions
questions_df['cleaned_text'] = questions_df['Body'].apply(cleaner)

In [8]:
questions_df['Body'][1]

"<p>What are some of the ways to forecast demographic census with some validation and calibration techniques?</p>\n\n<p>Some of the concerns:</p>\n\n<ul>\n<li>Census blocks vary in sizes as rural\nareas are a lot larger than condensed\nurban areas. Is there a need to account for the area size difference?</li>\n<li>if let's say I have census data\ndating back to 4 - 5 census periods,\nhow far can i forecast it into the\nfuture?</li>\n<li>if some of the census zone change\nlightly in boundaries, how can i\naccount for that change?</li>\n<li>What are the methods to validate\ncensus forecasts? for example, if i\nhave data for existing 5 census\nperiods, should I model the first 3\nand test it on the latter two? or is\nthere another way?</li>\n<li>what's the state of practice in\nforecasting census data, and what are\nsome of the state of the art methods?</li>\n</ul>\n"

In [9]:
questions_df['cleaned_text'][1]

'way forecast demographic census validation calibration technique concern census block vary size rural area lot large condense urban area need account area size difference let s census datum date census period far forecast future census zone change lightly boundary account change method validate census forecast example datum exist census period model test way s state practice forecast census datum state art method'

# Merge Tags with Questions

Let's now explore the tags data.

In [10]:
tags_df.head()

Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [11]:
# count of unique tags
len(tags_df['Tag'].unique())

1315

In [12]:
tags_df['Tag'].value_counts()

r                            13236
regression                   10959
machine-learning              6089
time-series                   5559
probability                   4217
                             ...  
gmm                              1
matconvnet                       1
mcar                             1
hierarchical-softmax             1
american-community-survey        1
Name: Tag, Length: 1315, dtype: int64

In [13]:
# remove "-" from the tags
tags_df['Tag']= tags_df['Tag'].apply(lambda x:re.sub("-"," ",x))

In [14]:
# group tags Id wise
tags_df = tags_df.groupby('Id').apply(lambda x:x['Tag'].values).reset_index(name='tags')
tags_df.head()

Unnamed: 0,Id,tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [15]:
# merge tags and questions
df = pd.merge(questions_df,tags_df,how='inner',on='Id')

In [16]:
df = df[['Id','Body','cleaned_text','tags']]
print('Shape=>',df.shape)
df.head()

Shape=> (85085, 4)


Unnamed: 0,Id,Body,cleaned_text,tags
0,6,"<p>Last year, I read a blog post from <a href=...",year read blog post brendan o connor entitled ...,[machine learning]
1,21,<p>What are some of the ways to forecast demog...,way forecast demographic census validation cal...,"[forecasting, population, census]"
2,22,<p>How would you describe in plain English the...,describe plain english characteristic distingu...,"[bayesian, frequentist]"
3,31,<p>After taking a statistics course and then t...,take statistic course try help fellow student ...,"[hypothesis testing, t test, p value, interpre..."
4,36,"<p>There is an old saying: ""Correlation does n...",old say correlation mean causation teach tend ...,"[correlation, teaching]"


There are over 85,000 unique questions and over 1300 tags.

# Dataset Preparation

In [17]:
# check frequency of occurence of each tag
freq= {}
for i in df['tags']:
  for j in i:
    if j in freq.keys():
      freq[j] = freq[j] + 1
    else:
      freq[j] = 1

Let's find out the most frequent tags.

In [18]:
# sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))

In [19]:
freq.items()

dict_items([('r', 13236), ('regression', 10959), ('machine learning', 6089), ('time series', 5559), ('probability', 4217), ('hypothesis testing', 3869), ('self study', 3732), ('distributions', 3501), ('logistic', 3316), ('classification', 2881), ('correlation', 2871), ('statistical significance', 2666), ('bayesian', 2656), ('anova', 2505), ('normal distribution', 2181), ('multiple regression', 2054), ('mixed model', 1998), ('clustering', 1952), ('neural networks', 1897), ('mathematical statistics', 1888), ('confidence interval', 1776), ('categorical data', 1703), ('generalized linear model', 1614), ('variance', 1576), ('data visualization', 1549), ('estimation', 1533), ('forecasting', 1422), ('t test', 1418), ('pca', 1395), ('sampling', 1363), ('cross validation', 1344), ('repeated measures', 1335), ('spss', 1296), ('svm', 1283), ('chi squared', 1261), ('maximum likelihood', 1209), ('predictive models', 1189), ('multivariate analysis', 1116), ('survival', 1081), ('references', 1076), (

In [21]:
# Top 20 most frequent tags
common_tags = list(freq.keys())[:20]
common_tags

['r',
 'regression',
 'machine learning',
 'time series',
 'probability',
 'hypothesis testing',
 'self study',
 'distributions',
 'logistic',
 'classification',
 'correlation',
 'statistical significance',
 'bayesian',
 'anova',
 'normal distribution',
 'multiple regression',
 'mixed model',
 'clustering',
 'neural networks',
 'mathematical statistics']

We will use only those questions/queries that have the above 20 tags associated with it.

In [22]:
x=[]
y=[]

for i in range(len(df['tags'])):
  
  temp=[]
  for j in df['tags'][i]:
    if j in common_tags:
      temp.append(j)

  if(len(temp)>1):
    x.append(df['cleaned_text'][i])
    y.append(temp)

In [23]:
# number of questions left
len(x)

19689

In [26]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

# Getting Labels 
y = mlb.fit_transform(y)
y.shape

(19689, 20)

In [27]:
y[0,:]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1])

In [28]:
mlb.classes_

array(['anova', 'bayesian', 'classification', 'clustering', 'correlation',
       'distributions', 'hypothesis testing', 'logistic',
       'machine learning', 'mathematical statistics', 'mixed model',
       'multiple regression', 'neural networks', 'normal distribution',
       'probability', 'r', 'regression', 'self study',
       'statistical significance', 'time series'], dtype=object)

We can now split the dataset into training set and validation set. 

In [29]:
from sklearn.model_selection import train_test_split
x_tr,x_val,y_tr,y_val=train_test_split(x, y, test_size=0.2, random_state=0,shuffle=True)

In [30]:
print('x_tr:',len(x_tr),'y_tr:',len(y_tr))
print('x_val:',len(x_val),'y_val:',len(y_val))

x_tr: 15751 y_tr: 15751
x_val: 3938 y_val: 3938


# Feature Engineering using TF-IDF

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
# initialize TFIDF
word_vectorizer = TfidfVectorizer(max_features=5000)

In [33]:
# Fitting Vectorizer on Train set
word_vectorizer.fit(x_tr)

TfidfVectorizer(max_features=5000)

In [34]:
# create TF-IDF vectors for Train Set
train_word_features = word_vectorizer.transform(x_tr)

In [35]:
train_word_features

<15751x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 702662 stored elements in Compressed Sparse Row format>

In [36]:
# create TF-IDF vectors for Test Set
test_word_features = word_vectorizer.transform(x_val)
test_word_features

<3938x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 173367 stored elements in Compressed Sparse Row format>

# Model Building

## Naive Bayes

In [37]:
# Importing for modeling
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

In [38]:
# Defining Model
nb_model=OneVsRestClassifier(MultinomialNB())

In [39]:
# Training Model
nb_model.fit(train_word_features,y_tr)

OneVsRestClassifier(estimator=MultinomialNB())

In [40]:
# Make predictions for train set
train_pred_nb=nb_model.predict_proba(train_word_features)

In [41]:
train_pred_nb[:5]

array([[1.02437139e-02, 5.76674670e-03, 3.96562972e-03, 6.34662625e-03,
        2.73760331e-02, 5.40371722e-03, 2.77042415e-02, 9.48605255e-03,
        2.28828855e-02, 8.03994194e-03, 1.24810683e-02, 1.63038916e-02,
        4.68594572e-03, 2.52561149e-03, 9.83704682e-03, 7.16014907e-01,
        2.27452298e-01, 1.79155861e-02, 2.60858612e-02, 9.87267383e-01],
       [1.90195560e-02, 5.64104337e-03, 5.11723906e-02, 1.18906514e-02,
        8.77914652e-03, 6.00480550e-03, 1.87623159e-02, 5.30565809e-02,
        2.85461797e-01, 3.33650360e-03, 1.06823692e-02, 9.24720664e-03,
        1.29972966e-02, 3.43882932e-03, 9.27999282e-03, 7.03286210e-01,
        1.91248515e-01, 5.94174360e-03, 1.90205495e-02, 3.74730350e-02],
       [5.21129613e-02, 4.75731786e-04, 1.12222899e-03, 6.73545256e-04,
        6.29958090e-03, 1.30236219e-03, 7.27122817e-03, 1.85967724e-02,
        2.43749603e-03, 4.71425192e-04, 4.55835231e-01, 6.34276019e-03,
        1.66767572e-04, 6.39069225e-04, 1.15051852e-03, 8.4256

The predictions are in terms of probabilities for each of the 20 tags. Hence we need to have a threshold value to convert these probabilities to 0 or 1.

Let's specify a set of candidate threshold values. We will select the threshold value that performs the best for the train set.

In [42]:
# Function for converting probabilities into classes or tags based on a threshold value
def classify(pred_prob,threshold):
  y_pred_seq = []

  for i in pred_prob:
    temp=[]
    for j in i:
      if j>=threshold:
        temp.append(1)
      else:
        temp.append(0)
    y_pred_seq.append(temp)

  return y_pred_seq

In [43]:
# Function for finding optimum value of threshold
def optimum_threshold(actual,pred_prob):
  #define candidate threshold values
  thresholds  = np.arange(0,0.5,0.01)

  score=[]
  for value in thresholds:
    # Getting classes for each threshold
    pred_classes= classify(pred_prob,value) 
    # Getting F1-score for every threshold
    score.append(f1_score(actual,pred_classes,average="weighted"))

  return thresholds[score.index(max(score))]    

In [44]:
# Finding Optimum value
print("Optimal threshold=>",optimum_threshold(y_tr,train_pred_nb))

Optimal threshold=> 0.19


In [46]:
# Getting classes using optimum threshold
train_pred_nb_class=classify(train_pred_nb,0.19)

In [47]:
train_pred_nb_class[:5]

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]]

In [48]:
mlb.inverse_transform(np.array(train_pred_nb_class[:5]))

[('r', 'regression', 'time series'),
 ('machine learning', 'r', 'regression'),
 ('mixed model', 'r', 'regression'),
 ('logistic', 'regression'),
 ('regression', 'self study')]

In [49]:
# Evaluating on Training Set
print("F1-score on Train Set:",f1_score(y_tr,train_pred_nb_class,average="weighted"))

F1-score on Train Set: 0.603040537907899


In [51]:
# Make Predictions on Validation Set
val_pred_nb=nb_model.predict_proba(test_word_features)

# Getting Classes
val_pred_nb_class=classify(val_pred_nb,0.19)

# Evaluating on Validation Set
print("F1-score on Validation Set:",f1_score(y_val,val_pred_nb_class,average="weighted"))

F1-score on Validation Set: 0.5470589813343284


## Logistic Regression

In [52]:
from sklearn.linear_model import LogisticRegression

In [53]:
# Defining Model
lr_model=OneVsRestClassifier(LogisticRegression())

# Training Model
lr_model.fit(train_word_features,y_tr)

OneVsRestClassifier(estimator=LogisticRegression())

In [54]:
# Make Predictions on Train Set
train_pred_lr=lr_model.predict_proba(train_word_features)

In [55]:
train_pred_lr[:5]

array([[0.02232066, 0.02437395, 0.01205742, 0.02121636, 0.02757488,
        0.01359249, 0.07287913, 0.01479795, 0.05802962, 0.04228092,
        0.0231282 , 0.01721852, 0.01569654, 0.0109303 , 0.02453161,
        0.53324371, 0.16421499, 0.06501881, 0.03638357, 0.99027601],
       [0.05956202, 0.01805246, 0.04287631, 0.01613613, 0.01507844,
        0.00817473, 0.04438653, 0.04625617, 0.45426204, 0.01418559,
        0.02894936, 0.02097675, 0.04528465, 0.01213117, 0.01502465,
        0.87893967, 0.23480989, 0.02554664, 0.0370895 , 0.03655322],
       [0.11210599, 0.00626056, 0.00794955, 0.01954793, 0.03789541,
        0.00668734, 0.01667922, 0.01424425, 0.00587191, 0.00476268,
        0.95939482, 0.01760574, 0.00672746, 0.00737833, 0.00560277,
        0.84066708, 0.11672936, 0.01821074, 0.03763426, 0.02329905],
       [0.02236576, 0.02481519, 0.05332397, 0.00717125, 0.01414123,
        0.0452816 , 0.32298003, 0.80088649, 0.040129  , 0.03515908,
        0.00840884, 0.07061227, 0.00860896, 0

In [56]:
# Finding Optimum value
print("Optimal threshold=>",optimum_threshold(y_tr,train_pred_lr))

Optimal threshold=> 0.25


In [57]:
# Getting classes using optimum threshold
train_pred_lr_class=classify(train_pred_lr,0.25)
train_pred_lr_class[:5]

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]]

In [58]:
# Evaluating on Training Set
print("F1-score on Train Set:",f1_score(y_tr,train_pred_lr_class,average="weighted"))

F1-score on Train Set: 0.7463131200357592


In [59]:
# Make Predictions on Validation Set
val_pred_lr=lr_model.predict_proba(test_word_features)

# Getting Classes
val_pred_lr_class=classify(val_pred_lr,0.25)

# Evaluating on Validation Set
print("F1-score on Validation Set:",f1_score(y_val,val_pred_lr_class,average="weighted"))

F1-score on Validation Set: 0.6732004042658751


## Model Building Summary
|        Model        | Train Set | Validation Set |
|:-------------------:|:---------:|:--------------:|
|     Naive Bayes     |   0.6030  |     0.5470     |
| Logistic Regression |   0.7463  |     0.6732     |

It is evident from the results that Logistic Regression performs better than Naive Bayes.

# Final Question Tagging Pipeline

In [60]:
def tagging(question):
  # Text Cleaning
  cleaned_question=cleaner(question)
  
  # Feature Engineering
  vector=word_vectorizer.transform([cleaned_question])
  
  # Predicting Probabilities
  pred_prob=lr_model.predict_proba(vector)
  
  # Converting Probabilities into classes
  pred_class=classify(pred_prob,0.34)

  return mlb.inverse_transform(np.array(pred_class))

In [63]:
tagging("Is Machine learning good?")

[('machine learning',)]