In this notebook, I'll use the dataset "StackSample: 10% of Stack Overflow Q&A", I'll only use the questions and the tags. 
I will implement a tag suggestion system. I'll both try machine learning models and deep learning models like Word2Vec. I'll then compare the performance of both approaches. 

This notebook will be divided in 2 parts:
* PART 1 : Cleaning data and EDA
* PART 2 : Classical classifiers implemented (SGC classifier, MultiNomial Naive Bayes Classifier, Random Forest Classfier, ...


**PART 1: Cleaning Data and Exploratory Data Analysis**

**1.1 Setting up the dataset for later training**

Importing useful libraries at first

In [None]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns

import warnings

import pickle
import time

import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import ToktokTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import hamming_loss


import logging
from scipy.sparse import hstack

warnings.filterwarnings("ignore")
plt.style.use('bmh')
%matplotlib inline

In [None]:
# Setting a random seed in order to keep the same random results each time I run the notebook
np.random.seed(seed=11)

In [None]:
import os 
print(os.listdir("../input"))

In [None]:
# Importing the database 

df = pd.read_csv("../input/Questions.csv", encoding="ISO-8859-1")

In [None]:
df.head(5)

In [None]:
tags = pd.read_csv("../input/Tags.csv", encoding="ISO-8859-1", dtype={'Tag': str})

In [None]:
tags.head(5)

In [None]:
df.info()

In [None]:
tags.info()

First, what I want to do is to merge both dataframes. In order to do that, I'll have to group tags by the id of the post since a post can have multiple tags. I'll just use the groupeby function and then merge the dataframes on the id. 

In [None]:
tags['Tag'] = tags['Tag'].astype(str)

In [None]:
grouped_tags = tags.groupby("Id")['Tag'].apply(lambda tags: ' '.join(tags))

In [None]:
grouped_tags.head(5)

In [None]:
grouped_tags.reset_index()

In [None]:
grouped_tags_final = pd.DataFrame({'Id':grouped_tags.index, 'Tags':grouped_tags.values})

In [None]:
grouped_tags_final.head(5)

In [None]:
df.drop(columns=['OwnerUserId', 'CreationDate', 'ClosedDate'], inplace=True)

In [None]:
df = df.merge(grouped_tags_final, on='Id')

In [None]:
df.head(5)

Now, I'll take only quesions witha score greater than 5. I'm doing that for 2 reasons:
* 1- I'll require less computational resources from kaggle.
* 2- The posts will probably be with a better quality and will be better tagged since they have lots of upvotes. 


In [None]:
df = df[df['Score']>5].copy()

**1.2 Cleaning Data**

In [None]:
print('Dupplicate entries: {}'.format(df.duplicated().sum()))
df.drop_duplicates(inplace = True)

This is a very good dataset since there are no missing valeus or dupplicate values. 

In [None]:
df.drop(columns=['Id', 'Score'], inplace=True)

### Tags

Let's do some cleaning on the tags' column. Furthermore, I decided to keep the 100 most popular tags because I'll be easier to predict the right tag from 100 words than from 14,000 and because we want to keep macro tags and not be too specific since it's only a recommendation for a post, the user can add more specific tags himself. 

In [None]:
from collections import Counter

In [None]:
df.head(5)

In [None]:
df['Tags'] = df['Tags'].apply(lambda x: x.split())

In [None]:
all_tags = [item for sublist in df['Tags'].values for item in sublist]

In [None]:
len(all_tags)

In [None]:
my_set = set(all_tags)
unique_tags = list(my_set)
len(unique_tags)

In [None]:
counts = Counter(all_tags)

In [None]:
print(counts.most_common(20))

In [None]:
frequencies_words = counts.most_common(20)
tags_features = [word[0] for word in frequencies_words]

In [None]:
print(tags_features)

In [None]:
def most_common(tags):
    tags_filtered = []
    for i in range(0, len(tags)):
        if tags[i] in tags_features:
            tags_filtered.append(tags[i])
    return tags_filtered

In [None]:
df['Tags'] = df['Tags'].apply(lambda x: most_common(x))
df['Tags'] = df['Tags'].apply(lambda x: x if len(x)>0 else None)

In [None]:
df.shape

In [None]:
df.dropna(subset=['Tags'], inplace=True)

In [None]:
df.shape

We are here loosing 10000 rows but the it's for the greater good. 

**1.2.2 Body**

In the next two columns: Body and Title, I'll use lots of text processing:
* Removing html format 
* Lowering text
* Transforming abbreviations 
* Removing punctuation (but keeping words like c# since it's the most popular tag)
* Lemmatizing words
* Removing stop words

In [None]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

In [None]:
import preprocess_kgptalkie as ps

In [None]:
def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

In [None]:
df['Body'] = df['Body'].apply(lambda x: get_clean(x))

In [None]:
df['Title'] = df['Title'].apply(lambda x: get_clean(x))

In [None]:
df['Text'] = df['Title'] + " " +  df['Body']

In [None]:
df.head()

**PART 2: Classical classifiers**

**2.1 Data preparation**

Now our data is almost ready to be put into a classifier. I just need to:
* Binarize the tags
* Use a TFIDF for body and Title
The parameters in the TFIDF are very important for the performance of our tags since we don't want him to delete words like c# or.net. To do that we need to use the following pattern : token_pattern=r"(?u)\S\S+"

In [None]:
y = df['Tags']

In [None]:
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(y)

In [None]:
tfidf = TfidfVectorizer(analyzer = 'word', max_features=1000)
X = tfidf.fit_transform(df['Text'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) # Do 70/30 split

Now it's finally ready. 

**2.2 One vs Rest**

To evaluate our models, I'll use the jacard score since it's the best fitted for multi label classification. 

In [None]:
def avg_jacard(y_true,y_pred):
    '''
    see https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics
    https://www.oreilly.com/library/view/mastering-machine-learning/9781788299879/87b63eb8-f52c-496a-b73b-42f8aef549fb.xhtml
    '''
    jacard = np.minimum(y_true,y_pred).sum(axis=1) / np.maximum(y_true,y_pred).sum(axis=1)
    
    return jacard.mean()*100

def print_score(y_pred, clf):
    print("Clf: ", clf.__class__.__name__)
    print("Jacard score: {}".format(avg_jacard(y_test, y_pred)))
    print("Hamming loss: {}".format(hamming_loss(y_pred, y_test)*100))
    print("---")    

In [None]:
sgd = SGDClassifier()
lr = LogisticRegression()
svc = LinearSVC()

for classifier in [sgd, lr, svc]:
    clf = OneVsRestClassifier(classifier)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print_score(y_pred, classifier)

**2.6 Confusion matrix**

In [None]:
for i in range(y_train.shape[1]):
    print(multilabel.classes_[i])
    print(confusion_matrix(y_test[:,i], y_pred[:,i]))
    print("")

#### Data Store

In [None]:
df[['Text', 'Tags']].to_csv('stackoverflow.csv')

## Deep Learning

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix
from sklearn.metrics import hamming_loss


# warnings.filterwarnings("ignore")
# plt.style.use('bmh')
# %matplotlib inline

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Flatten,Embedding,Activation, Dropout
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D 

In [None]:
import tensorflow as tf
print(tf.__version__)

In [None]:
from sklearn.model_selection import train_test_split
import ast

In [None]:
df = pd.read_csv('./stackoverflow.csv', index_col = 0)

In [None]:
df.head()

In [None]:
df['Tags'] = df['Tags'].apply(lambda x: ast.literal_eval(x))

In [None]:
df['Tags']

In [None]:
df['Tags'].iloc[0]

In [None]:
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(df['Tags'])

In [None]:
multilabel.classes_

In [None]:
text = df['Text'].tolist()

In [None]:
text[:2]

In [None]:
token = Tokenizer()
token.fit_on_texts(text)

In [None]:
y

In [None]:
y.shape

In [None]:
len(token.word_counts)

In [None]:
vocab_size = len(token.word_index) + 1 #https://keras.io/api/layers/core_layers/embedding/
vocab_size

In [None]:
x = ['i love i rt the']
# x = [1, 2, 3, 4, 6]

In [None]:
token.texts_to_sequences(x)

In [None]:
encoded_text = token.texts_to_sequences(text)

In [None]:
max_length = 100
X = pad_sequences(encoded_text, maxlen=max_length, padding='post')

In [None]:
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.3)

## Model Building

In [None]:
from tensorflow.keras.optimizers import Adam


In [None]:
vec_size = 50
def get_model():
  model = Sequential()
  model.add(Embedding(vocab_size, vec_size, input_length=max_length))

  model.add(Conv1D(32, 2, activation = 'relu'))
  model.add(MaxPooling1D(2))
  model.add(Dropout(0.2))

  model.add(Conv1D(64, 3, activation = 'relu'))
  model.add(MaxPooling1D(2))
  model.add(Dropout(0.3))

#   model.add(Dense(128, activation='relu'))
#   model.add(Dropout(0.2))

  model.add(Dense(128, activation='relu'))

  model.add(GlobalMaxPooling1D())

  model.add(Dense(y.shape[1], activation='softmax'))

  return model



### jacard_score Multilabel Classification Evaluation Metrics

In [None]:
from keras import backend as K
def avg_jacard(y_true,y_pred):
    '''
    see https://en.wikipedia.org/wiki/Multi-label_classification#Statistics_and_evaluation_metrics
    '''
    jacard = K.sum(K.minimum(y_true,y_pred)) / K.sum(K.maximum(y_true,y_pred))
    
    return K.mean(jacard)

In [None]:
model = get_model()
model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics = [avg_jacard])
model.fit(X_train, y_train, epochs = 1, validation_data = (X_test, y_test), batch_size = 128)

### Model Testing

In [None]:
x = "I have some HTML tables which I extract from a third party program which I'd like to show without using a javascript the user gets to see 4 categories and each category has multiple options. From each category only 1 item can be selected"

In [None]:
def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

In [None]:
def get_encoded(x):
#   x = get_clean(x)
  x = token.texts_to_sequences([x])
  x = pad_sequences(x, maxlen=max_length, padding = 'post')
  return x

In [None]:
coded = get_encoded(x)

In [None]:
coded

In [None]:
model.predict_classes(coded)

In [None]:
multilabel.inverse_transform(model.predict_classes(coded))

In [None]:
multilabel.classes_[11]

In [None]:
multilabel.classes_