# StackOverflow assistant bot

This project combine different concepts learned about [Natural Language Processing](https://www.coursera.org/learn/language-processing) into a simple dialog chatbot capable of:

* Answering user software-programming-related questions (using StackOverflow dataset);
* Chit-chatting and simulating a dialogue on all non-programming-related questions.

The chit-chat mode uses a pre-trained Neural Network Engine available from [ChatterBot](https://chatterbot.readthedocs.io/en/stable/).

## Project description

The ChatBot will be constantly wating for user input, either a software development related question or more generic dialog input.

The intent of the user is identified using a classifier model trained on the following datasates:

1. `data/tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (*positive samples*).
2.  `data/dialogues.tsv` — dialogue phrases from movie subtitles (*negative samples*)

Software-related questions will be further classified depending from the relevant programming language (C/C++, Python, Java, etc.) and ranked according to the probability of each classes.

In [1]:
from utils import *

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alberto/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Part I. Intent and language recognition

The goal of this project is to write a ChatBot capable to **maintain an entertaining dialogue** with the user as well as to **understand and answer software programming-related questions**. 

In case of a programming-related question, the Bot should also be able to understand which programming language the user is referring. This step will help in speeding up the search for the most appropriate answer in our StackOverflow dataset.

In [3]:
import numpy as np
import pandas as pd
import joblib
import re

from sklearn.feature_extraction.text import TfidfVectorizer

### Data preparation

Transform text into features using a TF-IDF transformation 

In [5]:
def tfidf_features(X_train, X_test, vectorizer_path):
    """
    Performs TF-IDF transformation and dumps the model.
    
    - Train a vectorizer on X_train data.
    - Transform X_train and X_test data.
    - Pickle the trained vectorizer to 'vectorizer_path'
    """

    tfidf_vectorizer = TfidfVectorizer(
        encoding='utf-8', 
        min_df=5, max_df=0.9, 
        ngram_range=(1,2), 
        token_pattern='(\S+)')
        
    tfidf_vectorizer.fit(X_train)
    
    X_train = tfidf_vectorizer.transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    
    output = open(vectorizer_path, 'wb')
    joblib.dump(tfidf_vectorizer, output)
    output.close()
    
    return X_train, X_test

Load examples of two classes,  using a subsample of stackoverflow data to balance the classes.

In [6]:
sample_size = 200000

dialogue_df = pd.read_csv('data/dialogues.tsv', sep='\t').sample(sample_size, random_state=0)

stackoverflow_df = pd.read_csv('data/tagged_posts.tsv', sep='\t').sample(sample_size, random_state=0)

Check how the data looks like:

In [7]:
dialogue_df.head()

Unnamed: 0,text,tag
82925,"Donna, you are a muffin.",dialogue
48774,He was here last night till about two o'clock....,dialogue
55394,"All right, then make an appointment with her s...",dialogue
90806,"Hey, what is this-an interview? We're supposed...",dialogue
107758,Yeah. He's just a friend of mine I was trying ...,dialogue


In [8]:
stackoverflow_df.head()

Unnamed: 0,post_id,title,tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python
1084095,15747223,Why does this basic thread program fail with C...,c_cpp
1049020,15189594,Link to scroll to top not working,javascript
200466,3273927,Is it possible to implement ping on windows ph...,c#
1200249,17684551,GLSL normal mapping issue,c_cpp


Apply *text_prepare* function to preprocess the data:

In [9]:
from utils import text_prepare

In [10]:
dialogue_df['text'] = [text_prepare(text) for text in dialogue_df['text']]

stackoverflow_df['title'] = [text_prepare(title) for title in stackoverflow_df['title']]

### Intent recognition

Implement a binary classification on TF-IDF representations of texts. 

Labels will be either `dialogue` for general questions or `stackoverflow` for programming-related questions. 

First, prepare the data for this task:
* concatenate `dialogue` and `stackoverflow` examples into one sample
* split it into train and test 
* transform it into TF-IDF features

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X = np.concatenate([dialogue_df['text'].values, stackoverflow_df['title'].values])
y = ['dialogue'] * dialogue_df.shape[0] + ['stackoverflow'] * stackoverflow_df.shape[0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, RESOURCE_PATH['TFIDF_VECTORIZER'])

Train size = 360000, test size = 40000


Train the **intent recognizer** using LogisticRegression on the train set. 

Use the following parameters:

1. *penalty='l2'*, 
2. *C=10*, 
3. *random_state=0*. 

Print out the accuracy on the test set to check whether everything looks good.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [14]:
intent_recognizer = LogisticRegression(penalty='l2', C=10, random_state=0)
intent_recognizer.fit(X_train_tfidf, y_train)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
# Check test accuracy.
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.99185


Dump the classifier to use it in the running bot.

In [16]:
joblib.dump(intent_recognizer, open(RESOURCE_PATH['INTENT_RECOGNIZER'], 'wb'))

### Programming language classification

Implement a multi class logistic regression on the TF-IDF representations of texts in orders to classify the user programming-related question into the relevant programming language.


In [18]:
X = stackoverflow_df['title'].values
y = stackoverflow_df['tag'].values

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 160000, test size = 40000


Let's reuse the TF-IDF vectorizer already created above. It should not make a huge difference which data was used to train it...

In [20]:
vectorizer = joblib.load(open(RESOURCE_PATH['TFIDF_VECTORIZER'], 'rb'))

X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

Train the **tag classifier** using OneVsRestClassifier wrapper over LogisticRegression. 

Use the following parameters: 

1. *penalty='l2'*, 
2. *C=5*, 
3. *random_state=0*.

In [21]:
from sklearn.multiclass import OneVsRestClassifier

In [22]:
tag_classifier = OneVsRestClassifier(
    LogisticRegression(penalty='l2', C=5, random_state=0)
)

tag_classifier.fit(X_train_tfidf, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=5, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=0, solver='lbfgs',
                                                 tol=0.0001, verbose=0,
                                                 warm_start=False),
                    n_jobs=None)

In [23]:
# Check test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.80055


Dump the classifier to use it in the running bot.

In [24]:
joblib.dump(tag_classifier, open(RESOURCE_PATH['TAG_CLASSIFIER'], 'wb'))

## Part II. Ranking questions with embeddings

Vectorized representation of the text (embeddings) are used to find relevant answers to user questions by calculating the similarity with respect to existing thread in our StackOverflow dataset.

Since it would be computationally expensive to calculate such a representation for all possible answers in *online mode* (e.g. when the Bot is running and answering questions from many users), answers vector representation will be pre-computed and stored on a *databse*. Pre-computed representation will be arranged by non-overlaping tags (i.e. programming languages), so that the search of the answer can be performed only within one tag each time. This will make the Bot even more efficient and it will allow not to store all the embedding database in memory. 

The following code employ StarSpace embeddings pre-trained in *supervised mode* on Stack Overflow posts. Alternatively, [pre-trained word vectors](https://code.google.com/archive/p/word2vec/) from Google could also be employed. 

In [26]:
starspace_embeddings, embeddings_dim = load_embeddings('data/ss_embeddings.tsv')

In [27]:
len(starspace_embeddings)

95058

Load the whole post dataset in order to pre-compute the embeddings on all possible answers.

In [28]:
posts_df = pd.read_csv('data/tagged_posts.tsv', sep='\t')

Look at the distribution of posts for programming languages (tags) and find the most common ones. 

In [29]:
counts_by_tag = posts_df.groupby('tag').size()

In [30]:
counts_by_tag

tag
c#            394451
c_cpp         281300
java          383456
javascript    375867
php           321752
python        208607
r              36359
ruby           99930
swift          34809
vb             35044
dtype: int64

Create two data structures for each `tag` which will serve as online search index:

* `tag_post_ids` — a list of post_ids with shape `(counts_by_tag[tag],)`. It will be needed to show the title and link to the thread;
* `tag_vectors` — a matrix with shape `(counts_by_tag[tag], embeddings_dim)` where embeddings for each answer are stored.

Implement the code which will calculate the mentioned structures and dump it to files. It should take several minutes to compute it.

In [31]:
import os
os.makedirs(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], exist_ok=True)

for tag, count in counts_by_tag.items():
        
    tag_posts = posts_df[posts_df['tag'] == tag]
    
    tag_post_ids = tag_posts['post_id'].tolist()
    
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float32)
    
    for i, title in enumerate(tag_posts['title']):
        tag_vectors[i, :] = question_to_vec(title, starspace_embeddings, embeddings_dim)

    # Dump post ids and vectors to a file.
    filename = os.path.join(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], os.path.normpath('%s.pkl' % tag))
    joblib.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))