# Final project: StackOverflow assistant bot

Congratulations on coming this far and solving the programming assignments! In this final project, we will combine everything we have learned about Natural Language Processing to construct a *dialogue chat bot*, which will be able to:
* answer programming-related questions (using StackOverflow dataset);
* chit-chat and simulate dialogue on all non programming-related questions.

For a chit-chat mode we will use a pre-trained neural network engine available from [ChatterBot](https://github.com/gunthercox/ChatterBot).
Those who aim at honor certificates for our course or are just curious, will train their own models for chit-chat.
![](https://imgs.xkcd.com/comics/twitter_bot.png)
©[xkcd](https://xkcd.com)

In [1]:
"""
到了最后的课程结业项目，这个项目将会包含在自然语言处理这门课中学到的所有的知识，用来开发一个对话机器人．这个机器人可以做到：
    回答与编程相关的问题（基于stackOverflow的数据集）
    在与编程无关的问题上，可以进行聊天和模拟对话
"""

'\n到了最后的课程结业项目，这个项目将会包含在自然语言处理这门课中学到的所有的知识，用来开发一个对话机器人．这个机器人可以做到：\n    回答与编程相关的问题（基于stackOverflow的数据集）\n    在与编程无关的问题上，可以进行聊天和模拟对话\n'

### Data description

To detect *intent* of users questions we will need two text collections:
- `tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (*positive samples*).
- `dialogues.tsv` — dialogue phrases from movie subtitles (*negative samples*).


In [2]:
import sys
sys.path.append("..")
from common.download_utils import download_project_resources

download_project_resources()

File data/dialogues.tsv is already downloaded.
File data/tagged_posts.tsv is already downloaded.


For those questions, that have programming-related intent, we will proceed as follow predict programming language (only one tag per question allowed here) and rank candidates within the tag using embeddings.
For the ranking part, you will need:
- `word_embeddings.tsv` — word embeddings, that you  trained with StarSpace in the 3rd assignment. It's not a problem if you didn't do it, because we can offer an alternative solution for you.

In [3]:
"""
    对于编程相关的问题，我们将会预测每个问题对应的编程语言（每个问题仅仅允许一个标签），用embedding对带有标签的候选句子进行评级，
    对于评级模块，会用到word_embedding.tsv,这个是用starSpace训练得到的（可以用来求句子向量，因此计算相似度）．
    我之前用StarSpace进行训练过，不过的文件名是modelSaveFile.tsv
"""

'\n    对于编程相关的问题，我们将会预测每个问题对应的编程语言（每个问题仅仅允许一个标签），用embedding对带有标签的候选句子进行评级，\n    对于评级模块，会用到word_embedding.tsv,这个是用starSpace训练得到的（可以用来求句子向量，因此计算相似度）．\n    我之前用StarSpace进行训练过，不过的文件名是modelSaveFile.tsv\n'

As a result of this notebook, you should obtain the following new objects that you will then use in the running bot:

- `intent_recognizer.pkl` — intent recognition model;
- `tag_classifier.pkl` — programming language classification model;
- `tfidf_vectorizer.pkl` — vectorizer used during training;
- `thread_embeddings_by_tags` — folder with thread embeddings, arranged by tags.
    

In [4]:
"""
    意图识别模型
    编程语言分类模型
    训练期间的向量化
    带有线程嵌入的文件夹，由标签排列
"""

'\n    意图识别模型\n    编程语言分类模型\n    训练期间的向量化\n    带有线程嵌入的文件夹，由标签排列\n'

Some functions will be reused by this notebook and the scripts, so we put them into *utils.py* file. Don't forget to open it and fill in the gaps!

In [5]:
from utils import *

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ironman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Part I. Intent and language recognition

We want to write a bot, which will not only **answer programming-related questions**, but also will be able to **maintain a dialogue**. We would also like to detect the *intent* of the user from the question (we could have had a 'Question answering mode' check-box in the bot, but it wouldn't fun at all, would it?). So the first thing we need to do is to **distinguish programming-related questions from general ones**.

It would also be good to predict which programming language a particular question referees to. By doing so, we will speed up question search by a factor of the number of languages (10 here), and exercise our *text classification* skill a bit. :)

In [6]:
"""
    我们写一个机器人，不仅仅是回答编程相关问题，还要可以维持一段对话．所以要从问题中检测出用户的意图
    第一件事情就是从一般问题中区分出编程相关问题
    第二件事情就是预测出与这个问题最相关的编程语言，这样就可以加速问题的搜索（会用到文本分类技巧） 
"""

'\n    我们写一个机器人，不仅仅是回答编程相关问题，还要可以维持一段对话．所以要从问题中检测出用户的意图\n    第一件事情就是从一般问题中区分出编程相关问题\n    第二件事情就是预测出与这个问题最相关的编程语言，这样就可以加速问题的搜索（会用到文本分类技巧） \n'

In [7]:
import numpy as np
import pandas as pd
import pickle
import re

from sklearn.feature_extraction.text import TfidfVectorizer

### Data preparation

In the first assignment (Predict tags on StackOverflow with linear models), you have already learnt how to preprocess texts and do TF-IDF tranformations. Reuse your code here. In addition, you will also need to [dump](https://docs.python.org/3/library/pickle.html#pickle.dump) the TF-IDF vectorizer with pickle to use it later in the running bot.

In [8]:
def tfidf_features(X_train, X_test, vectorizer_path):
    """Performs TF-IDF transformation and dumps the model."""
    
    # Train a vectorizer on X_train data.
    # Transform X_train and X_test data.
    
    # Pickle the trained vectorizer to 'vectorizer_path'
    # Don't forget to open the file in writing bytes mode.
    
    ######################################
    ######### YOUR CODE HERE #############
    ######################################
    #输入参数：max_idf,min_idf, bigram,token_pattern, max_features
    #max_features特征尤其重要，训练集样本
    #tfidf_vectorizer =TfidfVectorizer(max_df=0.9, min_df=5, stop_words='english', ngram_range=(1, 2), token_pattern= '(\S+)', max_features=5000 )
    vectorizer = TfidfVectorizer(max_df=0.9, min_df=5, stop_words='english', ngram_range=(1, 2), token_pattern= '(\S+)')
    tfidf_vectorizer = vectorizer.fit(X_train)
    
    #用定义好的tf-idf形式对语料进行转换
    X_train = tfidf_vectorizer.transform(X_train)#这里的函数用错了，用成了fit_transform()，导致训练集和测试集合维度不一致
    X_test = tfidf_vectorizer.transform(X_test)
    
    #保存训练好的vectorizer, open(filename, 'wb')
    pickle.dump(vectorizer, open(vectorizer_path, 'wb'))
    
    return X_train, X_test

Now, load examples of two classes. Use a subsample of stackoverflow data to balance the classes. You will need the full data later.

In [9]:
#这里的正例和负例，应该是各有20万条
sample_size = 200000

#暂时不需要重新加载

dialogue_df = pd.read_csv('data/dialogues.tsv', sep='\t').sample(sample_size, random_state=0)
stackoverflow_df = pd.read_csv('data/tagged_posts.tsv', sep='\t').sample(sample_size, random_state=0)



In [72]:
stackoverflow_df.index

Int64Index([2168983, 1084095, 1049020,  200466, 1200249, 1274044, 1948078,
            1580601,  839533, 1811118,
            ...
              70423,  715062, 1417588, 1074334,  844119,  189987, 1975274,
             445525,  187718,  163186],
           dtype='int64', length=200000)

Check how the data look like:

In [10]:
dialogue_df.head()

Unnamed: 0,text,tag
82925,"Donna, you are a muffin.",dialogue
48774,He was here last night till about two o'clock....,dialogue
55394,"All right, then make an appointment with her s...",dialogue
90806,"Hey, what is this-an interview? We're supposed...",dialogue
107758,Yeah. He's just a friend of mine I was trying ...,dialogue


In [11]:
stackoverflow_df.head()

Unnamed: 0,post_id,title,tag
2168983,43837842,Efficient Algorithm to compose valid expressio...,python
1084095,15747223,Why does this basic thread program fail with C...,c_cpp
1049020,15189594,Link to scroll to top not working,javascript
200466,3273927,Is it possible to implement ping on windows ph...,c#
1200249,17684551,GLSL normal mapping issue,c_cpp


Apply *text_prepare* function to preprocess the data:

In [12]:
from utils import text_prepare

In [13]:
print(type(dialogue_df["text"]))
print(type(dialogue_df["text"].index))
for i in dialogue_df["text"].index[0:10]:
    print(i)
# print(dialogue_df["text"][82925])

# dialogue_df["text"][82925] = "dadaf"
# print(dialogue_df["text"][82925])

<class 'pandas.core.series.Series'>
<class 'pandas.core.indexes.numeric.Int64Index'>
82925
48774
55394
90806
107758
193506
28588
95628
47491
119356


In [15]:
#一分钟左右运行完
dialogue_df['text'] = dialogue_df["text"].map(lambda x:text_prepare(x))######### YOUR CODE HERE #############

#不知道如何处理pandas.core.series.Series对象，只能借助for循环
#已经预处理成功，这里就不再次运行了

# for index in dialogue_df["text"].index:
#     dialogue_df["text"][index] = text_prepare(dialogue_df["text"][index])
# for index in stackoverflow_df["title"].index:
#     stackoverflow_df["title"][index] = text_prepare(stackoverflow_df["title"][index])



In [16]:
stackoverflow_df['title'] = stackoverflow_df["title"].map(lambda x:text_prepare(x))######### YOUR CODE HERE #############

In [17]:
#ok,可行
dialogue_df.head()

Unnamed: 0,text,tag
82925,donna muffin,dialogue
48774,last night till two oclock hear really got stu...,dialogue
55394,right make appointment see,dialogue
90806,hey thisan interview supposed making love,dialogue
107758,yeah hes friend mine trying help,dialogue


### Intent recognition

We will do a binary classification on TF-IDF representations of texts. Labels will be either `dialogue` for general questions or `stackoverflow` for programming-related questions. First, prepare the data for this task:
- concatenate `dialogue` and `stackoverflow` examples into one sample
- split it into train and test in proportion 9:1, use *random_state=0* for reproducibility
- transform it into TF-IDF features

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
#所有的文本都会进行预处理，预处理完毕之后才会进行切分，然后生成tf-idf向量，然后进行，进行分类器的训练
X = np.concatenate([dialogue_df['text'].values, stackoverflow_df['title'].values])
y = ['dialogue'] * dialogue_df.shape[0] + ['stackoverflow'] * stackoverflow_df.shape[0]

######### YOUR CODE HERE ##########
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, RESOURCE_PATH['TFIDF_VECTORIZER']) #保存在指定目录下

Train size = 360000, test size = 40000


Train the **intent recognizer** using LogisticRegression on the train set with the following parameters: *penalty='l2'*, *C=10*, *random_state=0*. Print out the accuracy on the test set to check whether everything looks good.

In [20]:
"""
    意图辨别：辨别一个问题是否是编程相关问题
    用logisticRegression来解决问题
"""

'\n    意图辨别：辨别一个问题是否是编程相关问题\n    用logisticRegression来解决问题\n'

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [22]:
######################################
######### YOUR CODE HERE #############
######################################
intent_recognizer = LogisticRegression(penalty="l2", C=10, random_state=0)
intent_recognizer.fit(X_train_tfidf, y_train)

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
# Check test accuracy.
# 进行测试时，也是预处理过的文本转换成的tf-idf向量，放入分类器进行预测
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.991075


Dump the classifier to use it in the running bot.

In [24]:
pickle.dump(intent_recognizer, open(RESOURCE_PATH['INTENT_RECOGNIZER'], 'wb'))

### Programming language classification 

We will train one more classifier for the programming-related questions. It will predict exactly one tag (=programming language) and will be also based on Logistic Regression with TF-IDF features. 

First, let us prepare the data for this task.

In [25]:
X = stackoverflow_df['title'].values
y = stackoverflow_df['tag'].values

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 160000, test size = 40000


Let us reuse the TF-IDF vectorizer that we have already created above. It should not make a huge difference which data was used to train it.

In [27]:
vectorizer = pickle.load(open(RESOURCE_PATH['TFIDF_VECTORIZER'], 'rb'))

X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

Train the **tag classifier** using OneVsRestClassifier wrapper over LogisticRegression. Use the following parameters: *penalty='l2'*, *C=5*, *random_state=0*.

In [28]:
from sklearn.multiclass import OneVsRestClassifier

In [29]:
######################################
######### YOUR CODE HERE #############
######################################
tag_classifier = OneVsRestClassifier(LogisticRegression(penalty="l2", C=5, random_state=0))
tag_classifier.fit(X_train_tfidf, y_train)   #可能这里的y_train要编程one_hot

OneVsRestClassifier(estimator=LogisticRegression(C=5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

In [30]:
# Check test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.8002


Dump the classifier to use it in the running bot.

In [31]:
pickle.dump(tag_classifier, open(RESOURCE_PATH['TAG_CLASSIFIER'], 'wb'))

## Part II. Ranking  questions with embeddings

To find a relevant answer (a thread from StackOverflow) on a question you will use vector representations to calculate similarity between the question and existing threads. We already had `question_to_vec` function from the assignment 3, which can create such a representation based on word vectors. 

However, it would be costly to compute such a representation for all possible answers in *online mode* of the bot (e.g. when bot is running and answering questions from many users). This is the reason why you will create a *database* with pre-computed representations. These representations will be arranged by non-overlaping tags (programming languages), so that the search of the answer can be performed only within one tag each time. This will make our bot even more efficient and allow not to store all the database in RAM. 

Load StarSpace embeddings which were trained on Stack Overflow posts. These embeddings were trained in *supervised mode* for duplicates detection on the same corpus that is used in search. We can account on that these representations will allow us to find closely related answers for a question. 

If for some reasons you didn't train StarSpace embeddings in the assignment 3, you can use [pre-trained word vectors](https://code.google.com/archive/p/word2vec/) from Google. All instructions about how to work with these vectors were provided in the same assignment. However, we highly recommend to use StartSpace's embeddings, because it contains more appropriate embeddings. If you chose to use Google's embeddings, delete the words, which is not in Stackoverflow data.

In [32]:
from utils import load_embeddings

In [33]:
embedding_file ="/home/ironman/D/python/nlp_Russia/natural-language-processing/week3/Starspace/modelSaveFile.tsv" 
starspace_embeddings, embeddings_dim = load_embeddings(embedding_file)

In [34]:
print(embeddings_dim)

100


Since we want to precompute representations for all possible answers, we need to load the whole posts dataset, unlike we did for the intent classifier:

In [35]:
posts_df = pd.read_csv('data/tagged_posts.tsv', sep='\t')

In [36]:
posts_df.head()

Unnamed: 0,post_id,title,tag
0,9,Calculate age in C#,c#
1,16,Filling a DataSet or DataTable from a LINQ que...,c#
2,39,Reliable timer in a console application,c#
3,42,Best way to allow plugins for a PHP application,php
4,59,"How do I get a distinct, ordered list of names...",c#


Look at the distribution of posts for programming languages (tags) and find the most common ones. 
You might want to use pandas [groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) and [count](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) methods:

In [37]:
counts_by_tag = posts_df.groupby("tag").count()######### YOUR CODE HERE #############
counts_by_tag = pd.Series(counts_by_tag["post_id"].values, index=counts_by_tag.index)

In [78]:
posts_df.loc[29989,['post_id', 'title', 'tag']]

post_id                        688529
title      C wINnet InternetConnect()
tag                             c_cpp
Name: 29989, dtype: object

In [38]:
print(type(posts_df.groupby("tag")))
print(type(posts_df.groupby("tag").count()))
print(type(counts_by_tag))
#print(counts_by_tag)
#print(posts_df["tag"].count())
#print(posts_df.set_index(["post_id","tag"]).count(level="tag"))
# for tag, count in posts_df.set_index(["post_id","tag"]).count(level="tag").items():
#     print(tag)
for tag, count in counts_by_tag.items():
    print(tag, count)

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
c# 394451
c_cpp 281300
java 383456
javascript 375867
php 321752
python 208607
r 36359
ruby 99930
swift 34809
vb 35044


In [39]:
#测试用
#print(len(posts_df["tag"].index))
print(type(posts_df.count().items()))
for tag, count in posts_df.count().items():
    print(tag, count)

<class 'zip'>
post_id 2171575
title 2171575
tag 2171575


Now for each `tag` you need to create two data structures, which will serve as online search index:
* `tag_post_ids` — a list of post_ids with shape `(counts_by_tag[tag],)`. It will be needed to show the title and link to the thread;
* `tag_vectors` — a matrix with shape `(counts_by_tag[tag], embeddings_dim)` where embeddings for each answer are stored.

Implement the code which will calculate the mentioned structures and dump it to files. It should take several minutes to compute it.

In [40]:
"""
每个tag对应两个数据结构：
    tag_post_ids:　post_ids的列表
    tag_vectors:　一个是嵌入矩阵
"""

'\n每个tag对应两个数据结构：\n    tag_post_ids:\u3000post_ids的列表\n    tag_vectors:\u3000一个是嵌入矩阵\n'

In [41]:
#print(posts_df[posts_df['tag'] == "c#"].index)
my_dict = {"aa":"bb", "aaa":"bbb"}
print(len(list(my_dict.values())[0]))  #需要强制转换，将dict_values转成list

2


In [44]:
#这整个代码运行下来，最重要的是形成这个为各个标签准备的 id-vectots列表
#其实也就运行了６分钟
import os
os.makedirs(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], exist_ok=True)

for tag, count in counts_by_tag.items():
    tag_posts = posts_df[posts_df['tag'] == tag]
    #print(tag_posts["title"])
    
    tag_post_ids = tag_posts.index######### YOUR CODE HERE #############
    #print(tag_post_ids)
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float32)#表示每个句子对应一个句子向量
    
    for i, title in enumerate(tag_posts['title']):
        #print("before_test_prepare:", type(title))
        #应该是要规范化之后在求取句子向量
        title = text_prepare(title)
        #print("after_test_prepare:", title)
        #第i行对应的句子向量
        tag_vectors[i, :] = question_to_vec(title, starspace_embeddings, embeddings_dim)######### YOUR CODE HERE #############

    # Dump post ids and vectors to a file.
    # 每一个tag存一个文件，文件内容：(id:句子向量)
    filename = os.path.join(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], os.path.normpath('%s.pkl' % tag))
    
    #已经生成好了，就不打开了，后续有必要时再重新生成
    #pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))

In [54]:
#测试pairwise_distances_argmin
from sklearn.metrics.pairwise import pairwise_distances_argmin

In [58]:
a= np.arange(1,17).reshape(8,2)
b = np.arange(5,7).reshape(1,-1)
print(a)
print(b)
pairwise_distances_argmin(b, a)

[[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]
 [13 14]
 [15 16]]
[[5 6]]


array([2])

In [None]:
"""
    看看找到最相似问题之后的回答是如何产生的(就是一个模板，加上最佳候选的id，组成一个url,可以在stackoverflow上查找即可)
    找到最相似的句子没有问题了．
"""

In [None]:
"""
    生子当如赵甲第，做人要做陈浮生．
"""