# Final project: StackOverflow assistant bot

Congratulations on coming this far and solving the programming assignments! In this final project, we will combine everything we have learned about Natural Language Processing to construct a *dialogue chat bot*, which will be able to:
* answer programming-related questions (using StackOverflow dataset);
* chit-chat and simulate dialogue on all non programming-related questions.

For a chit-chat mode we will use a pre-trained neural network engine available from [ChatterBot](https://github.com/gunthercox/ChatterBot).
Those who aim at honor certificates for our course or are just curious, will train their own models for chit-chat.
![](https://imgs.xkcd.com/comics/twitter_bot.png)
©[xkcd](https://xkcd.com)

### Data description

To detect *intent* of users questions we will need two text collections:
- `tagged_posts.tsv` — StackOverflow posts, tagged with one programming language (*positive samples*).
- `dialogues.tsv` — dialogue phrases from movie subtitles (*negative samples*).


In [2]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    ! wget https://raw.githubusercontent.com/hse-aml/natural-language-processing/master/setup_google_colab.py -O setup_google_colab.py
    import setup_google_colab
    setup_google_colab.setup_project()

import sys
sys.path.append("..")
from common.download_utils import download_project_resources

download_project_resources()

Removed incomplete download


KeyboardInterrupt: 

For those questions, that have programming-related intent, we will proceed as follow predict programming language (only one tag per question allowed here) and rank candidates within the tag using embeddings.
For the ranking part, you will need:
- `word_embeddings.tsv` — word embeddings, that you  trained with StarSpace in the 3rd assignment. It's not a problem if you didn't do it, because we can offer an alternative solution for you.

As a result of this notebook, you should obtain the following new objects that you will then use in the running bot:

- `intent_recognizer.pkl` — intent recognition model;
- `tag_classifier.pkl` — programming language classification model;
- `tfidf_vectorizer.pkl` — vectorizer used during training;
- `thread_embeddings_by_tags` — folder with thread embeddings, arranged by tags.
    

Some functions will be reused by this notebook and the scripts, so we put them into *utils.py* file. Don't forget to open it and fill in the gaps!

In [9]:
from utils import *

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Part I. Intent and language recognition

We want to write a bot, which will not only **answer programming-related questions**, but also will be able to **maintain a dialogue**. We would also like to detect the *intent* of the user from the question (we could have had a 'Question answering mode' check-box in the bot, but it wouldn't fun at all, would it?). So the first thing we need to do is to **distinguish programming-related questions from general ones**.

It would also be good to predict which programming language a particular question referees to. By doing so, we will speed up question search by a factor of the number of languages (10 here), and exercise our *text classification* skill a bit. :)

In [10]:
import numpy as np
import pandas as pd
import pickle
import re

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Data preparation

In the first assignment (Predict tags on StackOverflow with linear models), you have already learnt how to preprocess texts and do TF-IDF tranformations. Reuse your code here. In addition, you will also need to [dump](https://docs.python.org/3/library/pickle.html#pickle.dump) the TF-IDF vectorizer with pickle to use it later in the running bot.

Firstly load the data

In [11]:
from random import sample

a=pd.read_csv('movie_lines.tsv',header=None)

def cleaner(x):
    out = x.split('\t')
    return out[-1]

lol = []
for l in a.values:
    lol.append(cleaner(l.item()))

In [12]:
lol1 = sample(lol,2000)

In [13]:
lol1

['Good swell.',
 "We're not watching the sea.",
 'I got a call to this apartment report of a disturbance --',
 'I thought you said Krueger burned to death.',
 "It ain't always a blessing.  My brother here?",
 "You're not like him at all.  I don't know exactly what's going on around here lately but don't make me start worrying about you.",
 "You don't even want to touch me.",
 "I'm sure of it. Just as I'm sure there's a wish for Shangri-La in everyone's heart. I have never seen the outside world. But I understand there are millions and millions of people who are supposed to be mean and greedy. Yet I just know that secretly they are all hoping to find a garden spot where there is peace and security where there's beauty and comfort where they wouldn't have to be mean and greedy. Oh I just wish the whole world might come to this valley.",
 "C'mon man...",
 'Sign here please.',
 'Ich weiss alles.',
 "Oh sure it was all mean old Marcel's idea. Give me a break! We didn't tell you because it's

In [14]:
nav_int_file = open('nav_int.txt','r')
size_int_file = open('size_intent.txt','r')
bbt_store_list_file = open('bbt_store_list.txt','r')

nav_int = []
size_int = []
bbt_store_list = []
for l in nav_int_file.readlines():
    nav_int.append(l.strip())

for l in size_int_file.readlines():
    size_int.append(l.strip())
    
for l in bbt_store_list_file.readlines():
    bbt_store_list.append(l.strip())

bbt_store_list = [x.split('.') for x in bbt_store_list][:-2]
bbt_store_dict = {}
for el in bbt_store_list:
    bbt_store_dict[el[0]] = el
    
all_store_names = []
for el in bbt_store_list:
    all_store_names += el

Now we replace insert with the names and create question marks

In [15]:
appended_nav = []
appended_size = []

for store_name in all_store_names:
    
    for nav_eg in nav_int:
        new_data = nav_eg.replace('insert', store_name)
        appended_nav.append(new_data)
        appended_nav.append(new_data+"?")
    
    for size_eg in size_int:
        new_d = size_eg.replace('insert', store_name)
        appended_size.append(new_d)
        appended_size.append(new_d+"?")

print(len(appended_nav))
print(len(appended_size))

2880
1920


Combine All Together

In [16]:
nav_frame = pd.DataFrame({"Data":appended_nav,"Label":[1]*len(appended_nav)})
size_frame = pd.DataFrame({"Data":appended_size,"Label":[2]*len(appended_size)})
diag_frame = pd.DataFrame({"Data":lol1,"Label":[0]*len(lol1)})

data_set = pd.concat([nav_frame, size_frame,diag_frame]).reset_index()
data_set = data_set[data_set['Data'] != ""]
data_set['question'] = data_set['Data'].apply(lambda x: x[-1] == "?")

In [17]:
data_set = data_set.sample(frac=1)
train_set = data_set[0:5000]
test_set = data_set[5000:]

In [18]:
train_docs = train_set['Data'].values.tolist()
train_labels = train_set['Label']

test_docs = test_set['Data'].values.tolist()
test_labels = test_set['Label']

# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
train_tfidf=tfidf_vectorizer.fit_transform(train_docs)
test_tfidf=tfidf_vectorizer.transform(test_docs)

# Intent Recognizer

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
intent_clf = RandomForestClassifier()

intent_clf.fit(train_tfidf, train_labels)
test_preds = intent_clf.predict(test_tfidf)

# Get accuracy
acc = np.mean(test_preds == test_labels.values)
print("Acc: " + str(acc))
print("Confusion: ")
print(confusion_matrix(test_labels, test_preds))

Acc: 0.996104618809
Confusion: 
[[511   4   3]
 [  0 784   0]
 [  0   0 495]]


In [55]:
# Save the classifier and vectorizer
pickle.dump(intent_clf, open('intent_rf.pkl', 'wb'))
pickle.dump(tfidf_vectorizer, open('tfidf_trans.pkl','wb'))

In [28]:
bbt_store_dict

{'alley': ['alley', 'the alley'],
 'bober tea': ['bober tea', 'bober', 'bobertea'],
 'chicha sanchen': ['chicha sanchen',
  'chicha',
  'sanchen',
  'chi cha',
  'san chen'],
 'each a cup': ['each a cup', 'eachacup', 'each acup', 'eacha cup'],
 'gongcha': ['gongcha', 'gong cha'],
 'heytea': ['heytea', 'hey tea'],
 'hitea': ['hitea'],
 'holin': ['holin'],
 'itea': ['itea'],
 'jenjudan': ['jenjudan', 'jen ju dan'],
 'liho': ['liho', 'li ho'],
 'milksha': ['milksha'],
 'playmade': ['playmade', 'play made'],
 'sharetea': ['sharetea', 'share tea'],
 'tiger sugar': ['tiger sugar', 'tiger sugar'],
 'tptea': ['tptea', 'tp tea'],
 'woobbee': ['woobbee'],
 'xinfutang': ['xinfutang',
  'xingfutang',
  'xin fu tang',
  'xing fu tang',
  'xft']}

In [34]:
import re

def store_recognizer(input_string, store_list, all_store_dict):
    '''
    takes as input an input string and finds the bubble tea store name
    always returns the base form of the name that is in the database
    '''
    
    for store_name in store_list:
        if re.search(store_name, input_string):
            # Find correct form
            for key_string, val_list in all_store_dict.items():
                if store_name in val_list:
                    return key_string
    return None
    

In [50]:
def get_location(input_string):
    output = re.findall("[0-9][0-9][0-9][0-9][0-9][0-9]",input_string)
    return output[0]
    

In [42]:
!ls

Intent_trainer.ipynb  bbt_store_list.txt   dialogues.tsv    size_intent.txt
README.md	      common		   main_bot.py	    utils.py
__pycache__	      data		   movie_lines.tsv  week5-project.ipynb
bbt_locations.csv     dialogue_manager.py  nav_int.txt


In [51]:
data_loc = pd.read_csv('bbt_locations.csv')

In [52]:
data_loc

Unnamed: 0,Brand,Location_Name,Location_Address
0,gongcha,1 Raffles Place (1RP),"1 Raffles Place, #B1-39, Shopping Podium Tower..."
1,gongcha,313 Somerset (313),"313 Orchard Road, #01-37, Singapore 238895"
2,gongcha,Ang Mo Kio (AMK),"703 Ang Mo Kio Ave 3, #01-2531, Singapore 560703"
3,gongcha,Bugis Junction (BJ),"Bugis Junction, #03-08, The Victoria Street , ..."
4,gongcha,Causeway Point (CWP),"Causeway Point 1 Woodlands Square, #02-K10, Si..."
5,gongcha,Century Square (CSQ),"2 Tampines Central 5, #01-39, Century Square, ..."
6,gongcha,Changi Hospital (CGH),"6 Simei Street 3, #01-03, Singapore 529898"
7,gongcha,Compass One (COM),"Compass One 1, #02-40/41, Sengkang Square, Sin..."
8,gongcha,Eunos MRT (EUNOS),"30 Eunos Crescent, #01-10, Eunos MRT Station, ..."
9,gongcha,FUNAN Centre,"109 North Bridge Road, Singapore 179097"


In [53]:
current_location = '570163'
bubble_tea_store = 'gongcha'

sub_frame = data_loc[data_loc['Brand']==bubble_tea_store]
sub_frame['Postal'] = sub_frame['Location_Address'].apply(lambda x: get_location(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [54]:
sub_frame.head()

Unnamed: 0,Brand,Location_Name,Location_Address,Postal
0,gongcha,1 Raffles Place (1RP),"1 Raffles Place, #B1-39, Shopping Podium Tower...",48616
1,gongcha,313 Somerset (313),"313 Orchard Road, #01-37, Singapore 238895",238895
2,gongcha,Ang Mo Kio (AMK),"703 Ang Mo Kio Ave 3, #01-2531, Singapore 560703",560703
3,gongcha,Bugis Junction (BJ),"Bugis Junction, #03-08, The Victoria Street , ...",188021
4,gongcha,Causeway Point (CWP),"Causeway Point 1 Woodlands Square, #02-K10, Si...",738099


In [38]:
def get_distance(current_location, bubble_tea_store):
    return 5

{'alley': ['alley', 'the alley'],
 'bober tea': ['bober tea', 'bober', 'bobertea'],
 'chicha sanchen': ['chicha sanchen',
  'chicha',
  'sanchen',
  'chi cha',
  'san chen'],
 'each a cup': ['each a cup', 'eachacup', 'each acup', 'eacha cup'],
 'gongcha': ['gongcha', 'gong cha'],
 'heytea': ['heytea', 'hey tea'],
 'hitea': ['hitea'],
 'holin': ['holin'],
 'itea': ['itea'],
 'jenjudan': ['jenjudan', 'jen ju dan'],
 'liho': ['liho', 'li ho'],
 'milksha': ['milksha'],
 'playmade': ['playmade', 'play made'],
 'sharetea': ['sharetea', 'share tea'],
 'tiger sugar': ['tiger sugar', 'tiger sugar'],
 'tptea': ['tptea', 'tp tea'],
 'woobbee': ['woobbee'],
 'xinfutang': ['xinfutang',
  'xingfutang',
  'xin fu tang',
  'xing fu tang',
  'xft']}

In [37]:
test = store_recognizer('bring me to itea', all_store_names, bbt_store_dict)
print(test)

itea


In [26]:
input_sentence = "should i upsize"

conv = tfidf_vectorizer.transform([input_sentence])
pred = intent_clf.predict(conv)

if pred == 0:
    print("dialouge")
elif pred == 1:
    print('Navigation')
else:
    print("Size")

Size


Pickle the RF Classifier as well as the vectorizer

Tea Shop Parser and Location Parser


In [33]:
!ls

Intent_trainer.ipynb  common		   intent_rf.pkl    tfidf_trans.pkl
README.md	      data		   main_bot.py	    utils.py
__pycache__	      db.sqlite3	   movie_lines.tsv  week5-project.ipynb
bbt_locations.csv     dialogue_manager.py  nav_int.txt
bbt_store_list.txt    dialogues.tsv	   size_intent.txt


In [85]:
response.status_code == 200


True

In [None]:
1+1

In [79]:
import requests
import pandas as pd
import numpy as np

#input_postal = '760817'
#bbt_locations = pd.read_csv('bbt_locations.csv',  engine='python')
#brand = 'liho'

import requests

def get_coords(input_postal):
    response = requests.get('https://developers.onemap.sg/commonapi/search',
                            params={'searchVal':input_postal,
                                   'returnGeom':'Y',
                                   'getAddrDetails':'N'},
                            verify=False)

    json_response = response.json()

    if json_response['found'] >= 1:
        x_coord = float(json_response['results'][0]['X'])
        y_coord = float(json_response['results'][0]['Y'])

    else:
        x_coord = 30554.79254
        y_coord = 36683.05713
    
    return x_coord, y_coord

def calc_euclidean(coord1, coord2):
    return np.linalg.norm(np.array(coord1) - np.array(coord2), ord=2)

def calc_fastest_time(input_postal, brand, bbt_locations):
    x_coord, y_coord = get_coords(input_postal)
    #def get_nearest(input_postal, brand, bbt_locations):
    subframe = bbt_locations[bbt_locations['Brand'] == brand]
    dist_list_x = subframe['X'].values.tolist()
    dist_list_y = subframe['Y'].values.tolist()
    euclidean_list = []
    for candx,candy in zip(dist_list_x, dist_list_y):
        euclidean_list.append(calc_euclidean([x_coord, y_coord], [candx, candy]))

    subframe['distance'] = euclidean_list
    subframe.sort_values('distance', inplace=True)
    best_place = subframe.iloc[0]
    best_dist = best_place['distance']
    best_add = best_place['Location_Address']
    time_taken = best_dist/60.4
    
    return time_taken, best_add

a,b,c = calc_fastest_time(input_postal, brand, bbt_locations)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [80]:
c

Unnamed: 0.1,Unnamed: 0,id,Brand,Location_Name,Location_Address,Zipcodes,SEARCHVAL,X,Y,LATITUDE,LONGITUDE,LONGTITUDE,distance
242,243,243,liho,LiHO @ Admiralty Place,"HDB Admiralty Place #01-15, 677 Woodlands Aven...",730677,ADMIRALTY PLACE,24482.05904,46785.54945,1.439386,103.801706,103.801706,4693.379250
207,208,208,liho,LiHO @ Jubilee Square,"Jubilee Square #01-14 , 61 Ang Mo Kio Avenue 8...",569814,61 ANG MO KIO AVENUE 8 SINGAPORE 569814,29608.90388,39310.59346,1.371786,103.847776,103.847776,4710.039274
234,235,235,liho,LiHO @ Greenwich V,"Greenwich V #01-20, 1 Seletar Road, 807011",807011,MORIAH SCHOOLHOUSE LLP,31986.01400,41036.67829,1.387395,103.869136,103.869136,4769.440208
225,226,226,liho,LiHO @ Ang Mo Kio Hub,"Ang Mo Kio Hub #02-66, 53 Ang Mo Kio Avenue 3,...",569933,DBS NTUC AMK HUB,29688.75118,39027.48929,1.369225,103.848493,103.848493,5003.733049
218,219,219,liho,LiHO @ Woodlands MRT Xchange,"Woodlands MRT Station (NS9) #01-16, 30 Woodlan...",738343,WOODLANDS MRT STATION (NS9),22741.59819,46501.78610,1.436820,103.786067,103.786067,5993.391179
208,209,209,liho,LiHO @ Thomson Plaza,"Thomson Plaza #01-104, 301 Upper Thomson Road,...",574408,DBS THOMSON BRANCH,27747.89951,37457.72240,1.355029,103.831053,103.831053,6313.686217
205,206,206,liho,LiHO @ Hougang 1,"Hougang 1 #01-37, 1 Hougang Street 91, 538692",538692,DBS NTUC HOUGANG POINT,33133.88460,39739.86400,1.375667,103.879450,103.879450,6465.460045
209,210,210,liho,LiHO @ Junction 8,"Junction 8 Shopping Centre #02-18A, 9 Bishan P...",579837,GOLDEN VILLAGE (GV BISHAN),29718.54699,36954.95700,1.350482,103.848761,103.848761,7004.222961
245,246,246,liho,LiHO @ Waterway Point,"Waterway Point (Shopping Centre) #B2-K10, 83 P...",828761,OCBC WATERWAY POINT BRANCH,35657.70133,43144.64168,1.406458,103.902130,103.902130,7610.223132
219,220,220,liho,LiHO @ Hougang Mall,"Hougang Mall #02-24, 90 Hougang Avenue 10, 538766",538766,HOUGANG MALL,34722.37003,39382.68690,1.372437,103.893724,103.893724,7962.815057


In [None]:
def tea_shop_parser(input_sentence):
    return 'gongcha'  # String

def location_parser(input_sentence):
    return '570163'  # string of postal code

In [None]:
import os
os.makedirs(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], exist_ok=True)

for tag, count in counts_by_tag.items():
    tag_posts = posts_df[posts_df['tag'] == tag]
    
    tag_post_ids = ######### YOUR CODE HERE #############
    
    tag_vectors = np.zeros((count, embeddings_dim), dtype=np.float32)
    for i, title in enumerate(tag_posts['title']):
        tag_vectors[i, :] = ######### YOUR CODE HERE #############

    # Dump post ids and vectors to a file.
    filename = os.path.join(RESOURCE_PATH['THREAD_EMBEDDINGS_FOLDER'], os.path.normpath('%s.pkl' % tag))
    pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))