# <center>Varun_Kumar(142002018)</center>

## <center>Building search engine using tf-idf for stack overflow </center>

### About Dataset

<b> Context </b>

Full text of all questions and answers from Stack Overflow that are tagged with the python tag. Useful for natural language processing and community analysis. See also the dataset of R questions.

<b> Content</b>

This dataset is organized as three tables:

   1. Questions contains the title, body, creation date, score, and owner ID for each Python question.
   2. Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
   3. Tags contains the tags on each question besides the Python tag.

Questions may be deleted by the user who posted them. They can also be closed by community vote, if the question is deemed off-topic for instance. Such questions are not included in this dataset.

The dataset contains questions all questions asked between August 2, 2008 and Ocotober 19, 2016.

<b> License </b>

    All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.



In [1]:
import os
import re
import time
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 

### loading Questions.csv

In [2]:
data=pd.read_csv("Questions.csv", delimiter=',', encoding='ISO-8859-1')

### i am taking 100000 rows

In [3]:
data=data.head(100000)

In [4]:
data.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...


### Shape of data

In [5]:
data.shape

(100000, 6)

In [6]:
data['title_body']=data['Title']+data['Body']

In [7]:
data.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,title_body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,How can I find the full path to a font from it...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,Get a preview JPEG of a PDF on Windows?<p>I ha...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,Continuous Integration System for a Python Cod...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,cx_Oracle: How do I iterate over a result set?...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,Using 'in' to match an attribute of Python obj...


### preprocess text data

In [8]:
def preprocess(x):

    x = x.lower()

    # remove tags
    x = re.sub("</?.*?>"," <> ", x)

    # remove special characters and digits
    x = re.sub("(\\d|\\W)+"," ", x).strip()
    return x

In [9]:
data['title_body']  = data['title_body'] .fillna("").apply(preprocess)

In [10]:
data.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,title_body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,how can i find the full path to a font from it...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,get a preview jpeg of a pdf on windows i have ...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,continuous integration system for a python cod...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,cx_oracle how do i iterate over a result set t...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,using in to match an attribute of python objec...


### converting title_body into vector using tf-idf

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000)
vectorizer.fit(data['title_body']) 
# we use the fitted CountVectorizer to convert the text to vector
main = vectorizer.transform(data['title_body'])
main.get_shape()
#vectorizer.get_feature_names()

(100000, 5000)

In [12]:
#doc_array = vectorizer.transform(data['title_body']).toarray()
#frequency_matrix = pd.DataFrame(doc_array,index=data['title_body'],columns=vectorizer.get_feature_names())
frequency_matrix = pd.DataFrame(main.toarray(),columns=vectorizer.get_feature_names())
frequency_matrix.head()

Unnamed: 0,__,__add__,__attribute__â,__builtin__,__call__,__class__,__del__,__dict__,__doc__,__eq__,...,zipfile,zlib,zmq,zone,zoom,zope,½ï,¾à,ï¼,ï¾
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
#frequency_matrix.to_csv('vector.csv', index=False)

### loading Answers.csv file and rename column

In [14]:
ans=pd.read_csv("Answers.csv", delimiter=',', encoding='ISO-8859-1')
ans.columns = ans.columns.str.replace('Id', 'IN')
ans.columns = ans.columns.str.replace('ParentIN', 'Id')
ans.columns = ans.columns.str.replace('Body', 'Answer')
ans.columns = ans.columns.str.replace('Score', 'Rating')

In [15]:
ans.head()

Unnamed: 0,IN,OwnerUserIN,CreationDate,Id,Rating,Answer
0,497,50.0,2008-08-02T16:56:53Z,469,4,<p>open up a terminal (Applications-&gt;Utilit...
1,518,153.0,2008-08-02T17:42:28Z,469,2,<p>I haven't been able to find anything that d...
2,536,161.0,2008-08-02T18:49:07Z,502,9,<p>You can use ImageMagick's convert utility f...
3,538,156.0,2008-08-02T18:56:56Z,535,23,<p>One possibility is Hudson. It's written in...
4,541,157.0,2008-08-02T19:06:40Z,535,20,"<p>We run <a href=""http://buildbot.net/trac"">B..."


### taking input(title and body) from user

In [16]:
title_inp=input(str())
body_inp=input(str())
text_d=title_inp+body_inp
data_fr = {'file': [text_d],'fie': [text_d]}
data_fr = pd.DataFrame(data_fr)
data_fr["file"]= data_fr["file"]. fillna("").apply(preprocess)
data_fr = vectorizer.transform(data_fr['file'])
doc_arra = pd.DataFrame(data_fr.toarray(),columns=vectorizer.get_feature_names())

How can I find the full path to a font



### input text covert into vector 

In [17]:
doc_arra.iloc[0:1]

Unnamed: 0,__,__add__,__attribute__â,__builtin__,__call__,__class__,__del__,__dict__,__doc__,__eq__,...,zipfile,zlib,zmq,zone,zoom,zope,½ï,¾à,ï¼,ï¾
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
from scipy.spatial.distance import cosine
data["sim"]=1
data["sim"]=data["sim"].astype(float)
for i in range(100000):
    data["sim"][i]=1-cosine(doc_arra.iloc[0:1], frequency_matrix.iloc[i:i+1])

In [19]:
data.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,title_body,sim
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...,how can i find the full path to a font from it...,0.562909
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,get a preview jpeg of a pdf on windows i have ...,0.021914
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...,continuous integration system for a python cod...,0.018605
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...,cx_oracle how do i iterate over a result set t...,0.026863
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...,using in to match an attribute of python objec...,0.0128


### Output

In [22]:
print("Score = "+str(data['sim'].max()))
aa=data[data.sim == data.sim.max()]
aa=aa.head(1)
ans = pd.merge(aa, ans, on='Id')
ans=ans[ans.Rating == ans.Rating.max()]
print("Rating = "+str(ans["Rating"].max()))
print("*******************************************************************************************")
print("Tile")
print(list(aa['Title']))
print("*******************************************************************************************")
print("Body")
print(list(aa['Body']))
print("*******************************************************************************************")
print("Answer")
print(list(ans['Answer']))

Score = 0.5629086099700047
Rating = 12
*******************************************************************************************
Tile
['How can I find the full path to a font from its display name on a Mac?']
*******************************************************************************************
Body
["<p>I am using the Photoshop's javascript API to find the fonts in a given PSD.</p>\n\n<p>Given a font name returned by the API, I want to find the actual physical font file that that font name corresponds to on the disc.</p>\n\n<p>This is all happening in a python program running on OSX so I guess I'm looking for one of:</p>\n\n<ul>\n<li>Some Photoshop javascript</li>\n<li>A Python function</li>\n<li>An OSX API that I can call from python</li>\n</ul>\n"]
*******************************************************************************************
Answer
["<p>Unfortunately the only API that isn't deprecated is located in the ApplicationServices framework, which doesn't have a bridge s

## Elstic Search 

### loading dataset

In [20]:
FILE_PATH = os.path.join('Questions.csv')
print('Reading the Questions file...')
df = pd.read_csv(FILE_PATH, delimiter=',', encoding='ISO-8859-1')
print('done')

Reading the Questions file...
done


### i am taking 100000 rows

In [21]:
df=df.head(100000)

In [22]:
df.head()

Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,469,147.0,2008-08-02T15:11:16Z,21,How can I find the full path to a font from it...,<p>I am using the Photoshop's javascript API t...
1,502,147.0,2008-08-02T17:01:58Z,27,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...
2,535,154.0,2008-08-02T18:43:54Z,40,Continuous Integration System for a Python Cod...,<p>I'm starting work on a hobby project with a...
3,594,116.0,2008-08-03T01:15:08Z,25,cx_Oracle: How do I iterate over a result set?,<p>There are several ways to iterate over a re...
4,683,199.0,2008-08-03T13:19:16Z,28,Using 'in' to match an attribute of Python obj...,<p>I don't remember whether I was dreaming or ...


### preprocess text data

In [23]:
def preprocess(title, body=None):
    """ Preprocess the input, i.e. lowercase, remove html tags, special character and digits."""
    text = ''
    if body is None:
        text = title
    else:
        text = title + body
    # to lower case
    text = text.lower()

    # remove tags
    text = re.sub("</?.*?>"," <> ", text)

    # remove special characters and digits
    text = re.sub("(\\d|\\W)+"," ", text).strip()
    return text

In [24]:
# Preprocess the corpus
data = [preprocess(title, body) for title, body in zip(df['Title'], df['Body'])]

In [25]:
def create_index(es_client):
    """ Creates an Elasticsearch index."""
    is_created = False
    # Index settings
    settings = {
        "settings": {
            "number_of_shards": 2,
            "number_of_replicas": 1
        },
        "mappings": {
            "dynamic": "true",
            "_source": {
            "enabled": "true"
            },
            "properties": {
                "body": {
                    "type": "text"
                }
            }
        }
    }
    print('Creating `Question` index...')
    try:
        if es_client.indices.exists(INDEX_NAME):
            es_client.indices.delete(index=INDEX_NAME, ignore=[404])
        es_client.indices.create(index=INDEX_NAME, body=settings)
        is_created = True
        print('index `Question` created successfully.')
    except Exception as ex:
        print(str(ex))
    finally:
        return is_created
    return is_created



def index_data(es_client, data, BATCH_SIZE=100000):
    """ Indexs all the rows in data (python questions)."""
    docs = []
    count = 0
    for line in data:
        js_object = {}
        js_object['body'] = line
        docs.append(js_object)
        count += 1

        if count % BATCH_SIZE == 0:
            index_batch(docs)
            docs = []
            print('Indexed {} documents.'.format(count))
    if docs:
        index_batch(docs)
        print('Indexed {} documents.'.format(count))

    es_client.indices.refresh(index=INDEX_NAME)
    print("Done indexing.")


def index_batch(docs):
    """ Indexes a batch of documents."""
    requests = []
    for i, doc in enumerate(docs):
        request = doc
        request["_op_type"] = "index"
        request["_index"] = INDEX_NAME
        request["body"] = doc['body']
        requests.append(request)
    bulk(es_client, requests)



In [27]:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import json
import time

### connecting with Elasticsearch local server

In [28]:
INDEX_NAME = 'python_questions'


# Create the client instance
es_client= Elasticsearch("http://localhost:9200")

create_index(es_client)
index_data(es_client, data)

Creating `Question` index...
Positional arguments can't be used with Elasticsearch API methods. Instead only use keyword arguments.
Indexed 100000 documents.
Done indexing.


In [29]:
def run_query_loop():
    """ Asks user to enter a query to search."""
    while True:
        try:
            handle_query()
        except KeyboardInterrupt:
            break
    return


def handle_query():
    """ Searches the user query and finds the best matches using elasticsearch."""
    query = input("Enter query: ")

    search_start = time.time()
    search = {"size": SEARCH_SIZE,"query": {"match": {"body": query}}}
    print(search)
    response = es_client.search(index=INDEX_NAME, body=search)
    search_time = time.time() - search_start
    print()
    print("{} total hits.".format(response["hits"]["total"]["value"]))
    print("search time: {:.2f} ms".format(search_time * 1000))
    for hit in response["hits"]["hits"]:
        print("id: {}, score: {}".format(hit["_id"], hit["_score"]))
        print(hit["_source"])
        print()

### output using Elasticsearch

In [30]:
SEARCH_SIZE = 2
run_query_loop()

Enter query: How can I find the full path to a font
{'size': 2, 'query': {'match': {'body': 'How can I find the full path to a font'}}}

10000 total hits.
search time: 133.79 ms
id: -E5DGoIB0sjRDY1oPeN3, score: 23.978268
{'body': 'how can i find the full path to a font from its display name on a mac i am using the photoshop s javascript api to find the fonts in a given psd given a font name returned by the api i want to find the actual physical font file that that font name corresponds to on the disc this is all happening in a python program running on osx so i guess i m looking for one of some photoshop javascript a python function an osx api that i can call from python'}

id: t09DGoIB0sjRDY1oUzPT, score: 17.170975
{'body': 'find full path of the python interpreter how do i find the full path of the currently running python interpreter from within the currently executing python script'}

