In [1]:
#housekeeping
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Objective

Given a set of ~5000 RFI questions collected from past RFIs, we will implement a question "similarity search"

A user will be able to input an RFI question in it's natural form and this program will return historical RFI questions that are most similar to that.

Since the historical questions have already been answered, this tools provides the potential to programatically fill in portions of an RFI!

**Credit:** parts of this code are taken from https://radimrehurek.com/gensim/tutorial.html

**Python modules used:** pandas, nltk and gensim. They can be installed using pip


## Part 1: Meet the data

We start with a raw csv file of historical RFI questions.

We'll load it into a pandas dataframe which is a good framework for visualizing tabular data

In [2]:
from IPython.display import display
import pandas as pd

pd.set_option('display.max_colwidth', -1)

data = pd.read_csv('RFX01-10172016.csv') #Pandas DataFrame
#Note this dataset is only a few MB in size so we read it all into RAM at once
#For large datasets (>4GB) we would read from disk in chunks

print("Total number of questions: {}".format(len(data)))
display(data.head())

Total number of questions: 4927


Unnamed: 0,id,category,question,answer,origin,date,url
0,1,,"Does your company (you) have an Information Security Policy document that is approved by management and published and communicated to all employees, contractors, temporary personnel, and relevant external parties?","Yes, Google is committed to the security of all information stored on its computer systems. This commitment is regularly reinforced by management and is outlined in the Google Code of Conduct, which is posted on Google’s website at https://abc.xyz/investor/other/code-of-conduct.html Google’s Security Philosophy is also outlined at the following page: http://www.google.com/intl/en/corporate/security.html.\nThe foundation of Google’s commitment to security is its set of security policies that cover physical, network and computer systems, applications services, systems services, account, data, data center security, corporate services, change management, and incident response. These policies are reviewed on a regular basis by Google's executive team to help ensure their continued effectiveness and accuracy.\nIn addition to these security policies, with which all persons employed by Google must comply, Google has a dedicated Security Team, which has responsibility for a security program that includes awareness raising, internal advocacy, training, strategic reviews, reviews and compliance audits. Google also requires that all third-party contractors comply with these security policies as well.",McKesson,2016-07-15,https://docs.google.com/a/google.com/document/d/1YLVFPVFxmKXrpy4C96EqfFLR-RHIW0w0Ke8w8X5qquI/edit?usp=sharing
1,2,,"If “yes,” then please provide a copy.",Google does not provide a copy of the security policy to customers.,McKesson,2016-07-15,https://docs.google.com/a/google.com/document/d/1YLVFPVFxmKXrpy4C96EqfFLR-RHIW0w0Ke8w8X5qquI/edit?usp=sharing
2,3,,Are your Information Security Policies based upon a standard framework or set of regulatory requirements (e.g. ISO27001 or NIST)?,"There are elements of NIST 800 in our practices and we are certified against ISO 27001. We are also PCI DSS 3.0 certified, the 3.1 audit is underway and will be completed this quarter. We have just certified against ISO 27017 and ISO 27081 relating to cloud security and privacy. These standards serve us well with our global customer base.",McKesson,2016-07-15,https://docs.google.com/a/google.com/document/d/1YLVFPVFxmKXrpy4C96EqfFLR-RHIW0w0Ke8w8X5qquI/edit?usp=sharing
3,4,,"Security Roles. How are the security roles and responsibilities of employees, contractors, temporary personnel, and third party users of your company’s information systems defined and documented in accordance with your Information Security Policy?","Google develops, disseminates, and regularly reviews formal, documented information and asset security policies that address purpose, scope, roles, and responsibilities.",McKesson,2016-07-15,https://docs.google.com/a/google.com/document/d/1YLVFPVFxmKXrpy4C96EqfFLR-RHIW0w0Ke8w8X5qquI/edit?usp=sharing
4,5,,"Security Breach. What is your policy, procedure and escalation process in event of a security breach/incident?","Google has an incident management process for security events that may affect the confidentiality, integrity, or availability of its systems or data. This process specifies courses of action, procedures for notification, escalation, mitigation, and documentation. To help ensure the swift resolution of security incidents, the Google information security team is available 24x7 to all Google employees.In addition, Google proactively searches for security incidents on an ongoing basis, by actively reviewing inbound security reports, monitoring public mailing lists and blog posts, and tracking automated perimeter systems. When an information security incident occurs, Google’s security staff responds promptly in a manner commensurate with the threat level. Notification of the incident may be generated automatically by Google’s monitoring systems or manually by a Google employee.Google works closely with the security community to track reported issues in Google services and open source tools.",McKesson,2016-07-15,https://docs.google.com/a/google.com/document/d/1YLVFPVFxmKXrpy4C96EqfFLR-RHIW0w0Ke8w8X5qquI/edit?usp=sharing


*^first few rows of the pandas DataFrame.*

## Part 2: Pre-process data

### Extract list of questions

The column that we care about for implementing our similarity search is the questions column.

We will extract that from the Pandas DataFrame and convert it to a built-in python list of questions.

To keep with NLP parlance we will refer to each question as a 'document'. The list of documents is our 'corpus'.


In [3]:
documents = data['question'].tolist()
documents[0] 

'Does your company (you) have an Information Security Policy document that is approved by management and published and communicated to all employees, contractors, temporary personnel, and relevant external parties?'

*^the first document (question) in our corpus (list of questions)*

###  Remove punctuation and make case insensitive

In NLP parlance stopwords are "low information" words (e.g. at, that, him etc...)

In [4]:
import string

documents_nopunc = [unicode(document.lower().translate(None, string.punctuation),errors='replace')
             for document in documents] #changing to unicode to avoid NLTK conflicts later
documents_nopunc[0]

u'does your company you have an information security policy document that is approved by management and published and communicated to all employees contractors temporary personnel and relevant external parties'

*^document is now lower case and without punctuation*

###  Remove stopwords

In NLP parlance stopwords are "low information" words (e.g. at, that, him etc...)

In [5]:
import nltk #natural language toolkit

#if you get an error syaing 'corpora/stopwords' not found,
    #uncomment the below line and follow the prompts to download the corpus
#nltk.download()

stopwords = nltk.corpus.stopwords.words("english") #As of 10-22-2016 there were 153 stopwords

texts = [[word for word in document.split() if word not in stopwords]
         for document in documents_nopunc] #text is now a list of lists
texts[0]

[u'company',
 u'information',
 u'security',
 u'policy',
 u'document',
 u'approved',
 u'management',
 u'published',
 u'communicated',
 u'employees',
 u'contractors',
 u'temporary',
 u'personnel',
 u'relevant',
 u'external',
 u'parties']

*^the first document after tokenizing and removing stopwords* 

### Stem the text

Stemming means reducing words to their root (e.g running -> run)

In [6]:
stemmer = nltk.stem.snowball.SnowballStemmer("english")
texts = [[stemmer.stem(word) for word in text] for text in texts]
texts[0]

[u'compani',
 u'inform',
 u'secur',
 u'polici',
 u'document',
 u'approv',
 u'manag',
 u'publish',
 u'communic',
 u'employe',
 u'contractor',
 u'temporari',
 u'personnel',
 u'relev',
 u'extern',
 u'parti']

*^the first document after stemming. pre-processing complete!* 

## Part 3: Vector Representation

In order to do any computation on these words, we need a numeric representation of the words called a vector.

We will use a library called gensim to generate vector representations of each document (RFI question) in our corpus

The simplest vector representation of a document is called "bag-of-words" and it is simply a frequency count of the words in the document. 

First we will generate a dictionary of the words in the corpus

In [7]:
from gensim import corpora

#first count unique words
dictionary = corpora.Dictionary(texts)
print(dictionary) 

2017-03-06 15:19:34,364 : INFO : 'pattern' package not found; tag filters are not available for English
2017-03-06 15:19:34,386 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-03-06 15:19:34,529 : INFO : built Dictionary(4213 unique tokens: [u'osdbappetc', u'leveragesupport', u'profici', u'untrust', u'ansibl']...) from 4927 documents (total 58373 corpus positions)


Dictionary(4213 unique tokens: [u'osdbappetc', u'leveragesupport', u'profici', u'untrust', u'ansibl']...)


*^count of how many unique words are in the corpus. This will also be the length of our vector representation.*

### Generate bag-of-words vector representations of the documents

In [8]:
corpus_bow = [dictionary.doc2bow(text) for text in texts]
corpus_bow[300]

[(17, 1), (103, 1), (275, 1), (298, 1), (311, 2), (673, 1), (888, 1), (891, 1)]

*^the bag of words vector representation of the document with index 300*

#### A note on how to interpret a bag of words vector representation

Take the vector representation: [(4,2),(3003,1)]

This means the word in the dictionary with id:4 appears 2 times in the document, and the word in the dictionarry with id:3003 appeared once. No other words in our dictionary appeared in the document. 

### Generate Tf-Idf vector representations

Now we have a vector representation that we can do computation with!

However this representation is pretty dumb.

For one thing it treats all words as equally valuable. For another it ignores any context or relationship b/w words.

A representation that understands context is a very active area of NLP research. 

However we can address the first issue easily using a technique called "term frequency inverse document frequency", Tf-Idf for short. Tf-Idf simply counts the number of occurences of each word in the corpus and weights the more rare words higher.

In [9]:
from gensim import models

tfidf = models.TfidfModel(corpus_bow) #initialize model
corpus_tfidf = tfidf[corpus_bow]
corpus_tfidf[300]

2017-03-06 15:19:34,703 : INFO : collecting document frequencies
2017-03-06 15:19:34,704 : INFO : PROGRESS: processing document #0
2017-03-06 15:19:34,737 : INFO : calculating IDF weights for 4927 documents and 4212 features (52998 matrix non-zeros)


[(17, 0.10911823407392553),
 (103, 0.10264638952738894),
 (275, 0.20675534930899012),
 (298, 0.28208901511763435),
 (311, 0.6922670284752578),
 (673, 0.24396783852207624),
 (888, 0.33049484641040283),
 (891, 0.45525174831155446)]

*^the tf-idf vector representation of the document with idex 300* 

Note the dictionary words are the same but the relative weigthings are quite different.

For example word:887 is weighted four times more than word:102, even though they both appear the same amount of times in the document (once). Let's see what these words actually are

In [10]:
print("Word 102: {}".format(dictionary[102]))
print("Word 887: {}".format(dictionary[887]))

Word 102: identifyfil
Word 887: passwordpin


*^this makes sense as "data" is a very common word in our corpus and "companyassign" is rare*

## Part 4: Generate similarity indexes

For comparison's sake we will build indexes for both our vector representations: bag of words and tf-idf

In [11]:
from gensim import similarities

index_bow = similarities.MatrixSimilarity(corpus_bow)
index_tfidf = similarities.MatrixSimilarity(corpus_tfidf)

2017-03-06 15:19:34,829 : INFO : creating matrix with 4927 documents and 4213 features
2017-03-06 15:19:35,175 : INFO : creating matrix with 4927 documents and 4213 features


## Part 5: Build vector representations of our input question

### Input Question 
Input any RFI question here. Play with it and try your own!

In [12]:
question = "Do you support IPv6?"
#question = "Describe your ISO compliance"

### Pre-process input question

To build a comparable vector representation we must do all the same pre-processing on our input question that we did on our corpus.

Here I will consolidate those steps into a single function I can pass an input question to.

In [13]:
def normalize(text):
    text = unicode(text.lower().translate(None, string.punctuation),errors='replace')
    text = [word for word in text.split() if word not in stopwords]
    text = [stemmer.stem(word) for word in text]
    return text

question_normalized = normalize(question)
question_normalized

[u'support', u'ipv6']

### Generate input question vectors

In [14]:
question_bow = dictionary.doc2bow(question_normalized)
question_tfidf = tfidf[question_bow]

print("Bag of words:\n{}".format(question_bow))
print("Tf-Idf:\n{}".format(question_tfidf))

Bag of words:
[(71, 1), (2520, 1)]
Tf-Idf:
[(71, 0.2890904017817321), (2520, 0.9573018017311348)]


## Part 6: Execute similarity search

Now that we have a representation of our search question in the same vector space as our corpus documents, we can calculate cosine similarity.

Cosine similarity referes to the cosine of the angle between two vectors. It will be between 1 and -1

### Tf-Idf Results

In [15]:
def display_sims(sims,number_of_results):
    #sort results
    sims = sorted(enumerate(sims), key=lambda item: -item[1])[:number_of_results]
    
    #display
    for sim in sims:
        print("Similarity Score: {}".format(sim[1]))
        display(data.loc[sim[0]].to_frame().drop(['id','category']))
        
display_sims(index_tfidf[question_tfidf],3)

Similarity Score: 0.710563600063


Unnamed: 0,1958
question,IPv6 and IPv4 Support Access
answer,Y - IPv4 is supported and IPv6 is supported for select consumer services.
origin,Foot Locker
date,2016-07-01
url,https://docs.google.com/spreadsheets/d/1AWGCQU51Rw-zmO7UEkzR64J-W5G392itY07VWPRsuyI


Similarity Score: 0.482242643833


Unnamed: 0,2277
question,Simultaneous IPv4 and IPv6 utilization where necessary
answer,[Google] Google Compute Engine currently only supports IPv4.
origin,Mayo
date,2015-07-01
url,https://docs.google.com/document/d/1u8eCTx7yeuamnEndpS65KgvjRo1aZK0ODj6CwlYuAPQ


Similarity Score: 0.289090394974


Unnamed: 0,1985
question,Support
answer,Y- Google internal support offered. Link to support options can be found here: https://cloud.google.com/support/
origin,Foot Locker
date,2016-07-01
url,https://docs.google.com/spreadsheets/d/1AWGCQU51Rw-zmO7UEkzR64J-W5G392itY07VWPRsuyI


### Bag of words results

You can execute for academic comparison. These results are expected to be bad.

In [16]:
display_sims(index_bow[question_bow],3)

Similarity Score: 0.707106769085


Unnamed: 0,1958
question,IPv6 and IPv4 Support Access
answer,Y - IPv4 is supported and IPv6 is supported for select consumer services.
origin,Foot Locker
date,2016-07-01
url,https://docs.google.com/spreadsheets/d/1AWGCQU51Rw-zmO7UEkzR64J-W5G392itY07VWPRsuyI


Similarity Score: 0.707106769085


Unnamed: 0,1985
question,Support
answer,Y- Google internal support offered. Link to support options can be found here: https://cloud.google.com/support/
origin,Foot Locker
date,2016-07-01
url,https://docs.google.com/spreadsheets/d/1AWGCQU51Rw-zmO7UEkzR64J-W5G392itY07VWPRsuyI


Similarity Score: 0.577350258827


Unnamed: 0,4100
question,"Is SFTP supported with your solution. If not, when it will it be supported?"
answer,Yes
origin,Pfizer Cloud Storage
date,2015-12-01
url,https://docs.google.com/spreadsheets/d/1vYWxHnTsQmAtWK750m22xeLmsmYzirhcFUDJUd_zuXY
