# Project Manager Helper

Problem

Over the years there were numerous projects started inside Nokia. Information about these projects exists in unstructured text format (*.doc,* *.pdf,* *.ppt* and other formats). Additionally, companies that have been aquired by Nokia also have a similar project descriptions.

When planning to start a new project, project manager would like to get an overview of existing projects within the company that are the most similar to his project idea. Useful information can include statistics on team size, budget, timeline. As well as a spectre of people, technologies, partner companies and development tools being presented in such projects.

Proposed solution

One way to solve this problem requitres splitting it into two smaller procedures.

Step 1: Narrow down the whole set of documents to the relevant subset describing the same technology.
        This can be solved by extracting keywords from an input document and comparing them to the keywords found
        in each document in a database of previous projects.
        IBM Watson Natural Language Understanding service can be used to extract keywords.

        Input: brief description of a project given in a plain text
        Output: list of the most similar project descriptions. Additionally a similarity score can be provided.
        Search space: all documents in a database.
  
Step 2: Discover patterns, dependencies and relationships within narrowed subset of the documents.
        IBM Watson Discovery service can be used to conduct such analysis.
        Input: subset of documents
        Output: statistics, relations, etc.
        
This notebook explains the first step of the process.

Since there is no existing dataset that can be used, we will be working with a NIPS dataset. It consists of 403 research papers presented during the annual NIPS conference which covers topics of AI, machine and deep learning, data scienece, etc.
Additionally, a .csv file is provided which includes title of each article and its full text.

For the first step we will use .csv file.

Importing required modules. Make sure to install them locally beforehand.

In [3]:
import json
import pandas as pd
import math
from collections import Counter
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import Features, KeywordsOptions

Creating an instance of a service. Version, username and password can be found in 'View Credentials' tab for this specific service.

In [5]:
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2017-02-27',
    username='ff57ed6a-a245-40bf-b6a0-a709612b9a7e',
    password='FOEh24Cuwubq')

Loading csv file with article text and titles. 

In [8]:
#Name of the file to read as a string
csv_file = 'Papers.csv'

#Use pandas module to read csv file
articles = pd.read_csv(csv_file)

#Choosing only Title and PaperTExt columns.
articles = articles.loc[:,['Title', 'PaperText']]

We can check if out file was succsesfully read. Let's output first 5 titles.

In [9]:
print articles.Title.head()

0    Double or Nothing: Multiplicative Incentive Me...
1    Learning with Symmetric Label Noise: The Impor...
2     Algorithmic Stability and Uniform Generalization
3    Adaptive Low-Complexity Sequential Inference f...
4    Covariance-Controlled Adaptive Langevin Thermo...
Name: Title, dtype: object


Creating variables to store keywords and titles for our documents.

In [22]:
#Empty array to store a set of keywords for each article as a separate element
doc_vectors = []

#Empty array to store titles
titles = []

#Amount of articles in the dataset (= amount of rows)
size = len(articles.index)
print 'Dataset includes %d articles.' % size

Dataset includes 403 articles.


Now we will iterate over every article in a dataset and extract keywords for that article using IBM Watson Natural Language Understanding service an instance of which we have already created.

In [23]:
#Iterate over each article, keep track of its index
for index, row in articles.iterrows():
    #Create an empty array to populate it with 
    doc_vector = []
    
    #Extract title and append it to the list of titles
    title = row['Title']
    titles.append(title)
    
    #Extract text
    text = row['PaperText']
    
    #To keep track of the process we will print out index of a document being analyzed at the moment
    if index % 10 == 0:
        print 'Extracting keywords for document %d out of %d' % (index, size)
        
    #Use try-except to catch possible errors 
    try:
        #Feed text to an instance of IBM Watson NLU service. We ask to return only keywords.
        keywords = natural_language_understanding.analyze(text=text, features=Features(keywords=KeywordsOptions()))['keywords']
    except Exception as e:
        print e.message
        pass
    
    #Service returns a list of keywords and relevance scores but we only need keywords
    for keyword in keywords:
        doc_vector.append(keyword['text'])
        
    #Append keywords extracted from an article to the list of keywords
    doc_vectors.append(doc_vector)

Extracting keywords for document 0 out of 403
Extracting keywords for document 10 out of 403
Extracting keywords for document 20 out of 403
Extracting keywords for document 30 out of 403
Extracting keywords for document 40 out of 403
Extracting keywords for document 50 out of 403
Extracting keywords for document 60 out of 403
Extracting keywords for document 70 out of 403
Extracting keywords for document 80 out of 403
Extracting keywords for document 90 out of 403
Extracting keywords for document 100 out of 403
Extracting keywords for document 110 out of 403
Extracting keywords for document 120 out of 403
unknown language detected
Extracting keywords for document 130 out of 403
Extracting keywords for document 140 out of 403
Extracting keywords for document 150 out of 403
Extracting keywords for document 160 out of 403
Extracting keywords for document 170 out of 403
Extracting keywords for document 180 out of 403
Extracting keywords for document 190 out of 403
Extracting keywords for d

In [29]:
#Add extracted keywords as a new column
word_vector = pd.Series(doc_vectors)
articles['word_vector'] = word_vector.values

Let's check results of keyword extraction. Change index to any number in range 0-402

In [34]:
print '"%s"' % articles.loc[26, 'Title']
print ''
print 'Extracted keywords: '
print ''
for keyword in articles.loc[26, 'word_vector']:
    print keyword

"Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution"

Extracted keywords: 

temporal dependency
hidden layer
recurrent convolutional network
conditional convolutions
multi-frame sr
feedforward convolution
SR methods
multi-frame SR methods
single-image sr
PSNR Time
current hidden layer
BRCN
recurrent neural networks
input layer
recurrent convolutional sub-network
Recurrent Convolutional Networks
conditional convolution
optical flow
video SR
temporal dependency modelling
bidirectional scheme
complex motions
video sequences
Feedforward convolution models
motion estimation
backward recurrent network
convolutional neural network
video frames
multi-frame SR method
forward recurrent network
bidirectional recurrent scheme
conditional convolutional connections
feedforward convolution focus
visual spatial dependency
single-image SR methods
video SR methods
recurrent convolutions
powerful temporal dependency
Information Processing Systems
efficient multi-frame SR
IEEE

Now we will read an imaginary description of a new project. You can write your own and put it in the root folder inside new_project.txt.

In [35]:
with open('new_project.txt', 'r') as f:
    text_to_compare = f.read().replace('\n', '')

In [36]:
print text_to_compare

Espoo Karage is a new space for all Nokia employees. They can learn and practice new skills and technologies such as artificial intelligence, machine learning, deep learning and data analysis. Working stations, various sensors and cameras are provided for a free and unlimited use. The use of Karage will be analyzed using low resolution thermal and digital cameras, feedback system based on IBM Watson sentimental analysis module. Convolutional neural networks will be applied to images from cameras for object detection, image segmentation and image processing. Data extracted from video stream is used to make prediction about utilization of the space in upcoming days. Time series are analied using RNN and linear regression models. Accuracy of these models is compared based on ROC AUC and cross enthropy. Results of this analysis will be presented to the manager of the Karage so he can access how successful this project is and what improvements can be made.


We will apply the same procedure to extract keywords from that document.

In [40]:
keywords_to_compare = [i['text'] for i in natural_language_understanding.analyze(text=text_to_compare, features=Features(keywords=KeywordsOptions()))['keywords']]

print 'Keywords found in a document:' + '\n'

for keyword in keywords_to_compare:
    print keyword

Keywords found in a document:

Convolutional neural networks
Watson sentimental analysis
linear regression models
Espoo Karage
ROC AUC
Nokia employees
digital cameras
new space
artificial intelligence
image segmentation
various sensors
Working stations
new skills
machine learning
deep learning
low resolution
data analysis
object detection
video stream
image processing
Time series
utilization
improvements
Accuracy
prediction
module
technologies
IBM
images
Results
manager


To compare similarity of two given sets of keywords we will use a fairly naive approach which calculates an intersection of two sets of keywords. The bigger is that intesection the higher is the similarity between two documents.

In [44]:
def naive_similarity(word_vector01, word_vector02):
    n = len(word_vector01)
    counter = 0
    matches = []
    for word in word_vector01:
        if word in word_vector02:
            counter += 1
            matches.append(word)
    return counter

We will loop over all sets of keywords and find similarity with a given document.

In [45]:
similarities = []

for index, row in articles.iterrows():
    temp_similarity = naive_similarity(keywords_to_compare, row['word_vector'])
    similarities.append(temp_similarity)

We can now check results.

In [54]:
max_similarity = similarities.index(max(similarities))

print'Article with the highest similarity is "%s"' % articles.ix[index, 'Title'] + '\n'
print'Number of matching keywords are:' + '\n'

matches = set(keywords_to_compare).intersection(articles.ix[max_similarity, 'word_vector'])

for word in matches:
    print word

Article with the highest similarity is "Variational Dropout and the Local Reparameterization Trick"

Number of matching keywords are:

deep learning
low resolution
image processing
