This Kernel's goal is to highlight a real business problem and to propose a modular way of thinking to build an efficient Search Engine.

In [1]:
from IPython.display import Image

I have been surfing [Kaggle job page](https://www.kaggle.com/jobs) to explore the latest they have in the field of data science and machine learning field and to learn what new skills I need to be focusing on for the future.

Since I’m interested in NLP I jumped right in to the search box and started typing “NLP software engineer” and I got a “No Jobs to show” message!

In [1]:
Image("../input/search-results-examples-images/kagglesearchengine-20190412t113812z-001/kaggleSearchEngine/kse-1.png")

I was sure that there are few jobs that match my search!

I tried to search only with “NLP” and the surprise was that it showed couple as a results!

In [1]:
Image("../input/search-results-examples-images/kagglesearchengine-20190412t113812z-001/kaggleSearchEngine/kse-2.png")

I tried to do more search queries in aim to understand how the search actually works on Kaggle's website.
I tried “NLP engineer” and I had one result..

In [1]:
Image("../input/search-results-examples-images/kagglesearchengine-20190412t113812z-001/kaggleSearchEngine/kse-3.png")

Now let's try to do search with the same keywords but in different order. In our case I tried “engineer NLP”, again the result was “No Jobs to show”!

In [1]:
Image("../input/search-results-examples-images/kagglesearchengine-20190412t113812z-001/kaggleSearchEngine/kse-4.png")

**From the business point of view**, this is considered a failing point because I send the user to a dead-end and his journey mostly will end up with a bad experience.

To proceed, we need to understand why this has happened and how to use NLP to build a search engine that can be reliable and capable to return related results which will help improve the user experience as well.

From the test cases above we can guess what is the function behind the search. The search function is built to return the Exact Match from the jobs description.

SQL example:
```sql
select * from kaggleJobs where description regexp '[[:<:]]search keywords[[:>:]]';
```
Python example:
```python
if "search keywords" in description:
    print("Find 2 Jobs!")
else:
    print("No Jobs to show")
```

This will check only if the exact search text is found in the jobs description. But this will not work all the time as we saw before.

**How NLP Can Help to Solve This Real Business Problem?**

In this notebook we’ll explain how use NLP to propose a simple and basic search engine that could handle multi cases to find the most related search results.

**Corpus & Reading in the data**

The first step is to acquire the data. For that, I collected all the open jobs on Kaggle. Let’s load and browse the dataset

In [1]:
import pandas as pd
data = pd.read_csv('../input/jobs-list/kaggleJobs.csv')

In [1]:
data.head()

As we see from the dataset, the only colums that contains textual content about the jobs are
* **result__companyName:** contain company name
* **result__content:** contain job description
* **result__name:** contain job title

We'll select these three columns and ignore the rest for now.

In [1]:
searchData = data[['result__companyName', 'result__content', 'result__name']]

**Pre-processing the raw text**

Now we need to clean the data where it came as a raw HTML. Also, we may remove the punctuations as well.

In [1]:
import re
import string

def cleanHtml(raw_html):
    tags = r'<.*?>|/|\\.|-|,'
    cleanHtml = re.sub(tags, ' ', raw_html)
    cleantext = cleanHtml.translate(str.maketrans('', '', string.punctuation))
    return cleantext

In [1]:
searchData['result__content__clean'] = searchData['result__content'].apply(lambda rawHtml: cleanHtml(rawHtml))

**Data Transformation**

After the initial preprocessing phase, we need to transform text into a meaningful vector of numbers.
We are going to use TF-IDF approach. Term Frequency-Inverse Document Frequency, or TF-IDF for short, is an approach to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized [1].

We'll use TFidfVectorizer from sklearn to convert our corpus into a matrix of TF-IDF features.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word',min_df=0.0005,stop_words='english')

tfidfMatrix = vectorizer.fit_transform(searchData['result__companyName']+" "+searchData['result__content__clean']+" "+searchData['result__name'])

**Cosine similarity**

TF-IDF is a transformation applied to texts to get two real-valued vectors in vector space. We can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. 

That's it!

Now we can apply search queries. Let's try one of the quieres that failed to return results such as "engineer NLP"

We'll transfor the query text to document-term matrix. Then we'll measure the cosine similarity between the queryTFIDF and our original TFIDF matrix

In [1]:
queryText = "engineer NLP"
queryTFIDF = vectorizer.transform([queryText])

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
def sim():
    similarityMatrix = cosine_similarity(queryTFIDF, tfidfMatrix).flatten()
    sortedResultsIndices = similarityMatrix.argsort()[:-11:-1]
    return [sortedResultsIndices[idx] for idx, val in enumerate(similarityMatrix[sortedResultsIndices]) if val>0]

data.iloc[sim(),[4, 5, 16, 23]]

**Hooray we have results!**

We can tell that the first two jobs are the most related to our search keywords so it works fine!

**But**, we still have more to do to make our engine reliable. To do that we need to handle incomplete words and the misspelling issues.

**Jaro–Winkler distance**

The Jaro–Winkler distance is a string metric measuring an edit distance between two sequences. We are going to use Jaro–Winkler to correct misspelling and to predict incomplete words based on the content of our corpus.

In [1]:
# To use jaro_winkler_similarity make sure that you have NLTK v3.4
# You can check nltk version by running following lines:
# import nltk
# print('The nltk version is {}.'.format(nltk.__version__))

!pip install -U nltk==3.4

In [1]:
from nltk.metrics.distance import jaro_winkler_similarity as jaro
def jaro_winkler(word):
    jaroDis = [round(jaro(x, word.lower(), p=0.1, max_l=10), 3) for x in vectorizer.get_feature_names()]
    indices = [index for index, value in sorted(enumerate(jaroDis), reverse=True, key=lambda x: x[1])]
    return [vectorizer.get_feature_names()[i] for i in indices[0:3]]

Let's intentionally misspell both words in "engineer NLP"

In [1]:
queryText = "enginner NLD"
queryTextCheck = " ".join([" ".join(jaro_winkler(w)) for w in queryText.split()])
queryTFIDF = vectorizer.transform([queryTextCheck])
data.iloc[sim(),[4, 5, 16, 23]]

That's nice! we still have a relevant results even with an input with typos/Improper Inputs

**Future work**

This is a basic engine that need to be optimized on many levels.

We may need to apply more preprocessing and cleaning such as Stemming and\or Lemmatization.

Also we may optimize the TfidfVectorizer by using n-gram and optimize other parameters.

We may optimize the p & l values for Jaro Winkler algorithm

Last but not least, we may start to think if we need to add extra layers to understand the context of the text, synonyms\antonyms and more ...

Finally, to bring our work to the life, I deployed all the work done up to a server so **YOU CAN TRY IT YOURSELF**
> Please note that some links may be out-of date

In [1]:
from IPython.display import HTML
HTML('<iframe src=https://kaggle-search-engine-demo.herokuapp.com/ width=600 height=350></iframe>')

Thank you