# NLP Data Science Technical Interview
![NLP Data Science](NLP_data_science.jpg)


## Problem
A fictional B2B company sells products to online shops and wants to create a list of potential customers.
They scrapped a large number of major German web sites and labelled some of the data with a flag denoting if the web site is an online shop or not.


### Task 1
Develop a classifier which is able to predict whether a web site is an online shop by looking at the HTML content of its main page.

### Task 2
Using this classifer, create predictions for each of the web sites which are unclassified so far ("dataset 2"). Provide the prediction as a CSV file containing the domain name and a flag that denotes if the respective web site is an online shop.

### Task 3
Explain your approach and its technical details to our team.

### Task 4
Alas, the VP Sales of the company does not trust black box models and thus wants to understand what the model learned and how it comes to its decisions. In order to convince him, illustrate which content a web page needs to contain or how it needs to be structured to be classified as an online shop by your model. The VP Sales is a non-technical person and does not have deep knowledge about data analysis, so keep this part of the presentation as easy to understand as possible.


**NOTE**: We will combine task 3 into task 1, so that as we build the solution, it will also be explained.


## Principles
To solve this problem we will stick to a few basic principles.

* The principle of finding the **lowest hanging fruit**. That is, we will find the the thing that brings the most value with the least effort.
* One the core principles of **Agile**. We will solve the problem **incrementally** and **iteratively**. We don't want to get bogged down in perfecting one aspect and using up all of our time obsessing over one thing.

In [1]:
import pandas as pd

from utils.data_utils import DataUtils
from utils.HTML_extractor import strip_tags
from nlu_engine import NLUEngine
from nlu_engine import IntentMatcher
from nlu_engine import LR
from nlu_engine import Analytics
from nlu_engine import TfidfEncoder
from nlu_engine import DataUtils as du

In [None]:
#TODO: rename training_data stuff to labeled_data!
#TODO: clean up the NLU Engine stuff and only keep what is used here
#TODO: add in an evaluation pipeline to evalute all of the classifiers (saving the evaluations as a combined report in a csv)
#TODO: create a word cloud venn diagram from the ranked features as a center piece for task 4

**NOTE**: As a general coding approach and best practice we will be abstracting as much of the code into classes to avoid cluttering up the notebook. This will make it easier to understand and modify the code later on. We will also be using docstrings to make the classes and methods therein more understandable.

# Kick off
We need to do a bit of preparation before we start with the first task.

First things first, let's have a look at our data! How about we start with the training data csv then the unclassified data set?

In [None]:
classified_csv_path = 'data/dataset1.csv'

classified_df = DataUtils.load_csv(classified_csv_path)
classified_df


In [None]:
number_of_entries = len(classified_df)
number_of_shops = len(classified_df[classified_df["is_shop"] == 1])
percent_of_shops = (number_of_shops / number_of_entries)*100
print(
    f'out of {len(classified_df)} entries, {percent_of_shops}% are shops')


In [None]:
unclassified_csv_path = 'data/dataset2.csv'

unclassified_df = DataUtils.load_csv(unclassified_csv_path)
unclassified_df

Okay dokey, we have two csv files we loaded into pandas that contains the domain names and the flags denoting if the respective web site is an online shop. How about we check out an HTML file?

In [None]:
example_html_path = 'data/scraped_html/1a-buerotechnik.de.html'

html = DataUtils.load_html(example_html_path)
html

Yep, that looks like HTML alright. But it isn't so useful for our purpose of classifying web sites as online shops. We will need to somehow turn this HTML into something useful for our classifier.

**NOTE**: It is interesting to see in the comments of this specific example it mentions `Shopsoftware by Gambio GmbH`, it might be interesting to see if many of them contain such a blatent reference to being a shop. However, we will stick for right now to the task of stripping down the HTML and classifying it as an online shop via the text of the website itself.

**Did you know?** One of the most time consuming steps in NLP data science is the cleaning of text data scraped from websites? 😉

In [None]:
extracted_text = strip_tags(html)
extracted_text

Looking at a few other examples, it looks pretty good for our purpose. Data can always be cleaner,but let's not get stuck cleaning this data further at this time and say it is good enough for now. We will see if we need to do some additional cleaning further down the road. 

Now that we have a way to read and clean up the HTML into normal text strings, we should probably apply that to all of the HTML data and while we are at it, merge it to the classified and unclassified csvs.

To do this we will:
* Get a file list from the HTML files in the directory
* Strip the ".html" extension from the file names
* Match those stripped file names with the domain names in the csvs
* Merge the parsed HTML to the csvs into a new column called `text`

Getting the file list is pretty straight forward. We will write a function that does this for us and throw it into the the `data_utils.py` file. Here's the output (for the first 10):

In [None]:
SCRAPED_DATA_DIR = 'data/scraped_html/'

file_list = DataUtils.get_file_list(SCRAPED_DATA_DIR)
print(file_list[:10])

We will need to write a function that will go through the list of files in the directory and returns a dataframe with the domains and text.

**NOTE**: In the original files, there was a `.ipynb_checkpoints` directory in the `data/scraped_html` directory that we will not be using, so I tossed it out.

In [None]:
text_df = DataUtils.process_html_files(SCRAPED_DATA_DIR)
text_df


This dataframe looks pretty good. Next we will merge this dataframe with the training and unclassified dataframes by the domain column.

In [None]:
classified_text_df = pd.merge(classified_df, text_df, on="domain")
unclassified_text_df = pd.merge(unclassified_df, text_df, on="domain")

In [None]:
classified_text_df

In [None]:
classified_text_df.to_csv('data/classified_text_df.csv')

In [None]:
unclassified_text_df

In [None]:
unclassified_text_df.to_csv('data/unclassified_text_df.csv')

Uh huh, looking pretty good with these dataframes. I'd say we are ready for the next major step...

# Task 1

Let's do some machine learning stuff! Well, almost.. There is the matter of figuring out our approach and of course doing the pre-processing that needs to be done.

Sticking with our principles, we will start with the easiest solution and then we will benchmark it and see where it gets us. So instead of doing some crazy SOTA stuff, we will encode the the text into TF-IDF vectors and use that as features for a "classic" intent classifier. 

This is very similar to the task of classifying emails as `spam` or `ham` for another binary classification example, but it can also be considered similar to doing multi-class intent matching that could be used in NLU for tasks found in matching utterances to intents like in a voice assistant. 

### Why are we using TF-IDF?
Well based on experience, we have found that TF-IDF is pretty good at being the features for an intent classifier in tasks like this, yet is super easy to do and computationally inexpensive. It is usually a good place to start.

Our **hypothesis** here is that ranking the terms in the documents by their frequency while lowering the rank of terms that are found in a lot of documents, while disregarding the word order is a good way to get a sense of the overall importance of the terms and therefore makes good features. So this is better than a simple bag of words (BoWs) approach, which would just be getting the frequeny of each term. However it might not be as good as other approaches that might use deeper context to get more features (such as word embeddings). But then again, it could give us really great results!

Of course we could also use stop words with BoWs or even with TF-IDF, but we will stick to TF-IDF for now and see where that gets us. We just love those low hanging fruits!

### Application
To do all of this, we will go with the easy to use `sklearn` library. It is what we like to call "old school cool". It is a great way to get started and explore the relationships of the features, train models, and evaluate them. Luckily, I happen to have written an NLU engine that is open source and uses the same library, so we will use parts of that. Please see this [Secret Sauce AI repo](https://github.com/secretsauceai/NLU-engine-prototype-benchmarks) for more information (watch out: this NLU project is still a major work in progress!). Because the project isn't done yet, there is no pipy package for it yet, so we will just grab the pieces we want and go from there.


Let's load the training and unclassified dataframes from the previous step.

In [2]:
classified_text_df = DataUtils.load_csv('data/classified_text_df.csv')
unclassified_text_df = DataUtils.load_csv('data/unclassified_text_df.csv')

## Preprocessing TF-IDF

Kicking this step off, we will first create a TF-IDF vectorizer.

In [3]:
tfidf_vectors, tfidf_vectorizer = TfidfEncoder.create_vectorizer(
    classified_text_df['text'])

To have a better understanding, let's have a look at the top 100 TF-IDF terms in the training data.

In [None]:
top_100_features = TfidfEncoder.get_top_n_features(
    tfidf_vectorizer, tfidf_vectors, 100)
top_100_features

In [None]:
#TODO: refactor the NLU Engine stuff to fit with this project (ie remove and rename stuff)

## Machine learning
Ohhhh yeah, we are ready to go!

In [None]:
#TODO: re-write code for intent classification
#TODO: add in way to evaluate all models and get a report on the results
#TODO: select the best one, train it on all of the classified data, then run it on the unevaluated data and look at a random sampling of the results and report on the results (use ipysheet with boolean column for evaluation)

In [None]:
LR_model = IntentMatcher.train_classifier(LR, tfidf_vectors, classified_text_df['is_shop'])

In [None]:
classified_text_df

In [None]:
IntentMatcher.predict_labels(LR_model, tfidf_vectors)

In [None]:
predictions = Analytics.cross_validate_classifier(LR, tfidf_vectors, classified_text_df['is_shop'])

In [None]:
report = Analytics.generate_report(
    LR, predictions, classified_text_df['is_shop'])
report

In [None]:
encoding = 'TFIDF'
report_df = Analytics.convert_report_to_df(classifier=LR, report=report, encoding=encoding)
report_df

In [None]:
#TODO: discuss what precision, recall, f1 score, accuracy, macro avg, and weighted avg are and how they are used in the evaluation process

In [None]:
report_df = NLUEngine.evaluate_intent_classifier(tfidf_vectors=tfidf_vectors, labels=classified_text_df['is_shop'], classifier=LR)
report_df

In [5]:
from nlu_engine import NB, DT, RF, KN, ADA, SVM

classifiers = [NB, LR, DT, RF, KN, ADA, SVM]

In [None]:
classifier_reports_df = NLUEngine.evaluate_all_classifiers(classifiers=classifiers, x_train=tfidf_vectors, y_train=classified_text_df['is_shop'])

In [None]:
dense_array = du.get_dense_array(classifier=NB, x_train=tfidf_vectors)

In [None]:
dense_array

In [6]:
prediction = Analytics.cross_validate_classifier(classifier=NB, x_train=tfidf_vectors, y_train=classified_text_df['is_shop'])

Cross validating with GaussianNB()


TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

In [None]:
#TODO figure out why the sparse matrix isn't converting to a dense array for the NB classifier