# NLP Data Science Technical Interview
![NLP Data Science](NLP_data_science.jpg)


## Problem
A fictional B2B company sells products to online shops and wants to create a list of potential customers.
They scrapped a large number of major German web sites and labelled some of the data with a flag denoting if the web site is an online shop or not.


### Task 1
Develop a classifier which is able to predict whether a web site is an online shop by looking at the HTML content of its main page.

### Task 2
Using this classifer, create predictions for each of the web sites which are unclassified so far ("dataset 2"). Provide the prediction as a CSV file containing the domain name and a flag that denotes if the respective web site is an online shop.

### Task 3
Explain your approach and its technical details to our team.

### Task 4
Alas, the VP Sales of the company does not trust black box models and thus wants to understand what the model learned and how it comes to its decisions. In order to convince him, illustrate which content a web page needs to contain or how it needs to be structured to be classified as an online shop by your model. The VP Sales is a non-technical person and does not have deep knowledge about data analysis, so keep this part of the presentation as easy to understand as possible.


**NOTE**: We will combine task 3 into task 1, so that as we build the solution, it will also be explained.


## Principles
To solve this problem we will stick to a few basic principles.

* The principle of finding the **lowest hanging fruit**. That is, we will find the the thing that brings the most value with the least effort.
* One the core principles of **Agile**. We will solve the problem **incrementally** and **iteratively**. We don't want to get bogged down in perfecting one aspect and using up all of our time obsessing over one thing.

In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import os

from utils.data_utils import DataUtils
from nlu_engine import NLUEngine
from nlu_engine import LabelEncoder
from nlu_engine import TfidfEncoder

In [2]:
#TODO: clean up the NLU Engine stuff and only keep what is used here
#TODO: add in an evaluation pipeline to evalute all of the classifiers (saving the evaluations as a combined report in a csv)
#TODO: create a word cloud venn diagram from the ranked features as a center piece for task 4

**NOTE**: As a general coding approach and best practice we will be abstracting as much of the code into classes to avoid cluttering up the notebook. This will make it easier to understand and modify the code later on. We will also be using docstrings to make the classes and methods therein more understandable.

# Kick off
We need to do a bit of preparation before we start with the first task.

First things first, let's have a look at our data! How about we start with the training data csv then the unclassified data set?

In [3]:
training_csv_path = 'data/dataset1.csv'

training_df = DataUtils.load_csv(training_csv_path)
training_df

Unnamed: 0,domain,is_shop
0,111.com,0
1,12xl.de,1
2,1a-buerotechnik.de,1
3,1a-yachtcharter.de,0
4,1blu.de,0
...,...,...
855,zeus-zukunft.de,0
856,zhaw.ch,0
857,zoeliakie-gourmet.at,1
858,zookauf-zwahr.de,1


In [4]:
unclassified_csv_path = 'data/dataset2.csv'

unclassified_df = DataUtils.load_csv(unclassified_csv_path)
unclassified_df

Unnamed: 0,domain
0,77records.de
1,absperrtechnik24.de
2,ackermedia.de
3,acris-ecommerce.at
4,adepto-shop.de
...,...
195,xeon-hosting.de
196,xodox.de
197,zaubermode.de
198,zipf-immobilien24.de


Okay dokey, we have two csv files we loaded into pandas that contains the domain names and the flags denoting if the respective web site is an online shop. How about we check out an HTML file?

In [5]:
example_html_path = 'data/scraped_html/1a-buerotechnik.de.html'

example_html_souped = DataUtils.load_html(example_html_path)
example_html_souped

<!DOCTYPE html>
<html dir="ltr" lang="de"><head>
<!--

		Shopsoftware by Gambio GmbH (c) 2005-2017 [www.gambio.de]

		Gambio GmbH offers you highly scalable E-Commerce-Solutions and Services.
		The Shopsoftware is redistributable under the GNU General Public License (Version 2) [http://www.gnu.org/licenses/gpl-2.0.html].
		based on: E-Commerce Engine Copyright (c) 2006 xt:Commerce, created by Mario Zanier & Guido Winger and licensed under GNU/GPL.
		Information and contribution at http://www.xt-commerce.com

		Please visit our website: www.gambio.de

		-->
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="index,follow,noodp,noydir" name="robots"/>
<meta content="de" name="language"/>
<meta content="K.-R. Kugler" name="author"/>
<meta content="2GrzpAAS3QpIFK9CfFOW6H1zd6mIJXw4EhkiwtvAalQ" name="google-site-verification_"/>
<meta content="(c) 1a-Bürotechnik 2001-2016" name="copyright"/>
<meta content=

Yep, that looks like HTML alright. But it isn't so useful for our purpose of classifying web sites as online shops. We will need to somehow turn this HTML into something useful for our classifier. Let's do that step-by-step.

**NOTE**: It is interesting to see in the comments of this specific example it mentions `Shopsoftware by Gambio GmbH`, it might be interesting to see if many of them contain such a blatent reference to being a shop. However, we will stick for right now to the task of stripping down the HTML and classifying it as an online shop via the text of the website itself.

**Did you know?** One of the most time consuming steps in NLP data science is the cleaning of text data scraped from websites? 😉

In [6]:
text = DataUtils.extract_text_from_html(example_html_path)
text

'Lastschrift, Kreditkarte und Rechnungskauf ohne Paypal-Konto - 1a-Bürotechnik Ihr Discount-Versand, schnell-kompetent-preiswert <div class="noscript_notice"> JavaScript ist in Ihrem Browser deaktiviert. Aktivieren Sie JavaScript, um alle Funktionen des Shops nutzen und alle Inhalte sehen zu können. </div> Kundenlogin Kundengruppe: Gast Merkzettel Warenkorb Suchen Über uns ►►► Rechnungskauf für Behörden Go Startseite Ihr Warenkorb keine Produkte LOTUS™ In Kürze bei uns verfügbar! Aktenvernichter HSM Securio B24 1,9x15mm Cross-Cut, Stufe P5 01.01.2018 Aktenvernichter HSM Securio B32  0,78x11mm, Stufe P6, autom. Ölung 01.01.2018 Unsere Empfehlungen AVerVision F17-8M Full HD, AVer... 349,98 EUR LED-Deckenleuchte MAULstart, weiß, 120 cm, 35 W 82,80 EUR Aktenvernichter HSM Securio AF500, 1,9x15mm,... 749,98 EUR Mobiles Whiteboard MAULstandard Emaille,... 318,73 EUR Aktenvernichter Fellowes Powershred M-7C... 49,73 EUR Tonrec Toner 31010L ersetzt Lexmark T640,... 64,28 EUR OLYMPIA Registrier

Hmmm, it's okay. Not perfect, but keeping with our incremental and iterative principle, let's not get stuck cleaning this data further at this time and say it is good enough for now. We will see if we need to do some additional cleaning further down the road. 

Now that we have a way to read and clean up the HTML into normal text strings, we should probably apply that to all of the HTML data and while we are at it, merge it to the training data and unclassified csvs.

To do this we will:
* Get a file list from the HTML files in the directory
* Strip the ".html" extension from the file names
* Match those stripped file names with the domain names in the csvs
* Merge the parsed HTML to the csvs into a new column called `text`

Getting the file list is pretty straight forward. We will write a function that does this for us and throw it into the the `data_utils.py` file. Here's the output:

In [7]:
file_list = DataUtils.get_file_list('data/scraped_html')
file_list

['toskanaferien.de.html',
 'pc-ostsee.de.html',
 'hammerkauf.de.html',
 'mobile-laden.com.html',
 'lutter.net.html',
 'meinemarkenmode.de.html',
 'sonnenhotels.de.html',
 'gelesi.de.html',
 'perfectmallorca.de.html',
 'bellanea.de.html',
 'comfortsun-shop.de.html',
 'rzamt.ch.html',
 'hexim.de.html',
 'euronics.ch.html',
 'kufatec.de.html',
 'duftholz.de.html',
 'gitarre-bestellen.de.html',
 'paintball.de.html',
 'inchweb.de.html',
 'medicalcorner24.com.html',
 'help99.de.html',
 'galdem.de.html',
 'abramo.de.html',
 'fleckerl.de.html',
 'beepworld.de.html',
 'dsguided.com.html',
 'baucompany24.de.html',
 'prosite.de.html',
 'koi-kichi.de.html',
 'r2g.de.html',
 'posylka.de.html',
 'die-planaren-exploratoren.de.html',
 'get-more24.de.html',
 'dessertweine.de.html',
 'worldtrip.de.html',
 'agentur-murr.de.html',
 'beerenweine.de.html',
 'bestoftechnic.de.html',
 'addtronic.com.html',
 'fernuni-hagen.de.html',
 'sonneninsel-teneriffa.de.html',
 'kreuzfahrten-center.com.html',
 'derhobbyk

We will need to write a function that will go through the list of files in the directory and returns a dataframe with the domains and text.

**NOTE**: In the original files, there was a `.ipynb_checkpoints` directory in the `data/scraped_html` directory that we will not be using, so I tossed it out.

In [8]:
text_df = DataUtils.process_html_files('data/scraped_html')
text_df

Unnamed: 0,domain,text
0,toskanaferien.de,Home - Toskanaferien Home Urlaubsort Lari Dolc...
1,pc-ostsee.de,PCO - Privat Charter Ostsee - Yachtcharter Dir...
2,hammerkauf.de,Hammerkauf Online-Shop - Baumarkt <iframe src=...
3,mobile-laden.com,Mobile-Laden.com - Handyzubehör zu Top Preisen...
4,lutter.net,Immobilien Makler Rostock Häuser Wohnungen Kau...
...,...,...
1055,nakur.de,Nakur Warenkorb 0 Artikel Ihr Konto Kasse Anme...
1056,babyviduals.de,"Babynahrung tiefgekühlt von Babyviduals, Bio-B..."
1057,taschenkaufhaus.de,"Taschen, Rucksäcke und Reisegepäck: Taschenkau..."
1058,gastroimmobilien.net,gastroimmobilien.net - Gastro Immobilien Das S...


This dataframe looks pretty good. Next we will merge this dataframe with the training and unclassified dataframes by the domain column.

In [9]:
training_text_df = pd.merge(training_df, text_df, on="domain")
unclassified_text_df = pd.merge(unclassified_df, text_df, on="domain")

In [10]:
training_text_df

Unnamed: 0,domain,is_shop,text
0,111.com,0,dns-net.ch | DNS-NET Services GmbH Our domain ...
1,12xl.de,1,Bekleidung in Übergrößen für Herren | Herren...
2,1a-buerotechnik.de,1,"Lastschrift, Kreditkarte und Rechnungskauf ohn..."
3,1a-yachtcharter.de,0,Yachtcharter - 9196 Yachten online chartern Mo...
4,1blu.de,0,Neue Internetpräsenz. hosted by www.1blu.de Hi...
...,...,...,...
855,zeus-zukunft.de,0,Zeus Zukunft - Ihr Spezialist für Einkommenssi...
856,zhaw.ch,0,Willkommen an der ZHAW | ZHAW Zürcher Hochschu...
857,zoeliakie-gourmet.at,1,Vital Gourmet 03177 25 295 / 0676 60 45 9 45 ...
858,zookauf-zwahr.de,1,Startseite 03581 / 405152 Login Kontakt Warenk...


In [13]:
training_text_df.to_csv('data/training_text_df.csv')

In [11]:
unclassified_text_df

Unnamed: 0,domain,text
0,77records.de,DJ Equipment | DJ Zubehör ★ 77records.de <div ...
1,absperrtechnik24.de,"Absperrpfosten, Schilder, Fahrradständer und m..."
2,ackermedia.de,Hosting - Avernis Domain bestellen / Domains u...
3,acris-ecommerce.at,E-Commerce & Shopware Agentur aus Linz OÖ ► AC...
4,adepto-shop.de,ADEPTO - Reinigungsfachhandel - Versandkosten ...
...,...,...
195,xeon-hosting.de,Xeon-Hosting.de TeamSpeak3 Live Chat Warenkorb...
196,xodox.de,"XODOX - Webdesign, Server, E-Mail & SEO aus Fo..."
197,zaubermode.de,kunterbunte Kindermode und Babykleidung - güns...
198,zipf-immobilien24.de,Zipf und Partner Immobilien GmbH - Immobilien ...


In [14]:
unclassified_text_df.to_csv('data/unclassified_text_df.csv')

Uh huh, looking pretty good with these dataframes. I'd say we are ready for the next major step...

# Task 1

Let's do some machine learning stuff! Well, almost.. There is the matter of figuring out our approach and of course doing the pre-processing that needs to be done.

Sticking with our principles, we will start with the easiest solution and then we will benchmark it and see where it gets us. So instead of doing some crazy SOTA stuff, we will encode the the text into TF-IDF vectors and use that as features for a "classic" intent classifier. 

This is very similar to the task of classifying emails as `spam` or `ham` for another binary classification example, but it can also be considered similar to doing multi-class intent matching that could be used in NLU for tasks found in matching utterances to intents like in a voice assistant. 

### Why are we using TF-IDF?
Well based on experience, we have found that TF-IDF is pretty good at being the features for an intent classifier in tasks like this, yet is super easy to do and computationally inexpensive. It is usually a good place to start.

Our **hypothesis** here is that ranking the terms in the documents by their frequency while lowering the rank of terms that are found in a lot of documents, while disregarding the word order is a good way to get a sense of the overall importance of the terms and therefore makes good features. So this is better than a simple bag of words (BoWs) approach, which would just be getting the frequeny of each term. However it might not be as good as other approaches that might use deeper context to get more features (such as word embeddings). But then again, it could give us really great results!

Of course we could also use stop words with BoWs or even with TF-IDF, but we will stick to TF-IDF for now and see where that gets us. We just love those low hanging fruits!

### Application
To do all of this, we will go with the easy to use `sklearn` library. It is what we like to call "old school cool". It is a great way to get started and explore the relationships of the features, train models, and evaluate them. Luckily, I happen to have written an NLU engine that is open source and uses the same library, so we will use parts of that. Please see this [Secret Sauce AI repo](https://github.com/secretsauceai/NLU-engine-prototype-benchmarks) for more information (watch out: this NLU project is still a major work in progress!). Because the project isn't done yet, there is no pipy package for it yet, so we will just grab the pieces we want and go from there.


## Preprocessing TF-IDF and label encoding

In [12]:
tfidf_vectors, tfidf_vectorizer = TfidfEncoder.encode_training_vectors(training_text_df)

In [15]:
#TODO: refactor the NLU Engine stuff to fit with this project (ie remove and rename stuff)

<860x110781 sparse matrix of type '<class 'numpy.float64'>'
	with 421820 stored elements in Compressed Sparse Row format>