# Acquisition

In [129]:
# import the required packages
import numpy as np 
import pandas as pd

import re

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve

# Case Overview

- A B2B company conducted an **acquisition campaign** in which they tried to convert leads into customers.
  In addition, they registered which leads eventually converted into profitable customers and which leads didn't.
    
    
- The **B2B** company also gathered information about the **characteristics of the leads**:
    
   
    a) The commercial dataset: this is a dataset which was bought from commercial vendors who are specialized in
       collecting data from companies such as revenue, net profit, cashflow, number of employees etc. 
       This data is often expensive and contains a lot of missing values.
    
    
    a) The web dataset: this is a dataset which was scraped from the web.
       This is a very cheap source of information but needs a lot of preprocessing since it is unstructured.
       Initially, it contained the textual data found on the website (if available) of each company. 
       In a next step, a text mining algorithm (Singular Value Decomposition) was used to convert this textual 
       data in 200 numerical variables (SVD_1, SVD_2, ..., SVD_200). 
       These variables somehow represent concepts which were found across all the
       websites and the values for these variables represent how much a company focused on this concept on its 
       website. This dataset also contains the Target variable which represents whether a company was converted 
       into a profitable customer or not.
      
- The B2B company wants to **build a model** that tries to find a relationship between the characteristics of the leads on the one hand and the probability of converting into a profitable customer on the other hand. This would allow the B2B company to **only target leads with a high probability of converting** into a profitable customer.

The following pictures gives a **quick overview** of the case:

<img src="./Data/acquisition_case_workflow.png" width="800">

The following picture visualizes how **Singular Value Decomposition** (**SVD**) works:

<img src="./data/lsi.png" width="500">

# 1. Data Exploration

In [130]:
# import web data
web_data = pd.read_csv("../case_acquisition/data/web1.csv", encoding="latin1")

# import commercial data, but with NACE_code as a string variable to keep leading zero 
commercial_data = pd.read_csv("../case_acquisition/data/commercial1.csv", encoding="latin1", dtype={'NACE_code': object})

In [131]:
# inspect the first three observations of the web data
web_data.head(3)

Unnamed: 0,URI,NAME,TARGET,_DOCUMENT_,_SVD_1,_SVD_2,_SVD_3,_SVD_4,_SVD_5,_SVD_6,...,_SVD_192,_SVD_193,_SVD_194,_SVD_195,_SVD_196,_SVD_197,_SVD_198,_SVD_199,_SVD_200,_SVDLEN_
0,file://C:\sas\small2\n2-1313http%3A--www.melis...,n2-1313http%3A--www.melisard.de-.txt.htm,0,1,0.390678,-0.118327,0.199925,-0.200753,0.053916,-0.10415,...,-0.013452,0.011185,-0.002033,0.005218,-0.007452,0.007244,-0.00544,0.012882,-0.000899,62.478443
1,file://C:\sas\small2\n2-1327http%3A--www.auto-...,n2-1327http%3A--www.auto-boehler.de-.txt.htm,0,2,0.46309,-0.121661,-0.149195,0.222979,-0.097737,0.000127,...,-0.066052,0.02864,0.020099,-0.013749,-0.006084,-0.004447,0.032138,0.020648,-0.009978,36.631883
2,file://C:\sas\small2\n2-1346http%3A--www.ac-me...,n2-1346http%3A--www.ac-metallteile.de-.txt.htm,0,3,0.425443,-0.192835,-0.162892,0.22198,0.106563,-0.120675,...,0.010257,0.00659,0.058038,0.016583,-0.005697,-0.013547,-0.00541,-0.015903,0.006741,43.768915


In [132]:
# inspect web data shape
web_data.shape

(14227, 205)

In [133]:
# inspect the first three observations of the commercial data
commercial_data.tail()

Unnamed: 0,F1,Company_name,NACE_code,Op__Rev_th_EUR_Last_avail__yr,Web_site_addresses,Cash_flow_th_EUR_Last_avail__yr,Number_of_employees_Last_avail__,Total_assets_th_EUR_Last_avail__,Long_term_debt_th_EUR_Last_avail,Loans_th_EUR_Last_avail__yr,Capital_th_EUR_Last_avail__yr,Sales_th_EUR_Last_avail__yr,Gross_profit_th_EUR_Last_avail__,Profit_margin___Last_avail__yr,Liquidity_ratio_x_Last_avail__yr,Average_cost_of_employee__th__EU,Profit_per_employee__th__EUR_Las,Total_assets_per_employee__th__E,Earnings_yield_______current
375871,101613.0,ZZF ZWEIRADZENTRUM FERNWALD-STEINBACH GMBH,4540,1201.0,www.zzf-gmbh.de,,,2603.584,1324.195,,749.948,1856.0,,,,,,,
375872,371556.0,WIRTSCHAFTSGEMEINSCHAFT ZOOLOGISCHER FACHBETRI...,9499,,www.zzf.de,,,11585.822,1534.59,,206.0,,,,6.354512,,,,
375873,275499.0,HTVG-GES. FÜR TECHNOLOGIEENTWICKLUNG U. VERMÖG...,7490,,www.zzh-herten.de,,,16624.0,9182.0,-4538.0,-2786.0,,,,-0.469264,,,,
375874,96747.0,ZAHNRAD- UND ZERSPANUNGSTECHNIK STELLJES GMBH,2815,1919.0,www.zzt-gmbh.de,,,8178.178,262.672,9408.0,889.565,1764.0,,,3.221941,,,,
375875,375875.0,ZZV ZEITUNGS- UND ZEITSCHRIFTEN VERTRIEB GMBH,5813,,www.zzv-gmbh.de,,,-1240.258,-863.0,-9949.0,-428.435,,,,4.14215,,,,


In [134]:
# inspect commercial data shape
commercial_data.shape

(375876, 19)

# 2. Data Preparation

## 2.1. Merge Commercial and Web data

As a first step, we are going to merge the commercial and web datasets together to obtain a large dataset with **all possible explanatory variables**. Since the website address can be found in both datasets, we are going to **merge the commercial data with the web data by the website address**.

In the web data, the website address can be found in the `NAME` variable, but it is not yet in the correct format.
The website address is always located between the substring "3A--" and its domain extension ".at" or ".de" (the companies from which the websites were scraped were all located in Germany or Austria). 

So for example the website address for the observation with `NAME` variable "n2-1327http%3A--www.auto-boehler.de-.txt.htm" is www.auto-boehler.de

**Exercise 1**:
    
1) **Remove** every observation from the web data where the domain extention .at or .de cannot be found in the NAME variable (we are only interested in companies with a valid website address)
       
       
2) Extract all the website adresses from ``NAME`` variable and store it in a **new column** called ``Web_site_addresses``
    
    
3) **Add** "www." to websites where this is missing (this is necessary to obtain the correct matches between the  web data and commercial data)

In [135]:
web_data_permute = web_data
web_data_permute["NAME"].head()

0          n2-1313http%3A--www.melisard.de-.txt.htm
1      n2-1327http%3A--www.auto-boehler.de-.txt.htm
2    n2-1346http%3A--www.ac-metallteile.de-.txt.htm
3      n2-1361http%3A--www.awelldigital.de-.txt.htm
4         n2-1393http%3A--www.paulbooch.de-.txt.htm
Name: NAME, dtype: object

In [136]:
#1
web_data_permute = web_data_permute[~web_data_permute['NAME'].isin(['.at','.de'])]

In [137]:
#2
web_data_permute["Web_site_addresses"] = web_data_permute['NAME'].str.extract(r'3A--(.+?)-.txt.htm')
web_data_permute.head()

Unnamed: 0,URI,NAME,TARGET,_DOCUMENT_,_SVD_1,_SVD_2,_SVD_3,_SVD_4,_SVD_5,_SVD_6,...,_SVD_193,_SVD_194,_SVD_195,_SVD_196,_SVD_197,_SVD_198,_SVD_199,_SVD_200,_SVDLEN_,Web_site_addresses
0,file://C:\sas\small2\n2-1313http%3A--www.melis...,n2-1313http%3A--www.melisard.de-.txt.htm,0,1,0.390678,-0.118327,0.199925,-0.200753,0.053916,-0.10415,...,0.011185,-0.002033,0.005218,-0.007452,0.007244,-0.00544,0.012882,-0.000899,62.478443,www.melisard.de
1,file://C:\sas\small2\n2-1327http%3A--www.auto-...,n2-1327http%3A--www.auto-boehler.de-.txt.htm,0,2,0.46309,-0.121661,-0.149195,0.222979,-0.097737,0.000127,...,0.02864,0.020099,-0.013749,-0.006084,-0.004447,0.032138,0.020648,-0.009978,36.631883,www.auto-boehler.de
2,file://C:\sas\small2\n2-1346http%3A--www.ac-me...,n2-1346http%3A--www.ac-metallteile.de-.txt.htm,0,3,0.425443,-0.192835,-0.162892,0.22198,0.106563,-0.120675,...,0.00659,0.058038,0.016583,-0.005697,-0.013547,-0.00541,-0.015903,0.006741,43.768915,www.ac-metallteile.de
3,file://C:\sas\small2\n2-1361http%3A--www.awell...,n2-1361http%3A--www.awelldigital.de-.txt.htm,0,4,0.616054,-0.252506,-0.229162,0.217503,0.214909,0.093692,...,-0.00678,0.034025,0.026627,0.019547,-0.012901,-0.00458,0.031348,0.020085,30.409164,www.awelldigital.de
4,file://C:\sas\small2\n2-1393http%3A--www.paulb...,n2-1393http%3A--www.paulbooch.de-.txt.htm,0,5,0.387577,-0.200605,-0.154955,0.029873,-0.036174,0.028454,...,-0.033938,0.025743,0.021395,0.039515,0.104573,-0.029109,-0.028996,-0.010934,28.295083,www.paulbooch.de


In [138]:
#3
web_data_permute["Web_site_addresses"] = np.where(web_data_permute["Web_site_addresses"].str.contains("www."),web_data_permute["Web_site_addresses"], 'www.' + web_data_permute["Web_site_addresses"])

#or
#web_data["Web_site_addresses"] = web_data["Web_site_addresses"].apply(lambda x: "www." + x if "www." not in x else x)

In [139]:
web_data = web_data_permute


Let's check whether your code was correct by comparing the `NAME` variable with the newly created `Web_site_addresses` variable for the **first 5 observations** in the web data.

In [140]:
# check
web_data[["NAME", "Web_site_addresses"]].head(5)

Unnamed: 0,NAME,Web_site_addresses
0,n2-1313http%3A--www.melisard.de-.txt.htm,www.melisard.de
1,n2-1327http%3A--www.auto-boehler.de-.txt.htm,www.auto-boehler.de
2,n2-1346http%3A--www.ac-metallteile.de-.txt.htm,www.ac-metallteile.de
3,n2-1361http%3A--www.awelldigital.de-.txt.htm,www.awelldigital.de
4,n2-1393http%3A--www.paulbooch.de-.txt.htm,www.paulbooch.de


**Exercise 2**:
    
1) **Merge** the web data and commercial data by the website address into a new DataFrame called basetable
    
    
2) **Inspect** the first 3 observations of the basetable

In [141]:
# Write your code here
basetable = pd.merge(left=commercial_data, right= web_data, on="Web_site_addresses", how= "inner")
basetable.head()

Unnamed: 0,F1,Company_name,NACE_code,Op__Rev_th_EUR_Last_avail__yr,Web_site_addresses,Cash_flow_th_EUR_Last_avail__yr,Number_of_employees_Last_avail__,Total_assets_th_EUR_Last_avail__,Long_term_debt_th_EUR_Last_avail,Loans_th_EUR_Last_avail__yr,...,_SVD_192,_SVD_193,_SVD_194,_SVD_195,_SVD_196,_SVD_197,_SVD_198,_SVD_199,_SVD_200,_SVDLEN_
0,35867.0,1-2-3 GEBÄUDEMANAGEMENT GMBH,6832,4807.0,www.1-2-3gm.de,,,723.714,-733.529,-1475.0,...,-0.045802,-0.020668,-0.034812,-0.019691,-0.008253,0.038824,-0.023298,0.064798,0.011593,26.117231
1,78669.0,1A PERSONALPARTNER GMBH,7820,2133.0,www.1a-personalpartner.de,,,7987.706,872.0,8132.0,...,0.012936,0.005614,0.025265,0.020918,0.02704,0.007349,-0.036119,-0.039724,0.012309,30.332482
2,49946.0,2 K KREATIVKONZEPT GESELLSCHAFT FÜR EFFEKTIVE ...,7311,2842.0,www.2-k.de,,,-3959.834,-571.0,-5418.0,...,0.010125,0.006644,0.009632,0.006227,0.022338,0.004339,-0.006715,0.019218,-0.029916,60.150309
3,55094.0,3CAP TECHNOLOGIES GMBH,7112,2696.0,www.3cap.de,,,-4088.19,-463.0,-6055.0,...,0.009281,0.003556,-0.012276,0.011094,-0.002663,0.006079,0.002901,-0.005705,-0.00812,69.091369
4,78671.0,3D-SCHILLING GMBH,2896,1992.0,www.3d-schilling.de,,,4847.704,360.953,5933.0,...,0.03255,-0.029513,0.017344,-0.006178,0.002543,0.023631,0.003898,0.004457,0.044888,56.614633


Now we can inspect the **distribution of the dependent variable** which is represented by the `TARGET` variable. This variable indicates whether a company was successfully converted into a profitable customer (`TARGET`=1) or not (`TARGET`=0)

In [142]:
# inspect the distribution of profitable vs non-profitable customers of the new basetable
basetable["TARGET"].value_counts(normalize=True)

1    0.64343
0    0.35657
Name: TARGET, dtype: float64

## 2.2. Variable Creation

The `NACE` code is a variable that came from the commercial data and indicates the **economic sector** of the company 
and could be an important predictor. Let's have a look in how many different economic sectors the companies from are dataset are active:

In [143]:
print("Number of different economic sectors: %s" %len(set(basetable["NACE_code"])))

Number of different economic sectors: 526


The high number of different economic sectors present in our dataset could result in the model not finding a clear relationship between an economic sector and the probability of successful conversion.
Hence, we are going to **extract the first digit** of the ``NACE_code``, which equals the general economic sector in which the company is active. 

For example, the economic sector *"Manufacture of Electrical Household Appliances"* with NACE code 2751 is a subcategory of the general economic sector *"Manufacturing"*, which is represented by the number 2 in the NACE code. This will **reduce the number of different categories** significantly and will help the model in finding a relationship between the economic sector of a company and the dependent variable.

**Exercise 3**:
    
1) Extract the **first digit** of the ``NACE_code`` variable and store it in a new column called ``NACE_1``. **Compare** the ``NACE_code`` and ``NACE_1`` variables for the first 3 observations to inspect whether your code was correct.


In [144]:
basetable.dtypes

F1                               float64
Company_name                      object
NACE_code                         object
Op__Rev_th_EUR_Last_avail__yr    float64
Web_site_addresses                object
                                  ...   
_SVD_197                         float64
_SVD_198                         float64
_SVD_199                         float64
_SVD_200                         float64
_SVDLEN_                         float64
Length: 224, dtype: object

In [145]:
basetable["NACE_1"] = basetable["NACE_code"].apply(lambda x: int(x[0]))
basetable[["NACE_1","NACE_code"]].head()

Unnamed: 0,NACE_1,NACE_code
0,6,6832
1,7,7820
2,7,7311
3,7,7112
4,2,2896


Finally, we are going to **drop** all the variables we don't need for the analysis

In [146]:
# define the columns we are going to drop
columns_to_drop = ["F1", "Company_name", "Web_site_addresses", "URI", "NAME","_SVDLEN_", "_DOCUMENT_", "NACE_code"]
# drop these colums from the basetable
basetable = basetable.drop(columns_to_drop, axis=1)

# 3. Train - Val - Test Split

For this specific case, we will **train several models** and **select the best performing model as our final model**.
Therefore, we will split our data into **3 different sets**: the *training set*, *validation set* and *test set*.

   - The ``training set`` will be used for training all the models 
   - The ``validation set`` will be used for evaluating all the models and selecting the best model
   - The ``test set`` will be used for evaluating the best model

In [147]:
# split data randomly into training and test set (set seed to 33 to replicate same results)
basetable_train, basetable_test = train_test_split(basetable, test_size=0.6, random_state=33)
# split test set randomly into validation and test set
basetable_val, basetable_test = train_test_split(basetable_test, test_size=0.5, random_state=33)

In [148]:
# check shapes
print(basetable_train.shape)
print(basetable_val.shape)
print(basetable_test.shape)

(3293, 217)
(2470, 217)
(2471, 217)


# 4. Missing Value Imputation

Next, we are going to **handle missing values**. In general, there exist **2 strategies** for dealing with missing values: 

1) Removing variables with missing values.
2) Imputing the missing values of the variables. 

Missing values of ``numeric variables`` are often imputed by the **mean** of the observed values of that variable, while missing values of ``categorical variables`` are often imputed by the **mode** of the observed values of that variable. These statistics are **calculated on the the training set** and are used to **impute the missing values of the training, validation AND test set**.

In [149]:
# inspect missing values per variable in training set
for col in basetable_train.columns:
    col_missings = basetable_train[col].isnull().sum()
    if col_missings > 0:
        print(col, " : ", col_missings)

Op__Rev_th_EUR_Last_avail__yr  :  1468
Cash_flow_th_EUR_Last_avail__yr  :  2534
Number_of_employees_Last_avail__  :  2127
Total_assets_th_EUR_Last_avail__  :  437
Long_term_debt_th_EUR_Last_avail  :  454
Loans_th_EUR_Last_avail__yr  :  1387
Capital_th_EUR_Last_avail__yr  :  460
Sales_th_EUR_Last_avail__yr  :  1665
Gross_profit_th_EUR_Last_avail__  :  3274
Profit_margin___Last_avail__yr  :  2765
Liquidity_ratio_x_Last_avail__yr  :  1365
Average_cost_of_employee__th__EU  :  2707
Profit_per_employee__th__EUR_Las  :  2127
Total_assets_per_employee__th__E  :  2521
Earnings_yield_______current  :  3281


For this specific case, we are first going to **drop variables** with more than 50% of the observations missing in the training set. For variables with less than 50% of the observations missings in the training set, we are going to **impute** the missing values with their mean.

**Exercise 4**:
    
1) Get the **percentage of missing values** per variable in training set.
    
    
2) **Store** the names of the variables with more than 50% of the observations missing **in a list** and **drop** these from the basetable.
       
       
3) **Drop** the variables from the train, validation and test set.

In [150]:
def drop_variables(dataframe):
    total_number_observations = len(dataframe.index)
    for col in dataframe.columns:
        col_missings = dataframe[col].isnull().sum()
        percentage_missing = col_missings/total_number_observations
        if percentage_missing > 0.5:
            del dataframe[col]
            del basetable_test[col]
            del basetable_val[col]

for x in [basetable_train]:
    drop_variables(x)


**Exercise 5**:
    
1) Get the **mean** of all the numeric variables of the training set.
    
    
2) **Impute** the missing values of the training, validation and test set.
    
    
3) Inspect number of **missing values** in training, validation and test set.

In [151]:
#get names numeric columns
numeric_columns = basetable_train.select_dtypes(include=np.number).columns.tolist()
numeric_columns

['Op__Rev_th_EUR_Last_avail__yr',
 'Total_assets_th_EUR_Last_avail__',
 'Long_term_debt_th_EUR_Last_avail',
 'Loans_th_EUR_Last_avail__yr',
 'Capital_th_EUR_Last_avail__yr',
 'Liquidity_ratio_x_Last_avail__yr',
 'TARGET',
 '_SVD_1',
 '_SVD_2',
 '_SVD_3',
 '_SVD_4',
 '_SVD_5',
 '_SVD_6',
 '_SVD_7',
 '_SVD_8',
 '_SVD_9',
 '_SVD_10',
 '_SVD_11',
 '_SVD_12',
 '_SVD_13',
 '_SVD_14',
 '_SVD_15',
 '_SVD_16',
 '_SVD_17',
 '_SVD_18',
 '_SVD_19',
 '_SVD_20',
 '_SVD_21',
 '_SVD_22',
 '_SVD_23',
 '_SVD_24',
 '_SVD_25',
 '_SVD_26',
 '_SVD_27',
 '_SVD_28',
 '_SVD_29',
 '_SVD_30',
 '_SVD_31',
 '_SVD_32',
 '_SVD_33',
 '_SVD_34',
 '_SVD_35',
 '_SVD_36',
 '_SVD_37',
 '_SVD_38',
 '_SVD_39',
 '_SVD_40',
 '_SVD_41',
 '_SVD_42',
 '_SVD_43',
 '_SVD_44',
 '_SVD_45',
 '_SVD_46',
 '_SVD_47',
 '_SVD_48',
 '_SVD_49',
 '_SVD_50',
 '_SVD_51',
 '_SVD_52',
 '_SVD_53',
 '_SVD_54',
 '_SVD_55',
 '_SVD_56',
 '_SVD_57',
 '_SVD_58',
 '_SVD_59',
 '_SVD_60',
 '_SVD_61',
 '_SVD_62',
 '_SVD_63',
 '_SVD_64',
 '_SVD_65',
 '_SVD_

In [152]:
"""
numeric_mean = pd.DataFrame(columns=["variable", "mean"])
count = 0
for numeric_column in numeric_columns:

    numeric_mean.loc[str(count)] = [numeric_column, basetable_train[numeric_column].mean()]
    count += 1
numeric_mean
"""
mean_dict = dict(basetable_train.mean())

basetable_train = basetable_train.fillna(mean_dict)
basetable_val = basetable_val.fillna(mean_dict)
basetable_test = basetable_test.fillna(mean_dict)

print(basetable_train.isnull().sum().sum())
print(basetable_val.isnull().sum().sum())
print(basetable_test.isnull().sum().sum())



0
0
0


# 5. Standardization

Next, we are going to **standardize the numeric features**. The statistics required for standardizing the features are first extracted from the ``training set``. Afterwards, the features from the training, validation AND test set are standardized by using these statistics.

In [153]:
# define all the numeric features
all_columns = basetable_train.columns
numeric_features = [obs for obs in all_columns if obs not in ["TARGET", "NACE_1"]]

In [154]:
# import min max scaler
from sklearn.preprocessing import MinMaxScaler

# initialize the scaler
scaler = MinMaxScaler()

# fit scaler on all the numeric variables from training set
scaler.fit(basetable_train[numeric_features])

# scale features
basetable_train[numeric_features] = scaler.transform(basetable_train[numeric_features])
basetable_val[numeric_features] = scaler.transform(basetable_val[numeric_features])
basetable_test[numeric_features] = scaler.transform(basetable_test[numeric_features])

In [155]:
# check
basetable_train.describe()

Unnamed: 0,Op__Rev_th_EUR_Last_avail__yr,Total_assets_th_EUR_Last_avail__,Long_term_debt_th_EUR_Last_avail,Loans_th_EUR_Last_avail__yr,Capital_th_EUR_Last_avail__yr,Liquidity_ratio_x_Last_avail__yr,TARGET,_SVD_1,_SVD_2,_SVD_3,...,_SVD_192,_SVD_193,_SVD_194,_SVD_195,_SVD_196,_SVD_197,_SVD_198,_SVD_199,_SVD_200,NACE_1
count,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,...,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0,3293.0
mean,0.001911,0.000947,0.000677,0.001609,0.025302,0.042203,0.638627,0.595768,0.216884,0.166162,...,0.423436,0.356348,0.396669,0.428151,0.491102,0.501764,0.63813,0.507029,0.583741,4.739144
std,0.020596,0.01944,0.017998,0.017656,0.022804,0.057795,0.480471,0.166911,0.211721,0.162343,...,0.067853,0.069343,0.062097,0.098082,0.068814,0.098024,0.075021,0.101492,0.073703,2.176934
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7.7e-05,8.2e-05,2.7e-05,0.000788,0.023246,0.02088,0.0,0.502885,0.095755,0.071363,...,0.386163,0.318323,0.364895,0.37204,0.45492,0.450688,0.597383,0.449773,0.542004,3.0
50%,0.001177,0.00015,4.7e-05,0.001609,0.023752,0.042203,1.0,0.624307,0.138669,0.117178,...,0.423389,0.357531,0.396761,0.426903,0.491756,0.503089,0.637795,0.505825,0.58595,4.0
75%,0.001911,0.000244,0.000105,0.001609,0.024357,0.042203,1.0,0.713176,0.218617,0.201285,...,0.459552,0.394064,0.431065,0.480279,0.526504,0.555955,0.678895,0.562664,0.624408,7.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.0


# 6. Modeling

Now that we have done all the data preprocessing, we can **start building our models**.

## 6.1. Logistic Regression

Instead of fitting one logistic regression model on the training set, 
we are going to **fit multiple models** on the training set: 
        
        - One model which is trained only with variables coming from the commercial data
        - One model which is trained only with variables coming from the web data
        - One model which is trained on the variables coming from the commercial and web data
    
This will allow us to compare the **predictive performance** of each model and to **investigate** whether the augmentation of the commercial data with the scraped web data increases the predictive performance of the model

**Exercise 6**:
    
1) Create **a function** that fits a logistic regression model on a user-specified training set with a user-specified list of independent features and a certain dependent variable (**1**). Once the model is fit, the function should also make predictions on the user-specified evaluation set (**2**). These predictions should be the predicted probabilities instead of the predicted categories and can be obtained by the ``predict_proba`` function of a trained sklearn model.
       
    - The function should thus accept **4 parameters**: 
        - A list of the names of the features
        - The name of the dependent variable
        - The training dataset
        - The evaluation dataset
    - The function should **return** the predicted probabilities for each observation from the evaluation dataset

In [156]:
# complete the function
def fit_lr_with_features(features, dependent_variable, train_data, test_data):
    # initialize logistic regression model
    lr_model = LogisticRegression(max_iter=1000)

    # fit logistic regression model on training set
    lr_model.fit(X=train_data[features], y=train_data[dependent_variable])

    # make predictions on test set
    test_preds = lr_model.predict_proba(X=test_data[features])

    # return predictions
    return(test_preds)

Next, we **define the features** coming from the web data and the commercial data.

In [157]:
# define features coming from web data
web_data_features = ["_SVD_%s" %i for i in range(1, 200)]

# define features coming from commercial data
com_data_features = ["Op__Rev_th_EUR_Last_avail__yr",
                      "Total_assets_th_EUR_Last_avail__", 
                      "Long_term_debt_th_EUR_Last_avail",
                      "Loans_th_EUR_Last_avail__yr",
                      "Capital_th_EUR_Last_avail__yr",
                      "Liquidity_ratio_x_Last_avail__yr",
                      "NACE_1"]

# define features coming from web and commercial data
all_data_features = web_data_features + com_data_features

Now that we defined the features coming from the different datasets, we can use our function that we created in the previous exercise to train and evaluate a logistic regression model **for each separate set of features**.

**Exercise 7**:
    
1) Get the predicted probabilities for the observations from the validation set of the model which was trained with the independent variables coming from the **web data** and the dependent variable being ``TARGET``
       
2) Get the predicted probabilities for the observations from the validation set of the model which was trained with the independent variables coming from the **commercial data** and the dependent variable being ``TARGET``
    
3) Get the predicted probabilities for the observations from the validation set of the model which was trained with the independent variables coming from the **web and commercial data** and the dependent variable being ``TARGET``

In [158]:
# get predictions for model trained on features coming from web data
web_val_preds = fit_lr_with_features(features=web_data_features, dependent_variable="TARGET", train_data=basetable_train, test_data=basetable_val)


# get predictions for model trained on features coming from commercial data
com_val_preds = fit_lr_with_features(features=commercial_data, dependent_variable="TARGET", train_data=basetable_train, test_data=basetable_val)


# get predictions for model trained on features coming from web and commercial data
all_val_preds = fit_lr_with_features(features=all_data_features, dependent_variable="TARGET", train_data=basetable_train, test_data=basetable_val)

ValueError: Boolean array expected for the condition, not object

## 6.2. Model Selection

Next, we can **evaluate every model** by comparing its predicted probabilites with the true values.
The ``AUC`` is a good evaluation metric for evaluating a binary model since it represents the probability of ranking a positive example higher than a negative example. Hence, an AUC of 1 represents a perfect model, while an AUC of 0.5 represents a random model.

In [None]:
# get auc score of the web model
fpr_w, tpr_w, threshold = roc_curve(y_true=basetable_val["TARGET"], y_score=web_val_preds[:, 1])
auc_w = roc_auc_score(y_true=basetable_val["TARGET"], y_score=web_val_preds[:, 1])

# get auc score of the commercial model
fpr_c, tpr_c, threshold = roc_curve(y_true=basetable_val["TARGET"], y_score=com_val_preds[:, 1])
auc_c = roc_auc_score(y_true=basetable_val["TARGET"], y_score=com_val_preds[:, 1])

# get auc score of the complete model
fpr_a, tpr_a, threshold = roc_curve(y_true=basetable_val["TARGET"], y_score=all_val_preds[:, 1])
auc_a = roc_auc_score(y_true=basetable_val["TARGET"], y_score=all_val_preds[:, 1])

# create plot with roc curves of each model
plt.figure(figsize=(6, 6))
plt.plot(fpr_w, tpr_w, label="Web model -- auc: %s" %round(auc_w, 2))
plt.plot(fpr_c, tpr_c, label="Com model -- auc: %s" %round(auc_c, 2))
plt.plot(fpr_a, tpr_a, label="All model -- auc: %s" %round(auc_a, 2))
plt.plot([0, 1], [0, 1],'r--', label="Random model -- auc: 0.5")
plt.legend(loc = 'upper left')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

The models trained on the features coming from the **web data** (Web model) and on **all the features** (All model) have the highest AUC. However, the model trained on the web data (Web model) has less features and hence is **less complex**. Therefore, we will choose the model trained on the web data as our final model (i.e., Occam's razor). We will **evaluate this model** on the test set to get our final unbiased estimate of the model performance.

# 7. Model Evaluation

**Exercise 8**:
    
Get the **predicted probabilities** for all the observations in the test set by using the best performing model.

In [None]:
# train final model on training set and get predictions on test set
final_model_test_preds = 

## 7.1. AUC

**Exercise 9**:
    
Get the ``AUC`` of the **final model** on the test set and **plot** the ``ROC`` curve.

In [None]:
# Write your code here




## 7.2. Lift

Finally, we will evaluate our model in terms of the ``lift``. This measures how much the model performs better than a random model **for different chunks of data**. More specifically, the observations are **first sorted** by their predicted probability of being positive (in this case being successfully converted into a customer). Next, these observations will be **grouped into different chunks** (most often into percentiles with each percentile containing 1% of the observations). So the first chunk will contain observations with higher predicted probabilites than the second chunk and so on, meaning that the model is more certain for making positive predictions in the first chunk than the second chunk and so on. So we can expect that the number of correctly identified positives in the first chunk will be higher than the second chunk and so on.
Finally, for each chunk, the model performance is **compared with the random model** by dividing the proportion of correctly identified positives with the general proportion of positives (which is the prediction of the random model). The result is a **lift score** for each chunk, indicating how much the model performs better than the random model for that particular chunk. 

For **example**: 
Suppose that there are 1000 companies in the test set and that the proportion of succesfully converted companies
equals 0.6.
First we will sort these companies by their predicted probability of being successfully converted into a customer.
Next we will split these ranked companies into percentiles, such that each chunk will contain 10 companies.
Now suppose that in the first chunk there are 9 companies of which the true label was 1 and 1 company of which the true label was 0. 
Then the lift score of this chunk is 0.9 / 0.6 = 1.5, meaning that the model is 1.5 times better than the random model for this chunk. 
Now suppose that in the second chunk there are only 7 companies of which the true label was 1 and 3 companies of which the true label was 0. 
Then the lift score of this chunk is 0.7 / 0.6 = 1.16, meaning that the model is 1.16 times better than the random model for this chunk.

<img src="./data/lift.png" width="500">

**Exercise 10**:
    
1) **Extract the label** and the **predicted probability** of being successfully converted into a customer of each observation in the test set and **store** these 2 variables into a new DataFrame.
        
2) **Sort** the observations of this new DataFrame by the predicted probability in *descending* order.
    
3) Get the **proportion** of companies being successfully converted in the test set.

In [None]:
# join dependent variable and predictions of test set in new DataFrame 
label_pred = 


# sort DataFrame by predicted probability
label_pred = 

# check
label_pred.head(5)

In [None]:
# get global proportion of profitable customers in test set
prop_profitable_cust_test = 

# check 
print("Proportion of profitable customers: %s" %prop_profitable_cust_test)

In [None]:
# calculate max lift
max_lift = 1. / prop_profitable_cust_test
# check
print("Max lift: %s" %max_lift)

**Exercise 11**:
    
Now that the companies are ranked according to their predicted probability of being successfully converted 
into a customer, we can calculated the lift score for each percentile. 
In the following code, companies are already properly assigned to each chunk by making use of a for loop.

1) Complete the code by **calculating the lift score for each chunk** and store this lift score in the ``lift_scores`` list.
        
        
2) **Plot the lift scores for each chunk** with the chunk number on the X-axis and the lift score on the Y-axis.

In [None]:
# initialize list for storing lift scores 
lift_scores = []

# loop through percentiles
for i in reversed(range(100)):
    # divide dataframes into percentiles of predicted probabilities
    start_perc = label_pred["prediction"].quantile(i / 100.)
    end_perc = label_pred["prediction"].quantile((i+1) / 100.)
    chunk = label_pred[(label_pred["prediction"] >= start_perc) &  (label_pred["prediction"] < end_perc)]
    
    # get lift score of each chunk
    
    
    # add chunk to lift_scores list
    

In [None]:
# plot lift curve
plt.figure(figsize=(8, 5))
plt.plot(range(100), lift_scores, label="Lift curve")
plt.plot(range(100), [max_lift for i in range(100)], label="Max lift")
plt.plot(range(100), [1 for i in range(100)], label="Random model")
plt.xlabel("Percentile")
plt.ylabel("Lift")
plt.legend(loc="lower left")
plt.show()