# Athens University of Economics and Business
# M.Sc. in Data Science
# Course: Data Mining
# Author:  Spiros Politis
# Homework: 1

---

The goal of this assignment is to implement a simple workflow that will assess the similarity between bank customers and suggest for any input customer a list of his/her 10 most similar other customers. In order to calculate the similarity between customers you will first have to compute the dissimilarity for every given attribute as discussed in lecture “Measuring Data Similarity”. In order to fulfill this assignment, you will have to perform the following tasks:

## Import required packages and custom libraries

In [1]:
from IPython.display import Markdown as md
import numpy as np
import pandas as pd
import sys

In [2]:
sys.path.append("./code/")

from DmHomework1 import UtilHelper, ConverterHelper, MeasuresHelper

## Import and pre-process the dataset with bank customers

### Question

You will download the bank.csv dataset from e-class. This dataset is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls in order to access if the product (bank term deposit) would be (or not) subscribed. The dataset includes 43192 bank customer profiles with 10 attributes each. The class attribute should be ignored. The last attribute is an array containing the bank products (1-20) each customer has. Full description for the dataset and the attributes is provided in the bank-names.txt file. For any numerical missing values, you should replace them with the average value of the attribute (keeping the integer part of the average).

### Answer

In [3]:
util_helper = UtilHelper.UtilHelper()
converter_helper = ConverterHelper.ConverterHelper()

#### Ingest data

In [4]:
df = pd.read_csv(
    "data/bank.csv", 
    sep = ";", 
    header = 0, 
    names = [
        "age", 
        "job", 
        "marital", 
        "education", 
        "default", 
        "balance", 
        "housing", 
        "loan", 
        "class", 
        "products"
    ], 
    converters = {
        "age": np.float64,
        "job": str, 
        "marital": str, 
        "education": str, 
        "default": str, 
        "balance": np.float64, 
        "housing": str, 
        "loan": str, 
        "class": str, 
        "products": str
    }, 
    na_values = [" "], 
    keep_default_na = False, 
    engine = "python"
)

#### Inspect data

A brief, visual inspection of the data set, its length and column data types so as to get an idea about our data and identify changes that should be made.

In [5]:
df.head(5)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,class,products
0,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,no,13161719
1,35.0,management,married,tertiary,no,231.0,yes,no,no,4816
2,,management,single,tertiary,no,447.0,yes,yes,no,716
3,42.0,entrepreneur,divorced,tertiary,yes,2.0,yes,no,no,1381011121819
4,58.0,retired,married,primary,no,121.0,yes,no,no,4567111819


In [6]:
util_helper.log.info("Data set length: {}".format(len(df)))

INFO: Data set length: 43191


In [7]:
util_helper.log.info("Data set types:\n{}".format(df.dtypes))

INFO: Data set types:
age           object
job           object
marital       object
education     object
default       object
balance      float64
housing       object
loan          object
class         object
products      object
dtype: object


#### Drop unused columns

##### Column '*class*'

In [8]:
df = df.drop(["class"], axis = 1)

#### Handle NAs

We are required to impute misssing values for numerical columns in the data set. We shall check for NAs for the following columns:

- '*age*'
- '*balance*'

Checking for missing values on *age*:

In [9]:
util_helper.check_na(df, "age")



Column *age* contains NA values, we should replace them with the age mean:

In [10]:
# Replace NAs with mean.
df["age"] = df["age"].fillna(np.int32(df["age"].mean()))

# Sanity check.
util_helper.check_na(df, "age")

INFO: age does not contain NA values.


Checking for missing values on *balance*:

In [11]:
util_helper.check_na(df, "balance")

INFO: balance does not contain NA values.


No action required for column *balance*.

#### Convert data types

Creating a copy of the original DataFrame which will be used for performing dissimilarity petric calculations, after having converted its column data types to appropriate ones. The original DataFrame will be used fir displaying the original column data types for displaying purposes.

In [12]:
df_t = df.copy(deep = True)

##### Conversion of numeric types

In [13]:
# Converting age to np.int32.
df_t["age"] = np.int32(df_t["age"])

##### Conversion of nominal types

In [14]:
# Converting nominal variables to numeric.
df_t = converter_helper.nominal_to_numeric(df = df_t, column = "job")
df_t = converter_helper.nominal_to_numeric(df = df_t, column = "marital")
df_t = converter_helper.nominal_to_numeric(df = df_t, column = "default")
df_t = converter_helper.nominal_to_numeric(df = df_t, column = "housing")
df_t = converter_helper.nominal_to_numeric(df = df_t, column = "loan")

##### Conversion of ordinal types

Converting ordinal variables requires knowledge about the order of their values. Let's identify the distinct values the variable can take.

In [15]:
pd.unique(df_t["education"])

array(['secondary', 'tertiary', 'primary'], dtype=object)

Order should be $primary \lt secondary \lt tertiary$.

In [16]:
# Converting ordinal variables to numeric.
df_t = converter_helper.ordinal_to_numeric(
    df = df_t, 
    column = "education", 
    categories = [
        "primary", 
        "secondary", 
        "tertiary"
    ]
)

##### Conversion of set variable types

We shall proceed with converting column *products* from a comma-separated list of values to a list object. The reason for doing so is to be able to perform set operations more efficiently.

In [17]:
# Converting comma-separated values to List.
df_t = converter_helper.comma_separated_to_list(df = df_t, column = "products")

##### Remarks and sanity check

We have converted the DataFrame types as follows:

In [18]:
util_helper.log.info("Data set types\n{}".format(df_t.dtypes))

INFO: Data set types
age            int32
job             int8
marital         int8
education       int8
default         int8
balance      float64
housing         int8
loan            int8
products      object
dtype: object


Note that for nominal, ordinal and set variables we have created **new** DataFrame columns to accommodate the converted types (e.g. '*job*' -> '*job_numeric*', '*education*' -> '*education_numeric*' etc.). It is these columns that we will be using for our computation, while leaving the original columns intact fot consistency / display purposes.

##  Compute data (dis-)similarity

### Question

In order to measure the similarity between the bank customers you could form the dissimilarity matrix for all given attributes. As described in lecture "Measuring Data Similarity", for every given attribute you first distinguish its type (categorical, ordinal, numerical or set) and then compute the dissimilarity of its values accordingly. For set similarity use the Jaccard similarity between sets. Then, you can calculate the average of the computed dissimilarities in order to form the dissimilarity over all attributes. Depending of the machine used to implement this assignment you should decide whether is feasible to compute the dissimilarity matrices or have the computations performed on-the-fly for a pair of customers.

### Answer

#### Determining feasibility of batch computation

Our inital thought is to create dissimilarity matrices for every variable in the data set. To do this, we need to create a $m \times n \times n$ matrix, where $m=9$ (number of variables), $n=43190$ (length of dataset). Also, the data type of the NumPy matrix should be at least *np.float32*, so as to acommodate for measures that require float precision (e.g. Jaccard dissimilarity).

Proceeding with experimenting on the memory requirements:

In [19]:
m = 9
n = 43190

try:
    mat = np.zeros((m, n, n), dtype = np.float32)
except MemoryError as me:
    util_helper.log.error("Memory error: {}.".format(me))

ERROR: Memory error: Unable to allocate 62.5 GiB for an array with shape (9, 43190, 43190) and data type float32.


Although the dissimilarity matrix is, effectively, a lower triangular matrix, memory representation requires that we treat it - inefficiently - as a , $m \times n \times n$ matrix as previously mentioned. To circumvent this, we could proceed with using *SciPy* sparse matrix representations.

However, this is rather an overkill in the scope of the assignment.

We conclude that it is infeasible to batch produce the measures and we shall proceed with on-th-fly computation, meaning that we will be computing a dissimilarity column vector for every variable. Note that in our effort to keep computation as efficient as possible, we have forbade for loops in favor of vectorized Pandas / NumPy computations.

#### Performing dissimilarity computations

The code for producing dissimilarity metrics has been developed in the Python class '*MeasuresHelper*'. The calculating functions are descibed below:

- '*numeric_dissimilarity*': produces the dissimilarity vector for **numeric** variables. Numeric dissimilarity is defined as 

<br />

$$d(a, b) = \frac{\mid a − b \mid}{max(value) − min(value)}$$

<br />

- '*nominal_dissimilarity*': produces the dissimilarity vector for **nominal** variables. Nominal dissimilarity is defined as

<br />

$$d(a, b) = 1 \text{if a } \neq \text{ b, 0 otherwise}$$

<br />

- '*ordinal_dissimilarity*': produces the dissimilarity vector for **ordinal** variables. Ordinal dissimilarity is defined as 

<br />

$$d(a, b) = \frac{\mid rank(a) − rank(b) \mid}{max(rank) − min(rank)}$$

<br />

- '*set_dissimilarity*': produces the dissimilarity vector for **set** type variables. Set dissimilarity is defined as 

<br />

$$d(S_{1}, S_{2}) = 1 - (\mid S_{1} \cap S_{2} \mid  / \mid S_{1} \cup S_{2} \mid)$$

<br />

The interface for all function is uniform and requires:

- The **source data set** as a Pandas DataFrame.
- The **column (variable)** for which to perform the computation.
- The **index of the data set entry** against which the computation is to be performed.

All functions return the dissimilarity vector of the entry under examination against every other variable value.

Finally, function '*dissimilarity*' computes a $m \times n$ matrix of dissimilarity measures for all 9 variables we need to examine.

Let's proceed with testing our dissimilarity measures against customer_id = 1.

In [20]:
measures_helper_dis = MeasuresHelper.MeasuresHelper()

In [21]:
index_to_compare = util_helper.customer_id_to_index(1)

##### Testing numeric dissimilarity

In [22]:
measures_helper_dis.numeric_dissimilarity(df = df_t, column = "age", index_to_compare = index_to_compare)

DEBUG: computing numeric dissimilarity for age


0        0.000000
1        0.025974
2        0.090909
3        0.116883
4        0.324675
           ...   
43186    0.233766
43187    0.493506
43188    0.506494
43189    0.311688
43190    0.051948
Name: age, Length: 43191, dtype: float64

##### Testing nominal dissimilarity

In [23]:
measures_helper_dis.nominal_dissimilarity(df = df_t, column = "job", index_to_compare = index_to_compare)

DEBUG: computing nominal dissimilarity for job


0        0.0
1        1.0
2        1.0
3        0.0
4        1.0
        ... 
43186    1.0
43187    1.0
43188    1.0
43189    1.0
43190    0.0
Length: 43191, dtype: float64

##### Testing ordinal dissimilarity

In [24]:
measures_helper_dis.ordinal_dissimilarity(df = df_t, column = "education", index_to_compare = index_to_compare)

DEBUG: computing ordinal dissimilarity for education


0        0.0
1        0.5
2        0.5
3        0.5
4        0.5
        ... 
43186    0.5
43187    0.5
43188    0.0
43189    0.0
43190    0.0
Name: education, Length: 43191, dtype: float64

##### Testing set dissimilarity

In [25]:
measures_helper_dis.set_dissimilarity(df = df_t, column = "products", index_to_compare = index_to_compare)

DEBUG: computing set dissimilarity for products


0        0.000000
1        0.857143
2        0.833333
3        0.700000
4        0.909091
           ...   
43186    1.000000
43187    0.666667
43188    0.846154
43189    0.857143
43190    0.909091
Length: 43191, dtype: float64

##### Testing all dissimilarity measures

In [26]:
dissimilarity_matrix = measures_helper_dis.all(df = df_t, index_to_compare = index_to_compare)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


Reference customer:

In [27]:
pd.DataFrame(df.loc[0, :]).transpose()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,products
0,33,entrepreneur,married,secondary,no,2,yes,yes,13161719


Top-10 most **similar** customers to 1:

In [28]:
# Sort by mean dissimilarity, ascending, get top-10 excluding first (own)
top_10_most_similar = dissimilarity_matrix[:, 9].argsort()[1:11]

In [29]:
df.loc[top_10_most_similar, :]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,products
25848,42.0,entrepreneur,married,secondary,no,215.0,yes,yes,135161720
16640,46.0,entrepreneur,married,secondary,no,0.0,yes,yes,3716171920
12328,33.0,entrepreneur,married,secondary,no,-627.0,yes,yes,234791216171819
1246,33.0,entrepreneur,married,secondary,no,0.0,yes,yes,3567121617
24632,33.0,entrepreneur,married,secondary,no,2.0,yes,yes,23567911121617
33281,37.0,entrepreneur,married,secondary,no,0.0,yes,yes,591617
34864,31.0,entrepreneur,married,secondary,no,162.0,yes,yes,58910121316171819
32931,34.0,entrepreneur,married,secondary,no,2.0,yes,yes,81112131719
24714,38.0,entrepreneur,married,secondary,no,0.0,yes,yes,2351416
12950,37.0,entrepreneur,married,secondary,no,2055.0,yes,yes,145891516181920


Top-10 most **dissimilar** customers to 1:

In [30]:
# Sort by mean dissimilarity, descending, get top-10
top_10_most_dissimilar = dissimilarity_matrix[:, 9].argsort()[::-1][0:10]

In [31]:
df.loc[top_10_most_dissimilar, :]

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,products
28487,59.0,management,divorced,tertiary,yes,0.0,no,no,218
11009,56.0,management,divorced,tertiary,yes,-1968.0,no,no,410111213151820
8707,56.0,housemaid,divorced,primary,yes,1238.0,no,no,15
14502,56.0,blue-collar,divorced,primary,yes,-1.0,no,no,578101516
38325,51.0,management,single,tertiary,no,102127.0,no,no,7911
39224,45.0,unemployed,divorced,primary,yes,11.0,no,no,13
37764,59.0,management,divorced,tertiary,yes,0.0,no,no,56810111216171819
15804,52.0,management,divorced,tertiary,yes,0.0,no,no,1618
10440,37.0,management,single,tertiary,yes,-25.0,no,no,2510
17296,30.0,management,single,tertiary,yes,35.0,no,no,25


## Nearest Neighbor (NN) search

### Question

Using the dissimilarities computed as discussed in the previous step, you will calculate the 10-NN (most similar) customers for the customers with ids listed below (customer id=line number in the csv file starting from line 2):

$1230, 5032, 10001, 24035, 28948, 35099, 37693, 39543, 40002, 42192$

For this task your script must take as input the customer-id and return the list of her 10 nearest neighbors (excluding the given customer)

### Answer

Computation of top-10 most similar customers is performed in the following fashion: *MeasuresHelper.get_most_similar_customers()* is a wrapper function to accommodate the interface requirements of the assignment. The function computes the dissimilarity matrix by calling the *MeasuresHelper.all()* function, which returns the dissimilarity matrix for all variables. Subsequently, function *MeasuresHelper.nn()* is called, which preformes argsort on the mean dissimilarity of the matrix, per row, and returns the top-10 indices.

#### Customer #1230

In [32]:
customer_id = 1230

In [33]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [34]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 1230

[34264, 36776, 6745, 1123, 4716, 4130, 34561, 31168, 28874, 11413]


#### Customer #5032

In [35]:
customer_id = 5032

In [36]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [37]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 5032

[37099, 15219, 39790, 14081, 18977, 37519, 25665, 31639, 26179, 16556]


#### Customer #10001

In [38]:
customer_id = 10001

In [39]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [40]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 10001

[37359, 2010, 9860, 1971, 1662, 34269, 2677, 4600, 2162, 42802]


#### Customer #24035

In [41]:
customer_id = 24035

In [42]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [43]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 24035

[14036, 14034, 27941, 28537, 23013, 24981, 4345, 25197, 33503, 34276]


#### Customer #28948

In [44]:
customer_id = 28948

In [45]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [46]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 28948

[25650, 31123, 2006, 32988, 352, 32927, 15652, 1816, 30763, 9165]


#### Customer #35099

In [47]:
customer_id = 35099

In [48]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [49]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 35099

[6412, 10700, 39019, 2674, 8480, 1393, 25702, 40759, 28243, 1759]


#### Customer #37693

In [50]:
customer_id = 37693

In [51]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [52]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 37693

[35682, 8660, 8, 29671, 31126, 42897, 37803, 3504, 39763, 35265]


#### Customer #39543

In [53]:
customer_id = 39543

In [54]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [55]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 39543

[32254, 39701, 41965, 38697, 25513, 6615, 17637, 29873, 42116, 41193]


#### Customer #40002

In [56]:
customer_id = 40002

In [57]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [58]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 40002

[43051, 29781, 30250, 361, 28938, 39841, 38985, 38978, 38690, 29722]


#### Customer #42192

In [59]:
customer_id = 42192

In [60]:
top_10_most_similar_customers = measures_helper_dis.get_most_similar_customers(df = df_t, customer_id = customer_id)

DEBUG: computing all dissimilarity measures
DEBUG: computing numeric dissimilarity for age
DEBUG: computing nominal dissimilarity for job
DEBUG: computing nominal dissimilarity for marital
DEBUG: computing ordinal dissimilarity for education
DEBUG: computing nominal dissimilarity for default
DEBUG: computing numeric dissimilarity for balance
DEBUG: computing nominal dissimilarity for housing
DEBUG: computing nominal dissimilarity for loan
DEBUG: computing set dissimilarity for products


In [61]:
util_helper.log.info("Top-10 most similar customers to customer_id = {}\n\n{}".format(customer_id, top_10_most_similar_customers))

INFO: Top-10 most similar customers to customer_id = 42192

[12434, 39442, 28656, 16255, 15359, 14169, 27634, 10744, 27636, 42983]


## Assignment handout

1) A report (pdf) describing in detail any processing and conversion you made to the original data and the reasons it was necessary. The report will also contain examples of how to use your script and its output to the list of customers provided at step 3. The first page of the report should clearly state the names and student ids of the members of the group.

2) The program/script you implemented for calculating the dissimilarity matrix. Implementation can be done in any programming language and should be accompanied by the necessary comments and remarks.

3) The pdf and the required programs/scripts should be uploaded to eclass until the assignment deadline. You should create a compressed (e.g. zip/tar) file containing the report, your code and any other files required for executing your script (you do not need to include the original dataset). The name of the compressed file should include the student ids of the members of the group.

---