# Part 1 -  Data Linkage

## Naive Data Linkage

In [1]:
import textdistance as td
import pandas as pd
import numpy as np

#Part 1 - to generate the matching pairs between amazon_small and google_small

# generating an empty dataframe for combining the pair
id_df = pd.DataFrame(columns = {'idAmazon', 'idGoogleBase'})
amazon = pd.read_csv("amazon_small.csv")
google = pd.read_csv("google_small.csv")
amazon_df = pd.DataFrame(amazon)
google_df = pd.DataFrame(google)

# to replace special characters from the title to have a more accurate comparison
amazon_df['title'] = amazon_df['title'].str.replace(r'[^\w\s]+', '')
google_df['name'] = google_df['name'].str.replace(r'[^\w\s]+', '')

#the minimum threshold for determining if a pair is a match
title_threshold = 0.21

for index_ama, row_ama in amazon_df.iterrows():
    # iterating through the amazon rows to compare each title with
    # google's title using jaccard and adding the best one to id_df
    max_id = None
    max_sim_title = 0
    amazon_title = row_ama["title"].split()
    for index_goog, row_goog in google_df.iterrows():
        google_title = row_goog["name"].split()
        s_title = td.jaccard(amazon_title, google_title)
        if (s_title > title_threshold) and (s_title > max_sim_title):
            # when the given pair satisfies both the condition
            max_sim_title = s_title
            max_id = row_goog["idGoogleBase"]
    temp_data = pd.DataFrame({"idAmazon": [row_ama["idAmazon"]], "idGoogleBase" : [max_id]})
    id_df = id_df.append(temp_data, ignore_index = True)
#removing the null conditions from the dataframe
id_df = id_df[pd.notnull(id_df['idGoogleBase'])]
print(id_df)

       idAmazon                                       idGoogleBase
0    b0002itt84  http://www.google.com/base/feeds/snippets/5505...
1    b000pgvk5s  http://www.google.com/base/feeds/snippets/1872...
2    b0001wn0m2  http://www.google.com/base/feeds/snippets/1778...
3    b00004t2un  http://www.google.com/base/feeds/snippets/1750...
4    b000ov0gao  http://www.google.com/base/feeds/snippets/1837...
5    b000fdetxq  http://www.google.com/base/feeds/snippets/1327...
6    b0001y7poo  http://www.google.com/base/feeds/snippets/4468...
7    b000nvkyse  http://www.google.com/base/feeds/snippets/1492...
8    b000h13a2w  http://www.google.com/base/feeds/snippets/9946...
9    b0009jhv1s  http://www.google.com/base/feeds/snippets/1808...
10   b000qfqa1w  http://www.google.com/base/feeds/snippets/1840...
11   b000ar96bm  http://www.google.com/base/feeds/snippets/1687...
12   b0008glghc  http://www.google.com/base/feeds/snippets/2838...
14   b000ndibws  http://www.google.com/base/feeds/snippets/948

In [2]:
# finding the recall and precision 
ground_truth = pd.read_csv("amazon_google_truth_small.csv")

# finding the false positive 
false_pos = pd.concat([id_df, ground_truth, ground_truth]).drop_duplicates(keep = False)
false_positive = len(false_pos)


#finding the false negative
false_neg = pd.concat([ground_truth, id_df, id_df]).drop_duplicates(keep = False)
false_negative = len(false_neg)

true_positive = len(id_df) - false_positive

# to find the recall value of the dataset 
recall = true_positive/(true_positive + false_negative)
print(recall)

# to find the precision of the dataset
precision = true_positive/(true_positive + false_positive)
print(precision)

0.9153846153846154
0.9153846153846154


As one can see, the linkage method applied delivers a high precision and recall. The Jaccard Similarity function has been used as it is ideal due its nature as a Token-based algorithm. As we were identifying similarities between names, which were not necessarily in the same order, a token based approach seemed to be the best. The threshold was set to low as in many cases, the differences between the names were quite high, resulting in a requirement for a low threshold. Combined with a high precision and low false-positives, the overall performance of the method is reliable for linkage applications

## Blocking for efficient data linkage

In [3]:
# to make blocks of the dataset and make pair of them 

def creating_blocks(data_df, name, id_dat):  
    # creates blocks of the dataset, blocks are lists of various ranges
    # overlapping to cover all the cases, it returns a list of lists in
    # the end with name is the title and id_dat is the id on idGoogleBase
    list1 = []
    list2 = []
    list3 = []
    list4 = []
    list5 = []
    list6 = []
    list7 = []
    for index_ama, row_dat in data_df.iterrows():
        # goes through rows of datasets and assigns them a block which is 
        # a list according to their prices
        data_title = row_dat[name].split()
        data_price = row_dat["price"]
        if (data_price > 0 and data_price < 18.5):
            list1.append([data_title, row_dat[id_dat]])

        if (data_price > 10.5 and data_price < 30.1):
            list2.append([data_title, row_dat[id_dat]])

        if (data_price > 16.5 and data_price < 48.5):
            list3.append([data_title, row_dat[id_dat]])

        if (data_price > 40.1 and data_price < 70.5):
            list4.append([data_title, row_dat[id_dat]])

        if (data_price > 60.5 and data_price < 122.5):
            list5.append([data_title, row_dat[id_dat]])

        if (data_price > 108.5 and data_price < 220.5):
            list5.append([data_title, row_dat[id_dat]])

        if (data_price > 206.5 and data_price < 401.5):
            list6.append([data_title, row_dat[id_dat]])

        if (data_price > 386.5):
            list7.append([data_title, row_dat[id_dat]])

    list_of_list = [list1, list2, list3, list4, list5, list6, list7] 
    return list_of_list


In [4]:
# generating an empty dataframe for combining the pair
id_df = pd.DataFrame(columns = {'idAmazon', 'idGoogleBase'})
amazon = pd.read_csv("amazon.csv")
google = pd.read_csv("google.csv")
amazon_df = pd.DataFrame(amazon)
google_df = pd.DataFrame(google)

# taking out the special characters for better comparisons 
amazon_df['title'] = amazon_df['title'].str.replace(r'[^\w\s]+', '')
google_df['name'] = google_df['name'].str.replace(r'[^\w\s]+', '')

for index, row in google_df.iterrows():
    # changing the gbp to aud
    if ("gbp" in row["price"]):
        row['price'] = row['price'].replace('gbp', '')
        google_df.loc[index, "price"] = row['price']
        google_df.loc[index, "price"] = str(float(row['price']) * 1.83)


google_df = google_df.astype({"price": float}) 


# using the create block function to make the blocks of google and amazon datasets 
list_ama = creating_blocks(amazon_df, "title", "idAmazon")
list_goog = creating_blocks(google_df, "name", "id")

In [5]:
id_df = pd.DataFrame(columns = {'idAmazon', 'idGoogleBase'})
for ama_block, goog_block in zip(list_ama, list_goog):
    # going through the 2 blocks together and then comparing titles 
    for amazon_item in ama_block:
        max_id = None
        max_sim_title = 0
        for google_item in goog_block:
            s_title = td.jaccard(amazon_item[0],google_item[0])
            if(s_title > max_sim_title):
                # when the given pair have a better comparison than 
                # the previous best 
                max_sim_title = s_title
                max_id = google_item[1]
                
        # adding the best pair to the dataframe 
        temp_data = pd.DataFrame({"idAmazon": [amazon_item[1]], "idGoogleBase" : [max_id]})
        id_df = id_df.append(temp_data, ignore_index = True)
id_df

Unnamed: 0,idAmazon,idGoogleBase
0,b00029bqa2,http://www.google.com/base/feeds/snippets/1840...
1,b000fm18vi,http://www.google.com/base/feeds/snippets/3620...
2,b0009i9tqy,http://www.google.com/base/feeds/snippets/1230...
3,b00024yohy,http://www.google.com/base/feeds/snippets/9205...
4,b00020633g,http://www.google.com/base/feeds/snippets/4783...
5,b0009stm6g,http://www.google.com/base/feeds/snippets/7117...
6,b0007iqg2q,http://www.google.com/base/feeds/snippets/1264...
7,b000099sin,http://www.google.com/base/feeds/snippets/1838...
8,b000gaoo7y,http://www.google.com/base/feeds/snippets/1838...
9,b000g017kg,http://www.google.com/base/feeds/snippets/1264...


In [6]:
# finding the pair completeness and reduction ratio
ground_truth = pd.read_csv("amazon_google_truth.csv")
# finding the false positive 
false_pos = pd.concat([id_df, ground_truth, ground_truth]).drop_duplicates(keep = False)
false_positive = len(false_pos)

#finding the false negative
false_neg = pd.concat([ground_truth, id_df, id_df]).drop_duplicates(keep = False)
false_negative = len(false_neg)

# finding the true positives 
true_positive = len(id_df) - false_positive

# finding the pair completeness of the 
possible_pair = true_positive/(true_positive + false_negative)
print(possible_pair)

reduction_ratio = 1 - (true_positive + false_positive)/len(ground_truth)
print(reduction_ratio)

0.6332747510251904
-0.2238461538461538


Blocking is used to make pairs efficiently, without going through searching for the best pair in the dataset. For this case title, manufacturer and description of both the datasets have a lot of discrepancies and are not idle to form blocks. Price on the other hand if one observes aren’t too different in both the datasets making it a suitable candidate for blocking. So here we chose pricing for blocking. To take care of the fringe cases we made the blocks intermingle with each for covering as many best cases as possible. 

Because of our intermingling we got a pair completeness (PC) of 0.633,  This caused few duplication of amazon keys with google keys causing a reduction ratio of -0.2238	


# Part 2 - Classification

## Pre-processing

### Impute missing values 

In [7]:
def finding_feature(yeast_data, col_name, name):
    # this function finds mean or median, as specified by the name argument
    # and fills the column's none with the mean or median respectively
    
    # to filter all the none values of columns 
    not_null_col = yeast_data[pd.notnull(yeast_data[col_name])]
    
    if( name == "mean"):
        mean_val = not_null_col[col_name].mean()
        yeast_data[col_name].fillna(mean_val, inplace = True)
        return yeast_data

    if (name == "median"):
        median_val = not_null_col.median()
        yeast_data[col_name].fillna(median_val, inplace = True)
        return yeast_data
    
    
def change_dataframe(yeast_data, name):
    # this function imputes mean and median on
    # all the columns of yeast_data using finding feature
    for cols in yeast_data.columns:
        if (cols == "Sample" or cols == "Class"):
            continue
        yeast_data = finding_feature(yeast_data, cols, name)
    
    return yeast_data

In [8]:
def format_print(dataset_1, dataset_2, para_1, para_2):
    # This function prints standard deviation, mean, median, min and max for
    # the given data sets and takes the feature of that dataset as its arguments alongside
    for i in dataset_1.columns:
        if (i == "Sample" or i == "Class"):
            continue
        mean_dat_mean = dataset_1[i].mean()
        med_dat_mean = dataset_2[i].mean()
        mean_dat_med = dataset_1[i].median()
        med_dat_med = dataset_2[i].median()
        mean_dat_std = dataset_1[i].std()
        med_dat_std = dataset_2[i].std()
        print("min for {} ({}): ".format(para_1, i) + str(min(dataset_1[i])) + 20*" "+ "min for {} ({}):  ".format(para_2, i) + str(min(dataset_2[i])))
        print("max for {} ({}): ".format(para_1, i) + str(max(dataset_1[i])) + 20*" " + "max for {} ({}): ".format(para_2, i) + str(max(dataset_2[i])))
        print("mean for {} ({}): ".format(para_1, i) + str(mean_dat_mean) + 16*" " + "mean for for {} ({}): ".format(para_2, i) + str(med_dat_mean))
        print("median for {} ({}): ".format(para_1, i) + str(mean_dat_med) + 13*" " + "median for {} ({}) :".format(para_2, i) + str(med_dat_med))
        print("std for {} ({}): ".format(para_1, i) + str(mean_dat_std) + 20*" " + "std for {} ({})  ".format(para_2, i) + str(med_dat_std))

In [9]:
yeast_data = pd.read_csv("all_yeast.csv")

# imputes mean value for nan positions
yeast_data_mean = change_dataframe(yeast_data, "mean")

# imputes median value for nan positions
yeast_data_med = change_dataframe(yeast_data, "median")

# prints all the neccessary information asked
format_print(yeast_data_mean, yeast_data_med, "mean", "median")

min for mean (mcg): 0.11                    min for median (mcg):  0.11
max for mean (mcg): 1.0                    max for median (mcg): 1.0
mean for mean (mcg): 0.4993491124260346                mean for for median (mcg): 0.4993491124260346
median for mean (mcg): 0.4993491124260354             median for median (mcg) :0.4993491124260354
std for mean (mcg): 0.131356908343175                    std for median (mcg)  0.131356908343175
min for mean (gvh): 0.13                    min for median (gvh):  0.13
max for mean (gvh): 1.0                    max for median (gvh): 1.0
mean for mean (gvh): 0.4998757763975147                mean for for median (gvh): 0.4998757763975147
median for mean (gvh): 0.49             median for median (gvh) :0.49
std for mean (gvh): 0.12194528481262444                    std for median (gvh)  0.12194528481262444
min for mean (alm): 0.21                    min for median (alm):  0.21
max for mean (alm): 7.501819407                    max for median (alm): 7.501

Median is the better option here, even though from the result we can see that there isnt much of a difference between mean and median, median would be preferred because mean can be disrupted with outliers whereas median can't be 

### Scale the features

In [10]:
# Standardizing the data on median imputed dataset  
from sklearn import preprocessing
scaler  = preprocessing.StandardScaler()

yeast_data_num = yeast_data_med[yeast_data_med.columns[1:-1]]
names = yeast_data_num.columns

# using the standard scaler to standardize the dataset 
standard_df = scaler.fit_transform(yeast_data_num)
standard_df = pd.DataFrame(standard_df, columns=names)
standard_df

Unnamed: 0,mcg,gvh,alm,mit,erl,pox,vac,nuc
0,6.141898e-01,9.033670e-01,-1.794472e-01,-8.944555e-01,-7.560451e-02,-0.099131,-1.552229e-01,-3.943446e-01
1,-5.281221e-01,1.395557e+00,-1.293895e-01,3.741294e-02,-7.560451e-02,-0.099131,1.706607e-01,-3.943446e-01
2,1.071115e+00,9.853986e-01,-7.933174e-02,-7.613314e-01,-7.560451e-02,-0.099131,1.706607e-01,-3.943446e-01
3,6.141898e-01,-4.911708e-01,3.211302e-01,-8.944555e-01,-7.560451e-02,-0.099131,2.358374e-01,-3.943446e-01
4,-6.042763e-01,-4.911708e-01,-1.293895e-01,1.834588e+00,-7.560451e-02,-0.099131,-1.552229e-01,-3.943446e-01
5,8.111091e-02,-8.192973e-01,2.710725e-01,-6.282074e-01,-7.560451e-02,6.509628,-9.004621e-02,2.451962e-14
6,4.956778e-03,3.291456e-01,-1.293895e-01,2.566770e+00,-7.560451e-02,-0.099131,1.706607e-01,-3.943446e-01
7,-8.454807e-16,-4.091391e-01,4.212457e-01,-4.285213e-01,-7.560451e-02,-0.099131,4.965443e-01,3.967709e-01
8,3.857274e-01,1.019027e-03,7.716500e-01,6.364712e-01,-7.560451e-02,-0.099131,-9.004621e-02,-3.943446e-01
9,-7.565845e-01,-9.013289e-01,4.713035e-01,-7.613314e-01,-7.560451e-02,-0.099131,4.965443e-01,1.330657e-01


In [11]:
# using mean center method to scale the data on median imputed dataset
mean_center = yeast_data_med

for i in mean_center.columns:
    if (i == 'Sample' or i == 'Class'):
        continue
    # finding the mean of every column and subtracting it from every
    # value in the column 
    mean_col = mean_center[i].mean()
    mean_center[i] = mean_center[i].sub(mean_col)
mean_center = mean_center.set_index('Sample')
dataset = mean_center
dataset

Unnamed: 0_level_0,mcg,gvh,alm,mit,erl,pox,vac,nuc,Class
Sample,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,8.065089e-02,1.101242e-01,-3.584804e-02,-1.343792e-01,-6.921003e-03,-0.0075,-2.381570e-02,-5.981599e-02,non-CYT
2,-6.934911e-02,1.701242e-01,-2.584804e-02,5.620763e-03,-6.921003e-03,-0.0075,2.618430e-02,-5.981599e-02,non-CYT
3,1.406509e-01,1.201242e-01,-1.584804e-02,-1.143792e-01,-6.921003e-03,-0.0075,2.618430e-02,-5.981599e-02,non-CYT
4,8.065089e-02,-5.987578e-02,6.415196e-02,-1.343792e-01,-6.921003e-03,-0.0075,3.618430e-02,-5.981599e-02,non-CYT
5,-7.934911e-02,-5.987578e-02,-2.584804e-02,2.756208e-01,-6.921003e-03,-0.0075,-2.381570e-02,-5.981599e-02,non-CYT
6,1.065089e-02,-9.987578e-02,5.415196e-02,-9.437924e-02,-6.921003e-03,0.4925,-1.381570e-02,-8.881784e-16,CYT
7,6.508876e-04,4.012422e-02,-2.584804e-02,3.856208e-01,-6.921003e-03,-0.0075,2.618430e-02,-5.981599e-02,non-CYT
8,8.326673e-16,-4.987578e-02,8.415196e-02,-6.437924e-02,-6.921003e-03,-0.0075,7.618430e-02,6.018401e-02,non-CYT
9,5.065089e-02,1.242236e-04,1.541520e-01,9.562076e-02,-6.921003e-03,-0.0075,-1.381570e-02,-5.981599e-02,non-CYT
10,-9.934911e-02,-1.098758e-01,9.415196e-02,-1.143792e-01,-6.921003e-03,-0.0075,7.618430e-02,2.018401e-02,CYT


In [12]:
# printing the mean, median, std, min and max of every column in mean centering and standardization  
format_print(mean_center, standard_df, "mean centering", "standardisation")

min for mean centering (mcg): -0.3893491124260346                    min for standardisation (mcg):  -2.965054263149991
max for mean centering (mcg): 0.5006508875739655                    max for standardisation (mcg): 3.8126632401994027
mean for mean centering (mcg): 9.208716448861895e-16                mean for for standardisation (mcg): -1.4513697222188704e-16
median for mean centering (mcg): 8.326672684688674e-16             median for standardisation (mcg) :-8.454806771486998e-16
std for mean centering (mcg): 0.13135690834317476                    std for standardisation (mcg)  1.0003370975993267
min for mean centering (gvh): -0.3698757763975147                    min for standardisation (gvh):  -3.034151432770152
max for mean centering (gvh): 0.5001242236024853                    max for standardisation (gvh): 4.10260072823934
mean for mean centering (gvh): 8.25971516399868e-16                mean for for standardisation (gvh): -3.654608810844939e-17
median for mean centering (gv

## Comparing Classification Algorithms

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier 
from sklearn import metrics

#Creating arrays for training the classifiers
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 8].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

#Implementing a k=5 Nearest Neighbour Classifier
classifier_5 = KNeighborsClassifier(n_neighbors=5)
classifier_5.fit(x_train, y_train)
y_pred_5 = classifier_5.predict(x_test)

#Implementing a k=10 Nearest Neighbour Classifier
classifier_10 = KNeighborsClassifier(n_neighbors=10)
classifier_10.fit(x_train, y_train)
y_pred_10 = classifier_10.predict(x_test)

#Implementing a Decision Tree Classifier
classifier_dt = DecisionTreeClassifier()
classifier_dt = classifier_dt.fit(x_train,y_train)
y_pred_dt = classifier_dt.predict(x_test)

#Printing the accuracy of the classifier methods. Note that the values changes as the training and test data is randomly selected
print("k-NN = 5 Accuracy: ",metrics.accuracy_score(y_test, y_pred_5))
print("k-NN = 10 Accuracy: ",metrics.accuracy_score(y_test, y_pred_10))
print("Decision Tree Accuracy: ",metrics.accuracy_score(y_test, y_pred_dt))

k-NN = 5 Accuracy:  0.6836734693877551
k-NN = 10 Accuracy:  0.7122448979591837
Decision Tree Accuracy:  0.7020408163265306


The k-NN = 10 had the highest accuracy amongst all the methods tested (Most of the time). In the experiment, the training and test data were devided into a 2:1 ration with the training data being used to train the above mentioned classifiers. They were then applied and their accuracies were calculated. k-NN = 10 being the one with the highest accuracy suggests that the data is of the form such that it is easy to classify it by forming groups that have neighbouring points.

## Feature Engineering

In [14]:
#Create a dictionary to store all the new features. The features are simply all possible products of the original features
data_dict = {}
for i in range(0,8):
    for j in range(i+1,8):
        data_dict["column{0}".format(10*i+j)] = dataset.iloc[:,i].values*dataset.iloc[:,j].values

#Converting dictionary to dataframe and then joining dataframe to original dataset
index=range(1,1485)
itp_data = pd.DataFrame(data_dict,index)
dataset_new = dataset.join(itp_data)

#Stores the Mutual Information Score in a dictionary, sorts it and converts it into a Dataframe
mi_dict = {}
for i in range(0,37):
    if (i != 8):
        mi_dict["{0}".format(i)] = (metrics.mutual_info_score(dataset_new.iloc[:,i], dataset_new['Class']))
sorted_list = sorted(mi_dict.items(), key=lambda item: item[1], reverse = True)
sorted_features = pd.DataFrame(sorted_list)

#Stores the top n columns which have been ranked on the basis of their Mutual Information and then converting it into a usable dataframe.
column_dict = {}
columns = sorted_features[0].astype(int)
for i in range(15):
    column_dict["column{0}".format(i)] = dataset_new.iloc[:,columns[i]].values
index=range(1,1485)
new_data = pd.DataFrame(column_dict, index = index)
new_data['Class'] = dataset['Class']

#Selecting the respective data columns required for training the k-NN classifier
x = new_data.iloc[:, :-1].values
y = new_data.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

#Implementing the improved k-NN classifier using interation term pairs and printing the accuracy measure
classifier_10 = KNeighborsClassifier(n_neighbors=10)
classifier_10.fit(x_train, y_train)
y_pred_10 = classifier_10.predict(x_test)
print("k-NN = 10 Improved Accuracy: ",metrics.accuracy_score(y_test, y_pred_10))

k-NN = 10 Improved Accuracy:  0.6448979591836734


In [15]:
from sklearn.cluster import KMeans

#Creating clusters and then assigning them to the dataset for training the Decision Tree using clustered data
req_data = dataset.iloc[:,0:8]
clusters = KMeans(n_clusters=2).fit(req_data)
labels = clusters.labels_
dataset['clusters'] = labels
dataset = dataset[['mcg','gvh','alm','mit','erl','pox','vac','nuc','clusters','Class']]
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 8].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

#Implementing the improved k-NN classifier using clustering labels and printing the accuracy measure
classifier_10 = KNeighborsClassifier(n_neighbors=10)
classifier_10.fit(x_train, y_train)
y_pred_10 = classifier_10.predict(x_test)
print("k-NN = 10 Improved Accuracy: ",metrics.accuracy_score(y_test, y_pred_10))

k-NN = 10 Improved Accuracy:  0.9979591836734694


We selected the number of generated features for the interaction term pairs on the basis of their overall effect on the classifier, and the number of clusters for the clustering labels on the basis of the realtion we established between number of clusters and their effect on accuracy.

The feature selection+generation with interaction term pairs provides a minor improvement over conventional clustering as the accuracy went up only 1% (68% to 69%) in the most optimal application of the generated features (Using top 15 as they showed most improvement in accuracy)

The feature generation with clustering labels delivers an incredible bump in accuracy, with it the classifier becoming a perfect classifier (68% to 100%) with use of 2 clusters. We suspect this to be the case as the data is very naturally divided into two clusters. 