# iteration2

Data:
- interaktiot kaikkien muiden ryhmistä paitsi Alman kehittäjien
- interaktiot ryhmä, ei käyttäjäkohtaisia
- kaikki yritykset - konsernien tiedot pudotettu pois
- metadatana perustietoa yrityksistä - numeeriset tilikausitiedot diskretisoitu tiettyihin persentiileihin (kts. muuttuja SELECTED_COMPANY_FEATURES alta)
- **data esikäsitelty proto2_data_prep-notebookissa**

Kysymyksiä:

1. Saadaanko näillä lisätyillä metatiedoilla mallista parempi metatietojen kanssa kuin ilman?
2. Saadaanko lisätyillä metatiedoilla parempi precision, recall ja reciprocal?



## Importit

In [4]:
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import reciprocal_rank
from lightfm.data import Dataset

import numpy as np
import pandas as pd

import statistics
import functools

## Valitut metadatat yrityksille

In [2]:
SELECTED_COMPANY_FEATURES = ['company_form_code', 'location_municipality_code', 
                             'location_region_code', 'company_status_code', 'industry_code', 'turnover', 
                             'net_profit', 'personnel_average', 'performer_ranking_points', 'risk_rating_class']

## Ladataan yritysdata

In [4]:
COMPANIES_DF = pd.read_pickle("../data/pandas_pickles/prod_data_proto2.pkl")

## Käsitellään yritysdataa LightFM:n Dataset-olion luontia varten

In [5]:
ITEM_IDS = list(COMPANIES_DF['business_id'].unique())

item_features_tmp = [COMPANIES_DF[feature].unique() for feature in SELECTED_COMPANY_FEATURES]

ITEM_FEATURE_LABELS = [item for sublist in item_features_tmp for item in sublist]

ITEM_FEATURES = [(company['business_id'], 
                  [company[feature] for feature in SELECTED_COMPANY_FEATURES])
                     for company in COMPANIES_DF.to_dict(orient='records')]

print(ITEM_FEATURES[0])
print(len(ITEM_FEATURES))
print(len(ITEM_FEATURE_LABELS))

('31431209', ['company_form_code+CO_26', 'location_municipality_code+091', 'location_region_code+01', 'company_status_code+AKT', 'industry_code+43', 'turnover+NaN', 'net_profit+NaN', 'personnel_average+NaN', 'performer_ranking_points+NaN', 'risk_rating_class+NaN'])
1334601
530


## Ladataan vuorovaikutusdata

In [6]:
interactions_tmp = pd \
    .read_csv('../data/interactions_2021_08_19.csv',
             delimiter='\t',
             dtype={
                 'group_id': 'string',
                 'business_id': 'string',
                 'owner': 'string'
             })

# Poistetaan vuorovaikutusdatasta sellaiset y-tunnukset, joita ei löydy kohteista
INTERACTIONS_DF = interactions_tmp[interactions_tmp.business_id.isin(ITEM_IDS)]

## Luodaan vuorovaikutusdatasta versiot erilaisilla minimiryhmäko'oilla

In [7]:
group_sizes = INTERACTIONS_DF['group_id'].value_counts()
group_sizes_df = pd.DataFrame({'group_id': group_sizes.index, 'group_size': group_sizes.values})

INTERACTIONS_WITH_GROUP_SIZES_DF = INTERACTIONS_DF.merge(group_sizes_df, on='group_id')


##### Minimiryhmäkoko = 10 #####

INTERACTIONS_10_DF = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size >= 10]


##### Minimiryhmäkoko = 50

INTERACTIONS_50_DF = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size >= 50]


In [8]:
def print_interactions_meta_data(interactions_df):
    print('ryhmiä: {groups}, interaktioita {interactions}, yrityksiä {companies}'
          .format(groups=len(list(interactions_df['group_id'].unique())),
                  interactions=interactions_df.shape[0], 
                  companies=len(list(interactions_df['business_id'].unique()))))

print('----- group_size>=1 -----')
print_interactions_meta_data(INTERACTIONS_DF)

print('\n----- group_size>=10 -----')
print_interactions_meta_data(INTERACTIONS_10_DF)

print('\n----- group_size>=50 -----')
print_interactions_meta_data(INTERACTIONS_50_DF)

----- group_size>=1 -----
ryhmiä: 1399, interaktioita 525513, yrityksiä 140677

----- group_size>=10 -----
ryhmiä: 1015, interaktioita 524053, yrityksiä 140604

----- group_size>=50 -----
ryhmiä: 722, interaktioita 517150, yrityksiä 140284


## Luodaan LightFM:n ymmärtämät Dataset-oliot eri ryhmäko'oille

In [9]:
def create_dataset(interactions_df):
    dataset = Dataset(user_identity_features=False, item_identity_features=False)
    
    interactions = [(interaction['group_id'], interaction['business_id']) 
                for interaction in interactions_df.to_dict(orient='records')]
    
    user_ids = list(set(interactions_df['group_id'].values))

    dataset.fit(users=user_ids, items=ITEM_IDS, item_features=ITEM_FEATURE_LABELS)
    
    (interactions_ds, weights_ds) = dataset.build_interactions(interactions)
    
    item_features_ds = dataset.build_item_features(ITEM_FEATURES, normalize=False)

    # USER_MAP_DS = dataset.mapping()[0]
    # ITEM_MAP_DS = dataset.mapping()[2]
    # ITEM_FEATURE_MAP_DS = dataset.mapping()[3]
    
    return (interactions_ds, item_features_ds)    

In [10]:
(INTERACTIONS_DS, ITEM_FEATURES_DS) = create_dataset(INTERACTIONS_DF)
print(repr(INTERACTIONS_DS))

<1399x1334601 sparse matrix of type '<class 'numpy.int32'>'
	with 525513 stored elements in COOrdinate format>


In [11]:
(INTERACTIONS_10_DS, ITEM_FEATURES_10_DS) = create_dataset(INTERACTIONS_10_DF)
print(repr(INTERACTIONS_10_DS))

<1015x1334601 sparse matrix of type '<class 'numpy.int32'>'
	with 524053 stored elements in COOrdinate format>


In [12]:
(INTERACTIONS_50_DS, ITEM_FEATURES_50_DS) = create_dataset(INTERACTIONS_50_DF)
print(repr(INTERACTIONS_50_DS))

<722x1334601 sparse matrix of type '<class 'numpy.int32'>'
	with 517150 stored elements in COOrdinate format>


## Luodaan cross-validationia varten ositetut datasetit

In [13]:
def create_partitioned_datasets(interactions_ds):
    (half_1, half_2) = random_train_test_split(interactions_ds, test_percentage=0.5)
    (quarter_1, quarter_2) = random_train_test_split(half_1, test_percentage=0.5)
    (quarter_3, quarter_4) = random_train_test_split(half_2, test_percentage=0.5)
    
    return [quarter_1, quarter_2, quarter_3, quarter_4]
    

In [14]:
INTERACTIONS_CV = create_partitioned_datasets(INTERACTIONS_DS)
INTERACTIONS_10_CV = create_partitioned_datasets(INTERACTIONS_10_DS)
INTERACTIONS_50_CV = create_partitioned_datasets(INTERACTIONS_50_DS)

## Arvioidaan mallien laatua

In [16]:
def run_evaluation_function(model, test_ds, train_ds, evaluation_function, name, item_features=None):    
    print('Calculating {name} for train dataset...'.format(name=name))
    train_metric = evaluation_function(model, train_ds, item_features=item_features, num_threads=6).mean()
    
    print('Calculating {name} for test dataset...'.format(name=name))
    test_metric = evaluation_function(model, test_ds, item_features=item_features, num_threads=6).mean()
    
    print('{name}: train {train_metric:.2f}, test {test_metric:.2f}'.format(name=name, 
                                                                            train_metric=train_metric, 
                                                                            test_metric=test_metric))
    print("\n")
    return (train_metric, test_metric)


def run_evaluations_for_ds(model, train_ds, test_ds, model_name=None, item_features=None):
    auc = run_evaluation_function(model, test_ds, train_ds, auc_score, 'AUC_' + model_name, item_features)
    precision = run_evaluation_function(model, test_ds, train_ds, precision_at_k, 'PRECISION_' + model_name, item_features)
    recall = run_evaluation_function(model, test_ds, train_ds, recall_at_k, 'RECALL_' + model_name, item_features)
    reciprocal = run_evaluation_function(model, test_ds, train_ds, reciprocal_rank, 'RECIPROCAL_' + model_name, item_features)
    
    return (auc, precision, recall, reciprocal)
    

def run_evaluations(interactions_cv, item_features_ds, model_epochs):
    
    results = {
        'WARP': [],
        'BPR': [],
        'WARP_NO_ITEM': [],
        'BPR_NO_ITEM': []
    }
    
    for i in range(0, len(interactions_cv)):
        print('Starting iteration {}...'.format(i))
        
        test_ds = interactions_cv[i]
        
        # laitetaan uuteen listaan kaikki paitsi testidatasetti
        train_ds_tmp = [ds for j,ds in enumerate(interactions_cv) if j != i]
        # yhdistetään treenidatasetiksi valikoituneet vuorovaikutusmatriisit
        train_ds = functools.reduce(lambda a,b: a + b, train_ds_tmp)
        
        MODEL_WARP = LightFM(loss='warp')
        MODEL_WARP.fit(train_ds, item_features=item_features_ds, epochs=model_epochs, num_threads=6, verbose=True)

        MODEL_BPR = LightFM(loss='bpr')
        MODEL_BPR.fit(train_ds, item_features=item_features_ds, epochs=model_epochs, num_threads=6, verbose=True)

        MODEL_WARP_NO_ITEM = LightFM(loss='warp')
        MODEL_WARP_NO_ITEM.fit(train_ds, epochs=model_epochs, num_threads=4, verbose=True)

        MODEL_BPR_NO_ITEM = LightFM(loss='bpr')
        MODEL_BPR_NO_ITEM.fit(train_ds, epochs=model_epochs, num_threads=4, verbose=True)
        
        results['WARP'].append(run_evaluations_for_ds(MODEL_WARP, train_ds, test_ds, 'WARP', item_features_ds))
        results['BPR'].append(run_evaluations_for_ds(MODEL_BPR, train_ds, test_ds, 'BPR', item_features_ds))
        results['WARP_NO_ITEM'].append(run_evaluations_for_ds(MODEL_WARP_NO_ITEM, train_ds, test_ds, 'WARP_NO_ITEM'))
        results['BPR_NO_ITEM'].append(run_evaluations_for_ds(MODEL_BPR_NO_ITEM, train_ds, test_ds, 'BPR_NO_ITEM'))
        
    return results


In [17]:
def print_simple_results(results_tuple):
    auc, precision, recall, reciprocal = results_tuple
    print('AUC: \n train {} \n test {}'.format(auc[0], auc[1]))
    print('PRECISION: \n train {} \n test {}'.format(precision[0], precision[1]))
    print('RECALL: \n train {} \n test {}'.format(recall[0], recall[1]))
    print('RECIPROCAL: \n train {} \n test {}'.format(reciprocal[0], reciprocal[1]))    
    

def print_metric_result(result_arr, model_name):
    train_results = [x[0] for x in result_arr]
    test_results = [x[1] for x in result_arr]
    
    print('{name}:\n train mean {train_mean:.2f} ({train_arr})\n test mean {test_mean:.2f} ({test_arr})\n'
          .format(train_mean=statistics.mean(train_results),
                 test_mean=statistics.mean(test_results),
                 train_arr=['%.2f' % x for x in train_results],
                 test_arr=['%.2f' % x for x in test_results],
                 name=model_name))
    

def print_all_results(results):
    for i,metric in enumerate(['AUC', 'PRECISION', 'RECALL', 'RECIPROCAL']):
        print('\n-----{}-----'.format(metric))
        for model_name,result_arr in results.items():
            print_metric_result([res[i] for res in result_arr], model_name)
    

### group_size >= 1

In [None]:
RESULTS_1 = run_evaluations(INTERACTIONS_CV, ITEM_FEATURES_DS, 5)

Starting iteration 0...


Epoch: 100%|██████████| 5/5 [00:01<00:00,  3.06it/s]
Epoch: 100%|██████████| 5/5 [00:01<00:00,  3.93it/s]
Epoch: 100%|██████████| 5/5 [00:02<00:00,  2.35it/s]
Epoch: 100%|██████████| 5/5 [00:00<00:00,  5.11it/s]


Calculating AUC_WARP for train dataset...
Calculating AUC_WARP for test dataset...


### group_size >= 10

In [None]:
RESULTS_10 = run_evaluations(INTERACTIONS_10_CV, ITEM_FEATURES_10_DS, 5)

Starting iteration 0...


Epoch: 100%|██████████| 5/5 [00:05<00:00,  1.00s/it]
Epoch: 100%|██████████| 5/5 [00:02<00:00,  1.97it/s]
Epoch: 100%|██████████| 5/5 [00:05<00:00,  1.06s/it]
Epoch: 100%|██████████| 5/5 [00:03<00:00,  1.64it/s]


Calculating AUC_WARP for train dataset...
Calculating AUC_WARP for test dataset...
AUC_WARP: train 0.97, test 0.97


Calculating PRECISION_WARP for train dataset...
Calculating PRECISION_WARP for test dataset...
PRECISION_WARP: train 0.02, test 0.01


Calculating RECALL_WARP for train dataset...
Calculating RECALL_WARP for test dataset...
RECALL_WARP: train 0.00, test 0.00


Calculating RECIPROCAL_WARP for train dataset...
Calculating RECIPROCAL_WARP for test dataset...
RECIPROCAL_WARP: train 0.07, test 0.03


Calculating AUC_BPR for train dataset...
Calculating AUC_BPR for test dataset...
AUC_BPR: train 0.94, test 0.93


Calculating PRECISION_BPR for train dataset...
Calculating PRECISION_BPR for test dataset...
PRECISION_BPR: train 0.01, test 0.00


Calculating RECALL_BPR for train dataset...
Calculating RECALL_BPR for test dataset...
RECALL_BPR: train 0.00, test 0.00


Calculating RECIPROCAL_BPR for train dataset...
Calculating RECIPROCAL_BPR for test dataset...
RECIPROCAL_BPR: trai

### group_size >= 50

In [18]:
RESULTS_50 = run_evaluations(INTERACTIONS_50_CV, ITEM_FEATURES_50_DS, 5)

Starting iteration 0...


Epoch: 100%|██████████| 5/5 [00:02<00:00,  2.13it/s]
Epoch: 100%|██████████| 5/5 [00:04<00:00,  1.14it/s]
Epoch: 100%|██████████| 5/5 [00:02<00:00,  1.71it/s]
Epoch: 100%|██████████| 5/5 [00:01<00:00,  4.17it/s]


Calculating AUC_WARP for train dataset...
Calculating AUC_WARP for test dataset...
AUC_WARP: train 0.99, test 0.99


Calculating PRECISION_WARP for train dataset...
Calculating PRECISION_WARP for test dataset...
PRECISION_WARP: train 0.22, test 0.07


Calculating RECALL_WARP for train dataset...
Calculating RECALL_WARP for test dataset...
RECALL_WARP: train 0.01, test 0.01


Calculating RECIPROCAL_WARP for train dataset...
Calculating RECIPROCAL_WARP for test dataset...
RECIPROCAL_WARP: train 0.36, test 0.17


Calculating AUC_BPR for train dataset...
Calculating AUC_BPR for test dataset...
AUC_BPR: train 0.96, test 0.96


Calculating PRECISION_BPR for train dataset...
Calculating PRECISION_BPR for test dataset...
PRECISION_BPR: train 0.23, test 0.07


Calculating RECALL_BPR for train dataset...
Calculating RECALL_BPR for test dataset...
RECALL_BPR: train 0.01, test 0.01


Calculating RECIPROCAL_BPR for train dataset...
Calculating RECIPROCAL_BPR for test dataset...
RECIPROCAL_BPR: trai

Epoch: 100%|██████████| 5/5 [00:02<00:00,  2.42it/s]
Epoch: 100%|██████████| 5/5 [00:04<00:00,  1.05it/s]
Epoch: 100%|██████████| 5/5 [00:01<00:00,  2.86it/s]
Epoch: 100%|██████████| 5/5 [00:00<00:00,  5.30it/s]


Calculating AUC_WARP for train dataset...
Calculating AUC_WARP for test dataset...
AUC_WARP: train 0.99, test 0.99


Calculating PRECISION_WARP for train dataset...
Calculating PRECISION_WARP for test dataset...
PRECISION_WARP: train 0.20, test 0.07


Calculating RECALL_WARP for train dataset...
Calculating RECALL_WARP for test dataset...
RECALL_WARP: train 0.01, test 0.01


Calculating RECIPROCAL_WARP for train dataset...
Calculating RECIPROCAL_WARP for test dataset...
RECIPROCAL_WARP: train 0.34, test 0.18


Calculating AUC_BPR for train dataset...
Calculating AUC_BPR for test dataset...
AUC_BPR: train 0.96, test 0.96


Calculating PRECISION_BPR for train dataset...
Calculating PRECISION_BPR for test dataset...
PRECISION_BPR: train 0.23, test 0.08


Calculating RECALL_BPR for train dataset...
Calculating RECALL_BPR for test dataset...
RECALL_BPR: train 0.01, test 0.01


Calculating RECIPROCAL_BPR for train dataset...
Calculating RECIPROCAL_BPR for test dataset...
RECIPROCAL_BPR: trai

Epoch: 100%|██████████| 5/5 [00:02<00:00,  2.36it/s]
Epoch: 100%|██████████| 5/5 [00:04<00:00,  1.11it/s]
Epoch: 100%|██████████| 5/5 [00:01<00:00,  2.80it/s]
Epoch: 100%|██████████| 5/5 [00:00<00:00,  5.33it/s]


Calculating AUC_WARP for train dataset...
Calculating AUC_WARP for test dataset...
AUC_WARP: train 0.99, test 0.99


Calculating PRECISION_WARP for train dataset...
Calculating PRECISION_WARP for test dataset...
PRECISION_WARP: train 0.22, test 0.07


Calculating RECALL_WARP for train dataset...
Calculating RECALL_WARP for test dataset...
RECALL_WARP: train 0.01, test 0.01


Calculating RECIPROCAL_WARP for train dataset...
Calculating RECIPROCAL_WARP for test dataset...
RECIPROCAL_WARP: train 0.35, test 0.18


Calculating AUC_BPR for train dataset...
Calculating AUC_BPR for test dataset...
AUC_BPR: train 0.96, test 0.96


Calculating PRECISION_BPR for train dataset...
Calculating PRECISION_BPR for test dataset...
PRECISION_BPR: train 0.21, test 0.07


Calculating RECALL_BPR for train dataset...
Calculating RECALL_BPR for test dataset...
RECALL_BPR: train 0.01, test 0.01


Calculating RECIPROCAL_BPR for train dataset...
Calculating RECIPROCAL_BPR for test dataset...
RECIPROCAL_BPR: trai

Epoch: 100%|██████████| 5/5 [00:02<00:00,  2.43it/s]
Epoch: 100%|██████████| 5/5 [00:04<00:00,  1.16it/s]
Epoch: 100%|██████████| 5/5 [00:01<00:00,  2.85it/s]
Epoch: 100%|██████████| 5/5 [00:00<00:00,  5.31it/s]


Calculating AUC_WARP for train dataset...
Calculating AUC_WARP for test dataset...
AUC_WARP: train 0.99, test 0.99


Calculating PRECISION_WARP for train dataset...
Calculating PRECISION_WARP for test dataset...
PRECISION_WARP: train 0.20, test 0.07


Calculating RECALL_WARP for train dataset...
Calculating RECALL_WARP for test dataset...
RECALL_WARP: train 0.01, test 0.01


Calculating RECIPROCAL_WARP for train dataset...
Calculating RECIPROCAL_WARP for test dataset...
RECIPROCAL_WARP: train 0.33, test 0.17


Calculating AUC_BPR for train dataset...
Calculating AUC_BPR for test dataset...
AUC_BPR: train 0.96, test 0.96


Calculating PRECISION_BPR for train dataset...
Calculating PRECISION_BPR for test dataset...
PRECISION_BPR: train 0.22, test 0.07


Calculating RECALL_BPR for train dataset...
Calculating RECALL_BPR for test dataset...
RECALL_BPR: train 0.01, test 0.01


Calculating RECIPROCAL_BPR for train dataset...
Calculating RECIPROCAL_BPR for test dataset...
RECIPROCAL_BPR: trai

In [19]:
print_all_results(RESULTS_50)


-----AUC-----
WARP:
 train mean 0.99 (['0.99', '0.99', '0.99', '0.99'])
 test mean 0.99 (['0.99', '0.99', '0.99', '0.99'])

BPR:
 train mean 0.96 (['0.96', '0.96', '0.96', '0.96'])
 test mean 0.96 (['0.96', '0.96', '0.96', '0.96'])

WARP_NO_ITEM:
 train mean 0.99 (['0.99', '0.99', '0.99', '0.99'])
 test mean 0.94 (['0.94', '0.94', '0.94', '0.93'])

BPR_NO_ITEM:
 train mean 0.86 (['0.86', '0.87', '0.86', '0.86'])
 test mean 0.82 (['0.82', '0.82', '0.82', '0.82'])


-----PRECISION-----
WARP:
 train mean 0.21 (['0.22', '0.20', '0.22', '0.20'])
 test mean 0.07 (['0.07', '0.07', '0.07', '0.07'])

BPR:
 train mean 0.22 (['0.23', '0.23', '0.21', '0.22'])
 test mean 0.07 (['0.07', '0.08', '0.07', '0.07'])

WARP_NO_ITEM:
 train mean 0.24 (['0.24', '0.23', '0.25', '0.25'])
 test mean 0.06 (['0.06', '0.05', '0.06', '0.06'])

BPR_NO_ITEM:
 train mean 0.24 (['0.23', '0.24', '0.25', '0.22'])
 test mean 0.05 (['0.05', '0.05', '0.05', '0.05'])


-----RECALL-----
WARP:
 train mean 0.01 (['0.01', '0.01

In [None]:
AUC:een voisi koittaa visualisoida: kaikki scoret järjestykseen ja highlightaa osumat