# iteration2

Data:
- interaktiot kaikkien muiden ryhmistä paitsi Alman kehittäjien
- interaktiot ryhmä, ei käyttäjäkohtaisia
- kaikki yritykset - konsernit mukana lisäämällä "K-" y-tunnuksen eteen
- metadatana perustietoa yrityksistä - numeeriset tilikausitiedot diskretisoitu tiettyihin persentiileihin (kts. muuttuja SELECTED_COMPANY_FEATURES alta)
- **data esikäsitelty proto2_data_prep-notebookissa**
- Warp-malli käytössä
- Minimiryhmäkoko = 2

Kysymyksiä:

1. Saadaanko näillä lisätyillä metatiedoilla mallista parempi metatietojen kanssa kuin ilman?
2. item_identity_features päälle vai pois?
3. Saadaanko lisätyillä metatiedoilla parempi precision?



## Importit

In [100]:
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import reciprocal_rank
from lightfm.data import Dataset

import numpy as np
import pandas as pd

import statistics
import functools

WORKING_DIRECTORY = '/mnt/c/git/masters-thesis-code/jupyter/code/'

## Valitut metadatat yrityksille

In [101]:
SELECTED_COMPANY_FEATURES = ['company_form_code', 'location_municipality_code', 
                             'location_region_code', 'company_status_code', 'industry_code', 'turnover', 
                             'net_profit', 'personnel_average', 'performer_ranking_points', 'risk_rating_class']

## Ladataan yritysdata

In [102]:
COMPANIES_DF = pd.read_pickle(WORKING_DIRECTORY + "data/pandas_pickles/prod_data_proto2.pkl")

## Käsitellään yritysdataa LightFM:n Dataset-olion luontia varten

In [103]:
ITEM_IDS = list(COMPANIES_DF['business_id'].unique())

item_features_tmp = [COMPANIES_DF[feature].unique() for feature in SELECTED_COMPANY_FEATURES]

ITEM_FEATURE_LABELS = [item for sublist in item_features_tmp for item in sublist]

ITEM_FEATURES = [(company['business_id'], 
                  [company[feature] for feature in SELECTED_COMPANY_FEATURES])
                     for company in COMPANIES_DF.to_dict(orient='records')]

print(ITEM_FEATURES[0])
print(len(ITEM_FEATURES))
print(len(ITEM_FEATURE_LABELS))


('31431209', ['company_form_code+CO_26', 'location_municipality_code+091', 'location_region_code+01', 'company_status_code+AKT', 'industry_code+43', 'turnover+NaN', 'net_profit+NaN', 'personnel_average+NaN', 'performer_ranking_points+NaN', 'risk_rating_class+NaN'])
1337868
530


In [104]:
COMPANIES_DF[COMPANIES_DF['business_id'] == '01370820']

Unnamed: 0,business_id,company_name,company_form_code,location_municipality_code,location_region_code,company_status_code,industry_code,turnover,net_profit,personnel_average,performer_ranking_points,risk_rating_class
1333142,1370820,Leipomo Rosten Oy,company_form_code+CO_16,location_municipality_code+853,location_region_code+02,company_status_code+AKT,industry_code+10,turnover+0.98,net_profit+0.98,personnel_average+0.98,performer_ranking_points+0.6,risk_rating_class+GREEN


## Ladataan vuorovaikutusdata

In [105]:
interactions_tmp = pd \
    .read_csv(WORKING_DIRECTORY + 'data/interactions_2021_08_19.csv',
             delimiter='\t',
             dtype={
                 'group_id': 'string',
                 'business_id': 'string',
                 'owner': 'string'
             })
             
interactions_tmp

Unnamed: 0,group_id,business_id,owner
0,3a63222b-86b2-4293-bd2e-171011190ae6,31291154,603e1524d377150007c2dbea
1,c2626398-faac-4ff3-b02d-cdc64b50cdaa,01681709,60646431ae18cb00063ed63f
2,c2626398-faac-4ff3-b02d-cdc64b50cdaa,15055514,60646431ae18cb00063ed63f
3,c2626398-faac-4ff3-b02d-cdc64b50cdaa,01876143,60646431ae18cb00063ed63f
4,c2626398-faac-4ff3-b02d-cdc64b50cdaa,01863991,60646431ae18cb00063ed63f
...,...,...,...
548193,8b0915ff-a0cb-4520-9160-8d783a6bf308,25610905,6110c56241e21e000857ca77
548194,8b0915ff-a0cb-4520-9160-8d783a6bf308,21281841,6110c56241e21e000857ca77
548195,8b0915ff-a0cb-4520-9160-8d783a6bf308,23633484,6110c56241e21e000857ca77
548196,8b0915ff-a0cb-4520-9160-8d783a6bf308,23552270,6110c56241e21e000857ca77


In [106]:
# lisätään konserniyrityksille interaktiot
concern_interactions = interactions_tmp.copy()
concern_interactions['business_id'] = 'K-' + concern_interactions['business_id'].astype(str)
concern_interactions = concern_interactions[concern_interactions.business_id.isin(ITEM_IDS)]
concern_interactions

Unnamed: 0,group_id,business_id,owner
1,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01681709,60646431ae18cb00063ed63f
2,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-15055514,60646431ae18cb00063ed63f
3,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01876143,60646431ae18cb00063ed63f
5,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-05363070,60646431ae18cb00063ed63f
6,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01387534,60646431ae18cb00063ed63f
...,...,...,...
548074,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-02106319,6110c56241e21e000857ca77
548110,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-20333371,6110c56241e21e000857ca77
548137,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-07027249,6110c56241e21e000857ca77
548162,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-02011774,6110c56241e21e000857ca77


In [107]:
# yhdistetään konserni-interaktiot tavallisiin ja poistetaan sellaiset interaktiot, joille ei löydy y-tunnusta
INTERACTIONS_DF = pd.concat([interactions_tmp, concern_interactions])
INTERACTIONS_DF = INTERACTIONS_DF[INTERACTIONS_DF.business_id.isin(ITEM_IDS)]
INTERACTIONS_DF[INTERACTIONS_DF['business_id'] == 'K-02011774']

Unnamed: 0,group_id,business_id,owner
66053,66aea578-682d-45de-a200-77fa79a8c5e7,K-02011774,5db92c0ebc3e9100062ac0b0
155209,6c894b42-cbc6-4d18-8cf1-39ee91d2bf53,K-02011774,6058a712ae18cb00063ed639
206717,608ac4ff-ab75-425b-a4f1-74d5defd43a3,K-02011774,5fbe6408f464de0006491d9e
220714,4d2c2290-72be-40fe-8506-42718d2aac25,K-02011774,5f5613da9769490006b0ebb3
247346,58f4cb31-0d40-42b2-9268-10d6dc64f1a0,K-02011774,5f5613da9769490006b0ebb3
338294,62c0017d-ff32-4584-8c85-562e9a1e8329,K-02011774,5fab9e2dc07a1900066bab26
367059,2177a613-3b16-483e-9c9b-dc20bd4225f6,K-02011774,608863795602580007e70ddf
406747,3a44f374-56ee-4500-8829-c4b942b7afec,K-02011774,6033ae80484e8a0006fe437a
434180,725e7cd0-7031-4988-b4cb-69672c611514,K-02011774,5db83a1cbc3e9100062ac0ab
444333,acfdacd6-0de0-4568-a819-c3c2e15ef221,K-02011774,6033ae80484e8a0006fe437a


## Luodaan vuorovaikutusdatasta versiot erilaisilla minimiryhmäko'oilla

In [108]:
group_sizes = INTERACTIONS_DF['group_id'].value_counts()
group_sizes_df = pd.DataFrame({'group_id': group_sizes.index, 'group_size': group_sizes.values})

INTERACTIONS_WITH_GROUP_SIZES_DF = INTERACTIONS_DF.merge(group_sizes_df, on='group_id')


##### Minimiryhmäkoko = 2 #####

INTERACTIONS_2_DF = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size >= 2]


In [109]:
def print_interactions_meta_data(interactions_df):
    print('ryhmiä: {groups}, interaktioita {interactions}, yrityksiä {companies}'
          .format(groups=len(list(interactions_df['group_id'].unique())),
                  interactions=interactions_df.shape[0], 
                  companies=len(list(interactions_df['business_id'].unique()))))

print('----- group_size>=1 -----')
print_interactions_meta_data(INTERACTIONS_DF)

print('\n----- group_size>=2 -----')
print_interactions_meta_data(INTERACTIONS_2_DF)

----- group_size>=1 -----
ryhmiä: 1402, interaktioita 598822, yrityksiä 143840

----- group_size>=2 -----
ryhmiä: 1335, interaktioita 598755, yrityksiä 143839


## Luodaan LightFM:n ymmärtämät Dataset-oliot eri testitapauksille

In [111]:
def create_dataset(interactions_df, item_identity_features):
    dataset = Dataset(user_identity_features=False, item_identity_features=item_identity_features)
    
    interactions = [(interaction['group_id'], interaction['business_id']) 
                for interaction in interactions_df.to_dict(orient='records')]
    
    user_ids = list(set(interactions_df['group_id'].values))

    dataset.fit(users=user_ids, items=ITEM_IDS, item_features=ITEM_FEATURE_LABELS)
    
    (interactions_ds, weights_ds) = dataset.build_interactions(interactions)
    
    item_features_ds = dataset.build_item_features(ITEM_FEATURES, normalize=False)

    # USER_MAP_DS = dataset.mapping()[0]
    # ITEM_MAP_DS = dataset.mapping()[2]
    ITEM_FEATURE_MAP_DS = dataset.mapping()[3]
    print(len(ITEM_FEATURE_MAP_DS))
    
    return (interactions_ds, item_features_ds)    

In [112]:
# item_identity_features off
(INTERACTIONS_2_DS, ITEM_FEATURES_2_DS) = create_dataset(INTERACTIONS_2_DF, False)
INTERACTIONS_2_DS

530


<1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
	with 598755 stored elements in COOrdinate format>

In [113]:
# item_identity_fetures on
(INTERACTIONS_2_IDENTITY_DS, ITEM_FEATURES_2_IDENTITY_DS) = create_dataset(INTERACTIONS_2_DF, True)
INTERACTIONS_2_IDENTITY_DS

1338398


<1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
	with 598755 stored elements in COOrdinate format>

## Luodaan cross-validationia varten ositetut datasetit

In [114]:
def create_partitioned_datasets(interactions_ds):
    (rest, fifth_1) = random_train_test_split(interactions_ds, test_percentage=0.2)
    (rest, fifth_2) = random_train_test_split(rest, test_percentage=0.25)
    (rest, fifth_3) = random_train_test_split(rest, test_percentage=0.3333333)
    (fifth_4, fifth_5) = random_train_test_split(rest, test_percentage=0.5)
    
    return [fifth_1, fifth_2, fifth_3, fifth_4, fifth_5]
    

In [115]:
INTERACTIONS_2_CV = create_partitioned_datasets(INTERACTIONS_2_DS)
INTERACTIONS_2_CV

[<1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>]

In [116]:
INTERACTIONS_2_IDENTITY_CV = create_partitioned_datasets(INTERACTIONS_2_IDENTITY_DS)
INTERACTIONS_2_IDENTITY_CV

[<1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>,
 <1335x1337868 sparse matrix of type '<class 'numpy.int32'>'
 	with 119751 stored elements in COOrdinate format>]

## Arvioidaan mallien laatua

In [117]:
NUM_THREADS = 12

def run_evaluation_function(model, test_ds, train_ds, evaluation_function, name, item_features=None):    
    #print('Calculating {name} for train dataset...'.format(name=name))
    #train_results = evaluation_function(model, train_ds, item_features=item_features, num_threads=NUM_THREADS)
    #np.savetxt('iteration2-train-results-{}.txt'.format(name), train_results)
    #train_metric = train_results.mean()
    
    print('Calculating {name} for test dataset...'.format(name=name))
    test_results = evaluation_function(model, test_ds, train_ds, item_features=item_features, num_threads=NUM_THREADS)
    np.savetxt('iteration2-test-results-{}.txt'.format(name), test_results)
    test_metric = test_results.mean()
    
    print('{name}: train {train_metric:.2f}, test {test_metric:.2f}'.format(name=name, 
                                                                            train_metric=-1.0, 
                                                                            test_metric=test_metric))
    print('\n')
    return (-1.0, test_metric)

def run_evaluations_for_ds(model, train_ds, test_ds, model_name, item_features=None):
    auc = run_evaluation_function(model, test_ds, train_ds, auc_score, 'AUC_' + model_name, item_features)
    precision = run_evaluation_function(model, test_ds, train_ds, precision_at_k, 'PRECISION_' + model_name, item_features)
    #recall = run_evaluation_function(model, test_ds, train_ds, recall_at_k, 'RECALL_' + model_name, item_features)
    #reciprocal = run_evaluation_function(model, test_ds, train_ds, reciprocal_rank, 'RECIPROCAL_' + model_name, item_features)
    
    return (auc, precision) #, recall, reciprocal)
    

def run_evaluations(name, interactions_cv, item_features_ds):
    
    results = {
        name: []
    }
    
    for i in range(0, len(interactions_cv)):
        print('Starting iteration {}...'.format(i))    
            
        test_ds = interactions_cv[i]
        
        # laitetaan uuteen listaan kaikki paitsi testidatasetti
        train_ds_tmp = [ds for j,ds in enumerate(interactions_cv) if j != i]
        # yhdistetään treenidatasetiksi valikoituneet vuorovaikutusmatriisit
        train_ds = functools.reduce(lambda a,b: a + b, train_ds_tmp)
        
        MODEL = LightFM(loss='warp')
        MODEL.fit(train_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_ds, test_ds, '{}_{}'.format(name, i), item_features_ds))
        
    return results


In [118]:
def print_metric_result(result_arr, model_name):
    train_results = [x[0] for x in result_arr]
    test_results = [x[1] for x in result_arr]
    
    print('{name}:\n train mean {train_mean:.2f} ({train_arr})\n test mean {test_mean:.2f} ({test_arr})\n'
          .format(train_mean=statistics.mean(train_results),
                 test_mean=statistics.mean(test_results),
                 train_arr=['%.2f' % x for x in train_results],
                 test_arr=['%.2f' % x for x in test_results],
                 name=model_name))
    

def print_all_results(results):
    for i,metric in enumerate(['AUC', 'PRECISION']): #, 'RECALL', 'RECIPROCAL']):
        print('\n-----{}-----'.format(metric))
        for model_name,result_arr in results.items():
            print_metric_result([res[i] for res in result_arr], model_name)
    

### Tulokset

In [119]:
RESULTS_IDENTITY_OFF = run_evaluations('IDENTITY_FEATURES_OFF', INTERACTIONS_2_CV, ITEM_FEATURES_2_DS)
RESULTS_IDENTITY_ON = run_evaluations('IDENTITY_FEATURES_ON', INTERACTIONS_2_IDENTITY_CV, ITEM_FEATURES_2_IDENTITY_DS)

Starting iteration 0...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.92it/s]


Calculating AUC_IDENTITY_FEATURES_OFF_0 for test dataset...
AUC_IDENTITY_FEATURES_OFF_0: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_OFF_0 for test dataset...
PRECISION_IDENTITY_FEATURES_OFF_0: train -1.00, test 0.08


Starting iteration 1...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.69it/s]


Calculating AUC_IDENTITY_FEATURES_OFF_1 for test dataset...
AUC_IDENTITY_FEATURES_OFF_1: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_OFF_1 for test dataset...
PRECISION_IDENTITY_FEATURES_OFF_1: train -1.00, test 0.09


Starting iteration 2...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.45it/s]


Calculating AUC_IDENTITY_FEATURES_OFF_2 for test dataset...
AUC_IDENTITY_FEATURES_OFF_2: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_OFF_2 for test dataset...
PRECISION_IDENTITY_FEATURES_OFF_2: train -1.00, test 0.08


Starting iteration 3...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.50it/s]


Calculating AUC_IDENTITY_FEATURES_OFF_3 for test dataset...
AUC_IDENTITY_FEATURES_OFF_3: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_OFF_3 for test dataset...
PRECISION_IDENTITY_FEATURES_OFF_3: train -1.00, test 0.08


Starting iteration 4...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.57it/s]


Calculating AUC_IDENTITY_FEATURES_OFF_4 for test dataset...
AUC_IDENTITY_FEATURES_OFF_4: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_OFF_4 for test dataset...
PRECISION_IDENTITY_FEATURES_OFF_4: train -1.00, test 0.08


Starting iteration 0...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.25it/s]


Calculating AUC_IDENTITY_FEATURES_ON_0 for test dataset...
AUC_IDENTITY_FEATURES_ON_0: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_ON_0 for test dataset...
PRECISION_IDENTITY_FEATURES_ON_0: train -1.00, test 0.09


Starting iteration 1...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.26it/s]


Calculating AUC_IDENTITY_FEATURES_ON_1 for test dataset...
AUC_IDENTITY_FEATURES_ON_1: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_ON_1 for test dataset...
PRECISION_IDENTITY_FEATURES_ON_1: train -1.00, test 0.09


Starting iteration 2...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.20it/s]


Calculating AUC_IDENTITY_FEATURES_ON_2 for test dataset...
AUC_IDENTITY_FEATURES_ON_2: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_ON_2 for test dataset...
PRECISION_IDENTITY_FEATURES_ON_2: train -1.00, test 0.09


Starting iteration 3...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.07it/s]


Calculating AUC_IDENTITY_FEATURES_ON_3 for test dataset...
AUC_IDENTITY_FEATURES_ON_3: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_ON_3 for test dataset...
PRECISION_IDENTITY_FEATURES_ON_3: train -1.00, test 0.09


Starting iteration 4...


Epoch: 100%|████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.49it/s]


Calculating AUC_IDENTITY_FEATURES_ON_4 for test dataset...
AUC_IDENTITY_FEATURES_ON_4: train -1.00, test 0.99


Calculating PRECISION_IDENTITY_FEATURES_ON_4 for test dataset...
PRECISION_IDENTITY_FEATURES_ON_4: train -1.00, test 0.09




In [120]:
print_all_results(RESULTS_IDENTITY_OFF)


-----AUC-----
IDENTITY_FEATURES_OFF:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.99 (['0.99', '0.99', '0.99', '0.99', '0.99'])


-----PRECISION-----
IDENTITY_FEATURES_OFF:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.08 (['0.08', '0.09', '0.08', '0.08', '0.08'])



In [121]:
print_all_results(RESULTS_IDENTITY_ON)


-----AUC-----
IDENTITY_FEATURES_ON:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.99 (['0.99', '0.99', '0.99', '0.99', '0.99'])


-----PRECISION-----
IDENTITY_FEATURES_ON:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.09 (['0.09', '0.09', '0.09', '0.09', '0.09'])

