# iteration3

Data:
- interaktiot kaikkien muiden ryhmistä paitsi Alman kehittäjien
- interaktiot ryhmä, ei käyttäjäkohtaisia
- kaikki yritykset - konsernit mukana lisäämällä "K-" y-tunnuksen eteen
- metadatana perustietoa yrityksistä - numeeriset tilikausitiedot diskretisoitu hieman persentiilejä mukaileviin custom-luokkiin
- **data esikäsitelty iteration3_feature_selection**-notebookissa
- Warp-malli käytössä
- Minimiryhmäkoko = 2

Kysymyksiä:

1. Miten uudet muokatut tilinpäätöstiedon luokat + location_municipalityn pudotus toimii?
2. Miten gini-indeksillä painotettu item_feature-matriisi toimii eri painotuksilla?
3. Mitä jos NaN-featureja ei anneta?

## Importit

In [121]:
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import reciprocal_rank
from lightfm.data import Dataset

import numpy as np
import pandas as pd

import statistics
import functools

from sklearn.model_selection import train_test_split

WORKING_DIRECTORY = '/mnt/d/git/masters-thesis-code/jupyter/code/'

## Valitut metadatat yrityksille

In [122]:
SELECTED_COMPANY_FEATURES = ['company_form_code', 
                             'location_region_code', 'company_status_code', 'industry_code', 'turnover', 
                             'net_profit', 'personnel_average', 'performer_ranking_points', 'risk_rating_class']

## Ladataan yritysdata

In [123]:
COMPANIES_DF = pd.read_pickle(WORKING_DIRECTORY + "data/pandas_pickles/company_data_iteration3.pkl")

ITEM_IDS = list(COMPANIES_DF['business_id'].unique())

item_features_tmp = [COMPANIES_DF[feature].unique() for feature in SELECTED_COMPANY_FEATURES]

ITEM_FEATURE_LABELS = [item for sublist in item_features_tmp for item in sublist]

In [37]:
COMPANIES_DF[COMPANIES_DF['business_id'] == 'K-01370820']

Unnamed: 0,business_id,company_name,company_form_code,location_region_code,company_status_code,industry_code,turnover,net_profit,personnel_average,performer_ranking_points,risk_rating_class
1333143,K-01370820,Leipomo Rosten Oy,company_form_code+CO_16,location_region_code+02,company_status_code+AKT,industry_code+10,turnover+top,net_profit+top,personnel_average+3,performer_ranking_points+1,risk_rating_class+GREEN


## Ladataan vuorovaikutusdata

In [132]:
interactions_tmp = pd \
    .read_csv(WORKING_DIRECTORY + 'data/interactions_2021_08_19.csv',
             delimiter='\t',
             dtype={
                 'group_id': 'string',
                 'business_id': 'string',
                 'owner': 'string'
             })

# otetaan pois 1 kokoiset ryhmät
group_sizes = interactions_tmp['group_id'].value_counts()
group_sizes_df = pd.DataFrame({'group_id': group_sizes.index, 'group_size': group_sizes.values})

INTERACTIONS_WITH_GROUP_SIZES_DF = interactions_tmp.merge(group_sizes_df, on='group_id')             
interactions_tmp = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size >= 2]
#interactions_tmp = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size <= 3000]
interactions_tmp.sort_values('group_size')

Unnamed: 0,group_id,business_id,owner,group_size
155696,3e9dd356-2b21-45ae-9ee4-7cd6cc122fe1,07577937,5e87095492119e00066e7158,2
106198,31503959-943a-4081-abcc-dc80e5cb0402,15093748,5db034c64320cd0006d2b788,2
313746,cab22fae-db47-46b6-b902-3d9a1b1051f6,01163004,5e4534bc7bf061000697e940,2
313747,cab22fae-db47-46b6-b902-3d9a1b1051f6,10410900,5e4534bc7bf061000697e940,2
545392,0967d6ed-88b7-4023-a720-f09f7051f24d,17944788,5efdbc656488210007bc27f6,2
...,...,...,...,...
8042,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16029641,5e1489f3c2f568000654ecbb,3999
8043,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16030167,5e1489f3c2f568000654ecbb,3999
8044,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16030415,5e1489f3c2f568000654ecbb,3999
8031,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16001948,5e1489f3c2f568000654ecbb,3999


In [133]:
# lisätään konserniyrityksille interaktiot
concern_interactions = interactions_tmp.copy()
concern_interactions['business_id'] = 'K-' + concern_interactions['business_id'].astype(str)
concern_interactions = concern_interactions[concern_interactions.business_id.isin(ITEM_IDS)]
concern_interactions

Unnamed: 0,group_id,business_id,owner,group_size
5,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01681709,60646431ae18cb00063ed63f,1862
6,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-15055514,60646431ae18cb00063ed63f,1862
7,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01876143,60646431ae18cb00063ed63f,1862
9,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-05363070,60646431ae18cb00063ed63f,1862
10,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01387534,60646431ae18cb00063ed63f,1862
...,...,...,...,...
548074,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-02106319,6110c56241e21e000857ca77,131
548110,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-20333371,6110c56241e21e000857ca77,131
548137,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-07027249,6110c56241e21e000857ca77,131
548162,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-02011774,6110c56241e21e000857ca77,131


In [134]:
# yhdistetään konserni-interaktiot tavallisiin ja poistetaan sellaiset interaktiot, joille ei löydy y-tunnusta
INTERACTIONS_DF = pd.concat([interactions_tmp, concern_interactions])
INTERACTIONS_DF = INTERACTIONS_DF[INTERACTIONS_DF.business_id.isin(ITEM_IDS)]
INTERACTIONS_DF[INTERACTIONS_DF['business_id'] == 'K-02011774']

USER_IDS = list(set(INTERACTIONS_DF['group_id'].values))


In [135]:
def print_interactions_meta_data(interactions_df):
    print('ryhmiä: {groups}, interaktioita {interactions}, yrityksiä {companies}'
          .format(groups=len(list(interactions_df['group_id'].unique())),
                  interactions=interactions_df.shape[0], 
                  companies=len(list(interactions_df['business_id'].unique()))))

print('----- group_size>=2 -----')
print_interactions_meta_data(INTERACTIONS_DF)

----- group_size>=2 -----
ryhmiä: 1312, interaktioita 598703, yrityksiä 143839


## Luodaan cross-validationia varten ositetut datasetit

## Luodaan LightFM:n ymmärtämät Dataset-oliot

In [136]:
def create_item_features_ds():
    return [(company['business_id'], 
                [company[feature] for feature in SELECTED_COMPANY_FEATURES])
                    for company in COMPANIES_DF.to_dict(orient='records')]


In [137]:
def calculate_gini_for_word(word, train_interactions_df, alpha):
    col_name = word.split('+')[0]
    matches_df = COMPANIES_DF[COMPANIES_DF[col_name] == word]
    
    matched_docs_total = matches_df.shape[0]
    
    match_bids = list(matches_df['business_id'].unique())
    
    matching_interactions_df = train_interactions_df[train_interactions_df['business_id'].isin(match_bids)]
    
    interacted_docs_count = matching_interactions_df['business_id'].unique().shape[0]
    non_interacted_docs_count = matched_docs_total - interacted_docs_count
    
    gini_index = 1 - ((interacted_docs_count / matched_docs_total) ** 2 + \
                    (non_interacted_docs_count / matched_docs_total) ** 2)
    
        
    return (word, alpha - gini_index, interacted_docs_count, matched_docs_total)

def create_gini_weighted_item_features(train_interactions_df, alpha):
    feature_weights = {}

    for word in ITEM_FEATURE_LABELS:
        gini = calculate_gini_for_word(word, train_interactions_df, alpha)
        feature_weights[word] = gini[1]

    return [(company['business_id'], 
                {k: feature_weights[k] for k in [company[feature] for feature in SELECTED_COMPANY_FEATURES]})
                    for company in COMPANIES_DF.to_dict(orient='records')]


In [145]:
def create_dataset(train_interactions_df, test_interactions_df, alpha=None):
    dataset = Dataset(user_identity_features=False)

    
    train_interactions = [(interaction['group_id'], interaction['business_id']) 
                for interaction in train_interactions_df.to_dict(orient='records')]

    test_interactions = [(interaction['group_id'], interaction['business_id']) 
            for interaction in test_interactions_df.to_dict(orient='records')]
    
    dataset.fit(users=USER_IDS, items=ITEM_IDS, item_features=ITEM_FEATURE_LABELS)

    (train_interactions_ds, _) = dataset.build_interactions(train_interactions)
    (test_interactions_ds, _) = dataset.build_interactions(test_interactions)
    
    if (alpha == None):
        item_features_ds = dataset.build_item_features(create_item_features_ds(), normalize=False)
        return (train_interactions_ds, test_interactions_ds, item_features_ds, dataset.mapping())

    else:
        item_features_ds = dataset.build_item_features(create_gini_weighted_item_features(train_interactions_df, alpha), normalize=False)
        return (train_interactions_ds, test_interactions_ds, item_features_ds)

## Arvioidaan mallien laatua

In [140]:
NUM_THREADS = 10

def run_evaluation_function(model, test_ds, train_ds, evaluation_function, name, item_features=None):    
    print('Calculating {name} for train dataset...'.format(name=name))
    train_results = evaluation_function(model, train_ds, item_features=item_features, num_threads=NUM_THREADS)
    np.savetxt('iteration3-train-results-{}.txt'.format(name), train_results)
    train_metric = train_results.mean()
    
    print('Calculating {name} for test dataset...'.format(name=name))
    test_results = evaluation_function(model, test_ds, train_ds, item_features=item_features, num_threads=NUM_THREADS)
    np.savetxt('iteration3-test-results-{}.txt'.format(name), test_results)
    test_metric = test_results.mean()
    
    print('{name}: train {train_metric:.4f}, test {test_metric:.4f}'.format(name=name, 
                                                                            train_metric=train_metric, 
                                                                            test_metric=test_metric))
    print('\n')
    return (train_metric, test_metric)

def run_evaluations_for_ds(model, train_ds, test_ds, model_name, item_features=None):
    auc = run_evaluation_function(model, test_ds, train_ds, auc_score, 'AUC_' + model_name, item_features)
    precision = run_evaluation_function(model, test_ds, train_ds, precision_at_k, 'PRECISION_' + model_name, item_features)
    #recall = run_evaluation_function(model, test_ds, train_ds, recall_at_k, 'RECALL_' + model_name, item_features)
    #reciprocal = run_evaluation_function(model, test_ds, train_ds, reciprocal_rank, 'RECIPROCAL_' + model_name, item_features)
    
    return (auc, precision) #, recall, reciprocal)


In [52]:
def run_evaluations(interactions_df_cv):

    results = {
        'NO_WEIGHTING': [],
        'GINI_05': [],
        'GINI_10': [],
        'GINI_25': [],
        'GINI_50': []
    }
    
    for i in range(0, len(interactions_df_cv)):
        print('Starting iteration {}...'.format(i))
            
        test_interactions_df = interactions_df_cv[i]
        
        # laitetaan uuteen listaan kaikki paitsi testidatasetti
        train_df_tmp = [ds for j,ds in enumerate(interactions_df_cv) if j != i]
        # yhdistetään treenidatasetiksi valikoituneet vuorovaikutusmatriisit
        train_interactions_df = pd.concat(train_df_tmp)

        print('test_interactions', test_interactions_df.shape)
        print('train_interactions', train_interactions_df.shape)
        
        ##### NO_WEIGHTING #####
        name = 'NO_WEIGHTING'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_05 #####
        name = 'GINI_05'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 0.5)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_10 #####
        name = 'GINI_10'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 1.0)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_25 #####
        name = 'GINI_25'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 2.5)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_50 #####
        name = 'GINI_50'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 5.0)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

    return results

In [56]:
def print_metric_result(result_arr, model_name):
    train_results = [x[0] for x in result_arr]
    test_results = [x[1] for x in result_arr]
    
    print('{name}:\n train mean {train_mean:.4f} ({train_arr})\n test mean {test_mean:.4f} ({test_arr})\n'
          .format(train_mean=statistics.mean(train_results),
                 test_mean=statistics.mean(test_results),
                 train_arr=['%.4f' % x for x in train_results],
                 test_arr=['%.4f' % x for x in test_results],
                 name=model_name))
    

def print_all_results(results):
    for i,metric in enumerate(['AUC', 'PRECISION']): #, 'RECALL', 'RECIPROCAL']):
        print('\n-----{}-----'.format(metric))
        for model_name,result_arr in results.items():
            print_metric_result([res[i] for res in result_arr], model_name)
    

### Tulokset

In [141]:
def create_partitioned_datasets(interactions_df):
    (rest, fifth_1) = train_test_split(interactions_df, test_size=0.2)
    (rest, fifth_2) = train_test_split(rest, test_size=0.25)
    (rest, fifth_3) = train_test_split(rest, test_size=0.3333333)
    (fifth_4, fifth_5) = train_test_split(rest, test_size=0.5)
    
    return [fifth_1, fifth_2, fifth_3, fifth_4, fifth_5]
    

In [142]:
INTERACTIONS_CV = create_partitioned_datasets(INTERACTIONS_DF)
for cv in INTERACTIONS_CV:
    print(cv.shape)

(119741, 4)
(119741, 4)
(119741, 4)
(119740, 4)
(119740, 4)


In [54]:
RESULTS = run_evaluations(INTERACTIONS_CV)

Starting iteration 0...
test_interactions (119741, 4)
train_interactions (478962, 4)


Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.94it/s]


Calculating AUC_NO_WEIGHTING_0 for train dataset...
Calculating AUC_NO_WEIGHTING_0 for test dataset...
AUC_NO_WEIGHTING_0: train 0.9957, test 0.9889


Calculating PRECISION_NO_WEIGHTING_0 for train dataset...
Calculating PRECISION_NO_WEIGHTING_0 for test dataset...
PRECISION_NO_WEIGHTING_0: train 0.1931, test 0.0957




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.84it/s]


Calculating AUC_GINI_05_0 for train dataset...
Calculating AUC_GINI_05_0 for test dataset...
AUC_GINI_05_0: train 0.9946, test 0.9869


Calculating PRECISION_GINI_05_0 for train dataset...
Calculating PRECISION_GINI_05_0 for test dataset...
PRECISION_GINI_05_0: train 0.1400, test 0.0703




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.07it/s]


Calculating AUC_GINI_10_0 for train dataset...
Calculating AUC_GINI_10_0 for test dataset...
AUC_GINI_10_0: train 0.9959, test 0.9893


Calculating PRECISION_GINI_10_0 for train dataset...
Calculating PRECISION_GINI_10_0 for test dataset...
PRECISION_GINI_10_0: train 0.1773, test 0.0938




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.03it/s]


Calculating AUC_GINI_25_0 for train dataset...
Calculating AUC_GINI_25_0 for test dataset...
AUC_GINI_25_0: train 0.9916, test 0.9808


Calculating PRECISION_GINI_25_0 for train dataset...
Calculating PRECISION_GINI_25_0 for test dataset...
PRECISION_GINI_25_0: train 0.0841, test 0.0350




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.11it/s]


Calculating AUC_GINI_50_0 for train dataset...
Calculating AUC_GINI_50_0 for test dataset...
AUC_GINI_50_0: train 0.7287, test 0.7073


Calculating PRECISION_GINI_50_0 for train dataset...
Calculating PRECISION_GINI_50_0 for test dataset...
PRECISION_GINI_50_0: train 0.0024, test 0.0012


Starting iteration 1...
test_interactions (119741, 4)
train_interactions (478962, 4)


Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.06it/s]


Calculating AUC_NO_WEIGHTING_1 for train dataset...
Calculating AUC_NO_WEIGHTING_1 for test dataset...
AUC_NO_WEIGHTING_1: train 0.9955, test 0.9881


Calculating PRECISION_NO_WEIGHTING_1 for train dataset...
Calculating PRECISION_NO_WEIGHTING_1 for test dataset...
PRECISION_NO_WEIGHTING_1: train 0.1850, test 0.0934




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.83it/s]


Calculating AUC_GINI_05_1 for train dataset...
Calculating AUC_GINI_05_1 for test dataset...
AUC_GINI_05_1: train 0.9947, test 0.9875


Calculating PRECISION_GINI_05_1 for train dataset...
Calculating PRECISION_GINI_05_1 for test dataset...
PRECISION_GINI_05_1: train 0.1433, test 0.0661




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.99it/s]


Calculating AUC_GINI_10_1 for train dataset...
Calculating AUC_GINI_10_1 for test dataset...
AUC_GINI_10_1: train 0.9959, test 0.9901


Calculating PRECISION_GINI_10_1 for train dataset...
Calculating PRECISION_GINI_10_1 for test dataset...
PRECISION_GINI_10_1: train 0.1862, test 0.0905




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.98it/s]


Calculating AUC_GINI_25_1 for train dataset...
Calculating AUC_GINI_25_1 for test dataset...
AUC_GINI_25_1: train 0.9898, test 0.9802


Calculating PRECISION_GINI_25_1 for train dataset...
Calculating PRECISION_GINI_25_1 for test dataset...
PRECISION_GINI_25_1: train 0.0817, test 0.0343




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.00it/s]


Calculating AUC_GINI_50_1 for train dataset...
Calculating AUC_GINI_50_1 for test dataset...
AUC_GINI_50_1: train 0.7631, test 0.7427


Calculating PRECISION_GINI_50_1 for train dataset...
Calculating PRECISION_GINI_50_1 for test dataset...
PRECISION_GINI_50_1: train 0.0034, test 0.0021


Starting iteration 2...
test_interactions (119741, 4)
train_interactions (478962, 4)


Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.09it/s]


Calculating AUC_NO_WEIGHTING_2 for train dataset...
Calculating AUC_NO_WEIGHTING_2 for test dataset...
AUC_NO_WEIGHTING_2: train 0.9957, test 0.9883


Calculating PRECISION_NO_WEIGHTING_2 for train dataset...
Calculating PRECISION_NO_WEIGHTING_2 for test dataset...
PRECISION_NO_WEIGHTING_2: train 0.1836, test 0.0905




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.86it/s]


Calculating AUC_GINI_05_2 for train dataset...
Calculating AUC_GINI_05_2 for test dataset...
AUC_GINI_05_2: train 0.9947, test 0.9870


Calculating PRECISION_GINI_05_2 for train dataset...
Calculating PRECISION_GINI_05_2 for test dataset...
PRECISION_GINI_05_2: train 0.1394, test 0.0635




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.05it/s]


Calculating AUC_GINI_10_2 for train dataset...
Calculating AUC_GINI_10_2 for test dataset...
AUC_GINI_10_2: train 0.9959, test 0.9887


Calculating PRECISION_GINI_10_2 for train dataset...
Calculating PRECISION_GINI_10_2 for test dataset...
PRECISION_GINI_10_2: train 0.1761, test 0.0903




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.13it/s]


Calculating AUC_GINI_25_2 for train dataset...
Calculating AUC_GINI_25_2 for test dataset...
AUC_GINI_25_2: train 0.9923, test 0.9827


Calculating PRECISION_GINI_25_2 for train dataset...
Calculating PRECISION_GINI_25_2 for test dataset...
PRECISION_GINI_25_2: train 0.1262, test 0.0469




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.37it/s]


Calculating AUC_GINI_50_2 for train dataset...
Calculating AUC_GINI_50_2 for test dataset...
AUC_GINI_50_2: train 0.9171, test 0.9054


Calculating PRECISION_GINI_50_2 for train dataset...
Calculating PRECISION_GINI_50_2 for test dataset...
PRECISION_GINI_50_2: train 0.0073, test 0.0024


Starting iteration 3...
test_interactions (119740, 4)
train_interactions (478963, 4)


Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.07it/s]


Calculating AUC_NO_WEIGHTING_3 for train dataset...
Calculating AUC_NO_WEIGHTING_3 for test dataset...
AUC_NO_WEIGHTING_3: train 0.9958, test 0.9896


Calculating PRECISION_NO_WEIGHTING_3 for train dataset...
Calculating PRECISION_NO_WEIGHTING_3 for test dataset...
PRECISION_NO_WEIGHTING_3: train 0.1886, test 0.0943




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.85it/s]


Calculating AUC_GINI_05_3 for train dataset...
Calculating AUC_GINI_05_3 for test dataset...
AUC_GINI_05_3: train 0.9946, test 0.9871


Calculating PRECISION_GINI_05_3 for train dataset...
Calculating PRECISION_GINI_05_3 for test dataset...
PRECISION_GINI_05_3: train 0.1440, test 0.0677




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.04it/s]


Calculating AUC_GINI_10_3 for train dataset...
Calculating AUC_GINI_10_3 for test dataset...
AUC_GINI_10_3: train 0.9958, test 0.9890


Calculating PRECISION_GINI_10_3 for train dataset...
Calculating PRECISION_GINI_10_3 for test dataset...
PRECISION_GINI_10_3: train 0.1858, test 0.0911




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.07it/s]


Calculating AUC_GINI_25_3 for train dataset...
Calculating AUC_GINI_25_3 for test dataset...
AUC_GINI_25_3: train 0.9927, test 0.9824


Calculating PRECISION_GINI_25_3 for train dataset...
Calculating PRECISION_GINI_25_3 for test dataset...
PRECISION_GINI_25_3: train 0.1285, test 0.0497




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.10it/s]


Calculating AUC_GINI_50_3 for train dataset...
Calculating AUC_GINI_50_3 for test dataset...
AUC_GINI_50_3: train 0.7829, test 0.7706


Calculating PRECISION_GINI_50_3 for train dataset...
Calculating PRECISION_GINI_50_3 for test dataset...
PRECISION_GINI_50_3: train 0.0054, test 0.0016


Starting iteration 4...
test_interactions (119740, 4)
train_interactions (478963, 4)


Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.97it/s]


Calculating AUC_NO_WEIGHTING_4 for train dataset...
Calculating AUC_NO_WEIGHTING_4 for test dataset...
AUC_NO_WEIGHTING_4: train 0.9958, test 0.9895


Calculating PRECISION_NO_WEIGHTING_4 for train dataset...
Calculating PRECISION_NO_WEIGHTING_4 for test dataset...
PRECISION_NO_WEIGHTING_4: train 0.1898, test 0.0976




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.86it/s]


Calculating AUC_GINI_05_4 for train dataset...
Calculating AUC_GINI_05_4 for test dataset...
AUC_GINI_05_4: train 0.9948, test 0.9875


Calculating PRECISION_GINI_05_4 for train dataset...
Calculating PRECISION_GINI_05_4 for test dataset...
PRECISION_GINI_05_4: train 0.1344, test 0.0668




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.09it/s]


Calculating AUC_GINI_10_4 for train dataset...
Calculating AUC_GINI_10_4 for test dataset...
AUC_GINI_10_4: train 0.9958, test 0.9897


Calculating PRECISION_GINI_10_4 for train dataset...
Calculating PRECISION_GINI_10_4 for test dataset...
PRECISION_GINI_10_4: train 0.1731, test 0.0935




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  2.11it/s]


Calculating AUC_GINI_25_4 for train dataset...
Calculating AUC_GINI_25_4 for test dataset...
AUC_GINI_25_4: train 0.9878, test 0.9777


Calculating PRECISION_GINI_25_4 for train dataset...
Calculating PRECISION_GINI_25_4 for test dataset...
PRECISION_GINI_25_4: train 0.0824, test 0.0329




Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.16it/s]


Calculating AUC_GINI_50_4 for train dataset...
Calculating AUC_GINI_50_4 for test dataset...
AUC_GINI_50_4: train 0.7878, test 0.7731


Calculating PRECISION_GINI_50_4 for train dataset...
Calculating PRECISION_GINI_50_4 for test dataset...
PRECISION_GINI_50_4: train 0.0041, test 0.0012




In [57]:
print_all_results(RESULTS)


-----AUC-----
NO_WEIGHTING:
 train mean 0.9957 (['0.9957', '0.9955', '0.9957', '0.9958', '0.9958'])
 test mean 0.9889 (['0.9889', '0.9881', '0.9883', '0.9896', '0.9895'])

GINI_05:
 train mean 0.9947 (['0.9946', '0.9947', '0.9947', '0.9946', '0.9948'])
 test mean 0.9872 (['0.9869', '0.9875', '0.9870', '0.9871', '0.9875'])

GINI_10:
 train mean 0.9958 (['0.9959', '0.9959', '0.9959', '0.9958', '0.9958'])
 test mean 0.9894 (['0.9893', '0.9901', '0.9887', '0.9890', '0.9897'])

GINI_25:
 train mean 0.9908 (['0.9916', '0.9898', '0.9923', '0.9927', '0.9878'])
 test mean 0.9808 (['0.9808', '0.9802', '0.9827', '0.9824', '0.9777'])

GINI_50:
 train mean 0.7959 (['0.7287', '0.7631', '0.9171', '0.7829', '0.7878'])
 test mean 0.7798 (['0.7073', '0.7427', '0.9054', '0.7706', '0.7731'])


-----PRECISION-----
NO_WEIGHTING:
 train mean 0.1880 (['0.1931', '0.1850', '0.1836', '0.1886', '0.1898'])
 test mean 0.0943 (['0.0957', '0.0934', '0.0905', '0.0943', '0.0976'])

GINI_05:
 train mean 0.1402 (['0.140

In [146]:
test_interactions_df = INTERACTIONS_CV[0]
train_df_tmp = [ds for j,ds in enumerate(INTERACTIONS_CV) if j != 0]
train_interactions_df = pd.concat(train_df_tmp)

(train_interactions_ds, test_interactions_ds, item_features_ds, mapping) = create_dataset(train_interactions_df, test_interactions_df)

MODEL = LightFM(loss='warp')
MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)

Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.95it/s]


<lightfm.lightfm.LightFM at 0x7ffa058bc130>

In [144]:
run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format('MODEL', 0), item_features_ds)

Calculating AUC_MODEL_0 for train dataset...
Calculating AUC_MODEL_0 for test dataset...
AUC_MODEL_0: train 0.9957, test 0.9896


Calculating PRECISION_MODEL_0 for train dataset...
Calculating PRECISION_MODEL_0 for test dataset...
PRECISION_MODEL_0: train 0.1878, test 0.0939




((0.9957112, 0.9896353), (0.18777692, 0.093921736))

In [157]:
item_mapping = mapping[2]
item_mapping = {v: k for k, v in item_mapping.items()}

user_mapping = mapping[0]
user_mapping = {v: k for k, v in user_mapping.items()}

for user in [0,200,300,755]:
    scores = MODEL.predict(user, list(range(0, train_interactions_ds.shape[1])), item_features_ds)
    top10 = np.argsort(-scores)[0:10]
    results = []
    for i in top10:
        company_name = COMPANIES_DF[COMPANIES_DF.business_id == item_mapping[i]].company_name
        print(company_name)

    print(results)
    print('-----\n')

1060052    Dustin Finland Oy
Name: company_name, dtype: object
149124    SAS Institute Oy
Name: company_name, dtype: object
405905    HCL Technologies Limited, Helsingin sivuliike
Name: company_name, dtype: object
271980    North European Oil Trade Oy
Name: company_name, dtype: object
1140895    DNA Welho Oy
Name: company_name, dtype: object
924366    Huawei Technologies Oy (Finland) Co. Ltd
Name: company_name, dtype: object
136525    Oy Dell Ab
Name: company_name, dtype: object
164811    Bang & Bonsomer Group Ab
Name: company_name, dtype: object
473224    McKinsey & Company, Inc. Finland, sivuliike Su...
Name: company_name, dtype: object
814485    Oy Teboil Ab
Name: company_name, dtype: object
[]
-----

471400    Kohi-Group Oy
Name: company_name, dtype: object
1134317    Mäkelä Alu Oy
Name: company_name, dtype: object
838198    Luoman Oy
Name: company_name, dtype: object
1148309    Luoman Puutuote Oy
Name: company_name, dtype: object
1165680    Ab Solving Oy
Name: company_name, dtype:

In [239]:
item_mapping = mapping[2]
item_mapping = {v: k for k, v in item_mapping.items()}

user_mapping = mapping[0]
user_mapping = {v: k for k, v in user_mapping.items()}

user = 103

user_id = user_mapping[user]
known_positives = INTERACTIONS_DF[INTERACTIONS_DF.group_id == user_id].business_id.tolist()
COMPANIES_DF[COMPANIES_DF.business_id.isin(known_positives)]

Unnamed: 0,business_id,company_name,company_form_code,location_region_code,company_status_code,industry_code,turnover,net_profit,personnel_average,performer_ranking_points,risk_rating_class
425,09444046,Cross Wrap Oy,company_form_code+CO_16,location_region_code+11,company_status_code+AKT,industry_code+28,turnover+4,net_profit+5,personnel_average+NaN,performer_ranking_points+2,risk_rating_class+GREEN
1852,02824927,Aimo Virtanen Oy,company_form_code+CO_16,location_region_code+02,company_status_code+AKT,industry_code+25,turnover+4,net_profit+3,personnel_average+1,performer_ranking_points+2,risk_rating_class+GREEN
2082,08369996,Jita Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+22,turnover+top,net_profit+top,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
2099,25497076,LeaseGreen Suomi Oy,company_form_code+CO_16,location_region_code+01,company_status_code+AKT,industry_code+43,turnover+top,net_profit+0,personnel_average+2,performer_ranking_points+0,risk_rating_class+RED
2760,31012434,"ADB Safegate BV, Suomen sivuliike",company_form_code+CO_19,location_region_code+01,company_status_code+AKT,industry_code+43,turnover+NaN,net_profit+NaN,personnel_average+NaN,performer_ranking_points+NaN,risk_rating_class+NaN
...,...,...,...,...,...,...,...,...,...,...,...
1332145,10227508,GE Power Finland Oy,company_form_code+CO_16,location_region_code+01,company_status_code+AKT,industry_code+71,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+1,risk_rating_class+NaN
1332956,23441685,Paja Finanssipalvelut Oy,company_form_code+CO_16,location_region_code+01,company_status_code+AKT,industry_code+70,turnover+top,net_profit+top,personnel_average+3,performer_ranking_points+top,risk_rating_class+GREEN
1332964,01365271,Teollisuusmaalaamo M Fingerroos Oy,company_form_code+CO_16,location_region_code+02,company_status_code+AKT,industry_code+43,turnover+3,net_profit+2,personnel_average+1,performer_ranking_points+1,risk_rating_class+GREEN
1333453,07554014,TECHNIA Oy,company_form_code+CO_16,location_region_code+01,company_status_code+AKT,industry_code+62,turnover+4,net_profit+5,personnel_average+1,performer_ranking_points+2,risk_rating_class+GREEN


In [242]:
scores = MODEL.predict(user, list(range(0, train_interactions_ds.shape[1])), item_features_ds)
top10 = np.argsort(-scores)[0:100]

top10bids = []

for i in top10:
    top10bids.append(item_mapping[i])
    company = COMPANIES_DF[COMPANIES_DF.business_id == item_mapping[i]]

COMPANIES_DF[COMPANIES_DF.business_id.isin(top10bids)]

Unnamed: 0,business_id,company_name,company_form_code,location_region_code,company_status_code,industry_code,turnover,net_profit,personnel_average,performer_ranking_points,risk_rating_class
30923,22873734,Tasowheel Gears Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+28,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
86732,K-04340643,Hypap Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+17,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+0,risk_rating_class+GREEN
95873,K-20452813,Stresstech Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+26,turnover+top,net_profit+5,personnel_average+3,performer_ranking_points+2,risk_rating_class+NaN
100138,01642056,Weckman Steel Oy,company_form_code+CO_16,location_region_code+07,company_status_code+AKT,industry_code+25,turnover+top,net_profit+top,personnel_average+3,performer_ranking_points+3,risk_rating_class+GREEN
110401,01544907,Muototerä Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+4,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
...,...,...,...,...,...,...,...,...,...,...,...
1217392,28188148,Mincon Nordic Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+28,turnover+top,net_profit+top,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
1233626,04013300,Arctic Machine Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+28,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+NaN
1257912,25085581,Wärtsilä Projects Oy,company_form_code+CO_16,location_region_code+15,company_status_code+AKT,industry_code+28,turnover+top,net_profit+0,personnel_average+2,performer_ranking_points+0,risk_rating_class+YELLOW
1297154,27702539,Javasko Machines Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+4,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN


In [243]:
hits = pd.merge(COMPANIES_DF[COMPANIES_DF.business_id.isin(known_positives)], COMPANIES_DF[COMPANIES_DF.business_id.isin(top10bids)], how='inner', on=['business_id'])
hits

Unnamed: 0,business_id,company_name_x,company_form_code_x,location_region_code_x,company_status_code_x,industry_code_x,turnover_x,net_profit_x,personnel_average_x,performer_ranking_points_x,...,company_name_y,company_form_code_y,location_region_code_y,company_status_code_y,industry_code_y,turnover_y,net_profit_y,personnel_average_y,performer_ranking_points_y,risk_rating_class_y
0,22873734,Tasowheel Gears Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+28,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+2,...,Tasowheel Gears Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+28,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
1,K-20452813,Stresstech Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+26,turnover+top,net_profit+5,personnel_average+3,performer_ranking_points+2,...,Stresstech Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+26,turnover+top,net_profit+5,personnel_average+3,performer_ranking_points+2,risk_rating_class+NaN
2,01642056,Weckman Steel Oy,company_form_code+CO_16,location_region_code+07,company_status_code+AKT,industry_code+25,turnover+top,net_profit+top,personnel_average+3,performer_ranking_points+3,...,Weckman Steel Oy,company_form_code+CO_16,location_region_code+07,company_status_code+AKT,industry_code+25,turnover+top,net_profit+top,personnel_average+3,performer_ranking_points+3,risk_rating_class+GREEN
3,01544907,Muototerä Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+4,net_profit+5,personnel_average+2,performer_ranking_points+2,...,Muototerä Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+4,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
4,09317207,thyssenkrupp Aerospace Finland Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+25,turnover+top,net_profit+top,personnel_average+2,performer_ranking_points+2,...,thyssenkrupp Aerospace Finland Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+25,turnover+top,net_profit+top,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,19585950,Urjala Works Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+top,net_profit+4,personnel_average+2,performer_ranking_points+1,...,Urjala Works Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+top,net_profit+4,personnel_average+2,performer_ranking_points+1,risk_rating_class+GREEN
76,28188148,Mincon Nordic Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+28,turnover+top,net_profit+top,personnel_average+2,performer_ranking_points+2,...,Mincon Nordic Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+28,turnover+top,net_profit+top,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
77,04013300,Arctic Machine Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+28,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+2,...,Arctic Machine Oy,company_form_code+CO_16,location_region_code+13,company_status_code+AKT,industry_code+28,turnover+top,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+NaN
78,27702539,Javasko Machines Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+4,net_profit+5,personnel_average+2,performer_ranking_points+2,...,Javasko Machines Oy,company_form_code+CO_16,location_region_code+06,company_status_code+AKT,industry_code+25,turnover+4,net_profit+5,personnel_average+2,performer_ranking_points+2,risk_rating_class+GREEN
