# iteration3

Data:
- interaktiot kaikkien muiden ryhmistä paitsi Alman kehittäjien
- interaktiot ryhmä, ei käyttäjäkohtaisia
- kaikki yritykset - konsernit mukana lisäämällä "K-" y-tunnuksen eteen
- metadatana perustietoa yrityksistä - numeeriset tilikausitiedot diskretisoitu hieman persentiilejä mukaileviin custom-luokkiin
- **data esikäsitelty iteration3_feature_selection**-notebookissa
- Warp-malli käytössä
- Minimiryhmäkoko = 2

Kysymyksiä:

1. Miten uudet muokatut tilinpäätöstiedon luokat + location_municipalityn pudotus toimii?
2. Miten gini-indeksillä painotettu item_feature-matriisi toimii eri painotuksilla?
3. Mitä jos NaN-featureja ei anneta?

## Importit

In [30]:
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import reciprocal_rank
from lightfm.data import Dataset

import numpy as np
import pandas as pd

import statistics
import functools

from sklearn.model_selection import train_test_split

WORKING_DIRECTORY = '/mnt/c/git/masters-thesis-code/jupyter/code/'

## Valitut metadatat yrityksille

In [17]:
SELECTED_COMPANY_FEATURES = ['company_form_code', 
                             'location_region_code', 'company_status_code', 'industry_code', 'turnover', 
                             'net_profit', 'personnel_average', 'performer_ranking_points', 'risk_rating_class']

## Ladataan yritysdata

In [18]:
COMPANIES_DF = pd.read_pickle(WORKING_DIRECTORY + "data/pandas_pickles/company_data_iteration3.pkl")

ITEM_IDS = list(COMPANIES_DF['business_id'].unique())

item_features_tmp = [COMPANIES_DF[feature].unique() for feature in SELECTED_COMPANY_FEATURES]

ITEM_FEATURE_LABELS = [item for sublist in item_features_tmp for item in sublist]

In [28]:
COMPANIES_DF[COMPANIES_DF['business_id'] == 'K-01370820']

Unnamed: 0,business_id,company_name,company_form_code,location_region_code,company_status_code,industry_code,turnover,net_profit,personnel_average,performer_ranking_points,risk_rating_class
1333143,K-01370820,Leipomo Rosten Oy,company_form_code+CO_16,location_region_code+02,company_status_code+AKT,industry_code+10,turnover+top,net_profit+top,personnel_average+3,performer_ranking_points+1,risk_rating_class+GREEN


## Ladataan vuorovaikutusdata

In [19]:
interactions_tmp = pd \
    .read_csv(WORKING_DIRECTORY + 'data/interactions_2021_08_19.csv',
             delimiter='\t',
             dtype={
                 'group_id': 'string',
                 'business_id': 'string',
                 'owner': 'string'
             })

# otetaan pois 1 kokoiset ryhmät
group_sizes = interactions_tmp['group_id'].value_counts()
group_sizes_df = pd.DataFrame({'group_id': group_sizes.index, 'group_size': group_sizes.values})

INTERACTIONS_WITH_GROUP_SIZES_DF = interactions_tmp.merge(group_sizes_df, on='group_id')             
interactions_tmp = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size >= 2]
#interactions_tmp = INTERACTIONS_WITH_GROUP_SIZES_DF[INTERACTIONS_WITH_GROUP_SIZES_DF.group_size <= 3000]
interactions_tmp.sort_values('group_size')

Unnamed: 0,group_id,business_id,owner,group_size
155696,3e9dd356-2b21-45ae-9ee4-7cd6cc122fe1,07577937,5e87095492119e00066e7158,2
106198,31503959-943a-4081-abcc-dc80e5cb0402,15093748,5db034c64320cd0006d2b788,2
313746,cab22fae-db47-46b6-b902-3d9a1b1051f6,01163004,5e4534bc7bf061000697e940,2
313747,cab22fae-db47-46b6-b902-3d9a1b1051f6,10410900,5e4534bc7bf061000697e940,2
545392,0967d6ed-88b7-4023-a720-f09f7051f24d,17944788,5efdbc656488210007bc27f6,2
...,...,...,...,...
8042,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16029641,5e1489f3c2f568000654ecbb,3999
8043,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16030167,5e1489f3c2f568000654ecbb,3999
8044,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16030415,5e1489f3c2f568000654ecbb,3999
8031,a5c6ce2e-22ab-4871-bd72-e5da294b33cc,16001948,5e1489f3c2f568000654ecbb,3999


In [20]:
# lisätään konserniyrityksille interaktiot
concern_interactions = interactions_tmp.copy()
concern_interactions['business_id'] = 'K-' + concern_interactions['business_id'].astype(str)
concern_interactions = concern_interactions[concern_interactions.business_id.isin(ITEM_IDS)]
concern_interactions

Unnamed: 0,group_id,business_id,owner,group_size
5,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01681709,60646431ae18cb00063ed63f,1862
6,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-15055514,60646431ae18cb00063ed63f,1862
7,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01876143,60646431ae18cb00063ed63f,1862
9,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-05363070,60646431ae18cb00063ed63f,1862
10,c2626398-faac-4ff3-b02d-cdc64b50cdaa,K-01387534,60646431ae18cb00063ed63f,1862
...,...,...,...,...
548074,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-02106319,6110c56241e21e000857ca77,131
548110,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-20333371,6110c56241e21e000857ca77,131
548137,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-07027249,6110c56241e21e000857ca77,131
548162,8b0915ff-a0cb-4520-9160-8d783a6bf308,K-02011774,6110c56241e21e000857ca77,131


In [37]:
# yhdistetään konserni-interaktiot tavallisiin ja poistetaan sellaiset interaktiot, joille ei löydy y-tunnusta
INTERACTIONS_DF = pd.concat([interactions_tmp, concern_interactions])
INTERACTIONS_DF = INTERACTIONS_DF[INTERACTIONS_DF.business_id.isin(ITEM_IDS)]
INTERACTIONS_DF[INTERACTIONS_DF['business_id'] == 'K-02011774']

USER_IDS = list(set(INTERACTIONS_DF['group_id'].values))


In [25]:
def print_interactions_meta_data(interactions_df):
    print('ryhmiä: {groups}, interaktioita {interactions}, yrityksiä {companies}'
          .format(groups=len(list(interactions_df['group_id'].unique())),
                  interactions=interactions_df.shape[0], 
                  companies=len(list(interactions_df['business_id'].unique()))))

print('----- group_size>=2 -----')
print_interactions_meta_data(INTERACTIONS_DF)

----- group_size>=2 -----
ryhmiä: 1312, interaktioita 598703, yrityksiä 143839


## Luodaan cross-validationia varten ositetut datasetit

## Luodaan LightFM:n ymmärtämät Dataset-oliot

In [49]:
def create_item_features_ds():
    return [(company['business_id'], 
                [company[feature] for feature in SELECTED_COMPANY_FEATURES])
                    for company in COMPANIES_DF.to_dict(orient='records')]


In [50]:
def calculate_gini_for_word(word, train_interactions_df, alpha):
    col_name = word.split('+')[0]
    matches_df = COMPANIES_DF[COMPANIES_DF[col_name] == word]
    
    matched_docs_total = matches_df.shape[0]
    
    match_bids = list(matches_df['business_id'].unique())
    
    matching_interactions_df = train_interactions_df[train_interactions_df['business_id'].isin(match_bids)]
    
    interacted_docs_count = matching_interactions_df['business_id'].unique().shape[0]
    non_interacted_docs_count = matched_docs_total - interacted_docs_count
    
    gini_index = 1 - ((interacted_docs_count / matched_docs_total) ** 2 + \
                    (non_interacted_docs_count / matched_docs_total) ** 2)
    
        
    return (word, alpha - gini_index, interacted_docs_count, matched_docs_total)

def create_gini_weighted_item_features(train_interactions_df, alpha):
    feature_weights = {}

    for word in ITEM_FEATURE_LABELS:
        gini = calculate_gini_for_word(word, train_interactions_df, alpha)
        feature_weights[word] = gini[1]

    return [(company['business_id'], 
                {k: feature_weights[k] for k in [company[feature] for feature in SELECTED_COMPANY_FEATURES]})
                    for company in COMPANIES_DF.to_dict(orient='records')]


In [51]:
def create_dataset(train_interactions_df, test_interactions_df, alpha=None):
    dataset = Dataset(user_identity_features=False)

    
    train_interactions = [(interaction['group_id'], interaction['business_id']) 
                for interaction in train_interactions_df.to_dict(orient='records')]

    test_interactions = [(interaction['group_id'], interaction['business_id']) 
            for interaction in test_interactions_df.to_dict(orient='records')]
    
    dataset.fit(users=USER_IDS, items=ITEM_IDS, item_features=ITEM_FEATURE_LABELS)

    (train_interactions_ds, _) = dataset.build_interactions(train_interactions)
    (test_interactions_ds, _) = dataset.build_interactions(test_interactions)
    
    if (alpha == None):
        item_features_ds = dataset.build_item_features(create_item_features_ds(), normalize=False)
        print(item_features_ds)
        return (train_interactions_ds, test_interactions_ds, item_features_ds)

    else:
        item_features_ds = dataset.build_item_features(create_gini_weighted_item_features(train_interactions_df, alpha), normalize=False)
        print(item_features_ds)
        return (train_interactions_ds, test_interactions_ds, item_features_ds)

## Arvioidaan mallien laatua

In [45]:
NUM_THREADS = 12

def run_evaluation_function(model, test_ds, train_ds, evaluation_function, name, item_features=None):    
    #print('Calculating {name} for train dataset...'.format(name=name))
    #train_results = evaluation_function(model, train_ds, item_features=item_features, num_threads=NUM_THREADS)
    #np.savetxt('iteration2-train-results-{}.txt'.format(name), train_results)
    #train_metric = train_results.mean()
    
    print('Calculating {name} for test dataset...'.format(name=name))
    test_results = evaluation_function(model, test_ds, train_ds, item_features=item_features, num_threads=NUM_THREADS)
    np.savetxt('iteration3-test-results-{}.txt'.format(name), test_results)
    test_metric = test_results.mean()
    
    print('{name}: train {train_metric:.2f}, test {test_metric:.2f}'.format(name=name, 
                                                                            train_metric=-1.0, 
                                                                            test_metric=test_metric))
    print('\n')
    return (-1.0, test_metric)

def run_evaluations_for_ds(model, train_ds, test_ds, model_name, item_features=None):
    auc = run_evaluation_function(model, test_ds, train_ds, auc_score, 'AUC_' + model_name, item_features)
    precision = run_evaluation_function(model, test_ds, train_ds, precision_at_k, 'PRECISION_' + model_name, item_features)
    #recall = run_evaluation_function(model, test_ds, train_ds, recall_at_k, 'RECALL_' + model_name, item_features)
    #reciprocal = run_evaluation_function(model, test_ds, train_ds, reciprocal_rank, 'RECIPROCAL_' + model_name, item_features)
    
    return (auc, precision) #, recall, reciprocal)


In [57]:
def run_evaluations(interactions_df_cv):

    results = {
        'NO_WEIGHTING': [],
        'GINI_05': [],
        'GINI_10': [],
        'GINI_25': [],
        'GINI_50': []
    }
    
    for i in range(0, len(interactions_df_cv)):
        print('Starting iteration {}...'.format(i))
            
        test_interactions_df = interactions_df_cv[i]
        
        # laitetaan uuteen listaan kaikki paitsi testidatasetti
        train_df_tmp = [ds for j,ds in enumerate(interactions_df_cv) if j != i]
        # yhdistetään treenidatasetiksi valikoituneet vuorovaikutusmatriisit
        train_interactions_df = pd.concat(train_df_tmp)

        print('test_interactions', test_interactions_df.shape)
        print('train_interactions', train_interactions_df.shape)
        
        ##### NO_WEIGHTING #####
        name = 'NO_WEIGHTING'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_05 #####
        name = 'GINI_05'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 0.5)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_10 #####
        name = 'GINI_10'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 1.0)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_25 #####
        name = 'GINI_25'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 2.5)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

        ##### GINI_50 #####
        name = 'GINI_50'
        (train_interactions_ds, test_interactions_ds, item_features_ds) = create_dataset(train_interactions_df, test_interactions_df, 5.0)

        MODEL = LightFM(loss='warp')
        MODEL.fit(train_interactions_ds, item_features=item_features_ds, epochs=5, num_threads=NUM_THREADS, verbose=True)
        
        results[name].append(run_evaluations_for_ds(MODEL, train_interactions_ds, test_interactions_ds, '{}_{}'.format(name, i), item_features_ds))

    return results

In [47]:
def print_metric_result(result_arr, model_name):
    train_results = [x[0] for x in result_arr]
    test_results = [x[1] for x in result_arr]
    
    print('{name}:\n train mean {train_mean:.2f} ({train_arr})\n test mean {test_mean:.2f} ({test_arr})\n'
          .format(train_mean=statistics.mean(train_results),
                 test_mean=statistics.mean(test_results),
                 train_arr=['%.2f' % x for x in train_results],
                 test_arr=['%.2f' % x for x in test_results],
                 name=model_name))
    

def print_all_results(results):
    for i,metric in enumerate(['AUC', 'PRECISION']): #, 'RECALL', 'RECIPROCAL']):
        print('\n-----{}-----'.format(metric))
        for model_name,result_arr in results.items():
            print_metric_result([res[i] for res in result_arr], model_name)
    

### Tulokset

In [38]:
def create_partitioned_datasets(interactions_df):
    (rest, fifth_1) = train_test_split(interactions_df, test_size=0.2)
    (rest, fifth_2) = train_test_split(rest, test_size=0.25)
    (rest, fifth_3) = train_test_split(rest, test_size=0.3333333)
    (fifth_4, fifth_5) = train_test_split(rest, test_size=0.5)
    
    return [fifth_1, fifth_2, fifth_3, fifth_4, fifth_5]
    

In [42]:
INTERACTIONS_CV = create_partitioned_datasets(INTERACTIONS_DF)
for cv in INTERACTIONS_CV:
    print(cv.shape)

(119741, 4)
(119741, 4)
(119741, 4)
(119740, 4)
(119740, 4)


In [56]:
RESULTS = run_evaluations(INTERACTIONS_CV)

Starting iteration 0...
test_interactions (119741, 4)
train_interactions (478962, 4)
  (0, 0)	1.0
  (0, 1337868)	0.966211
  (0, 1337933)	0.62889767
  (0, 1337953)	0.8259553
  (0, 1337958)	0.59567994
  (0, 1338048)	0.93935204
  (0, 1338055)	0.9567706
  (0, 1338063)	0.8756648
  (0, 1338069)	0.9274255
  (0, 1338075)	0.92000103
  (1, 1)	1.0
  (1, 1337869)	0.5261126
  (1, 1337933)	0.62889767
  (1, 1337953)	0.8259553
  (1, 1337959)	0.91632456
  (1, 1338048)	0.93935204
  (1, 1338055)	0.9567706
  (1, 1338063)	0.8756648
  (1, 1338069)	0.9274255
  (1, 1338075)	0.92000103
  (2, 2)	1.0
  (2, 1337868)	0.966211
  (2, 1337934)	0.9253642
  (2, 1337953)	0.8259553
  (2, 1337959)	0.91632456
  :	:
  (1337865, 1338048)	0.93935204
  (1337865, 1338055)	0.9567706
  (1337865, 1338063)	0.8756648
  (1337865, 1338069)	0.9274255
  (1337865, 1338075)	0.92000103
  (1337866, 1337866)	1.0
  (1337866, 1337876)	0.571888
  (1337866, 1337933)	0.62889767
  (1337866, 1337953)	0.8259553
  (1337866, 1337970)	0.55351424
  (133

Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.26it/s]


Calculating AUC_GINI_10_0 for test dataset...
AUC_GINI_10_0: train -1.00, test 0.99


Calculating PRECISION_GINI_10_0 for test dataset...
PRECISION_GINI_10_0: train -1.00, test 0.09


Starting iteration 1...
test_interactions (119741, 4)
train_interactions (478962, 4)
  (0, 0)	1.0
  (0, 1337868)	0.96602416
  (0, 1337933)	0.6287351
  (0, 1337953)	0.8259346
  (0, 1337958)	0.5962555
  (0, 1338048)	0.9391766
  (0, 1338055)	0.95661515
  (0, 1338063)	0.87560636
  (0, 1338069)	0.9271499
  (0, 1338075)	0.91978145
  (1, 1)	1.0
  (1, 1337869)	0.5261458
  (1, 1337933)	0.6287351
  (1, 1337953)	0.8259346
  (1, 1337959)	0.91627455
  (1, 1338048)	0.9391766
  (1, 1338055)	0.95661515
  (1, 1338063)	0.87560636
  (1, 1338069)	0.9271499
  (1, 1338075)	0.91978145
  (2, 2)	1.0
  (2, 1337868)	0.96602416
  (2, 1337934)	0.92542636
  (2, 1337953)	0.8259346
  (2, 1337959)	0.91627455
  :	:
  (1337865, 1338048)	0.9391766
  (1337865, 1338055)	0.95661515
  (1337865, 1338063)	0.87560636
  (1337865, 1338069)	0.9271499

Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.28it/s]


Calculating AUC_GINI_10_1 for test dataset...
AUC_GINI_10_1: train -1.00, test 0.99


Calculating PRECISION_GINI_10_1 for test dataset...
PRECISION_GINI_10_1: train -1.00, test 0.09


Starting iteration 2...
test_interactions (119741, 4)
train_interactions (478962, 4)
  (0, 0)	1.0
  (0, 1337868)	0.96600163
  (0, 1337933)	0.6295944
  (0, 1337953)	0.825954
  (0, 1337958)	0.595632
  (0, 1338048)	0.93920237
  (0, 1338055)	0.95651954
  (0, 1338063)	0.875514
  (0, 1338069)	0.9272669
  (0, 1338075)	0.91987234
  (1, 1)	1.0
  (1, 1337869)	0.5262554
  (1, 1337933)	0.6295944
  (1, 1337953)	0.825954
  (1, 1337959)	0.9158871
  (1, 1338048)	0.93920237
  (1, 1338055)	0.95651954
  (1, 1338063)	0.875514
  (1, 1338069)	0.9272669
  (1, 1338075)	0.91987234
  (2, 2)	1.0
  (2, 1337868)	0.96600163
  (2, 1337934)	0.9252087
  (2, 1337953)	0.825954
  (2, 1337959)	0.9158871
  :	:
  (1337865, 1338048)	0.93920237
  (1337865, 1338055)	0.95651954
  (1337865, 1338063)	0.875514
  (1337865, 1338069)	0.9272669
  (133786

Epoch: 100%|██████████████████████████████████████████████████████████████████████████████| 5/5 [00:04<00:00,  1.23it/s]


Calculating AUC_GINI_10_2 for test dataset...


KeyboardInterrupt: 

In [None]:
print_all_results(RESULTS_IDENTITY_OFF)


-----AUC-----
IDENTITY_FEATURES_OFF:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.99 (['0.99', '0.99', '0.99', '0.99', '0.99'])


-----PRECISION-----
IDENTITY_FEATURES_OFF:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.08 (['0.08', '0.09', '0.08', '0.08', '0.08'])



In [None]:
print_all_results(RESULTS_IDENTITY_ON)


-----AUC-----
IDENTITY_FEATURES_ON:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.99 (['0.99', '0.99', '0.99', '0.99', '0.99'])


-----PRECISION-----
IDENTITY_FEATURES_ON:
 train mean -1.00 (['-1.00', '-1.00', '-1.00', '-1.00', '-1.00'])
 test mean 0.09 (['0.09', '0.09', '0.09', '0.09', '0.09'])

