# Proto1

Kysymyksiä:

1. Tuottaako malli mitenkään järkeviä suosituksia?
2. Warp vai bpr?
3. Mitä tapahtuu, jos item_featuret jättää kokonaan pois?
4. item_ ja user_user_identity_featuret? Päälle vai pois? (Tää sittenkin ehkä proto2:een)


## Importataan tarvittavat kirjastot

In [75]:
from lightfm import LightFM
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from lightfm.evaluation import reciprocal_rank
from lightfm.data import Dataset

import numpy as np
import pandas as pd

## Määritetään halutut ominaisuudet yrityksille

In [17]:
SELECTED_COMPANY_FEATURES = ['location_region_code', 'industry_code']

## Ladataan raakadata yrityksistä

In [18]:
COMPANIES_RAW = pd \
        .read_csv('data/prod_data_companies_2021_08_22.csv',
                  delimiter='\t',
                  na_values='(null)',
                  dtype={
                      'business_id': 'string',
                      'company_name': 'string',
                      'company_form': 'string',
                      'company_form_code': 'string',
                      'location_region': 'string',
                      'location_region_code': 'string',
                      'location_municipality': 'string',
                      'location_municipality_code': 'string',
                      'industry_code': 'string',
                      'company_status': 'string',
                      'company_status_code': 'string',
                      'personnel_class': 'string'
                  }
                  )

## Käsitellään ja poimitaan halutut yritysdatat

In [19]:
ITEM_IDS = list(COMPANIES_RAW['business_id'].values)

# pitäisi varmaan prefiksoida koodit, jotta pysyvät uniikkeina, kun niitä lisätään
item_feature_labels_tmp = [COMPANIES_RAW[feature].dropna().unique() for feature in SELECTED_COMPANY_FEATURES]

ITEM_FEATURE_LABELS = [item for sublist in item_feature_labels_tmp for item in sublist]

ITEM_FEATURES = [(company['business_id'], 
                  [company[feature] for feature in SELECTED_COMPANY_FEATURES if str(company[feature]) != '<NA>'])
                     for company in COMPANIES_RAW.to_dict(orient='records')]

print(ITEM_FEATURES[0:10])

[('01423486', ['02', '68201']), ('15697971', ['01', '87302']), ('02373820', ['01', '88999']), ('02105471', ['19', '68201']), ('01556668', ['06', '68201']), ('01556668', ['06', '68201']), ('15697971', ['01', '87302']), ('02026351', ['01', '47730']), ('01165149', ['01', '68201']), ('01497530', ['07', '87301'])]


## Ladataan vuorovaikutusdata

In [20]:
interactions_tmp = pd \
    .read_csv('data/interactions_2021_08_19.csv',
             delimiter='\t',
             dtype={
                 'group_id': 'string',
                 'business_id': 'string',
                 'owner': 'string'
             })

# Poistetaan vuorovaikutusdatasta sellaiset y-tunnukset, joita ei löydy kohteista
INTERACTIONS_RAW = interactions_tmp[interactions_tmp.business_id.isin(ITEM_IDS)]

print(interactions_tmp.shape)
print(INTERACTIONS_RAW.shape)

(548198, 3)
(346309, 3)


## Käsitellään vuorovaikutusdata

Toistaiseksi ainakin pidän ryhmää "käyttäjänä". Oletukseni on, että ryhmä yrityksiä on se taso, jolle suosituksia halutaan luoda.

In [21]:
USER_IDS = list(INTERACTIONS_RAW['group_id'].values.unique())

INTERACTIONS = [(interaction['group_id'], interaction['business_id']) 
                for interaction in INTERACTIONS_RAW.to_dict(orient='records')]

print(len(INTERACTIONS))
print(INTERACTIONS[0:10])

346309
[('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01681709'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '15055514'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01876143'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01863991'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '05363070'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01387534'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01372818'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '18348689'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01421229'), ('c2626398-faac-4ff3-b02d-cdc64b50cdaa', '01446661')]


## Luodaan LightFM:n ymmärtämä Dataset-olio

Kysymyksiä:
1. Tuleeko parempia tuloksia identity_featuret päällä vai poissa?
2. Parempi normalisoinnin kanssa vai ilman?

In [22]:
DATASET = Dataset(user_identity_features=False, item_identity_features=False)

# user_featureja ei ainakaan vielä ole
DATASET.fit(users=USER_IDS, items=ITEM_IDS, item_features=ITEM_FEATURE_LABELS)

ITEM_FEATURES_DS = DATASET.build_item_features(ITEM_FEATURES, normalize=False)

(INTERACTIONS_DS, WEIGHTS_DS) = DATASET.build_interactions(INTERACTIONS)

USER_MAP_DS = DATASET.mapping()[0]
ITEM_MAP_DS = DATASET.mapping()[2]
ITEM_FEATURE_MAP_DS = DATASET.mapping()[3]

# print(ITEM_FEATURE_MAP_DS)
print(USER_MAP_DS)

{'c2626398-faac-4ff3-b02d-cdc64b50cdaa': 0, 'f0977af3-6923-4dbe-b553-1736c30803f6': 1, '19c2f8ca-916e-4a23-87dc-a185e1ca672d': 2, 'd87579d9-a2fb-48e7-adbe-3082633a93fb': 3, '2e04eb78-9845-49dd-8f58-31bbdc63340b': 4, '77a00cc6-9077-4339-b4c1-305dc88a9fc0': 5, '65ef0910-d1a7-458d-86c0-159d2b190e5b': 6, '68035d16-77f7-4767-bea7-472ea391973a': 7, 'a5c6ce2e-22ab-4871-bd72-e5da294b33cc': 8, 'e379a73d-ff66-4143-9603-f37f083a4b24': 9, '4b364551-1533-4758-b909-c5b95f3eabfd': 10, '00d04fee-2558-4354-9058-68c66e4bf0bc': 11, '286954d8-e3d8-4b27-b3a1-456d979d5f2a': 12, '8fdff56a-55e9-43a8-8b3a-8f7c15214d7f': 13, '22492a19-87ba-4da8-8d16-aa8fb8bb86ca': 14, '3b2c2dd7-ca83-41dd-8e33-204041db7851': 15, '3d1a4452-7594-4691-b651-34a82e5e92aa': 16, '86c8920e-3e18-4a3d-96d8-a15e74aec10b': 17, '7b303b54-74bc-49db-9308-981c52c2eb42': 18, 'fb0814c4-59e4-44d3-8b61-b145dd851366': 19, '37ba7e7f-a7e8-409a-ae45-faf419beade3': 20, '6c894b42-cbc6-4d18-8cf1-39ee91d2bf53': 21, 'd2f0e1f1-441a-422c-9a96-a9cae80627f8': 2

## Luo LightFM-malli

Kysymyksiä:
1. Mitkä on sopivat hyperparametrit mallille?
2. Erityisesti mikä on paras loss-funktio?
3. Mitkä on parhaat hyperparametrit mallin sovittamiseen?

In [23]:
MODEL = LightFM(loss='warp')

MODEL.fit(INTERACTIONS_DS, item_features=ITEM_FEATURES_DS, epochs=10, verbose=True)

Epoch: 100%|██████████| 1/1 [00:00<00:00,  3.10it/s]


<lightfm.lightfm.LightFM at 0x7f8c0a964df0>

## Tulostetaan esimerkkisuosituksia

In [24]:
USER_ID = 'fe368d6c-ef4d-46c9-ac94-a22eff3ac1f3'

known_positives = [interaction[1] for interaction in INTERACTIONS if interaction[0] == USER_ID]
known_positives_df = COMPANIES_RAW.loc[COMPANIES_RAW['business_id'].isin(known_positives)]

print("User %s:" % USER_ID)
print("Known positives:")
print(known_positives_df[['business_id', 'company_name', 'industry_code', 'location_region_code']].head(10))
    
print("\n\n")

user_id_ds = USER_MAP_DS[USER_ID]
scores = MODEL.predict(user_id_ds, list(ITEM_MAP_DS.values()), item_features=ITEM_FEATURES_DS)

scores_df = pd.DataFrame.from_records(zip(list(ITEM_MAP_DS.keys()), scores), columns=['business_id', 'scores'])
merged_scores_df = pd.merge(scores_df, COMPANIES_RAW, on='business_id')

print("Top hits:")
print(merged_scores_df[['business_id', 'company_name', 'industry_code', 'location_region_code', 'scores']]
      .sort_values('scores', axis=0, ascending=False).head(10))

User fe368d6c-ef4d-46c9-ac94-a22eff3ac1f3:
Known positives:
     business_id                         company_name industry_code  \
198     01591746               Vuoksen Kirjakauppa Oy         47610   
244     11068254                       Rasnel-Hold Oy         68209   
363     09304537              Parikkalan Autorahti Oy         49410   
395     22370715                         Ice Power Oy         43220   
568     17588365           Rakennusvalvonta Toikka Oy         71126   
753     06961267                       Lepojoutsen Oy         55101   
1073    25890126                Saimaan Työpalvelu Oy         78200   
1172    24869536                              JORO OY         49410   
1199    19031093  Asianajotoimisto Heikki Oikkonen Oy         69101   
1484    21289202                  PTu Konsultointi Oy         95110   

     location_region_code  
198                  <NA>  
244                  <NA>  
363                    09  
395                    09  
568               

## Arvioidaan mallien laatua

### Jaetaan vuorovaikutukset testi- ja koulutusdatasetteihin suositusten laatua

In [62]:
(TRAIN, TEST) = random_train_test_split(INTERACTIONS_DS, test_percentage=0.2)

print(TRAIN.sum())
print(TEST.sum())

277047.0
69262.0


### Koulutetaan mallit

Warp ja BPR

In [90]:
MODEL_WARP = LightFM(loss='warp')
MODEL_WARP.fit(TRAIN, item_features=ITEM_FEATURES_DS, epochs=10, verbose=True)

MODEL_BPR = LightFM(loss='bpr')
MODEL_BPR.fit(TRAIN, item_features=ITEM_FEATURES_DS, epochs=10, verbose=True)

Epoch: 100%|██████████| 10/10 [00:02<00:00,  3.71it/s]
Epoch: 100%|██████████| 10/10 [00:02<00:00,  3.82it/s]


<lightfm.lightfm.LightFM at 0x7f8c0c083b80>

#### Lasketaan ROC AUC

In [91]:
print("Calculating AUC for WARP...")

WARP_TRAIN_AUC = auc_score(MODEL_WARP, TRAIN, item_features=ITEM_FEATURES_DS).mean()
WARP_TEST_AUC = auc_score(MODEL_WARP, TEST, item_features=ITEM_FEATURES_DS).mean()

print("Calculating AUC for BPR...")

BPR_TRAIN_AUC = auc_score(MODEL_BPR, TRAIN, item_features=ITEM_FEATURES_DS).mean()
BPR_TEST_AUC = auc_score(MODEL_BPR, TEST, item_features=ITEM_FEATURES_DS).mean()

print('AUC_WARP: train %.2f, test %.2f.' % (WARP_TRAIN_AUC, WARP_TEST_AUC))
print('AUC_BPR: train %.2f, test %.2f.' % (BPR_TRAIN_AUC, BPR_TEST_AUC))

Calculating AUC for WARP...
Calculating AUC for BPR...
AUC_WARP: train 0.95, test 0.92.
AUC_BPR: train 0.93, test 0.89.


#### Lasketaan precision_at_k (k=10)

In [96]:
print("Calculating precision at k=10 for WARP...")

WARP_TRAIN_PRECISION = precision_at_k(MODEL_WARP, TRAIN, k=10, item_features=ITEM_FEATURES_DS).mean()
WARP_TEST_PRECISION = precision_at_k(MODEL_WARP, TEST, k=10, item_features=ITEM_FEATURES_DS).mean()

print("Calculating precision at k=10 for BPR...")

BPR_TRAIN_PRECISION = precision_at_k(MODEL_BPR, TRAIN, k=10, item_features=ITEM_FEATURES_DS).mean()
BPR_TEST_PRECISION = precision_at_k(MODEL_BPR, TEST, k=10, item_features=ITEM_FEATURES_DS).mean()

print('PRECISION_WARP: train %.2f, test %.2f.' % (WARP_TRAIN_PRECISION, WARP_TEST_PRECISION))
print('PRECISION_BPR: train %.2f, test %.2f.' % (BPR_TRAIN_PRECISION, BPR_TEST_PRECISION))

Calculating precision at k=10 for WARP...
Calculating precision at k=10 for BPR...
PRECISION_WARP: train 0.11, test 0.03.
PRECISION_BPR: train 0.11, test 0.03.


#### Lasketaan recall_at_k (k=10)

In [93]:
print("Calculating recall at k=10 for WARP...")

WARP_TRAIN_RECALL = recall_at_k(MODEL_WARP, TRAIN, k=10, item_features=ITEM_FEATURES_DS).mean()
WARP_TEST_RECALL = recall_at_k(MODEL_WARP, TEST, k=10, item_features=ITEM_FEATURES_DS).mean()

print("Calculating recall at k=10 for BPR...")

BPR_TRAIN_RECALL = recall_at_k(MODEL_BPR, TRAIN, k=10, item_features=ITEM_FEATURES_DS).mean()
BPR_TEST_RECALL = recall_at_k(MODEL_BPR, TEST, k=10, item_features=ITEM_FEATURES_DS).mean()

print('RECALL_WARP: train %.2f, test %.2f.' % (WARP_TRAIN_RECALL, WARP_TEST_RECALL))
print('RECALL_BPR: train %.2f, test %.2f.' % (BPR_TRAIN_RECALL, BPR_TEST_RECALL))

Calculating recall at k=10 for WARP...
Calculating recall at k=10 for BPR...
RECALL_WARP: train 0.02, test 0.02.
RECALL_BPR: train 0.02, test 0.02.


#### Lasketaan reciprocal_rank

In [94]:
print("Calculating reciprocal_rank for WARP...")

WARP_TRAIN_RECIPROCAL = reciprocal_rank(MODEL_WARP, TRAIN, item_features=ITEM_FEATURES_DS).mean()
WARP_TEST_RECIPROCAL = reciprocal_rank(MODEL_WARP, TEST, item_features=ITEM_FEATURES_DS).mean()

print("Calculating reciprocal_rank for BPR...")

BPR_TRAIN_RECIPROCAL = reciprocal_rank(MODEL_BPR, TRAIN, item_features=ITEM_FEATURES_DS).mean()
BPR_TEST_RECIPROCAL = reciprocal_rank(MODEL_BPR, TEST, item_features=ITEM_FEATURES_DS).mean()

print('RECIPROCAL_WARP: train %.2f, test %.2f.' % (WARP_TRAIN_RECIPROCAL, WARP_TEST_RECIPROCAL))
print('RECIPROCAL_BPR: train %.2f, test %.2f.' % (BPR_TRAIN_RECIPROCAL, BPR_TEST_RECIPROCAL))

Calculating reciprocal_rank for WARP...
Calculating reciprocal_rank for BPR...
RECIPROCAL_WARP: train 0.22, test 0.10.
RECIPROCAL_BPR: train 0.21, test 0.09.


## Opit protosta

- AUC antaa uskoa, että kyllä tää jotain saattaa pystyäkin päättelemään?
- Warp-malli parempi?
- Item-featuret ei tee juuri mitään, koska ne ei kuvaa pätevästi ryhmien sisältämien yritysten yhteyksiä?
    - JOKO valitut featuret liian geneerisiä/ei riitä kuvaamaan ryhmien sisäisiä yhtenäisiä tekijöitä
    - TAI Ryhmillä ei ole sellaista sisäistä logiikkaa, että sitä pystyisi featurejen (tai edes suosittelujen) avulla kuvaamaan