# Задача мэтчинга товаров

**Цели исследования:**  
  
Для каждого товара магазина найти один или несколько объектов из ассортимента магазина-конкурента, которые близки к нему по некоторой заданной метрике. 

**Задача:**
  
Разработать модель для метчинга товаров в соответствии с требуемой метрикой.
- разработать алгоритм, который для всех товаров магазина предложит несколько вариантов наиболее похожих товаров из ассортимента магазина-конкурента;
- оценить качество алгоритма по метрике accuracy@5

  
**План работы:**
  
- загрузить и изучить представленные данные;
- провести необходимую предобработку данных;
- провести исследовательский анализ данных;
- провести корреляционный анализ признаков, сделать выводы о мультиколлинеарности и при необходимости устранить её.
- выполнить подготовку признаков в пайплайне;
- выбрать лучшую модель и проверить её качество;
- провести анализ важности признаков, сделать выводы об их значимости;
- сформировать выводы и рекомендации по каждому шагу исследования;
- сформировать общий вывод и рекомендации.

**Какими данными располагаем:** 
  
- `base.csv` - анонимизированный набор товаров. Каждый товар представлен как уникальный id (0-base, 1-base, 2-base) и вектор признаков размерностью 72.
- `train.csv` - обучающий датасет. Каждая строчка - один товар, для которого известен уникальный id (0-query, 1-query, …) , вектор признаков И id товара из base.csv, который максимально похож на него (по мнению экспертов).
- `validation.csv` - датасет с товарами (уникальный id и вектор признаков), для которых надо найти наиболее близкие товары из base.csv
- `validation_answer.csv` - правильные ответы к предыдущему файлу.

In [1]:
# FAISS
# Annoy
# Qdrant

In [2]:
%%capture

# Стандартные библиотеки
import os
import re
import sys
import time
import warnings
from datetime import datetime
from math import ceil

# Апдейт и установка необходимых пакетов
!"{sys.executable}" -m pip install -U numba
!"{sys.executable}" -m pip install numpy==1.26.4
!"{sys.executable}" -m pip install scipy==1.13.1
!"{sys.executable}" -m pip install pandas==1.4.4
!"{sys.executable}" -m pip install --upgrade scikit-learn
!"{sys.executable}" -m pip install --upgrade matplotlib
!"{sys.executable}" -m pip install --upgrade seaborn
!"{sys.executable}" -m pip install --upgrade jinja2==3.1.4
!"{sys.executable}" -m pip install catboost
!"{sys.executable}" -m pip install missingno
!"{sys.executable}" -m pip install phik
!"{sys.executable}" -m pip install shap
!"{sys.executable}" -m pip install tqdm 

# Сторонние библиотеки
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import phik
import seaborn as sns
import shap
from IPython.display import display, HTML
from matplotlib.axes._axes import _log as matplotlib_axes_logger
from matplotlib.ticker import MultipleLocator
import missingno as msno
from pandas.plotting import register_matplotlib_converters
from scipy import stats as st

# # Библиотеки scikit-learn
# from sklearn.base import BaseEstimator, TransformerMixin
# from sklearn.compose import ColumnTransformer
# from sklearn.dummy import DummyRegressor
# from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
# from sklearn.experimental import enable_halving_search_cv
# from sklearn.impute import SimpleImputer
# from sklearn.inspection import permutation_importance
# from sklearn.linear_model import LinearRegression, Lasso, Ridge
# from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# (
#     GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV, train_test_split
# )
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.pipeline import Pipeline
# from sklearn.preprocessing import (
#     LabelEncoder, MinMaxScaler, OneHotEncoder, OrdinalEncoder, RobustScaler, StandardScaler
# )
# from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import shuffle

# FAISS
import faiss

# Дополнительные библиотеки
from tqdm import tqdm

# Дополнительные настройки
matplotlib_axes_logger.setLevel('ERROR')
warnings.filterwarnings("ignore")
warnings.warn("ignore")
register_matplotlib_converters()

# Зафиксированные параметры визуализации
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
sns.set(rc={'figure.figsize': (20, 10)})
mpl.rcParams.update({'font.size': 11})
sns.set_style("whitegrid")

In [21]:
def unload_df(file_name, parse_dates=None, sep=None, dec=',', index_col=0):
    """
    Ищет файл в сети и локально, загружает его и возвращает как pandas DataFrame.
    Также умеет парсить дату по столбцам.

    :param file_name: имя файла для загрузки
    :param sep: разделитель колонок в файле (например, ',' для CSV)
    :param dec: символ десятичного разделителя (по умолчанию ',')
    :param parse_dates: список столбцов, которые нужно разобрать как даты
    :return: загруженный DataFrame или None, если файл не найден
    """
    if parse_dates is None:
        parse_dates = []
    file_path_net = f'/datasets/{file_name}'
    file_path_local = file_name

    try:
        if os.path.exists(file_path_net):
            file_path = file_path_net
        elif os.path.exists(file_path_local):
            file_path = file_path_local
        else:
            print(f'{file_name} не найден нигде')
            return None
        
        df = pd.read_csv(file_path, parse_dates=parse_dates, sep=sep, decimal=dec, index_col=index_col)
        location = "сети" if file_path == file_path_net else "локального хранилища"
        print(f'{file_name} успешно загружен из {location}')
        return df
    except Exception as e:
        print(f'Произошла ошибка при загрузке: {e}')
        return None
    
def look_on(df):
    display(df.head())
    df.info()

In [22]:
base = unload_df('sample/base.csv')
train = unload_df('sample/train.csv')
validation_answer = unload_df('sample/validation_answer.csv')
validation = unload_df('sample/validation.csv')

sample/base.csv успешно загружен из локального хранилища
sample/train.csv успешно загружен из локального хранилища
sample/validation_answer.csv успешно загружен из локального хранилища
sample/validation.csv успешно загружен из локального хранилища


In [23]:
look_on(base)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1
4207931-base,-43.946243,15.364378,17.515854,-132.31146,157.06442,-4.069252,-340.63086,-57.55014,128.39822,45.090958,-126.84374,4.494522,-99.84231,44.926903,177.52173,-12.29179,38.47036,105.35765,-142.46024,-80.16326,-110.368935,1047.517357,-69.59462,66.31354,84.87387,813.770071,-81.03878,16.162964,-98.24488,159.53406,27.554913,-209.18428,62.05977,-529.295053,114.59833,90.469894,-20.256914,-164.768,-133.31387,-41.25296,-10.251193,8.289038,-131.31271,75.7045,-16.483078,40.771038,-146.09674,-143.40768,49.807987,63.43448,-30.25008,20.470263,78.07991,-128.91531,92.32768,63.88557,-141.17464,142.90259,-93.068596,-568.421584,-90.01869,-129.01567,-71.92717,30.711966,-90.190475,-24.931271,66.972534,106.346634,-44.270622,155.98834,-1074.464888,-25.066608
2710972-base,-73.00489,4.923342,-19.750746,-136.52908,99.90717,-70.70911,-567.401996,-128.89015,109.914986,201.4722,-186.2265,29.896042,-99.770996,0.126302,136.19049,-35.22474,-30.321323,-43.148834,-162.85175,-79.71451,-75.78487,1507.231274,-69.654564,43.640663,-4.779669,813.770071,43.976913,11.924875,-50.228523,166.0082,-59.505333,-115.33252,72.18324,-735.671365,96.3223,85.79636,-22.03033,-147.54501,-108.38295,-45.084892,-15.004004,-1.532826,-46.456585,197.57895,-56.199876,60.29871,-102.65334,-108.967964,58.512012,-9.678028,-85.4483,-68.68608,71.5902,-232.42569,91.706856,63.290657,-137.33595,-47.124687,-148.0574,-543.787056,-160.6516,-133.46222,-109.04466,20.916021,-171.20139,-110.596844,67.7301,8.909615,-9.470253,133.29536,-545.897014,-72.91323
1371460-base,-85.56557,-0.493598,-48.374817,-157.98502,96.80951,-81.71021,-22.297688,79.76867,124.357086,105.71518,-149.80756,-54.50168,-21.037973,-24.88766,128.38864,-58.558483,34.862656,19.784412,-130.9182,-79.03223,-166.63525,1507.231274,-8.495993,61.205086,25.895348,813.770071,-140.76886,20.87279,-123.95757,126.34781,11.713674,-125.025154,152.6859,-1018.469545,-22.4446,73.89764,9.190645,-156.51881,-92.18573,-34.92676,-13.277475,16.026424,-33.853546,119.60452,-52.525341,71.20475,-178.70294,-88.2785,30.501453,16.651737,-88.377014,-55.883583,70.18298,-89.233925,92.00578,76.458725,-131.14087,40.914352,-157.90054,-394.319235,-87.107025,-120.772545,-58.82165,41.369606,-132.9345,-43.016839,67.871925,141.77824,69.04852,111.72038,-1111.038833,-23.087206
3438601-base,-105.56409,15.393871,-46.223934,-158.11488,79.514114,-48.94448,-93.71301,38.581398,123.39796,110.324326,-161.188,-68.51979,-0.60733,38.733696,120.74344,-14.109269,28.868027,-29.85881,-94.30395,-79.33981,-138.98427,1507.231274,-131.88538,70.03136,32.736595,813.770071,-62.37086,13.763219,-31.872276,139.5527,9.836465,-150.22113,80.1402,-537.183707,3.091667,129.69933,-63.429424,-169.02724,-119.77007,-28.637785,-8.315162,2.752385,-160.29382,85.08689,-18.25175,90.374054,1.479935,-121.98305,65.85266,8.355225,34.118896,-57.069756,70.4618,-127.90541,94.31428,71.25994,-135.57787,-39.982346,-159.75156,-230.147648,-95.22116,-148.81409,-87.90729,-58.80687,-147.7948,-155.830237,68.974754,21.39751,126.098785,139.7332,-1282.707248,-74.52794
422798-base,-74.63888,11.315012,-40.204174,-161.7643,50.507114,-80.77556,-640.923467,65.225,122.34494,191.46585,-156.98384,-76.65021,-75.67497,12.624029,145.33752,-35.774258,11.598761,-11.460761,-201.35443,-77.779366,-120.9684,548.736883,19.851685,17.943344,27.06332,813.770071,-85.48378,21.236433,-95.07102,132.61092,13.526038,-160.47684,104.71937,-304.174382,-15.385452,91.418655,-36.474556,-157.43959,-102.83162,-56.78271,-19.969252,-0.598189,-222.22879,33.441666,-56.09211,71.27603,-8.713509,-86.09938,8.488903,-14.959278,86.812996,-29.666779,64.417755,-56.716187,80.90172,69.3969,-137.74811,23.644325,-101.447716,-341.115945,22.303604,-131.1714,-30.002094,53.64293,-149.82323,176.921371,69.47328,-43.39518,-58.947716,133.84064,-1074.464888,-1.164146


<class 'pandas.core.frame.DataFrame'>
Index: 291813 entries, 4207931-base to 274130-base
Data columns (total 72 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   0       291813 non-null  float64
 1   1       291813 non-null  float64
 2   2       291813 non-null  float64
 3   3       291813 non-null  float64
 4   4       291813 non-null  float64
 5   5       291813 non-null  float64
 6   6       291813 non-null  float64
 7   7       291813 non-null  float64
 8   8       291813 non-null  float64
 9   9       291813 non-null  float64
 10  10      291813 non-null  float64
 11  11      291813 non-null  float64
 12  12      291813 non-null  float64
 13  13      291813 non-null  float64
 14  14      291813 non-null  float64
 15  15      291813 non-null  float64
 16  16      291813 non-null  float64
 17  17      291813 non-null  float64
 18  18      291813 non-null  float64
 19  19      291813 non-null  float64
 20  20      291813 non-null  float64
 21 

In [24]:
look_on(train)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1
109249-query,-24.021454,3.122524,-80.947525,-112.329994,191.09018,-66.90313,-759.626065,-75.284454,120.55149,131.1317,-149.21106,-102.31221,21.387623,11.277594,143.2214,-22.01157,-3.618249,-16.00548,-133.38228,-78.89356,-65.69053,407.773575,-11.660624,67.00815,24.975033,813.770071,40.051064,17.933155,-75.435745,149.8172,-23.413877,-178.09557,133.78647,-906.571061,113.35556,83.94226,-16.592659,-146.52074,-120.23786,-27.341612,-8.845615,1.027612,-175.64772,167.73582,-32.931559,47.86096,-196.2475,-118.81005,-4.762772,-114.87768,37.397278,-55.616966,56.627056,-108.43317,87.37256,76.51343,-136.27057,3.652915,-164.57451,-635.284275,-75.647255,-116.67934,-41.234684,-24.60167,-167.76077,133.678516,68.1846,26.317545,11.938202,148.54932,-778.563381,-46.87775,66971-base
34137-query,-82.03358,8.115866,-8.793022,-182.9721,56.645336,-52.59761,-55.720337,130.05925,129.38335,76.20288,-137.79942,33.30165,-2.868191,-34.31877,189.06479,-19.33755,-14.20821,-71.110245,-157.74814,-78.70069,-91.741875,1054.2056,-41.84563,102.12862,72.55905,813.770071,-37.957787,17.598982,-159.9754,140.02528,-8.819328,-147.05518,113.81987,-529.295053,70.67494,55.976795,8.817799,-134.14812,-73.679794,-57.566544,-4.338496,-3.270682,-144.4992,144.6502,-37.903276,58.913525,-105.36284,-125.66783,19.367283,-29.087658,-35.02135,26.627962,55.718437,-110.52611,83.513374,75.92613,-135.68242,-7.429803,-180.64502,11.470171,16.464691,-121.807236,-90.81445,54.448433,-120.894806,-12.292085,66.608116,-27.997612,10.091335,95.809265,-1022.691531,-88.564705,1433819-base
136121-query,-75.71964,-0.223386,-86.18613,-162.06406,114.320114,-53.3946,-117.261013,-24.857851,124.8078,112.190155,-200.92596,-38.86518,-80.61127,14.343805,156.62129,-22.498169,-26.359468,-109.03487,-106.92659,-79.74731,-69.87683,1507.231274,-20.058287,34.334927,23.592144,813.770071,-49.50386,22.1662,-85.74016,134.83647,-69.56985,-139.88724,67.377045,-341.781842,54.161224,81.89166,36.421352,-159.99583,-131.91608,-20.495195,-13.976569,-2.355247,-216.22865,238.83649,-56.611536,43.36664,7.191841,-159.48369,-19.338009,-51.409897,36.81954,32.53688,80.68102,-232.40741,84.05369,59.08618,-139.8595,78.40944,-115.940575,2.426572,7.594826,-126.520134,-73.14896,-5.609123,-93.02988,-80.997871,63.733383,11.378683,62.932007,130.97539,-1074.464888,-74.861176,290133-base
105191-query,-56.58062,5.093593,-46.94311,-149.03912,112.43643,-76.82051,-324.995645,-32.833107,119.47865,120.07479,-61.347084,-28.6706,-102.79018,-36.19432,157.18976,-33.31824,7.448413,-47.230713,-178.04608,-78.78652,-106.23544,1507.231274,-63.414307,38.099255,-89.79535,813.770071,-107.43239,10.052701,-71.91738,147.74005,-18.750763,-143.79562,67.20731,-366.139446,112.1877,78.14481,-41.08541,-132.75719,-89.44503,-19.267069,-14.866466,7.775788,-104.30211,74.622894,-59.875136,76.40647,-77.79702,-92.01658,19.3373,-37.922787,37.27127,111.63957,94.91295,-179.7254,86.60148,62.698364,-122.16293,29.87394,-53.50812,-0.938894,-36.919907,-144.555,-96.79859,21.624313,-158.88037,179.597294,69.89136,-33.804955,233.91461,122.868546,-1074.464888,-93.775375,1270048-base
63983-query,-52.72565,9.027046,-92.82965,-113.11101,134.12497,-42.423073,-759.626065,8.261169,119.49023,172.36536,-186.64139,-84.9438,-92.339966,-30.229528,167.86163,-22.635653,0.014536,-9.796367,-213.1018,-78.59006,-98.7283,1250.423749,-43.892487,86.28845,-1.549826,813.770071,-110.35698,24.055641,-96.57827,156.5823,45.12424,-123.888504,118.03511,-607.946912,52.31141,76.7478,-14.161914,-143.53851,-124.886215,-64.78333,-17.706848,15.446568,-53.554455,174.38162,-23.140892,76.41933,-73.357605,-128.12526,-34.57149,-2.756741,44.027752,-13.445387,62.028725,-99.98626,79.376854,49.96618,-131.30576,-71.27052,-262.39697,-21.395427,-43.73464,-127.42511,-81.566216,13.807772,-208.65004,41.742014,66.52242,41.36293,162.72305,111.26131,-151.162805,-33.83145,168591-base


<class 'pandas.core.frame.DataFrame'>
Index: 9999 entries, 109249-query to 13504-query
Data columns (total 73 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       9999 non-null   float64
 1   1       9999 non-null   float64
 2   2       9999 non-null   float64
 3   3       9999 non-null   float64
 4   4       9999 non-null   float64
 5   5       9999 non-null   float64
 6   6       9999 non-null   float64
 7   7       9999 non-null   float64
 8   8       9999 non-null   float64
 9   9       9999 non-null   float64
 10  10      9999 non-null   float64
 11  11      9999 non-null   float64
 12  12      9999 non-null   float64
 13  13      9999 non-null   float64
 14  14      9999 non-null   float64
 15  15      9999 non-null   float64
 16  16      9999 non-null   float64
 17  17      9999 non-null   float64
 18  18      9999 non-null   float64
 19  19      9999 non-null   float64
 20  20      9999 non-null   float64
 21  21      9999 non-null   

In [25]:
look_on(validation)

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1
196680-query,-59.38342,8.563436,-28.203072,-134.22534,82.73661,-150.57217,-129.178969,23.670555,125.66636,108.809586,-129.48387,-178.98306,-109.600174,-8.799808,172.95998,-20.794373,-30.065893,-14.889741,-213.47429,-81.44286,-92.55872,1507.231274,96.50842,87.97525,55.862797,813.770071,1.647972,16.160482,-77.401474,166.08685,-7.085945,-114.40581,116.56427,-481.586956,40.185913,73.085365,-37.582203,-140.10822,-113.26041,-64.86323,-16.001427,7.223721,-5.791832,154.65631,-34.690983,52.748238,-34.976818,-160.45952,-28.526081,11.436787,107.38664,33.11757,56.67899,-43.842407,95.18327,51.950043,-123.31064,-10.645209,-52.291348,-525.623407,53.718872,-129.38846,-103.48163,79.56453,-120.31357,54.218155,68.50073,32.681908,84.19686,136.41296,-1074.464888,-21.233612
134615-query,-103.91215,9.742726,-15.209915,-116.3731,137.6988,-85.530075,-776.123158,44.48153,114.67121,95.23129,-166.03618,-66.35983,-36.001366,3.264235,73.0693,-29.384926,22.245693,62.49841,-114.18031,-80.017426,-56.034016,914.81209,-23.072426,64.59154,47.07409,813.770071,1.761437,24.459257,-177.63837,157.88023,-15.6488,-174.11716,37.697598,-701.605866,18.38345,81.50202,22.23146,-129.41878,-117.69812,-53.36446,-4.394635,11.10895,-109.88005,102.26328,-47.268603,52.33637,31.617912,13.088348,0.388435,-55.594444,-37.935482,-46.97078,50.4821,-132.51833,88.67881,81.240204,-130.75761,4.710941,-114.01305,-433.616738,-119.45599,-129.18834,-51.19377,49.299644,-101.89454,105.560548,67.80104,13.633057,108.05138,111.864456,-841.022331,-76.56798
82675-query,-117.92328,-3.504554,-64.29939,-155.18713,156.82137,-34.082264,-537.423653,54.078613,121.97396,59.321335,-90.08289,4.986931,-52.51456,52.529945,140.47353,-4.860558,-18.06383,-36.5374,-137.92374,-79.66107,-70.73312,1507.231274,-7.057582,26.21356,-2.779066,813.770071,-69.70441,16.080505,-90.43261,137.94106,24.971474,-138.86641,92.28719,-735.671365,68.33519,78.20822,14.04361,-147.51697,-113.89963,-18.748684,-8.779379,-8.737224,-177.38287,156.10245,-35.756027,65.31769,-262.90784,-96.01807,55.713432,22.165249,151.10054,-24.815138,70.9211,-121.11931,91.86982,87.153366,-138.0755,-3.30969,14.035965,-107.596636,-152.85394,-118.99784,-115.176155,48.63613,-132.17967,-0.988696,68.11125,107.065216,134.61765,134.08,27.773269,-32.401714
162076-query,-90.880554,4.888542,-39.647797,-131.7501,62.36212,-105.59327,-347.132493,-83.35175,133.91331,201.14609,-193.19345,-31.961876,-11.191006,-28.481222,157.13997,-39.51394,-20.431585,30.671173,-131.63226,-79.8416,-74.22269,1507.231274,-75.13584,34.67843,-14.997078,401.379624,-29.014805,17.788988,-87.42479,160.81638,-13.624538,-137.01877,89.403885,-388.662473,-0.446587,73.49353,3.99568,-144.55515,-125.87352,-35.733467,-9.979044,2.092319,-114.457405,158.60924,-58.275016,96.41683,-166.10669,-36.61077,95.94446,-43.66269,33.86911,30.89594,65.87759,-106.50322,94.52601,72.289566,-152.20987,29.090012,-188.34215,-327.117943,43.21247,-139.8522,-112.29379,54.884007,-177.56935,-116.374997,67.88766,136.89398,124.89447,117.70775,-566.34398,-90.905556
23069-query,-66.94674,10.562773,-73.78183,-149.39787,2.93866,-51.288853,-587.189361,-2.764402,126.56105,131.90062,-131.9364,-35.794685,-155.97958,-2.110109,137.72418,-11.544052,-12.95752,2.028175,-129.12962,-79.461266,-72.88312,1507.231274,-47.1126,7.837235,-8.623394,813.770071,-60.251694,11.591301,-82.7948,134.84439,4.764982,-114.47928,83.1504,-156.24989,41.852833,42.16045,-58.56596,-146.39613,-90.59503,-53.295376,-12.213371,-8.682546,-142.69327,71.629135,-57.668621,55.122387,10.182793,-100.19081,-45.052837,-46.877544,10.418076,106.135445,75.15257,-110.0515,83.05377,97.87521,-135.77925,11.59474,-135.25359,-336.361313,19.853195,-126.754234,-116.440605,47.279976,-162.654,107.409409,67.78526,-60.97649,142.68571,82.2643,-345.340457,-48.572525


<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 196680-query to 43566-query
Data columns (total 72 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       10000 non-null  float64
 1   1       10000 non-null  float64
 2   2       10000 non-null  float64
 3   3       10000 non-null  float64
 4   4       10000 non-null  float64
 5   5       10000 non-null  float64
 6   6       10000 non-null  float64
 7   7       10000 non-null  float64
 8   8       10000 non-null  float64
 9   9       10000 non-null  float64
 10  10      10000 non-null  float64
 11  11      10000 non-null  float64
 12  12      10000 non-null  float64
 13  13      10000 non-null  float64
 14  14      10000 non-null  float64
 15  15      10000 non-null  float64
 16  16      10000 non-null  float64
 17  17      10000 non-null  float64
 18  18      10000 non-null  float64
 19  19      10000 non-null  float64
 20  20      10000 non-null  float64
 21  21      10000 non-null 

In [26]:
look_on(validation_answer)

Unnamed: 0_level_0,Expected
Id,Unnamed: 1_level_1
196680-query,1087368-base
134615-query,849674-base
82675-query,4183486-base
162076-query,2879258-base
23069-query,615229-base


<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 196680-query to 43566-query
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Expected  10000 non-null  object
dtypes: object(1)
memory usage: 156.2+ KB


In [27]:
train.Target.value_counts()

803095-base     3
804781-base     3
272609-base     3
839040-base     3
79670-base      3
               ..
2222522-base    1
24444-base      1
501422-base     1
826204-base     1
505720-base     1
Name: Target, Length: 9672, dtype: int64

In [28]:
12
34
56

56

## 2.3 EDA

Несмотря на то, что данные обезличены, EDA здесь также будет полезен: все ли столбцы имеют одинаковое распределение значений? Есть ли столбцы, которые для модели были бы мало полезны? Есть ли сильно скоррелированные друг с другом столбцы? Может быть, есть смысл на первом этапе подавать в модель не все фичи, а наиболее информативные? Есть ли пропуски? Явные дубликаты? Если есть - что с ними делать? Есть ли аномалии в распределениях? Следующий важный вопрос - не требуется ли масштабирование данных? Ответить на этот вопрос можно, например, замерив метрику с масштабированием и без масштабирования признаков.

## 2.4 Целевая метрика

Наша целевая метрика - accuracy@n. Собственно, что это такое. Вспомним, что 

$$
Accuracy = \frac{Correct\ predictions}{All\ predictions}
$$

Представим расчет метрики в цикле, перебирая все предложенные моделью ответы. При этом каждое предсказание содержит в себе не 1 ответ, а сразу n, и если среди предложенных вариантов окажется правильный - числитель и знаменатель увеличиваются на 1. А если нет ни одного - то на 1 увеличивается только знаменатель. В нашей задаче n = 5. Хорошо бы добиться accuracy@5 ≥ 0,7. Кстати, легко заметить, что accuracy@1 - это самая обычная accuracy.

## 📊 Create FAISS [index](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes) for small dataset


[Guideline](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index)

Hint: Use numpy [ascontigiousarray](https://numpy.org/doc/stable/reference/generated/numpy.ascontiguousarray.html) - object which is stored in one [unbroken block](https://www.educative.io/answers/what-is-the-numpyascontiguousarray-function-in-python) in memory -  to load vectors in FAISS

In [29]:
dims = base.shape[1]
n_cells = 20
quantizer = faiss.IndexFlatL2(dims)
idx_l2 = faiss.IndexIVFFlat(quantizer, dims, n_cells)

In [30]:
%%time
idx_l2.train(np.ascontiguousarray(base.values).astype('float32'))
idx_l2.add(np.ascontiguousarray(base.values).astype('float32'))

CPU times: user 178 ms, sys: 107 ms, total: 285 ms
Wall time: 207 ms


In [31]:
base_index = {k: v for k, v in enumerate(base.index.to_list())}

## 🔍 Search

In [32]:
targets = train["Target"]
train.drop("Target", axis=1, inplace=True)

In [33]:
%%time
candidate_number = 5
r, idx = idx_l2.search(np.ascontiguousarray(train.values).astype('float32'), candidate_number)

CPU times: user 2.34 s, sys: 17 ms, total: 2.36 s
Wall time: 325 ms


## 📈 Accuracy@candidate_number calculation

In [34]:
acc = 0
for target, el in zip(targets.values.tolist(), idx.tolist()):
    acc += int(target in [base_index[r] for r in el])
print(f'Accuracy @ {candidate_number} = {acc / len(idx):.1%}')

Accuracy @ 5 = 11.5%


In [35]:
## ❓❓❓ What's next?

For full dataset it is strongly recommended to test your code on the small batch before loading all dataset to FAISS

You can make your own research:
- change number of cells
- change number of candidates
- change indexes
- add another ML models to improve the FAISS result
- change the accelerator: Hint: Search method on GPU differs a bit from the similar method on CPU
-.....

Remember, that in Colab you have only 12 GB of RAM, so remove variables and objects if necessary

**Good Luck!**

SyntaxError: invalid syntax (2633361802.py, line 3)