<h1><b>Definição do problema</b></h1>

O Otto Group é um dos maiores companhias de e-commerce do mundo, com subsidiários em mais de 20 paises, incluindo Crate & Barrel (EUA), Otto.de (Alemanha) e 3 Suisses (França). A companhia vende milhões de produtos por todo o mundo todos os dias, com milhares de produtos sendo adicionados a sua lista de produtos.

Uma analise consistente da performace de seus produtos é crucial. Entretanto, devido a diversidade global de produtos da empresa, muitos produtos indenticos recebem classificações diferentes. Assim, a qualidade da analise de produtos da empresa depende fortemente da abilidade de classificar produtos similares precisamente. Quanto melhor a classificação, mais insights podemos gerar sobre a nossa gama de produtos.

<h2><b>Bibliotecas utilizadas</b></h2>
* matplotlib
* seaborn
* numpy
* panda
* imblearn
    * over_sampling
* sklearn
    * preprocesing
    * model_selection
    * ensemble
    * metrics
* sys
* math
* xgboost

In [None]:
# coding: utf-8

# para análise dos dados 
from matplotlib import pyplot as plt 
%matplotlib inline
import seaborn as sns

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss

print('Importando wrapper sklearn para xgboost... ', end='')
# Wrapper xgboost -> sklearn
import sys
import math

import numpy as np

sys.path.append('xgboost/wrapper/')
import xgboost as xgb

   <h1><b>Classificador XGBoost</b></h1>

In [None]:
class XGBoostClassifier():
    def __init__(self, num_boost_round=10, **params):
        self.clf = None
        self.num_boost_round = num_boost_round
        self.params = params
        self.params.update({'objective': 'multi:softprob'})

    def fit(self, X, y, num_boost_round=None):
        num_boost_round = num_boost_round or self.num_boost_round
        self.label2num = dict((label, i) for i, label in enumerate(sorted(set(y))))
        dtrain = xgb.DMatrix(X, label=[self.label2num[label] for label in y])
        self.clf = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=num_boost_round)

    def predict(self, X):
        num2label = dict((i, label)for label, i in self.label2num.items())
        Y = self.predict_proba(X)
        y = np.argmax(Y, axis=1)
        return np.array([num2label[i] for i in y])

    def predict_proba(self, X):
        dtest = xgb.DMatrix(X)
        return self.clf.predict(dtest)

    def score(self, X, y):
        Y = self.predict_proba(X)
        return 1 / logloss(y, Y)

    def get_params(self, deep=True):
        return self.params

    def set_params(self, **params):
        if 'num_boost_round' in params:
            self.num_boost_round = params.pop('num_boost_round')
        if 'objective' in params:
            del params['objective']
        self.params.update(params)
        return self
    
    
def logloss(y_true, Y_pred):
    label2num = dict((name, i) for i, name in enumerate(sorted(set(y_true))))
    return -1 * sum(math.log(y[label2num[label]]) if y[label2num[label]] > 0 else -np.inf for y, label in zip(Y_pred, y_true)) / len(Y_pred)
print('Concluido.')

<h3><b>Leitura dos dados</b></h3>

In [None]:
# Leitura dos dados
print('Lendo dados de treinamento.. ', end='')
data = pd.read_csv("../input/train.csv")
data['id'] = data['id'].astype(str)

data_target = data['target']
data_features = data.drop(['target', 'id'], axis=1)
data_columns = data_features.columns
print('Concluido.')

<h1><b>Analize dos dados</b></h1>

In [None]:
order = sorted(set(train_df['target']))
sns.countplot(x='target', data=data,order=order)
plt.grid()
plt.title("Nº of Product of Each Class")
plt.figure(num=None, figsize=(20, 30), dpi=80, facecolor='w', edgecolor='k')

In [None]:
wt = data.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features")

<h2><b>Normalização e balanceamento dos dados</b><h2>

In [None]:
# Normalização dos dados
print('Normalizando os dados de treinamento... ', end='')
scaler = StandardScaler()
scaler.fit(data_features, data_target)
scaled_data = scaler.transform(data_features)

data_features = pd.DataFrame(scaled_data, columns=data_columns)
print('Concluido.')

# Reconstruindo dados
print('Gerando nova tabela de dados de treinamento... ', end='')
data_rebuilt = np.column_stack((data_features, data_target))
data_columns_2 = np.asarray(data_columns.tolist() + ['target'])
data_columns_2
train_scaled_balanced = pd.DataFrame(data_rebuilt, columns=data_columns_2)
print('Concluido.')

# Dados de teste
print('Lendo dados de teste... ', end='')
test = pd.read_csv("../input/test.csv")

test_features = test.drop(['id'], axis=1)
test_id = test['id']
test_columns = test.columns
print('Concluido')

# Normalização dos dados
print('Normalizando os dados de teste... ', end='')
scaled_test_features = scaler.transform(test_features)
print('Concluido.')

# Reconstruindo dados
print('Gerando nova tabela de dados de teste... ', end='')
data_rebuilt = np.column_stack((test_id, scaled_test_features))
test_scaled = pd.DataFrame(data_rebuilt, columns=test_columns)
test_scaled ['id'] = test_scaled['id'].astype(int)
print('Concluido.')

# Gerando arquivos
train_scaled_balanced.to_csv('train_sb.csv', index=False)
test_scaled.to_csv('test_s.csv', index=False)

-------------------------------------------------------------Daqui pra baixo é antigo-------------------------------------------------------------

<h1><b>Analyze by describing data</h1></b>

<h2><b>Which features are available in the dataset?</h2></b>

In [None]:
print(train_df.columns.values)

In [None]:
# preview the data
train_df.head()

In [None]:
train_df.tail()

In [None]:
train_df.info()
print('_'*40)
test_df.info()

In [None]:
train_df.describe()

In [None]:
train_df.describe(include=['O'])

In [None]:
order = sorted(set(train_df['target']))
sns.countplot(x='target', data=train_df,order=order)
plt.grid()
plt.title("Nº of Product of Each Class")
plt.figure(num=None, figsize=(20, 30), dpi=80, facecolor='w', edgecolor='k')

In [None]:
cls1 = train_df[train_df.target=='Class_1']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_1")

In [None]:
cls1 = train_df[train_df.target=='Class_2']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_2")

In [None]:
cls1 = train_df[train_df.target=='Class_3']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_3")

In [None]:
cls1 = train_df[train_df.target=='Class_4']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_4")

In [None]:
cls1 = train_df[train_df.target=='Class_5']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_5")

In [None]:
cls1 = train_df[train_df.target=='Class_6']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_6")

In [None]:
cls1 = train_df[train_df.target=='Class_7']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_7")

In [None]:
cls1 = train_df[train_df.target=='Class_8']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_8")

In [None]:
cls1 = train_df[train_df.target=='Class_9']
wt = cls1.sum()
wt.drop(['target','id']).sort_values().plot(kind='barh', figsize=(15,20))
plt.grid()
plt.title("Weight Of Features in Class_9")