知識0から [Mercari Price Suggestion Challenge](https://www.kaggle.com/c/mercari-price-suggestion-challenge) に参加してどこまで行けるかの挑戦

## Format

- name 金額が含まれている場合は取り除かれている。その場合 [rm] に変わっている
- shipping 送料無料かどうか
- item_condition_id [1=New, 2=Like New, 3=Good, 4=Fair, 5=Poor](https://www.kaggle.com/c/mercari-price-suggestion-challenge/discussion/44228)

| train_id | name                                 | item_condition_id | category_name                                      | brand_name      | price | shipping | item_description                                                                                                                                                                                                                                                             | 
|----------|--------------------------------------|-------------------|----------------------------------------------------|-----------------|-------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| 
| 0        | MLB Cincinnati Reds T Shirt Size XL  | 3                 | Men/Tops/T-shirts                                  |                 | 10.0  | 1        | No description yet                                                                                                                                                                                                                                                           | 
| 1        | Razer BlackWidow Chroma Keyboard     | 3                 | Electronics/Computers & Tablets/Components & Parts | Razer           | 52.0  | 0        | This keyboard is in great condition and works like it came out of the box. All of the ports are tested and work perfectly. The lights are customizable via the Razer Synapse app on your PC.                                                                                 | 
| 2        | AVA-VIV Blouse                       | 1                 | Women/Tops & Blouses/Blouse                        | Target          | 10.0  | 1        | Adorable top with a hint of lace and a key hole in the back! The pale pink is a 1X, and I also have a 3X available in white!                                                                                                                                                 | 

# Proccess Data

## Read tsv

In [None]:
import math
import gc
import numpy as np
import pandas as pd

submit = True

In [None]:
# NOTE: https://qiita.com/hadacchi/items/ff6528364fed2404eef0
# NOTE: specify dtype for memory...
def read(ic, file_name):
    print('read ' + file_name)

    base_dir = '../input/'
    dtype = {ic: 'uint32', 'shipping': 'uint8', 'price': 'float16', 'item_condition_id': 'uint8'}
    return pd.read_table(base_dir + file_name, index_col=[ic], dtype=dtype)

df_train = read('train_id', 'train.tsv')
df_test = read('test_id', 'test.tsv')

In [None]:
def process_texts(df):
    print('process_text')

    #df['has_rm'] = df['name'].apply(lambda x: 1 if '[rm]' in x else 0).astype('int8')

    #df['has_description'] = df['item_description'].apply(lambda x: 1 if x != 'No description yet' else 0).astype('int8')
    df.drop(['name', 'item_description'], axis=1, inplace=True)
    return df

df_train = process_texts(df_train)
df_test = process_texts(df_test)
gc.collect()

In [None]:
brand_name_master = {}
brand_names = df_train['brand_name'].value_counts().where(lambda x : x > 10).dropna().index
for _, b in enumerate(brand_names):
    brand_name_master[b] = b
del brand_names

def process_brand_name(df):
    print('process_brand_name')

    # NOTE: NaN == NaN -> false
    df['has_brand_name'] = df['brand_name'].apply(lambda x: 1 if x == x else 0).astype('int8')
    df['brand_name_m'] = df['brand_name'].apply(lambda x: brand_name_master.get(x, np.nan))
    df.drop(['brand_name'], axis=1, inplace=True)
    return df

df_train = process_brand_name(df_train)
df_test = process_brand_name(df_test)

del brand_name_master
gc.collect()

In [None]:
def process_category_name(df):
    print('process_category_name')

    return df

df_train = process_category_name(df_train)
df_test = process_category_name(df_test)

[カテゴリカルデータをダミー変数化](https://qiita.com/airtoxin/items/d66a22c5c7074e23be17#%E3%82%AB%E3%83%86%E3%82%B4%E3%83%AA%E3%82%AB%E3%83%AB%E3%83%87%E3%83%BC%E3%82%BF%E3%82%92%E3%83%80%E3%83%9F%E3%83%BC%E5%A4%89%E6%95%B0%E5%8C%96)

というように pandas 側の前処理で質的データを量的データへ変換し、重回帰分析する。(数量化1類)

TODO: とりあえず一律でモデルを揃えるが、後で再検討する必要がある。例えば train には Handmade があるけど test には Handmade/Cool しか無いという場合に Handmade の訓練データを使わないのはもったいない気がする

In [None]:
def process_categorical(df):
    print('process_categorical')

    df = pd.get_dummies(df, columns=['item_condition_id', 'category_name'], drop_first=True)
    gc.collect()
    
    # DON'T drop_first
    df = pd.get_dummies(df, columns=['brand_name_m'])
    gc.collect()

    return df

df_train = process_categorical(df_train)
gc.collect()
print(df_train.shape)
df_test = process_categorical(df_test)
gc.collect()
print(df_test.shape)

## 目的変数

(trivial) X, y という命名については[慣習](http://docs.pyq.jp/python/machine_learning/tips/scikit-learn.html)

In [None]:
y = df_train['price'].as_matrix()
df_train.drop(['price'], axis=1, inplace=True)
gc.collect()

## 説明変数

X は疎行列にする [参考](http://nakano-tomofumi.hatenablog.com/entry/2017/11/06/181223)

In [None]:
print('align train.tsv & test.tsv')
# https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding
df_train, df_test = df_train.align(df_test, join='left', axis=1, fill_value=0)
gc.collect()

In [None]:
train_columns = df_train.columns
print('generate sparse matrix X')
X = df_train.to_sparse(fill_value=0).to_coo()
print('generate sparse matrix T')
T = df_test.to_sparse(fill_value=0).to_coo()

# Training

変数の対数変換を行なう。https://atarimae.biz/archives/13161 によると対数変換ある/なしでは

- そのまま → x に y が比例
- 対数取る → x の変化率に対する y の変化率 が一定

と x と y の関係をどう仮定するかということが違ってくるらしい。

これはどっちが正しいというより対象によってどっちを仮定するのがより合ってるのかということで、割と弾力性とやらが一定のことが多いから対数変換する場合が多いという理解。

今回説明変数に量的データがなく、質的データを get_dummies した2値のものだけなのでそっちは対数変換してもしなくても結果は変わらないはず。

In [None]:
from sklearn import linear_model
from sklearn.model_selection import train_test_split

def rmsle(a, p):
    n = len(a)
    assert len(p) == n
    terms_to_sum = [(math.log(max(p[i-1], 0) + 1) - math.log(a[i-1] + 1)) ** 2.0 for i in range (1, n)]
    return (sum(terms_to_sum) * (1.0/n)) ** 0.5

clf = linear_model.LinearRegression()
if submit:
    print('fit')
    %timeit clf.fit(X, np.log1p(y))
else:
    print('train_test_split')
    X_train, X_test, y_train, a = train_test_split(X, y, test_size=0.2)
    print('fit')
    %timeit clf.fit(X_train, np.log1p(y_train))
    print('predict')
    p = clf.predict(X_test)
    print('calculate score')
    score = rmsle(a, np.expm1(p))
    print('v-score: ' + str(score))

print("=== Coefficients ===")
print(pd.DataFrame({
  "Name": train_columns,
  "Coefficients": clf.coef_
}).sort_values(by='Coefficients'))

print("=== Intercept ===")
print(clf.intercept_)

In [None]:
file_name = 'submission.csv'

if submit:
    result = clf.predict(T)
    result_df = pd.DataFrame({'test_id': df_test.index, 'price': np.expm1(result)}, columns=['test_id', 'price'])
    result_df['price'] = result_df['price'].where(result_df['price'] >= 0, 0)
    result_df.to_csv(file_name, index=False)
    print('see ' + file_name)