참고
* https://www.kaggle.com/kailex/r-eda-for-elo-ensemble-learning
* https://www.kaggle.com/artgor/elo-eda-and-models

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os
print(os.listdir("../input"))

**Data**

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
merchants = pd.read_csv('../input/merchants.csv')
new_merchant_t = pd.read_csv('../input/new_merchant_transactions.csv')
his_trans = pd.read_csv('../input/historical_transactions.csv')

In [None]:
print(train.shape)
print(test.shape)
print(merchants.shape)
print(new_merchant_t.shape)
print(his_trans.shape)

In [None]:
train.dtypes

In [None]:
train['first_active_month'] = pd.to_datetime(train['first_active_month']).apply(lambda x: x.strftime('%Y-%m'))

In [None]:
print(new_merchant_t.dtypes)
print(his_trans.dtypes)

In [None]:
new_merchant_t['city_id'] = new_merchant_t['city_id'].astype(object)
new_merchant_t['merchant_category_id'] = new_merchant_t['merchant_category_id'].astype(object)
new_merchant_t['category_2'] = new_merchant_t['category_2'].astype(object)
new_merchant_t['state_id'] = new_merchant_t['state_id'].astype(object)
new_merchant_t['subsector_id'] = new_merchant_t['subsector_id'].astype(object)
new_merchant_t['purchase_date'] = pd.to_datetime(new_merchant_t['purchase_date'])

his_trans['city_id'] = his_trans['city_id'].astype(object)
his_trans['merchant_category_id'] = his_trans['merchant_category_id'].astype(object)
his_trans['category_2'] = his_trans['category_2'].astype(object)
his_trans['state_id'] = his_trans['state_id'].astype(object)
his_trans['subsector_id'] = his_trans['subsector_id'].astype(object)
his_trans['purchase_date'] = pd.to_datetime(his_trans['purchase_date'])

In [None]:
print(merchants.dtypes)

In [None]:
merchants['merchant_group_id'] = merchants['merchant_group_id'].astype(object)
merchants['merchant_category_id'] = merchants['merchant_category_id'].astype(object)
merchants['subsector_id'] = merchants['subsector_id'].astype(object)
merchants['city_id'] = merchants['city_id'].astype(object)
merchants['state_id'] = merchants['state_id'].astype(object)
merchants['category_2'] = merchants['category_2'].astype(object)

train.csv
* card_id :	Unique card identifier
* first_active_month :	'YYYY-MM', month of first purchase
* feature_1 :	Anonymized card categorical feature
* feature_2 :	Anonymized card categorical feature
* feature_3 :	Anonymized card categorical feature
* target :	Loyalty numerical score calculated 2 months after historical and evaluation period


historical_transactions.csv , new_merchant_period.csv
* card_id :	Card identifier
* month_lag :	month lag to reference date
* purchase_date :	Purchase date
* authorized_flag :	Y' if approved, 'N' if denied
* category_3 :	anonymized category
* installments :	number of installments of purchase
* category_1 :	anonymized category
* merchant_category_id :	Merchant category identifier (anonymized )
* subsector_id :	Merchant category group identifier (anonymized )
* merchant_id :	Merchant identifier (anonymized)
* purchase_amount :	Normalized purchase amount
* city_id :	City identifier (anonymized )
* state_id :	State identifier (anonymized )
* category_2 :	anonymized category


merchant.csv
* merchant_id :	Unique merchant identifier
* merchant_group_id :	Merchant group (anonymized )
* merchant_category_id :	Unique identifier for merchant category (anonymized )
* subsector_id :	Merchant category group (anonymized )
* numerical_1 :	anonymized measure
* numerical_2 :	anonymized measure
* category_1 :	anonymized category
* most_recent_sales_range :	Range of revenue (monetary units) in last active month --> A > B > C > D > E
* most_recent_purchases_range :	Range of quantity of transactions in last active month --> A > B > C > D > E
* avg_sales_lag3 :	Monthly average of revenue in last 3 months divided by revenue in last active month
* avg_purchases_lag3 :	Monthly average of transactions in last 3 months divided by transactions in last active month
* active_months_lag3 :	Quantity of active months within last 3 months
* avg_sales_lag6 :	Monthly average of revenue in last 6 months divided by revenue in last active month
* avg_purchases_lag6 :	Monthly average of transactions in last 6 months divided by transactions in last active month
* active_months_lag6 :	Quantity of active months within last 6 months
* avg_sales_lag12 :	Monthly average of revenue in last 12 months divided by revenue in last active month
* avg_purchases_lag12 :	Monthly average of transactions in last 12 months divided by transactions in last active month
* active_months_lag12 :	Quantity of active months within last 12 months
* category_4 :	anonymized category
* city_id :	City identifier (anonymized )
* state_id :	State identifier (anonymized )
* category_2 :	anonymized category

In [None]:
train.head(10)

Predicting a loyalty score for each card_id represented in test.csv and sample_submission.csv. -> 'target'

**1. train**

In [None]:
train['target'].describe()

평균이 0보다 작고, 다른 값들에 비해 극단적인 값을 갖는 일부 값들이 존재

The average is less than zero. Some values are extreme( relative to others ). 

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(train.target)

In [None]:
f, ax = plt.subplots(figsize=(10, 8))
sns.distplot(train['target'])

데이터를 살펴보면, target은 0 근처에 모여있다. -30보다 작은 데이터는 이상치가 아닐까 생각했다.

In [None]:
# calculate the correlation matrix
corr = train.corr()

# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns, annot=True)

In [None]:
col_list = train.columns.tolist()
col_list = col_list[2:]
f = pd.melt(train, value_vars=col_list)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

In [None]:
col_list = test.columns.tolist()
col_list = col_list[2:]
f = pd.melt(test, value_vars=col_list)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

각각의 feature에 대해 train, test이 비슷한 분포

**2. transaction data - historical, new**

In [None]:
his_trans.head()

In [None]:
his_trans.describe()

In [None]:
new_merchant_t.describe()

(purchase_amount가 음수인건 무슨 의미..?? -> 뭔가 처리된 값이라서)

In [None]:
fig, ax = plt.subplots(1, 4, figsize = (16, 6));
his_trans['authorized_flag'].value_counts().sort_index().plot(kind='bar', ax=ax[0], color='teal', title='authorized_flag');
his_trans['category_1'].value_counts().sort_index().plot(kind='bar', ax=ax[1], color='brown', title='category_1');
his_trans['category_2'].value_counts().sort_index().plot(kind='bar', ax=ax[2], color='gold', title='category_2');
his_trans['category_3'].value_counts().sort_index().plot(kind='bar', ax=ax[3], color='purple', title='category_3');
plt.suptitle('Counts of categiories historical transaction');
fig, ax = plt.subplots(1, 4, figsize = (16, 6));
new_merchant_t['authorized_flag'].value_counts().sort_index().plot(kind='bar', ax=ax[0], color='teal', title='authorized_flag');
new_merchant_t['category_1'].value_counts().sort_index().plot(kind='bar', ax=ax[1], color='brown', title='category_1');
new_merchant_t['category_2'].value_counts().sort_index().plot(kind='bar', ax=ax[2], color='gold', title='category_2');
new_merchant_t['category_3'].value_counts().sort_index().plot(kind='bar', ax=ax[3], color='purple', title='category_3');
plt.suptitle('Counts of categiories new merchant transaction');

거래 데이터의 일부 카테고리 변수를 살펴보았다. historical data와 new merchant data가 유사한 패턴을 보이는 것을 확인. 단, authorized_flag 변수만 다름. new merchant data에서는 N이 존재하지 않음. category_1은 historical data에서 Y값이 조금 더 많고, category_3에서 B가 historical data에서 조금 더 적은 비중.

In [None]:
his_trans[his_trans['authorized_flag']=='N'].head(10)

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (16, 6));
his_trans['installments'].value_counts().sort_index().plot(kind='bar', ax=ax[0], color='teal', title='installments');
his_trans['month_lag'].value_counts().sort_index().plot(kind='bar', ax=ax[1], color='brown', title='month_lag');
plt.suptitle('Counts of categiories historical transaction');

fig, ax = plt.subplots(1, 2, figsize = (16, 6));
new_merchant_t['installments'].value_counts().sort_index().plot(kind='bar', ax=ax[0], color='teal', title='installments');
new_merchant_t['month_lag'].value_counts().sort_index().plot(kind='bar', ax=ax[1], color='brown', title='month_lag');
plt.suptitle('Counts of categiories new merchant transaction');

(installments 999?? -> outlier..??)

In [None]:
his_trans[his_trans['installments']==999].head()

In [None]:
his_trans[(his_trans['installments']==999) & (his_trans.authorized_flag=='Y')]

some transactions -> Y authorized_flag

In [None]:
his_trans[his_trans['installments']==-1].head()

installments에 -1,999-> outlier? missing value? -> How can handle this? -> 999( if this means over 12 installments...maybe....I can change 999->(ex)13  )

-1 in id type variable (city_id, merchant_category_id, state_id, subsector_id)

In [None]:
his_trans[his_trans['city_id']==-1]

In [None]:
his_trans[(his_trans['city_id']==-1)&(his_trans['state_id']!=-1)]

If city_id is -1, state_id is -1, too.

In [None]:
print(len(his_trans[(his_trans['city_id']!=-1)&(his_trans['state_id']==-1)]))
his_trans[(his_trans['city_id']!=-1)&(his_trans['state_id']==-1)].head()

But the opposite does not. There are som cases city_id is not -1, state_id is -1.

-1 in id type variables -> missing value? outlier? -> how............?

**Data Preprocess**

결측치 확인

In [None]:
merchants.isnull().sum()

In [None]:
merchants[merchants.avg_sales_lag3.isnull()==True]

avg_sales_lag의 결측치는 모두 동일한 데이터에서 발생.

In [None]:
new_merchant_t.isnull().sum()

In [None]:
his_trans.isnull().sum()

**train + historical**

먼저, train data에서 target이 -30보다 작았던 것들 제거

In [None]:
train = train[train.target>-30]

1. purchase_amount (Normalized purchase amount)

구매금액으로 이해했는데 구매수인가..?

In [None]:
# purchase_amount size per card_id (구매건수)
c_his = his_trans.groupby("card_id")
c_his = c_his["purchase_amount"].size().reset_index()
c_his.columns = ["card_id","purchase_amount_size"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['purchase_amount_size']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='purchase_amount_size', y="target", data=data)
plt.xlabel("purchase_amount_size")
plt.ylabel("target")
plt.show

In [None]:
# purchase_amount mean per card_id
c_his = his_trans.groupby("card_id")
c_his = c_his["purchase_amount"].mean().reset_index()
c_his.columns = ["card_id","purchase_amount_mean"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['purchase_amount_mean']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='purchase_amount_mean', y="target", data=data)
plt.xlabel("purchase_amount_mean")
plt.ylabel("target")
plt.show

맨 오른쪽에 있는 점 확인

In [None]:
train[train['purchase_amount_mean']>400000]

혼자 튀는 이유가 his_trans의 describe를 보았을때 max값이 엄청 컸던 것 때문으로 추정되어 확인 -> 이 purchase amount가 이상치인가...?

In [None]:
his_trans[his_trans['purchase_amount']>400000]

In [None]:
# if remove that point,
data = data[data.purchase_amount_mean < 400000]
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='purchase_amount_mean', y="target", data=data)
plt.xlabel("purchase_amount_mean")
plt.ylabel("target")
plt.show

2. installments(number of installments of purchase)

할부 개월로 이해

In [None]:
c_his = his_trans.groupby("card_id")
c_his = c_his["installments"].mean().reset_index()
c_his.columns = ["card_id","installments_mean"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['installments_mean']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='installments_mean', y="target", data=data)
plt.xlabel("installments_mean")
plt.ylabel("target")
plt.show

installments_mean (include -1, 999installments) -> How can handle this...? hange 999 -> 13.....or,,,,,,,,remove....?,,,,,,,,or other.....?

할부....저렇게 많이도 가능한건....가요....? -> 999의 영향...999를 어떻게 처리해주면 좋을까요..? 만약 12개월 이상이 999로 되는거라면, 999를 13으로 대체...? 제거? 다른 값으로?

3. month_lag(month lag to reference date)

In [None]:
c_his = his_trans.groupby("card_id")
c_his = c_his["month_lag"].max().reset_index()
c_his.columns = ["card_id","month_lag_recent"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['month_lag_recent']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='month_lag_recent', y="target", data=data)
plt.xlabel("month_lag_recent")
plt.ylabel("target")
plt.show

최근에 구매한 경우가 많음

first_active에서 최근 구매까지 기간을 살펴보면,

In [None]:
his_trans['purchase_date_month'] = his_trans['purchase_date'].apply(lambda x: x.strftime('%Y-%m'))

In [None]:
c_his = his_trans.groupby("card_id")
c_his = c_his["purchase_date_month"].max().reset_index()
c_his.columns = ["card_id","purchase_date_month_recent"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
train['purchase_period'] = pd.to_datetime(train['purchase_date_month_recent'])-pd.to_datetime(train['first_active_month'])

In [None]:
for i in range(len(train)):
    train.purchase_period[i] = train.purchase_period[i].days//30

In [None]:
# calculate the correlation matrix
corr = train.corr()

# plot the heatmap
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns, annot=True)

train에 거래 데이터를 추가해서 correlation을 다시 확인. 여전히.....ㅎ....

**train + new_merchant_transactions**

In [None]:
new_merchant_t.head()

1. purchase_amount (Normalized purchase amount)

In [None]:
# purchase_amount size per card_id (구매건수)
c_his = new_merchant_t.groupby("card_id")
c_his = c_his["purchase_amount"].size().reset_index()
c_his.columns = ["card_id","purchase_amount_size_new"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['purchase_amount_size_new']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='purchase_amount_size_new', y="target", data=data)
plt.xlabel("purchase_amount_size_new")
plt.ylabel("target")
plt.show

In [None]:
# purchase_amount mean per card_id
c_his = new_merchant_t.groupby("card_id")
c_his = c_his["purchase_amount"].mean().reset_index()
c_his.columns = ["card_id","purchase_amount_mean_new"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['purchase_amount_mean_new']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='purchase_amount_mean_new', y="target", data=data)
plt.xlabel("purchase_amount_mean_new")
plt.ylabel("target")
plt.show

purchase_amount_mean과 target의 relation을 살펴봤을때, historical에서는 거의 flat한 형태에 가까웠는데 new에서는 조금 더 정규분포처럼 생긴 모양.

2. installments(number of installments of purchase)

In [None]:
c_his = new_merchant_t.groupby("card_id")
c_his = c_his["installments"].mean().reset_index()
c_his.columns = ["card_id","installments_mean_new"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['installments_mean_new']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='installments_mean_new', y="target", data=data)
plt.xlabel("installments_mean_new")
plt.ylabel("target")
plt.show

살짝 동떨어진 값이 존재...

In [None]:
train[train.installments_mean_new>30]

혹시 installments가 999여서?

In [None]:
new_merchant_t[new_merchant_t.installments==999]

new_merchant_trasaction data에 999 installments인 경우가 2가지 존재.

In [None]:
# if remove that point,
data = data[data.installments_mean_new<30]
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='installments_mean_new', y="target", data=data)
plt.xlabel("installments_mean_new")
plt.ylabel("target")
plt.show

그래프가 조금 신기...? 퍼져있기도 하고, 줄서있기도 하고..?

3. month_lag(month lag to reference date)

In [None]:
c_his = new_merchant_t.groupby("card_id")
c_his = c_his["month_lag"].max().reset_index()
c_his.columns = ["card_id","month_lag_recent_new"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
data = pd.concat([train['target'], train['month_lag_recent_new']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
plt.scatter(x='month_lag_recent_new', y="target", data=data)
plt.xlabel("month_lag_recent_new")
plt.ylabel("target")
plt.show

first_active에서 최근 구매까지 기간을 살펴보면,

In [None]:
new_merchant_t['purchase_date_month'] = new_merchant_t['purchase_date'].apply(lambda x: x.strftime('%Y-%m'))

In [None]:
c_his = new_merchant_t.groupby("card_id")
c_his = c_his["purchase_date_month"].max().reset_index()
c_his.columns = ["card_id","purchase_date_month_recent_new"]
train = pd.merge(train, c_his, on="card_id", how="left")

In [None]:
train['purchase_period_new'] = pd.to_datetime(train['purchase_date_month_recent_new'])-pd.to_datetime(train['first_active_month'])

In [None]:
train.head()

근데 궁금한게..고객맞춤형 가게 추천같은건데...뭔가 지금 한건 그냥 고객별로만 본거같은데...흠....?
변수 더 생성해보기 -> 평균 구매간격, 카드별 가장 빈도수 높은 city, state, merchant_id 등등의 id변수 활용..?여기서 나온 merchant_id를 기준으로 merchant data 이용...?흠...(merchant data 어떻게 활용하면 좋을까요..)


most frequent value in card_id (citiy_id, state_id,merchant_id, merchant_category_id, category1, category2, category3,subsector_id)

In [None]:
# Fill null by most frequent data
#df_trans['category_2'].fillna(1.0,inplace=True)
#df_trans['category_3'].fillna('A',inplace=True)
#df_trans['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True)

In [None]:
train.isnull().sum()

In [None]:
train.dtypes

변수 다 만든 후 next week는 predict model 만들어서 비교해보기! 