# MERKLE Data Scientist

## 1.探索資料分析 – 進行建模之前，請你描述觀察資料狀況 。

數據資料相當多(train=六千多萬筆, test=四百多萬筆）,隨機抽取1% train匯入作為訓練資料。
click變數[1,0],click=1 在training data裡僅佔17%, 22個Features,都沒有missing。其中日期變數hour（datetime）為特殊格式, 將hour轉換增加兩變數為weekday和hour_of_day。features （ex.C1,C14,C15....)雖多為不明確定義的變數, 分佈不平均僅暫且留著使用。 


## 2. 特徵工程 – 建模用的feature，你做了哪些處理?

將dtype=object's feature使用hash()轉換為float(), 以使用sklearn。再使用DecisionTreeClassfier() feature_importances_ 挑選特徵值>0.01 的變數作為選用變數。

## 3. 模型調整 – 訓練完的模型會根據驗證資料進行優化調整，請說明優化的細節?

選用DecisionTreeClassfier(), 使用Grid search CV調整, max_depth = 10最好, 再使用cross_validation kfold驗證避免overfitting。
發現cv驗證時, max_depth=10 和12 都是差不多, 因此最後選擇depth = 10, Roc_auc_score約54%

## 4. 思考細節 – 這些步驟也想了解你的思考的邏輯，為什麼會做這些處理和調整?
以上只是粗略敘述最後的選用, 但其實先前做了資料參考和很多模型選擇, 可能別人用了Xboost和變數分組最後的成果相當好, 我想試試別種方式轉換feature（轉成numeric),也增加一些組合數值變數（思考可能影響共線性最後不採用）和其他模型（DecisionTreeClassfier）做看看結果。
可以思考更進一步實作的是, 數據分佈不平均的資料和定義會影響變數和模型的使用, 把各變數都取平均分佈選取sample, 可能會好一點但影響真實性。
另外最後的submission: 使用predict and predict_proba時, submision_score的結果不同, 都可以作為參考。 

## Import data

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

from matplotlib import pyplot as plt
import seaborn as sns
import gzip

test_file = '../input/avazu-ctr-prediction/test.gz'
samplesubmision_file = '../input/avazu-ctr-prediction/sampleSubmission.gz'


In [None]:
chunksize = 10 ** 6
num_of_chunk = 0
train_raw = pd.DataFrame()

for chunk in pd.read_csv('../input/trainingtraincsv/train.csv',chunksize=chunksize):
    num_of_chunk = num_of_chunk + 1
    train_raw = pd.concat([train_raw,chunk.sample(frac=0.01)],axis = 0,ignore_index = True)
    print(f'Process {str(num_of_chunk)} is done.')

In [None]:
train_len = len(train_raw)
print("Number of training Set:",train_len)

## Exploring Data

### - The First Look

In [None]:
train = train_raw.iloc[: , 1:]
train.info()

In [None]:
#print(train.head(5))
#print(train.columns)
#print(train.dtypes)

### - Check for y variable
### Note: around 17% click in training sample data

In [None]:
# Define X and y
X = train.loc[:, ~train.columns.isin(['click'])]
y = train.click

# Sample CTR
train.click.value_counts().plot(kind='bar')
print(train.click.value_counts())
print("Sample CTR :", y.sum()/len(y))

### - Check for X variables

In [None]:
for i in train.columns:
    print(i,':\n','unique num: ',train[i].nunique(),'\n total num: ',train[i].count())
    print(train[i].value_counts()/len(X),'\n')

In [None]:

ff=[]
def field_plot():
    for i in X.columns:
        if train[i].nunique() < 20:
            ff.append(i)
    for f in ff:
        print(f'Distribution of {f} :', X[f].nunique())
        print(X[f].value_counts()/len(X),'\n')
        plt.figure(figsize=(15,10))
        sns.countplot(x=f,hue='click',data=train)
 


field_plot()       

In [None]:
# Select and print numeric columns
numeric_X = train.select_dtypes(include=['int', 'float'])
print(numeric_X.columns.tolist())

In [None]:
# Select and print categorical columns 
categorical_X = train.select_dtypes(include=['object'])
print(categorical_X.columns.tolist())

### - Check for Missing Value: There is no missing data for columns and rows.

In [None]:
# Print missing values by column 
print(train.isnull().sum(axis = 0))

In [None]:
# Print total number of missing values in rows
print(train.isnull().sum(axis = 1).sum())

## Feature Engineering

### - combine test data and train data's feature to do feature engineering

In [None]:
all_data = pd.concat([train, pd.read_csv(test_file, compression='gzip')]).drop(['id'], axis=1)

In [None]:
print('Number of train data: ', len(train))
print('Number of test data: ', len(all_data)-len(train))

### - Data field [hour]: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.

In [None]:
#Original data
print(all_data['hour'].head(5))

### - Define new column: 'hour of day' and 'weekday'

In [None]:

# extract hour of day
all_data['hour'] = pd.to_datetime(all_data['hour'], format = '%y%m%d%H')
all_data['hour_of_day'] = all_data['hour'].dt.hour
#print(all_data['hour_of_day'].sample(10))

# extract weekday
all_data['weekday'] = (all_data['hour'].dt.dayofweek)
#print(all_data['weekday'].sample(10))

### - plot datetime columns

In [None]:
# Get and plot total clicks by hour of day
all_data.groupby('hour_of_day')['click'].sum().plot.bar(figsize=(12,6))
plt.ylabel('Number of clicks')
plt.title('Number of clicks by hour of day')
plt.show()
plt.figure(figsize=(15,10))
sns.countplot(x='hour_of_day',hue='click',data=all_data)

In [None]:
# Get and plot total clicks by weekday
all_data.groupby('weekday')['click'].sum().plot.bar(figsize=(12,6))
plt.ylabel('Number of clicks')
plt.title('Number of clicks by weekday')
plt.show()

### - Converting categorical variables

In [None]:
# Iterate over categorical columns and apply hash function
categorical_cols = all_data.select_dtypes(include = ['object']).columns.tolist()
for col in categorical_cols: all_data[col] = all_data[col].apply(lambda x: hash(x))
print(all_data.head(5))

## Feature Selection

all_data = train+test

train = train

In [None]:
# Define train data and test data
train = all_data[:train_len]
test = all_data[train_len:]

# using train data to do feature selection caculation
X = train.loc[:, ~train.columns.isin(['click','hour'])]
y = train.loc[:, train.columns.isin(['click'])]
#y = train.click
#X.info()
#y.info()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import ExtraTreesClassifier
#from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
#import shap
#from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve , auc, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score, recall_score





In [None]:
# DecisionTreeClassifier
# Set up classifier using training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)


In [None]:
# Sort for feature score by using DTC clf
ft_imp=pd.DataFrame(clf.feature_importances_,columns=["DTC"])
ft_imp.index=X.columns
ft_imp.sort_values(['DTC'], ascending=False)

In [None]:
# take >0.01 features
feature_len = len(ft_imp[ft_imp[ft_imp.columns[0]] > 0.01])

y = train[['click']]
X = train[ft_imp[:feature_len].index]

test = test[ft_imp[:feature_len].index]

#using new feature group to split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

In [None]:
# Set up classifier using training data to predict test data
y_pred = clf.fit(X_train, y_train).predict(X_test) 

## Regularization

### - Max_depth

In [None]:
# Create list of hyperparameters 
max_depth = [5,10,12,20,30]
param_grid = {'max_depth': max_depth}

# Use Grid search CV to find best parameters 
print("starting DTC grid search.. ")
dtc = DecisionTreeClassifier()
clf = GridSearchCV(estimator = dtc, param_grid = param_grid, scoring = 'roc_auc')
clf.fit(X_train, y_train)
print("Best Score: ")
print(clf.best_score_)
print("Best Estimator: ")
print(clf.best_estimator_)

### - Cross validation : KFold

To avoid overfitting

In [None]:
# Iterate over different levels of max depth and set up k-fold
for max_depth_val in [ 5, 10, 12,20,30]:
    k_fold = KFold(n_splits = 8, random_state = 0, shuffle=True)
    clf = DecisionTreeClassifier(max_depth = max_depth_val)
    print("Evaluating Decision Tree for max_depth = %s" %(max_depth_val))
    y_pred = clf.fit(X_train, y_train).predict(X_test) 
  
  # Calculate precision for cross validation and test
    cv_precision = cross_val_score(clf, X_train, y_train, cv = k_fold, scoring = 'precision_weighted')
    precision = precision_score(y_test, y_pred, average = 'weighted')
    print("Cross validation Precision: %s" %(cv_precision))
    print("Test Precision: %s" %(precision))

## Final

In [None]:
clf = DecisionTreeClassifier(max_depth = 10)
clf.fit(X,y.values.ravel())

In [None]:
y_pred = clf.predict(X)
print("Roc_auc_score: ",roc_auc_score(y,y_pred)*100,"%")

In [None]:
#predict
submission = pd.read_csv(samplesubmision_file, compression='gzip', index_col='id')
submission[submission.columns[0]] = clf.predict(test)
submission.to_csv('submission_pred.csv')

In [None]:
#predict_proba
submission = pd.read_csv(samplesubmision_file, compression='gzip', index_col='id')
submission[submission.columns[0]] = clf.predict_proba(test)[:,1]
submission.to_csv('submission_prob.csv')

In [None]:
submission.sample(40)