[View in Colaboratory](https://colab.research.google.com/github/taiki323/kaggle_training/blob/master/Google_Analytics_Customer_Revenue_Prediction.ipynb)

# 課題
- Google Merchandise Storeの顧客データを分析して、顧客あたりの収益を予測する。

- 各fullVisitorIdに対して、PredictedLogRevenueで総収入の自然対数を予測する。RMSEで評価

![target](https://github.com/taiki323/image_house/blob/master/target1.PNG?raw=true)

# セットアップ

In [0]:
!pip install kaggle　
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth

auth.authenticate_user()

drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])

filename = "/content/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)

request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

In [2]:
# !mkdir .kaggle
!mkdir ~/.kaggle
!mkdir work
%cd work
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle competitions download -c google-analytics-customer-revenue-prediction

/content/work
Downloading sample_submission.csv.zip to /content/work
  0% 0.00/5.22M [00:00<?, ?B/s]
100% 5.22M/5.22M [00:00<00:00, 56.8MB/s]
Downloading test.csv.zip to /content/work
 62% 33.0M/53.3M [00:00<00:00, 55.4MB/s]
100% 53.3M/53.3M [00:00<00:00, 120MB/s] 
Downloading train.csv.zip to /content/work
 78% 45.0M/57.5M [00:00<00:00, 64.2MB/s]
100% 57.5M/57.5M [00:00<00:00, 121MB/s] 


In [0]:
!unzip '*.zip'

Archive:  sample_submission.csv.zip
  inflating: sample_submission.csv   

Archive:  train.csv.zip
  inflating: train.csv               

# 前処理

## データ確認
１回のストア訪問につき、１行のデータになる。
- fullVisitorId: 一意なユーザID
- channelGrouping: ユーザがストアにアクセスした経路(アフィリエイトなど)
- date: ストアに訪れた日付
- device: ユーザが使用したデバイス
- geoNetwork: ユーザがいる地域
- sessionId: セッションID
- socialEngagementType: Not Socially Engagedしかデータない
- totals: セッション全体の集計値
- trafficSource: トラフィックソースの情報
- visitId: セッションID。ユーザにのみ一意。
- visitNumber: セッション番号。初回アクセスなら1。
- visitStartTime: ストアに訪れた時間

In [24]:
test['visitNumber'].value_counts()

1      604370
2       89994
3       35119
4       18729
5       11699
6        8025
7        5781
8        4307
9        3266
10       2603
11       2071
12       1710
13       1437
14       1219
15       1053
16        928
17        813
18        735
19        654
20        576
21        509
22        467
23        411
24        376
25        357
26        331
27        304
28        283
29        263
30        248
        ...  
354         1
355         1
356         1
357         1
358         1
359         1
360         1
338         1
335         1
313         1
322         1
314         1
315         1
316         1
317         1
318         1
319         1
320         1
321         1
323         1
334         1
324         1
327         1
328         1
329         1
330         1
331         1
332         1
333         1
457         1
Name: visitNumber, Length: 446, dtype: int64

In [2]:
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

# Going to use these 5 base models for the stacking
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.cross_validation import KFold

In [0]:
train = pd.read_csv("/content/work/train.csv")
test = pd.read_csv("/content/work/test.csv")

In [20]:
print(train.shape)
print(test.shape)
train.head(3)

(903653, 12)
(804684, 12)


Unnamed: 0,channelGrouping,date,device,fullVisitorId,geoNetwork,sessionId,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",1131660440785968503,"{""continent"": ""Asia"", ""subContinent"": ""Western...",1131660440785968503_1472830385,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472830385,1,1472830385
1,Organic Search,20160902,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",377306020877927890,"{""continent"": ""Oceania"", ""subContinent"": ""Aust...",377306020877927890_1472880147,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472880147,1,1472880147
2,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",3895546263509774583,"{""continent"": ""Europe"", ""subContinent"": ""South...",3895546263509774583_1472865386,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472865386,1,1472865386


In [13]:
sample = pd.read_csv("/content/work/sample_submission.csv")
sample.head()

Unnamed: 0,fullVisitorId,PredictedLogRevenue
0,259678714014,0.0
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0


0.0