# Classification

## Team Name
>### Sigma  

## Team Member
>### 조현윤, 이상협, 정하연  

## Objective
> ### in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them.
> ### to create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.
> ### to predict the potential business value of a person who has performed a specific activity.

## Evaluation
> ### valuated on area under the ROC curve between the predicted and the observed outcome.

## Submission File
> ### For each activity_id in the test set, you must predict a probability for the 'outcome' variable, represented by a number between 0 and 1.
~~~~
activity_id,outcome
act1_1,0
act1_100006,0
act1_100050,0
~~~~

## Data
> ### uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.
> ### The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.
> ### The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.
> ### The activity file contains several different categories of activities. 
>> Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).
> ### To develop a predictive model with this data, you will likely need to join the files together into a single data set. The two files can be joined together using person_id as the common key. All variables are categorical, with the exception of 'char_38' in the people file, which is a continuous numerical variable.

## Reference 
[kaggel Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value)

### Load Python Package

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from datetime import datetime
from datetime import date
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.stats.stattools as stools
import scipy as sp
%matplotlib inline

  from pandas.core import datetools


In [2]:
pd.__version__

'0.21.0'

## Exploratory Data Analysis (EDA )¶

## Load Data Set

In [3]:
# activity data set
act_Train = pd.read_csv('./data/act_train.csv')
act_Test = pd.read_csv('./data/act_test.csv')
# people data set
people = pd.read_csv('./data/people.csv')

In [4]:
def preprocess_acts(data, train_set=True):

    # Getting rid of data feature for now

    data.date = pd.to_datetime(data.date)


    # Add features from date
    data["year"] = data.date.apply(lambda x: x.year)
    data["month"] = data.date.apply(lambda x: x.month)
    data["day"] = data.date.apply(lambda x: x.day)
    data = data.drop(["date"], axis = 1)
    return data

def preprocess_people(data):



    data.date = pd.to_datetime(data.date)
    #  Values in the people df is Booleans and Strings
    columns = list(data.columns)
    columns.remove("date")
    bools = columns[11:]
    strings = columns[1:11]

    for col in bools:
        data[col] = pd.to_numeric(data[col]).astype(int)


    # Add features from date
    data["year_p"] = data.date.apply(lambda x: x.year)
    data["month_p"] = data.date.apply(lambda x: x.month)
    data["day_p"] = data.date.apply(lambda x: x.day)


    data = data.drop(['date'], axis=1)
    return data

In [5]:
action_people = preprocess_people(people)
actions_train = preprocess_acts(act_Train)
actions_test = preprocess_acts(act_Test, train_set=False)

In [109]:
people.head()

Unnamed: 0,people_id,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,...,char_32,char_33,char_34,char_35,char_36,char_37,char_38,year_p,month_p,day_p
0,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,0,0,1,1,1,0,36,2021,6,29
1,ppl_100002,type 2,group 8688,type 3,2021-01-06,type 28,type 9,type 5,type 3,type 11,...,1,1,1,1,1,0,76,2021,1,6
2,ppl_100003,type 2,group 33592,type 3,2022-06-10,type 4,type 8,type 5,type 2,type 5,...,1,1,1,0,1,1,99,2022,6,10
3,ppl_100004,type 2,group 22593,type 3,2022-07-20,type 40,type 25,type 9,type 4,type 16,...,1,1,1,1,1,1,76,2022,7,20
4,ppl_100006,type 2,group 6534,type 3,2022-07-27,type 40,type 25,type 9,type 3,type 8,...,0,0,0,1,1,0,84,2022,7,27


In [6]:
trainMerge = pd.merge(actions_train, people, on='people_id')
trainMerge.tail()


Unnamed: 0,people_id,activity_id,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,...,char_32,char_33,char_34,char_35,char_36,char_37,char_38,year_p,month_p,day_p
2197286,ppl_99994,act2_4668076,type 4,,,,,,,,...,1,0,1,1,1,1,95,2023,1,6
2197287,ppl_99994,act2_4743548,type 4,,,,,,,,...,1,0,1,1,1,1,95,2023,1,6
2197288,ppl_99994,act2_536973,type 2,,,,,,,,...,1,0,1,1,1,1,95,2023,1,6
2197289,ppl_99994,act2_688656,type 4,,,,,,,,...,1,0,1,1,1,1,95,2023,1,6
2197290,ppl_99994,act2_715089,type 2,,,,,,,,...,1,0,1,1,1,1,95,2023,1,6


In [100]:
trainMerge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2197291 entries, 0 to 2197290
Data columns (total 57 columns):
people_id            object
activity_id          object
activity_category    object
char_1_x             object
char_2_x             object
char_3_x             object
char_4_x             object
char_5_x             object
char_6_x             object
char_7_x             object
char_8_x             object
char_9_x             object
char_10_x            object
outcome              int64
year                 int64
month                int64
day                  int64
char_1_y             object
group_1              object
char_2_y             object
date                 object
char_3_y             object
char_4_y             object
char_5_y             object
char_6_y             object
char_7_y             object
char_8_y             object
char_9_y             object
char_10_y            bool
char_11              bool
char_12              bool
char_13              bool
cha

In [168]:
trainMerge[trainMerge['date_x'] ==trainMerge['date_y']].groupby('outcome').count()

Unnamed: 0_level_0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,83034,83034,83034,83034,31196,31196,31196,31196,31196,31196,...,83034,83034,83034,83034,83034,83034,83034,83034,83034,83034
1,49239,49239,49239,49239,16178,16178,16178,16178,16178,16178,...,49239,49239,49239,49239,49239,49239,49239,49239,49239,49239


In [126]:
for idx in trainMerge.columns:
    if 'type 0' in list(trainMerge[idx].unique()):
        print (idx, 'type 0 ')
    else:pass

# 조현윤님 code

In [7]:
import pandas as pd
import numpy as np
import datetime
import pandas as pd
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import random
from operator import itemgetter
import time
import copy
import scipy as sp
from sklearn.cross_validation import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.preprocessing import LabelEncoder
import xgboost

def create_feature_map(features):
    outfile = open('xgb.fmap', 'w')
    for i, feat in enumerate(features):
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
    outfile.close()


def get_importance(gbm, features):
    create_feature_map(features)
    importance = gbm.get_fscore(fmap='xgb.fmap')
    importance = sorted(importance.items(), key=itemgetter(1), reverse=True)
    return importance


def intersect(a, b):
    return list(set(a) & set(b))


def get_features(train, test):
    trainval = list(train.columns.values)
    testval = list(test.columns.values)
    output = intersect(trainval, testval)
    output.remove('people_id')
    return sorted(output)

def Load_DataSet():
    print ("loading .....  act_train")
    act_train = pd.read_csv('./data/act_train.csv',
                            dtype={'people_id': np.str,
                                'activity_id': np.str,
                                'outcome': np.int8},
                         parse_dates=['date'])
    print ("loading ......  act_test")
    act_test = pd.read_csv('./data/act_test.csv',
                           dtype={'people_id': np.str,
                                'activity_id': np.str},
                         parse_dates=['date'])
    print ("loading ..... people")
    people = pd.read_csv('./data/people.csv',
                         dtype={'people_id': np.str,
                               'activity_id': np.str,
                               'char_38': np.int32},
                        parse_dates=['date'])
    print ("Merge ......")
    trainMerge= act_train.merge(people, on="people_id")
    testMerge = act_test.merge(people, on="people_id")
    print('Number of active people in train : {}'.format(trainMerge['people_id'].nunique()))
    print('Number of active people in test : {}'.format(testMerge['people_id'].nunique()))
    print ("Processing Train")
    for idx in trainMerge.columns:
        print (idx)
        if idx not in ['people_id', 'activity_id', 'date_x','date_y', 'char_38', 'outcome']:
            if trainMerge[idx].dtype == 'object':
                trainMerge.fillna('type 0', inplace = True)
                trainMerge[idx] = trainMerge[idx].apply(lambda x:x.split(' ')[1]).astype(np.int32)
            elif trainMerge[idx].dtype == 'bool':
                trainMerge[idx] = trainMerge[idx].astype(np.int8)
    trainMerge['date_x'] = pd.to_datetime(trainMerge['date_x'])
    trainMerge['date_y'] = pd.to_datetime(trainMerge['date_y'])
    trainMerge['year_x'] = trainMerge['date_x'].dt.year
    trainMerge['month_x'] = trainMerge['date_x'].dt.month
    trainMerge['day_x'] = trainMerge['date_x'].dt.day
    trainMerge['weekday_x'] = trainMerge['date_x'].dt.weekday
    trainMerge['weekend_x'] = ((trainMerge.weekday_x == 0) | (trainMerge.weekday_x == 6)).astype(int)
    trainMerge = trainMerge.drop('date_x', axis = 1)

    trainMerge['year_y'] = trainMerge['date_y'].dt.year
    trainMerge['month_y'] = trainMerge['date_y'].dt.month
    trainMerge['day_y'] = trainMerge['date_y'].dt.day
    trainMerge['weekday_y'] = trainMerge['date_y'].dt.weekday
    trainMerge['weekend_y'] = ((trainMerge.weekday_y == 0) | (trainMerge.weekday_y == 6)).astype(int)
    trainMerge = trainMerge.drop('date_y', axis = 1)

    print ("Processing Test")
    for idx in testMerge.columns:
        print (idx)
        if idx not in ['people_id', 'activity_id', 'date_x','date_y', 'char_38', 'outcome']:
            if testMerge[idx].dtype == 'object':
                testMerge.fillna('type 0', inplace = True)
                testMerge[idx] = testMerge[idx].apply(lambda x:x.split(' ')[1]).astype(np.int32)
            elif testMerge[idx].dtype == 'bool':
                testMerge[idx] = testMerge[idx].astype(np.int8)
    testMerge['date_x'] = pd.to_datetime(testMerge['date_x'])
    testMerge['date_y'] = pd.to_datetime(testMerge['date_y'])
    testMerge['year_x'] = testMerge['date_x'].dt.year
    testMerge['month_x'] = testMerge['date_x'].dt.month
    testMerge['day_x'] = testMerge['date_x'].dt.day
    testMerge['weekday_x'] = testMerge['date_x'].dt.weekday
    testMerge['weekend_x'] = ((testMerge.weekday_x == 0) | (testMerge.weekday_x == 6)).astype(int)
    testMerge = testMerge.drop('date_x', axis = 1)

    testMerge['year_y'] = testMerge['date_y'].dt.year
    testMerge['month_y'] = testMerge['date_y'].dt.month
    testMerge['day_y'] = testMerge['date_y'].dt.day
    testMerge['weekday_y'] = testMerge['date_y'].dt.weekday
    testMerge['weekend_y'] = ((testMerge.weekday_y == 0) | (testMerge.weekday_y == 6)).astype(int)
    testMerge = testMerge.drop('date_y', axis = 1)
    return trainMerge, testMerge


def run(train, test, random_state=0):
    eta = 0.5
    max_depth = 5
    subsample = 0.6
    colsample_bytree = 0.8
    start_time = time.time()
    params ={
        "objective": "binary:logistic",
        "booster" : "gbtree",
        "eval_metric": "auc",
        "max_depth" : max_depth,
        "subsample": subsample,
        "colsample_bytree": colsample_bytree,
        "silent":1,
        "seed": random_state
    }
    num_boost_round = 120
    early_stopping_rounds = 10
    test_size = 0.1
    X_train, X_valid = train_test_split(train, test_size=test_size, random_state=random_state)
    y_train = X_train['outcome']
    y_valid = X_valid['outcome']
    X_train = X_train.drop(['people_id','activity_id','outcome'], axis = 1)
    X_valid = X_valid.drop(['people_id','activity_id','outcome'], axis = 1)
    dtrain = xgb.DMatrix(X_train, y_train)
    dvalid = xgb.DMatrix(X_valid, y_valid)
    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    gbm = xgb.train(params, dtrain, num_boost_round, evals=watchlist, early_stopping_rounds=early_stopping_rounds, verbose_eval=True)

    check = gbm.predict(xgb.DMatrix(X_valid), ntree_limit=gbm.best_iteration+1)
    score = roc_auc_score(y_valid, check)
    testActivityId = test['activity_id']
    test = test.drop(['people_id','activity_id'],axis = 1)
    test_prediction = gbm.predict(xgb.DMatrix(test), ntree_limit=gbm.best_iteration+1)
    imp = get_importance(gbm,X_train.columns)
    print ('importance array: ', imp)
    out = pd.concat([testActivityId,pd.DataFrame(test_prediction.round())],axis = 1)
    out.rename({0:'outcome'},axis = 1, inplace = True)
    return out

def Main():
    train, test = Load_DataSet()
    out = run(train,test)
    out.to_csv('./submission02.csv',index = False)









if __name__ == "__main__":
    Main()





loading .....  act_train
loading ......  act_test
loading ..... people
Merge ......
Number of active people in train : 151295
Number of active people in test : 37823
Processing Train
people_id
activity_id
date_x
activity_category
char_1_x
char_2_x
char_3_x
char_4_x
char_5_x
char_6_x
char_7_x
char_8_x
char_9_x
char_10_x
outcome
char_1_y
group_1
char_2_y
date_y
char_3_y
char_4_y
char_5_y
char_6_y
char_7_y
char_8_y
char_9_y
char_10_y
char_11
char_12
char_13
char_14
char_15
char_16
char_17
char_18
char_19
char_20
char_21
char_22
char_23
char_24
char_25
char_26
char_27
char_28
char_29
char_30
char_31
char_32
char_33
char_34
char_35
char_36
char_37
char_38
Processing Test
people_id
activity_id
date_x
activity_category
char_1_x
char_2_x
char_3_x
char_4_x
char_5_x
char_6_x
char_7_x
char_8_x
char_9_x
char_10_x
char_1_y
group_1
char_2_y
date_y
char_3_y
char_4_y
char_5_y
char_6_y
char_7_y
char_8_y
char_9_y
char_10_y
char_11
char_12
char_13
char_14
char_15
char_16
char_17
char_18
char_19
char_20
c

## merge한 상태에서 모든 featrue 영향력 check

### X(data set) 모음

In [114]:
#date 둘 다 넣은 set
X = trainMerge.drop(['people_id','activity_id','outcome','date'],axis = 1)
y = trainMerge['outcome']

In [116]:
#date 둘 다 뺀 set
X = trainMerge.drop(['people_id','activity_id','date_x','date_y','outcome'],axis = 1)
y = trainMerge['outcome']

In [228]:
#date x하나 넣은 set
X = trainMerge.drop(['people_id','activity_id','date_y','outcome'],axis = 1)
y = trainMerge['outcome']

In [9]:
#date y하나 넣은 set
X = trainMerge.drop(['people_id','activity_id','date_x','outcome'],axis = 1)
y = trainMerge['outcome']

## label encoding

In [115]:
from sklearn.preprocessing import LabelEncoder

#LabelEncoder.fit(X,y)
for idx in X.columns:
    X[idx] = X[idx].fillna('type 0')
    X[idx] = LabelEncoder().fit_transform(X[idx])

## xgboost

In [175]:
X = trainMerge.drop(['people_id','activity_id','outcome'],axis = 1)
y = trainMerge['outcome']

In [230]:
for idx in X.columns:
    X[idx] = X[idx].fillna('type 0')
    X[idx] = LabelEncoder().fit_transform(X[idx])

In [48]:
import xgboost



In [106]:
#date 둘다 뺀 set
X = trainMerge.drop(['people_id','activity_id','date_x','date_y','outcome'],axis = 1)
y = trainMerge['outcome']

In [130]:
X = trainMerge.drop(['people_id','activity_id','date','outcome'],axis = 1)
y = trainMerge['outcome']

In [131]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2197291 entries, 0 to 2197290
Data columns (total 56 columns):
activity_category    object
char_1_x             object
char_2_x             object
char_3_x             object
char_4_x             object
char_5_x             object
char_6_x             object
char_7_x             object
char_8_x             object
char_9_x             object
char_10_x            object
year                 int64
month                int64
day                  int64
char_1_y             object
group_1              object
char_2_y             object
char_3_y             object
char_4_y             object
char_5_y             object
char_6_y             object
char_7_y             object
char_8_y             object
char_9_y             object
char_10_y            int64
char_11              int64
char_12              int64
char_13              int64
char_14              int64
char_15              int64
char_16              int64
char_17              int64
ch

In [132]:
model_xgb = xgboost.XGBClassifier(n_estimators=50, max_depth=2)
%time
Xx = model_xgb.fit(X_train,y_train)

CPU times: user 6 µs, sys: 4 µs, total: 10 µs
Wall time: 13.8 µs


In [133]:
score = Xx.score(X_test, y_test)
"Mean accuracy of XgBooster : {0}".format(score)

'Mean accuracy of XgBooster : 0.8413640407865125'

In [134]:
#importance 확인
for ix in range(len(X.columns)):
    print (X.columns[ix], Xx.feature_importances_[ix])

activity_category 0.0
char_1_x 0.0
char_2_x 0.0
char_3_x 0.0
char_4_x 0.0
char_5_x 0.0
char_6_x 0.0
char_7_x 0.0
char_8_x 0.0
char_9_x 0.0
char_10_x 0.0
year 0.0
month 0.0
day 0.0
char_1_y 0.04
group_1 0.266667
char_2_y 0.133333
char_3_y 0.0133333
char_4_y 0.0
char_5_y 0.0
char_6_y 0.12
char_7_y 0.08
char_8_y 0.0
char_9_y 0.0
char_10_y 0.0
char_11 0.0
char_12 0.0
char_13 0.0
char_14 0.0
char_15 0.0
char_16 0.0
char_17 0.0
char_18 0.0
char_19 0.0
char_20 0.0
char_21 0.0
char_22 0.0
char_23 0.0
char_24 0.0
char_25 0.0
char_26 0.0
char_27 0.0
char_28 0.0
char_29 0.00666667
char_30 0.00666667
char_31 0.0
char_32 0.0
char_33 0.0
char_34 0.0
char_35 0.0
char_36 0.0
char_37 0.0
char_38 0.333333
year_p 0.0
month_p 0.0
day_p 0.0


## 영향력이 컸었던 char_10,뿐 아니라, char1~10이 모두 0이 됨
### people의 char_no의 영향력이 중요?

# Logistic Regression

In [139]:
from sklearn.preprocessing import LabelEncoder

#LabelEncoder.fit(X,y)
for idx in X.columns:
    X[idx] = X[idx].fillna('type 0')
    X[idx] = LabelEncoder().fit_transform(X[idx])

In [112]:
X.tail()

Unnamed: 0,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,char_8_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
2197286,334,3,0,0,0,0,0,0,0,0,...,1,1,1,1,0,1,1,1,1,95
2197287,256,3,0,0,0,0,0,0,0,0,...,1,1,1,1,0,1,1,1,1,95
2197288,186,1,0,0,0,0,0,0,0,0,...,1,1,1,1,0,1,1,1,1,95
2197289,289,3,0,0,0,0,0,0,0,0,...,1,1,1,1,0,1,1,1,1,95
2197290,333,1,0,0,0,0,0,0,0,0,...,1,1,1,1,0,1,1,1,1,95


In [109]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=100)
# Fit the model to our training data
lr = lr.fit(X, y)
score = lr.score(X, y)
"Mean accuracy of Logistic Regression: {0}".format(score)

'Mean accuracy of Logistic Regression: 0.8341648875820271'

# Decision Tree

In [135]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [117]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
model_tree = DecisionTreeClassifier().fit(X_train, y_train)

In [136]:
from sklearn.model_selection import cross_val_score

scores=cross_val_score(model_tree, X_test, y_test, scoring="accuracy", cv=5)
print(scores)

[ 0.967779    0.96717562  0.96726665  0.96711836  0.96741418]


## Random Forest

In [72]:
from sklearn.model_selection import train_test_split

In [141]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=6)

In [142]:
from sklearn.ensemble import ExtraTreesClassifier

model_forest = ExtraTreesClassifier(n_estimators=100)


model_forest.fit(X_train,y_train)


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [143]:
score = model_forest.score(X_test, y_test)
"Mean accuracy of Random Forest: {0}".format(score)

'Mean accuracy of Random Forest: 0.9934214568367015'

## Support Vector Machine

In [None]:
from sklearn.svm import SVC
svc_1 = SVC(C=0.3,kernel='rbf', gamma=20).fit(X_train, y_train)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
score = svc_1.score(X_test,y_test)

"C=0.3, test size=0.2, Score is: {0}".format(score)

In [None]:
#date 둘 다 넣은 set
testx = test_x.drop(['people_id','activity_id','outcome'],axis = 1)
testy = testMerge['outcome']

## making test set

In [144]:
testmerge = pd.merge(actions_test, people, on='people_id')

In [145]:
testmerge2 = pd.merge(actions_test, people, on='people_id')

In [146]:
testmerge.head(3)

Unnamed: 0,people_id,activity_id,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,char_7_x,...,char_32,char_33,char_34,char_35,char_36,char_37,char_38,year_p,month_p,day_p
0,ppl_100004,act1_249281,type 1,type 5,type 10,type 5,type 1,type 6,type 1,type 1,...,1,1,1,1,1,1,76,2022,7,20
1,ppl_100004,act2_230855,type 5,,,,,,,,...,1,1,1,1,1,1,76,2022,7,20
2,ppl_10001,act1_240724,type 1,type 12,type 1,type 5,type 4,type 6,type 1,type 1,...,1,1,1,1,1,1,90,2022,10,14


In [91]:
testmerge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 498687 entries, 0 to 498686
Data columns (total 55 columns):
people_id            498687 non-null object
activity_category    498687 non-null object
char_1_x             40092 non-null object
char_2_x             40092 non-null object
char_3_x             40092 non-null object
char_4_x             40092 non-null object
char_5_x             40092 non-null object
char_6_x             40092 non-null object
char_7_x             40092 non-null object
char_8_x             40092 non-null object
char_9_x             40092 non-null object
char_10_x            458595 non-null object
year                 498687 non-null int64
month                498687 non-null int64
day                  498687 non-null int64
char_1_y             498687 non-null object
group_1              498687 non-null object
char_2_y             498687 non-null object
date                 498687 non-null object
char_3_y             498687 non-null object
char_4_y             

In [147]:
for idx in testmerge:
    testmerge[idx] = testmerge[idx].fillna('type 0')
    testmerge[idx] = LabelEncoder().fit_transform(testmerge[idx])

In [158]:
Xtest = testmerge.drop(['people_id','activity_id','date'],axis = 1)


In [157]:
Xtest.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 498687 entries, 0 to 498686
Data columns (total 57 columns):
activity_category    498687 non-null int64
char_1_x             498687 non-null int64
char_2_x             498687 non-null int64
char_3_x             498687 non-null int64
char_4_x             498687 non-null int64
char_5_x             498687 non-null int64
char_6_x             498687 non-null int64
char_7_x             498687 non-null int64
char_8_x             498687 non-null int64
char_9_x             498687 non-null int64
char_10_x            498687 non-null int64
year                 498687 non-null int64
month                498687 non-null int64
day                  498687 non-null int64
char_1_y             498687 non-null int64
group_1              498687 non-null int64
char_2_y             498687 non-null int64
date                 498687 non-null int64
char_3_y             498687 non-null int64
char_4_y             498687 non-null int64
char_5_y             498687 n

In [159]:
test_result_y = model_forest.predict(Xtest)

In [155]:
test_result_y

array([1, 1, 1, ..., 0, 0, 0])

In [129]:
test_result_y = test_result_y.transpose

In [134]:
test_result_y

<function ndarray.transpose>

In [46]:
type(testmerge)

pandas.core.frame.DataFrame

In [160]:
activity_test = pd.concat([testmerge2['activity_id'],pd.Series(test_result_y)], axis =1 )

In [157]:
activity_test.head(3)

Unnamed: 0,activity_id,0
0,act1_249281,1
1,act2_230855,1
2,act1_240724,1


In [161]:
activity_test.columns.values[1] = "outcome"

In [22]:
activity_test = pd.concat([testmerge2['activity_id'],pd.Series(test_result_y)], axis =1 )

In [162]:
activity_test

Unnamed: 0,activity_id,outcome
0,act1_249281,0
1,act2_230855,0
2,act1_240724,1
3,act1_83552,1
4,act2_1043301,1
5,act2_112890,1
6,act2_1169930,1
7,act2_1924448,1
8,act2_1953554,1
9,act2_1971739,1


In [163]:
activity_test.to_csv('submission_action.csv', index=False)

### voting classifier

## ROC_Curve


In [237]:
score = model_forest.score(X_train, y_train)
"Mean accuracy of Random Forest cv: {0}".format(score)

'Mean accuracy of Random Forest cv: 0.9843750711103223'

In [239]:
from sklearn.metrics import roc_curve, auc , roc_auc_score, accuracy_score

In [249]:
# Determine the false positive and true positive rates
fpr, tpr, _ = roc_curve(y, model_tree.predict_proba(X)[:,1])
 
# Calculate the AUC
roc_auc = auc(fpr, tpr)
print ('ROC AUC: %0.2f' % roc_auc)
 
# Plot of a ROC curve for a specific class
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

ValueError: bad input shape (2197291, 2)

## K-means

In [194]:
from sklearn.cluster import KMeans
dfX = people.copy()

In [195]:
for idx in dfX.columns:
    dfX[idx] = dfX[idx].fillna('type 0')
    dfX[idx] = LabelEncoder().fit_transform(dfX[idx])

In [111]:
people.groupby('group_1')['people_id'].count()

group_1
group 1         1
group 10        1
group 100       3
group 1000     51
group 10001     1
group 10002     1
group 10003     1
group 10004     8
group 10005     1
group 10006     1
group 10008    12
group 1001     56
group 10011     1
group 10012     6
group 10013     1
group 10016     1
group 10018     1
group 10019     2
group 1002      1
group 10020     1
group 10021     1
group 10022     1
group 10023     1
group 10025    62
group 1003      4
group 10030     1
group 10032     4
group 10033     1
group 10036     1
group 10037     1
               ..
group 9961      1
group 9962      3
group 9963      3
group 9964      1
group 9965      1
group 9967      1
group 9968      4
group 9969      2
group 997      18
group 9970     10
group 9973      2
group 9974      4
group 9976      1
group 9977      7
group 9978      2
group 998       2
group 9980      5
group 9981      1
group 9982      1
group 9984      2
group 9985      4
group 9986      1
group 9987      1
group 9988      1
gr

In [196]:
for num in range(100,251):
    kmModel = KMeans(n_clusters=num, init='random',n_init=5, max_iter=10).fit(dfX)
    print (num, kmModel.score(dfX))

100 -1.56993300627e+12
101 -1.60538375643e+12
102 -1.57720156582e+12
103 -1.61791004237e+12
104 -1.57511934685e+12
105 -1.51589342419e+12


KeyboardInterrupt: 

In [None]:
kmModle.score(dfX)

In [85]:
act_Train.groupby('date')['activity_id'].count()

date
2022-07-17    1240
2022-07-18     718
2022-07-19    3734
2022-07-20    5360
2022-07-21    4611
2022-07-22    4627
2022-07-23    3826
2022-07-24     711
2022-07-25     656
2022-07-26    4374
2022-07-27    5730
2022-07-28    4430
2022-07-29    5316
2022-07-30    3497
2022-07-31     726
2022-08-01     722
2022-08-02    3608
2022-08-03    4741
2022-08-04    5854
2022-08-05    5751
2022-08-06    4193
2022-08-07     752
2022-08-08     772
2022-08-09    4346
2022-08-10    6660
2022-08-11    4810
2022-08-12    3818
2022-08-13    4971
2022-08-14     697
2022-08-15     233
              ... 
2023-08-02    3493
2023-08-03    6700
2023-08-04    8824
2023-08-05    4521
2023-08-06    1421
2023-08-07     556
2023-08-08    4205
2023-08-09    5150
2023-08-10    4008
2023-08-11    4326
2023-08-12    3987
2023-08-13     956
2023-08-14     660
2023-08-15    3078
2023-08-16    6870
2023-08-17    5696
2023-08-18    3385
2023-08-19      58
2023-08-20       9
2023-08-21      18
2023-08-22    5186
2023-08