## Column Descriptions


> - ACTION:	ACTION is 1 if the resource was approved, 0 if the resource was not
- RESOURCE:	An ID for each resource
- MGR_ID :	The EMPLOYEE ID of the manager of the current EMPLOYEE ID record; an employee may have only one manager at a time
- ROLE_ROLLUP_1	:Company role grouping category id 1 (e.g. US Engineering)
- ROLE_ROLLUP_2	:Company role grouping category id 2 (e.g. US Retail)
- ROLE_DEPTNAME	:Company role department description (e.g. Retail)
- ROLE_TITLE:	Company role business title description (e.g. Senior Engineering Retail Manager)
- ROLE_FAMILY_DESC:	Company role family extended description (e.g. Retail Manager, Software Engineering)
- ROLE_FAMILY:	Company role family description (e.g. Retail Manager)
- ROLE_CODE : 	Company role code; this code is unique to each role (e.g. Manager)


##  ref :

- http://www.chioka.in/kaggle-competition-solutions/
- https://github.com/codelibra/Amazon-Employee-Access-Challenge/blob/master/Amazon-Employee-Access-Challenge.ipynb

In [1]:
# Load basics library 

import pandas as pd, numpy as np
%matplotlib inline
%pylab inline
import seaborn  as sns 
import pylab as pl
import matplotlib.pyplot as plt

Populating the interactive namespace from numpy and matplotlib


In [2]:
# load CSVs

df_train = pd.read_csv('~/Kaggle.com_mini_project/Amazon_access/train.csv')
df_test = pd.read_csv('~/Kaggle.com_mini_project/Amazon_access/test.csv')
sampleSubmission = pd.read_csv('~/Kaggle.com_mini_project/Amazon_access/sampleSubmission.csv')

In [12]:
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score, mean_absolute_error


def sample_split(data):
    #data =  data[selected_feature]
    relevent_cols = list(data)
    data_=data.values.astype(float)             
    Y = data_[:,0]
    X = data_[:,1:]
    test_size = .3
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state = 3)
    return X_train, X_test, y_train, y_test


def reg_analysis(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    #Calculate Variance score
    Variance_score = explained_variance_score(y_test, prediction)
    print ('Variance score : %.2f' %Variance_score)
    #Mean Absolute Error
    MAE = mean_absolute_error(y_test, prediction)
    print ('Mean Absolute Error : %.2f' %MAE)
    #Root Mean Squared Error
    RMSE = mean_squared_error(y_test, prediction)**0.5
    print ('Mean Squared Error : %.2f' %RMSE)
    #R² score, the coefficient of determination
    r2s = r2_score(y_test, prediction)
    print ('R2  score : %.2f' %r2s)
    return model

In [13]:
X_train, X_test, y_train, y_test = sample_split(df_train)

# In-balance data 

In [8]:
df_train.ACTION.value_counts()

1    30872
0     1897
Name: ACTION, dtype: int64

## 95% of train data are access approved (ACTION = 1)

(only 5% not approved)
```
It's needed to resample train data for avoding 
1. type 1, type 2 error , 
2. predict all ACTION = 1  CAN get accuarcy ~= 95% anyway )

## https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

```

## Approaches : 



In [23]:
y = df_train['ACTION']
X = df_train[df_train.columns.difference(['ACTION'])]

### 1) Oversampleing 

In [24]:
from imblearn.over_sampling import RandomOverSampler


# Generate the dataset
X = X.as_matrix()
y = y
# Apply the random over-sampling
ros = RandomOverSampler()
X_oversampled, y_oversampled = ros.fit_sample(X, y)

In [39]:
pd.DataFrame(y_oversampled)[0].value_counts()

1    30872
0    30872
Name: 0, dtype: int64

In [53]:
X_train_overs, X_test_overs, y_train_overs, y_test_overs = \
 train_test_split(X_oversampled, y_oversampled)

In [58]:
print ('len of X_train_overs :', len(X_train_overs))

print ('len of X_test_overs :', len(X_test_overs))

len of X_train_overs : 22938
len of X_test_overs : 9831


### 2) Undersampling

In [26]:
from imblearn.under_sampling import RandomUnderSampler 
# Generate the dataset
#X = X.as_matrix()
#y = y
# Apply the random over-sampling
ros = RandomUnderSampler()
X_undersampled, y_undersampled = ros.fit_sample(X, y)

In [40]:
pd.DataFrame(y_undersampled)[0].value_counts()

1    1897
0    1897
Name: 0, dtype: int64

In [60]:
X_train_unders, X_test_unders, y_train_unders, y_test_unders = \
train_test_split(X_undersampled, y_undersampled)

In [61]:
print ('len of X_train_unders :', len(X_train_unders))

print ('len of X_test_unders :', len(X_test_unders))

len of X_train_unders : 2845
len of X_test_unders : 949


# ML

In [55]:
# SVR 

from sklearn import svm
clf_svr = svm.SVC()


clf_svr.fit(X_train_overs,y_train_overs)
clf_svr.predict(X_test_overs)
clf_svr.score(X_test_overs,y_test_overs)

0.99954651464109878

In [62]:
# SVR 

from sklearn import svm
clf_svr = svm.SVC()


clf_svr.fit(X_train_unders,y_train_unders)
clf_svr.predict(X_test_unders)
clf_svr.score(X_test_unders,y_test_unders)

0.52054794520547942