#### This kernel used from the Porto Seguro’s Safe Driver Prediction and copied from the 'Data Preparation & Exploration' written by Bert Carremans

#### Data Preparation & Exploration : [URL](https://www.kaggle.com/bertcarremans/data-preparation-exploration) 

*Thanks for sharing kernel, Bert Carremans*

### Loading packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import VarianceThreshold
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier

pd.set_option('display.max_columns', 100)

### Loading data

In [None]:
train = pd.read_csv('../input/porto-seguro-safe-driver-prediction/train.csv')
test = pd.read_csv('../input/porto-seguro-safe-driver-prediction/test.csv')

### Data at first sight

Here is an excerpt of the data description for the competition:

* Features that belong to similar groupings are tagged as such in the feature names (ind, reg, car, calc)
* Feature names include the postfix bin to indicate binary features and cat to indicate categorical features
* Features without these designations are either continious or ordinal
* Values of -1 indicate that the feature was missing from the observation
* The target columns signifies whether or not a claim was filled for that policy holder

We indeed see the following 
* binary variables
* categorical varibales of which the category values are integers
* other variables with integer or float values
* variables with -1 represening missing values
* the target variable and an ID variable

In [None]:
train.drop_duplicates()
train.shape

In [None]:
test.shape

In [None]:
test.info()

Again, with the info() method we see that the data type is integer or float. No null values are present in the data set. That's normal because missing values are replaced by -1. We'll look into that later.

### Metadata

To facilitate the data management, we'll store meta-information about the variables in a DataFrame. This will be helpful when we want to select specific variables for analysis, visuallization, modeling

Concretely we will store:
* role: input, ID, target
* level: nominal, interval, ordinal, binary
* keep: Ture of False
* dtype: int, float, str

In [None]:
train.columns

In [None]:
data = []
for f in train.columns:
    # defining the role
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
        
    # defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
    elif 'cat' in f or f == 'id':
        level = 'nominal'
    elif train[f].dtype == float:
        level = 'interval'
    elif train[f].dtype == int:
        level = 'ordinal'
        
    # initialize keep to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
        
    # defining the data type
    dtype = train[f].dtype
    
    # creating a dictionary that contains all the metadata for the varibale
    f_dict = {
        'varname' : f,
        'role' : role,
        'level' : level,
        'keep' : keep,
        'dtype' : dtype
    }
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns = ['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace = True)

In [None]:
# Example to extract all nominal variables that are not dropped

meta[(meta.level == 'nominal') & (meta.keep)].index

In [None]:
# Below the number of variables per role and level are displayed.

pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

### Descriptive statistics

We can also apply the describe method on the dataframe. However, it doesn't make much sense to calculate the mean, std, ... on categorical variables and the id variable. We'll explore the categorical variables visually later.

Thanks to our meta file we can easily select the variables on which we want to compute the descriptive statistics. To keep things clear, we'll do this per data type.

In [None]:
# interval variables

v = meta[(meta.level == 'interval') & (meta.keep)].index
train[v].describe()

#### reg variables
* only ps_reg_03 has missing values
* the range (min to max) differs between the variables. We could apply scaling (e.g. StandardScaler), but it depends on the classifier we will want to use.

#### car variables
* ps_car_12 and ps_car_15 have missing values
* again, the range differs and we could apply scaling.

#### calc variables
* no missing values
* this seems to be some kind of ratio as the maximum is 0.9
* all three _calc variables have very similar distributions

#### Overall, we can see that the range of the interval variables is rather small. Perhaps some transformation (e.g. log) is already applied in order to anonymize the data?

In [None]:
# Ordinal variables

v = meta[(meta.level == 'ordinal') & (meta.keep)].index
train[v].describe()

* Only one missing variable: ps_car_11
* We could apply scaling to deal with the different ranges

In [None]:
# Binary variables

v = meta[(meta.level == 'binary') & (meta.keep)].index
train[v].describe()

* A priori in the train data is 3.645%, which is strongly imbalanced.
* From the means we can conclude that for most variables the value is zero in most cases.

### Handling imbalanced classes

As we mentioned above the proportion of records with target=1 is far less than target=0. This can lead to a model that has great accuracy but does have any added value in practice. Two possible strategies to deal with this problem are:

* oversampling records with target=1
* undersampling records with target=0

There are many more strategies of course and MachineLearningMastery.com gives a nice overview. As we have a rather large training set, we can go for undersampling.

In [None]:
desired_apriori = 0.10

# get the indices per target value
idx_0 = train[train.target == 0].index
idx_1 = train[train.target == 1].index

# get orginal number of records per target value
nb_0 = len(train.loc[idx_0])
nb_1 = len(train.loc[idx_1])

# calculate the undersampling rate and resulting number of records with target = 0
undersampling_rate = ((1 - desired_apriori) * nb_1) / (nb_0 * desired_apriori)
undersampled_nb_0 = int(undersampling_rate * nb_0)

print('rate to undersample records with target = 0 : {}'.format(undersampling_rate))
print('number of records with target = 0 after undersampling : {}'.format(undersampled_nb_0))

# randomly select records with target = 0 to get at the desired apriori
undersampled_idx = shuffle(idx_0, random_state = 37,
                          n_samples = undersampled_nb_0)
# shuffle() >> 리스트 내 값을 무작위로 섞기

# construct list with remaining indices
idx_list = list(undersampled_idx) + list(idx_1)\

# return undersample dataframe
train = train.loc[idx_list].reset_index(drop = True)

### 연관분석
url : https://hezzong.tistory.com/entry/python-%EC%97%B0%EA%B4%80%EA%B7%9C%EC%B9%99%EB%B6%84%EC%84%9DA-Priori-Algorithm

### Data quality checks

#### checking missing values
Missing are represented as -1

In [None]:
vars_with_missing = []

for f in train.columns:
    missings = train[train[f] == -1][f].count()
    if missings > 0 :
        vars_with_missing.append(f)
        missings_perc = missings/train.shape[0]
        
        print('variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))

print('in total, there are {} variables with missing values'.format(len(vars_with_missing)))

* ps_car_03_cat and ps_car_05_cat have a large proportion of records with missing values. Remove these variables.

* For the other categorical variables with missing values, we can leave the missing value -1 as such.

* ps_reg_03 (continuous) has missing values for 18% of all records. Replace by the mean.

* ps_car_11 (ordinal) has only 5 records with misisng values. Replace by the mode.

* ps_car_12 (continuous) has only 1 records with missing value. Replace by the mean.

* ps_car_14 (continuous) has missing values for 7% of all records. Replace by the mean.

In [None]:
# dropping the variables with too many missing values
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
train.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False

# imputing with the mean or mode
mean_imp = SimpleImputer(missing_values = -1, strategy = 'mean')
mode_imp = SimpleImputer(missing_values = -1, strategy = 'most_frequent')

train['ps_reg_03'] = mean_imp.fit_transform(train[['ps_reg_03']]).ravel()
train['ps_car_12'] = mean_imp.fit_transform(train[['ps_car_12']]).ravel()
train['ps_car_14'] = mean_imp.fit_transform(train[['ps_car_14']]).ravel()
train['ps_car_11'] = mode_imp.fit_transform(train[['ps_car_11']]).ravel()

#### Numpy 다차원 배열을 1차원으로 바꾸는 것을 지원하는 3개의 함수가 있습니다.

바로 ravel(), reshape(), flatten() 입니다.

참고로 ravel은 "풀다"로 다차원을 1차원으로 푸는 것을 의미합니다.

#### Checking the cardinality of the categorical variables

Cardinality refers to the number of different values in a variable. As we will create dummy variables from the categorical variables later on, we need to check whether there are variables with many distinct values. We should handle these variables differently as they would result in many dummy variables.

In [None]:
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for i in v:
    dist_values = train[f].value_counts().shape[0]
    print('variable {} has {} distinct values'.format(f, dist_values))

Only ps_car_11_cat has many distinct values, although it is still reasonable.



In [None]:
def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

In [None]:
def target_encode(trn_series = None,
                  tst_series = None,
                  target = None,
                  min_samples_leaf = 1,
                  smoothing = 1,
                  noise_level = 0):
    
    assert len(trn_series) == len(target)
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis = 1)
    
    # compute target mean
    averages = temp.groupby(by = trn_series.name)[target.name].agg(['mean', 'count'])
    
    # compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages['count'] - min_samples_leaf) / smoothing))
    
    # apply average function to all target data
    prior = target.mean()
    
    # the bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages['mean'] * smoothing
    averages.drop(['mean', 'count'], axis = 1, inplace = True)
    
    # apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns = {'index': target.name, target.name : 'average'}),
        on = trn_series.name,
        how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns = {'idnex' : target.name, target.name : "average"}),
        on = tst_series.name,
        how = 'left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)

In [None]:
train_encoded, test_encoded = target_encode(train["ps_car_11_cat"], 
                             test["ps_car_11_cat"], 
                             target=train.target, 
                             min_samples_leaf=100,
                             smoothing=10,
                             noise_level=0.01)

train['ps_car_11_cat_te'] = train_encoded
train.drop('ps_car_11_cat', axis=1, inplace=True)
meta.loc['ps_car_11_cat','keep'] = False  # Updating the meta
test['ps_car_11_cat_te'] = test_encoded
test.drop('ps_car_11_cat', axis=1, inplace=True)

## Exploratory Data Visualization

### Categorical variables

Let's look into the categorical variables and the proportion of customers with target = 1

In [None]:
v = meta[(meta.level == 'nominal') & (meta.keep)].index

for f in v:
    plt.figure()
    fig, ax = plt.subplots(figsize=(20,10))
    # Calculate the percentage of target=1 per category value
    cat_perc = train[[f, 'target']].groupby([f],as_index=False).mean()
    cat_perc.sort_values(by='target', ascending=False, inplace=True)
    # Bar plot
    # Order the bars descending on target mean
    sns.barplot(ax=ax, x=f, y='target', data=cat_perc, order=cat_perc[f])
    plt.ylabel('% target', fontsize=18)
    plt.xlabel(f, fontsize=18)
    plt.tick_params(axis='both', which='major', labelsize=18)
    plt.show();

As we can see from the variables with missing values, it is a good idea to keep the missing values as a separate category value, instead of replacing them by the mode for instance. The customers with a missing value appear to have a much higher (in some cases much lower) probability to ask for an insurance claim.

#### Interval variables

Checking the correlations between interval variables. A heatmap is a good way to visualize the correlation between variables. The code below is based on an example by Michael Waskom

In [None]:
def corr_heatmap(v):
    correlations = train[v].corr()
    
    # create color map ranging between two colors
    cmap = sns.diverging_palette(220, 10, as_cmap = True)
    
    fig, ax = plt.subplots(figsize = (10, 10))
    sns.heatmap(correlations, cmap = cmap, vmax = 1.0, center = 0, fmt = '.2f',
                square = True, linewidths = 0.5, annot = True,
               cbar_kws={'shrink' : 0.75})
    plt.show();

v = meta[(meta.level == 'interval') & (meta.keep)].index
corr_heatmap(v)

There are a strong correlations between the variables:

* ps_reg_02 and ps_reg_03 (0.7)
* ps_car_12 and ps_car13 (0.67)
* ps_car_12 and ps_car14 (0.58)
* ps_car_13 and ps_car15 (0.67)

Seaborn has some handy plots to visualize the (linear) relationship between variables. We could use a pairplot to visualize the relationship between the variables. But because the heatmap already showed the limited number of correlated variables, we'll look at each of the highly correlated variables separately.

**NOTE:** I take a sample of the train data to speed up the process.

In [None]:
# DataFrame으로 부터 특정 비율의 표본을 무작위로 추출하기 (fraction)
# DataFrame으로 부터 특정 비율(fraction)으로 무작위 표본 추출을 하고 싶으면 frac 매개변수에 0~1 사이의 부동소수형(float) 값을 입력해주면 됩니다.
# URL : https://rfriend.tistory.com/602

s = train.sample(frac = 0.1)

#### ps_reg_02 and ps_reg_03
As the regression line shows, there is a linear relationship between these variables. Thanks to the hue parameter we can see that the regression lines for target=0 and target=1 are the same.

In [None]:
# Hue란, 분류라고 이해하면 가장 쉽다.
# 예를 들어 gender라는 Hue 에는 Male과 Female이 존재한다.
# 이렇듯, 데이터를 분석할 때 분류별로 그래프를 따로 그리고 싶다면, 예로 hue='target'를 지정해주면 된다.

sns.lmplot(x='ps_reg_02', y='ps_reg_03', data=s, hue='target', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()

In [None]:
# ps_car_12 and ps_car_13

sns.lmplot(x = 'ps_car_12', y = 'ps_car_13', data = s, hue = 'target',
           palette = 'Set1', scatter_kws={'alpha' : 0.3})
plt.show();

In [None]:
# ps_car_12 and ps_car_14
sns.lmplot(x = 'ps_car_12', y = 'ps_car_13', data = s, hue = 'target',
           palette = 'Set1', scatter_kws = {'alpha' : 0.3})
plt.show()

In [None]:
# ps_car_13 and ps_car_15

sns.lmplot(x = 'ps_car_15', y = 'ps_car_13', data = s, hue = 'target',
           palette = 'Set1', scatter_kws = {'alpha' : 0.3})
plt.show()

Allright, so now what? How can we decide which of the correlated variables to keep? We could perform Principal Component Analysis (PCA) on the variables to reduce the dimensions. In the AllState Claims Severity Competition I made this kernel to do that. But as the number of correlated variables is rather low, we will let the model do the heavy-lifting.

#### Checking the correlations between ordinal vaiables

In [None]:
v = meta[(meta.level == 'ordinal') & (meta.keep)].index
corr_heatmap(v)

## Feature engineering

#### Creating dummy variables

The value of the categorical variables do not represent any order or magintude. For instance, category 2 is not twice the value of the category 1. Therefore we can create dummy variables to deal with that. We drop the first dummy variable as this information can be derived from the other dummy variables generated for the categories of the original vaiable.

In [None]:
v = meta[(meta.level == 'nominal') & (meta.keep)].index

print('before dummification we have {} variables in train'.format(train.shape[1]))

train = pd.get_dummies(train, columns = v, drop_first = True)

print('after dummification we have {} variables in train'.format(train.shape[1]))

#### Creating interaction variables

In [None]:
v = meta[(meta.level == 'interval') & (meta.keep)].index
poly = PolynomialFeatures(degree = 2, interaction_only = False,
                          include_bias = False)
interactions = pd.DataFrame(data = poly.fit_transform(train[v]),
                            columns = poly.get_feature_names(v))

# remove the orginal columns
interactions.drop(v, axis = 1, inplace = True)

# concat the interaction variables to the train data

print('before dummification we have {} variables in train'.format(train.shape[1]))

train = pd.concat([train, interactions], axis = 1)

print('after dummification we have {} variables in train'.format(train.shape[1]))

## Feature Selection

#### Removing features with low or zero variance

Personally, I prefer to let the classifier algorithm chose which features to keep. But there is one thing that we can do ourselves. That is removing features with no or a very low variance. Sklearn has a handy method to do that: VarianceThreshold. By default it removes features with zero variance. This will not be applicable for this competition as we saw there are no zero-variance variables in the previous steps. But if we would remove features with less than 1% variance, we would remove 31 variables.

In [None]:
selector = VarianceThreshold(threshold = 0.01)
# fit to train without id and target variables
selector.fit(train.drop(['id', 'target'], axis = 1))

# function to toggle boolean array elements
f = np.vectorize(lambda x : not x)

v = train.drop(['id', 'target'], axis = 1).columns[f(selector.get_support())]
print('{} variables have too low variance.'.format(len(v)))
print('These variables are {}'.format(list(v)))

We would lose rather many variables if we would select based on variance. But because we do not have so many variables, we'll let the classifier chose. For data sets with many more variables this could reduce the processing time.

Sklearn also comes with other feature selection methods. One of these methods is SelectFromModel in which you let another classifier select the best features and continue with these. Below I'll show you how to do that with a Random Forest.

#### Selecting features with a Random Forest and SelectFromModel

Here we'll base feature selection on the feature importances of a random forest. With Sklearn's SelectFromModel you can then specify how many variables you want to keep. You can set a threshold on the level of feature importance manually. But we'll simply select the top 50% best variables.

> The code in the cell below is borrowed from the GitHub repo of Sebastian Raschka. This repo contains code samples of his book Python Machine Learning, which is an absolute must to read.

In [None]:
X_train = train.drop(['id', 'target'], axis = 1)
y_train = train['target']

feat_labels = X_train.columns

rf = RandomForestClassifier(n_estimators = 100, random_state = 0,
                            n_jobs = -1)

rf.fit(X_train, y_train)
importances = rf.feature_importances_

indices = np.argsort(rf.feature_importances_)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]],
                                importances[f]))

In [None]:
sfm = SelectFromModel(rf, threshold = 'median', prefit = True)
print('number of features before selection: {}'.format(X_train.shape[1]))

n_features = sfm.transform(X_train).shape[1]

print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])

In [None]:
train = train[selected_vars + ['target']]

#### Feature scaling
As mentioned before, we can apply standard scaling to the training data. Some classifiers perform better when this is done.

In [None]:
scaler = StandardScaler()
scaler.fit_transform(train.drop(['target'], axis = 1))

#### Conclusion

Hopefully this notebook helped you with some tips on how to start with this competition. Feel free to vote for it. And if you have questions, post a comment.