# Feature exploration and dataset preparation
---
In this kernel we're going to explore and prepare the data that will be used in our models:

1. Explore the dataset and look at some of the most important features with the help of some of the notebooks in Kaggle
2. Data cleaning, remove unused features, replace null values and outliers
3. Standarize

# Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# load the train and test data files
train = pd.read_csv("../input/santander-customer-satisfaction/train.csv", index_col=0)
test = pd.read_csv("../input/santander-customer-satisfaction/test.csv", index_col=0)

# 1. Initial exploration and feature analysis

In [None]:
print(train.shape)
print(test.shape)

Our training dataset has 370 features. The test dataset will be used to submit the predictions to the Kaggle competition. We'll split our training dataset in training data and test data to train and validate our models.

Note that according to the description of the dataset in the Kaggle competition, the TARGET variable determines if the customer is satisfied (0) or not (1). We'll look into the target feature in more detail later on.

Now we're going to have a look at our training dataset and the name and data type of our features:

In [None]:
train.head()

In [None]:
train.describe()

It looks like most of the features are numerical, let's have a look at all the column names to see if they follow any naming convention:

In [None]:
train.columns.values

After a first look, we can see three big groups of features:

1. Features starting with **imp_**, **num_**, **saldo_**: Probably from *importe* (amount), *numerico* (numerical), and *saldo* (balance). Should be numerical.
2. Features starting with **delta_imp_**, **delta_num_**: A feature linked and probably calculated based on the previous features. Should be numerical.
3. Features starting with **ind_**: Looks like an index (probably categorical). Should be 1/0.

Let's verify that we only have numerical data in our dataset:

In [None]:
train.dtypes.value_counts()

As expected, we can see that all the features are numerical. Now let's explore some of the most important features. As most of the features have non descriptive names, we're going to rely on the information from other notebooks in Kaggle ([https://www.kaggle.com/cast42/exploring-features](https://www.kaggle.com/cast42/exploring-features)).

Other than the group of features above, these are the other features in our dataset:

- var3
- var15
- var38
- TARGET feature

## 1.1 var3: Country
According to some Kaggle users, the var3 feature would correspond to the customer country. Let's explore this feature:

In [None]:
train.var3.value_counts()

Seems that **-999999** is a placeholder when the country is unkown, we'll replace this with the most common value later (**2**, which probably corresponds to Spain). let's see 

In [None]:
# filter by top countries, excluding the most common one (2)
top_countries = train[(train.var3 != -999999) & (train.var3 != 2)].groupby('var3').filter(lambda x: len(x) > 80)

# plot number of satisfied / unsatisfied customers by country
sns.catplot(x='var3', hue='TARGET', kind='count', data=top_countries);

It doesn't look like there's a bit correlation between the country and the customer satisfaction.

## 1.2 var15: Customer Age
It seems that the **var15** feature corresponds to the customer age:

In [None]:
train.var15.value_counts()

Which makes sense looking at the data. The most common age is 23 years. Let's see how the age relates to the customer satisfaction:

In [None]:
print(train[(train.var15 < 23)].shape)
print(train[(train.var15 < 23)].TARGET.sum())

There are no unsatisfied customers below 23 years. Let's look at the satisfaction rate per age:

In [None]:
g = sns.catplot(x='var15', y='TARGET', kind='bar', data=train[(train.var15 > 22) & (train.var15 < 100)], aspect=3)

for ax in g.axes.flat:
    ax.yaxis.set_major_formatter(PercentFormatter())

plt.show();

We can see that customers become increasingly unsatisfied from around 23 to 40 years old.

## 1.3 var38: Mortgage
According to some users, the **var38** corresponds to the Mortgage:

In [None]:
train.var38.value_counts()

The most common value is 117310, which according to some users may correspond to the median value of a mortgage in Spain. Let's see the relationship between the mortgage and the customer satisfaction:

In [None]:
train[(train.var38 != 117310.979016494) & (train.var38 < 300000)].var38.hist(bins=20);

In [None]:
train[(train.var38 != 117310.979016494) & (train.var38 < 300000) & (train.TARGET == 1)].var38.hist(bins=20);

It seems to follow a similar distribution, and we don't see a direct relation between the amount of the mortgage and the customer satisfaction.

## 1.4 TARGET: Customer satisfaction

As mentioned before, our target feature is the customer satisfaction: **0** for satisfied customers and **1** for unsatisfied customers. Let's look at the distribution of the classes:

In [None]:
train.TARGET.value_counts(normalize=True) * 100

Less than **4%** of our customers are unsatisfied. We will probably need to do some resampling (either upsampling or downsampling) to balance the classes. We'll also need to take this into account while splitting our dataset in train data and test data!

# 2. Data cleaning

We'll proceed to do some data cleaning before building any models:

1. We've observed that some columns are empty (all zeroes), so we'll proceed to remove them.
2. As mentioned before we'll replace the placeholder in the Country feature with the most commond country (2).

## 2.1 Remove empty columns

In [None]:
train.shape

In [None]:
# return a dataset with the columns where any of the values is not 0
train = train.loc[:, (train != 0).any(axis=0)]

In [None]:
train.shape

## 2.2 Replace placeholder in the Country feature

In [None]:
train.var3 = train.var3.replace(-999999, 2)

# 3. Standarize
It's recommended to standarize the data for the models we're going to build (although normalization is not required):

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
# standarize our training dataset values and convert it to a new dataframe
# we won't standarize the TARGET feature
train_scaled = StandardScaler().fit_transform(train.drop('TARGET', axis=1).values)
train_scaled_df = pd.DataFrame(train_scaled, index=train.index, columns=train.drop('TARGET', axis=1).columns)
train_scaled_df['TARGET'] = train['TARGET']

# Output

We'll use this processed dataset as the input for our models:

In [None]:
train_scaled_df.to_csv('train_clean_standarized.csv')