# Ex. 3

## Calculating AUC
> The AUC value assesses how well a model can order observations from low probability to be target to high probability to be target. In Python, the __roc_auc_score__ function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.

> You will make predictions again, before calculating its __roc_auc_score__.

## Instructions

- The model logreg from the last chapter has been created and fitted for you, the dataframe X contains the predictor columns of the basetable. Make predictions for the objects in the basetable.
- Select the second column of predictions, as it contains the predictions for the target.
- The true values of the target are loaded in y. Use the roc_auc_score function to calculate the AUC of the model.

In [None]:
# Make predictions
predictions = logreg.predict_proba(X)
predictions_target = predictions[:,1]

# Calculate the AUC value
auc = roc_auc_score(y, predictions_target)
print(round(auc,2))

# Ex. 5 

## Using different sets of variables
>Adding more variables and therefore more complexity to your logistic regression model does not automatically result in more accurate models. In this exercise you can verify whether adding 3 variables to a model leads to a more accurate model.

>variables_1 and variables_2 are available in your environment: you can print them to the console to explore what they look like.

## Instructions
- Fit the logreg model using variables_2 which contains 3 additional variables compared to variables_1.
- Make predictions for this model.
- Calculate the AUC of this model.

In [None]:
# Create appropriate dataframes
X_1 = basetable[variables_1]
X_2 = basetable[variables_2]
y = basetable[["target"]]

# Create the logistic regression model
logreg = linear_model.LogisticRegression()

# Make predictions using the first set of variables and assign the AUC to auc_1
logreg.fit(X_1, y)
predictions_1 = logreg.predict_proba(X_1)[:,1]
auc_1 = roc_auc_score(y, predictions_1)

# Make predictions using the second set of variables and assign the AUC to auc_2
logreg.fit(X_2, y)
predictions_2 = logreg.predict_proba(X_2)[:,1]
auc_2 = roc_auc_score(y, predictions_2)

# Print auc_1 and auc_2
print(round(auc_1,2))
print(round(auc_2,2))

# Ex. 6

## Selecting the next best variable
>The forward stepwise variable selection method starts with an empty variable set and proceeds in steps, where in each step the next best variable is added. To implement this procedure, two handy functions have been implemented for you.

>The auc function calculates for a given variable set variables the AUC of the model that uses this variable set as predictors. The next_best function calculates which variable should be added in the next step to the variable list.

>In this exercise, you will experiment with these functions to better understand their purpose. You will calculate the AUC of a given variable set, calculate which variable should be added next, and verify that this indeed results in an optimal AUC.

## Instructions
- The auc function has been implemented for you. Calculate the AUC of a model that uses "max_gift", "mean_gift" and "min_gift" as predictors. You should pass these variables in a list as the first argument to the auc function.
- The next_best function has been implemented for you. Calculate which variable should be added next, given that "max_gift", "mean_gift" and "min_gift" are currently in the model, and "age" and "gender_F" are the candidate next predictors. The first argument of the next_best function is a list with the current variables, while the second argument is a list with the candidate predictors.
- Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "age" as predictors.
- Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "gender_F" as predictors.

In [None]:
# Calculate the AUC of a model that uses "max_gift", "mean_gift" and "min_gift" as predictors
auc_current = auc(["max_gift", "mean_gift", "min_gift"], ["target"], basetable)
print(round(auc_current,4))

# Calculate which variable among "age" and "gender_F" should be added to the variables "max_gift", "mean_gift" and "min_gift"
next_variable = next_best(["max_gift", "mean_gift", "min_gift"], ["age", "gender_F"], ["target"], basetable)
print(next_variable)

# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "age" as predictors
auc_current_age = auc(["max_gift", "mean_gift", "min_gift", "age"], ["target"], basetable)
print(round(auc_current_age,4))

# Calculate the AUC of a model that uses "max_gift", "mean_gift", "min_gift" and "gender_F" as predictors
auc_current_gender_F = auc(["max_gift", "mean_gift", "min_gift", "gender_F"], ["target"], basetable)
print(round(auc_current_gender_F,4))

# Ex.7

## Finding the order of variables
`The forward stepwise variable selection procedure starts with an empty set of variables, and adds predictors one by one. In each step, the predictor that has the highest AUC in combination with the current variables is selected.`

`In this exercise you will learn to implement the forward stepwise variable selection procedure. To this end, you can use the next_best function that has been implemented for you. It can be used as follows:`

>next_best(current_variables,candidate_variables,target,basetable)

`where current_variables is the list of variables that is already in the model and candidate_variables the list of variables that can be added next.
`


## Instructions
- Use the function next_best to calculate the next best variable and assign it to next_variable.
- Update the current_variables list.
- Update the candidate_variables list.

In [None]:
# Find the candidate variables
candidate_variables = list(basetable.columns.values)
candidate_variables.remove("target")

# Initialize the current variables
current_variables = []

# The forward stepwise variable selection procedure
number_iterations = 5
for i in range(0, number_iterations):
    next_variable = next_best(current_variables, candidate_variables, ["target"], basetable)
    current_variables = current_variables + [next_variable]
    candidate_variables.remove(next_variable)
    print("Variable added in step " + str(i+1)  + " is " + next_variable + ".")
print(current_variables)

# Ex.8

## Correlated variables
>The first 10 variables that are added to the model are the following:

['max_gift', 'number_gift', 'time_since_last_gift', 'mean_gift', 'income_high', 'age', 'country_USA', 'gender_F', 'income_low', 'country_UK']
> As you can see, min_gift is not added. Does this mean that it is a bad variable? You can test the performance of the variable by using it in a model as a single variable and calculating the AUC. How does the AUC of min_gift compare to the AUC of income_high? To this end, you can use the function auc():

- auc(variables, target, basetable)

It can happen that a good variable is not added because it is highly correlated with a variable that is already in the model. You can test this calculating the correlation between these variables:

>import numpy
numpy.corrcoef(basetable["variable_1"],basetable["variable_2"])[0,1]

## Instructions
- Calculate the AUC of the model using the variable min_gift only.
- Calculate the AUC of the model using the variable income_high only.
- Calculate the correlation between the variable min_gift and mean_gift.

In [None]:
import numpy as np

# Calculate the AUC of the model using min_gift only
auc_min_gift = auc(['min_gift'], ["target"], basetable)
print(round(auc_min_gift,2))

# Calculate the AUC of the model using income_high only
auc_income_high = auc(['income_high'], ["target"], basetable)
print(round(auc_income_high,2))

# Calculate the correlation between min_gift and mean_gift
correlation = np.corrcoef(basetable["min_gift"], basetable["mean_gift"])[0,1]
print(round(correlation,2))

# Ex.10

## Partitioning
>In order to properly evaluate a model, one can partition the data in a train and test set. The train set contains the data the model is built on, and the test data is used to evaluate the model. This division is done randomly, but when the target incidence is low, it could be necessary to stratify, that is, to make sure that the train and test data contain an equal percentage of targets.

>In this exercise you will partition the data with stratification and verify that the train and test data have equal target incidence. The train_test_split method has already been imported, and the X and y dataframes are available in your workspace.

## Instructions
- Stratify these dataframes using the train_test_split method. Make sure that train and test set are the same size, and have equal target incidence.
- Calculate the target incidence of the train set. This is the number of targets in the train set divided by the number of observations in the train set.
- Calculate the target incidence of the test set.

In [None]:
# Load the partitioning module
from sklearn.cross_validation import train_test_split

# Create dataframes with variables and target
X = basetable.drop('target', 1)
y = basetable["target"]

# Carry out 50-50 partititioning with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.50, stratify = y)

# Create the final train and test basetables
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

# Check whether train and test have same percentage targets
print(round(sum(train["target"])/len(train), 2))
print(round(sum(test["target"])/len(test), 2))

# Ex. 11

## Evaluating a model on test and train
* The function auc_train_test calculates the AUC of model that is built on a train set and evaluated on a test set:

- auc_train, auc_test = auc_train_test(variables, target, train, test)
- with variables a list of the names of the variables that is used in the model.

>In this exercise, you will apply this function, and check whether the train and test AUC are similar.

## Instructions
- The basetable is loaded. Partition the basetable such that the train set contains 70% of the data, and make sure that train and test set have equal target incidence.
- Calculate the train and test AUC of the model using "age" and "gender_F" as predictors using the auc_train_test function.

In [None]:
# Load the partitioning module
from sklearn.cross_validation import train_test_split

# Create dataframes with variables and target
X = basetable.drop('target', 1)
y = basetable["target"]

# Carry out 70-30 partititioning with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify = y)

# Create the final train and test basetables
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)

 # Apply the auc_train_test function
auc_train, auc_test = auc_train_test(["age","gender_F"], ["target"], train, test)
print(round(auc_train,2))
print(round(auc_test,2))