# Machine Learning Topics: Feature Engineering, Cross Validation, and Bias vs Variance

## Goals

- <b>Feature engineering</b>: transform a dirty dataset into a machine learning ready dataset.
- Learn the importance of <b>cross validation</b>: why and how it's used.
- The <b>bias vs variance</b> trade-off, aka the eternal dilemma of machine learning.
- Make and interpret learning and validation curves

### Feature Engineering.
["Feature engineering is the process of using domain knowledge of the data to create features 
that make machine learning algorithms work"](https://en.wikipedia.org/wiki/Feature_engineering)

We are creating new features from old ones.
<br><br>
Our job: transform the [titanic dataset](https://www.kaggle.com/c/titanic) into one that can be used for machine learning, specifically predicting whether or not a passenger survives the titanic.

In [None]:
#imports


In [None]:
#load in the the dataset

path = "../../data/titanic.csv"

titanic = 

#lowercase column names


#Set passengerid column as index


#view data


<b>Data dictionary:</b>

PassengerID: A column added by Kaggle to identify each row and make submissions easier

Survived: Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes)

Pclass:	The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)

Sex: The passenger’s sex

Age: The passenger’s age in years

SibSp: The number of siblings or spouses the passenger had aboard the Titanic

Parch: The number of parents or children the passenger had aboard the Titanic

Ticket: The passenger’s ticket number

Fare: The fare the passenger paid

Cabin: The passenger’s cabin number

Embarked— The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

Inspection time.

In [None]:
#Call .info()


We have three columns will null values, what should we do with them?

First up: fixing the age column.

We're not going to drop it because we don't want to reduce the size of the dataset by a significant amount. We're going to use a technique called "[imputation](https://machinelearningmastery.com/handle-missing-data-python/)" to get around this issue.

In [None]:
#Fill in the null values of the age column using the median age.


#Confirm there are no nulls


What about cabin? It gets the drop

In [None]:
#Look at unique values


In [None]:
#Drop the cabin column from the data



Lastly, the embarked column.

In [None]:
#View uniques



What should we do?

Solution is to drop rows that null because there are only two null rows.

In [None]:
#Check to see which columns has nulls


In [None]:
#drop rows with null values



Let's at the details of the data again.

In [None]:
#Call .info() on titanic



We've dealt with the missing data issue, now we need to handle the text data. Our objective here is turn words into numbers.

What do you think that means?

First order of business, deciding which of the string/object columns to keep and drop.

In [None]:
#List of object dtype columns
object_columns = ["name", "sex", "ticket", "embarked"]

#Look at titanic data with just the columns in object_columns


Which ones do we drop?

Name and ticket get the ax

In [None]:
#Drop name and ticket columns from the data



#View data


At this point, we now have two string columns in sex and embarked. Let's turn them in numbers by making dummy variables.
<br><br>
1. Convert male to 0 and female to 1 in the sex column.
2. Make dummy variables from the embarked column.

In [None]:
#Make a dictionary where the keys are male and female and the values are 0 and 1
gender_dict =


#Map dictionary onto the sex column and reassign it to sex.

titanic["sex"] =

In [None]:
#Use pd.get_dummies to make dummy variables from the embarked column

#Pass embarked column, then set prefix to "emb" and call .head()


What's the issue here?

Curse of dimensionality!

We don't need all three columns. We didn't make a separate column for male and female, so why should we do that for C, Q, S

In [None]:
#Make dummy variables from the embarked column, but this time set drop_first = True
#Assign dummies to variable called emb_dums



#Look at emb_dums


Combine this dataframe of dummy variables with our original dataset.

In [None]:
#1. Drop embarked column


#2. Concatenate the titanic and emb_dums dataframes and overwrite titanic variable


#3. View new concatenated dataframe


In [None]:
#Check to see if all variables are numeric


Great! Our dataset is now ready for machine learning.
<br><br>

But time for a quick exercise. Write a function to that takes an uncleaned version of the titantic dataset, applys the feature engineering techniques we used above, and outputs a clean machine learning ready dataset.

In [None]:
#Function goes here
def titanic_fe(df):
    
    return df


In [None]:
#Test to see if function on reimported titanic dataset

titanic2 = pd.read_csv("../data/titanic.csv")

#Pass in titanic2 into titanti_fe function


We're ready to do some machine learning but first let's discuss the null accuracy
<br><br>
The null accuracy aka the bench mark of our model's performance. It is the maximum percentage of the target variable distribution. 

In [None]:
#Call .value_counts(normalize=True) on survived column



Our null accuracy is 61.75%. That means we have to create a model that classifies the data at a better rate than 61.75%.

If we didn't build a model and just said everyone died, then we'd be 61.75% without even going through the trouble of building a model.

## Train/test sets and cross validation. 

In [None]:
#Import train_test_split and cross_val_score functions


We are going to split our titanic dataset into two sets: training and testing.

In [None]:
#First extract features and target variables

X = 
y = 

#Input X and y into the train_test_split function, set test_size to .25, random_state = 4
X_train, X_test, y_train, y_test = train_test_split()

- X_train = The features of the data we use to fit the model

- X_test = The features of the data we use to make and test predictions with

- y_train = The target variable of the data we use to fit the model

- y_test = The target variable of the data we use to make and test predicitions with

In [None]:
#Fit a decision tree model on X_train and y_train. Do not specify max_depth



#Evaluate the model by scoring the X_train and y_train


Yay! We got a high score! Or did we????

In [None]:
#Evaluate the model on the test set



<b>Huge drop in accuracy score. How come?</b>

Let's bring back to the model plotting function for the purpose of visualizing an overfit model against a test set.

In [None]:
#Make some fake data again
from sklearn.datasets import make_classification

#Generate fake data that is 400 x 2.
data = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, 
                    class_sep=.20, random_state = 34)
#Assign features to XX

#Assign target variable to yy


#Set style and size



#Plot features and use to yy to color-encode,


In [None]:
#Train test split on XX and yy


#fit model on the training set 


#Evaluate model on training data


Yay! Perfect model!

In [None]:
#Load in plot_decision_boundary function
def plot_decision_boundary(model, X, y):
    X_max = X.max(axis=0)
    X_min = X.min(axis=0)
    xticks = np.linspace(X_min[0], X_max[0], 100)
    yticks = np.linspace(X_min[1], X_max[1], 100)
    xx, yy = np.meshgrid(xticks, yticks)
    ZZ = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = ZZ >= 0.5
    Z = Z.reshape(xx.shape)
    plt.rcParams["figure.figsize"] = (10,7)
    fig, ax = plt.subplots()
    ax = plt.gca()
    ax.contourf(xx, yy, Z, cmap=plt.cm.bwr, alpha=0.2)
    ax.scatter(X[:,0], X[:,1], c=y, alpha=0.4, s = 50, cmap="rainbow")

In [None]:
#Visualize the model and the testing data

#Pass in pre-trained model that was trained on the training set
#Pass in the testing data.


How does that look to you? Where in the plot is the model overfit?

Let's check to see how well the model classifies the testing data

In [None]:
#Evaluate model on testing data


Let's try that whole process again to see if we get different scores

<br><br>
Run this code several times and observe the changes in the testing score

In [None]:
#Different train/test split but with no random_state set
X_train, X_test, y_train, y_test = train_test_split(X ,y, test_size = .25)

#Fit model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

training_score = model.score(X_train, y_train)

print ("The training score is {:.3f} percent".format(training_score*100))

testing_score = model.score(X_test, y_test)
print ("The testing score is {:.3f}".format(testing_score*100))

In [None]:
#Lets make this a for loop

#Intialize list that we'll use for our testing scores
testscorelist = []

#Iterate over range 10

testscorelist

### Cross Validation
<br><br>

"Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set." 

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

<br><br>



<b>K-Fold Cross Validation</b>
![Image](https://i.stack.imgur.com/1fXzJ.png)

<br><br>
"[In K Fold cross validation](https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f), the data is divided into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Interchanging the training and test sets also adds to the effectiveness of this method. As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value."

Let's use cross_val_score function to perform KFold cross validation five times.

In [None]:
#Call cross_val_score, input empty DT model, X, y, set cv = 5 and scoring = accuracy
cv_scores = 

#Call cv_scores
cv_scores

We see theres a degree of variance in the output, which makes deriving the mean crucial.

In [None]:
#Whats the average score


<b>Class exercise:</b>

Test to see the relationship between max_depth and the average cv_score. What happens when you increase or decrease max_depth. 

Whats you're done playing around with that, then make a line plot of depth values from 1 - 20 and the average cross validated score for each corresponding depth value.

In [None]:
#Answer



What is the best depth value?

Train a model with the best depth value and evaluate it on a test set

In [None]:
#Train and test
X_train, X_test, y_train, y_test = train_test_split(X ,y, 
                                                    test_size = .25,
                                                   random_state = 42)
#Fit model with depth 6 and random_state = 42


#Score model on test set
testscore = 

print ("The test score is {:.3f} percent".format(testscore*100))

How does that compare to the null accuracy?

In [None]:
#Subtract null accuracy from testscore




Not too bad.

Time to make a confusion matrix

In [None]:
#Imports confusion_matrix and accuracy score funcions


In [None]:
#Calculate accuracy_score using sklearn

#Make predictions on test set
preds = 

#Call accuracy_score on y_test and preds


In [None]:
#Pass in y_test and preds into confusion_matrix function


The best depth is one that is not too small but not too large.

We need to find the depth that strikes the right balance between <b>bias</b> and <b>variance</b>

### Bias vs Variance
<br><br>
<b>Bias:</b> The simplifying assumptions made by the model to make the target function easier to approximate.

<b>Variance:</b> The amount that the estimate of the target function will change given different training data.

From: https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
<br><br>
[Legendary data science blog post](http://scott.fortmann-roe.com/docs/BiasVariance.html)

<b>Bias error:</b> The difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Bias measures how far off in general these models' predictions are from the correct value.

<b>Variance error:</b> The error due to variance is taken as the variability of a model prediction for a given data point. Imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

Graphic illustration of bias vs variance:

![b v v](https://i.stack.imgur.com/r7QFy.png)

Credit: Scott Fortmann-Roe

What do you see here? How would you interpret this graphic?

<b>Depicting bias vs variance with validation and learning curves</b>

Validation Curve:

![Lc](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

<br><br>
Learning Curve:
![lc](https://chrisalbon.com/images/machine_learning_flashcards/Learning_Curve_print.png)
<br><br>
["Graph that compares the performance of a model on training and testing data over a varying number of training instances"](http://www.ritchieng.com/machinelearning-learning-curve/)

Graph 1: Plot validation curve of model complexity versus error rates for training and test sets

In [None]:
#1. Train test split
X_train, X_test, y_train, y_test = train_test_split(X ,y, test_size = .25,
                                                    random_state = 38)

#2. Initialize lists of errors for train and test sets

train_errors = []
test_errors = []

#3. Set range of depth values from 1 to 20
depths = range(1,21)
#4a. Iterate over depth values.
#4b. Fit a DT model for each depth model.
#4c. Evaluate the model on both the train and test sets.
#4d. Append scores to train_errors and test_errors


    
#5. Make two line plots. Plot depths vs train_errors and plot depths vs test_errors
#Give them different colors and labels



Link to validation plot code from Chris Albon: https://chrisalbon.com/machine_learning/model_evaluation/plot_the_validation_curve/

Graph 2: Plot learning curve of training sizes vs training and testing errors

In [None]:
#Credit Chris Albon

#1. Import learning_curve from sklearn
from sklearn.model_selection import learning_curve

#2. Create CV training and test scores for various training set sizes
#Use max_depth = 5 for DT model
train_sizes, train_scores, test_scores = learning_curve(DecisionTreeClassifier(max_depth=5), 
                                                        X, 
                                                        y,
                                                        # Number of folds in cross-validation
                                                        cv=5,
                                                        # Evaluation metric
                                                        scoring='accuracy', 
                                                        # 30 different sizes of the training set
                                                        train_sizes=np.linspace(0.01, 1.0, 30))

#3.Train and test_scores are 30x5. We need to compute average of each 5-fold cv
train_scores = train_scores.mean(axis =1)
test_scores = test_scores.mean(axis = 1)

#4. Draw lines
plt.plot(train_sizes, train_scores, color="r",  label="Training score")
plt.plot(train_sizes, test_scores, color="g", label="Testing score")

#5. Create plot
plt.title("Learning Curve")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy Score")
plt.ylim(0.6, 1.1)
plt.legend(loc="best")
plt.tight_layout()
plt.show()

Bonus!!
<br><br>
Let's see the most important features and visualize the decision tree

In [None]:
#Fit DT model on X and y with max_depth 4

dt = 


In [None]:
#Call .feature_importances


In [None]:
#Lets put that in a dataframe
fi = 

In [None]:
#Sort it



Visualize the tree!

In [None]:
from sklearn.tree import export_graphviz
import graphviz

#Export the decision tree graph viz object. We have to export and the re-import it
export_graphviz(dt, out_file='titanic.dot', 
                    feature_names=X.columns, 
                    class_names=["dead", "alive"])
with open("titanic.dot") as f: 
        dot_graph = f.read()
graphviz.Source(dot_graph)

### Class work 

For the rest of class, work on improving your model as much as possible. 

- See what happens when you drop different features
- Try different combinations of them
- Try making pclass into a dummy variables instead of a continuous one
- Make predictions of "fake passengers". Input a bunch of features to see what happens.
- Once you've made the best possible model, make some more validation and learning curves.
- You're also welcome to try the iris dataset or the churn rate dataset as well or make your own data with sklearn.

### Resources:

Bias vs variance:

https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/

https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/

http://www.machinelearningtutorial.net/2017/01/26/the-bias-variance-tradeoff/

https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/


<br><br>
Cross validation:

https://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english

https://www.analyticsvidhya.com/blog/2015/11/improve-model-performance-cross-validation-in-python-r/

https://www.openml.org/a/estimation-procedures/1


<br><br>
Titanic dataset projects:

https://www.kaggle.com/maielld1/titanic-dataquest-tutorial

https://github.com/agconti/kaggle-titanic

https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html