# Lesson 10 - Accuracy Metrics
Welcome to the Accuracy Metrics. In this lesson we will be covering: 
- **Metrics for Classification**
- **Metrics for Regression**

The lab for Lesson 10 will consist of all the exercises throughtout the notebook. 

For this lesson we will again be using the Titanic Survival Dataset from Kaggle to predict survival of passengers.

Let's review the column values once more as a reminder of the data we are using:

- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain ?)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin**: Cabin number of the passenger (Some entries contain ?)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
- **Boat**: Lifeboat (if survived)
- **Body**: Body Number (if did not survive and body was recovered)
- **Home.Dest**: Home / Destination

## Sklearn Library 

In [None]:
# If you do not have the Scikit-learn, please install using by removing the # from the conda command. 
# If you do not need to, then you can skip this step
#!conda install -c anaconda scikit-learn

## Building your model ML - End to End

In our lab today, we will go through end to end of the machine learning process to reiterate concepts we have learned, and then discuss ones that we learned today. 

Lets start with importing our libraries....

In [None]:
# Exercise 1
# Import the matplotlib library 
import pandas as pd
import numpy as np
import
import seaborn as sns
import plotly.express as px

%matplotlib inline

### Read in your data

In [None]:
# Exercise 2 read in the titanic dataset 


### Explore the Data

We will now explore our data to get a feel of what we are dealing with. Remember this is typically the first step of the process, if you do not understand your data, then you will not be able to predict with your data.

In [None]:
titanic_data.info()

In [None]:
#First we must group the dataset by Pclass and Survived to gather the total count
group = titanic_data.groupby(['pclass', 'survived'])
pclass_survived = group.size().unstack()


# Creating a histogram of age by survival
hist = px.histogram(titanic_data,x = "age", opacity = 0.7, color = "survived")
hist

Below I will demonstrate how I explored the data.The first thing you will notice is that I created two list:

- categorical columns 
- numerical columns

I do this, to make it easier for me to access columns from these groups and if I need to apply any transformations

In [None]:
categorical_columns = ["name","sex","cabin","embarked", "home.dest"]
numerical_columns = ["pclass","survived","age","sibsp","parch","ticket","fare","boat","body"]

Now I will look at the values from my categorical columns

In [None]:
for column in categorical_columns:
    print("-----------------------------")
    print("For column {}, the values are: \n{} \n".format(column, titanic_data[column].value_counts()))
    print("-----------------------------")

Now lets take a look at how our numerical values are distributed

In [None]:
for col in numerical_columns:
    if col in ["pclass","survived","sibsp","parch"]:
        titanic_data.hist(column=col)

### Prepare The data

Next step is to start preparing our data. Here we will transform columns or clean up the data before we begin to feature engineer any new columns. 

Lets start with replacing our good ole friend the "?", so that we can decide how to replace missing values later.

In [None]:
titanic_data = titanic_data.replace({'?': None})
titanic_data

Lets count how many missing values we currently have

In [None]:
titanic_data.isnull().sum()

In the step below we can see where I use the list of numerical columns to convert them all to numeric. You could have typed each individual column out, but this way you save time and can focus at the task at hand.

In [None]:
# Convert numeric columns to numeric
titanic_data[numerical_columns] = titanic_data[numerical_columns].apply(pd.to_numeric, errors='coerce')

In the next step I shorten the cabin values to only the cabin letter and not the number. I am doing this to make our next steps easier, but we are losing information that could be important.

In [None]:
titanic_data["cabin"] = titanic_data["cabin"].str[0]

### Feature Engineering 

As discussed in our previous lesson, Feature engineering is used to enhance our data to draw more insights. Here we will transform stings to numbers, potentially remove missing data, one hot encode data, etc.

In [None]:
# Label Encoding cabin
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "T": 8}
titanic_data["cabin"] = titanic_data["cabin"].map(deck)

In [None]:
# One hot encoding sex
titanic_data = pd.get_dummies(titanic_data,columns= ["sex"], prefix="gender")
titanic_data

In the cell below I am implementing a feature engineering technique called **target mean encoding**. We will only glance at the code, but I wanted to give you a demonstration of more complex techniques that are used in industry 

In [None]:
target_mean = titanic_data["survived"].mean()
embarked_agg = titanic_data.groupby("embarked")["survived"].agg(['count','mean'])
counts = embarked_agg['count']
means = embarked_agg['mean']
target_mean = (counts * means + 100 * means)/(counts + 100)
titanic_data["embarked"] = titanic_data["embarked"].map(target_mean)

Now lets deal with our missing data. 

In [None]:
titanic_data.isnull().sum()

As we can see above we have alot of missing values. Our dataset has about 1309 values, and we have cabin and body with more than 1000 values missing. It is here where we can validate if dropping values or imputing will make a difference in our final results. In the cell below, add or remove columns you believe will affect our model due to missing values.

In [None]:
# Here we will drop columns we deem unnecessary. Remember this spot as we will come back 
titanic_data = titanic_data.drop(columns=["name","boat","home.dest"])

In [None]:
titanic_data.info()

Now that we have removed columns that we have deemed unnecessary, lets fill in values using one of the following:

- **mean**: replace missing values using the mean along each column. Can only be used with numeric data.

- **median**: replace missing values using the median along each column. Can only be used with numeric data.

- **most_frequent**: replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

- **constant**:  replace missing values with fill_value. Can be used with strings or numeric data.

We will be using the `SimpleImputer` method from the sklearn library 

In [None]:
from sklearn.impute import SimpleImputer
# specify method below after strategy=. Mean will be the default value 
imp = SimpleImputer(strategy="mean")
imp.fit(titanic_data)
titanic_data_final = imp.transform(titanic_data)
titanic_data_final = pd.DataFrame(titanic_data_final, columns=titanic_data.columns)

# Create a copy for regression 
titanic_data_reg = titanic_data_final

titanic_data_final

In [None]:
# Rounding the cabin value if we did not drop it
if "cabin" in titanic_data.columns:
    titanic_data_final['cabin'] = round(titanic_data_final['cabin'])
    titanic_data_final


Now lets check our results

In [None]:
titanic_data_final.isnull().sum()

## Modeling

We went over all of the above to emphasize that without good data, we do not have a good model. All of the above is part of the modeling process. 

Now that we have our **final titanic  dataset** ready we can now approach the steps to model our data. 

### Types of Modeling
As mentioned from the lecture we have:

- Unsupervised Models
    - Clustering
    - Dimensionality Reduction
- Supervised Models
    - Regression
    - Classification

We are focusing on the supervised models.

**Classification**: Models are used to classify your target variable. Such as, predict what color my shirt is, or predict if a transaction is fradualent or not. Classification is used to determine a label

**Regression**: Models are used to predict values. Such as Predict the stock price for Apple tomorrow, or predict my yearly income for next year. 

Each both have various specific models under each of them. A few of them are Neural Networks, LSTMS, RandomForrest, XGBoost, DecesionTrees, PCA, K-Means and various others. All these model have the advantages, disadvantages, and specific uses, but we will go over these in another lesson. If you would like to get a head start, you can take a look at the [sklearn documentation](https://scikit-learn.org/stable/user_guide.html)

### Specify Target Variable

The first step is to remove our value that we aim to predict. We do this because then the model will have no way to know what we are asking of it. If we did not specify our target variable, it would be similiar to expecting ice cream, without asking for it. 

In [None]:
target = titanic_data_final["survived"]
target

### Specify the features

After specifying the target variable, we now drop it from the titanic_daata_final, so then we can have a dataset of just our features. What are features? 

**Features:** Measurable values to be analyzed. 

An easier way to understand features is to imagine them as the inputs(columns) to an algorithm that helps make the decisions or path to take

In [None]:
features = titanic_data_final.drop('survived', axis = 1)
features

### Lets check what could be good features
Now that we have our features we could use modeling to help us decide which features to use in our final model. We will not go over the code below in detail, but we will use the graph to see what might be good features to use.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)
model = RandomForestClassifier()

model.fit(X_train, y_train)

feature_weights = model.feature_importances_
feature_weights

# Show top 5 features 
indices = np.argsort(feature_weights)[::-1]
columns = X_train.columns.values[indices[:5]]
values = feature_weights[indices][:5]

# Creat the plot
fig = plt.figure(figsize=(9, 5))
plt.title("Normalized Weights for First Five Most Predictive Features", fontsize=16)
plt.bar(np.arange(5), values, width=0.6, align="center", color='#00A000', \
       label="Feature Weight")
plt.bar(np.arange(5) - 0.3, np.cumsum(values), width=0.2, align="center", color='#00A0A0', \
       label="Cumulative Feature Weight")
plt.xticks(np.arange(5), columns)
plt.xlim((-0.5, 4.5))
plt.ylabel("Weight", fontsize=12)
plt.xlabel("Feature", fontsize=12)

plt.legend(loc='upper center')
plt.tight_layout()
plt.show()  

### Split Data into Train and Test
Before we submit out featres and target to a model, we first need to split our data into a test set and a training set. We split our data into test/train, because if we were to train our model with the complete dataset, how could we test our model? How could we know if our model is better than just guessing. 

By splitting the data we are providing a system of checks and balances, so that if we were to use this model in real life, we could verify our model to more than what is included in our dataset. 

We will use the `train_test_split` method from the sklearn library to split our data 70% training and 30% testing. You can play with this to see how your results vary. Just change the **test_size** input to a decimal value from .1-.9

In [None]:
# Split the data into train and test. 
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

#### Lets take a quick look at our train features values

In [None]:
X_train.head()

#### Lets take a quick look at our train target values

In [None]:
y_train.head()

### Create the Model

We are finally at the step to creating our model. As this will be your second time viewing a model, we will mainly use the default settings. Models typically have a variety of levers that can be used, but we will cover this in another lesson.

In [None]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree


# Define the classifier, and fit it to the data. Here we are creating a variable type of the DecisionTreeClassifier
# that we will use to perform actions with. 
model = DecisionTreeClassifier()

If you would like to see all of the actions that you can take with [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) follow the link

### Train the model
We will now train our model. Taining our model will use our training set created from `train_test_split`. This step can be viewed as teaching our model. The model than stores memory as weights, which will then be used for prediction

In [None]:
# Train the model
model.fit(X_train, y_train)

Below is a graph of how are model made decisions. As you can see it made lots of decisions based off of the data we provided. 

### Predict using your model
Now we will ask our model to guess values from our testing set. The model has not seen these values, so it is a good test to see how we are doing

In [None]:
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

### Lets check our results

Remember, with classification models we are predicting what category an observation falls into. To judge the accuracy of our model we have the following metrics available to us:
- accuracy
- precision
- recall 

We will focus on precision, recall, and accuracy. One way to calculate these metrics is to use the values from the confusion matrix. Lets start by importing metrics from the `sklean` library

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, balanced_accuracy_score 
from sklearn.metrics import precision_score, recall_score

Now lets quickly look at our accuracy

In [None]:
# Find the accuracy of your predictions
test_accuracy = accuracy_score(y_test, y_test_pred)
print(test_accuracy)

**Exercise 3:** How do you think we did?

**(Double Click Here)** Answer:

# Model Accuraccy 
Lets dive into our next lesson, model accuracy. Lets start by asking a question...

What does this value tell us about our model?
   - It tells how well our model was able to predict the results
   - Our model was able to accurately predict whether or not someone survived the titanic 74% of the time
   - How many passagers does that translate into?
   

In [None]:
# lets save the number of accurate predictions in a variable
pass_in_test = y_test.count()
print('The number of passengers in our test dataset was', pass_in_test)
print('The number of passengers which our model accurately predicted would survive/not survive was', test_accuracy*pass_in_test)

Let's dive deeper..

- Out of the passengers who actually survived, how many did our model predict would survived?
- Out of the passenger who actually did not survive, how many did our model predict would not survive?

It is now time to look at our confusion matrix to get these answers...

![confusionmatrix](confusion_matrix.png "confusion matrix")

Now lets generate our own confusion matrix below from the `sklearn.metrics` library

In [None]:
titanic_confusion_matrix = confusion_matrix(y_test, y_test_pred)
print(titanic_confusion_matrix)

We can also visualize the confusion matrix...

In [None]:
# visualize the confusion matrix
plt.figure(figsize = (5,3))
sns.heatmap(titanic_confusion_matrix)

Now lets jump into metrics that we saw before, but now we will shed some light into how they are calculated

***Precision:***
- From our confusion matrix we can see that our model predicted that 169 passengers survived, however out of these..
    - We predicted 51 false positives (top right result of our matrix)
    - Our precision was 0.7 (TP/(TP+FP))
    - Note that this value is lower than our accuracy. We can say that our model is not actually performing as well as we thought
    
***Recall:***
- Note that our model predicted that 51 passengers would not survive who actually survived. (lower left result of our matrix)
    - Depending on the context of our problem our predicitive models can't miss positive results. Specially true in the case of cancer diagnosis for example
    - For this reason we measure Recall, the ratio of positive values predicted vs the actual number of positive values in our test dataset
    - The goal in this case is to minimize the number of false negative results (the number of cases in which the model incorrectly predicted negative result)
    - In the case of the titanic the recall value is 0.7 (TP/(TP+FN))

Sklearn also provides a very handy way to quickly calculate all of these values using the `classification_report` method, show below

In [None]:
print(classification_report(y_test,y_test_pred))

Now lets tie everything together and use the accuracy metrics we used before to see how we did

In [None]:

train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
train_accuracy_balanced = balanced_accuracy_score(y_train, y_train_pred)
test_accuracy = balanced_accuracy_score(y_test, y_test_pred)
train_precision = precision_score(y_train, y_train_pred)
test_precision = precision_score(y_test, y_test_pred)
train_recall = recall_score(y_train, y_train_pred)
test_recall = recall_score(y_test, y_test_pred)

print('The training accuracy is {}%'.format(train_accuracy*100))
print('The training precision is {}%'.format(train_precision*100))
print('The training recall is {}%'.format(train_recall*100))
print('The test accuracy is {}%'.format(test_accuracy * 100))
print('The test precision is {}%'.format(test_precision * 100))
print('The test recall is {}%'.format(test_recall * 100))

**Exercise 4:** If you had to choose one accuracy metric to judge your model, which would you choose?

(Double Click Here) Answer:

## Regression Metrics

Remember, with regression models we are trying to predict a numerical value. How could we validate if our predicition of said value is correct? Could we look at the numbers and see if they match? Could we look at the difference of the values?

**Exercise 5:** If you were going to judge a regression model for accuracy, how would you do it? 

**(Double Click Here)** Answer:

Now lets dive into the two ways we can validate our regression models. 

- **Mean absolute error (MAE)**: The MAE method is very simple. Lets say I predicted that DogeCoin was going to be 50 dollars tomorrow, and the actual value was 38. My MAE would 50 (Predicted) - 38 (actual) = 12. MAE is simply the difference between what you predicted and what is the actual value. 


- **Mean squared error (MSE)**: MSE is exactly like MAE, the only difference is after we subtract the value we square that result, so we would then have, (50 (Predicted) - 38 (actual)) = 144. MSE allows for larger errors to have a larger impact. This could be useful when there are extremely large values in your data that overcast the others.

Lets now go over a quick implementation. We will change our prediction of the titanic dataset from survived, to predicting the value of the ticket for passengers. We will follow the same steps above from modeling and build a quick regression model to validate.

In [None]:
# Import our Regression model
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

# Build our target and features data
reg_target = titanic_data_reg["fare"]
reg_features = titanic_data_reg.drop("fare", axis=1)

In [None]:
# Split our data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(reg_features, reg_target, test_size=0.3, random_state=42)

# Create our model and train in on the training set
reg_model = DecisionTreeRegressor()
reg_model.fit(X_train,y_train)

In [None]:
# Predict using our regression model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

In [None]:
# Validate our regression model using MAE and MSE
from sklearn.metrics import mean_absolute_error, mean_squared_error

print("The mean absolute Error is: {}".format(mean_absolute_error(y_test, y_test_pred)))
print("The mean Squared Error is: {}".format(mean_squared_error(y_test, y_test_pred)))

Now what can we interpret from these results? Using the mean absolute error we were about 32 British pounds off, not too bad. With the Mean Squared error we were about 3457 British pounds off, not to good. 

So how do we know which error metric we should be using? Picking between the MAE and the MSE comes down to the application. Typically if your data has outliers that you want to account for, you would use MSE. If your data is not senstive to outliers you can stick to MAE

# Congratulations Future Data Scientist/Machine Learning Engineer! 

## You've now added many awesome techniques to your toolbox. 

### Congrats on your final lesson!