# Objective 02 - Begin with baselines for classification
## Overview
When we fit a model to our data and look at the model score, we need something to which to compare that score. As we covered in Module 1, a baseline is a simple estimate or prediction for a model. A baseline can be determined from descriptive statistics such as the mean value or a variable, or even a simple linear regression for two variables. We started by finding a baseline for a linear regression problem where we are predicting using continuous variables. From our penguin data set, we used a baseline estimate of the ratio of flipper length to body mass.

In this module, we are going to focus on classification problems: a classification model predicts which class a set of observations belongs to. Classification problems deal with discrete or non-continuous variables. As an example of a classification problem, consider the Old Faithful geyser data set where we have some information about how long each eruption lasts and the length of time between eruptions. There are two features (duration and waiting) and one target containing two classes (kind). We want to use the duration and waiting features to predict which class the geyser eruption belongs to (long or short).

## Classification Baseline
Before we set up our model, we need to start with a baseline. For a classification problem, a common starting place is to find the most common class and use that as a baseline. We'll start by considering a binary classification problem, where there are only two classes. In our geyser data set, there are two kinds of geyser eruptions, short and long, so this data set is suitable for a binary classification problem.

Why do we use the most common class as a starting baseline? If we think about how we make a prediction, we would be most likely to be correct if our guess is the most common class. Let's explore the data set to find the most common class and then calculate the accuracy of this baseline.

### Follow Along

The eruption data set is available in the `seaborn` plotting library data sets, so it's easy to load and use. We'll start the usual way by loading the data and viewing it.

In [None]:
# Import numpy and seaborn
import numpy as np
import seaborn as sns

geyser = sns.load_dataset("geyser")
display(geyser.head())

display(geyser.describe())
duration	waiting	kind
0	3.600	79	long
1	1.800	54	short
2	3.333	74	long
3	2.283	62	short
4	4.533	85	long
duration	waiting
count	272.000000	272.000000
mean	3.487783	70.897059
std	1.141371	13.594974
min	1.600000	43.000000
25%	2.162750	58.000000
50%	4.000000	76.000000
75%	4.454250	82.000000
max	5.100000	96.000000
The next step is to see how many observations are in each class. We can use the value_counts() method on the kind column.

In [None]:
# Find the number of counts for each type of eruption
geyser['kind'].value_counts()

In [None]:
long     172
short    100
Name: kind, dtype: int64

Here, we have more observations for the long class. If we were given a set of values for the duration and waiting interval and predicted long for the class, we would be correct 63% of the time (number of long values divided by the total number of observations: 172/272 = 0.63). This baseline is the number we would like to beat when we actually train and fit a model to our data set.

Challenge
Using the penguin data set and the iris data set, find the most common classes for the target variables ("sex" for the penguin data and "species" for the iris data). For each of these data sets, what is the baseline?

Additional Resources
Old Faithful Geyser Data['http://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat']