# Introduction
In this example the Iris dataset we discussed in the slides. Our question will be this: given a model trained on a portion of the Iris data, how well can we predict the remaining labels?

The more detailed version of this notebook can be found on :
http://bit.ly/3mHipAa

In [2]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


We will use an extremely simple generative model known as **Gaussian naive Bayes**, which proceeds by assuming each class is drawn from an axis-aligned Gaussian distribution.

GB is fast and has no hyperparameters to choose, hence a good model to use as a baseline classification, before exploring whether improvements can be found through more sophisticated models.

We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. This could be done by hand, but it is more convenient to use the `train_test_split utility` function:

In [4]:
from sklearn.model_selection import train_test_split

In [6]:
X_iris = iris.drop('species', axis=1)
y_iris = iris['species']

Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
                                                random_state=1)

In [7]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data

Finally, we can use the accuracy_score utility to see the fraction of predicted labels that match their true value:

In [8]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.9736842105263158