# First Model with Scikit-Learn

We start first with building predictive models on pandas dataframes using only numerical features. 
In particular we will highlight:
* The sickit-learn API: `.fit(X, y)`, `.predict(X)`, `.score(X, y)`; and
* how to evaluate the generalization performance of a model with a train-test split.

We are going to use the same `adult_census.csv` data as before.

In [30]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Disable jedi autocompleter
%config Completer.use_jedi = False

In [31]:
# Load only numerical variables from the data
df = pd.read_csv("data/adult-census-numeric.csv")
df.head(5)

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,class
0,41,0,0,92,<=50K
1,48,0,0,40,<=50K
2,60,0,0,25,<=50K
3,37,0,0,45,<=50K
4,73,3273,0,40,<=50K


Our target, that we would like to predict is `class` and the remaining columns are used to train our predictive model.  

The first step is to separate columns to get on one side the target and on the other side the data.

In [32]:
target = df['class']
data = df.drop(columns=['class',])
data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,41,0,0,92
1,48,0,0,40
2,60,0,0,25
3,37,0,0,45
4,73,3273,0,40


In [33]:
data.shape

(39073, 4)

## Fit a Model an Make Predictions

We will build a classification model using the "K-nearest neighbors" strategy. To predict the target of a new sample, a k-nearest neighbors takes into account its $k$ closest samples in the training set and predicts the majority target of these samples.

<div class="alert alert-block alert-warning">
<b>Caution:</b> <br>
We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive algorithm. In the next
notebook, we will introduce better models.</div>

The `fit` method is called to train the model from the input and target data.

In [34]:
# to display a nice model diagram
from sklearn import set_config
set_config(display='diagram')

In [35]:
from sklearn.neighbors import KNeighborsClassifier

# 1. we first declare the model using the classifier's constructor
model = KNeighborsClassifier()

# 2. Then we train the model using the training data and training targets
model.fit(data, target)

Learning can be represented as follows:

![image.png](attachment:2e91861a-a8b8-483f-9c11-7e1df1fcdcd0.png)

The method `fit` is composed of two elements:  

1. A learning algorithm,
2. Some model states.

The learning algorithm takes the training data and training target as input and sets the model states. These model states will be used later to either predict (for classifiers and regressors) or transform data (for transformers).

Let's use our model to make some prediction using the same dataset.

In [36]:
# 3. We predict the outcome of the testing data
target_predicted = model.predict(data)

We can illustrate the prediction mechanism as follows:

![image.png](attachment:65225a2a-0bd5-41d8-9b52-bda9355091eb.png)

To predict, a model uses a prediction function that will use the input data together with the model states.  

Let's now have a look at the computed predictions.

In [37]:
# first five predicted targets
target[:5] == target_predicted[:5]

0    False
1     True
2     True
3     True
4     True
Name: class, dtype: bool

In [38]:
print(f"Number of correct prediction: "
      f"{(target == target_predicted).sum()} / {len(target)}")

Number of correct prediction: 32135 / 39073


To get a better assessment, we can compute the average success rate:

In [39]:
(target == target_predicted).mean()

0.8224349294909529

## Train-Test Data Split

<div class="alert alert-block alert-warning">
When building a machine learning model, it is important to evaluate the trained model on data that was not used to fit it. Generalization vs. memorization
</div>

The data used to fit a model is called training data, while the data used to assess a model is classed testing data. Usually we have one data set and we split the train and the test data from this data set using an sklearn method. In this example we have two csv files with one containing only train data and the other only test data. So, let's start by loading the required data sets.

In [40]:
# load test data set
adult_census_test = pd.read_csv('data/adult-census-numeric-test.csv')

In [41]:
adult_census_test.shape

(9769, 5)

From this new loaded data, we separate our input features and the target to predict.  
Note that with target we mean y and with data: X.

In [42]:
target_test = adult_census_test['class']
data_test = adult_census_test.drop(columns=['class',])

Instead of computing the prediction and manually computing the average success rate, we can use the method `score`. When dealing with classifiers this method returns their performance metric.

In [43]:
accuracy = model.score(data_test, target_test)
model_name = model.__class__.__name__

print(f"The test accuracy using a {model_name} is {accuracy:.3f}")

The test accuracy using a KNeighborsClassifier is 0.807


Let's check the underlying mechanism when the score method is called:

![image.png](attachment:96c5aec9-7b9c-4ea9-b474-173076aab011.png)

To compute the score, the predictor first computes the predictions (using the predict method) and then uses a scoring function to compare the true target $y$ and the predictions. Finally, the score is returned.

If we compare with the accuracy obtained by wrongly evaluating the model on the training set, we find that this evaluation was indeed optimistic compared to the score obtained on a held-out test set.

It shows the importance to always testing the generalization performance of predictive models on a different set than the one used to train these models. We will discuss later in more details how predictive models should be evaluated.