# 📝 Exercise M1.03

The goal of this exercise is to compare the statistical performance of our
classifier (81% accuracy) to some baseline classifiers that would ignore the
input data and instead make constant predictions.

- What would be the score of a model that always predicts `' >50K'`?
- What would be the score of a model that always predicts `' <=50K'`?
- Is 81% or 82% accuracy a good score for this problem?

Use a `DummyClassifier` and do a train-test split to evaluate
its accuracy on the test set. This
[link](https://scikit-learn.org/stable/modules/model_evaluation.html#dummy-estimators)
shows a few examples of how to evaluate the statistical performance of these
baseline models.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

We will first split our dataset to have the target separated from the data
used to train our predictive model.

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=target_name)

We start by selecting only the numerical columns as seen in the previous
notebook.

In [44]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

data_numeric = data[numerical_columns]

Split the dataset into a train and test sets.

In [11]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, test_size=0.1, random_state=20)
print(data_train)
print(data_test)
print(target_test)

       age  capital-gain  capital-loss  hours-per-week
40391   19             0             0              10
43087   46             0             0              60
1415    43             0             0              35
27848   56             0             0              40
4751    23             0             0              30
...    ...           ...           ...             ...
23452   24             0             0               8
23775   31             0             0              40
37135   50         15024             0              40
27098   35          5013             0              70
48483   28             0             0              45

[43957 rows x 4 columns]
       age  capital-gain  capital-loss  hours-per-week
21601   28             0             0              40
45922   46             0             0              36
29979   37             0             0              40
37053   71             0             0              75
18094   22             0             0 

Use a `DummyClassifier` such that the resulting classifier will always
predict the class `' >50K'`. What is the accuracy score on the test set?
Repeat the experiment by always predicting the class `' <=50K'`.

Hint: you can refer to the parameter `strategy` of the `DummyClassifier`
to achieve the desired behaviour.

In [45]:
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

# help(DummyClassifier)
inf50K = DummyClassifier(strategy='constant', constant=" <=50K")
sup50K = DummyClassifier(strategy='constant', constant=" >50K")
most_frequent = DummyClassifier(strategy='most_frequent')
stratified = DummyClassifier(strategy='stratified')
prior = DummyClassifier(strategy='prior')
uniform = DummyClassifier(strategy='uniform')
model = LogisticRegression()

inf50K.fit(data_train, target_train)
sup50K.fit(data_train, target_train)
most_frequent.fit(data_train, target_train)
stratified.fit(data_train, target_train)
prior.fit(data_train, target_train)
uniform.fit(data_train, target_train)
model.fit(data_train, target_train)

# print(target_test.value_counts())
print(f"Il y a dans le jeu de test {(target_test == ' <=50K').sum()} lignes <=50K soit {(target_test == ' <=50K').sum() * 100 / target_test.shape[0]:.1f}%")
print(f"Il y a dans le jeu de test {(target_test == ' >50K').sum()} lignes >50K soit {(target_test == ' >50K').sum() * 100 / target_test.shape[0]:.1f}%")
print()
print(f"Le Dummy constant:<=50K donne un score de {inf50K.score(data_test, target_test):.3f}")
print(f"Le Dummy constant:>50K donne un score de {sup50K.score(data_test, target_test):.3f}")
print(f"Le Dummy most_frequent donne un score de {most_frequent.score(data_test, target_test):.3f}")
print(f"Le Dummy stratified donne un score de {stratified.score(data_test, target_test):.3f}")
print(f"Le Dummy prior donne un score de {prior.score(data_test, target_test):.3f}")
print(f"Le Dummy uniform donne un score de {uniform.score(data_test, target_test):.3f}")
print()
print(f"La régression logistique donne un score de {model.score(data_test, target_test):.3f}")





 <=50K    3702
 >50K     1183
Name: class, dtype: int64
Il y a dans le jeu de test 3702 lignes <=50K soit 75.8%
Il y a dans le jeu de test 1183 lignes >50K soit 24.2%

Le Dummy constant:<=50K donne un score de 0.758
Le Dummy constant:>50K donne un score de 0.242
Le Dummy most_frequent donne un score de 0.758
Le Dummy stratified donne un score de 0.631
Le Dummy prior donne un score de 0.758
Le Dummy uniform donne un score de 0.499

La régression logistique donne un score de 0.803
