# Dummy Model

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_validate
import seaborn as sns
sns.set(style='darkgrid', context='talk', palette='Set2')


## Read the data

In [29]:
df = pd.read_csv('../data/train_data.csv')
# create a new column called weightOverHeightSquared
df['weightOverHeightSquared'] = df['Weight'] / df['Height'] ** 2

numerical_features = ['Weight', 'Height', 'Age', 'Meal_Count', 'Phys_Act', 'Water_Consump']
categorical_features = ['Smoking', 'Alcohol_Consump', 'Transport']
target = 'Body_Level'

df = df[numerical_features + categorical_features + [target]]

# encode categorical features
df = pd.get_dummies(df, columns=categorical_features)


## Prepare the data

In [30]:
X = df.drop('Body_Level', axis=1) # x contains all features except the target
# X = df.drop('Alcohol_Consump_Always', axis=1) 

# take the weight and height columns only
# X = X[['Weight', 'Height']]
y = df['Body_Level'] # y contains only the target
y = y.map({'Body Level 1': 0, 'Body Level 2': 1, 'Body Level 3': 2, 'Body Level 4': 3}) # encode target to numerical values

## Dummy Model
The DummyClassifier class in scikit-learn provides several strategies for a baseline method, such as predicting the most frequent class label, predicting a random class label,  or predicting based on the class distribution of the training set. 


**Strategy to use to generate predictions:**
1. "most_frequent": 
   - The predict method always returns the most frequent class label in the observed y argument passed to fit. 
   - The predict_proba method returns the matching one-hot encoded vector.
2. "prior": 
  - The predict method always returns the most frequent class label in the observed y argument passed to fit (like "most_frequent"). 
  - Predict_proba always returns the empirical class distribution of y also known as the empirical class prior distribution.
3. "stratified": 
  - The predict_proba method randomly samples one-hot vectors from a multinomial distribution parametrized by the empirical class prior probabilities. 
  - The predict method returns the class label which got probability one in the one-hot vector of predict_proba. Each sampled row of both methods is therefore independent and identically distributed.
4. "uniform": 
  - Generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability.
5. "constant": 
  - Always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class.



In [31]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

# Train ZeroR on the training set
zeroR = DummyClassifier(strategy='uniform')
zeroR.fit(X, y)

cv_results = cross_validate(zeroR, X, y, cv=10, scoring=['f1_macro', 'accuracy', 'f1_micro'])

print('accuracy: ', cv_results['test_accuracy'].mean())
print('f1_macro: ', cv_results['test_f1_macro'].mean())
print('f1_micro: ', cv_results['test_f1_micro'].mean())

accuracy:  0.2582751744765702
f1_macro:  0.23978499507790643
f1_micro:  0.2582751744765702
