## Classification with Naive Bayes Classifier for Hong Kong Horse Racing
This is a differnt differnt model but do the same task as [my another notebook with Deep Neural Network](https://www.kaggle.com/cullensun/deep-learning-model-for-horse-racing). Naive Bayes Classifer is based on statistical classification. Let's see if it's better.

## Import packages

In [None]:
import pandas as pd
import numpy as np
import sklearn.preprocessing as preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

## Read inputs
Here, I am going to select some features that I think useful. I will also join runs.csv and races.csv because they are related and each includes some features for the classification.

In [None]:
races = pd.read_csv(r"../input/hkracing/races.csv", delimiter=",", header=0, index_col='race_id')
races_data = races[['venue', 'race_no', 'config', 'surface', 'distance', 'going', 'horse_ratings', 'race_class']]
runs = pd.read_csv(r"../input/hkracing/runs.csv", delimiter=",", header=0)
runs_data = runs[['race_id', 'result', 'won', 'horse_age', 'horse_country', 'horse_type', 'horse_rating',
                  'declared_weight', 'actual_weight', 'draw', 'win_odds']] 
data = runs_data.join(races_data, on='race_id')
# drop race_id after join because it's not a feature
data = data.drop(columns=['race_id'])
print(data.head())

## Data prepocessing
- Drop rows with NaN
- Use different encoders for ordinal and nomimal columns.


In [None]:
# remove rows with NaN
print(data[data.isnull().any(axis=1)])
print('data shape before drop NaN rows', data.shape)
data.dropna(inplace=True)
data.reset_index(drop=True, inplace=True)
print('data shape after drop NaN rows', data.shape)

# encode ordinal columns: config, going, horse_ratings
encoder = preprocessing.OrdinalEncoder()
data['config'] = encoder.fit_transform(data['config'].values.reshape(-1, 1))
data['going'] = encoder.fit_transform(data['going'].values.reshape(-1, 1))
data['horse_ratings'] = encoder.fit_transform(data['horse_ratings'].values.reshape(-1, 1))

# encode labels
lb_encoder = preprocessing.LabelEncoder()
data['horse_country'] = lb_encoder.fit_transform(data['horse_country'])
data['horse_type'] = lb_encoder.fit_transform(data['horse_type'])
data['venue'] = lb_encoder.fit_transform(data['venue'])

print(data.dtypes)
print(data.head())

## Feature selection
Feature selection is so important. When I tried to reduce/add some columns for fitting, I found that dropping some feature will cause big change to the performance. Then I found this [article](https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e) and learned to find best features. To my surprise, win_odds is such an important feature. 

After found the important features, I simply use them as fitting data.

In [None]:
# feature selection
# result and won are outputs, the rest are inputs
X = data.drop(columns=['result', 'won'])
y = data['won']

# apply SelectKBest class to extract top 10 best features
best_features = SelectKBest(score_func=chi2, k=10)
fit = best_features.fit(X, y)
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X.columns)
# concat two dataframes for better visualization 
feature_scores = pd.concat([df_columns, df_scores], axis=1)
feature_scores.columns = ['features', 'score']  
print(feature_scores.nlargest(10, 'score')) 

# choose the top 10 features only
X = data[['win_odds', 'draw', 'declared_weight', 'actual_weight', 'horse_rating', 
          'horse_country', 'venue', 'race_no', 'horse_ratings', 'race_class']]

## Fitting with Naive Bayes Classifier 
Here, I use kFold cross validation, and calculate the precision. Luckily, it's just 3 lines of code with help of sklearn packages.

In [None]:
mnb = MultinomialNB()
scores = cross_val_score(mnb, X, y, cv=10, scoring='precision')
average_precision = sum(scores) / len(scores) 
print(f'MultinomialNB average precision: {average_precision}')

## Conclusion 

I chose `precision` as an evaluation parameter because it's our interest to bet on win.

$ precision = \frac{TP}{TP + FP} $

If `precision = 0.1`, it means we make bet for winning horse 10 times and only once is correct. Generally speaking, the model generalize poorly.  
