# Data Mining and Statistics
## Session 5 - Classification - Answers
*Peter Stikker - Haarlem, the Netherlands - v 1.1*

----

In [10]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import CategoricalNB
try:
    import matplotlib.pyplot as plt
    print('PyPlot already installed, only imported')
except:
    !pip install matplotlib
    import matplotlib.pyplot as plt
    print('PyPlot was not installed, installed and imported')
try:
    import seaborn as sn
    print('seaborn already installed, only imported')
except:
    !pip install seaborn
    import seaborn as sn
    print('seaborn was not installed, installed and imported')

PyPlot already installed, only imported
seaborn already installed, only imported


**Exercise 1**

Determine who is most likely to win in a match between a Southpaw fighter and a Orthodox player.

Use the UFC2019.csv dataset and of course a naive Bayesian analysis.

***my answer***

First load the data.

In [11]:
UFCdata=pd.read_csv('data/UFC2019.csv',sep = ',', header=0)
UFCdata.head()

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0.0,...,2.0,0.0,0.0,8.0,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0.0,...,0.0,2.0,0.0,5.0,Southpaw,165.1,167.64,125.0,32.0,31.0
2,Tony Ferguson,Donald Cerrone,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Red,False,Lightweight,3,0.0,...,3.0,6.0,1.0,14.0,Orthodox,180.34,193.04,155.0,36.0,35.0
3,Jimmie Rivera,Petr Yan,Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Blue,False,Bantamweight,3,0.0,...,1.0,0.0,0.0,6.0,Orthodox,162.56,172.72,135.0,26.0,29.0
4,Tai Tuivasa,Blagoy Ivanov,Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blue,False,Heavyweight,3,0.0,...,2.0,0.0,0.0,3.0,Southpaw,187.96,190.5,264.0,32.0,26.0


I don't need all the data, only the winner, and the stance from each player:

In [12]:
#Create subset of data and remove missing values
UFCsel=UFCdata[["Winner","R_Stance","B_Stance"]]
UFCsel=UFCsel.dropna()
UFCsel.head()

Unnamed: 0,Winner,R_Stance,B_Stance
0,Red,Orthodox,Orthodox
1,Red,Southpaw,Orthodox
2,Red,Orthodox,Orthodox
3,Blue,Orthodox,Switch
4,Blue,Southpaw,Southpaw


Now let's assign each field to the appropriate type and then convert it to a numerical version:

In [None]:
#Convert the panda fields to categorical
UFCsel["Winner"]=pd.Categorical(UFCsel["Winner"])
UFCsel["R_Stance"]=pd.Categorical(UFCsel["R_Stance"])
UFCsel["B_Stance"]=pd.Categorical(UFCsel["B_Stance"])

#get the numerical values as a numpy array
selNum = np.asarray(UFCsel["Winner"].cat.codes)
selNum = np.dstack((selNum, np.asarray(UFCsel["R_Stance"].cat.codes)))
selNum = np.dstack((selNum, np.asarray(UFCsel["B_Stance"].cat.codes)))
selNum = np.squeeze(selNum)

selNum[0:5]

array([[2, 1, 1],
       [2, 3, 1],
       [2, 1, 1],
       [0, 1, 4],
       [0, 3, 3]], dtype=int8)

Creating the model and testing it's accuracy:

In [None]:
#Set the independent and dependent variable(s)
X=selNum[:,1:3]     ##the R_stance and B_Stance
y=selNum[:,0]       ##the selection
X

array([[1, 1],
       [3, 1],
       [1, 1],
       ...,
       [3, 1],
       [1, 1],
       [1, 1]], dtype=int8)

In [None]:
#Create the model
clf = CategoricalNB()
clf = clf.fit(X, y)

#Show some results
print(clf.score(X,y))

0.6780667622363301


So, the model has a 67% of a correct prediction.

To see which code was used for Southpaw, and Orthodox, we can look at the categories of Stance:

In [None]:
print(UFCsel["R_Stance"].cat.categories)
print(UFCsel["B_Stance"].cat.categories)

Index(['Open Stance', 'Orthodox', 'Sideways', 'Southpaw', 'Switch'], dtype='object')
Index(['Open Stance', 'Orthodox', 'Sideways', 'Southpaw', 'Switch'], dtype='object')


So an Southpaw would be 3 and Orthodox a 1 (we start the counting from 0).

To get the prediction:

In [None]:
myTest=[3,1]
myTest=np.array(myTest)
myTest = myTest.reshape(-1, 2)
clf.predict(myTest)

array([2], dtype=int8)

The predicted winner is....2.

Which one is this:

In [None]:
UFCsel["Winner"].cat.categories

Index(['Blue', 'Draw', 'Red'], dtype='object')

Ah, the predicted winner is 'Red'.

**Exercise 2**

Another example taken from: https://www.saedsayad.com/naive_bayesian.htm. The data is already available as 'playGolf.csv'. Load this data and create a model to predict if we can go Play or not.

If you have time to spare, you could look into the conversion of the categories into the numerical ones by using the LabelEncoder option of sklearn.

***My answer***

In [None]:
df=pd.read_csv('data/playGolf.csv', sep = ';', header=0)
df

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,rainy,hot,high,False,no
1,rainy,hot,high,True,no
2,overcast,hot,high,False,yes
3,sunny,mild,high,False,yes
4,sunny,cool,normal,False,yes
5,sunny,cool,normal,True,no
6,overcast,cool,normal,True,yes
7,rainy,mild,high,False,no
8,rainy,cool,normal,False,yes
9,sunny,mild,normal,False,yes


In [None]:
df["Outlook"]=pd.Categorical(df["Outlook"])
df["Temp"]=pd.Categorical(df["Temp"])
df["Humidity"]=pd.Categorical(df["Humidity"])
df["Windy"]=pd.Categorical(df["Windy"])
df["Play"]=pd.Categorical(df["Play"])

arr = np.asarray(df["Outlook"].cat.codes)
arr = np.dstack((arr, np.asarray(df["Temp"].cat.codes)))
arr = np.dstack((arr, np.asarray(df["Humidity"].cat.codes)))
arr = np.dstack((arr, np.asarray(df["Windy"].cat.codes)))
arr = np.dstack((arr, np.asarray(df["Play"].cat.codes)))
arr = np.squeeze(arr)

X=arr[:,0:4]
y=arr[:,4] 

clf = CategoricalNB()
model = clf.fit(X, y)
model.score(X,y)

0.9285714285714286