### Classification using scikit-learn (with pandas)

In [None]:
import csv
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

In [None]:
# For compatibility across multiple platforms
import os
IB = os.environ.get('INSTABASE_URI',None) is not None
open = ib.open if IB else open

### Read Cities.csv into dataframe

We create artificial categories as follows: 
- <b>cold</b> : temperature < 5
- <b>cool</b> : 5 <= temperature < 9 
- <b>warm</b> : 9 <= temperature < 15 
- <b>hot</b> : 15 <= temperature 

We add a column 'category' for the temperature category as defined above. 

In [None]:
# Note: For a dataframe D and integer i, D.loc[i] is the i-th row of D
f = open('datasets/Cities.csv','rU')
cities = pd.read_csv(f)
cats = []
for i in range(len(cities)):
    if cities.loc[i]['temperature'] < 5:
        cats.append('cold')
    elif cities.loc[i]['temperature'] < 9:
        cats.append('cool')
    elif cities.loc[i]['temperature'] < 15:
        cats.append('warm')
    else: cats.append('hot')
cities['category'] = cats
print("cold:", len(cities[(cities.category == 'cold')]))
print("cool:", len(cities[(cities.category == 'cool')]))
print("warm:", len(cities[(cities.category == 'warm')]))
print("hot:", len(cities[(cities.category == 'hot')]))

Split the cities dataset into training and test sets. 
- <b>Training set</b> - the subset of the data that will be used to "train" our model to learn to predict the different categories. 
- <b>Test set</b> - serves as the new never-before-seen data that will be used to evaluate the model's accuracy

Note: the size of the training set should be much <i>much</i> greater than the size of the test set. 
In the following, we set the training data to be 85% of the dataset, and the remaining 15% will serve as our test set.

In [None]:
numitems = len(cities)
percenttrain = 0.85
numtrain = int(numitems*percenttrain)
numtest = numitems - numtrain
print('Training set', numtrain, 'items')
print('Test set', numtest, 'items')
citiesTrain = cities[0:numtrain]
citiesTest = cities[numtrain:]

### Model 1: K-nearest-neighbors classification

Predicts temperature category from the features, longitude and latitude. 

With Python's scikit-learn, machine learning involves generally 3 steps:
1. Instantiating the model
    - In the following cell, this is done by invoking the function `KNeighborsClassifier()`. 
    - This function takes the number of neighbors as its parameter (in this case, it's 8), which we can tweak later on. 
2. Invoke the `classifier.fit()` function, which allows the model to learn from the training set
3. Predict the categories for the test set using the `classifier.predict()` function
    - From the prediction, we can then get the <b>accuracy</b> of the model. 
    - The accuracy is basically the proportion of the test set that was correctly predicted. 

In [None]:
features = ['longitude', 'latitude']
neighbors = 8
classifier = KNeighborsClassifier(neighbors)
classifier.fit(citiesTrain[features], citiesTrain['category'])
predictions = classifier.predict(citiesTest[features])
# Calculate accuracy
numtrain = len(citiesTrain)
numtest = len(citiesTest)
correct = 0
for i in range(numtest):
    print('Predicted:', predictions[i], ' Actual:', citiesTest.loc[numtrain+i]['category'])
    if predictions[i] == citiesTest.loc[numtrain+i]['category']: correct +=1
print('Accuracy:', float(correct)/float(numtest))

<b>Your task: </b> In the previous cell, comment out print, try other values for neighbors, other features. What is the highest accuracy you can get?

### <font color="green">Your Turn: K-nearest-neighbors on World Cup Data</font>

Predicts position from one or more of minutes, shots, passes, tackles, saves.

The following cell does all the set-up, including reordering the data to avoid team bias.

In [None]:
f = open('datasets/Players.csv','r')
players = pd.read_csv(f)
players = players.sort_values(by='surname')
players = players.reset_index(drop=True)
numitems = len(players)
percenttrain = 0.95
numtrain = int(numitems*percenttrain)
numtest = numitems - numtrain
print('Training set', numtrain, 'items')
print('Test set', numtest, 'items')
playersTrain = players[0:numtrain]
playersTest = players[numtrain:]

The following cell does the classification (i.e. instantiate model, fits the model to the training set, predicts categories for test set).

Q1: Try different features and different numbers of neighbors. What's the highest accuracy you can get?

In [None]:
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']
neighbors = 10
classifier = KNeighborsClassifier(neighbors)
classifier.fit(playersTrain[features], playersTrain['position'])
predictions = classifier.predict(playersTest[features])
# Calculate accuracy
numtrain = len(playersTrain)
numtest = len(playersTest)
correct = 0
for i in range(numtest):
#    print('Predicted:', predictions[i], ' Actual:', playersTest.loc[numtrain+i]['position'])
    if predictions[i] == playersTest.loc[numtrain+i]['position']: correct +=1
print('Accuracy:', float(correct)/float(numtest))

### Model 2:  Decision tree classification

In [None]:
features = ['longitude','latitude']
split = 20
dt = DecisionTreeClassifier(min_samples_split=split) # parameter is optional
dt.fit(citiesTrain[features],citiesTrain['category'])
predictions = dt.predict(citiesTest[features])
# Calculate accuracy
numtrain = len(citiesTrain)
numtest = len(citiesTest)
correct = 0
for i in range(numtest):
#    print('Predicted:', predictions[i], ' Actual:', citiesTest.loc[numtrain+i]['category'])
    if predictions[i] == citiesTest.loc[numtrain+i]['category']: correct +=1
print('Accuracy:', float(correct)/float(numtest))

<b>Your task: </b>In the previous cell, try other values for split, other features. What's the highest accuracy you can get?

### Model 3: Random Forest - "forest" of decision trees

In [None]:
# Predict temperature category from other features
features = ['longitude', 'latitude']
trees = 10
rf = RandomForestClassifier(n_estimators=trees)
rf.fit(citiesTrain[features],citiesTrain['category'])
predictions = rf.predict(citiesTest[features])
# Calculate accuracy
numtrain = len(citiesTrain)
numtest = len(citiesTest)
correct = 0
for i in range(numtest):
#    print('Predicted:', predictions[i], ' Actual:', citiesTest.loc[numtrain+i]['category'])
    if predictions[i] == citiesTest.loc[numtrain+i]['category']: correct +=1
print('Accuracy:', float(correct)/float(numtest))

<b>Your task: </b>In the previous cell, try other values for trees, other features. What's the highest accuracy you can get?

### <font color="green">Your Turn: Decision tree and forest of trees on World Cup Data</font>

Q1: (Single Tree) Predict position from one or more of minutes, shots, passes, tackles, saves. 

Try different features and different values for min_samples_split. What's the highest accuracy you can get?

In [None]:
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']
split = 10
dt = DecisionTreeClassifier(min_samples_split=split) # parameter is optional
dt.fit(playersTrain[features],playersTrain['position'])
predictions = dt.predict(playersTest[features])
# Calculate accuracy
numtrain = len(playersTrain)
numtest = len(playersTest)
correct = 0
for i in range(numtest):
#    print('Predicted:', predictions[i], ' Actual:', playersTest.loc[numtrain+i]['position'])
    if predictions[i] == playersTest.loc[numtrain+i]['position']: correct +=1
print('Accuracy:', float(correct)/float(numtest))

Q2: (Random Forest) Predict position from one or more of minutes, shots, passes, tackles, saves. 

Try different values for n_estimators. What's the highest accuracy you can get?

In [None]:
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']
trees = 10
rf = RandomForestClassifier(n_estimators=trees)
rf.fit(playersTrain[features],playersTrain['position'])
predictions = rf.predict(playersTest[features])
# Calculate accuracy
numtrain = len(playersTrain)
numtest = len(playersTest)
correct = 0
for i in range(numtest):
#    print('Predicted:', predictions[i], ' Actual:', playersTest.loc[numtrain+i]['position'])
    if predictions[i] == playersTest.loc[numtrain+i]['position']: correct +=1
print('Accuracy:', float(correct)/float(numtest))