## Wprowadzenie
Skrypt pokazuje jak użyć pakietu SciKit do klasyfikacji danych. Rozważane są dwa przykłady:  zestaw danych IRIS oraz zestaw danych TITANIC  (do ściągnięcia z https://www.kaggle.com/c/titanic, dokładniej: potrzebny jest  plik https://www.kaggle.com/c/titanic/download/train.csv).

In [48]:
% matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn import tree

## 1. Pierwszy zestaw danych
Dane IRIS

In [49]:
# wczytanie zestawu danych
from sklearn import datasets
iris = datasets.load_iris()

data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [50]:
# rozbicie zestawu danych na dane opisujące kwiat (X) i etykietę klasy (y)
y = data['species']
X = data.drop('species', axis = 1)

# stworzenie drzewa klasyfikacyjnego
t = tree.DecisionTreeClassifier()
t = t.fit(X, y)

In [51]:
help(tree.DecisionTreeClassifier)

Help on class DecisionTreeClassifier in module sklearn.tree.tree:

class DecisionTreeClassifier(BaseDecisionTree, sklearn.base.ClassifierMixin)
 |  A decision tree classifier.
 |  
 |  Read more in the :ref:`User Guide <tree>`.
 |  
 |  Parameters
 |  ----------
 |  criterion : string, optional (default="gini")
 |      The function to measure the quality of a split. Supported criteria are
 |      "gini" for the Gini impurity and "entropy" for the information gain.
 |  
 |  splitter : string, optional (default="best")
 |      The strategy used to choose the split at each node. Supported
 |      strategies are "best" to choose the best split and "random" to choose
 |      the best random split.
 |  
 |  max_features : int, float, string or None, optional (default=None)
 |      The number of features to consider when looking for the best split:
 |  
 |          - If int, then consider `max_features` features at each split.
 |          - If float, then `max_features` is a percentage and
 |

In [52]:
# zapisanie drzewa klasyfikacyjnego do pliku .dot
# plik ten można przekształcić do pliku .pdf za pomocą programu graphviz używając polecenia:
#   dot -Tpdf iris.dot -o iris.pdf

with open("/home/dominik/Dokumenty/Studia/Data-mining/Lista5-trees/iris.dot", "w") as f:
    tree.export_graphviz(t, out_file=f, feature_names=X.columns)

In [53]:
# ocena stworzonego klasyfikatora na danych uczących
t.score(X, y)

1.0

In [54]:
# Uczciwiej byłoby oceniać klasyfikator na danych, które nie były używane podczas tworzenia
# klasyfikatora. Dlatego cały zestaw danych warto podzielić na dwie części: dane uczące i dane
# testowe.

data['train'] = np.random.uniform(0, 1, len(data))

data_train = data[data['train'] <= 0.65]
data_test = data[data['train'] > 0.65]

y = data_train['species']
X = data_train.drop('species', axis = 1)

t = tree.DecisionTreeClassifier()
t = t.fit(X, y)

print(t.score(X, y))

y = data_test['species']
X = data_test.drop('species', axis = 1)

print(t.score(X, y))

1.0
0.924528301887


## 2. Drugi zestaw danych
Dane TITANIC (do ściągnięcia z https://www.kaggle.com/c/titanic, dokładniej: potrzebny jest  plik https://www.kaggle.com/c/titanic/download/train.csv).

In [55]:
# wczytanie zestawu danych z pliku
data = pd.read_csv("/home/dominik/Dokumenty/Studia/Data-mining/Lista5-trees/train.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [56]:
# usunięcie z zestawu danych atrybutów nieistotnych dla klasyfikacji
data = data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
data = data.dropna()
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [57]:
# zmiana kodowania atrybutów nominalnych w zestawie danych
data['Sex'] = pd.Categorical(data['Sex']).codes
data['Embarked'] = pd.Categorical(data['Embarked']).codes
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2
1,1,1,0,38.0,1,0,71.2833,0
2,1,3,0,26.0,0,0,7.925,2
3,1,1,0,35.0,1,0,53.1,2
4,0,3,1,35.0,0,0,8.05,2


In [58]:
# rozbicie zestawu danych na dane opisujące pasażera (X) i etykietę klasy (y)
y = data['Survived']
X = data.drop('Survived', axis = 1)

# stworzenie drzewa klasyfikacyjnego
tGini = tree.DecisionTreeClassifier()
tGini = tGini.fit(X, y)

In [59]:
# zapisanie drzewa klasyfikacyjnego do pliku .dot
# plik ten można przekształcić do pliku .pdf za pomocą programu graphviz używając polecenia:
#   dot -Tpdf titanic.dot -o titanic.pdf

with open("/home/dominik/Dokumenty/Studia/Data-mining/Lista5-trees/titanic.dot", "w") as f:
    tree.export_graphviz(tGini, out_file=f, feature_names=X.columns)

In [60]:
# ocena stworzonego klasyfikatora na danych uczących
tGini.score(X, y)

0.9859550561797753

In [61]:
tEntropy = tree.DecisionTreeClassifier(criterion='entropy')
tEntropy = tEntropy.fit(X, y)

In [62]:
tEntropy.score(X,y)

0.9859550561797753

In [63]:
# Uczciwiej byłoby oceniać klasyfikator na danych, które nie były używane podczas tworzenia
# klasyfikatora. Dlatego cały zestaw danych warto podzielić na dwie części: dane uczące i dane
# testowe (lista 5 zadanie 2b).

In [64]:
data['train'] = np.random.uniform(0, 1, len(data))

data_train = data[data['train'] <= 0.75]
data_test = data[data['train'] > 0.75]

print(data_test.shape)
print(data_train.shape)

(200, 9)
(512, 9)


In [65]:
y = data_train['Survived']
X = data_train.drop('Survived', axis = 1)

In [66]:
tGini = tree.DecisionTreeClassifier()
tGini = tGini.fit(X, y)
tGini.score(X,y)

1.0

In [67]:
tEntropy = tree.DecisionTreeClassifier(criterion='entropy')
tEntropy = tEntropy.fit(X, y)
tEntropy.score(X,y)

1.0

In [68]:
y = data_test['Survived']
X = data_test.drop('Survived', axis = 1)

In [69]:
print(tGini.score(X,y))
print(tEntropy.score(X,y))

0.44
0.43


In [70]:
tGini.tree_.max_depth

20

In [71]:
data['train'] = np.random.uniform(0, 1, len(data))

data_train = data[data['train'] <= 0.75]
data_test = data[data['train'] > 0.75]

y = data_train['Survived']
X = data_train.drop('Survived', axis = 1)

print(data_test.shape)
print(data_train.shape)

tGini = tree.DecisionTreeClassifier(max_depth=5)
tGini = tGini.fit(X, y)
print(tGini.score(X,y))

tEntropy = tree.DecisionTreeClassifier(criterion='entropy', max_depth=5)
tEntropy = tEntropy.fit(X, y)
print(tEntropy.score(X,y))

(168, 9)
(544, 9)
0.871323529412
0.865808823529


In [72]:
y = data_test['Survived']
X = data_test.drop('Survived', axis = 1)

print(tGini.score(X,y))
print(tEntropy.score(X,y))

0.815476190476
0.779761904762


In [73]:
with open("/home/dominik/Dokumenty/Studia/Data-mining/Lista5-trees/titanic.dot", "w") as f:
    tree.export_graphviz(tEntropy, out_file=f, feature_names=X.columns)

In [74]:
data['train'] = np.random.uniform(0, 1, len(data))

data_train = data[data['train'] <= 0.75]
data_test = data[data['train'] > 0.75]

y = data_train['Survived']
X = data_train.drop('Survived', axis = 1).drop('train',axis=1)

print(data_test.shape)
print(data_train.shape)

tGini = tree.DecisionTreeClassifier(max_depth=7,min_samples_split=15,min_samples_leaf=7)
tGini = tGini.fit(X, y)
print(tGini.score(X,y))

tEntropy = tree.DecisionTreeClassifier(criterion='entropy', max_depth=5,min_samples_split=15,min_samples_leaf=7)
tEntropy = tEntropy.fit(X, y)
print(tEntropy.score(X,y))

(184, 9)
(528, 9)
0.876893939394
0.863636363636


In [75]:
y = data_test['Survived']
X = data_test.drop('Survived', axis = 1).drop('train',axis=1)

print(tGini.score(X,y))
print(tEntropy.score(X,y))

0.733695652174
0.728260869565


In [76]:
with open("/home/dominik/Dokumenty/Studia/Data-mining/Lista5-trees/titanic.dot", "w") as f:
    tree.export_graphviz(tEntropy, out_file=f, feature_names=X.columns)

In [77]:
def titanicTreeCrossValid(params):

    data['train'] = np.random.uniform(0, 1, len(data))
    score = 0
    for i in range(10):
        data_test = data[(data['train'] >= i*0.1) & (data['train'] < (i+1)*0.1)]
        data_train = data[(data['train'] < i*0.1) | (data['train'] >= (i+1)*0.1)]

        y = data_train['Survived']
        X = data_train.drop('Survived', axis = 1).drop('train',axis=1)

        t = tree.DecisionTreeClassifier(criterion=params['criterion'], max_depth=params['max_depth'], min_samples_split=params['min_samples_split'], min_samples_leaf=params['min_samples_leaf'])
        t = t.fit(X, y)

        y = data_test['Survived']
        X = data_test.drop('Survived', axis = 1).drop('train',axis=1)
        print(t.score(X,y))
        score += t.score(X,y) / 10

    print('Average score: ',score)

In [78]:
print('Simple Gini:')
titanicTreeCrossValid({'criterion': 'gini', 'max_depth' : None, 'min_samples_split': 2,'min_samples_leaf':1})
print('\nSimple entropy:')
titanicTreeCrossValid({'criterion': 'entropy', 'max_depth' : None, 'min_samples_split': 2,'min_samples_leaf':1})
print('\nGini with max depth:')
titanicTreeCrossValid({'criterion': 'gini', 'max_depth' : 5, 'min_samples_split': 2,'min_samples_leaf':1})
print('\nEntropy with max depth:')
titanicTreeCrossValid({'criterion': 'entropy', 'max_depth' : 5, 'min_samples_split': 2,'min_samples_leaf':1})
print('\nGini with max depth, leafs and split restrictions:')
titanicTreeCrossValid({'criterion': 'gini', 'max_depth' : 6, 'min_samples_split': 15,'min_samples_leaf':7})
print('\nEntropy with max depth, leafs and split restrictions:')
titanicTreeCrossValid({'criterion': 'entropy', 'max_depth' : 6, 'min_samples_split': 15,'min_samples_leaf':7})

Simple Gini:
0.796610169492
0.802816901408
0.706666666667
0.777777777778
0.7625
0.760563380282
0.68
0.764705882353
0.808823529412
0.821917808219
Average score:  0.768238211561

Simple entropy:
0.790322580645
0.818181818182
0.761904761905
0.734939759036
0.807692307692
0.706896551724
0.805970149254
0.684210526316
0.794520547945
0.676923076923
Average score:  0.758156207962

Gini with max depth:
0.79012345679
0.767857142857
0.909090909091
0.847222222222
0.732394366197
0.776119402985
0.733333333333
0.73417721519
0.815384615385
0.775
Average score:  0.788070266405

Entropy with max depth:
0.815384615385
0.835294117647
0.811594202899
0.837837837838
0.796875
0.820895522388
0.808823529412
0.80303030303
0.753086419753
0.753424657534
Average score:  0.803624620589

Gini with max depth, leafs and split restrictions:
0.825
0.772727272727
0.825396825397
0.8
0.837837837838
0.760563380282
0.771084337349
0.757142857143
0.84
0.716666666667
Average score:  0.79064191774

Entropy with max depth, leafs an