# seeds Data Set - https://archive.ics.uci.edu/ml/datasets/seeds#

## Data Set Information

The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for 
the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin. 

The data set can be used for the tasks of classification and cluster analysis.

## Attribute Information

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 
All of these parameters were real-valued continuous.

# Labb 2 Uppgift 2

För varje körning nedan, presentera confusion matrix och värdet på Accuracy

a. Prova två körningar och redogör för vilken skillnad blir det i de två fallen:
i. med icke-normaliserat data, dvs originaldata.
ii. med normaliserat datat i intervallet [0,1]

Tips för normalisering:
https://machinelearningmastery.com/standardscaler-and-minmax scaler-transforms-in-python/

b. I dessa körningar förändra antalet tränings- och testfall. Sätt random_state=None i DecisionTreeClassifier kontruktorn (innebär att test och träningsobservationerna verkligen slumpas varje gång du kör) och kör varje punkt nedan 100 gånger (allt från början till slut) och beräkna medelvärdet på prestandan av dessa 100 körningar:
i. train=75%, test=25%
ii. välj själv ett förhållande mellan träning och test

Presentera en av de bästa exekveringarna, med avseende på prestandan (värdet på Accuracy) hos klassificeringsalgoritmen. Visa alltid confusion matrix och accuracy.

In [None]:
# Import data set
import pandas as pd

data = pd.read_csv('seeds_dataset.txt', delim_whitespace=True, header=None, usecols=[*range(0, 8)], names=['area A', 'perimeter P', 'compactness C', 'kernel length', 'kernel width', 'asymmetry coefficient', 'kernel groove length', 'wheat variant'])

In [None]:
# Do some exploratory data analysis
import seaborn as sns

display(round(data.describe(),1))
data.info()
sns.pairplot(data, hue = 'wheat variant')

In [None]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
x = data.drop('wheat variant', axis = 1)
y = data['wheat variant']

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.25)

In [None]:
# Train the decision tree model
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(x_training_data, y_training_data)
predictions = model.predict(x_test_data)

In [None]:
# Predict the test set results
predictions

In [None]:
# Measure the accuracy of the model
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import tree
import matplotlib.pyplot as plt

# Generate a decision tree
tree.plot_tree(model)
plt.show()

# Print the classification report
print('\nClassification report:\n', classification_report(y_test_data, predictions))

# Print the Confusion matrix
print('\nConfustion matrix:')
#print('\nConfustion matrix:\n', confusion_matrix(y_test_data, predictions))
pd.crosstab(y_test_data, predictions, rownames=['Actual'], colnames=['Predicted'])