# Dataset Tutorial.

This notebook shows the steps of loading/preprocessing a dataset for the exercise in ```dataset.py```

## Step 0: Installing required packages.

Before doing this exercise, please make sure that you have installed these required packages:
```
 - pandas
 - numpy
 - scikit-learn
```

You can install these packages using ```pip``` by running these lines:

In [9]:
%pip install numpy
%pip install pandas
%pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.



These packages are also required for the other exercises.

## Step 1: Import libaries and load in the dataset

To load in any datasets and preprocess its features, ```pandas``` and ```numpy``` are recomended.
In this guide, we will use ```Wine Quality Dataset``` from kaggle to demonstrate the process of loading and training a model.

In [10]:
import pandas as pd
import numpy as np

# Load in the datasets using pandas libary.
df = pd.read_csv("./dataset/winequalityN.csv")

Print the first five samples to confirm the dataset is loaded correctly

In [11]:
df.head()

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,white,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,white,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,white,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


From here you can see how many features this data has, and in what type ```(float, string, etc.)```. In this task, the classification label is the ```"type"``` collumn.

## Step 2: Preprocess the data.

After loading the data, we will preprocess the data before using it to train a model. This process includes:
 - Converting string labels to number labels.
 - Fill in the ```NaN``` variables.
 - Identify the features ```X``` and the labels ```y```
 - If the dataset does not have seperate training and validation set, split the dataset into two subsets. (optional)

In [12]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
# Convert the labels in the "type" collumn into numbers
le = preprocessing.LabelEncoder()
le.fit(df["type"])

# Check how many unique elements in the collumn.
# The return list corresponds to the list of classes, 
# and the index of each element represents that element's 
# converted number.
print(f"List of unique elements:{le.classes_}")

# To convert from string labels to int labels, use le.transform
df['type'] = le.transform(df["type"])

# print out the first 5 elements:
df.head()


List of unique elements:['red' 'white']


Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,1,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,1,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,1,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,1,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [13]:
# There are many ways to fill in Nan variables.
# In this case, we will only replace the Nan variables with "0".
# The current dataset you will be using will have Nan values,
# do this exact step.
df = df.fillna(0)

# Before training the model, we need to define the labels and 
# the input features. In this example, the label is the "type"
# collumn, and the rest would be counted as features

label_collumn = "type"
y = df[label_collumn]
X = df. loc[:, df. columns != label_collumn]

# Since the model takes in a numpy array, we convert 
# both the features and labels to numpy arrays.

y = y.to_numpy()
X = X.to_numpy()

# Finally, to train the models while having a dataset to validate it,
# we split the dataset into two subsets, training and validation.
# The ratio of between the training and the validation set is 9:1.

X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.1, random_state=42,stratify=y)


## Step 3: Train the model.
After having the training dataset, we can than start training and getting the predictions from 
the model.

In [25]:
from k_nearest_neighbors import KNNClassifier
from decision_tree import DecisionTree

# Train the model using the "fit" function

tree = DecisionTree(max_depth=6,mode="ig")
tree.fit(X_train,y_train)


After training, you can validate your results on a validation set. The ```accuracy_score``` function in ```sklearn``` will help you with the process.

In [26]:
from sklearn.metrics import accuracy_score

y_pred = tree.predict(X_val)
print(f"Valiation accuracy:{accuracy_score(y_pred,y_val):.2%}")

Valiation accuracy:99.23%
