## Supervised Machine Learning Algorithms 

Supervised learning is the types of machine learning in which machines are trained using well "labelled" training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output.

Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud Detection, spam filtering, etc.

In this section, we will cover:
* K-Nearest Neighbors 
* Decision/Regression Trees
* Random Forest

## K- Nearest Neighbors (KNN) ML Algorithm 

* K-Nearest Neighbour is one of the simplest Supervised Machine Learning algorithms.
* K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.
* K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
* K-NN algorithm can be used for Regression [to predict numeric target variable] as well as for Classification [to predict categorical target variable] but **mostly it is used for the Classification problems.**
* K-NN is a **non-parametric algorithm**, which means it does not make any assumption on underlying data.
* The K-Nearest Neighbor algorithm works by calculating a new data points class (in the case of classification) or value (in the case of regression) by looking at its most similar neighbors. 
    * How does it determine which data points are the most similar? Generally, this is done by using a distance calculation, such as the Euclidian distance or the Manhattan distance.
        * Euclidean distance is a widely used distance metric. It works on the principle of the Pythagoras theorem and signifies the shortest distance between two points.
        * Manhattan distance measures the distance that a taxi-cab would have to take if it could only make right-angle turns.
        * Also, the Minkowski distance or Minkowski metric is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. 
        
Official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier 

We will take an example of using the KNN algorithm for classification purposes.

In this section, we will use a built-in dataset in the seaborn package (https://seaborn.pydata.org/generated/seaborn.load_dataset.html): the dataset focuses on predicting the species of a penguin based on its physical characteristics. Seaborn is another great Python package that focuses on data exploration and visualization.

In [1]:
# Importing packages
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from seaborn import load_dataset

In [2]:
# loading the dataset
df = load_dataset('penguins')

In [3]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [5]:
df.nunique()

species                3
island                 3
bill_length_mm       164
bill_depth_mm         80
flipper_length_mm     55
body_mass_g           94
sex                    2
dtype: int64

#### Some information about the variables available in this dataset:
* species: The species of the penguin
* island: The island on which the penguin's data was taken
* bill_length_mm:The length of the penguin’s bill, measured in millimetres
* bill_depth_mm: The depth of the penguin’s bill, measured in millimetres
* flipper_length_mm: The length of the penguin’s flipper, measured in millimetres
* body_mass_g: The mass of the penguin, measured in grams
* sex: The sex of the penguin

#### To practice:
1. We will set up KNN algorithm with a target value and one feature
2. Then set up KNN algorithm with a target value and all numerical features 
3. Set up KNN algorithm with a target value and all features (including categorical features) 

And for each of the above, we will look at the accuracy of the model and how to evaluate it. 

#### 1. Using just one feature:
For example, we will only classify the species of the penguin based on the bill's length.

In [6]:
# Cleaning the data: dropping NA values
df = df.dropna()

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB


In [8]:
# Determining our X (features array: usually multi-dimensional array)
# an y (target array: expected to be of a single dimension)

X = df[['bill_length_mm']] # Usually more than one, which is why we are giving it a list of the variables. In this case, it is just one.
y = df['species']

We will now split our data into a training dataset and a testing dataset. 

This is a very important step because a poorly split dataset, or one that’s not split at all, can lead to two common problems in machine learning. Namely, these problems are referred to as **underfitting** and **overfitting** a model.

**Underfitting is a problem that occurs when a model doesn’t capture the relationships between different variables.**

A common cause of this can be when, say, including the incorrect variables. Similarly, it can occur when the wrong type of model is applied to a given problem. For example, applying a polynomial model to a model that actually linear. This type of problem will perform poorly in both training and testing data. Because of this, it can be easy to spot.

**On the other hand, overfitting occurs when the model attempts to find overly complex relationships between variables that don’t actually exist.**

This is typically a problem when the dataset learns from both the true relationships (the “signal”) and from variables that have little influence (the “noise”). Generally, these types of models perform exceptionally well with training data, but quite poorly with testing data.

**To split the data, we will use sklearn's train_test_split function.** The only required parameter is the **arrays** of the features' and targets' arrays. 

You can also decide things, such as:
* **test_size**: defaults to None and is 0.25 of the data size if train_size is also None
* **train_size**: defaults to None, complement of test_size
* **random_state**: defaults to None, and it is a parameter that controls shuffling applied to the data before applying the split. It is good practice to pass it an integer for reproducibility reasons: similar to when we set the random seed
* **shuffle**: defaults to True; whether or not to shuffle the data before splitting > if it is set to False, stratify must be None
* **stratify**: defaults to None, if not None, the data is split in a stratified fashion > especially helpful when one is trying to classify an imbalanced dataset, where there isn’t a balance between the different classes.

A simple explanation of what stratifying is
"stratifying preserves the proportion of how data is distributed in the target column - and depicts that same proportion of distribution in the train_test_split. Take for example, if the problem is a binary classification problem, and the target column is having proportion of 80% = yes, and 20% = no. Since there are 4 times more 'yes' than 'no' in the target column, by splitting into train and test without stratifying, we might run into the trouble of having only the 'yes' falling into our training set, and all the 'no' falling into our test set.(i.e, the training set might not have 'no' in its target column)

Hence by Stratifying, the target column for the training set has 80% of 'yes' and 20% of 'no', and also, the target column for the test set has 80% of 'yes' and 20% of 'no' respectively.

Hence, Stratify makes even distribution of the target(label) in the train and test set - just as it is distributed in the original dataset." [https://stackoverflow.com/a/72092663]

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [11]:
df['species'].value_counts() #Relatively imbalanced data, so will use stratified sampling

Adelie       146
Gentoo       119
Chinstrap     68
Name: species, dtype: int64

In [12]:
# Splitting the data into a training dataset and a testing dataset

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 100, stratify = y)

Before using the KNeighborsClassifier function, let's understand its parameters. Remember you can always check the official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier

Parameters:
* n_neighbors: default is 5, determine the number of neighbors that the algorithm uses to classify a certain data point. It is usually recommended to use odd number of neighbors.
* weights: default is uniform; there is also the option to weight by distance, where weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
* algorithm: algorithm used to compute the nearest neighbors; the default is auto, which will attempt to decide the most appropriate algorithm based on the values passed to fit method.
* leaf_size: default to 30, specific to certain algorithms of the above.
* p: default to 2. It is the power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. 
* metric: default to minkowski; Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2. 
* metric_paramsdict, default to None; Additional keyword arguments for the metric function.
* n_jobs: default is None, which is 1. It is the number of parallel jobs to run for neighbors search. -1 is to use all processors.

In [13]:
# First set up the classifier

clf = KNeighborsClassifier(n_neighbors = 5, weights = 'distance', p = 2)

In [14]:
# Fitting our model

clf.fit(X_train, y_train)

KNeighborsClassifier(weights='distance')

In [15]:
# Now onto predicting

predictions = clf.predict(X_test)

In [16]:
print(predictions)

['Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Adelie'
 'Chinstrap' 'Gentoo' 'Adelie' 'Chinstrap' 'Adelie' 'Gentoo' 'Gentoo'
 'Chinstrap' 'Gentoo' 'Gentoo' 'Adelie' 'Chinstrap' 'Chinstrap'
 'Chinstrap' 'Gentoo' 'Chinstrap' 'Gentoo' 'Chinstrap' 'Gentoo' 'Gentoo'
 'Adelie' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Gentoo' 'Chinstrap'
 'Chinstrap' 'Gentoo' 'Gentoo' 'Adelie' 'Adelie' 'Gentoo' 'Chinstrap'
 'Adelie' 'Gentoo' 'Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Adelie' 'Adelie'
 'Chinstrap' 'Adelie' 'Gentoo' 'Adelie' 'Adelie' 'Chinstrap' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Chinstrap'
 'Adelie' 'Adelie' 'Gentoo' 'Adelie' 'Gentoo' 'Chinstrap' 'Chinstrap'
 'Gentoo' 'Adelie' 'Adelie' 'Adelie' 'Chinstrap' 'Chinstrap' 'Chinstrap'
 'Adelie' 'Gentoo' 'Chinstrap' 'Adelie' 'Adelie' 'Gentoo' 'Adelie']


In [17]:
# Predicting on something not from the dataset

predictions = clf.predict([[44.2]]) # x: a penguin's bill length
print(predictions)

['Gentoo']




### Validating our KNN algorithm
Since our original dataset is pre-labelled, we can assess how accurate our model is. 

Because we split our data into training and testing data, it can be helpful to evaluate the model’s performance using the testing data. This is because this is data that the model hasn’t yet seen. Because of this, we can be confident that the model’s effectiveness to new data can be accurately tested.

In classification problems, one helpful measurement for a model’s effectiveness is the accuracy score. This looks at the proportion of accurate predictions out of the total of all predictions.

When we made predictions using the X_test array, sklearn returned an array of predictions. We already know the true values for these: they’re stored in y_test.

We can use the sklearn function, accuracy_score() to return a proportion out of 1 that measures the algorithms effectiveness.

In [19]:
predictions = clf.predict(X_test)
print(accuracy_score(y_test, predictions))

0.6666666666666666


Not exactly bad, but we can do better by adding more of the available features.

#### 2. Using all numeric features:
For example, we will classify the penguins' species based on all numeric features, such as the bill's length & depth, flipper length, and body mass.

In [20]:
# Reloading the data, just to give ourselves a clean slate
df = load_dataset('penguins')

# Cleaning the data
df = df.dropna()

# Creating our X and y
X = df.select_dtypes(include='number')
y = df['species']

In [22]:
X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
4,36.7,19.3,193.0,3450.0
5,39.3,20.6,190.0,3650.0


In [23]:
# Splitting the data into a training dataset and a testing dataset

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 100, stratify = y)

In [24]:
# Setting up the classifier

clf = KNeighborsClassifier(n_neighbors = 5, weights = 'distance', p = 2)

In [25]:
# Fitting our model

clf.fit(X_train, y_train)

KNeighborsClassifier(weights='distance')

In [26]:
# Testing our model/making predictions

predictions = clf.predict(X_test)

In [27]:
# Checking accuracy score 

print(accuracy_score(y_test, predictions))

0.8095238095238095


Such a great improvement! Let's also try to add our categorical variables, such as sex and island. 

#### 3. Using all available features:
Machine learning models work with numerical data. Because of this, we need to transform the data in our categorical columns into numbers in order for our algorithm to work successfully.

There are a number of different ways in which we can encode our categorical data. One of these methods is known as **one-hot encoding.** This process converts each unique value in a categorical column into its own binary column. So, for example, after applying it to the sex column, we will have 2 new columns: female and male. After applying it to the island, we will have 3 new columns, each for an island. In this case, the outcome is binary for each column: 0 means the value is not presented, while 1 means the value is presented.

We will not encode them into 0,1,2 because it entails that there is an order.

### Preprocessing before applying our third algorithm

In [30]:
## One-hot encoding our categorical variables 

# Creating our updated/comprehensive X

X = df.drop(columns = ['species'])

# Resplitting our data based on the current X and y

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 100, stratify = y)

#Using the make_column_transformer to apply the OneHotEncode() function

# Documentation> https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 

column_transformer = make_column_transformer(
    (OneHotEncoder(), ['sex', 'island']), # apply it on these columns
    remainder='passthrough') # ignore all other columns

# using the fit_transform() function to apply the column_transformer on the X_train data

X_train = column_transformer.fit_transform(X_train)


X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names_out())

Since our data is made up of different variables, and some have larger ranges than others. Ex: transformed categorical variables have 0 and 1, while body mass has a range 2700-6300. It is always a good idea to scale the data to make sure the larger range does not "dominate" the algorithm. 

Suggested in this case to use a Min-Max normalization method, which will turn all values to be a range from 0 to 1. The original distribution will be maintained.

To apply the Min-Max scaling, we will do very similar steps to the above

In [57]:
# Reloading and recleaning the data; just to ensure nothing is out of place
df = load_dataset('penguins')
df = df.dropna()

# Recreating our comprehensive X and y
X = df.drop(columns = ['species'])
y = df['species']

# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100, stratify = y)

# Using make_column_transformer() to apply OneHotEncoder() and MinMaxScaler()
# Documentation > https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

column_transformer = make_column_transformer(
    (OneHotEncoder(), ['sex', 'island']),
    (MinMaxScaler(), ['bill_depth_mm', 'bill_length_mm', 'flipper_length_mm', 'body_mass_g']),
    remainder='passthrough')

# using the fit_transform() function to apply the column_transformer on the X_train data

X_train = column_transformer.fit_transform(X_train)

X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names_out())

# Apply the same on the testing data, but using the transform () function instead of fit_transform()

X_test = column_transformer.transform(X_test)
X_test = pd.DataFrame(data=X_test, columns=column_transformer.get_feature_names_out())

### Applying our third algorithm:

In [58]:
# Setting up our classifier
clf = KNeighborsClassifier(n_neighbors = 5, weights = 'distance', p = 2)

In [59]:
# Training/fitting it

clf.fit(X_train, y_train)

KNeighborsClassifier(weights='distance')

In [60]:
# Testing/predicting 

predictons = clf.predict(X_test)

In [61]:
# Checking the accuracy of the third algorithm

print(accuracy_score(y_test, predictions)) # Show them if I change the sample stratification above

0.8095238095238095


### More advanced learning opportunities:

1. Other methods of validating and evaluating the algorithm:

Such as a confusion matrix, also known as an error matrix, is a powerful tool used to evaluate the performance of classification models. 

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Helpful tutorial: https://www.jcchouinard.com/confusion-matrix-in-scikit-learn/ 

2. Hyper-parameter tuning:

Hyper-parameters are the variables that you specify while building a machine learning model. This includes, for example, the number of neighbours to consider or the type of distance to use.

Hyper-parameter tuning, then, refers to the process of tuning these values to ensure a higher accuracy score. One way to do this is, simply, to plug in different values and see which hyper-parameters return the highest score.

This, however, is quite time-consuming. Scikit-Learn comes with a function **GridSearchCV** which makes the process simpler. You simply provide a dictionary of values to run through and sklearn returns the values that worked best.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

This function completes **a process of cross-validation.** This means that it will cycle through different combinations of training and testing data, in order to help prevent overfitting. For example, when we set a test size of 20%, cross-validation will cycle through different splits of that 20% in relation to the whole.

## Decision Trees ML Algorithm
* Decision tree classifiers are supervised machine learning models. 
* Decision trees **can also be used for regression problems.** 
* Decision tree classifiers work like flowcharts. 
    * Each node of a decision tree represents a decision point that splits into two leaf nodes. 
    * Each of these nodes represents the outcome of the decision and each of the decisions can also turn into decision nodes. 
    * Eventually, the different decisions will lead to a final classification.
* Decision trees work to make decisions (show flowchart example). The top node is called the root node. Each of the decision points are called decision nodes. The final decision point is referred to as a leaf node.
* Decision trees are easy to understand and interpret its decision-making algorithm. 
* Ther are generally faster to train than other algorithms, such as neural networks.
* Their complexity is a by-product of the data’s attributes and dimensions.
* They can handle high dimensional data with high degrees of accuracy.
* It’s a **non-parametric method** meaning that they do not depend on probability distribution assumptions.



This following part is alreadty automated using the scikit-learn package; however, it is useful to understand the background.

The algorithm uses a number of different ways to split the dataset into a series of decisions. One of the ways is **Gini Impurity.**

Useful resources:
- https://stats.stackexchange.com/a/69048 [Explains Gini Gain and its calculations simply]
- https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c

Gini Impurity refers to a measurement of the likelihood of incorrect classification of a new instance of a random variable *if that instance was randomly classified according to the distribution of class labels from the dataset.*

**The Gini Impurity measures the likelihood that an item will be misclassified if it’s randomly assigned a class based on the data’s distribution.**

When training a decision tree, the best split is chosen by maximizing the Gini Gain, which is calculated by subtracting the weighted impurities of the branches from the original impurity.

In this exercise, we will use the titanic dataset. 

To practice, we will:
1. Apply the Decision Tree to all numeric data
2. Apply the Decision Tree to all available features

In [2]:
# Loading the data
df = pd.read_csv('TitanicFull.csv') # make sure to use your file path

In [63]:
# Summary of the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [64]:
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [3]:
# Dropping unnecessary columns and mainly null values

df.drop(['Name', 'Ticket', 'Cabin'], axis = 1, inplace=True)

In [4]:
# Fixing NA values

df['Age'] = df['Age'].fillna(df['Age'].mean())

In [67]:
df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       2
dtype: int64

In [5]:
# Dropping the 2 missing values for Embarked 
df.dropna(inplace=True) 

In [69]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

### 1. Apply the Decision Tree to all numeric data

In [7]:
# Creating our X and y
X = df.select_dtypes(include='number')

# Survived is an integer data type as well, but it cannot be in both our X and y
X.drop(['Survived', 'PassengerId', 'Fare'], axis = 1, inplace=True)
# Also dropped PassengerId and Fare.

y = df['Survived']

In [8]:
X

Unnamed: 0,Pclass,Age,SibSp,Parch
0,3,22.000000,1,0
1,1,38.000000,1,0
2,3,26.000000,0,0
3,1,35.000000,1,0
4,3,35.000000,0,0
...,...,...,...,...
886,2,27.000000,0,0
887,1,19.000000,0,0
888,3,29.699118,1,2
889,1,26.000000,0,0


In [97]:
# Since the data is a bit imbalanced, we will use stratify when splitting data 
df['Survived'].value_counts()

0    549
1    340
Name: Survived, dtype: int64

In [114]:
# Splitting data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100, stratify = y)

Before using the DecisionTreeClassifier function, let's understand its parameters. Remember you can always check the official documentation: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Parameters:
* criterion: its default is	'gini'. The function to measure the quality of a split. There is also 'entropy' or 'log_loss'
* splitter: its default is 'best'. The strategy to choose the best split. Either 'best' or 'random'
* max_depth: its default is	None. The maximum depth of the tree. If None, the nodes are expanded until all leaves are pure or until they contain less than the min_samples_split
* min_samples_split: its default is	2. The minimum number of samples required to split a node.
* min_samples_leaf: its default is 1. The minimum number of samples require to be at a leaf node.
* min_weight_fraction_leaf=	0.0	The minimum weighted fraction of the sum of weights of all the input samples required to be at a node.
* max_features: the default is None. The number of features to consider when looking for the best split. Can be:

    – int,
    
    – float,
    
    – 'auto' (the square root of number of features),
    
    – 'sqrt' (same as auto),
    
    – 'log2' (log of number of features),
    
    – None (the number of features)
    
* random_state: its default is None. The control for the randomness of the estimator.
* max_leaf_nodes: its default is None. Grow a tree with a maximum number of nodes. If None, then an unlimited number is possible.
* min_impurity_decrease: its default is 0.0. A node will be split if this split decreases the impurity greater than or equal to this value.


**In this section, we will focus on criterion, max_depth, max_features, and splitter.** Since we are beginners, we can depend on the fact that the scikit-learn automizes a lot of the complex decisions.

In [115]:
# Creating our classifier 

clf = DecisionTreeClassifier()

In [116]:
# Training the algorithm 

clf.fit(X_train, y_train)

DecisionTreeClassifier()

In [117]:
# Testing/predicting the algorithm

predictions = clf.predict(X_test)

In [118]:
print(predictions[:5])

[0 0 1 0 0]


In [119]:
# Measuring accuracy of our model 
print(accuracy_score(y_test, predictions))

0.6591928251121076


### 2. Apply the Decision Tree to all features

Remember we have to first transform the categorical variables using OneHotEncoder().

In [9]:
# Creating our X and y
X = df.copy()

# Survived is part of the dataset, but it cannot be in both our X and y
X.drop(['Survived', 'PassengerId', 'Fare'], axis = 1, inplace=True)


y = df['Survived']

In [10]:
X.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked
0,3,male,22.0,1,0,S
1,1,female,38.0,1,0,C
2,3,female,26.0,0,0,S
3,1,female,35.0,1,0,S
4,3,male,35.0,0,0,S


In [11]:
# Splitting data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 100, stratify = y)

In [12]:
# Transforming our columns for the train data:

column_transformer = make_column_transformer(
    (OneHotEncoder(), ['Sex', 'Embarked']),
    remainder='passthrough')

X_train = column_transformer.fit_transform(X_train)

X_train = pd.DataFrame(data=X_train, columns=column_transformer.get_feature_names_out())

In [13]:
# Transforming our columns for the test data:

X_test = column_transformer.transform(X_test)

X_test = pd.DataFrame(data=X_test, columns=column_transformer.get_feature_names_out())

In [14]:
# Creating our classifier 

clf = DecisionTreeClassifier()

# Training the algorithm 

clf.fit(X_train, y_train)

# Testing/predicting the algorithm

predictions = clf.predict(X_test)

In [15]:
print(predictions[:5])

[0 0 1 1 0]


In [16]:
# Measuring accuracy of our model 
print(accuracy_score(y_test, predictions))

0.7892376681614349


Although many machine learning algorithms are based on distance calculations, it is not the same for decision trees. Thus, we do not need to worry about scaling or normalizing data when using decision tree algorithms.


### More advanced learning opportunities:

1. Other methods of validating and evaluating the algorithm:

Such as a confusion matrix, also known as an error matrix, is a powerful tool used to evaluate the performance of classification models. 

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Helpful tutorial: https://www.jcchouinard.com/confusion-matrix-in-scikit-learn/ 

2. Hyper-parameter tuning:

Hyper-parameters are the variables that you specify while building a machine learning model. This includes, for example, how the algorithm splits the data (either by entropy or gini impurity), as well as, the other parameters mentioned above.

Hyper-parameter tuning, then, refers to the process of tuning these values to ensure a higher accuracy score. One way to do this is, simply, to plug in different values and see which hyper-parameters return the highest score.

This, however, is quite time-consuming. Scikit-Learn comes with a function **GridSearchCV** which makes the process simpler. You simply provide a dictionary of values to run through and sklearn returns the values that worked best.

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

This function completes **a process of cross-validation.** This means that it will cycle through different combinations of training and testing data, in order to help prevent overfitting. For example, when we set a test size of 20%, cross-validation will cycle through different splits of that 20% in relation to the whole.

## Random Forests ML Algorithm 
* A random forest classifier is an ensemble algorithm. 
* An ensemble algorithm leverages multiple instances of another algorithm at the same time to find a result. 
* Decision trees are prone to overfitting. 
    * By "planting" more decision trees, one can avoid the overfitting that happens with decision trees.
    * A random forest is the automated handling of creating more decision trees. Each tree receives a vote in terms of how to classify. Some of these votes will be wildly overfitted and inaccurate. However, by creating a hundred trees the classification returned by the most trees is very likely to be the most accurate.
    
For this example, we will be using the Penguins dataset again but try to apply it to the titanic dataset too!

In [17]:
# loading the dataset
df = load_dataset('penguins')

# Cleaning the data
df = df.dropna()

Since ML algorithms cannot deal with categorical variables, we need to OneHotEncoder() them. As we remember, they are the variables 'Sex' and 'Island'. We will do it a little bit differently here, on the entire dataset instead of on the train and test arrays.

Note: 
* The .categories_ attribute contains a list containing an array of the attribute names 
* The encoded object contains the one-hot encoded array. By converting it to an explicit array, the data can be mapped to DataFrame columns.


In [18]:
one_hot = OneHotEncoder()

# For island
encoded1 = one_hot.fit_transform(df[['island']])
df[one_hot.categories_[0]] = encoded1.toarray()

# For sex
encoded2 = one_hot.fit_transform(df[['sex']])
df[one_hot.categories_[0]] = encoded2.toarray()

In [19]:
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,Biscoe,Dream,Torgersen,Female,Male
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,0.0,0.0,1.0,0.0,1.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0.0,0.0,1.0,1.0,0.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0.0,0.0,1.0,1.0,0.0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0.0,0.0,1.0,1.0,0.0
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
338,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female,1.0,0.0,0.0,1.0,0.0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,1.0,0.0,0.0,1.0,0.0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,1.0,0.0,0.0,0.0,1.0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,1.0,0.0,0.0,1.0,0.0


In [23]:
#Dropping the original categorical variables from the dataframe
df.drop(['sex', 'island'], axis = 1, inplace=True)

Documentation for the RandomForestClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

The parameters are very similar to those for the Decision Tree.
Note:
- n_estimators is The number of trees in the forest. The default is 100.

In [24]:
# Creating our X and y
X = df.copy()
X.drop(['species'], axis = 1, inplace=True)

y = df['species']

# Splitting our data into training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100, stratify=y)

In [25]:
# Creating our classifier 

forest_clf = RandomForestClassifier(n_estimators=100, random_state=100)

In [26]:
# Training the model

forest_clf.fit(X_train,y_train)

RandomForestClassifier(random_state=100)

In [27]:
# Testing/Predicting using our model
predictions = forest_clf.predict(X_test)

In [28]:
# Checking its accuracy
print(accuracy_score(y_test, predictions))

0.98


At home practice: apply random forest algorithm on titanic data and decision tree algorithm on penguins data. Compare and contrast the accuracy of the predictions.