# Data Mining & Machine Learning

## Algorithms in use:

  * Random Forest
  * Decision Tree
  * Knn

## Our purpose is to predict which passangers we'll survive based on the passenger features

<img src="images_for_notebook/DataMining_10.png"/>
<img src="images_for_notebook/DataMining_11.png"/>

<br>
<br>
<hr class="dotted">
<br>
<br>

# Load Data

In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

titanic = pd.read_csv('titanic_raw.csv')

# titanic.shape
titanic.head(5)

We have 891 examples and 12 columns (11 features\variables and one label). Let's look at the data

<br>
<br>
<hr class="dotted">
<br>
<br>

# Data Cleaning

In [None]:
titanic = titanic.rename(columns=str.lower) # Rename columns to lower letters
titanic.survived = (titanic.survived=='Yes').astype('int') # Label to numeric
titanic = titanic.drop(['ticket','name'], axis=1) # Drop some features which aren't informative
titanic['has_cabin'] = (titanic.cabin.isna()==False).astype(int) # Create new column "has_cabin"
titanic = titanic.drop(['cabin'], axis=1) # Drop the old column "cabin"
titanic = pd.get_dummies(titanic) # Categorical values to 1-hot ("one hot" encoding is a representation of categorical variables as binary vectors)
titanic.age = titanic.age.fillna(titanic.age.median()) # Let's use the median to fill "age" column
titanic

<h1 style="color:#FF0000">Notice !</h1>
<p style="color:#FF0000">That this time we kept 'passengerid' column</p>

In [None]:
# Let's convert all data to float because some modules warn against other types
titanic = titanic.astype(float)

<br>
<br>
<hr class="dotted">
<br>
<br>

# Check

In [None]:
# No missing values!
titanic.isna().sum()

In [None]:
# Check all values are indeed numeric (float)
titanic.dtypes

<br>
<br>
<hr class="dotted">
<br>
<br>

# Supervised Learning

<img src="images_for_notebook/SupervisedLearning_10.png"/>
<img src="images_for_notebook/SupervisedLearning_11.png"/>

# Train and Test split

<img src="images_for_notebook/Train_Split_10.png"/>
<img src="images_for_notebook/Train_Split_11.png"/>

In [None]:
titanic

### We only have 891 examples, let's use 200 for test and the rest for train and split to inputs and labels
#### We'll use Scikit-Learn library
<br>

 <p><u>Scikit-Learn</u></p>
Scikit-Learn is the most popular library for machine learning in Python. It includes functions for read, write and manipulate data, as lots of optimized machine learning algorithms.

The name derives from the combination <b>SCI</b>py tool<b>KIT</b> for machine <b>LEARN</b>ing. The library expands the capabilities of numpy, scipy, and pandas.

"sklearn" is how you type the "scikit-learn" name in pythonm

In [None]:
from sklearn.model_selection import train_test_split
test_size = 200
train, test = train_test_split(titanic, test_size=test_size, random_state=0, shuffle=True)
label = 'survived'
x_train = train.drop(label, axis=1)
y_train = train[label]
x_test, y_test = test.drop(label, axis=1), test[label]

In [None]:
train

In [None]:
test

In [None]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

In [None]:
x_train

In [None]:
y_train

In [None]:
x_test

In [None]:
y_test

<br>
<br>
<hr class="dotted">
<br>
<br>

# Decision Tree
## A tool to classify complex data

<img src="images_for_notebook/DecisionTree_10.png"/>
<img src="images_for_notebook/DecisionTree_11.png"/>
<img src="images_for_notebook/DecisionTree_12.png"/>
<img src="images_for_notebook/DecisionTree_13.png"/>
<img src="images_for_notebook/DecisionTree_14.png"/>
<img src="images_for_notebook/DecisionTree_15.png"/>
<img src="images_for_notebook/DecisionTree_16.png"/>
<img src="images_for_notebook/DecisionTree_17.png"/>
<img src="images_for_notebook/DecisionTree_18.png"/>
<img src="images_for_notebook/DecisionTree_19.png"/>

In [None]:
from sklearn.tree import DecisionTreeClassifier # Importing the algorithm

clf = DecisionTreeClassifier(max_depth=3)
# define the algorithm:
    # arg:
        # max_depth = The maximum depth of the tree. (If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples)


clf.fit(x_train, y_train) # running\training the algorithm with the train data

y_test_pred_DecisionTree = clf.predict(x_test) # making a prediction based on "test" data features


output = pd.DataFrame({'passengerid': x_test.passengerid, 'survived_what_actualy_happened':y_test, 'survived_predicted_by_model': y_test_pred_DecisionTree}) # saving results to DataFrame
output.to_csv('my_DecisionTree_Prediction.csv', index=False) # saving results to csv

In [None]:
# Plotting the decision tree
# Importing the necessary libraries
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from IPython.display import SVG
from graphviz import Source
from IPython.display import display

# this is afunction that we can always use for plotting decision trees, the function expects 3 arg as follows
def plot_tree(tree, features, labels):
    graph = Source(export_graphviz(tree, feature_names=features, class_names=labels, filled = True))
    display(SVG(graph.pipe(format='svg')))

# Using the function above, with the 3 arg
    # tree
    # feaures
    # labels --> we'll go over the example and it will be clear
plot_tree(clf, x_train.columns, ['Died', 'Survived'])


<h3 style="color:#FF0000">Notice !</h3>
If after runnig the plot for the model you got an error "No module named 'graphviz' in Jupyter Notebook", open anaconda prompt and type the folowing commands:"<br>
"conda install -c anaconda pydot"<br>
"conda install -c conda-forge python-graphviz"



<img src="images_for_notebook/DecisionTree_20.png"/>
<img src="images_for_notebook/DecisionTree_21.png"/>
<img src="images_for_notebook/DecisionTree_22.png"/>

<br>
<br>
<hr class="dotted">
<br>
<br>

# Random Forest model

<img src="images_for_notebook/RandomForest_10.png"/>
<img src="images_for_notebook/RandomForest_11.png"/>
<img src="images_for_notebook/RandomForest_12.png"/>
<img src="images_for_notebook/RandomForest_13.png"/>

<br>
<br>
We'll build what's known as a random forest model. This model is constructed of several "trees" (there are three trees in the picture below, but we'll construct 100!), the model creates decision trees on randomly selected data samples that will individually consider each passenger's data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!
<br>
<br>
<br>
<br>

<img src="images_for_notebook/RandomForest_1.png"/>

<br>
The code cell below looks for patterns in the different columns\variables of the data.
It constructs the trees in the random forest model based on patterns in the "train" data, before generating predictions for the passengers in "test" data. The code also saves these new predictions in a new CSV file.
<br>
<br>

In [None]:
from sklearn.ensemble import RandomForestClassifier # Importing the algorithm

model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=1)
# define the algorithm:
    # arg:
        # n_estimators = number of trees
        # max_dept = the maximum depth of the trees
        # random_state =
            # basically, an algorithm is repeated a number of times using random selections of features and samples. The random_state parameter allows controlling these random choices.
            # if you call this with random_state=1 (or any other value), then each and every time, you'll get the same result.
            
model.fit(x_train, y_train) # running\training the algorithm with the train data

y_test_pred_RandomForest = model.predict(x_test) # making a prediction based on "test" data features


output = pd.DataFrame({'passengerid': x_test.passengerid, 'survived_what_actualy_happened':y_test, 'survived_predicted_by_model': y_test_pred_RandomForest}) # saving results to DataFrame
output.to_csv('my_RandomForest_Prediction.csv', index=False) # saving results to csv


<br>
<br>
<hr class="dotted">
<br>
<br>

# KNN - K Nearest Neighbors

## Show me youe neighbors and I'll tell you who you are

<img src="images_for_notebook/Knn_10.png"/>
<img src="images_for_notebook/Knn_11.png"/>
<img src="images_for_notebook/Knn_12.png"/>

In [None]:
from sklearn.neighbors import KNeighborsClassifier  # Importing the algorithm

clf = KNeighborsClassifier(n_neighbors=3)
# define the algorithm:
    # arg:
        # n_neighbors = number of neighbors
        
clf.fit(x_train, y_train) # running\training the algorithm with the train data

y_test_pred_Knn = clf.predict(x_test) # making a prediction based on "test" data features

output = pd.DataFrame({'passengerid': x_test.passengerid, 'survived_what_actualy_happened':y_test, 'survived_predicted_by_model': y_test_pred_Knn}) # saving results to DataFrame
output.to_csv('my_Knn_Prediction.csv', index=False) # saving results to csv



<br>
<br>
<hr class="dotted">
<br>
<br>

# Accuracy

<img src="images_for_notebook/Accuracy_10.png"/>


### As this is a classification problem, we can use accuracy as our evaluation metric
### Let's import that from sklearn
<br>

In [None]:
from sklearn.metrics import accuracy_score # importing "accuracy_score" from "sklearn.metrics"

In [None]:
# Evaluation for Decision Tree
test_acc = accuracy_score(y_test, y_test_pred_DecisionTree)
test_acc

In [None]:
# Evaluation for Random Forest
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Evaluation for Knn
test_acc = accuracy_score(y_test, y_test_pred_Knn)
test_acc

### We can easily see that the "Decision Tree" algo had the best result, "Knn" on the other hand did poorly

<br>
<br>
<hr class="dotted">
<br>
<br>

# Overfitting
## How do we improve our models ?
## First, we'll need to understand what's "overfitting" means

<img src="images_for_notebook/Overfitting_10.png"/>
<img src="images_for_notebook/Overfitting_11.png"/>
<img src="images_for_notebook/Overfitting_12.png"/>

<br>
<br>
<hr class="dotted">
<br>
<br>

# Overfitting in Decision Tree

<img src="images_for_notebook/Overfitting_DecisionTree_10.png"/>
<img src="images_for_notebook/Overfitting_DecisionTree_11.png"/>

In [None]:
# Decision Tree -- max_depth = 3 (like before)

clf = DecisionTreeClassifier(max_depth=3)
clf.fit(x_train, y_train)
y_test_pred_DecisionTree = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_DecisionTree)
test_acc

In [None]:
# Decision Tree -- max_depth = 5

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(x_train, y_train)
y_test_pred_DecisionTree = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_DecisionTree)
test_acc

In [None]:
# Decision Tree -- max_depth = 2

clf = DecisionTreeClassifier(max_depth=2)
clf.fit(x_train, y_train)
y_test_pred_DecisionTree = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_DecisionTree)
test_acc

In [None]:
# Decision Tree -- max_depth = 99

clf = DecisionTreeClassifier(max_depth=99)
clf.fit(x_train, y_train)
y_test_pred_DecisionTree = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_DecisionTree)
test_acc

## max_depth=3 --> look like the best one

<br>
<br>
<hr class="dotted">
<br>
<br>

# Overfitting in Random Forest

In [None]:
# Random Forest -- n_estimators = 100 & max_depth = 3 (like before)

model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 200 & max_depth = 3

model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 150 & max_depth = 3

model = RandomForestClassifier(n_estimators=150, max_depth=3, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 50 & max_depth = 3

model = RandomForestClassifier(n_estimators=50, max_depth=3, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 10 & max_depth = 3

model = RandomForestClassifier(n_estimators=10, max_depth=3, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

## looks like  "n_estimators" should be about a 100 (n_estimators = 100 ~ 81.5% accuracy)
## Let's "play" with  "max_depth"
<br>

In [None]:
# Random Forest -- n_estimators = 100 & max_depth = 5

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 100 & max_depth = 7

model = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 100 & max_depth = 9

model = RandomForestClassifier(n_estimators=100, max_depth=9, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 100 & max_depth = 21

model = RandomForestClassifier(n_estimators=100, max_depth=21, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

In [None]:
# Random Forest -- n_estimators = 100 & max_depth = 25

model = RandomForestClassifier(n_estimators=100, max_depth=25, random_state=1)        
model.fit(x_train, y_train)
y_test_pred_RandomForest = model.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_RandomForest)
test_acc

## We've improved our model
## "n_estimators=100" & "max_depth=25" is a pretty good fit ~ 84%
<br>

# The impact of the number of trees
![RandomForest](https://miro.medium.com/max/675/1*EFBVZvHEIoMdYHjvAZg8Zg.gif "random forest")

<br>
<br>
<hr class="dotted">
<br>
<br>

# Overfitting in Knn

<img src="images_for_notebook/Overfitting_Knn_10.png"/>
<img src="images_for_notebook/Overfitting_Knn_11.png"/>

In [None]:
# Knn -- n_neighbors = 3 (like before)

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(x_train, y_train)
y_test_pred_Knn = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_Knn)
test_acc

In [None]:
# Knn -- n_neighbors = 7

clf = KNeighborsClassifier(n_neighbors=7)
clf.fit(x_train, y_train)
y_test_pred_Knn = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_Knn)
test_acc

In [None]:
# Knn -- n_neighbors = 9

clf = KNeighborsClassifier(n_neighbors=9)
clf.fit(x_train, y_train)
y_test_pred_Knn = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_Knn)
test_acc

In [None]:
# Knn -- n_neighbors = 33

clf = KNeighborsClassifier(n_neighbors=33)
clf.fit(x_train, y_train)
y_test_pred_Knn = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_Knn)
test_acc

In [None]:
# Knn -- n_neighbors = 100

clf = KNeighborsClassifier(n_neighbors=100)
clf.fit(x_train, y_train)
y_test_pred_Knn = clf.predict(x_test)
test_acc = accuracy_score(y_test, y_test_pred_Knn)
test_acc

## Seems like the algorithm overfits a bit, also we know that KNN can highly suffer from features that are in different scales. So let's scale the x values first

In [None]:
from sklearn.preprocessing import StandardScaler # import the libraries

scaler = StandardScaler() # define

x_train_scaled = scaler.fit_transform(x_train) # Fit to data, then transform it.
x_test_scaled = scaler.transform(x_test) # Perform standardization by centering and scaling

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(x_train_scaled, y_train)

y_test_pred_Knn_Scaled = clf.predict(x_test_scaled)


output = pd.DataFrame({'passengerid': x_test.passengerid, 'survived_what_actualy_happened':y_test, 'survived_predicted_by_model': y_test_pred_Knn, 'survived_predicted_by_model_scaled':y_test_pred_Knn_Scaled}) # saving results to DataFrame
output.to_csv('my_Knn_Prediction_scaled.csv', index=False) # saving results to csv


test_acc = accuracy_score(y_test, y_test_pred_Knn_Scaled)
test_acc

## Much Better !

<br>
<br>
<hr class="dotted">
<br>
<br>