<a href="https://colab.research.google.com/github/ravidata-25/DecisionTreesFoundations/blob/main/Aritificial_Neural_Networks_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **What are Artificial Neural Networks (ANNs)?**

You can introduce ANNs using an analogy to the human brain.
The Neuron: At its core, the brain is made of billions of neurons that are connected to each other. An individual neuron is not very smart, but when connected, they form a powerful network. An ANN is a computational model inspired by this structure. The "neurons" in an ANN are called nodes.


Layers: These nodes are organized into layers. Every ANN has at least two:

An Input Layer: This is where the data comes in. You have one node for each feature in your dataset (e.g., one for CreditScore, one for Age, one for Balance, etc.).

An Output Layer: This layer produces the final prediction. For a binary classification problem like ours ("Will the customer exit? Yes/No"), this layer often has just one node that outputs a probability (e.g., a 78% chance the customer will leave).

Hidden Layers: Between the input and output layers, we can have one or more hidden layers. These are the "processing" centers of the network. It's in these layers that the ANN learns to identify complex patterns and non-linear relationships in the data.

The network learns how to combine the input features in interesting ways to make a better prediction. A network with many hidden layers is called a "deep" neural network, which is where the term "deep learning" comes from.

Connections and Weights: Every node in one layer is connected to nodes in the next layer. Each connection has a weight associated with it. This weight represents the strength or importance of that connection. During training, the network's main goal is to find the optimal set of weights that produces the most accurate predictions.


Activation Functions: Inside each node (except in the input layer), an activation function is applied. This function decides whether a neuron should be "activated" or not, based on the weighted sum of inputs it receives. It introduces non-linearity into the model, which is crucial. Without it, the ANN would just be a complex linear regression model, unable to capture intricate patterns.



### **Why and When Do We Use ANNs?**

This is a key question for your students. ANNs are not always the best tool for the job.

When to Use ANNs:

Complex, Non-Linear Problems: ANNs excel when the relationship between the input features and the output is complex and not easily captured by simpler models like linear or logistic regression. Customer churn is a perfect example; the decision to leave a bank is likely a combination of many subtle factors, not a simple straight-line relationship.


Large Datasets: Neural networks are data-hungry. They need a lot of examples to learn the optimal weights. With thousands or millions of data points (like in our dataset of 10,000 customers), they can uncover patterns that would be invisible in smaller datasets.


High-Dimensional Data: They work very well on problems with many input features, such as image recognition (where every pixel is a feature) or natural language processing.


When Predictive Performance is Paramount: If your primary goal is to get the most accurate prediction possible, and you care less about understanding the why behind the prediction, ANNs are a top choice.


###  **When to Consider Alternatives: **

Small Datasets: On small datasets, ANNs are prone to overfitting—essentially memorizing the training data instead of learning a general pattern. Simpler models like Logistic Regression, Decision Trees, or SVMs often perform better and are less computationally expensive.

Need for Interpretability: ANNs are often called "black boxes." It's very difficult to look at the millions of weights and understand exactly why the model made a specific prediction.

 If you need to explain the decision-making process to a regulator or a business stakeholder (e.g., "This customer was denied a loan because their debt-to-income ratio was too high"), a Decision Tree or Logistic Regression model would be a much better choice.

In [None]:
#----------------------- Artificial Neural Network for classification --------------------#
#importing required libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.compose import ColumnTransformer
# What it is: This is a very useful tool for applying different preprocessing steps to different columns of the data.
# Why we need it: We want to OneHotEncode the 'Geography' column but leave the other columns (like 'Age' and 'Balance') alone.
#  ColumnTransformer lets us do exactly that in one clean step.


from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score

In [None]:
#----------------------- Data Pre-processing ----------------------#
# Checking the tensorflow version
print(tf.__version__)


2.19.0


In [None]:
# Loading the data
bank_data = pd.read_csv("/content/Artificial_Neural_Network_Case_Study_data.csv")

In [None]:
bank_data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [None]:
bank_data.shape

(10000, 14)

In [None]:
bank_data.isnull().sum()

Unnamed: 0,0
RowNumber,0
CustomerId,0
Surname,0
CreditScore,0
Geography,0
Gender,0
Age,0
Tenure,0
Balance,0
NumOfProducts,0


In [None]:
# Taking  all rows and all columns in the data except the last column as X (feature matrix)
#the row numbers and customer id's are not necessary for the modelling so we get rid of and start with credit score
X = bank_data.iloc[:,3:-1].values
print("Independent variables are:", X)

'''
bank_data.iloc is how we select data in Pandas by its numerical position.
The [:, 3:-1] part is the key. The first colon : means 'give me all the rows'.
The 3:-1 means 'start at the 4th column (index 3) and go up to, but not including, the last column'.
Why start at column 3? Because RowNumber, CustomerId, and Surname are just identifiers.
They don't have any predictive power, so we exclude them.
The .values at the end converts our data from a Pandas table into a NumPy array, which is the format our model requires.

'''

#taking all rows but only the last column as Y(dependent variable)
y = bank_data.iloc[:, -1].values
print("Dependent variable is:", y)


'''
This is simpler. We're telling it to take all rows (:) and only the very last column (-1),
which is our 'Exited' column. This column contains the 'answers' our model needs to learn from.

'''

Independent variables are: [[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]
Dependent variable is: [1 0 1 ... 1 1 0]


"\nThis is simpler. We're telling it to take all rows (:) and only the very last column (-1),\nwhich is our 'Exited' column. This column contains the 'answers' our model needs to learn from.\n\n"

In [None]:
# Transforming the gender variable, labels are chosen randomly
le = LabelEncoder()
X[:,2] = le.fit_transform(X[:,2])
print(X)

'''
le = LabelEncoder(): First, we create an instance of the LabelEncoder object, which we'll call le.
X[:,2] = le.fit_transform(X[:,2]): This is the main action.
X[:,2] selects all rows (:) but only the third column (2), which is our 'Gender' column.
The fit_transform method does two things at once:
fit: It looks at the column and learns all the unique categories ('Female' and 'Male').
transform: It then converts each of those categories into an integer. For example, it will assign 'Female' to 0 and 'Male' to 1.
Finally, we replace the original 'Gender' column with these new numerical values.

'''


[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


"\nle = LabelEncoder(): First, we create an instance of the LabelEncoder object, which we'll call le.\nX[:,2] = le.fit_transform(X[:,2]): This is the main action.\nX[:,2] selects all rows (:) but only the third column (2), which is our 'Gender' column.\nThe fit_transform method does two things at once:\nfit: It looks at the column and learns all the unique categories ('Female' and 'Male').\ntransform: It then converts each of those categories into an integer. For example, it will assign 'Female' to 0 and 'Male' to 1.\nFinally, we replace the original 'Gender' column with these new numerical values.\n\n"

In [None]:
# Transforming the geography column variable, labels are chosen randomly, the ct asks for argument [1] the index of the target vb
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(),[1])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))
print(X)


'''
We could use LabelEncoder again, which would turn them into 0, 1, and 2. But this creates a subtle problem.
It implies an ordinal relationship—that Germany (2) is somehow 'greater' than Spain (1), which is 'greater' than France (0).
Our model might mistakenly learn this non-existent order, which is bad.


To avoid this, we use a better technique for columns with more than two categories: One-Hot Encoding.
This creates new binary columns for each category.
ct = ColumnTransformer(...): We use ColumnTransformer to apply this change only to the geography column while leaving all
other columns untouched.
transformers = [('encoder', OneHotEncoder(), [1])]: This is the core instruction. We're telling it:
Apply an OneHotEncoder()...
...to the column at index 1 (which is our 'Geography' column).
remainder = 'passthrough': This is very important. It tells the ColumnTransformer to just let all the other columns
(CreditScore, Age, etc.) pass through without any changes.
X = np.array(ct.fit_transform(X)): We apply this transformation to our entire feature matrix X.
The 'Geography' column is replaced by three new columns at the very beginning of our dataset, representing France, Germany, and Spain.
'''

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


"\nWe could use LabelEncoder again, which would turn them into 0, 1, and 2. But this creates a subtle problem.\nIt implies an ordinal relationship—that Germany (2) is somehow 'greater' than Spain (1), which is 'greater' than France (0).\nOur model might mistakenly learn this non-existent order, which is bad.\n\n\nTo avoid this, we use a better technique for columns with more than two categories: One-Hot Encoding.\nThis creates new binary columns for each category.\nct = ColumnTransformer(...): We use ColumnTransformer to apply this change only to the geography column while leaving all\nother columns untouched.\ntransformers = [('encoder', OneHotEncoder(), [1])]: This is the core instruction. We're telling it:\nApply an OneHotEncoder()...\n...to the column at index 1 (which is our 'Geography' column).\nremainder = 'passthrough': This is very important. It tells the ColumnTransformer to just let all the other columns\n(CreditScore, Age, etc.) pass through without any changes.\nX = np.arr

In [None]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#printing the dimensions of each of those snapshots to see amount of rows and columns i each of them
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)


'''
Think of it like preparing for an exam. The train set is your textbook and practice problems.
The test set is the final, unseen exam. Your grade on the final exam is the true measure of what you've learned.

X_train, X_test, y_train, y_test = train_test_split(...): This function takes our full dataset (X and y) and shuffles
it randomly before splitting it into four new sets:
X_train: The features we will use to teach our model.
y_train: The corresponding correct answers for X_train.
X_test: The features we will use to evaluate our model's performance on unseen data.
y_test: The corresponding correct answers for X_test, which we'll use to grade the model's predictions.
test_size = 0.2: This parameter tells the function to hold back 20% of the data for the test set.
 This means 80% will be used for training. This is a common and good starting point for the split ratio.

random_state = 0: This is for reproducibility. By setting a random_state, we ensure that every time we run this code,
 the data is shuffled and split in the exact same way. This is crucial for getting consistent results when we are developing and comparing models.
The print statements confirm the split.

We started with 10,000 customers. Now we have 8,000 for training ((8000, 12)) and 2,000 for testing ((2000, 12)),
just as we specified."

'''

(8000, 12) (2000, 12)
(8000,) (2000,)


'\nThink of it like preparing for an exam. The train set is your textbook and practice problems.\nThe test set is the final, unseen exam. Your grade on the final exam is the true measure of what you\'ve learned.\n\nX_train, X_test, y_train, y_test = train_test_split(...): This function takes our full dataset (X and y) and shuffles\nit randomly before splitting it into four new sets:\nX_train: The features we will use to teach our model.\ny_train: The corresponding correct answers for X_train.\nX_test: The features we will use to evaluate our model\'s performance on unseen data.\ny_test: The corresponding correct answers for X_test, which we\'ll use to grade the model\'s predictions.\ntest_size = 0.2: This parameter tells the function to hold back 20% of the data for the test set.\n This means 80% will be used for training. This is a common and good starting point for the split ratio.\n\nrandom_state = 0: This is for reproducibility. By setting a random_state, we ensure that every time 

In [None]:
# Data Scaling/normalization of the features that will go to the NN
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


'''
For a neural network, this massive difference in scale is a problem.
The features with larger values can dominate the learning process, causing the model to train slowly or inefficiently.

We solve this with Feature Scaling. The goal is to put all our features onto a similar scale, so they all contribute fairly.

sc = StandardScaler(): We create an instance of the StandardScaler.
This particular scaler transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
X_train = sc.fit_transform(X_train): This is a two-step process on our training data.

fit: The scaler calculates the mean and standard deviation for each column in X_train.
It learns the scale of our training data.

transform: It then uses these calculated values to scale the training data.


X_test = sc.transform(X_test): This is a very important distinction. For the test set, we only use transform.
Why not fit_transform again? Because we must use the same scaling parameters (the same mean and standard deviation) that we learned from the training set. The test set represents new, unseen data, and we must process it in the exact same way we processed our training data. We are pretending we don't know the distribution of the test set, just like we wouldn't in the real world.


'''

"\nFor a neural network, this massive difference in scale is a problem.\nThe features with larger values can dominate the learning process, causing the model to train slowly or inefficiently.\n\nWe solve this with Feature Scaling. The goal is to put all our features onto a similar scale, so they all contribute fairly.\n\nsc = StandardScaler(): We create an instance of the StandardScaler.\nThis particular scaler transforms the data so that each feature has a mean of 0 and a standard deviation of 1.\nX_train = sc.fit_transform(X_train): This is a two-step process on our training data.\n\nfit: The scaler calculates the mean and standard deviation for each column in X_train.\nIt learns the scale of our training data.\n\ntransform: It then uses these calculated values to scale the training data.\n\n\nX_test = sc.transform(X_test): This is a very important distinction. For the test set, we only use transform.\nWhy not fit_transform again? Because we must use the same scaling parameters (the 

In [None]:
X_train

array([[-1.01460667, -0.5698444 ,  1.74309049, ...,  0.64259497,
        -1.03227043,  1.10643166],
       [-1.01460667,  1.75486502, -0.57369368, ...,  0.64259497,
         0.9687384 , -0.74866447],
       [ 0.98560362, -0.5698444 , -0.57369368, ...,  0.64259497,
        -1.03227043,  1.48533467],
       ...,
       [ 0.98560362, -0.5698444 , -0.57369368, ...,  0.64259497,
        -1.03227043,  1.41231994],
       [-1.01460667, -0.5698444 ,  1.74309049, ...,  0.64259497,
         0.9687384 ,  0.84432121],
       [-1.01460667,  1.75486502, -0.57369368, ...,  0.64259497,
        -1.03227043,  0.32472465]])

In [None]:
#----------------------- Building the model -----------------------#

# Initializing the ANN by calling the Sequential class fromm keras of Tensorflow
ann = tf.keras.models.Sequential()


'''
Think of this first line as laying the foundation for a building.
We're creating an empty container that we will add layers to, one after the other.

ann = tf.keras.models.Sequential():
tf.keras is the user-friendly API within TensorFlow that we use to build models.

A Sequential model is the most common type of model. It simply means we are building our network layer by layer, in a linear stack.
We will add the input layer, then a hidden layer, then another hidden layer, and finally the output layer.
We are assigning this empty model object to the variable ann (short for Artificial Neural Network).

'''

"\nThink of this first line as laying the foundation for a building.\nWe're creating an empty container that we will add layers to, one after the other.\n\nann = tf.keras.models.Sequential():\ntf.keras is the user-friendly API within TensorFlow that we use to build models.\n\nA Sequential model is the most common type of model. It simply means we are building our network layer by layer, in a linear stack.\nWe will add the input layer, then a hidden layer, then another hidden layer, and finally the output layer.\nWe are assigning this empty model object to the variable ann (short for Artificial Neural Network).\n\n"

In [None]:
ann

<Sequential name=sequential, built=False>

In [None]:
# Adding "fully connected" INPUT layer to the Sequential ANN by calling Dense class
# Number of Units = 6 and Activation Function = Rectifier
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))

'''
"Now we add our first 'floor' of neurons to our empty model. This will be our first hidden layer.
ann.add(...): This is how we add a new layer to our Sequential model.

tf.keras.layers.Dense(...): A Dense layer is the most basic and common type of layer.
'Dense' simply means that every neuron in this layer is connected to every neuron in the previous layer.

Let's look at the two key parameters we've set:
units = 6: This defines the number of neurons in this layer. We've chosen 6.
Why 6? The number of neurons in a hidden layer is a hyperparameter.

There's no single perfect answer; it's often found through experimentation.
A common rule of thumb is to choose a number somewhere between the number of input features (we have 12) and
the number of output neurons (we'll have 1). Starting with 6 is a reasonable choice.
activation = 'relu': This is the activation function for the neurons in this layer.

relu stands for Rectified Linear Unit. It's the most popular activation function for hidden layers.
What it does: It's a very simple function. If the input to the neuron is negative, it outputs 0.
If the input is positive, it outputs the input value itself.

Why use it? It's computationally efficient and helps the network learn complex patterns
without running into certain mathematical problems that older activation functions had.

'''

'\n"Now we add our first \'floor\' of neurons to our empty model. This will be our first hidden layer.\nann.add(...): This is how we add a new layer to our Sequential model.\n\ntf.keras.layers.Dense(...): A Dense layer is the most basic and common type of layer.\n\'Dense\' simply means that every neuron in this layer is connected to every neuron in the previous layer.\n\nLet\'s look at the two key parameters we\'ve set:\nunits = 6: This defines the number of neurons in this layer. We\'ve chosen 6.\nWhy 6? The number of neurons in a hidden layer is a hyperparameter.\n\nThere\'s no single perfect answer; it\'s often found through experimentation.\nA common rule of thumb is to choose a number somewhere between the number of input features (we have 12) and \nthe number of output neurons (we\'ll have 1). Starting with 6 is a reasonable choice.\nactivation = \'relu\': This is the activation function for the neurons in this layer.\n\nrelu stands for Rectified Linear Unit. It\'s the most popul

In [None]:
#----------------------------------------------------------------------------------
# Adding "fully connected" SECOND layer to the Sequential ANN by calling Dense class
#----------------------------------------------------------------------------------
# Number of Units = 6 and Activation Function = Rectifier
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu'))


'''
We're now adding another floor to our building—a second hidden layer. This is what makes our network 'deeper'
.
The code is identical to the previous step, and for good reason. We are adding another layer with the same structure.
ann.add(tf.keras.layers.Dense(units = 6, activation = 'relu')):

Again, we use a Dense layer, meaning every neuron in this new layer will be connected to all 6 neurons from the first hidden layer.
We are again choosing 6 units and the relu activation function. Keeping the architecture symmetrical like this (6 -> 6)
 is a common practice, though not a strict rule.

Why add a second layer?

By adding more layers, we allow the network to learn more complex and abstract patterns from the data. The first layer
might learn simple relationships between features, and the second layer can then learn patterns from the outputs of the first layer.
This hierarchical learning is what gives deep learning its power.
Our network structure so far is: Input Layer -> Hidden Layer 1 (6 neurons) -> Hidden Layer 2 (6 neurons)."
'''

'\nWe\'re now adding another floor to our building—a second hidden layer. This is what makes our network \'deeper\'\n.\nThe code is identical to the previous step, and for good reason. We are adding another layer with the same structure.\nann.add(tf.keras.layers.Dense(units = 6, activation = \'relu\')):\n\nAgain, we use a Dense layer, meaning every neuron in this new layer will be connected to all 6 neurons from the first hidden layer.\nWe are again choosing 6 units and the relu activation function. Keeping the architecture symmetrical like this (6 -> 6)\n is a common practice, though not a strict rule.\n\nWhy add a second layer?\n\nBy adding more layers, we allow the network to learn more complex and abstract patterns from the data. The first layer \nmight learn simple relationships between features, and the second layer can then learn patterns from the outputs of the first layer. \nThis hierarchical learning is what gives deep learning its power.\nOur network structure so far is: Inp

In [None]:
#----------------------------------------------------------------------------------
# Adding "fully connected" OUTPUT layer to the Sequential ANN by calling Dense class
#----------------------------------------------------------------------------------
# Number of Units = 1 and Activation Function = Sigmoid
ann.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))


'''
"We've built the processing floors of our network; now we need the top floor—the output layer.
This layer is responsible for producing the final prediction.

The structure of this layer is very specific to the problem we're trying to solve.

units = 1: We set the number of neurons to 1.

Why one? Because we are solving a binary classification problem.
 The final output we want is a single number: the probability that the customer will exit.
 A single neuron is all we need to output that one number.

activation = 'sigmoid': This is a different activation function, and its choice is deliberate.

What it does: The sigmoid function is a special S-shaped curve that squashes any input value into a range between 0 and 1.

Why use it here? This is perfect for our output. A value of 0.8 can be interpreted as an 80% probability of churn.
A value of 0.1 means a 10% probability.
It's the ideal activation function for binary classification because it directly gives us the probability we need
to make our final decision.

So, with this step, our network architecture is complete. It looks like this:

Input Layer -> Hidden Layer 1 (6 relu neurons) -> Hidden Layer 2 (6 relu neurons) -> Output Layer (1 sigmoid neuron)"

'''

'\n"We\'ve built the processing floors of our network; now we need the top floor—the output layer.\nThis layer is responsible for producing the final prediction.\n\nThe structure of this layer is very specific to the problem we\'re trying to solve.\n\nunits = 1: We set the number of neurons to 1.\n\nWhy one? Because we are solving a binary classification problem.\n The final output we want is a single number: the probability that the customer will exit.\n A single neuron is all we need to output that one number.\n\nactivation = \'sigmoid\': This is a different activation function, and its choice is deliberate.\n\nWhat it does: The sigmoid function is a special S-shaped curve that squashes any input value into a range between 0 and 1.\n\nWhy use it here? This is perfect for our output. A value of 0.8 can be interpreted as an 80% probability of churn.\nA value of 0.1 means a 10% probability.\nIt\'s the ideal activation function for binary classification because it directly gives us the p

In [None]:
#----------------------- Training the model -----------------------#
# Compiling the ANN
# Type of Optimizer = Adam Optimizer, Loss Function =  crossentropy for binary dependent variable, and Optimization is done w.r.t. accuracy
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])


'''
"We've designed our neural network's architecture, but it doesn't know how to learn yet.

The compile step is where we give it the tools and instructions for the learning process. We need to define three key things:

optimizer = 'adam': The optimizer is the engine that drives the learning.

What it does: During training, the optimizer's job is to adjust the network's weights in a way that minimizes the error.
It's the algorithm that implements gradient descent.

Why 'adam'? The Adam optimizer is a highly effective, popular, and robust choice.
It's efficient and generally works well across a wide range of problems, making it an excellent default choice.

loss = 'binary_crossentropy': The loss function is how we measure the model's error.

What it does: For each prediction, the loss function calculates a 'penalty' or 'cost' based on how far
the prediction was from the actual true value. The goal of training is to make the total loss as low as possible.

Why 'binary_crossentropy'? This is the standard, mathematically-optimized loss function for a binary (two-class)
classification problem, especially when the output of our model is a probability from a sigmoid function.

It penalizes the model heavily for being confident and wrong.

metrics = ['accuracy']: Metrics are used to monitor the training and testing steps.


What it does: While the loss function is what the optimizer tries to minimize, it's not always easy for humans to interpret.
Accuracy, on the other hand, is very intuitive: "What percentage of predictions are correct?"
Why use it? We tell the model that during training, in addition to calculating the loss, we also want it to keep track of the accuracy. This helps us gauge the model's performance in a way that's easy to understand.
In summary, we've just told our model: 'Your goal is to minimize the binary_crossentropy loss. The tool you will use to do this is the adam optimizer. And along the way, please report back on your accuracy so we can see how you're doing.'"

'''

'\n"We\'ve designed our neural network\'s architecture, but it doesn\'t know how to learn yet.\n\nThe compile step is where we give it the tools and instructions for the learning process. We need to define three key things:\n\noptimizer = \'adam\': The optimizer is the engine that drives the learning.\n\nWhat it does: During training, the optimizer\'s job is to adjust the network\'s weights in a way that minimizes the error.\nIt\'s the algorithm that implements gradient descent.\n\nWhy \'adam\'? The Adam optimizer is a highly effective, popular, and robust choice.\nIt\'s efficient and generally works well across a wide range of problems, making it an excellent default choice.\n\nloss = \'binary_crossentropy\': The loss function is how we measure the model\'s error.\n\nWhat it does: For each prediction, the loss function calculates a \'penalty\' or \'cost\' based on how far\nthe prediction was from the actual true value. The goal of training is to make the total loss as low as possible.

In [None]:
# Training the ANN model on training set  (fit method always the same)
# batch_size = 32, the default value, number of epochs  = 100
ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

'''
Alright class, our model is designed and compiled.
It's time to start the learning process.
This is where we show the model our training data and let it figure out the patterns. This is done with the .fit() method.


Let's break down the arguments:
X_train, y_train: This is the most important part.
 We are providing the model with our training features (X_train) and the corresponding correct answers (y_train).
 The model will look at X_train, make a prediction, compare it to y_train, calculate the loss,
 and then use the optimizer to update its internal weights to do better next time.


batch_size = 32: The model doesn't look at all 8,000 training samples at once.
That would be computationally expensive. Instead, it looks at them in small groups or 'batches'.
Here, it will take the first 32 customers, make predictions, calculate the average loss for that batch,
and update its weights. Then it takes the next 32, and so on, until it has gone through all 8,000 training samples.
32 is a common and efficient default value.


epochs = 100: An epoch is one full pass through the entire training dataset.
We've told our model to go through all 8,000 training samples (in batches of 32) a total of 100 times.
Why multiple epochs? One pass is not enough for the network to learn the complex patterns.
By repeatedly seeing the data, the optimizer can gradually fine-tune the weights, getting closer and closer to the best solution.
As you can see from the output, for each of the 100 epochs, the model reports its progress.
You will generally see the loss go down and the accuracy go up as the training progresses,
which tells us that the model is learning successfully."


'''

Epoch 1/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.7629 - loss: 0.5751
Epoch 2/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7941 - loss: 0.4592
Epoch 3/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7997 - loss: 0.4435
Epoch 4/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8118 - loss: 0.4271
Epoch 5/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8135 - loss: 0.4226
Epoch 6/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8256 - loss: 0.4182
Epoch 7/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8251 - loss: 0.4115
Epoch 8/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8284 - loss: 0.4040
Epoch 9/100
[1m250/250[0m [32

'\nAlright class, our model is designed and compiled.\nIt\'s time to start the learning process.\nThis is where we show the model our training data and let it figure out the patterns. This is done with the .fit() method.\n\n\nLet\'s break down the arguments:\nX_train, y_train: This is the most important part.\n We are providing the model with our training features (X_train) and the corresponding correct answers (y_train).\n The model will look at X_train, make a prediction, compare it to y_train, calculate the loss,\n and then use the optimizer to update its internal weights to do better next time.\n\n\nbatch_size = 32: The model doesn\'t look at all 8,000 training samples at once.\nThat would be computationally expensive. Instead, it looks at them in small groups or \'batches\'.\nHere, it will take the first 32 customers, make predictions, calculate the average loss for that batch,\nand update its weights. Then it takes the next 32, and so on, until it has gone through all 8,000 train

In [None]:
#----------------------- Evaluating the Model ---------------------#
# the goal is to use this ANN model to predict the probability of the customer leaving the bank
# Predicting the churn probability for single observations

#Geography: French
#Credit Score:600
#Gender: Male
#Age: 40 years old
#Tenure: 3 years
#Balance: $60000
#Number of Products: 2
#with Credit Card
#Active member
#Estimated Salary: $50000

print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])))
print(ann.predict(sc.transform([[1, 0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])) > 0.5)
# this customer has 4% chance to leave the bank


'''
"Now that our model is trained, let's use it for its intended purpose:
 making a prediction on a new, single customer. This shows us how the bank could use this model in a real-world scenario

We've defined a hypothetical customer in the comments.
Our first step is to translate these characteristics into the numerical array format the model understands.
This includes the one-hot encoding for 'France' (1, 0, 0) and the label encoding for 'Male' (1).
Look at the first print statement:

print(ann.predict(sc.transform([[...]])))
First, we must pass this new customer's data through our scaler using sc.transform().
Every new piece of data must go through the exact same preprocessing steps as our training data.
Then, we feed this scaled data into ann.predict().

The output, [[0.034...]], is the raw probability from our sigmoid output neuron.
This means our model predicts there is a 3.4% chance this specific customer will leave the bank.
Now for the second print statement:

print(ann.predict(sc.transform([[...]])) > 0.5)
Often, we don't just want a probability; we want a definite 'Yes' or 'No' decision.
We set a threshold, typically 50% (or 0.5). If the predicted probability is greater than 0.5,
we classify the customer as 'Exited'. If not, they are classified as 'Stays'.
Since 3.4% is much less than 50%, the expression evaluates to False.
The model's final verdict is that this customer will not leave the bank.
This is a practical example of how to turn the model's output into an actionable business insight."

'''

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 98ms/step
[[0.03989722]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step
[[False]]


'\n"Now that our model is trained, let\'s use it for its intended purpose:\n making a prediction on a new, single customer. This shows us how the bank could use this model in a real-world scenario\n\nWe\'ve defined a hypothetical customer in the comments.\nOur first step is to translate these characteristics into the numerical array format the model understands.\nThis includes the one-hot encoding for \'France\' (1, 0, 0) and the label encoding for \'Male\' (1).\nLook at the first print statement:\n\nprint(ann.predict(sc.transform([[...]])))\nFirst, we must pass this new customer\'s data through our scaler using sc.transform().\nEvery new piece of data must go through the exact same preprocessing steps as our training data.\nThen, we feed this scaled data into ann.predict().\n\nThe output, [[0.034...]], is the raw probability from our sigmoid output neuron.\nThis means our model predicts there is a 3.4% chance this specific customer will leave the bank.\nNow for the second print statem

In [None]:
#show the vector of predictions and real values
#probabilities
y_pred_prob = ann.predict(X_test)


'''
"Predicting for one customer is useful, but to truly know how good our model is,
we need to evaluate it on the entire unseen test set. Remember, this is the data our model has never been exposed to before.

y_pred_prob = ann.predict(X_test):
Here, instead of passing in one customer, we're passing in the entire X_test dataset,
which contains the features for our 2,000 test customers.

The model will rapidly make a prediction for every single one of them.

The result, which we store in y_pred_prob, is a vector containing 2,000 probability values—one for each customer in the test set.

This gives us the raw probabilistic output for our entire test set, which we will use in the next steps
to calculate our final performance metrics."
'''

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


'\n"Predicting for one customer is useful, but to truly know how good our model is,\nwe need to evaluate it on the entire unseen test set. Remember, this is the data our model has never been exposed to before.\n\ny_pred_prob = ann.predict(X_test):\nHere, instead of passing in one customer, we\'re passing in the entire X_test dataset,\nwhich contains the features for our 2,000 test customers.\n\nThe model will rapidly make a prediction for every single one of them.\n\nThe result, which we store in y_pred_prob, is a vector containing 2,000 probability values—one for each customer in the test set.\n\nThis gives us the raw probabilistic output for our entire test set, which we will use in the next steps\nto calculate our final performance metrics."\n'

In [None]:
#probabilities to binary
y_pred = (y_pred_prob > 0.5)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), 1))



'''
"We have our list of predicted probabilities, but to calculate things like accuracy,
we need a final binary prediction: 0 (Stays) or 1 (Exits).

y_pred = (y_pred_prob > 0.5): This is the same logic we used for the single customer,
but now we're applying it to our entire array of 2,000 predicted probabilities.
For every probability in y_pred_prob, this line checks if it is greater than 0.5.

The result, y_pred, is a new array of True and False values. (In Python, True is treated as 1 and False is treated as 0).
This is our final set of predictions.

print(np.concatenate(...)): The second line is just for visualization.
It's a handy way to compare our model's predictions with the actual results side-by-side.

np.concatenate is a NumPy function that joins arrays together.
We are taking our predictions (y_pred) and the true answers (y_test) and stacking them into a two-column list.
The .reshape(...) part is just to make sure both arrays are arranged as vertical columns before they are joined.
The output shows this comparison clearly. The first column is our model's prediction,
and the second column is the ground truth. For example, in the second row [0 1], our model predicted 0 (Stays),
but the customer actually 1 (Exited). This was an incorrect prediction. The other rows show correct predictions."

'''

[[0 0]
 [0 1]
 [0 0]
 ...
 [0 0]
 [0 0]
 [0 0]]


'\n"We have our list of predicted probabilities, but to calculate things like accuracy,\nwe need a final binary prediction: 0 (Stays) or 1 (Exits).\n\ny_pred = (y_pred_prob > 0.5): This is the same logic we used for the single customer,\nbut now we\'re applying it to our entire array of 2,000 predicted probabilities.\nFor every probability in y_pred_prob, this line checks if it is greater than 0.5.\n\nThe result, y_pred, is a new array of True and False values. (In Python, True is treated as 1 and False is treated as 0).\nThis is our final set of predictions.\n\nprint(np.concatenate(...)): The second line is just for visualization.\nIt\'s a handy way to compare our model\'s predictions with the actual results side-by-side.\n\nnp.concatenate is a NumPy function that joins arrays together.\nWe are taking our predictions (y_pred) and the true answers (y_test) and stacking them into a two-column list.\nThe .reshape(...) part is just to make sure both arrays are arranged as vertical columns

In [None]:
#Confusion Matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix", confusion_matrix)
print("Accuracy Score", accuracy_score(y_test, y_pred))


'''
This is the simplest metric. It answers the question: "Out of all predictions, what percentage did the model get right?"

The result 0.865 means our model was correct on 86.5% of the 2,000 customers in the unseen test set. This is a very strong result.

Next, and more importantly, the Confusion Matrix:

print("Confusion Matrix", confusion_matrix)

Accuracy can sometimes be misleading, especially if one class is much more common than the other.
A confusion matrix gives us a much richer understanding of our model's performance by breaking down the correct and incorrect predictions for each class.
Let's interpret the matrix: [[1508, 87], [183, 222]]

Top-Left (1502): True Negatives. These are the customers who did not leave, and our model correctly predicted they would not leave.
 This is the biggest group and a good result.
Top-Right (93): False Positives. These are customers who did not leave, but our model incorrectly predicted they would leave.
(The model was 'falsely positive' about them leaving).
Bottom-Left (189): False Negatives. This is often the most important number for business.
These are customers who did leave, but our model incorrectly predicted they would stay.
These are the churners we failed to identify.
Bottom-Right (216): True Positives. These are customers who did leave, and our model correctly predicted they would leave.
These are the successes where the bank could now intervene.
By combining the accuracy score with the detailed insights from the confusion matrix,
we get a complete picture of our model's strengths and weaknesses."
'''

TypeError: 'numpy.ndarray' object is not callable