<center>
  <h1 style="text-align: center; font-size: 48px;">Alphabet Soup</h1>
  <h2 style="text-align: center; font-size: 30px;">To <i>Fund</i> or Not to <i>Fund</i></h2>
Alphabet Soup, a nonprofit foundation, extensively assessed their historical venture funding data using a variety of machine learning algorithms to enhance the success rate of future ventures.
</center>

# Documentation of the Machine Learning Process

## 1. Preprocessing
Preprocessing is called so because it refers to the set of steps and techniques applied to raw data before it is used for analysis or model training. The purpose of preprocessing is to clean, transform, and prepare the data so that it becomes suitable for further analysis or machine learning tasks.

The term "preprocessing" emphasizes that these steps are done prior to the main data analysis or modeling phase. It is a crucial initial stage in the data science workflow because the quality and suitability of the data can significantly impact the accuracy and effectiveness of the subsequent analyses or models.

Preprocessing involves tasks like handling missing values, encoding categorical variables, scaling numerical features, removing outliers, and normalizing data. By performing these operations before feeding the data into a model, the data scientist ensures that the data is in a format that the algorithm can handle and that it is free from issues that could negatively affect the results.

#### Import Necessary Dependencies (Libraries and Modules)

In [1]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import log_loss
import numpy as np
import pandas as pd
import tensorflow as tf
import pandas as pd

2023-08-05 15:34:21.430147: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


####  Import and read the charity_data.csv.

In [2]:
df_application = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
df_application.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [3]:
# Identify and count the number of occurrences of observed typo
df_application["USE_CASE"].value_counts()

Preservation     28095
ProductDev        5671
CommunityServ      384
Heathcare          146
Other                3
Name: USE_CASE, dtype: int64

In [4]:
# Correct typo
df_application["USE_CASE"] = df_application["USE_CASE"].replace("Heathcare", "Healthcare")

In [5]:
# Drop the non-beneficial ID columns, 'EIN' and 'NAME'.
df_application_features = df_application.drop(columns = ['EIN', 'NAME'])

# View the new features dataframe
df_application_features.head()

Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,T3,Independent,C1000,Healthcare,Trust,1,100000-499999,N,142590,1


In [6]:
# Determine the number of unique values in each column.
print(df_application_features.nunique())

APPLICATION_TYPE            17
AFFILIATION                  6
CLASSIFICATION              71
USE_CASE                     5
ORGANIZATION                 4
STATUS                       2
INCOME_AMT                   9
SPECIAL_CONSIDERATIONS       2
ASK_AMT                   8747
IS_SUCCESSFUL                2
dtype: int64


In [7]:
# Look at APPLICATION_TYPE value counts for binning
print(df_application_features['APPLICATION_TYPE'].value_counts())

T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: APPLICATION_TYPE, dtype: int64


In [8]:
# Choose a cutoff value and create a list of application types to be replaced
cutoff_value = 500

# use the variable name `application_types_to_replace`
application_types_to_replace = list(
    df_application_features['APPLICATION_TYPE'].value_counts()[df_application_features['APPLICATION_TYPE'].value_counts() < cutoff_value].index)

# Replace in dataframe
for app in application_types_to_replace:
    df_application_features['APPLICATION_TYPE'] = df_application_features['APPLICATION_TYPE'].replace(app,"Other")

# Check to make sure binning was successful
df_application_features['APPLICATION_TYPE'].value_counts()

T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
Other      276
Name: APPLICATION_TYPE, dtype: int64

In [9]:
# Look at CLASSIFICATION value counts for binning
print(df_application_features['APPLICATION_TYPE'].value_counts())

T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
Other      276
Name: APPLICATION_TYPE, dtype: int64


In [10]:
# You may find it helpful to look at CLASSIFICATION value counts > 1
classification_counts = df_application_features['CLASSIFICATION'].value_counts()
classification_counts_filtered = classification_counts[classification_counts > 1]
print(classification_counts_filtered)

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
C1700      287
C4000      194
C5000      116
C1270      114
C2700      104
C2800       95
C7100       75
C1300       58
C1280       50
C1230       36
C1400       34
C7200       32
C2300       32
C1240       30
C8000       20
C7120       18
C1500       16
C1800       15
C6000       15
C1250       14
C8200       11
C1238       10
C1278       10
C1235        9
C1237        9
C7210        7
C2400        6
C1720        6
C4100        6
C1257        5
C1600        5
C1260        3
C2710        3
C0           3
C3200        2
C1234        2
C1246        2
C1267        2
C1256        2
Name: CLASSIFICATION, dtype: int64


In [11]:
# Choose a cutoff value and create a list of classifications to be replaced
cutoff_value = 1000

# use the variable name `classifications_to_replace`
classifications_to_replace = list(df_application_features['CLASSIFICATION'].value_counts()[df_application_features['CLASSIFICATION'].value_counts() < cutoff_value].index)

# Replace in dataframe
for cls in classifications_to_replace:
    df_application_features['CLASSIFICATION'] = df_application_features['CLASSIFICATION'].replace(cls,"Other")

# Check to make sure binning was successful
df_application_features['CLASSIFICATION'].value_counts()

C1000    17326
C2000     6074
C1200     4837
Other     2261
C3000     1918
C2100     1883
Name: CLASSIFICATION, dtype: int64

---
## Feature Engineering/Transformation, Scaling, and Defining the Model
- Feature engineering is the process of transforming raw data into meaningful features that can be used to improve the performance of machine learning models. It involves extracting, selecting, and creating new features from the available data. Here are three basic examples of feature engineering:
    - One-Hot Encoding:
        - Example: Imagine you have a categorical feature like "Gender" with values 'Male' and 'Female'.
        - Feature Engineering: One-hot encode the "Gender" feature into two binary features, 'Is_Male' and 'Is_Female'. A '1' in the 'Is_Male' column indicates the sample is male, and a '1' in the 'Is_Female' column indicates the sample is female.
    - Feature Scaling:
        - Example: You have two features, "Age" (ranging from 0 to 100) and "Income" (ranging from $20,000 to $100,000).
        - Feature Engineering: Scale the "Age" and "Income" features to a similar range, such as 0 to 1. This ensures that both features have equal importance during model training.
    - Polynomial Features:
        - Example: You have a single feature "x" and want to fit a polynomial regression model.
        - Feature Engineering: Create additional polynomial features, such as "x^2", "x^3", etc., to capture the nonlinear relationship between the feature "x" and the target variable. This allows the model to learn more complex patterns.

- These are just a few basic examples of feature engineering. In practice, feature engineering can involve various techniques like handling missing values, creating interaction terms, binning, and much more. The goal is to transform the data in a way that captures important patterns and relationships, making it easier for machine learning algorithms to learn and make accurate predictions.

In [12]:
# Convert categorical data to numeric with `pd.get_dummies`
df_application_features = pd.get_dummies(df_application_features)

In [13]:
# Split our preprocessed data into our features and target arrays
X = df_application_features.drop(columns=['IS_SUCCESSFUL'])
y = df_application_features['IS_SUCCESSFUL']

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [14]:
# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [15]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
number_input_features = len(X_train_scaled[0])
hidden_nodes_layer1 = 80
hidden_nodes_layer2 = 30

nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation='relu'))

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=hidden_nodes_layer2, activation='relu'))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

# Check the structure of the model
nn.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 80)                3520      
                                                                 
 dense_1 (Dense)             (None, 30)                2430      
                                                                 
 dense_2 (Dense)             (None, 1)                 31        
                                                                 
Total params: 5981 (23.36 KB)
Trainable params: 5981 (23.36 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Compile the Model
Compiling a neural network model means specifying the necessary components that define the model's learning process. During compilation, you need to define three key aspects:

Loss function: The loss function quantifies the difference between the predicted output and the actual target value. In the given code, 'binary_crossentropy' is used as the loss function. It indicates that the model is being trained for a binary classification problem. Binary cross-entropy is commonly used for binary classification tasks, and it measures the difference between the predicted probability distribution and the true binary labels.

Optimizer: The optimizer is responsible for adjusting the weights of the neural network during the training process to minimize the loss function. In this case, 'adam' is used as the optimizer. Adam (Adaptive Moment Estimation) is a popular optimization algorithm that combines the advantages of both AdaGrad and RMSprop. It efficiently adapts the learning rates for each parameter during training.

Metrics: Metrics are additional evaluation criteria used during training to monitor the model's performance. In the given code, ['accuracy'] is used as the metric, which means the model's accuracy will be tracked during training. Accuracy is a common metric for classification tasks, representing the proportion of correctly classified samples to the total number of samples.

After the model is compiled with these components, it is ready for training using the fit() method, where it will iterate through the data, compute the loss, update the weights using the optimizer, and monitor the specified metrics throughout the training process.

In [16]:
nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

---
## FeedForward Neural Network Model:
### Train and Evaluate

In [17]:
# Train the model
nn.fit(X_train_scaled, y_train, epochs=100, verbose=1)

Epoch 1/100


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

<keras.src.callbacks.History at 0x7faf0c1bbdf0>

In [18]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

268/268 - 1s - loss: 0.5579 - accuracy: 0.7305 - 772ms/epoch - 3ms/step
Loss: 0.5579472780227661, Accuracy: 0.7304956316947937


---
## Results (FeedForward Nueral Network)
- Accurately preicts, `73.08%` of the time, that a venture will be successful or not if funded by Alphabet Soup.
- Given the model architecture and the number of epochs (100), a binary cross-entropy loss value of approximately `0.5559` at the end of the training process indicates that the model has achieved a moderate level of performance in the binary classification task.
    - The binary cross-entropy loss measures the dissimilarity between the true labels and the predicted probabilities generated by the model. In this case, `the loss value of 0.5559 means that, on average, the model's predictions are somewhat far from the true labels. However, it is not too high, indicating that the model has learned to make reasonable predictions`.

In [19]:
# Export our model to HDF5 file
nn.save("trained_charity.h5")

  saving_api.save_model(


---
## Optimizing the Model
### Changing Hyperparameters Automatedly:
- `\# of Epochs` (iterations) (10, 20, 40, 80)
- `Optimizers` (Adam, SGD, RMSprop, Adagrad, Adadelta, Nadam)
- `Activation functions` (ReLU, Sigmoid, Tanh, Leaky ReLU, ELU, Swish, Softmax)
- `Loss functions` (Binary Cross-Entropy (Log Loss), Hinge Loss, Categorical Cross-Entropy, Sparse Categorical Cross Entropy, Kullback-Leibler (KL) Divergence, Mean Squared Error (MSE), Mean Absolute Error (MAE), Huber Loss, Pairwise Ranking Loss)
- `\# of Nodes` (10, 20, 40)
- `\# of Hidden Layers` (1, 2, 4) 
- `Types of Classifiers` (MLPClassifier, SVC, RandomForestClassifier)
- `Training-Test split ratio` (80%-20%, 75%-25%, 70%-30%)
- `\# of cross-validation` (cv = 2, 3, 4, 5)
- `Initial Weights`
- `Initial Biases`

---
## MLPClassifier Model:
### Compile, Train and Evaluate

In [20]:
# Create the neural network classifier
nn = MLPClassifier(max_iter=80)

# Define the parameter grid for hyperparameter search
param_grid = {
    'hidden_layer_sizes': [(80,), (100, 50), (50, 30, 10)],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.001, 0.01, 0.1],
}

# Create the GridSearchCV object
grid_search = GridSearchCV(nn, param_grid, cv=2)

# Perform the hyperparameter search
grid_search.fit(X_train_scaled, y_train)

# Get the best hyperparameters and the corresponding model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Get the best accuracy
best_accuracy = grid_search.best_score_

# Evaluate the best model on the test set
X_test_scaled = scaler.transform(X_test)
accuracy = best_model.score(X_test_scaled, y_test)

# Get the log loss on the test set
y_predicted_probabilities = best_model.predict_proba(X_test_scaled)
logloss = log_loss(y_test, y_predicted_probabilities)

print("Best hyperparameters:", best_params)
print("Best accuracy:", best_accuracy)
print("Test set accuracy:", accuracy)
print("Test set log loss:", logloss)



Best hyperparameters: {'alpha': 0.01, 'hidden_layer_sizes': (80,), 'learning_rate_init': 0.001}
Best accuracy: 0.7282304462758513
Test set accuracy: 0.7300291545189505
Test set log loss: 0.5538679984964919




---
## Results (MLPClassifier Neural Network)
- Accurately preicts, `73.03%` of the time, that a venture will be successful or not if funded by Alphabet Soup.
- Given the model architecture and the number of epochs (80), a binary cross-entropy loss value of approximately `0.5515` at the end of the training process indicates that the model has achieved a moderate level of performance in the binary classification task.
    - The binary cross-entropy loss measures the dissimilarity between the true labels and the predicted probabilities generated by the model. In this case, `the loss value of 0.5515 means that, on average, the model's predictions are somewhat far from the true labels. However, it is not too high, indicating that the model has learned to make reasonable predictions`.

---

# Reference Notes on Optimizing Hyperparameters:
- Feature engineering
    - To use feature engineering to improve the accuracy of the model, you can consider the following techniques and transformations for the given dataset:

1. **Handling Missing Values**: Check for missing values in the dataset and decide on the appropriate strategy to handle them. You can either impute missing values (e.g., using mean, median, or mode) or drop rows/columns with missing values based on the percentage of missing data and the impact on the overall dataset size.

2. **Feature Scaling**: Apply feature scaling to numerical features, especially if they have different scales. Common scaling methods include Standardization (scaling to mean=0, std=1) or Min-Max scaling (scaling to a specific range, e.g., [0, 1]).

3. **Encoding Categorical Variables**: If you have categorical features, encode them properly for the model to understand. You can use techniques like one-hot encoding or label encoding, depending on the nature of the categorical variable.

4. **Creating Interaction Terms**: If certain features have strong interactions, consider creating new features by combining or multiplying them. For example, if you have 'Length' and 'Width,' you can create a new feature 'Area' by multiplying them.

5. **Binning/Rounding**: For continuous numerical features, you can group them into bins or round them to reduce the impact of outliers and noise.

6. **Log Transform**: If some features have a skewed distribution, applying a log transformation may help to make the distribution more Gaussian-like.

7. **Feature Extraction from Text or Images**: If you have text or image data, consider extracting relevant features from them using techniques like TF-IDF for text or pre-trained CNN models for images.

As for which columns to remove, it depends on the data and your specific problem. Here are some guidelines:

1. **Remove Constant or Near-Constant Columns**: If a feature has very low variance or nearly constant across all samples, it may not add any value to the model.

2. **Highly Correlated Features**: If you have highly correlated features, it might be redundant to keep all of them. You can remove one of the highly correlated features.

3. **Irrelevant Features**: If certain features have no logical connection to the target variable or are known to have no impact on the outcome, they can be removed.

4. **Domain Knowledge**: Use your domain knowledge to identify features that are unlikely to contribute to the model's predictive power.

5. **Based on Feature Importance**: If you have already trained a model, you can use the feature importance scores to identify less relevant features and consider removing them.

Remember, feature engineering is an iterative process, and it's essential to evaluate the impact of each transformation on the model's performance using validation techniques like cross-validation. Removing features should be done carefully, as you don't want to remove potentially useful information that might help the model.
- Reducing number of columns to reduce noise (how to determine which columns to remove?):
    - When trying to reduce the number of columns in your dataset, especially in the context of supervised learning, there are several approaches you can consider. Some common techniques include:

1. Feature Importance from Tree-based models: Decision tree-based models, like Random Forest and Gradient Boosting Machines (GBM), provide a feature importance score. These scores can give you an idea of which features are contributing the most to the model's predictive performance. You can use this information to identify and keep the most important features while discarding the less important ones.

2. Univariate Feature Selection: This method involves selecting features based on univariate statistical tests like chi-square for categorical features or ANOVA for numerical features. The idea is to select features that have a significant impact on the target variable.

3. Recursive Feature Elimination (RFE): RFE is an iterative method where you train the model, rank the features by importance, and then remove the least important feature. The process is repeated until you reach the desired number of features or performance no longer improves.

4. L1 Regularization (Lasso): L1 regularization can be used with linear models (e.g., Logistic Regression) to penalize the absolute magnitude of the coefficients. This often leads to sparse solutions, effectively setting some coefficients to zero, which corresponds to feature selection.

5. PCA (Principal Component Analysis): PCA is a dimensionality reduction technique that aims to transform the original features into a new set of uncorrelated features (principal components). However, PCA is unsupervised, and it may not always preserve the predictive power of the features for the target variable.

6. SelectFromModel: This is a method in scikit-learn that allows you to use the feature importance from a model to automatically select the most important features.

When dealing with supervised learning, it's essential to use techniques that take into account the relationship between features and the target variable. Using PCA alone may not be the best choice as it is an unsupervised method and doesn't consider the target variable's predictive power.

It's worth noting that the choice of feature selection method depends on your dataset, the models you are using, and the nature of the problem you are trying to solve. It's often a good idea to try different techniques and compare their impact on the model's performance using cross-validation or a validation set. This way, you can determine the best approach to feature selection for your specific problem.


---

Determining the theoretical limit on accuracy and loss for a machine learning model, including a neural network, is not straightforward and can be influenced by several factors. It is important to keep in mind that the theoretical limit is not necessarily achievable in practice due to various real-world constraints. However, we can discuss some factors that can influence the upper and lower bounds on accuracy and loss:

1. **Data Quality and Quantity**: The quality and quantity of your training data play a significant role. The more high-quality, diverse, and representative data you have, the better your model's performance can potentially be. Having more data generally allows the model to learn more patterns and generalize better.

2. **Model Complexity**: The complexity of your neural network architecture affects the theoretical limit. More complex models can learn intricate relationships in the data but may be prone to overfitting, especially with limited data.

3. **Features and Feature Engineering**: The choice and quality of features can influence model performance. Proper feature engineering can enhance the model's ability to capture relevant patterns in the data.

4. **Hyperparameter Tuning**: Properly tuning hyperparameters can significantly impact model performance. The theoretical limit may vary depending on the specific hyperparameter values.

5. **Noise and Irreducible Error**: In real-world datasets, there can be noise and unpredictability that are inherent in the data, leading to an irreducible error. This means that no model, no matter how complex, can achieve perfect accuracy or zero loss on such data.

6. **Class Imbalance**: If your dataset has severe class imbalance, the theoretical limit for accuracy can be influenced. A high proportion of one class might make it more challenging for the model to learn the minority class effectively.

7. **Data Representation and Preprocessing**: How you preprocess and represent your data (e.g., normalization, scaling, one-hot encoding) can impact model performance.

8. **Overfitting and Underfitting**: Overfitting occurs when the model becomes too complex and memorizes noise in the training data, leading to poor generalization. Underfitting happens when the model is too simple to capture the underlying patterns in the data.

9. **Choice of Loss Function**: The choice of loss function (e.g., cross-entropy, mean squared error) depends on the problem and can influence the theoretical limit on loss.

10. **Limited Model Capacity**: The capacity of your neural network (i.e., the number of parameters) can limit its ability to represent complex relationships in the data.

11. **Architecture Constraints**: Certain architectural constraints, such as the depth and width of the network, may affect the theoretical limit.

It's essential to understand that the theoretical limit is often hard to quantify precisely. The best approach is to experiment with different models, hyperparameters, and data preprocessing techniques to reach the best possible performance. Remember that in practice, the performance of a model is evaluated based on its generalization to unseen data and its ability to solve real-world problems rather than achieving a specific theoretical limit.

## Automating Optimization Algorithms

In [21]:
# Define the classifiers you want to test
classifiers = [
    ('MLP', MLPClassifier()),
    ('SVC', SVC()),
    ('RandomForest', RandomForestClassifier())
]

# Create separate parameter grids for each classifier
param_grids = {
    'MLP': {
        'hidden_layer_sizes': [(80,), (100, 50), (50, 30, 10)],
        'alpha': [0.0001, 0.001, 0.01],
        'learning_rate_init': [0.001, 0.01, 0.1],
        'max_iter': [40, 80]
    },
    'SVC': {
        'kernel': ['linear', 'rbf', 'poly'],
        'C': [0.1, 1, 10],
        'gamma': ['scale', 'auto']
    },
    'RandomForest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }
}

# Create an empty dictionary to store the best models for each classifier
best_models = {}

# Loop through each classifier and perform the hyperparameter search
for name, clf in classifiers:
    # Create the pipeline with the current classifier
    pipeline = Pipeline([
        ('scaler', scaler),
        ('classifier', clf)
    ])

    # Get the corresponding parameter grid for the current classifier
    param_grid = {'classifier__' + key: value for key, value in param_grids[name].items()}

    # Create the GridSearchCV object
    grid_search = GridSearchCV(pipeline, param_grid, cv=2)

    # Perform the hyperparameter search
    grid_search.fit(X_train, y_train)

    # Get the best model and store it in the dictionary
    best_models[name] = grid_search.best_estimator_

# Evaluate the best models on the test set
for name, model in best_models.items():
    accuracy = model.score(X_test, y_test)
    print(f"Best accuracy for {name}:", accuracy)


