<a href="https://colab.research.google.com/github/theidari/alphabet_soup/blob/main/src/AlphabetSoupCharity_Optimization_Name.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#880808"><h1><b>Alphabet Soup Charity Optimization</b></h1></font>
<p align="justify"><font color="#0A0888">
After analyzing and exploring various parameters, such as dropping unnecessary columns, creating additional bins for rare occurrences in columns, adjusting the number of values in each bin, increasing the number of neurons in a hidden layer, and adding more hidden layers using a wider and deeper technique, as well as finding the optimal number of epochs and experimenting with different activation functions for the hidden layers based on hyperparameter techniques (using Keras Tuner), the model's accuracy only increased slightly to 72.9%. check <a href="https://github.com/theidari/alphabet_soup/blob/main/src/AlphabetSoupCharity_Optimization.ipynb"><font color="#FF5733">parameter selection</font></a> file.
</font></p>
<p align="justify"><font color="#0A0888">To improve the accuracy in this section, we reintroduced the "NAME" column based on a specific condition. The "NAME" column serves as an identification column, and its inclusion may introduce bias into the modeling process. However, we established a criterion to mitigate such bias by binning the names to a set of just over 100 replicates.</font></p>

In [1]:
# Delete the existing directory
!rm -rf alphabet_soup

# Clone the repository to a new directory
!git clone https://github.com/theidari/alphabet_soup.git

# Dependencies and setup
from alphabet_soup.src.package.constants import * # constants
from alphabet_soup.src.package.helpers import * # liberaries and functions

Cloning into 'alphabet_soup'...
remote: Enumerating objects: 340, done.[K
remote: Counting objects: 100% (113/113), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 340 (delta 74), reused 8 (delta 1), pack-reused 227[K
Receiving objects: 100% (340/340), 551.25 KiB | 9.84 MiB/s, done.
Resolving deltas: 100% (227/227), done.
☑ constants is imporetd
☑ helpers is imporetd


In [2]:
# Loading the data into a Pandas DataFrame
application_df = pd.read_csv(DATA_URL)

In [3]:
# Drop the 'EIN', 'SPECIAL_CONSIDERATIONS', 'ASK_AMT', 'STATUS' columns and keep "NAME".
application_df = application_df.drop(["EIN", "SPECIAL_CONSIDERATIONS", "ASK_AMT", "STATUS"], axis=1)

In [4]:
# cutoff APPLICATION_TYPE value and create a list of application types to be replaced
binning (application_df,"APPLICATION_TYPE",100)


--------------------------------------------------------------------------------
 Value Count before binning:
--------------------------------------------------------------------------------
T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: APPLICATION_TYPE, dtype: int64
--------------------------------------------------------------------------------
Value Count after binning:
--------------------------------------------------------------------------------
T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
T9         156
Other      120
Name: APPLICATION_TYPE, dtype: int64


In [5]:
# cutoff CLASSIFICATION value and create a list of application types to be replaced
binning (application_df,"CLASSIFICATION",100)


--------------------------------------------------------------------------------
 Value Count before binning:
--------------------------------------------------------------------------------
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C4120        1
C8210        1
C2561        1
C4500        1
C2150        1
Name: CLASSIFICATION, Length: 71, dtype: int64
--------------------------------------------------------------------------------
Value Count after binning:
--------------------------------------------------------------------------------
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
Other      669
C1700      287
C4000      194
C5000      116
C1270      114
C2700      104
Name: CLASSIFICATION, dtype: int64


In [6]:
# cutoff NAME value and create a list of application types to be replaced
binning (application_df,"NAME",100)


--------------------------------------------------------------------------------
 Value Count before binning:
--------------------------------------------------------------------------------
PARENT BOOSTER USA INC                                                  1260
TOPS CLUB INC                                                            765
UNITED STATES BOWLING CONGRESS INC                                       700
WASHINGTON STATE UNIVERSITY                                              492
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC                          408
                                                                        ... 
ST LOUIS SLAM WOMENS FOOTBALL                                              1
AIESEC ALUMNI IBEROAMERICA CORP                                            1
WEALLBLEEDRED ORG INC                                                      1
AMERICAN SOCIETY FOR STANDARDS IN MEDIUMSHIP & PSYCHICAL INVESTIGATI       1
WATERHOUSE CHARITABLE TR              

In [7]:
# Convert categorical data to numeric with `pd.get_dummies`
application_numeric = pd.get_dummies(application_df)

In [8]:
# Split our preprocessed data into our features and target arrays
X = application_numeric.drop(["IS_SUCCESSFUL"], axis=1)
y = application_numeric["IS_SUCCESSFUL"]

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create a StandardScaler instances
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# make input_dim for keras tuner function
input_features=len(X_train_scaled[1])

In [9]:
# create a method that creates a new Sequential model with hyperparameter options
def create_model(hp):
    nn_model = tf.keras.models.Sequential()

    # allow kerastuner to decide which activation function to use in hidden layers
    activation = hp.Choice("activation",["relu","tanh","sigmoid"])
    
    # allow kerastuner to decide number of neurons in first layer
    nn_model.add(tf.keras.layers.Dense(units=hp.Int("first_units",
        min_value=1,
        max_value=320,
        step=5), activation=activation, input_dim=input_features))

    # allow kerastuner to decide number of hidden layers and neurons in hidden layers
    for i in range(hp.Int("num_layers", 1, 8)):
        nn_model.add(tf.keras.layers.Dense(units=hp.Int("units_" + str(i),
            min_value=1,
            max_value=120,
            step=5),
            activation=activation))
    
    nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

    # compile the model
    nn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    
    return nn_model

In [10]:
# import the kerastuner library
!pip install -q -U keras-tuner
import keras_tuner as kt

tuner = kt.Hyperband(
    create_model,
    objective="val_accuracy",
    max_epochs=35,
    hyperband_iterations=2)

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/172.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/172.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m172.2/172.2 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
# run the kerastuner search for best hyperparameters
tuner.search(X_train_scaled,y_train,epochs=35,validation_data=(X_test_scaled,y_test)) 

Trial 180 Complete [00h 02m 26s]
val_accuracy: 0.7533527612686157

Best val_accuracy So Far: 0.7555685043334961
Total elapsed time: 01h 31m 35s


In [12]:
# get top 3 model hyperparameters and print the values
top_hyper = tuner.get_best_hyperparameters(3)
for param in top_hyper:
    print(param.values)

{'activation': 'relu', 'first_units': 286, 'num_layers': 2, 'units_0': 71, 'units_1': 21, 'units_2': 11, 'units_3': 81, 'units_4': 71, 'units_5': 86, 'units_6': 26, 'units_7': 86, 'tuner/epochs': 35, 'tuner/initial_epoch': 12, 'tuner/bracket': 2, 'tuner/round': 2, 'tuner/trial_id': '0070'}
{'activation': 'relu', 'first_units': 251, 'num_layers': 5, 'units_0': 26, 'units_1': 101, 'units_2': 16, 'units_3': 66, 'units_4': 111, 'units_5': 91, 'units_6': 36, 'units_7': 51, 'tuner/epochs': 12, 'tuner/initial_epoch': 4, 'tuner/bracket': 3, 'tuner/round': 2, 'tuner/trial_id': '0038'}
{'activation': 'relu', 'first_units': 181, 'num_layers': 3, 'units_0': 51, 'units_1': 31, 'units_2': 116, 'units_3': 86, 'units_4': 111, 'units_5': 46, 'units_6': 41, 'units_7': 16, 'tuner/epochs': 12, 'tuner/initial_epoch': 4, 'tuner/bracket': 2, 'tuner/round': 1, 'tuner/trial_id': '0058'}


In [13]:
# evaluate the top 3 models against the test dataset
top_model = tuner.get_best_models(3)
for model in top_model:
    model_loss, model_accuracy = model.evaluate(X_test_scaled,y_test,verbose=2)
    print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

268/268 - 2s - loss: 0.4957 - accuracy: 0.7556 - 2s/epoch - 8ms/step
Loss: 0.49565911293029785, Accuracy: 0.7555685043334961
268/268 - 2s - loss: 0.4977 - accuracy: 0.7550 - 2s/epoch - 6ms/step
Loss: 0.4976896047592163, Accuracy: 0.7549854516983032
268/268 - 1s - loss: 0.4963 - accuracy: 0.7549 - 1s/epoch - 5ms/step
Loss: 0.49632805585861206, Accuracy: 0.7548688054084778


In [14]:
# first best model 
best_model_1 = top_model[0]
best_model_1.build()
best_model_1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 286)               22308     
                                                                 
 dense_1 (Dense)             (None, 71)                20377     
                                                                 
 dense_2 (Dense)             (None, 21)                1512      
                                                                 
 dense_3 (Dense)             (None, 1)                 22        
                                                                 
Total params: 44,219
Trainable params: 44,219
Non-trainable params: 0
_________________________________________________________________


In [15]:
# second best model 
best_model_2 = top_model[1]
best_model_2.build()
best_model_2.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 251)               19578     
                                                                 
 dense_1 (Dense)             (None, 26)                6552      
                                                                 
 dense_2 (Dense)             (None, 101)               2727      
                                                                 
 dense_3 (Dense)             (None, 16)                1632      
                                                                 
 dense_4 (Dense)             (None, 66)                1122      
                                                                 
 dense_5 (Dense)             (None, 111)               7437      
                                                                 
 dense_6 (Dense)             (None, 1)                 1

In [16]:
# third best model 
best_model_3 = top_model[2]
best_model_3.build()
best_model_3.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 181)               14118     
                                                                 
 dense_1 (Dense)             (None, 51)                9282      
                                                                 
 dense_2 (Dense)             (None, 31)                1612      
                                                                 
 dense_3 (Dense)             (None, 116)               3712      
                                                                 
 dense_4 (Dense)             (None, 1)                 117       
                                                                 
Total params: 28,841
Trainable params: 28,841
Non-trainable params: 0
_________________________________________________________________


In [17]:
# Export our model to HDF5 file
best_model_1.save(OUTPUT_URL+"AlphabetSoupCharity_Optimization.h5")