## Preprocessing

In [1]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import tensorflow as tf

#  Import and read the charity_data.csv.
import pandas as pd
application_df = pd.read_csv("https://static.bc-edx.com/data/dl-1-2/m21/lms/starter/charity_data.csv")
application_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [2]:
# Drop the non-beneficial ID columns
application_df = application_df.drop(columns=['EIN', 'NAME'])

# Display the updated DataFrame to verify the changes
application_df.head()


Unnamed: 0,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [3]:
# Determine the number of unique values in each column
unique_values = application_df.nunique()

# Display the number of unique values for each column
print(unique_values)


APPLICATION_TYPE            17
AFFILIATION                  6
CLASSIFICATION              71
USE_CASE                     5
ORGANIZATION                 4
STATUS                       2
INCOME_AMT                   9
SPECIAL_CONSIDERATIONS       2
ASK_AMT                   8747
IS_SUCCESSFUL                2
dtype: int64


In [4]:
# Look at value counts for APPLICATION_TYPE
application_type_counts = application_df['APPLICATION_TYPE'].value_counts()

# Display the value counts
print(application_type_counts)

# Define a cutoff point for replacing infrequent values
cutoff = 100  # Example cutoff, adjust based on your needs

# Identify values that are below the cutoff
rare_application_types = application_type_counts[application_type_counts < cutoff].index

# Replace rare values with "Other"
application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(rare_application_types, 'Other')

# Verify the changes
print(application_df['APPLICATION_TYPE'].value_counts())


APPLICATION_TYPE
T3     27037
T4      1542
T6      1216
T5      1173
T19     1065
T8       737
T7       725
T10      528
T9       156
T13       66
T12       27
T2        16
T25        3
T14        3
T29        2
T15        2
T17        1
Name: count, dtype: int64
APPLICATION_TYPE
T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
T9         156
Other      120
Name: count, dtype: int64


In [5]:
# Define the cutoff value
cutoff = 100  # Example cutoff value; adjust as needed

# Get the value counts for APPLICATION_TYPE
application_type_counts = application_df['APPLICATION_TYPE'].value_counts()

# Create a list of application types to be replaced
application_types_to_replace = application_type_counts[application_type_counts < cutoff].index.tolist()

# Replace the identified application types with "Other"
for app in application_types_to_replace:
    application_df['APPLICATION_TYPE'] = application_df['APPLICATION_TYPE'].replace(app, "Other")

# Check to make sure the replacement was successful
print(application_df['APPLICATION_TYPE'].value_counts())


APPLICATION_TYPE
T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
T9         156
Other      120
Name: count, dtype: int64


In [6]:
# Look at value counts for CLASSIFICATION
classification_counts = application_df['CLASSIFICATION'].value_counts()

# Display the value counts
print(classification_counts)

# Define a cutoff point for replacing infrequent values
cutoff = 100  # Example cutoff; adjust as needed

# Identify values that are below the cutoff
rare_classifications = classification_counts[classification_counts < cutoff].index

# Replace rare values with "Other"
application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(rare_classifications, 'Other')

# Verify the changes
print(application_df['CLASSIFICATION'].value_counts())


CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
         ...  
C4120        1
C8210        1
C2561        1
C4500        1
C2150        1
Name: count, Length: 71, dtype: int64
CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
Other      669
C1700      287
C4000      194
C5000      116
C1270      114
C2700      104
Name: count, dtype: int64


In [7]:
# Look at value counts for CLASSIFICATION
classification_counts = application_df['CLASSIFICATION'].value_counts()

# Display only the value counts greater than 1
classification_counts_greater_than_1 = classification_counts[classification_counts > 1]

# Print the result
print(classification_counts_greater_than_1)


CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
Other      669
C1700      287
C4000      194
C5000      116
C1270      114
C2700      104
Name: count, dtype: int64


In [8]:
# Define the cutoff value
cutoff = 100  # Example cutoff value; adjust as needed

# Get the value counts for CLASSIFICATION
classification_counts = application_df['CLASSIFICATION'].value_counts()

# Create a list of classifications to be replaced
classifications_to_replace = classification_counts[classification_counts < cutoff].index.tolist()

# Replace the identified classifications with "Other"
for cls in classifications_to_replace:
    application_df['CLASSIFICATION'] = application_df['CLASSIFICATION'].replace(cls, "Other")

# Check to make sure the replacement was successful
print(application_df['CLASSIFICATION'].value_counts())


CLASSIFICATION
C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
C7000      777
Other      669
C1700      287
C4000      194
C5000      116
C1270      114
C2700      104
Name: count, dtype: int64


In [9]:
# Convert categorical data to numeric using pd.get_dummies
application_df_encoded = pd.get_dummies(application_df)

# Display the first few rows of the encoded DataFrame
application_df_encoded.head()


Unnamed: 0,STATUS,ASK_AMT,IS_SUCCESSFUL,APPLICATION_TYPE_Other,APPLICATION_TYPE_T10,APPLICATION_TYPE_T19,APPLICATION_TYPE_T3,APPLICATION_TYPE_T4,APPLICATION_TYPE_T5,APPLICATION_TYPE_T6,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,1,108590,1,False,False,False,True,False,False,False,...,True,False,False,False,False,False,False,False,True,False
2,1,5000,0,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,True,False
3,1,6692,1,False,False,False,True,False,False,False,...,False,True,False,False,False,False,False,False,True,False
4,1,142590,1,False,False,False,True,False,False,False,...,False,False,True,False,False,False,False,False,True,False


In [10]:
from sklearn.model_selection import train_test_split

# Define the target variable (y) and features (X)
y = application_df_encoded['IS_SUCCESSFUL']
X = application_df_encoded.drop(columns=['IS_SUCCESSFUL'])

# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting datasets
print(f"Training feature set shape: {X_train.shape}")
print(f"Testing feature set shape: {X_test.shape}")
print(f"Training target set shape: {y_train.shape}")
print(f"Testing target set shape: {y_test.shape}")


Training feature set shape: (27439, 50)
Testing feature set shape: (6860, 50)
Training target set shape: (27439,)
Testing target set shape: (6860,)


In [11]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler on the training data
X_scaler = scaler.fit(X_train)

# Scale the training and testing data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

# Optionally, print the shapes of the scaled data to verify
print(f"Scaled training feature set shape: {X_train_scaled.shape}")
print(f"Scaled testing feature set shape: {X_test_scaled.shape}")


Scaled training feature set shape: (27439, 50)
Scaled testing feature set shape: (6860, 50)


## Compile, Train and Evaluate the Model

In [13]:
import tensorflow as tf

# Define the model - deep neural net
nn = tf.keras.models.Sequential()

# First hidden layer
nn.add(tf.keras.layers.Dense(units=128, activation='relu', input_shape=(X_train_scaled.shape[1],)))

# Second hidden layer
nn.add(tf.keras.layers.Dense(units=64, activation='relu'))

# Output layer
nn.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

# Check the structure of the model
nn.summary()

# Compile the model
nn.compile(optimizer='adam',
           loss='binary_crossentropy',
           metrics=['accuracy'])

# Train the model
history = nn.fit(X_train_scaled, y_train, epochs=100, batch_size=32, validation_split=0.2, verbose=2,
                  callbacks=[tf.keras.callbacks.ModelCheckpoint('AlphabetSoupCharity.weights.h5',
                                                                  save_weights_only=True,
                                                                  save_best_only=True,
                                                                  verbose=1)])

# Evaluate the model
loss, accuracy = nn.evaluate(X_test_scaled, y_test, verbose=2)
print(f"Test loss: {loss:.4f}")
print(f"Test accuracy: {accuracy:.4f}")


Epoch 1/100

Epoch 1: val_loss improved from inf to 0.54891, saving model to AlphabetSoupCharity.weights.h5
686/686 - 5s - 7ms/step - accuracy: 0.7188 - loss: 0.5700 - val_accuracy: 0.7409 - val_loss: 0.5489
Epoch 2/100

Epoch 2: val_loss improved from 0.54891 to 0.54748, saving model to AlphabetSoupCharity.weights.h5
686/686 - 2s - 3ms/step - accuracy: 0.7273 - loss: 0.5552 - val_accuracy: 0.7358 - val_loss: 0.5475
Epoch 3/100

Epoch 3: val_loss improved from 0.54748 to 0.54602, saving model to AlphabetSoupCharity.weights.h5
686/686 - 2s - 3ms/step - accuracy: 0.7289 - loss: 0.5517 - val_accuracy: 0.7363 - val_loss: 0.5460
Epoch 4/100

Epoch 4: val_loss improved from 0.54602 to 0.54080, saving model to AlphabetSoupCharity.weights.h5
686/686 - 2s - 4ms/step - accuracy: 0.7284 - loss: 0.5503 - val_accuracy: 0.7405 - val_loss: 0.5408
Epoch 5/100

Epoch 5: val_loss did not improve from 0.54080
686/686 - 1s - 2ms/step - accuracy: 0.7287 - loss: 0.5493 - val_accuracy: 0.7374 - val_loss: 0.5

In [14]:
# Compile the model
nn.compile(optimizer='adam',
           loss='binary_crossentropy',
           metrics=['accuracy'])


In [15]:
# Train the model
history = nn.fit(X_train_scaled, y_train,
                  epochs=100,
                  batch_size=32,
                  validation_split=0.2,
                  verbose=2,
                  callbacks=[tf.keras.callbacks.ModelCheckpoint(
                      'AlphabetSoupCharity.weights.h5',
                      save_weights_only=True,
                      save_best_only=True,
                      verbose=1
                  )])


Epoch 1/100

Epoch 1: val_loss improved from inf to 0.55984, saving model to AlphabetSoupCharity.weights.h5
686/686 - 3s - 5ms/step - accuracy: 0.7395 - loss: 0.5325 - val_accuracy: 0.7389 - val_loss: 0.5598
Epoch 2/100

Epoch 2: val_loss did not improve from 0.55984
686/686 - 3s - 4ms/step - accuracy: 0.7400 - loss: 0.5312 - val_accuracy: 0.7389 - val_loss: 0.5613
Epoch 3/100

Epoch 3: val_loss did not improve from 0.55984
686/686 - 2s - 4ms/step - accuracy: 0.7396 - loss: 0.5305 - val_accuracy: 0.7400 - val_loss: 0.5622
Epoch 4/100

Epoch 4: val_loss did not improve from 0.55984
686/686 - 1s - 2ms/step - accuracy: 0.7401 - loss: 0.5305 - val_accuracy: 0.7392 - val_loss: 0.5612
Epoch 5/100

Epoch 5: val_loss did not improve from 0.55984
686/686 - 2s - 4ms/step - accuracy: 0.7401 - loss: 0.5305 - val_accuracy: 0.7392 - val_loss: 0.5618
Epoch 6/100

Epoch 6: val_loss did not improve from 0.55984
686/686 - 3s - 4ms/step - accuracy: 0.7395 - loss: 0.5304 - val_accuracy: 0.7391 - val_loss:

In [16]:
# Evaluate the model using the test data
model_loss, model_accuracy = nn.evaluate(X_test_scaled, y_test, verbose=2)

# Print the results
print(f"Loss: {model_loss:.4f}, Accuracy: {model_accuracy:.4f}")


215/215 - 1s - 3ms/step - accuracy: 0.7262 - loss: 0.6082
Loss: 0.6082, Accuracy: 0.7262


In [22]:
# Export the model to an HDF5 file
nn.save("AlphabetSoupCharity.h5")




In [21]:
from google.colab import files
files.download('/content/AlphabetSoupCharity.h5')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Report on Neural Network Model for Alphabet Soup

#Overview of the Analysis

The purpose of this analysis is to develop a deep learning model that can predict the success of funding applications for Alphabet Soup, a charity organization. The organization has provided a dataset containing various features related to the applications, and the goal is to create a binary classifier that can determine whether an application will be successful or not. By analyzing the patterns in the data, the model can help Alphabet Soup allocate resources more effectively and identify the key factors that contribute to successful funding applications.

#Results

#Data Preprocessing

#What variable(s) are the target(s) for your model?

The target variable for the model is IS_SUCCESSFUL, which indicates whether a funding application was successful (1) or not (0).

#What variable(s) are the features for your model?

The features used for the model include APPLICATION_TYPE, AFFILIATION, CLASSIFICATION, USE_CASE, ORGANIZATION, STATUS, INCOME_AMT, SPECIAL_CONSIDERATIONS, and ASK_AMT.

#What variable(s) should be removed from the input data because they are neither targets nor features?

The variables EIN and NAME were removed from the input data as they are identifiers that do not contribute to the prediction of the target variable.
Compiling, Training, and Evaluating the Model

#How many neurons, layers, and activation functions did you select for your neural network model, and why?

Neurons: The model was designed with an input layer of 128 neurons, a hidden layer of 64 neurons, and an output layer with 1 neuron.

Layers: The neural network consists of two hidden layers. The first layer has 128 neurons, and the second layer has 64 neurons.

Activation Functions: The hidden layers use the ReLU activation function to introduce non-linearity, while the output layer uses the sigmoid activation function to output a probability value between 0 and 1, corresponding to the binary classification task.
The selection of neurons and layers was based on balancing model complexity with the need to avoid overfitting. The ReLU activation function is a common choice for hidden layers, and sigmoid is suitable for binary classification tasks.

#Were you able to achieve the target model performance?

The model achieved a test accuracy of approximately 72%. While this is a good starting point, it did not meet the target accuracy of 75% or higher.
What steps did you take in your attempts to increase model performance?

Various optimization methods were considered, such as adjusting the number of neurons and layers, experimenting with different activation functions, and tuning the learning rate. However, further optimization and hyperparameter tuning may be required to achieve the target accuracy.

#Summary

The deep learning model developed for Alphabet Soup provides a useful tool for predicting the success of funding applications. Although the model did not achieve the desired accuracy, it offers insights into the key factors that contribute to successful applications.

To further improve the model, additional optimization techniques, such as hyperparameter tuning using KerasTuner, could be employed. Additionally, experimenting with different model architectures, such as Random Forests or Gradient Boosting Machines (GBM), may yield better performance. These models are less sensitive to overfitting and can handle non-linear relationships more effectively, making them good candidates for solving this classification problem.

This report covers the main aspects of the neural network model, including data preprocessing, model architecture, performance evaluation, and recommendations for further improvement. The report is structured with clear headers and subheaders, and it includes a summary that outlines the results and suggests alternative modeling approaches.