# Alphabet Soup Charity

UC Berkeley Extension Data Analytics Boot Camp Module 19 Challenge

---

### Import dependencies

In [15]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf


### Import and characterize the input data


In [16]:
file_name = "../Resources/charity_data.csv"
charity_df = pd.read_csv(file_name)
charity_df.head()

Unnamed: 0,EIN,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,10520599,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,10531628,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,10547893,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,10553066,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,10556103,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


In [17]:
charity_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34299 entries, 0 to 34298
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   EIN                     34299 non-null  int64 
 1   NAME                    34299 non-null  object
 2   APPLICATION_TYPE        34299 non-null  object
 3   AFFILIATION             34299 non-null  object
 4   CLASSIFICATION          34299 non-null  object
 5   USE_CASE                34299 non-null  object
 6   ORGANIZATION            34299 non-null  object
 7   STATUS                  34299 non-null  int64 
 8   INCOME_AMT              34299 non-null  object
 9   SPECIAL_CONSIDERATIONS  34299 non-null  object
 10  ASK_AMT                 34299 non-null  int64 
 11  IS_SUCCESSFUL           34299 non-null  int64 
dtypes: int64(4), object(8)
memory usage: 3.1+ MB


In [18]:
# EIN and Name both not needed

charity_df = charity_df.drop(["EIN"], axis=1)
charity_df.head()

Unnamed: 0,NAME,APPLICATION_TYPE,AFFILIATION,CLASSIFICATION,USE_CASE,ORGANIZATION,STATUS,INCOME_AMT,SPECIAL_CONSIDERATIONS,ASK_AMT,IS_SUCCESSFUL
0,BLUE KNIGHTS MOTORCYCLE CLUB,T10,Independent,C1000,ProductDev,Association,1,0,N,5000,1
1,AMERICAN CHESAPEAKE CLUB CHARITABLE TR,T3,Independent,C2000,Preservation,Co-operative,1,1-9999,N,108590,1
2,ST CLOUD PROFESSIONAL FIREFIGHTERS,T5,CompanySponsored,C3000,ProductDev,Association,1,0,N,5000,0
3,SOUTHSIDE ATHLETIC ASSOCIATION,T3,CompanySponsored,C2000,Preservation,Trust,1,10000-24999,N,6692,1
4,GENETIC RESEARCH INSTITUTE OF THE DESERT,T3,Independent,C1000,Heathcare,Trust,1,100000-499999,N,142590,1


### Binning Process

Reducing the number of bins using cutoff values determined by trial and error

In [28]:

names

OTHER                                                28539
PARENT BOOSTER USA INC                                1260
TOPS CLUB INC                                          765
UNITED STATES BOWLING CONGRESS INC                     700
WASHINGTON STATE UNIVERSITY                            492
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC        408
PTA TEXAS CONGRESS                                     368
SOROPTIMIST INTERNATIONAL OF THE AMERICAS INC          331
ALPHA PHI SIGMA                                        313
TOASTMASTERS INTERNATIONAL                             293
MOST WORSHIPFUL STRINGER FREE AND ACCEPTED MASONS      287
LITTLE LEAGUE BASEBALL INC                             277
INTERNATIONAL ASSOCIATION OF LIONS CLUBS               266
Name: NAME, dtype: int64

In [29]:
# check for which names show up multiple times and what threshold we can create an "other" section at
names = charity_df.NAME.value_counts()

replace_names = list(names[names < 250].index)

# Replace values in df with "Other"
for name in replace_names:
    charity_df['NAME'] = charity_df['NAME'].replace(name,'OTHER')

charity_df['NAME'].value_counts()

OTHER                                                28539
PARENT BOOSTER USA INC                                1260
TOPS CLUB INC                                          765
UNITED STATES BOWLING CONGRESS INC                     700
WASHINGTON STATE UNIVERSITY                            492
AMATEUR ATHLETIC UNION OF THE UNITED STATES INC        408
PTA TEXAS CONGRESS                                     368
SOROPTIMIST INTERNATIONAL OF THE AMERICAS INC          331
ALPHA PHI SIGMA                                        313
TOASTMASTERS INTERNATIONAL                             293
MOST WORSHIPFUL STRINGER FREE AND ACCEPTED MASONS      287
LITTLE LEAGUE BASEBALL INC                             277
INTERNATIONAL ASSOCIATION OF LIONS CLUBS               266
Name: NAME, dtype: int64

In [38]:
# determining if number of classifications needs to be reduced
classes = charity_df.CLASSIFICATION.value_counts()

replace_classes = list(classes[classes < 500].index)

for classification in replace_classes:
    charity_df['CLASSIFICATION'] = charity_df['CLASSIFICATION'].replace(classification,'OTHER')

charity_df['CLASSIFICATION'].value_counts()

C1000    17326
C2000     6074
C1200     4837
C3000     1918
C2100     1883
OTHER     1484
C7000      777
Name: CLASSIFICATION, dtype: int64

In [41]:
# determining if application types need to be cut down

app_count = charity_df.APPLICATION_TYPE.value_counts()

replace_apps = list(app_count[app_count < 400].index)

for apps in replace_apps:
    charity_df['APPLICATION_TYPE'] = charity_df['APPLICATION_TYPE'].replace(apps,'OTHER')

charity_df['APPLICATION_TYPE'].value_counts()

T3       27037
T4        1542
T6        1216
T5        1173
T19       1065
T8         737
T7         725
T10        528
OTHER      276
Name: APPLICATION_TYPE, dtype: int64

### Encoding

In [44]:
# Generate list of columns with categorical variables

cat_vars = charity_df.dtypes[charity_df.dtypes == "object"].index.tolist()

In [45]:
# Create the OneHotEncoder instance (copied from module)

enc = OneHotEncoder(sparse=False)

# Fit and transform the OneHotEncoder using the categorical variable list
encode_df = pd.DataFrame(enc.fit_transform(charity_df[cat_vars]))

# Add the encoded variable names to the DataFrame
encode_df.columns = enc.get_feature_names(cat_vars)
encode_df.head()

Unnamed: 0,NAME_ALPHA PHI SIGMA,NAME_AMATEUR ATHLETIC UNION OF THE UNITED STATES INC,NAME_INTERNATIONAL ASSOCIATION OF LIONS CLUBS,NAME_LITTLE LEAGUE BASEBALL INC,NAME_MOST WORSHIPFUL STRINGER FREE AND ACCEPTED MASONS,NAME_OTHER,NAME_PARENT BOOSTER USA INC,NAME_PTA TEXAS CONGRESS,NAME_SOROPTIMIST INTERNATIONAL OF THE AMERICAS INC,NAME_TOASTMASTERS INTERNATIONAL,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [46]:
# Merge the datasets and drop the categorical values

charity_df = charity_df.merge(encode_df, left_index = True, right_index = True)
charity_df = charity_df.drop(cat_vars,1)
charity_df.head()

Unnamed: 0,STATUS,ASK_AMT,IS_SUCCESSFUL,NAME_ALPHA PHI SIGMA,NAME_AMATEUR ATHLETIC UNION OF THE UNITED STATES INC,NAME_INTERNATIONAL ASSOCIATION OF LIONS CLUBS,NAME_LITTLE LEAGUE BASEBALL INC,NAME_MOST WORSHIPFUL STRINGER FREE AND ACCEPTED MASONS,NAME_OTHER,NAME_PARENT BOOSTER USA INC,...,INCOME_AMT_1-9999,INCOME_AMT_10000-24999,INCOME_AMT_100000-499999,INCOME_AMT_10M-50M,INCOME_AMT_1M-5M,INCOME_AMT_25000-99999,INCOME_AMT_50M+,INCOME_AMT_5M-10M,SPECIAL_CONSIDERATIONS_N,SPECIAL_CONSIDERATIONS_Y
0,1,5000,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,108590,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1,5000,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1,6692,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1,142590,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Split into Testing and Training

In [47]:
# Split on IS_SUCCESSFUL into features and targets

y = charity_df['IS_SUCCESSFUL'].values # target
X = charity_df.drop(['IS_SUCCESSFUL'],axis=1).values # everything but the target is a feature

# Split into training and testing

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [48]:
# Create a StandardScaler instance
scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

### NN Model 

##### Read the first line of each cell to identify trial type. If running TRIAL 1, only run the tabs listed as TRIAL 1. Results achieved are summarized in the last cell of the notebook.

In [49]:
# Create the Keras Sequential model - INITIAL TRIAL - 1 layer, relu
nn_model = tf.keras.models.Sequential()

# define parameters
# Add the input and hidden layer 
number_inputs = len(X_train[0])
number_hidden_nodes_1 = 25

nn_model.add(tf.keras.layers.Dense(units=number_hidden_nodes_1, activation="relu", input_dim=number_inputs))

# Add the output layer that uses a probability activation function
nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

nn_model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 25)                1450      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 26        
Total params: 1,476
Trainable params: 1,476
Non-trainable params: 0
_________________________________________________________________


In [53]:
# TRIAL 2 - 2 layers, relu
number_inputs = len(X_train[0])
number_hidden_nodes_1 =  25
number_hidden_nodes_2 = 10

nn_model = tf.keras.models.Sequential()

# Add first layer to the Sequential model using Keras’ Dense class
nn_model.add(tf.keras.layers.Dense(units = number_hidden_nodes_1, input_dim=number_inputs, activation="relu"))
#nn_model.add(tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="tanh"))

# Second hidden layer
nn_model.add(tf.keras.layers.Dense(units=number_hidden_nodes_2, activation="relu"))


# Add the output layer that uses a probability activation function
nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the Sequential model
nn_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 25)                1450      
_________________________________________________________________
dense_3 (Dense)              (None, 10)                260       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
Total params: 1,721
Trainable params: 1,721
Non-trainable params: 0
_________________________________________________________________


In [57]:
# TRIAL 3 - 2 layers, Relu
number_inputs = len(X_train[0])
number_hidden_nodes_1 =  35
number_hidden_nodes_2 = 10

nn_model = tf.keras.models.Sequential()

# Add first layer to the Sequential model using Keras’ Dense class
nn_model.add(tf.keras.layers.Dense(units=number_hidden_nodes_1, input_dim=number_inputs, activation="relu"))

# Second hidden layer
nn_model.add(tf.keras.layers.Dense(units=number_hidden_nodes_2, activation="relu"))


# Add the output layer that uses a probability activation function
nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the Sequential model
nn_model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 35)                2030      
_________________________________________________________________
dense_6 (Dense)              (None, 10)                360       
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 11        
Total params: 2,401
Trainable params: 2,401
Non-trainable params: 0
_________________________________________________________________


In [61]:
# TRIAL 3 - 2 layers, Relu
number_inputs = len(X_train[0])
number_hidden_nodes_1 =  25
number_hidden_nodes_2 = 15

nn_model = tf.keras.models.Sequential()

# Add first layer to the Sequential model using Keras’ Dense class
nn_model.add(tf.keras.layers.Dense(units=number_hidden_nodes_1, input_dim=number_inputs, activation="relu"))

# Second hidden layer
nn_model.add(tf.keras.layers.Dense(units=number_hidden_nodes_2, activation="relu"))


# Add the output layer that uses a probability activation function
nn_model.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the Sequential model
nn_model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 25)                1450      
_________________________________________________________________
dense_9 (Dense)              (None, 15)                390       
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 16        
Total params: 1,856
Trainable params: 1,856
Non-trainable params: 0
_________________________________________________________________


In [62]:
# Compile the Sequential model together and customize metrics - RUN FOR ALL TRIALS

nn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [55]:
# Test - TRIALS 1-2 both use 100 epochs

fit_model = nn_model.fit(X_train_scaled, y_train, epochs=100)

Train on 25724 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100


Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [63]:
# Test - TRIALS 3 & 4 use 75 epochs to reduce computation time

fit_model = nn_model.fit(X_train_scaled, y_train, epochs=75)

Train on 25724 samples
Epoch 1/75
Epoch 2/75
Epoch 3/75
Epoch 4/75
Epoch 5/75
Epoch 6/75
Epoch 7/75
Epoch 8/75
Epoch 9/75
Epoch 10/75
Epoch 11/75
Epoch 12/75
Epoch 13/75
Epoch 14/75
Epoch 15/75
Epoch 16/75
Epoch 17/75
Epoch 18/75
Epoch 19/75
Epoch 20/75
Epoch 21/75
Epoch 22/75
Epoch 23/75
Epoch 24/75
Epoch 25/75
Epoch 26/75
Epoch 27/75
Epoch 28/75
Epoch 29/75
Epoch 30/75
Epoch 31/75
Epoch 32/75
Epoch 33/75
Epoch 34/75
Epoch 35/75
Epoch 36/75
Epoch 37/75
Epoch 38/75
Epoch 39/75
Epoch 40/75
Epoch 41/75
Epoch 42/75
Epoch 43/75
Epoch 44/75
Epoch 45/75
Epoch 46/75
Epoch 47/75
Epoch 48/75
Epoch 49/75
Epoch 50/75
Epoch 51/75
Epoch 52/75
Epoch 53/75
Epoch 54/75
Epoch 55/75
Epoch 56/75
Epoch 57/75
Epoch 58/75
Epoch 59/75
Epoch 60/75
Epoch 61/75
Epoch 62/75
Epoch 63/75
Epoch 64/75
Epoch 65/75
Epoch 66/75
Epoch 67/75
Epoch 68/75
Epoch 69/75
Epoch 70/75
Epoch 71/75
Epoch 72/75
Epoch 73/75
Epoch 74/75
Epoch 75/75




In [52]:
# Evaluate model performance using the test data - TRIAL 1
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

8575/1 - 1s - loss: 0.4747 - accuracy: 0.7450
Loss: 0.5097151760839512, Accuracy: 0.7449562549591064


In [56]:
# Evaluate model performance using the test data - TRIAL 2
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

8575/1 - 1s - loss: 0.4712 - accuracy: 0.7493
Loss: 0.5084301748185394, Accuracy: 0.7492711544036865


In [60]:
# Evaluate model performance using the test data - TRIAL 3
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

8575/1 - 1s - loss: 0.4695 - accuracy: 0.7487
Loss: 0.5076871745231896, Accuracy: 0.7486880421638489


In [64]:
# Evaluate model performance using the test data - TRIAL 4
model_loss, model_accuracy = nn_model.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {model_loss}, Accuracy: {model_accuracy}")

8575/1 - 1s - loss: 0.4738 - accuracy: 0.7468
Loss: 0.5111516482454694, Accuracy: 0.7468221783638


### Notes on combinations tried:
    
Trial 1: 1 layer, 25 nodes, Relu, 100 epochs -- ACCURACY SCORE: 74.5% - Very close to our goal of 75%.

- Our initial model performed very close to our target accuracy of 75%

Trial 2: 2 layer, 25/10 nodes, Relu, 100 epochs -- ACCURACY SCORE: 74.9% - Even closer to 75%!

- Adding a second layer to our model slightly increased our accuracy.

Trial 3: 1 layer, 35/10 nodes, Relu, 75 epochs -- ACCURACY SCORE: 74.8 % - Slight degradation over Trial 2.

- We reduced the number of epochs since there was not much of an improvement in loss or accuracy beyond 60-75 epochs. We found that increasing the number of nodes in layer 1 did not improve our accuracy.

Trial 4: 1 layer, 25/15 nodes, Relu, 75 epochs -- ACCURACY SCORE: 74.7% - Slight degradation over Trial 3.

- In this trial, we reduced the number of layer 1 nodes back to 25, and increased the number of layer 2 nodes to 15. We found that this model performed worse than our second attempt. However, all four attempts were within 1 percentage point of our target accuracy of 75%.
