Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split and Normalize Data #2

Closed
github-learning-lab bot opened this issue Apr 21, 2021 · 3 comments
Closed

Split and Normalize Data #2

github-learning-lab bot opened this issue Apr 21, 2021 · 3 comments

Comments

@github-learning-lab
Copy link

Now that we have our data in a useable form, we need to split it. We want to have a set of data that we'll use to train our model, and we'll use another set of data to test our model after we've trained it. In general, the data is randomly split with about 70% being used for training and 30% used for testing. For easier visualization, we'll be splitting the data by Pokémon generation. The first generation of Pokémon (from Pokémon Red, Blue, and Yellow) will be our testing data while the rest will be our training data:

def train_test_splitter(DataFrame, column):
    df_train = DataFrame.loc[df[column] != 1]
    df_test = DataFrame.loc[df[column] == 1]

    df_train = df_train.drop(column, axis=1)
    df_test = df_test.drop(column, axis=1)

    return(df_train, df_test)

df_train, df_test = train_test_splitter(df, 'Generation')

This function takes any Pokémon whose "Generation" label is equal to 1 and putting it into the test dataset, and putting everyone else in the training dataset. It then drops the Generation category from the dataset.

Now that we have our two sets of data, we'll need to separate the labels (the 'islegendary' category) from the rest of the data. Remember, this is the answer key to the test the algorithms are trying to solve, and it does no good to have them learn with the answer-key in (metaphorical) hand:

def label_delineator(df_train, df_test, label):
    
    train_data = df_train.drop(label, axis=1).values
    train_labels = df_train[label].values
    test_data = df_test.drop(label,axis=1).values
    test_labels = df_test[label].values
    return(train_data, train_labels, test_data, test_labels)

This function extracts the data from the DataFrame and puts it into arrays that TensorFlow can understand with.values. We then have the four groups of data:

train_data, train_labels, test_data, test_labels = label_delineator(df_train, df_test, 'isLegendary')

Comment with the generation number we used in the test dataset.

@suryamshashwat
Copy link
Owner

1

@github-learning-lab
Copy link
Author

That's right! And now that we have our labels extracted from the data, let's normalize the data so everything is on the same scale:

def data_normalizer(train_data, test_data):
    train_data = preprocessing.MinMaxScaler().fit_transform(train_data)
    test_data = preprocessing.MinMaxScaler().fit_transform(test_data)
    return(train_data, test_data)

train_data, test_data = data_normalizer(train_data, test_data)

Now we can get to the machine learning! Let's create the model using Keras. Keras is an API for Tensorflow. We have a few options for doing this, but we'll keep it simple for now. A model is built upon layers. We'll add two fully connected neural layers.

The number associated with the layer is the number of neurons in it. The first layer we'll use is a 'ReLU' (Rectified Linear Unit)' activation function. Since this is also the first layer, we need to specify input_size, which is the shape of an entry in our dataset.

After that, we'll finish with a softmax layer. Softmax is a type of logistic regression done for situations with multiple cases, like our 2 possible groups: 'Legendary' and 'Not Legendary'. With this we delineate the possible identities of the Pokémon into 2 probability groups corresponding to the possible labels:

length = train_data.shape[1]

model = keras.Sequential()
model.add(keras.layers.Dense(500, activation='relu', input_shape=[length,]))
model.add(keras.layers.Dense(2, activation='softmax'))

Close this issue when you are finished normalizing the data.

@github-learning-lab
Copy link
Author

Awesome! We are moving right along.

In the next issue we will compile our model and evaluate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant