Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lstm value error of different shape #61

Closed
vinayakumarr opened this issue May 4, 2016 · 15 comments
Closed

lstm value error of different shape #61

vinayakumarr opened this issue May 4, 2016 · 15 comments

Comments

@vinayakumarr
Copy link

I tried to modify imdb example to my dataset, which is given below
3 3 373 27 9 615 9 16 10 34 0 8 0 199 65917 1319 122 402 319 183
3 3 77 12 4 66 4 3 0 5 0 14 3 50 106 139 38 164 53 109
3 3 86 6 2 6 2 0 0 1 0 25 0 4 284 77888 19 66 11 25
3 3 469 21 7 291 7 43 15 82 0 207 0 181 115646 59073 294 928 112 675
3 3 2090 21 7 4035 7 17 8 40 0 317 10 717 1033 25661 142 2054 1795 1023
3 3 691 18 6 597 6 30 16 61 0 245 18 273 719 2352305 213 1106 324 719
6 6 229 0 8 526 0 11 1 13 0 6 5 101 7246 2082 120 141 288 1570
3 3 1158 9 3 649 3 16 6 17 1 247 38 477 592 987626 82 1305 653 707
4 4 211 0 10 429 0 16 9 20 0 3 0 106 42725 27302 4280 133 477 1567

The first column is the target which has 9 classes and around 1803 features

from future import print_function
import numpy as np
from sklearn.cross_validation import train_test_split
import tflearn
import pandas as pd
from tflearn.data_utils import to_categorical, pad_sequences

print("Loading")
data = pd.read_csv('Train.csv')

X = data.iloc[:,1:1805]
y = data.iloc[:,0]

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

print("Preprocessing")
X_train1 = X_train.values.T.tolist()
X_test1 = X_test.values.tolist()
y_train1 = y_train.values.T.tolist()
y_test1 = y_test.values.tolist()

Data preprocessing

Sequence padding

trainX = pad_sequences(X_train1, maxlen=200, value=0.)
testX = pad_sequences(X_test1, maxlen=200, value=0.)

Converting labels to binary vectors

trainY = to_categorical(y_train, nb_classes=0)
testY = to_categorical(y_test, nb_classes=0)

Network building

net = tflearn.input_data([None, 200])
net = tflearn.embedding(net, input_dim=20000, output_dim=128)
net = tflearn.lstm(net, 128)
net = tflearn.dropout(net, 0.5)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam',loss='categorical_crossentropy')

Training

model = tflearn.DNN(net, clip_gradients=0., tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True, batch_size=128)
untitled

@aymericdamien
Copy link
Member

You have to change last fully connected layer to:

net = tflearn.fully_connected(net, 10, activation='softmax')

In your example you seems to have 10 classes and not 2, so you need your softmax layer to have a 10 output dimension.

@vinayakumarr
Copy link
Author

vinayakumarr commented May 5, 2016

I I modified but it is generating the following error
untitled1

@aymericdamien
Copy link
Member

in that line you have to change input_dim by your dictionary size (total number of different ids):
net = tflearn.embedding(net, input_dim=20000, output_dim=128)

@vinayakumarr
Copy link
Author

that i understood by looking at the error itself. But what exactly that dictionary size (total number different ids) with respect to my data set.

@aymericdamien
Copy link
Member

aymericdamien commented May 5, 2016

I guess it should be np.max(trainX)+1

@vinayakumarr
Copy link
Author

generating the following error
untitled3
Segmentation fault (core dumped)

@aymericdamien
Copy link
Member

What np.max(trainX)+1 is returning? Maybe your dictionary size is too large..

@vinayakumarr
Copy link
Author

np.max(trainX)+1 returning 1930563585. How to solve this..

@aymericdamien
Copy link
Member

Can you tell what are these numbers? I thought they were ids? I think your main issue here is parsing your data.

@vinayakumarr
Copy link
Author

vinayakumarr commented May 5, 2016

when i use print(np.max(trainX)+1), it returns 1930563585.

reading and parsing code is given below

//for reading csv file
data = pd.read_csv('Train.csv')

X = data.iloc[:,1:1805] //all columns
y = data.iloc[:,0] // only first clumn - class label

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42) // using scikit learn train_test_split

// for converting into list of values
print("Preprocessing")
X_train1 = X_train.values.T.tolist()
X_test1 = X_test.values.tolist()
y_train1 = y_train.values.T.tolist()
y_test1 = y_test.values.tolist()

Sequence padding

trainX = pad_sequences(X_train1, maxlen=200, value=0.)
testX = pad_sequences(X_test1, maxlen=200, value=0.)

Converting labels to binary vectors

trainY = to_categorical(y_train, nb_classes=0)
testY = to_categorical(y_test, nb_classes=0)

Then finally model creation

@vinayakumarr
Copy link
Author

The above code is for reading and parsing my dataset. Is there any problem in that?

@aymericdamien
Copy link
Member

Please can you tell what does these data mean? It would be easier to understand what you are actually trying to do. I mean, these integers are an id (that represent words or whatever)? or are they real values?

@aymericdamien
Copy link
Member

I see, first you can normalize your data by assigning an id (from 0 to your total number of event) for every event. After you can apply the embedding. But note that if you total number of event is too large, it will be very slow, so you can try to find some ways to reduce your data dimension (keep only events occurring more than X times, or apply a PCA transformation, etc...)

@vinayakumarr
Copy link
Author

I reduced my feature set now. I have 2000 row and 20 24 columns. Then also it is showing same error

@aymericdamien
Copy link
Member

oh, your problem is not about number of rows, but about your embedding layer dimensions, you have to first normalize your data and give them ids (0 to total number of event). What is your total number of distinct events?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants