shuffling True on test data decreasing score #58986

khurramsiddiqui · 2021-05-26T13:13:04Z

After training model when i test model in batch using shuffle=False give me good score , when i use the same model and same test records using shuffle=True give me bad score , i am confused why it is so?

dataset = pd.read_csv('Churn_Modelling.csv')

I shuffle the data before splitting data into train/test

from sklearn.utils import shuffle

data = shuffle(data)
data.reset_index(inplace=True, drop=True)

X = data[['Age','Tenure','Geography','Balance','EstimatedSalary','Gender','NumOfProducts','CreditScore','HasCrCard','IsActiveMember']]
Y = data['Exited']

I am embedding following categorical variables

categorical_columns = ['Geography', 'Gender', 'HasCrCard', 'IsActiveMember']
for col in categorical_columns:
    X.loc[:,col] = X.loc[:,col].astype('category')

X['Geography'] = LabelEncoder().fit_transform(X['Geography'])
X['Gender']    = LabelEncoder().fit_transform(X['Gender'])
X['HasCrCard'] = LabelEncoder().fit_transform(X['HasCrCard'])
X['IsActiveMember'] = LabelEncoder().fit_transform(X['IsActiveMember'])

After encoding label encoder above , these columns converted into integer - hence re converting them to
category

for col in categorical_columns:
    X.loc[:,col] = X.loc[:,col].astype('category')
X.dtypes

Get embedding categorical columns

embedded_cols = {n: len(col.cat.categories) for n,col in X[categorical_columns].items()}
embedded_cols
{'Geography': 3, 'Gender': 2, 'HasCrCard': 2, 'IsActiveMember': 2}

Splitting train/test data

X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.20, random_state=0)

Following function will return categorical , numerical columns separately , reason for this i want to embed categorical column separately and then combined with numerical features while training

class ShelterOutcomeDataset(Dataset):
    def __init__(self, X, Y, embedded_col_names):
        Xdata = X.copy()
        self.X1 = Xdata.loc[:,embedded_col_names].copy().values.astype(np.int64) #categorical columns
        self.X2 = Xdata.drop(columns=embedded_col_names).copy().values.astype(np.float32) #numerical columns
        self.y  = Y.copy().values.astype(np.int64)
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.X1[idx], self.X2[idx], self.y[idx]

Size of embedding columns

embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
embedding_sizes
[(3, 2), (2, 1), (2, 1), (2, 1)]

train_ds = ShelterOutcomeDataset(X_train,y_train ,categorical_columns)

embedded_col_names = embedded_cols.keys()
len(X.columns) - len(embedded_cols) #number of numerical columns
6

Model

class testNet(nn.Module):
    def __init__(self, emb_dims, n_cont):
        super().__init__()

        self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories,size in emb_dims])
        no_of_embs = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
   
        self.n_emb, self.n_cont = no_of_embs, n_cont
        self.lin1 = nn.Linear(self.n_emb + self.n_cont,200)
        self.lin2 = nn.Linear(200, 100)
        self.lin3 = nn.Linear(100, 50)
        self.lin4 = nn.Linear(50, 2)

        self.bn1 = nn.BatchNorm1d(self.n_cont)
        self.bn2 = nn.BatchNorm1d(200)
        self.bn3 = nn.BatchNorm1d(100)
        self.bn4 = nn.BatchNorm1d(50)

        self.emb_drop = nn.Dropout(0.4)
        self.drops    = nn.Dropout()
        

    def forward(self, x_cat, x_cont):
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        
        x = self.emb_drop(x)
        x2 = self.bn1(x_cont)
        x = torch.cat([x, x2], 1)
        x = F.relu(self.lin1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = F.relu(self.lin3(x))
        x = self.drops(x)
        x = self.bn4(x)
        x = F.relu(self.lin4(x))

        return x

model = testNet(embedding_sizes,6)
print(model)

testNet(
  (embeddings): ModuleList(
    (0): Embedding(3, 2)
    (1): Embedding(2, 1)
    (2): Embedding(2, 1)
    (3): Embedding(2, 1)
  )
  (lin1): Linear(in_features=9, out_features=200, bias=True)
  (lin2): Linear(in_features=200, out_features=100, bias=True)
  (lin3): Linear(in_features=100, out_features=50, bias=True)
  (lin4): Linear(in_features=50, out_features=2, bias=True)
  (bn1): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn3): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (emb_drop): Dropout(p=0.4, inplace=False)
  (drops): Dropout(p=0.5, inplace=False)
)

Training

def get_optimizer(model, lr = 0.001, wd = 0.0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optim = torch_optim.Adam(parameters, lr=lr, weight_decay=wd)
    return optim

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_normal_(m.weight)

criterion = nn.CrossEntropyLoss()

def train_model(model, optim, train_dl):
    model.train()
    total    = 0
    sum_loss = 0
    output   = 0
    
    for cat, cont, y in train_dl:
        batch = y.shape[0]
        output = model(cat, cont)
        loss = criterion(output, y)
        optim.zero_grad()
        loss.backward()
        optim.step()
        total += batch
        sum_loss += batch*(loss.item())
    return sum_loss/total,pred

def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for epoch in range(epochs): 
       
        loss,pred = train_model(model, optim, train_dl)
        if (epoch+1) % 50 ==0:
            print(f'epoch : {epoch+1},training loss : {loss}')
            
sampler = class_imbalance_sampler(y_train)

batch_size = 1000
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)

model = testNet(embedding_sizes,6)
model.apply(init_weights)

opt = torch.optim.Adam(model.parameters(), lr=1e-2)
train_loop(model, epochs=200, lr=0.001, wd=0.00001)

testNet(
  (embeddings): ModuleList(
    (0): Embedding(3, 2)
    (1): Embedding(2, 1)
    (2): Embedding(2, 1)
    (3): Embedding(2, 1)
  )
  (lin1): Linear(in_features=11, out_features=200, bias=True)
  (lin2): Linear(in_features=200, out_features=100, bias=True)
  (lin3): Linear(in_features=100, out_features=50, bias=True)
  (lin4): Linear(in_features=50, out_features=2, bias=True)
  (bn1): BatchNorm1d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn3): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (emb_drop): Dropout(p=0.4, inplace=False)
  (drops): Dropout(p=0.5, inplace=False)
)
epoch : 50,training loss : 0.3888436555862427
epoch : 100,training loss : 0.3804803378880024
epoch : 150,training loss : 0.3702864944934845
epoch : 200,training loss : 0.35881927236914635

Validation When Shuffle=False- Sores are below

valid_ds = ShelterOutcomeDataset(X_val,y_val , categorical_columns)
batch_size = 100
valid_dl = DataLoader(valid_ds, batch_size=batch_size,**shuffle=False**)
valid_dl = DeviceDataLoader(valid_dl, device)

preds = []
with torch.no_grad():
    for cat, cont,y in valid_dl:
        output = model(cat, cont)
        _,pred = torch.max(output,1)
        preds.append(pred.cpu().detach().numpy())
final_preds = [item for sublist in preds for item in sublist]        

print(classification_report(y_val, np.array(final_preds)))

              precision    recall  f1-score   support

           0       0.86      0.95      0.90      1610
           1       0.63      0.37      0.47       390

    accuracy                           0.83      2000
   macro avg       0.74      0.66      0.69      2000
weighted avg       0.82      0.83      0.82      2000

Validation When Shuffle=True- Sores are below

valid_ds = ShelterOutcomeDataset(X_val,y_val , categorical_columns)
batch_size = 100
valid_dl = DataLoader(valid_ds, batch_size=batch_size,**shuffle=True**)
valid_dl = DeviceDataLoader(valid_dl, device)

preds = []
with torch.no_grad():
    for cat, cont,y in valid_dl:
        output = model(cat, cont)
        _,pred = torch.max(output,1)
        preds.append(pred.cpu().detach().numpy())
final_preds = [item for sublist in preds for item in sublist] 


print(classification_report(y_val, np.array(final_preds)))

              precision    recall  f1-score   support

           0       0.80      0.88      0.84      1610
           1       0.15      0.09      0.12       390

    accuracy                           0.72      2000
   macro avg       0.48      0.48      0.48      2000
weighted avg       0.67      0.72      0.70      2000

As you can see when shuffle=False the class "1" precision/recall score is way better than when shuffle=True - I am lost why it is, in real world data could be in any order? Please help

cc @albanD @mruberry @jbschlosser

The text was updated successfully, but these errors were encountered:

jbschlosser · 2021-05-26T15:47:19Z

Hey @khurramsiddiqui! FYI a question like this is a better fit for the PyTorch forums.

khurramsiddiqui · 2021-05-26T23:10:50Z

Hi @jbschlosser apologies for it , i asked there as well - no response yet. Could you please share some thoughts https://discuss.pytorch.org/t/when-shuffling-true-score-decreased-why/122426

ptrblck · 2021-05-27T04:09:15Z

Response added and issue cannot be reproduced so far.

anjali411 added module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 26, 2021

jbschlosser closed this as completed May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shuffling True on test data decreasing score #58986

shuffling True on test data decreasing score #58986

khurramsiddiqui commented May 26, 2021 •

edited by pytorch-probot bot

jbschlosser commented May 26, 2021

khurramsiddiqui commented May 26, 2021

ptrblck commented May 27, 2021

shuffling True on test data decreasing score #58986

shuffling True on test data decreasing score #58986

Comments

khurramsiddiqui commented May 26, 2021 • edited by pytorch-probot bot

jbschlosser commented May 26, 2021

khurramsiddiqui commented May 26, 2021

ptrblck commented May 27, 2021

khurramsiddiqui commented May 26, 2021 •

edited by pytorch-probot bot