Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess.py 的 53-63行感觉有点问题 #9

Open
MaNing1924382115 opened this issue Jul 27, 2021 · 0 comments
Open

Preprocess.py 的 53-63行感觉有点问题 #9

MaNing1924382115 opened this issue Jul 27, 2021 · 0 comments

Comments

@MaNing1924382115
Copy link

       1  win_size = args.win_size
      2  step = args.step
      3  start_index = 0
     4   end_index = win_size
     5   data = token_ids[start_index:end_index]
      6  train_list.append(data)
      7  start_index += step
      8  end_index += step
      9  while end_index+50 < len(token_ids):  # 剩下的数据长度,大于或等于50,才加入训练数据集
          10  data = token_ids[start_index:end_index]
          11  train_list.append(data)
          12  start_index += step
          13  end_index += step

假如tokens长度621
执行完8行时, start_index =200, end_index =400, train_list保存到200
进入循环,第一次执行到13行,start_index =400, end_index =600, train_list保存到400
判断600+50 > 621 退出,train_list保存到400,400-621 被遗弃

假如tokens长度651
执行完8行时, start_index =200, end_index =400, train_list保存到200
进入循环,第一次执行到13行,start_index =400, end_index =600, train_list保存到400
第二次执行到13行,start_index =600, end_index =800, train_list保存到600
判断800+50 > 621 退出,train_list保存到600,600-651 被遗弃
你这个代码会把tokens的最后50 到step+50-1 token删除,感觉不是你说的 剩下的数据长度,大于或等于50,才加入训练数据集

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant