Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Windows returns error at DFfeatureselect.py step #54

Closed
jd-coderepos opened this issue May 16, 2016 · 4 comments
Closed

Training on Windows returns error at DFfeatureselect.py step #54

jd-coderepos opened this issue May 16, 2016 · 4 comments

Comments

@jd-coderepos
Copy link

I'm trying to train a new language identifier model on my own languages dataset. Unfortunately, it crashes at the DFfeatureselect.py script, returning "TypeError: marshal.load() arg must be file" error message. Below is the log until the crash point.

C:\langid.py-master\langid\train>C:\Python27\python.exe train.py corpus
corpus path: corpus
model path: ..model
langs(22): el(26) eo(42) en(1674) af(285) ca(287) am(2426) an(226) cy(79) ar(82) cs(432) et(449) az(534) es(457) be(292) bg(818) bn(65) de(2795) da(90) dz(220) br(532) bs(493) as(101)
domains(1): domain(12405)
identified 12405 documents
will tokenize 12405 documents
using byte NGram tokenizer, max_order: 4
chunk size: 50 (249 chunks)
job count: 8
whole-document tokenization
tokenized chunk (1/249) [11880 keys]
tokenized chunk (2/249) [12305 keys]
tokenized chunk (3/249) [10517 keys]
tokenized chunk (4/249) [18799 keys]
tokenized chunk (5/249) [17955 keys]
tokenized chunk (6/249) [6092 keys]
tokenized chunk (7/249) [21901 keys]
tokenized chunk (8/249) [11344 keys]
tokenized chunk (9/249) [6342 keys]
tokenized chunk (10/249) [6499 keys]
tokenized chunk (11/249) [5452 keys]
tokenized chunk (12/249) [5734 keys]
tokenized chunk (13/249) [6204 keys]
tokenized chunk (14/249) [5252 keys]
tokenized chunk (15/249) [6565 keys]
tokenized chunk (16/249) [3035 keys]
tokenized chunk (17/249) [2157 keys]
tokenized chunk (18/249) [9931 keys]
tokenized chunk (19/249) [8004 keys]
tokenized chunk (20/249) [5949 keys]
tokenized chunk (21/249) [8345 keys]
tokenized chunk (22/249) [13381 keys]
tokenized chunk (23/249) [18026 keys]
tokenized chunk (24/249) [15978 keys]
tokenized chunk (25/249) [12526 keys]
tokenized chunk (26/249) [17599 keys]
tokenized chunk (27/249) [11572 keys]
tokenized chunk (28/249) [18360 keys]
tokenized chunk (29/249) [8206 keys]
tokenized chunk (30/249) [11074 keys]
tokenized chunk (31/249) [14938 keys]
tokenized chunk (32/249) [12470 keys]
tokenized chunk (33/249) [10483 keys]
tokenized chunk (34/249) [14454 keys]
tokenized chunk (35/249) [9515 keys]
tokenized chunk (36/249) [10757 keys]
tokenized chunk (37/249) [8575 keys]
tokenized chunk (38/249) [13322 keys]
tokenized chunk (39/249) [8586 keys]
tokenized chunk (40/249) [8388 keys]
tokenized chunk (41/249) [16794 keys]
tokenized chunk (42/249) [6053 keys]
tokenized chunk (43/249) [8165 keys]
tokenized chunk (44/249) [4032 keys]
tokenized chunk (45/249) [3898 keys]
tokenized chunk (46/249) [3113 keys]
tokenized chunk (47/249) [2738 keys]
tokenized chunk (48/249) [12874 keys]
tokenized chunk (49/249) [7597 keys]
tokenized chunk (50/249) [4921 keys]
tokenized chunk (51/249) [3117 keys]
tokenized chunk (52/249) [8515 keys]
tokenized chunk (53/249) [9234 keys]
tokenized chunk (54/249) [13384 keys]
tokenized chunk (55/249) [13649 keys]
tokenized chunk (56/249) [13531 keys]
tokenized chunk (57/249) [12832 keys]
tokenized chunk (58/249) [12293 keys]
tokenized chunk (59/249) [25620 keys]
tokenized chunk (60/249) [6443 keys]
tokenized chunk (61/249) [15453 keys]
tokenized chunk (62/249) [10807 keys]
tokenized chunk (63/249) [19978 keys]
tokenized chunk (64/249) [44970 keys]
tokenized chunk (65/249) [14168 keys]
tokenized chunk (66/249) [12106 keys]
tokenized chunk (67/249) [27309 keys]
tokenized chunk (68/249) [12115 keys]
tokenized chunk (69/249) [20707 keys]
tokenized chunk (70/249) [19919 keys]
tokenized chunk (71/249) [11967 keys]
tokenized chunk (72/249) [16046 keys]
tokenized chunk (73/249) [8409 keys]
tokenized chunk (74/249) [20964 keys]
tokenized chunk (75/249) [12275 keys]
tokenized chunk (76/249) [16301 keys]
tokenized chunk (77/249) [12272 keys]
tokenized chunk (78/249) [21592 keys]
tokenized chunk (79/249) [19530 keys]
tokenized chunk (80/249) [17342 keys]
tokenized chunk (81/249) [19946 keys]
tokenized chunk (82/249) [15298 keys]
tokenized chunk (83/249) [17531 keys]
tokenized chunk (84/249) [17299 keys]
tokenized chunk (85/249) [24131 keys]
tokenized chunk (86/249) [16513 keys]
tokenized chunk (87/249) [19510 keys]
tokenized chunk (88/249) [14266 keys]
tokenized chunk (89/249) [22952 keys]
tokenized chunk (90/249) [15482 keys]
tokenized chunk (91/249) [15573 keys]
tokenized chunk (92/249) [20496 keys]
tokenized chunk (93/249) [18156 keys]
tokenized chunk (94/249) [22490 keys]
tokenized chunk (95/249) [29002 keys]
tokenized chunk (96/249) [20352 keys]
tokenized chunk (97/249) [44165 keys]
tokenized chunk (98/249) [34627 keys]
tokenized chunk (99/249) [49905 keys]
tokenized chunk (100/249) [53103 keys]
tokenized chunk (101/249) [51983 keys]
tokenized chunk (102/249) [31038 keys]
tokenized chunk (103/249) [31409 keys]
tokenized chunk (104/249) [33165 keys]
tokenized chunk (105/249) [37822 keys]
tokenized chunk (106/249) [10940 keys]
tokenized chunk (107/249) [71118 keys]
tokenized chunk (108/249) [38858 keys]
tokenized chunk (109/249) [37634 keys]
tokenized chunk (110/249) [51967 keys]
tokenized chunk (111/249) [56836 keys]
tokenized chunk (112/249) [27115 keys]
tokenized chunk (113/249) [15849 keys]
tokenized chunk (114/249) [14734 keys]
tokenized chunk (115/249) [26009 keys]
tokenized chunk (116/249) [19294 keys]
tokenized chunk (117/249) [32044 keys]
tokenized chunk (118/249) [29201 keys]
tokenized chunk (119/249) [39628 keys]
tokenized chunk (120/249) [6244 keys]
tokenized chunk (121/249) [7435 keys]
tokenized chunk (122/249) [21227 keys]
tokenized chunk (123/249) [29732 keys]
tokenized chunk (124/249) [35250 keys]
tokenized chunk (125/249) [10271 keys]
tokenized chunk (126/249) [32891 keys]
tokenized chunk (127/249) [7873 keys]
tokenized chunk (128/249) [10418 keys]
tokenized chunk (129/249) [7311 keys]
tokenized chunk (130/249) [9516 keys]
tokenized chunk (131/249) [11074 keys]
tokenized chunk (132/249) [15263 keys]
tokenized chunk (133/249) [11205 keys]
tokenized chunk (134/249) [8567 keys]
tokenized chunk (135/249) [7678 keys]
tokenized chunk (136/249) [44950 keys]
tokenized chunk (137/249) [21967 keys]
tokenized chunk (138/249) [35438 keys]
tokenized chunk (139/249) [49606 keys]
tokenized chunk (140/249) [55683 keys]
tokenized chunk (141/249) [49369 keys]
tokenized chunk (142/249) [48286 keys]
tokenized chunk (143/249) [44039 keys]
tokenized chunk (144/249) [11811 keys]
tokenized chunk (145/249) [41120 keys]
tokenized chunk (146/249) [69629 keys]
tokenized chunk (147/249) [70067 keys]
tokenized chunk (148/249) [46883 keys]
tokenized chunk (149/249) [52358 keys]
tokenized chunk (150/249) [127523 keys]
tokenized chunk (151/249) [37044 keys]
tokenized chunk (152/249) [74712 keys]
tokenized chunk (153/249) [63824 keys]
tokenized chunk (154/249) [55408 keys]
tokenized chunk (155/249) [61234 keys]
tokenized chunk (156/249) [54418 keys]
tokenized chunk (157/249) [39921 keys]
tokenized chunk (158/249) [62581 keys]
tokenized chunk (159/249) [71439 keys]
tokenized chunk (160/249) [53094 keys]
tokenized chunk (161/249) [76232 keys]
tokenized chunk (162/249) [36778 keys]
tokenized chunk (163/249) [71083 keys]
tokenized chunk (164/249) [71121 keys]
tokenized chunk (165/249) [54315 keys]
tokenized chunk (166/249) [62550 keys]
tokenized chunk (167/249) [67024 keys]
tokenized chunk (168/249) [69247 keys]
tokenized chunk (169/249) [66758 keys]
tokenized chunk (170/249) [54992 keys]
tokenized chunk (171/249) [62659 keys]
tokenized chunk (172/249) [60409 keys]
tokenized chunk (173/249) [44923 keys]
tokenized chunk (174/249) [43095 keys]
tokenized chunk (175/249) [50332 keys]
tokenized chunk (176/249) [62506 keys]
tokenized chunk (177/249) [51782 keys]
tokenized chunk (178/249) [71541 keys]
tokenized chunk (179/249) [63289 keys]
tokenized chunk (180/249) [85046 keys]
tokenized chunk (181/249) [63942 keys]
tokenized chunk (182/249) [58598 keys]
tokenized chunk (183/249) [63150 keys]
tokenized chunk (184/249) [47424 keys]
tokenized chunk (185/249) [65839 keys]
tokenized chunk (186/249) [93418 keys]
tokenized chunk (187/249) [12910 keys]
tokenized chunk (188/249) [53958 keys]
tokenized chunk (189/249) [37259 keys]
tokenized chunk (190/249) [11532 keys]
tokenized chunk (191/249) [52861 keys]
tokenized chunk (192/249) [14390 keys]
tokenized chunk (193/249) [11546 keys]
tokenized chunk (194/249) [43913 keys]
tokenized chunk (195/249) [66130 keys]
tokenized chunk (196/249) [10962 keys]
tokenized chunk (197/249) [9993 keys]
tokenized chunk (198/249) [11903 keys]
tokenized chunk (199/249) [28550 keys]
tokenized chunk (200/249) [10199 keys]
tokenized chunk (201/249) [11053 keys]
tokenized chunk (202/249) [11845 keys]
tokenized chunk (203/249) [10557 keys]
tokenized chunk (204/249) [10736 keys]
tokenized chunk (205/249) [19925 keys]
tokenized chunk (206/249) [18973 keys]
tokenized chunk (207/249) [22198 keys]
tokenized chunk (208/249) [13544 keys]
tokenized chunk (209/249) [12096 keys]
tokenized chunk (210/249) [10717 keys]
tokenized chunk (211/249) [23275 keys]
tokenized chunk (212/249) [11339 keys]
tokenized chunk (213/249) [11669 keys]
tokenized chunk (214/249) [12482 keys]
tokenized chunk (215/249) [15175 keys]
tokenized chunk (216/249) [53832 keys]
tokenized chunk (217/249) [52319 keys]
tokenized chunk (218/249) [51782 keys]
tokenized chunk (219/249) [48032 keys]
tokenized chunk (220/249) [44353 keys]
tokenized chunk (221/249) [47209 keys]
tokenized chunk (222/249) [43914 keys]
tokenized chunk (223/249) [48074 keys]
tokenized chunk (224/249) [27881 keys]
tokenized chunk (225/249) [39001 keys]
tokenized chunk (226/249) [41330 keys]
tokenized chunk (227/249) [45242 keys]
tokenized chunk (228/249) [51633 keys]
tokenized chunk (229/249) [38759 keys]
tokenized chunk (230/249) [33628 keys]
tokenized chunk (231/249) [37245 keys]
tokenized chunk (232/249) [28676 keys]
tokenized chunk (233/249) [40631 keys]
tokenized chunk (234/249) [37609 keys]
tokenized chunk (235/249) [41072 keys]
tokenized chunk (236/249) [39166 keys]
tokenized chunk (237/249) [42001 keys]
tokenized chunk (238/249) [14521 keys]
tokenized chunk (239/249) [43873 keys]
tokenized chunk (240/249) [5256 keys]
tokenized chunk (241/249) [5307 keys]
tokenized chunk (242/249) [15233 keys]
tokenized chunk (243/249) [34008 keys]
tokenized chunk (244/249) [16667 keys]
tokenized chunk (245/249) [7618 keys]
tokenized chunk (246/249) [18999 keys]
tokenized chunk (247/249) [17754 keys]
tokenized chunk (248/249) [22048 keys]
tokenized chunk (249/249) [21140 keys]
Traceback (most recent call last):
File "train.py", line 196, in
doc_count = tally(b_dirs, args.jobs)
File "C:\langid.py-master\langid\train\DFfeatureselect.py", line 92, in tally
for i, keycount in enumerate(pass_sum_df_out):
File "C:\Python27\lib\multiprocessing\pool.py", line 620, in next
raise value
TypeError: marshal.load() arg must be file

@zq2017
Copy link

zq2017 commented Dec 4, 2017

Hello, I met the same problem as you. Have you solved this problem? Can you give me some solutions? Thank you so much!

@zq2017
Copy link

zq2017 commented Dec 5, 2017

@saffsd Dear Marco, can you give me some solutions? Thank you so much!

@jd-coderepos
Copy link
Author

Hello,

Sharing my workaround to this problem here. The problem is related to this StackOverflow question: https://stackoverflow.com/questions/3249822/python-2-6-4-marshal-load-doesnt-accept-open-file-objects-made-with-subprocess

In the file common.py, I changed the 'unmarshal_iter' function as below:

def unmarshal_iter(path):
  tmpfolder = tempfile.mkdtemp()
  try:
      tmpfile = os.path.join(tmpfolder, "temp")
      with open(tmpfile, "wb") as binfile:
          binfile.write(gzip.open(path, 'rb').read())

      with open(tmpfile, "rb") as binfile:
          while True:
              try:
                  yield marshal.load(binfile)
              except EOFError:
                  break
  finally:
      if tmpfolder and os.path.isdir(tmpfolder):
          shutil.rmtree(tmpfolder)

Also, I'm using python version 2.7.3 to run the program.

Best.

@antoniogois
Copy link

antoniogois commented May 16, 2018

@jenlindadsouza thanks for your tip! Don't forget to also add "import shutil" to the beginning of common.py

And one more thing, in case anyone else has an error in "NBtrain.py", line 256: What I did was change line 257
except NameError:
by
except:

I think some exception that was supposed to be ignored has a different name now, so with my change all exceptions are ignored here. I'm not sure about this fix, but it seems to be working. (at first this seemed like a windows path issue [using / instead of \] but I had this problem both on windows and linux)

EDIT: forget about this exception thing, all predictions are going crazy... There's probably a real problem behind the exception being thrown, I can't find it right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants