_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

nit-esh · 2018-07-13T17:10:59Z

File "../tools\email_preprocess.py", line 36, in preprocess
word_data = pickle.load(words_file_handler)
_pickle.UnpicklingError: the STRING opcode argument must be quoted

How to resolve this error??
Please help!

hat20 · 2018-08-08T19:28:35Z

Hi!
I had the same problem.
The pickle file has to be using Unix new lines otherwise at least Python 3.4's C pickle parser fails with exception: pickle.UnpicklingError: the STRING opcode argument must be quoted
I think that some git versions may be changing the Unix new lines ('\n') to DOS lines ('\r\n').

You may use this code to change "word_data.pkl" to "word_data_unix.pkl" and then use the new .pkl file on the script "nb_author_id.py":
dos2unix.txt

`#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: dos2unix.py
"""
original = "word_data.pkl"
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
content = infile.read()
with open(destination, 'wb') as output:
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))`

dos2unix.py addapted from:
http://stackoverflow.com/a/19702943

Copied this answer as this was originally answered by @monkshow92 . Kudos to him

buildhigh20 · 2020-06-10T07:55:37Z

Can you please explain the above line you have written,and how to implement it.

EddyvK · 2020-06-10T12:35:46Z

@buildhigh20

The issue is about what kind of "newlines" are used in the different kind of character formats/operating systems. In computer systems, text is saved as binary data using certain encodings. The end of a line is written in different codes for different encoding/OSs. That seems to be the issue at stake here.

Monkshow92 and hat20 wrote a small Python script (see above) which changes the one format into the other. I've updated it a bit, so that you can use the name of the original file and the name of a new file (in which the newline characters have been changed) on the command line:

_#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = sys.argv[1]
destination = sys.argv[2]

content = ''
outsize = 0
with open(original, 'rb') as infile:
print("reading from: ",original)
content = infile.read()
with open(destination, 'wb') as output:
print("writing to: ",destination)
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))_

Please save the code (the text in italics) in a file (e.g. dos2unix.py) and execute it with:

python dos2unix.py sourcefile destinationfile

where you need to replace "sourcefile" by the name and location of the file that you're trying to read and "destinationfile" by the name of a new file which you will read instead.

JayanthReddySunkara · 2020-06-10T13:14:13Z

I want to create a project can you help me I'm with 0 knowledge 🤒😭

jaimep01 · 2020-07-04T17:51:57Z

Hello every1 in Win10 this fixed it:
(add you whole path to avoid issues)

PowerShell:
$path = "\word_data.pkl"
(Get-Content $path -Raw).Replace("`r`n","`n") | Set-Content $path -Force

vkaushik189 · 2020-07-24T01:14:33Z

I'm using python 3.8. Follow this steps to fix the issue.

Create a file named "doc2unix.py" in the directory /tools/
This is the content of the file (utilized @hat20 answer)

#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = 'word_data.pkl'
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))

Run this file. Make sure there are no errors.
The email_preprocess file has certain errors related to python libraries. CHanged a couple of lines in the email_preprocess file to resolve the errors:

#!/usr/bin/python

import pickle
#import cPickle
import numpy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif



def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test

Write the ans in the nb_author_id.py and you should get a good accuracy.

vishal19217 · 2020-08-11T10:28:12Z

sir you explained really well but after trying this as well i am facing this error
D:\kagglecontest\ud120-projects\naive_bayes>py nb_author_id.py
Traceback (most recent call last):
File "nb_author_id.py", line 24, in
features_train, features_test, labels_train, labels_test = preprocess()
File "../tools\email_preprocess.py", line 36, in preprocess
words_file_handler = open(words_file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '../tools/word_data_unix.pkl'

kindly help me to resolve such issue.

webpagearshi · 2020-08-18T14:00:43Z

Thank you so much @vkaushik189

kaushikacharya · 2021-04-30T13:56:00Z

If you are using notepad++, then you can manually change the newline format as shown in below figure.

MaguiSella · 2021-08-06T14:29:35Z

If you are using notepad++, then you can manually change the newline format as shown in below figure.

This worked for me

austinnoah92 · 2021-08-10T08:50:21Z

I'm using python 3.8. Follow this steps to fix the issue.

Create a file named "doc2unix.py" in the directory /tools/
This is the content of the file (utilized @hat20 answer)

#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = 'word_data.pkl'
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))

Run this file. Make sure there are no errors.
The email_preprocess file has certain errors related to python libraries. CHanged a couple of lines in the email_preprocess file to resolve the errors:

#!/usr/bin/python

import pickle
#import cPickle
import numpy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif



def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test

Write the ans in the nb_author_id.py and you should get a good accuracy.

Thank you so much. I was stucked for over 4hours until I stumbled on this.

DataPavel · 2021-09-07T19:46:19Z

@vkaushik189 Thank you! That worked for me:)

Sean-Miningah · 2021-09-26T12:47:55Z

Worked Perfectly. Thanks ;) @austinnoah92

Hichemhino · 2022-04-27T18:39:36Z

I fixed the issue using #181 (comment) and changing the name of the path in the 'email_preprocess.py' file

engyahmedtarek1 · 2022-06-23T10:51:50Z

thank you, finally worked

CBanafo · 2022-09-23T07:49:02Z

Thanks man @vkaushik189 worked for me as well. Python 3.7

Create a file named "doc2unix.py" in the directory /tools/
This is the content of the file (utilized @hat20 answer)
#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = 'word_data.pkl'
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
content = infile.read()
with open(destination, 'wb') as output:
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))
Run this file. Make sure there are no errors.

The email_preprocess file has certain errors related to python libraries. CHanged a couple of lines in the email_preprocess file to resolve the errors:

#!/usr/bin/python

import pickle
#import cPickle
import numpy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
"""
this function takes a pre-made list of email texts (by default word_data.pkl)
and the corresponding authors (by default email_authors.pkl) and performs
a number of preprocessing steps:
-- splits into training/testing sets (10% testing)
-- vectorizes into tfidf matrix
-- selects/keeps most helpful features

    after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

    4 objects are returned:
        -- training/testing features
        -- training/testing labels

"""

### the words (features) and authors (labels), already largely preprocessed
### this preprocessing will be repeated in the text learning mini-project
authors_file_handler = open(authors_file, "rb")
authors = pickle.load(authors_file_handler)
authors_file_handler.close()

words_file_handler = open(words_file, "rb")
word_data = pickle.load(words_file_handler)
words_file_handler.close()

### test_size is the percentage of events assigned to the test set
### (remainder go into training)
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



### text vectorization--go from strings to lists of numbers
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed  = vectorizer.transform(features_test)



### feature selection, because text is super high dimensional and 
### can be really computationally chewy as a result
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
features_test_transformed  = selector.transform(features_test_transformed).toarray()

### info on the data
print("no. of Chris training emails:", sum(labels_train))
print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

return features_train_transformed, features_test_transformed, labels_train, labels_test

sumitkr2000 · 2023-01-23T07:50:26Z

Thanks a lot, @vkaushik189 it helped. I was tucked in it from 1 day :)

jamumamu · 2023-04-12T19:55:55Z

This can be fixed permanently using a .gitattributes file in the root of the repo containing:

*.pkl text eol=lf

Halfwayclover mentioned this issue Sep 10, 2021

UnpicklingError("the STRING opcode argument must be quoted") for Windows #331

Closed

leronjulian mentioned this issue Jun 8, 2023

_pickle.UnpicklingError: the STRING opcode argument must be quoted Zielon/MICA#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

nit-esh commented Jul 13, 2018 •

edited

Loading

hat20 commented Aug 8, 2018 •

edited

Loading

buildhigh20 commented Jun 10, 2020

EddyvK commented Jun 10, 2020

JayanthReddySunkara commented Jun 10, 2020

jaimep01 commented Jul 4, 2020 •

edited

Loading

vkaushik189 commented Jul 24, 2020

vishal19217 commented Aug 11, 2020

webpagearshi commented Aug 18, 2020

kaushikacharya commented Apr 30, 2021

MaguiSella commented Aug 6, 2021

austinnoah92 commented Aug 10, 2021

DataPavel commented Sep 7, 2021

Sean-Miningah commented Sep 26, 2021

Hichemhino commented Apr 27, 2022

engyahmedtarek1 commented Jun 23, 2022

CBanafo commented Sep 23, 2022

sumitkr2000 commented Jan 23, 2023

jamumamu commented Apr 12, 2023

_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

Comments

nit-esh commented Jul 13, 2018 • edited Loading

hat20 commented Aug 8, 2018 • edited Loading

buildhigh20 commented Jun 10, 2020

EddyvK commented Jun 10, 2020

JayanthReddySunkara commented Jun 10, 2020

jaimep01 commented Jul 4, 2020 • edited Loading

vkaushik189 commented Jul 24, 2020

vishal19217 commented Aug 11, 2020

webpagearshi commented Aug 18, 2020

kaushikacharya commented Apr 30, 2021

MaguiSella commented Aug 6, 2021

austinnoah92 commented Aug 10, 2021

DataPavel commented Sep 7, 2021

Sean-Miningah commented Sep 26, 2021

Hichemhino commented Apr 27, 2022

engyahmedtarek1 commented Jun 23, 2022

CBanafo commented Sep 23, 2022

sumitkr2000 commented Jan 23, 2023

jamumamu commented Apr 12, 2023

nit-esh commented Jul 13, 2018 •

edited

Loading

hat20 commented Aug 8, 2018 •

edited

Loading

jaimep01 commented Jul 4, 2020 •

edited

Loading