Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

Open
nit-esh opened this issue Jul 13, 2018 · 18 comments
Open

_pickle.UnpicklingError: the STRING opcode argument must be quoted #181

nit-esh opened this issue Jul 13, 2018 · 18 comments

Comments

@nit-esh
Copy link

nit-esh commented Jul 13, 2018

File "../tools\email_preprocess.py", line 36, in preprocess
word_data = pickle.load(words_file_handler)
_pickle.UnpicklingError: the STRING opcode argument must be quoted

How to resolve this error??
Please help!

@hat20
Copy link

hat20 commented Aug 8, 2018

Hi!
I had the same problem.
The pickle file has to be using Unix new lines otherwise at least Python 3.4's C pickle parser fails with exception: pickle.UnpicklingError: the STRING opcode argument must be quoted
I think that some git versions may be changing the Unix new lines ('\n') to DOS lines ('\r\n').

You may use this code to change "word_data.pkl" to "word_data_unix.pkl" and then use the new .pkl file on the script "nb_author_id.py":
dos2unix.txt

`#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: dos2unix.py
"""
original = "word_data.pkl"
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
content = infile.read()
with open(destination, 'wb') as output:
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))`

dos2unix.py addapted from:
http://stackoverflow.com/a/19702943

Copied this answer as this was originally answered by @monkshow92 . Kudos to him

@buildhigh20
Copy link

Can you please explain the above line you have written,and how to implement it.

@EddyvK
Copy link

EddyvK commented Jun 10, 2020

@buildhigh20

The issue is about what kind of "newlines" are used in the different kind of character formats/operating systems. In computer systems, text is saved as binary data using certain encodings. The end of a line is written in different codes for different encoding/OSs. That seems to be the issue at stake here.

Monkshow92 and hat20 wrote a small Python script (see above) which changes the one format into the other. I've updated it a bit, so that you can use the name of the original file and the name of a new file (in which the newline characters have been changed) on the command line:

_#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = sys.argv[1]
destination = sys.argv[2]

content = ''
outsize = 0
with open(original, 'rb') as infile:
print("reading from: ",original)
content = infile.read()
with open(destination, 'wb') as output:
print("writing to: ",destination)
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))_

Please save the code (the text in italics) in a file (e.g. dos2unix.py) and execute it with:

python dos2unix.py sourcefile destinationfile

where you need to replace "sourcefile" by the name and location of the file that you're trying to read and "destinationfile" by the name of a new file which you will read instead.

@JayanthReddySunkara
Copy link

I want to create a project can you help me I'm with 0 knowledge 🤒😭

@jaimep01
Copy link

jaimep01 commented Jul 4, 2020

Hello every1 in Win10 this fixed it:
(add you whole path to avoid issues)

PowerShell:
$path = "\word_data.pkl"
(Get-Content $path -Raw).Replace("`r`n","`n") | Set-Content $path -Force

@vkaushik189
Copy link

I'm using python 3.8. Follow this steps to fix the issue.

  1. Create a file named "doc2unix.py" in the directory /tools/
    This is the content of the file (utilized @hat20 answer)
#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = 'word_data.pkl'
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))
  1. Run this file. Make sure there are no errors.

  2. The email_preprocess file has certain errors related to python libraries. CHanged a couple of lines in the email_preprocess file to resolve the errors:

#!/usr/bin/python

import pickle
#import cPickle
import numpy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif



def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test
  1. Write the ans in the nb_author_id.py and you should get a good accuracy.

@vishal19217
Copy link

sir you explained really well but after trying this as well i am facing this error
D:\kagglecontest\ud120-projects\naive_bayes>py nb_author_id.py
Traceback (most recent call last):
File "nb_author_id.py", line 24, in
features_train, features_test, labels_train, labels_test = preprocess()
File "../tools\email_preprocess.py", line 36, in preprocess
words_file_handler = open(words_file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '../tools/word_data_unix.pkl'

kindly help me to resolve such issue.

@webpagearshi
Copy link

Thank you so much @vkaushik189

@kaushikacharya
Copy link

If you are using notepad++, then you can manually change the newline format as shown in below figure.

image

@MaguiSella
Copy link

If you are using notepad++, then you can manually change the newline format as shown in below figure.

image

This worked for me

@austinnoah92
Copy link

I'm using python 3.8. Follow this steps to fix the issue.

  1. Create a file named "doc2unix.py" in the directory /tools/
    This is the content of the file (utilized @hat20 answer)
#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = 'word_data.pkl'
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
    content = infile.read()
with open(destination, 'wb') as output:
    for line in content.splitlines():
        outsize += len(line) + 1
        output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))
  1. Run this file. Make sure there are no errors.
  2. The email_preprocess file has certain errors related to python libraries. CHanged a couple of lines in the email_preprocess file to resolve the errors:
#!/usr/bin/python

import pickle
#import cPickle
import numpy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif



def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test
  1. Write the ans in the nb_author_id.py and you should get a good accuracy.

Thank you so much. I was stucked for over 4hours until I stumbled on this.

@DataPavel
Copy link

@vkaushik189 Thank you! That worked for me:)

@Sean-Miningah
Copy link

Worked Perfectly. Thanks ;) @austinnoah92

@Hichemhino
Copy link

I fixed the issue using #181 (comment) and changing the name of the path in the 'email_preprocess.py' file
image

@engyahmedtarek1
Copy link

thank you, finally worked

@CBanafo
Copy link

CBanafo commented Sep 23, 2022

Thanks man @vkaushik189 worked for me as well. Python 3.7

Create a file named "doc2unix.py" in the directory /tools/
This is the content of the file (utilized @hat20 answer)
#!/usr/bin/env python
"""
convert dos linefeeds (crlf) to unix (lf)
usage: python dos2unix.py
"""

import sys

original = 'word_data.pkl'
destination = "word_data_unix.pkl"

content = ''
outsize = 0
with open(original, 'rb') as infile:
content = infile.read()
with open(destination, 'wb') as output:
for line in content.splitlines():
outsize += len(line) + 1
output.write(line + str.encode('\n'))

print("Done. Saved %s bytes." % (len(content)-outsize))
Run this file. Make sure there are no errors.

The email_preprocess file has certain errors related to python libraries. CHanged a couple of lines in the email_preprocess file to resolve the errors:

#!/usr/bin/python

import pickle
#import cPickle
import numpy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
"""
this function takes a pre-made list of email texts (by default word_data.pkl)
and the corresponding authors (by default email_authors.pkl) and performs
a number of preprocessing steps:
-- splits into training/testing sets (10% testing)
-- vectorizes into tfidf matrix
-- selects/keeps most helpful features

    after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

    4 objects are returned:
        -- training/testing features
        -- training/testing labels

"""

### the words (features) and authors (labels), already largely preprocessed
### this preprocessing will be repeated in the text learning mini-project
authors_file_handler = open(authors_file, "rb")
authors = pickle.load(authors_file_handler)
authors_file_handler.close()

words_file_handler = open(words_file, "rb")
word_data = pickle.load(words_file_handler)
words_file_handler.close()

### test_size is the percentage of events assigned to the test set
### (remainder go into training)
features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



### text vectorization--go from strings to lists of numbers
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed  = vectorizer.transform(features_test)



### feature selection, because text is super high dimensional and 
### can be really computationally chewy as a result
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
features_test_transformed  = selector.transform(features_test_transformed).toarray()

### info on the data
print("no. of Chris training emails:", sum(labels_train))
print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

return features_train_transformed, features_test_transformed, labels_train, labels_test

@sumitkr2000
Copy link

Thanks a lot, @vkaushik189 it helped. I was tucked in it from 1 day :)

@jamumamu
Copy link

This can be fixed permanently using a .gitattributes file in the root of the repo containing:

*.pkl text eol=lf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests