Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Preprocessing #6

Closed
giriallada opened this issue Apr 20, 2019 · 3 comments
Closed

Data Preprocessing #6

giriallada opened this issue Apr 20, 2019 · 3 comments
Labels

Comments

@giriallada
Copy link

Hey I have query about the data preprocessing part for model 4 and 5 . Whenever I try to preprocess the data this is what i end up with
Traceback (most recent call last): File "process_English.py", line 290, in <module> reviews = pd.read_csv(reviews_csv,header = 1) #skip first row (of header) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__ self._make_engine(self.engine) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 751, in pandas._libs.parsers.TextReader._get_header pandas.errors.ParserError: Passed header=1 but only 1 lines in file
I have preprocessed the data the data using the steps which abisee gave but I dont understand the csv part in ur method

@theamrzaki
Copy link
Owner

the file expects an excel file , what is the format of the file that you have used ?

@theamrzaki theamrzaki added the Data Processing text dataset issues label Apr 20, 2019
@theamrzaki
Copy link
Owner

in line 290 in preprocess data

  reviews_csv =cnn_stories_dir + "\ArabicBook00.csv"
  reviews = pd.read_csv(reviews_csv)  #header = 1 has been removed
  reviews = reviews.filter(['content', 'title'])
  reviews = reviews.dropna()
  reviews = reviews.reset_index(drop=True)
  reviews.head()

this reads an excel file in a csv format with the below format

content title
content of first article title of first article
content of second article title of second article
content of third article title of third article

this is a link of Reviews.csv
that you can use for this task


If you need the ready made output of this preprocessed data to help you getting started you can use this link , you can also follow this repo for more info about preprocessing data

this preprocessed data contains

  1. folder for training
  2. folder for testing
  3. folder for validation
  4. file of vocab

these folders contains the data in a binary chunked format

@theamrzaki
Copy link
Owner

We would close this issue for inactivity , however feel free to post a new issue , if a new problem appears , i would truly like to help you if a new problem arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants