## First part: Clean and transform data

WARNING: ONLY RUN THESE CODES ONCE. IF YOU RUN IT AGAIN, YOU MAY BREAK THE CSV FILE. I HAVE A BOOKS_RAW.CSV FILE IN THE REPOSITORY, SO YOU CAN USE THAT AS THE ORIGINAL FILE. ALWAY MAKE A COPY OF THE ORIGINAL FILE BEFORE RUNNING THE CODES.

In [61]:
import pandas as pd

try:
    df = pd.read_csv('../Dataset/books.csv')
except Exception as e:
    print(e)


Error tokenizing data. C error: Expected 12 fields in line 3350, saw 13



By testing a simple pandas function, we can see that the file books.csv has an error in line 3350

In [62]:
# print line 3350 of books.csv without using pandas, which is 3349 in python
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i == 3349:
            print(line)
            print("The line "+ str(i+1) + " has: " + str(len(line.split(','))) + " elements separated by commas")
            break


12224,Streetcar Suburbs: The Process of Growth in Boston  1870-1900,Sam Bass Warner, Jr./Sam B. Warner,3.58,0674842111,9780674842113,en-US,236,61,6,4/20/2004,Harvard University Press

The line 3350 has: 13 elements separated by commas


By looking at the line, we can see that the error is caused by the authors column, which has a comma in the middle of the string, causing the csv file to read it as a new column

In [63]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i == 3349:
            # print the authors
            print(line.split(',')[2:4])

['Sam Bass Warner', ' Jr./Sam B. Warner']


A simple fix of this error is to delete the comma infront of the 'Jr.'

In [64]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('../Dataset/books.csv', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        if i == 3349:
            # concatenate the second and third element of the line and save it back to the file
            line = line.split(',')
            line[2] = line[2] + line[3]
            line.pop(3)
            f.write(','.join(line))
        else:
            f.write(line)

Now, by running the first cell, we can see that the error is fixed, but there are still other errors in the file, the below code would fix the next one, in the line 4703.

In [65]:
import pandas as pd

try:
    df = pd.read_csv('../Dataset/books.csv')
except Exception as e:
    print(e)

Error tokenizing data. C error: Expected 12 fields in line 4704, saw 13



In [66]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i == 4703:
            print(line)
            print("The line "+ str(i+1) + " has: " + str(len(line.split(','))) + " elements separated by commas")
            break


16914,The Tolkien Fan's Medieval Reader,David E. Smith (Turgon of TheOneRing.net, one of the founding members of this Tolkien website)/Verlyn Flieger/Turgon (=David E. Smith),3.58,1593600119,9781593600112,eng,400,26,4,4/6/2004,Cold Spring Press

The line 4704 has: 13 elements separated by commas


In [67]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('../Dataset/books.csv', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        if i == 4703:
            # concatenate the second and third element of the line and save it back to the file
            line = line.split(',')
            line[2] = line[2] + line[3]
            line.pop(3)
            f.write(','.join(line))
        else:
            f.write(line)

In [68]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i == 5878:
            print(line)
            print("The line "+ str(i+1) + " has: " + str(len(line.split(','))) + " elements separated by commas")
            break

22128,Patriots (The Coming Collapse),James Wesley, Rawles,3.63,156384155X,9781563841552,eng,342,38,4,1/15/1999,Huntington House Publishers

The line 5879 has: 13 elements separated by commas


In [69]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('../Dataset/books.csv', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        if i == 5878:
            # concatenate the second and third element of the line and save it back to the file
            line = line.split(',')
            line[2] = line[2] + line[3]
            line.pop(3)
            f.write(','.join(line))
        else:
            f.write(line)

In [70]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('../Dataset/books.csv', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        if i == 5878:
            # modify the second part of the line by adding a slash after the string "Wesley"
            line = line.split(',')
            line[2] = line[2].replace('Wesley', 'Wesley /')
            line = ','.join(line)
        if not line.endswith('\n'):
            line += '\n'
        f.write(line)

NEXT ERROR

In [71]:
import pandas as pd

try:
    df = pd.read_csv('../Dataset/books.csv')
except Exception as e:
    print(e)

Error tokenizing data. C error: Expected 12 fields in line 8981, saw 13



In [72]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i == 8980:
            print(line)
            print("The line "+ str(i+1) + " has: " + str(len(line.split(','))) + " elements separated by commas")
            break

34889,Brown's Star Atlas: Showing All The Bright Stars With Full Instructions How To Find And Use Them For Navigational Purposes And Department Of Trade Examinations.,Brown, Son & Ferguson,0.00,0851742718,9780851742717,eng,49,0,0,5/1/1977,Brown Son & Ferguson Ltd.

The line 8981 has: 13 elements separated by commas


In [73]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('../Dataset/books.csv', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        if i == 8980:
            # concatenate the second and third element of the line and save it back to the file
            line = line.split(',')
            line[2] = line[2] + line[3]
            line.pop(3)
            f.write(','.join(line))
        else:
            f.write(line)

In [74]:
with open('../Dataset/books.csv', 'r', encoding='utf-8') as f:
    lines = f.readlines()

with open('../Dataset/books.csv', 'w', encoding='utf-8') as f:
    for i, line in enumerate(lines):
        if i == 8980:
            # modify the second part of the line by adding a slash after the string "Wesley"
            line = line.split(',')
            line[2] = line[2].replace('Brown', 'Brown /')
            line = ','.join(line)
        if not line.endswith('\n'):
            line += '\n'
        f.write(line)