### Install the module to get started.

In [1]:
%pip install comma-fixer --upgrade

Note: you may need to restart the kernel to use updated packages.


In [2]:
from comma_fixer.column import Column
from comma_fixer.fixer import Fixer, create_chunks
from comma_fixer.schema import Schema
import pandas as pd

# Creating a Schema

To create a Schema, we have to define all the columns with the column name and type, with additional arguments on whether the columns are nullable.

Each column will have a function to determine whether a given token can be placed within that column. 

For `datetime` columns, it is critical that the input is in the ISO 8601 datetime format, i.e. `%Y-%m-%d` or `YYYY-MM-DD`, as this is the format accepted 
by the `pandas` library used for storing the dataset before exporting to CSV.

For example, assume we have the entry string "1,Bob,Johnson,twenty three,False,", and are checking whether "twenty three" is suitable for the "age" column.
Since the "age" column only accepts numeric values, it will return False. However, if the entry string were "1,Bob,Johnson,23,False,", the column's verifier 
would return True for "23".

Other types of columns can be created as well, but a `pd.Series` object must be supplied to be able to create a `pd.DataFrame` when exporting to CSV. This requires 
importing `pandas`. All arguments must be given.



In [3]:
schema_1 = Schema.new(columns=[
    Column.numeric(name="id"),
    Column.string(name="firstname", is_nullable=False, has_commas=False, has_spaces=False),
    Column.string(name="lastname", is_nullable=False, has_commas=False, has_spaces=False),
    Column.numeric(name="age"),
    Column.new(name="cat_owner", data_type=bool, series_type=pd.Series(dtype=bool), is_nullable=False, has_commas=False, has_spaces=False, format=None),
    Column.string(name="cat_names", is_nullable=True, has_commas=True, has_spaces=True)
    ]
)

After creating a Schema, its contents can be displayed in a table format. However, newer columns can not be added into existing Schemas. 

In [4]:
schema_1.info()

Unnamed: 0,name,type,nullable,has commas,has spaces,format
0,id,int,False,False,False,
1,firstname,str,False,False,False,
2,lastname,str,False,False,False,
3,age,int,False,False,False,
4,cat_owner,bool,False,False,False,
5,cat_names,str,True,True,True,


# Fixer

After creating a Schema, it can be used to create a `Fixer`.

In [5]:
fixer_1 = Fixer.new(schema_1)

A file can be processed one at a time by passing in the filepath to the fixer. This will create a `Parsed` object where the processed, valid rows can be exported into a CSV file, and invalid rows can be viewed.

Primarily, invalid rows may occur if there are multiple ways of parsing the row to fit the schema, or there is no valid parsing. This may be a result of a weak, non-restrictive schema. To fix this, the schema should contain further restrictive elements such as RegEx formatting.

If enabled in `fix_file`, invalid rows can also print out their possible tokenisations for fixing.

## Example 1

The `example_1.csv` file only has one column with commas, so there should not be any invalid rows aside from rows which are missing values.

In [6]:
! cat ./examples/example_1.csv

id,firstname,lastname,age,cat_owner,cat_names
1,John,Appleseed,43,True,Apple
2,John,Wick,35,False,
3,Bob,Smiles,25,True,Fluffy,Fluffy Sr.
4,Jennifer,Law,26,True,Snowy
5,Taylor,Fast,35,True,Grey,Benson,Button
6,Tom,Jack,18,True,,Mimi
7,Jake,Howler,30,True,,,Mob,Psycho
8,Pujan,,Sir,32,True,,,


In [7]:
parsed_example_1 = fixer_1.fix_file(file="./examples/example_1.csv", skip_first_line=True, show_possible_parses=True)



File has been processed!
Number of total entries: 9            
 Number of invalid entries: 1


When passing in a filepath, the default encoding is "utf-8", however, if the file to be processed has a different encoding, it can be passed in as an argument.

For example, for files in Thai encoding:
- `.fix_file(file="/path/to/csv/file.csv", encoding="Cp874")` 
- `.fix_file(file="../examples/example_1.csv", encoding="TIS-620")`

`TextIOWrapper` types can also be passed into the `fix_file` function if the user would like to open the file themselves, as shown in the example below.

In [8]:
file_buffer = open("./examples/example_1.csv", "r", encoding="Cp874")
parsed_example_1 = fixer_1.fix_file(file=file_buffer, skip_first_line=True, show_possible_parses=True)



File has been processed!
Number of total entries: 9            
 Number of invalid entries: 1


In [9]:
parsed_example_1.export_to_csv_best_effort(filepath="./examples/example_1_parsed.csv")

INFO:Parsed Logs:None


<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         7 non-null      object
 1   firstname  7 non-null      object
 2   lastname   7 non-null      object
 3   age        7 non-null      object
 4   cat_owner  7 non-null      object
 5   cat_names  7 non-null      object
dtypes: object(6)
memory usage: 392.0+ bytes


In [10]:
! cat ./examples/example_1_parsed.csv

id,firstname,lastname,age,cat_owner,cat_names
1,John,Appleseed,43,True,Apple
2,John,Wick,35,False,
3,Bob,Smiles,25,True,"Fluffy,Fluffy Sr."
4,Jennifer,Law,26,True,Snowy
5,Taylor,Fast,35,True,"Grey,Benson,Button"
6,Tom,Jack,18,True,Mimi
7,Jake,Howler,30,True,"Mob,Psycho"


The parsed, valid entries can also be returned as a `pandas.DataFrame` object for further manipulation. 

In [11]:
parsed_example_1.convert_to_dataframe_best_effort()

INFO:Parsed Logs:None


<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         7 non-null      object
 1   firstname  7 non-null      object
 2   lastname   7 non-null      object
 3   age        7 non-null      object
 4   cat_owner  7 non-null      object
 5   cat_names  7 non-null      object
dtypes: object(6)
memory usage: 392.0+ bytes


Unnamed: 0,id,firstname,lastname,age,cat_owner,cat_names
0,1,John,Appleseed,43,True,Apple
1,2,John,Wick,35,False,
2,3,Bob,Smiles,25,True,"Fluffy,Fluffy Sr."
3,4,Jennifer,Law,26,True,Snowy
4,5,Taylor,Fast,35,True,"Grey,Benson,Button"
5,6,Tom,Jack,18,True,Mimi
6,7,Jake,Howler,30,True,"Mob,Psycho"


We see that only 7 out of 8 entries are in the parsed CSV, meaning that only 8 rows were valid according to the schema created.

We can inspect the invalid entries by printing them out,viewing as a dataframe (if there are many invalid entries) or by exporting to CSV file, similarly to how we would export the valid entries.

For invalid entries, the exported columns will be `line index, invalid entry`, where line index is respective to the original CSV file.

In [12]:
parsed_example_1.print_all_invalid_entries()
parsed_example_1.export_invalid_entries_to_csv(filepath="./examples/example_1_invalid.csv")

INFO:Parsed Logs:None


Index	Line entry
8	8,Pujan,,Sir,32,True,,,
<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 0 to 0
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   line number    1 non-null      int64 
 1   invalid entry  1 non-null      object
dtypes: int64(1), object(1)
memory usage: 24.0+ bytes


In [13]:
! cat ./examples/example_1_invalid.csv

line number,invalid entry
8,"8,Pujan,,Sir,32,True,,,"


We see that the only entry that is invalid is on line 8, and this entry is invalid because there is a null entry in a non-null column (lastname).

## Example 2

However, if there are multiple columns allowing commas consecutively, the fixer will be unable to parse rows as efficiently compared to other schemas, i.e. schemas where columns with commas are separated by a different type, such as numeric types.

An example is shown below.

In [14]:
schema_2 = Schema.new(columns=[
    Column.numeric("id"),
    Column.string("cat_names", is_nullable=False,has_commas=True,has_spaces=True),
    Column.string("cat_colours", is_nullable=False,has_commas=True,has_spaces=False)
])

fixer_2 = Fixer.new(schema_2)

parsed_example_2 = fixer_2.fix_file("./examples/example_2.csv", skip_first_line=False, show_possible_parses=True)

INFO:Fixer Logs:['1', 'chanom', 'chayen,orange,orange']
INFO:Fixer Logs:Correct CSV format: 1,chanom,"chayen,orange,orange"
INFO:Fixer Logs:['1', 'chanom,chayen', 'orange,orange']
INFO:Fixer Logs:Correct CSV format: 1,"chanom,chayen","orange,orange"
INFO:Fixer Logs:['1', 'chanom,chayen,orange', 'orange']
INFO:Fixer Logs:Correct CSV format: 1,"chanom,chayen,orange",orange
INFO:Fixer Logs:['2', 'chayen', 'olieang,orange,black']
INFO:Fixer Logs:Correct CSV format: 2,chayen,"olieang,orange,black"
INFO:Fixer Logs:['2', 'chayen,olieang', 'orange,black']
INFO:Fixer Logs:Correct CSV format: 2,"chayen,olieang","orange,black"
INFO:Fixer Logs:['2', 'chayen,olieang,orange', 'black']
INFO:Fixer Logs:Correct CSV format: 2,"chayen,olieang,orange",black


File has been processed!
Number of total entries: 4            
 Number of invalid entries: 2


In [15]:
parsed_example_2.print_all_invalid_entries()

Index	Line entry
0	1,chanom,chayen,orange,orange
1	2,chayen,olieang,orange,black


We can see that for lines with multiple commas, the processing fails as it is unable to tell apart which tokens should be placed in which column. However, with `show_possible_parses` enabled, we can see the exact line and its possible parses.

In [16]:
parsed_example_2.export_to_csv_best_effort("./examples/example_2_parsed.csv")

INFO:Parsed Logs:None


<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           2 non-null      object
 1   cat_names    2 non-null      object
 2   cat_colours  2 non-null      object
dtypes: object(3)
memory usage: 64.0+ bytes


In [17]:
! cat ./examples/example_2_parsed.csv

id,cat_names,cat_colours
3,muffin,orange
4,chanom,orange


Exporting the parsed dataset will only result in valid rows being exported.

However, if we had restricted the Schema further given that the contents are known, then we are more likely to achieve better results.

It should be noted that this would not perform as well for columns with variable data, i.e. columns containing long text. It should go without saying, unless the contents are from a strict set of items, having incredibly restrictive schemas may also cause poor performance.

In [18]:
schema_2_revised = Schema.new(columns=[
    Column.numeric("id"),
    Column.string("cat_names", is_nullable=False,has_commas=True,has_spaces=True, format=r"^(?!orange|black|white|calico|tabby)"),
    Column.string("cat_colours", is_nullable=False,has_commas=True,has_spaces=False, format=r"^(orange|black|white|calico|tabby)")
])

fixer_2_revised = Fixer.new(schema_2_revised)

parsed_example_2_revised = fixer_2_revised.fix_file("./examples/example_2.csv", skip_first_line=False, show_possible_parses=True)

File has been processed!
Number of total entries: 4            
 Number of invalid entries: 0


In [19]:
parsed_example_2_revised.export_to_csv_best_effort("./examples/example_2_parsed_regex.csv")

INFO:Parsed Logs:None


<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           4 non-null      object
 1   cat_names    4 non-null      object
 2   cat_colours  4 non-null      object
dtypes: object(3)
memory usage: 128.0+ bytes


In [20]:
! cat ./examples/example_2_parsed_regex.csv

id,cat_names,cat_colours
1,"chanom,chayen","orange,orange"
2,"chayen,olieang","orange,black"
3,muffin,orange
4,chanom,orange


For the sake of this example, let us assume that the third column `cat_colours` can only contain the following values: [orange, black, white, calico, tabby].

With this, we can create a restrictive schema as seen above. However, we will also have to add this restriction to the column prior, so that the prior column can determine whether it belongs to it or not. Otherwise, there may still be multiple possible parses.

For example, `1, "chanom, chayen, orange", orange` could be a possible parse if we did not specify `cat_names` to exclude values from `cat_colours`. 

Hence, it is important to strictly define each column such that there are clear distinctions between consecutive columns.

## Example 3

In [21]:
schema_3 = Schema.new(columns=[
    Column.numeric("id"),
    Column.string("username",is_nullable=False,has_commas=False,has_spaces=False),
    Column.numeric("number_of_platforms"),
    Column.string("platforms",is_nullable=True,has_commas=True,has_spaces=False),
    Column.numeric("number_of_cats"),
    Column.string("cat_names",is_nullable=True,has_commas=True,has_spaces=True)
])

fixer_3 = Fixer.new(schema_3)

parsed_example_3 = fixer_3.fix_file("./examples/example_3.csv", skip_first_line=True, show_possible_parses=True)

File has been processed!
Number of total entries: 9            
 Number of invalid entries: 0


In [22]:
parsed_example_3.export_to_csv_best_effort("./examples/example_3_parsed.csv")

INFO:Parsed Logs:None


<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, 0 to 7
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   id                   8 non-null      object
 1   username             8 non-null      object
 2   number_of_platforms  8 non-null      object
 3   platforms            8 non-null      object
 4   number_of_cats       8 non-null      object
 5   cat_names            8 non-null      object
dtypes: object(6)
memory usage: 448.0+ bytes


In [23]:
! cat ./examples/example_3_parsed.csv

id,username,number_of_platforms,platforms,number_of_cats,cat_names
1,john_appleseed,2,"facebook,instagram",1,Apple
2,john_wick,0,,0,
3,bob,1,instagram,2,"fluffy,fluffy sr."
4,jlaw,1,instagram,1,snowy
5,tay_fast,2,"instagram,youtube",3,"grey,benson,button"
6,tommyj,0,,1,mimi
7,jakeyh,2,"twitter,instagram",2,"mob,Psycho"
8,pujanf,0,,2,


Since there is a clear divider between the two comma columns, valid parsings can be produced and exported. 

## Example 4

If the values of a column with commas is known, i.e. the values came from a multiple choice question on a form, they can be specified to help identify whether a value can be placed within a column.

In [24]:
schema_4 = Schema.new(columns=[
    Column.string("favourite_cat_colours", is_nullable=False,has_commas=True,has_spaces=False, format=r"^(orange|black|tabby|white|calico)"),
    Column.string("favourite_colour_reason", is_nullable=False,has_commas=True,has_spaces=True, format=r"^(?!orange|black|tabby|white|calico)")
])

fixer_4 = Fixer.new(schema_4)

parsed_example_4 = fixer_4.fix_file("./examples/example_4.csv", skip_first_line=True, show_possible_parses=True)
parsed_example_4.print_all_invalid_entries()

File has been processed!
Number of total entries: 6            
 Number of invalid entries: 0
Index	Line entry


In [25]:
parsed_example_4.export_to_csv_best_effort(filepath="./examples/example_4_parsed.csv")

INFO:Parsed Logs:None


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   favourite_cat_colours    5 non-null      object
 1   favourite_colour_reason  5 non-null      object
dtypes: object(2)
memory usage: 120.0+ bytes


In [26]:
! cat ./examples/example_4_parsed.csv

favourite_cat_colours,favourite_colour_reason
"orange,calico","because orange cats are very silly,and calicos are very pretty"
black,because black cats are very sweet despite superstition
white,my cat is white so i like white cats (my cat)
"orange,tabby",I like tabby cats because they look like striped fish.
"orange,calico,black,white,tabby","I like all cat colours,why discriminate?"


By specifying the RegEx formatting of tokens that are expected in each column, it can help with parsing tokens into their respective columns. 

However, this can only be done for columns where their expected values are known. For text columns, this may not be as effective.

As seen in the example above, the last column is a text column and its contents can be random. In this case, we can try to differentiate from the previous column since we know the previous column's values and exclude all tokens which begin with items from the previous column.

## Example 5 - Processing a file in chunks

In the case where the file is being processed is extremely large, i.e. millions of rows, it may benefit to break down the processing into smaller chunks rather than doing it all in one go. 

To do this, there is a supplied utility function for creating chunks for processing with `fix_file`. This utility function can be used by first importing from the fixer submodule like so:
```python
from comma_fixer.fixer import create_chunks
```

Then, supply the filepath or `TextIOBuffer` or `StringIO` objects, the amount of lines/rows per chunk, and whether to skip the first line.

For example:
```python
create_chunks(filepath="/path/to/csv/file.csv", lines_per_chunk=100_000, skip_first_line=True)
```

In [27]:
all_chunks = create_chunks(filepath="examples/example_5.csv", lines_per_chunk=100_000, skip_first_line=True)
print(f"Number of chunks created: {len(all_chunks)}")

Number of chunks created: 13


If the number of lines per chunk is not specified, or `None` is passed, then the number of chunks created is based on the amount of cores the device has.

In [28]:
import os 

all_chunks = create_chunks(filepath="examples/example_5.csv", lines_per_chunk=None, skip_first_line=True)
print(f"Number of chunks created: {len(all_chunks)}")
print(f"Number of cores on device: {os.cpu_count()}")

Number of chunks created: 11
Number of cores on device: 10


Let us create lots of chunks via a low number of lines per chunk.

In [29]:
all_chunks = create_chunks(filepath="examples/example_5.csv", lines_per_chunk=1_000, skip_first_line=True)
print(f"Number of chunks created: {len(all_chunks)}")

Number of chunks created: 1210


The returned object is a list of `StringIO` objects, which will allow for processing with `fix_file` directly. 

In [30]:
import string

schema_5 = Schema.new(
    columns=[
        Column.string(f"col_{i}", False, False, False) for i in range(1,51)
    ]
)

In [31]:
schema_5.info()

Unnamed: 0,name,type,nullable,has commas,has spaces,format
0,col_1,str,False,False,False,
1,col_2,str,False,False,False,
2,col_3,str,False,False,False,
3,col_4,str,False,False,False,
4,col_5,str,False,False,False,
5,col_6,str,False,False,False,
6,col_7,str,False,False,False,
7,col_8,str,False,False,False,
8,col_9,str,False,False,False,
9,col_10,str,False,False,False,


In [32]:
fixer_5 = Fixer.new(schema=schema_5)

In [33]:
parsed_example_5_chunk_0 = fixer_5.fix_file(all_chunks[0])



File has been processed!
Number of total entries: 1000            
 Number of invalid entries: 544


For this particular file, a majority of the rows are malformed. Because of this, the logger will print out warnings when it does not find a possible parsing given the schema to alert the user. Since the number of lines per chunk is extremely large, over 100,000 lines per chunk, this may result in a lot of logs being printed which could negatively impact the runtime. 

To prevent this, an additional parameter has been added to the `Fixer.fix_file` function to write the logs to a file for later inspection rather than printing. 

For this example, we will adjust the chunk size to be smaller for easier inspection.

In [34]:
small_chunks = create_chunks(filepath="examples/example_5.csv", lines_per_chunk=500, skip_first_line=True)
print(f"Number of chunks created: {len(small_chunks)}")

Number of chunks created: 2420


In [35]:
parsed_example_5_chunk_0 = fixer_5.fix_file(small_chunks[0])



File has been processed!
Number of total entries: 500            
 Number of invalid entries: 273


If `log_file` is set to true, a subfolder will be created in the current directory called `logs` and log files will be created with the naming format to `comma_fixer_%Y%m%d_%H%M%S.log`. Currently, there is no way to modify where the logs are written to.

In [36]:
parsed_example_5_chunk_1 = fixer_5.fix_file(small_chunks[1], show_possible_parses=True, log_file=True)

File has been processed!
Number of total entries: 500            
 Number of invalid entries: 271


Let us investigate the log file.

In [42]:
! cat ./logs/comma_fixer_20250918_162426.log



We see that there are mostly warnings dictating that no valid path has been found in some of the rows, with no possible parses.

Note that large matrices may become truncated. 

Let us try again with an example file from earlier where there are multiple possible parses.

In [38]:
parsed_example_2_logs = fixer_2.fix_file("./examples/example_2.csv", skip_first_line=False, show_possible_parses=True, log_file=True)

File has been processed!
Number of total entries: 4            
 Number of invalid entries: 2


In [43]:
! cat logs/comma_fixer_20250918_162429.log

INFO:Fixer Logs:['1', 'chanom', 'chayen,orange,orange']
INFO:Fixer Logs:Correct CSV format: 1,chanom,"chayen,orange,orange"
INFO:Fixer Logs:['1', 'chanom,chayen', 'orange,orange']
INFO:Fixer Logs:Correct CSV format: 1,"chanom,chayen","orange,orange"
INFO:Fixer Logs:['1', 'chanom,chayen,orange', 'orange']
INFO:Fixer Logs:Correct CSV format: 1,"chanom,chayen,orange",orange
INFO:Fixer Logs:['2', 'chayen', 'olieang,orange,black']
INFO:Fixer Logs:Correct CSV format: 2,chayen,"olieang,orange,black"
INFO:Fixer Logs:['2', 'chayen,olieang', 'orange,black']
INFO:Fixer Logs:Correct CSV format: 2,"chayen,olieang","orange,black"
INFO:Fixer Logs:['2', 'chayen,olieang,orange', 'black']
INFO:Fixer Logs:Correct CSV format: 2,"chayen,olieang,orange",black


Similarly to what is printed out when `log_file` is set to False, the log file will display any errors that occur for each row, as well as the possible parses if the option is enabled.

### Using itertools

Instead of using the `create_chunk` function, users can also use the `itertools` library to slice the input. It is important that the initial type of the input MUST be a stream or buffer, such as `TextIOBuffer` or `StringIO`, which as an internal `readline` function.

An example using `example_3.csv`, which from previous examples, has 8 rows in total.

Here, we can specify to only slice the first 5 rows of the file and attempt to parse.

In [None]:
from itertools import islice

file_3_slice = open("./examples/example_3.csv")
slice_5 = islice(file_3_slice, 5)

# For demonstration on how it looks like
for line in slice_5:
    print(line)

file_3_slice = open("./examples/example_3.csv")
parsed_3_first_5 = fixer_3.fix_file(islice(file_3_slice, 10), skip_first_line=True)

id,username,number_of_platforms,platforms,number_of_cats,cat_names

1,john_appleseed,2,facebook,instagram,1,Apple

2,john_wick,0,,0,

3,bob,1,instagram,2,fluffy,fluffy sr.

4,jlaw,1,instagram,1,,snowy

File has been processed!
Number of total entries: 4            
 Number of invalid entries: 0


With `itertools.islice`, we can specify exactly which lines to process, and allows more customisability compared to the `create_chunks` function. 

For example, with `itertools.islice`, a start and stop index can be specified so that only lines from the `start` to `stop` are sliced.

For example, using `example_3.csv` again, we can slice lines 3-6.

In [49]:
file_3_slice = open("./examples/example_3.csv")
slice_3to6 = islice(file_3_slice, 3, 6)

# For demonstration on how it looks like
for line in slice_3to6:
    print(line)
    
file_3_slice = open("./examples/example_3.csv")
parsed_3_first_3to6 = fixer_3.fix_file(islice(file_3_slice, 3, 6), skip_first_line=False)

3,bob,1,instagram,2,fluffy,fluffy sr.

4,jlaw,1,instagram,1,,snowy

5,tay_fast,2,,instagram,,youtube,3,grey,,benson,button

File has been processed!
Number of total entries: 3            
 Number of invalid entries: 0
