##  <center>Load PDF Files</center>

### About this project

This is the first step of the **fermac_analysis** project. This parte of project is intended for loading monthly PDF files containing sales invoices by year, cleaning up them, tokenizing the data through regex, saving them as a **pd.DataFrame**, exporting each report as an individual **CSV file** per year.

In [1]:
#https://www.youtube.com/watch?v=QJbbxqgwQaM
import pandas as pd
from collections import Counter
import pdfplumber
import pdfplumber, re
print(pdfplumber.__version__)

0.11.1


In [2]:
pdf = pdfplumber.open('./data/pdf_files/nfs_2013.pdf')

## Inspecting the document

In [3]:
for page in pdf.pages:
    for line in page.extract_text().splitlines():
        print(line)

Empresa Panatlântica S.A.
R. Rudolfo Vontobel, 600 Distrito Industrial - Gravatai/RS CEP: 94045405
CNPJ: 92.693.019/0001-89 IE: 0570024129 FONE: 3489 77 77
CONSULTA NOTA FISCAL Data : 23.07.2023 - Hora: 22:19:19 - Folha : 001 / 69
Data: 01//0/2/20
N o t a F i s c a l E s t a b P e d i d o - S e q C l i e n t e I t e m E s p e s s u r a L a r g u r a C o m p r i m e n t o V l P e ç a V l U n i t Q t I t e m Vl Total
0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68
0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10
0 1 0 7 8 5 6 1 1 1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61
0 1 0 7 8 5 6 1 1 1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53
0 1 0 7 8 5 7 1 1 1 4 5 5 9 2 - 1 0 K E K O 

## Visualize the first Page

### Cleanning up the data. Removing header footer and other undesirable lines.

In [4]:
# Defining a function for removing a certain pattern.
def remove_pattern(text, pattern):
    return re.sub(pattern,'', text)

In [5]:
# Defining a function for printing the data. 
def print_data(data_list):
    for index, value in enumerate(data_list):
        print(f'The value at {index} is {value}\n\n')

In [6]:
# Regex pattern for "Date", "Time" and "Folha : [page_number] / [total_pages]"

extracted_pdf_data = []
header_date_pattern = r'\d{2}\.\d{2}\.\d{4} - ' #Data : 23.07.2023
time_pattern = r'Hora: \d{2}:\d{2}:\d{2} - '
page_pattern = r'Folha : \d+ / \d+'  # $ indicates the pattern is at the end of the string
search_string_1 = 'Empresa Panatlântica S.A.'
search_string_2 = 'R. Rudolfo Vontobel, 600 Distrito Industrial - Gravatai/RS CEP: 94045405' 
search_string_3 = 'CNPJ: 92.693.019/0001-89 IE: 0570024129 FONE: 3489 77 77' 
search_string_4 = 'CONSULTA NOTA FISCAL Data : ' + header_date_pattern + time_pattern + page_pattern
search_string_5 = 'N o t a F i s c a l E s t a b P e d i d o - S e q C l i e n t e I t e m E s p e s s u r a L a r g u r a C o m p r i m e n t o V l P e ç a V l U n i t Q t I t e m Vl Total'

for page in pdf.pages:
    for line in page.extract_text().splitlines():
        extracted_pdf_data.append(line)

# Using list comprehension to remove the strings that match the pattern at the end        
extracted_pdf_data = [remove_pattern(line, search_string_1) for line in extracted_pdf_data] # 'Empresa Panatlântica S.A.'
extracted_pdf_data = [remove_pattern(line, search_string_2) for line in extracted_pdf_data] # 'R. Rudolfo Vontobel, 600 Distrito Industrial - Gravatai/RS CEP: 94045405'
extracted_pdf_data = [remove_pattern(line, search_string_3) for line in extracted_pdf_data] # 'CNPJ: 92.693.019/0001-89 IE: 0570024129 FONE: 3489 77 77' 
extracted_pdf_data = [remove_pattern(line, search_string_4) for line in extracted_pdf_data] # 'CONSULTA NOTA FISCAL ' + date_pattern + time_pattern + page_pattern
extracted_pdf_data = [remove_pattern(line, search_string_5) for line in extracted_pdf_data] # 'N o t a F i s c a l E s t a b P e d i d o - S e q...'
extracted_pdf_data = [item for item in extracted_pdf_data if item.strip()]
[extracted_pdf_data.pop(index) for index, value in enumerate(extracted_pdf_data) if not 'Data' in value and len(value) <= 30] #'6.666,00 17.841,26'
extracted_pdf_data.pop(-1) #Total Geral1.691.589,00 4.546.046,08

print_data(extracted_pdf_data)

The value at 0 is Data: 01//0/2/20


The value at 1 is 0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68


The value at 2 is 0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10


The value at 3 is 0 1 0 7 8 5 6 1 1 1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61


The value at 4 is 0 1 0 7 8 5 6 1 1 1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53


The value at 5 is 0 1 0 7 8 5 7 1 1 1 4 5 5 9 2 - 1 0 K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32


The value at 6 is 0 1 0 7 8 8 6 1 1 1 4 6 3 2 8 - 2 0 K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64


The value at 7 is 0 1 0 7 8 8 7 1 1 1 4 6 1 4 3 - 2

In [7]:
extracted_pdf_data

['Data: 01//0/2/20',
 '0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68',
 '0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10',
 '0 1 0 7 8 5 6 1 1 1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61',
 '0 1 0 7 8 5 6 1 1 1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53',
 '0 1 0 7 8 5 7 1 1 1 4 5 5 9 2 - 1 0 K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32',
 '0 1 0 7 8 8 6 1 1 1 4 6 3 2 8 - 2 0 K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64',
 '0 1 0 7 8 8 7 1 1 1 4 6 1 4 3 - 2 0 M E T M E T A L C I N TR REL 0,30 BOB NBR5007 G2L32 BbOL 0,3000 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33',
 '0 1 0 7 8 9 

In [8]:
# Getting the date in the correct format.
date_pattern = r'\s(\d{2}\//\d{1}/\d{1}/\d{2})?' # 'Data: 01//0/6/20
Date = []
for index, line in enumerate(extracted_pdf_data):
    if 'Data' in line:
        date = re.sub(r'Data:',"",extracted_pdf_data[index]).strip().replace("/20","//2013").replace("//","-").replace("/", "")
    else:
        Date.append(date)
Date

['01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-02-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-03-2013',
 '01-04-2013',
 '01-04-2013',
 '01-04-2013',
 '01-04-2013',
 '01-04-2013',
 '01-04-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-07-2013',
 '01-08-2013',
 '01-10-2013',
 '01-10-2013',
 '01-10-2013',
 '01-11-2013',
 '01-11-2013',
 '01-11-2013',
 '01-11-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-04-2013',
 '02-05-2013',
 '02-05-2013',
 '02-05-2013',
 '02-05-2013',
 '02-05-2013',
 '02-05-2013',
 '02-05-2013',
 '02-07-20

In [9]:
len(Date)

1658

In [10]:
extracted_pdf_data

['Data: 01//0/2/20',
 '0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68',
 '0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10',
 '0 1 0 7 8 5 6 1 1 1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61',
 '0 1 0 7 8 5 6 1 1 1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53',
 '0 1 0 7 8 5 7 1 1 1 4 5 5 9 2 - 1 0 K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32',
 '0 1 0 7 8 8 6 1 1 1 4 6 3 2 8 - 2 0 K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64',
 '0 1 0 7 8 8 7 1 1 1 4 6 1 4 3 - 2 0 M E T M E T A L C I N TR REL 0,30 BOB NBR5007 G2L32 BbOL 0,3000 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33',
 '0 1 0 7 8 9 

In [11]:
# Removing the date.
date_pattern = r'Data:\s(\d{2}\//\d{1}/\d{1}/\d{2})?'
[extracted_pdf_data.pop(index) for index, value in enumerate(extracted_pdf_data) if 'Data' in value]

['Data: 01//0/2/20',
 'Data: 01//0/3/20',
 'Data: 01//0/4/20',
 'Data: 01//0/7/20',
 'Data: 01//0/8/20',
 'Data: 01//1/0/20',
 'Data: 01//1/1/20',
 'Data: 02//0/4/20',
 'Data: 02//0/5/20',
 'Data: 02//0/7/20',
 'Data: 02//0/8/20',
 'Data: 02//0/9/20',
 'Data: 02//1/0/20',
 'Data: 02//1/2/20',
 'Data: 03//0/1/20',
 'Data: 03//0/4/20',
 'Data: 03//0/5/20',
 'Data: 03//0/6/20',
 'Data: 03//0/7/20',
 'Data: 03//0/9/20',
 'Data: 03//1/0/20',
 'Data: 03//1/2/20',
 'Data: 04//0/1/20',
 'Data: 04//0/2/20',
 'Data: 04//0/3/20',
 'Data: 04//0/4/20',
 'Data: 04//0/6/20',
 'Data: 04//0/7/20',
 'Data: 04//0/9/20',
 'Data: 04//1/0/20',
 'Data: 04//1/1/20',
 'Data: 04//1/2/20',
 'Data: 05//0/2/20',
 'Data: 05//0/3/20',
 'Data: 05//0/4/20',
 'Data: 05//0/6/20',
 'Data: 05//0/7/20',
 'Data: 05//0/8/20',
 'Data: 05//0/9/20',
 'Data: 05//1/1/20',
 'Data: 05//1/2/20',
 'Data: 06//0/2/20',
 'Data: 06//0/3/20',
 'Data: 06//0/5/20',
 'Data: 06//0/6/20',
 'Data: 06//0/8/20',
 'Data: 06//0/9/20',
 'Data: 06//1

In [12]:
print_data(extracted_pdf_data)

The value at 0 is 0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68


The value at 1 is 0 0 2 5 3 0 5 1 4 1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10


The value at 2 is 0 1 0 7 8 5 6 1 1 1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61


The value at 3 is 0 1 0 7 8 5 6 1 1 1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53


The value at 4 is 0 1 0 7 8 5 7 1 1 1 4 5 5 9 2 - 1 0 K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32


The value at 5 is 0 1 0 7 8 8 6 1 1 1 4 6 3 2 8 - 2 0 K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64


The value at 6 is 0 1 0 7 8 8 7 1 1 1 4 6 1 4 3 - 2 0 M E T M E T A L C I N TR REL 0,30 

In [13]:
len(extracted_pdf_data)

1658

In [14]:
# Getting the NF number.
nf_pattern = r'(\d{1}\s\d{1}\s\d{1}\s\d{1}\s\d{1}\s\d{1}\s\d{1})'  # Pattern like 0 3 1 9 2 0 3
NF = []

# Compile the regex pattern
pattern = re.compile(nf_pattern)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
    if match:  # If there's a match
        NF.append(match.group().replace(" ", ""))  # Append the matched string to NF (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
        #print_data(extracted_pdf_data)
    else:  # If there's no match
       NF.append("9999-99")  # Add the placeholder value
# Print the result
NF  # List of matched patterns

['0025305',
 '0025305',
 '0107856',
 '0107856',
 '0107857',
 '0107886',
 '0107887',
 '0107890',
 '0107905',
 '0107914',
 '0107950',
 '0109978',
 '0110002',
 '0109977',
 '0109977',
 '0109977',
 '0109977',
 '0109977',
 '0109976',
 '0109959',
 '0109958',
 '0112178',
 '0112177',
 '0112142',
 '0112140',
 '0112137',
 '0112196',
 '0118816',
 '0118818',
 '0118818',
 '0118866',
 '0118866',
 '0118866',
 '0118866',
 '0118888',
 '0118888',
 '0118899',
 '0118902',
 '0118853',
 '0121326',
 '0125813',
 '0125814',
 '0125855',
 '0128225',
 '0128226',
 '0128221',
 '0128263',
 '0112306',
 '0112303',
 '0112302',
 '0112225',
 '0112225',
 '0112225',
 '0112225',
 '0112243',
 '0112244',
 '0112246',
 '0112247',
 '0112268',
 '0114452',
 '0114487',
 '0114453',
 '0114453',
 '0114491',
 '0114490',
 '0114489',
 '0118957',
 '0118979',
 '0118979',
 '0118995',
 '0118938',
 '0118938',
 '0118938',
 '0121372',
 '0121372',
 '0121372',
 '0121378',
 '0121379',
 '0123584',
 '0123585',
 '0123585',
 '0123602',
 '0125959',
 '01

In [15]:
len(NF)

1658

In [16]:
extracted_pdf_data  # Modified data with matches removed

[' 1 4 1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68',
 ' 1 4 1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10',
 ' 1 1 1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61',
 ' 1 1 1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53',
 ' 1 1 1 4 5 5 9 2 - 1 0 K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32',
 ' 1 1 1 4 6 3 2 8 - 2 0 K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64',
 ' 1 1 1 4 6 1 4 3 - 2 0 M E T M E T A L C I N TR REL 0,30 BOB NBR5007 G2L32 BbOL 0,3000 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33',
 ' 1 1 1 4 7 7 3 7 - 4 0 L U P E M E TIRA FF BOB 2,65 NBR6658 OL 2,6500 66,0000 0,0000 0,00 2,97 1.069,00 3.333,68',
 ' 1 1 1 

In [17]:
# Getting the Est pattern
est_pattern = r'(\s\d{1}\s\d{1}\s)' # 1 1
Est = []

# Compile the regex pattern
pattern = re.compile(est_pattern)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
    if match:  # If there's a match
        Est.append(match.group().replace(" ", ""))  # Append the matched string to Est (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
    else:  # If there's no match
        Est.append("99")  # Add the placeholder value

# Print the result
Est  # List of matched patterns

['14',
 '14',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '14',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '14',
 '14',
 '11',
 '11',
 '11',
 '11',
 '14',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',
 '11',

In [18]:
len(Est)

1658

In [19]:
print(extracted_pdf_data)  # Modified data with matches removed

['1 4 7 8 7 6 - 2 0 G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68', '1 4 7 8 7 6 - 3 0 G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10', '1 3 7 0 8 1 - 1 0 K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61', '1 4 7 0 0 2 - 2 0 K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53', '1 4 5 5 9 2 - 1 0 K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32', '1 4 6 3 2 8 - 2 0 K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64', '1 4 6 1 4 3 - 2 0 M E T M E T A L C I N TR REL 0,30 BOB NBR5007 G2L32 BbOL 0,3000 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33', '1 4 7 7 3 7 - 4 0 L U P E M E TIRA FF BOB 2,65 NBR6658 OL 2,6500 66,0000 0,0000 0,00 2,97 1.069,00 3.333,68', '1 4 9 6 8 6 - 1 0 L A I N D CHAPA FF 1,20 NBR6658 OL 1,

In [20]:
# Getting the Order number
order_pattern = r'(\d\s\d\s\d?\s?\d?\s?\d?\s?\d?\s?\-\s\d\s\d\s?\d?)'
Order = []

# Compile the regex pattern
pattern = re.compile(order_pattern)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
   
    if match:  # If there's a match
        Order.append(match.group().replace(" ", ""))  # Append the matched string to Est (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
    else:  # If there's no match
        Order.append("9999-99")  # Add the placeholder value
    
# Print the result
Order  # List of matched patterns

['147876-20',
 '147876-30',
 '137081-10',
 '147002-20',
 '145592-10',
 '146328-20',
 '146143-20',
 '147737-40',
 '149686-10',
 '149250-10',
 '144699-20',
 '149309-10',
 '151326-10',
 '152066-10',
 '152066-20',
 '152066-40',
 '152066-180',
 '152066-270',
 '151003-30',
 '150924-10',
 '152164-10',
 '150674-40',
 '149938-20',
 '151246-30',
 '151326-30',
 '149937-20',
 '153672-10',
 '160135-30',
 '154650-20',
 '154650-30',
 '153892-10',
 '153891-20',
 '153893-30',
 '153888-40',
 '156241-10',
 '156244-20',
 '156241-10',
 '149136-10',
 '149946-10',
 '162097-10',
 '165382-10',
 '165998-10',
 '167894-10',
 '168953-10',
 '169373-20',
 '169374-10',
 '165339-40',
 '152066-240',
 '151454-70',
 '151003-50',
 '152986-10',
 '153145-70',
 '153156-70',
 '153158-70',
 '152595-10',
 '152595-10',
 '153902-10',
 '152071-20',
 '152595-10',
 '156216-10',
 '155292-50',
 '156419-20',
 '156419-50',
 '154626-20',
 '153947-40',
 '153956-10',
 '156230-20',
 '160518-20',
 '160518-40',
 '159312-10',
 '160135-10',
 '1

In [21]:
len(Order)

1658

In [22]:
print(extracted_pdf_data)  # Modified data with matches removed

['G P A N I Z FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68', 'G P A N I Z FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10', 'K E K O CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61', 'K E K O CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53', 'K E K O CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32', 'K E K O CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64', 'M E T M E T A L C I N TR REL 0,30 BOB NBR5007 G2L32 BbOL 0,3000 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33', 'L U P E M E TIRA FF BOB 2,65 NBR6658 OL 2,6500 66,0000 0,0000 0,00 2,97 1.069,00 3.333,68', 'L A I N D CHAPA FF 1,20 NBR6658 OL 1,2000 1.000,0000 2.700,0000 0,00 2,55 1.614,00 4.321,49', 'G P A N I Z CHAPA FQ 4,75 NBR6658 4,7500 1.200,0000 3.000,0000 0,00 2,14 2.961,00 6.846,61', 'M E T A L 

In [23]:
# Getting Client
keywords = ['TIRA', 'CHAPA', 'PECA', 'FLANGE', 'TELHA', 'TR', 'PERFIL', 'BOB', 'BLANK', 'FITA', 'CH', 'CUMEEIRA', 'TUBO', 'PONTA']
client = r'([A-Z0-9\.\'\"\/\s\-]+?)(?=\s(?:' + '|'.join(map(re.escape, keywords)) + r'))'

Client = []

# Compile the regex pattern
pattern = re.compile(client)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
    if match:  # If there's a match
        Client.append(match.group().replace(" ", ""))  # Append the matched string to Est (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
    else:  # If there's no match
        Client.append("NO CLIENT")  # Add the placeholder value

# Print the result
Client  # List of matched patterns

['GPANIZ',
 'GPANIZ',
 'KEKO',
 'KEKO',
 'KEKO',
 'KEKO',
 'METMETALCIN',
 'LUPEME',
 'LAIND',
 'GPANIZ',
 'METALMATRIX',
 'GS',
 'METSISTEM',
 'GPANIZIND',
 'GPANIZIND',
 'GPANIZIND',
 'GPANIZIND',
 'GPANIZIND',
 'GPANIZ',
 'MAQENGE',
 'PLASLINK',
 'METMETALCIN',
 'VECTOR',
 'MAQENGE',
 'METSISTEM',
 'VECTOR',
 'KAE',
 'GPANIZIND',
 'INCORPOL',
 'INCORPOL',
 'METMETALCIN',
 'METMETALCIN',
 'METMETALCIN',
 'METMETALCIN',
 'FLANTECH',
 'FLANTECH',
 'FLANTECH',
 'SASPLASTI',
 'VECTOR',
 'COPRIMA',
 'VECTOR',
 'POMMIER',
 'CANELLO',
 'METALMATRIX',
 'ACOGRATO',
 'JMARCONIND',
 'VECTOR',
 'GPANIZIND',
 'GPANIZ',
 'GPANIZ',
 'GPANIZ',
 'GPANIZ',
 'GPANIZ',
 'GPANIZ',
 'FLANTECH',
 'FLANTECH',
 'RODAROSIND',
 'RODAROSIND2',
 'FLANTECH',
 'GPANIZ',
 'KEKO',
 'GPANIZ',
 'GPANIZ',
 'CIRNA',
 'MAQENGE',
 'LUZI',
 'KEKO',
 'GPANIZ',
 'GPANIZ',
 'METCOSBI',
 'GPANIZIND',
 'GPANIZIND',
 'GPANIZIND',
 'GPANIZ',
 'GPANIZ',
 'GPANIZ',
 'GPANIZ',
 'GPANIZIND',
 'MAQENGE',
 'METMETALCIN',
 'METMETALCIN'

In [24]:
len(Client)

1658

In [25]:
print(extracted_pdf_data)  # Modified data with matches removed

[' FLANGE OXICORTADO LCG 12,50 NBR 8300 12,5000 90,0000 420,0000 55,45 4,27 780,00 3.542,68', ' FLANGE OXICORTADO LCG 9,50 NBR 8300 9,5000 90,0000 360,0000 31,10 4,04 385,00 1.657,10', ' CHAPA GR LTQ 9,50 NBR 6656 LNE 38 9,5000 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61', ' CHAPA FF 1,50 NBR5915 EM OL 1,5000 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53', ' CHAPA FQ DEC 2,00 SAE1006 OL 2,0000 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32', ' CHAPA FQ 4,25 SAE1010 4,2500 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64', ' TR REL 0,30 BOB NBR5007 G2L32 BbOL 0,3000 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33', ' TIRA FF BOB 2,65 NBR6658 OL 2,6500 66,0000 0,0000 0,00 2,97 1.069,00 3.333,68', ' CHAPA FF 1,20 NBR6658 OL 1,2000 1.000,0000 2.700,0000 0,00 2,55 1.614,00 4.321,49', ' CHAPA FQ 4,75 NBR6658 4,7500 1.200,0000 3.000,0000 0,00 2,14 2.961,00 6.846,61', ' TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV 1,5500 20,0000 0,0000 0,00 3,19 2.284,00 7.650,26', ' TIRA ZC BOB 0,95 NB

In [26]:
# Getting Esp
esp = r'(\s\d+\,\d{4})'
Esp = []

# Compile the regex pattern
pattern = re.compile(esp)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
    if match:  # If there's a match
        Esp.append(match.group().strip())  # Append the matched string to Esp (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
    else:  # If there's no match
        Esp.append("9,9999")  # Add the placeholder value

# Print the result
Esp  # List of matched patterns

['12,5000',
 '9,5000',
 '9,5000',
 '1,5000',
 '2,0000',
 '4,2500',
 '0,3000',
 '2,6500',
 '1,2000',
 '4,7500',
 '1,5500',
 '0,9500',
 '0,9000',
 '0,5000',
 '0,5000',
 '1,5000',
 '1,5000',
 '1,5000',
 '2,6500',
 '0,6000',
 '1,2500',
 '0,4000',
 '0,9000',
 '0,6000',
 '4,2500',
 '0,9000',
 '0,8000',
 '2,0000',
 '3,0000',
 '4,2500',
 '0,3500',
 '0,4000',
 '0,3000',
 '0,3000',
 '9,5000',
 '9,5000',
 '9,5000',
 '1,5000',
 '0,9000',
 '1,5500',
 '0,9000',
 '2,0000',
 '3,0000',
 '1,5000',
 '1,5000',
 '1,9000',
 '0,9000',
 '4,7500',
 '1,5000',
 '2,6500',
 '1,0600',
 '1,0600',
 '1,0600',
 '1,0600',
 '9,5000',
 '9,5000',
 '3,7500',
 '12,5000',
 '9,5000',
 '12,5000',
 '9,5000',
 '1,5000',
 '2,6500',
 '3,0000',
 '0,6000',
 '1,2000',
 '4,2500',
 '2,0000',
 '2,6500',
 '4,7500',
 '1,2000',
 '2,0000',
 '2,0000',
 '4,7500',
 '6,3000',
 '2,0000',
 '0,5000',
 '2,0000',
 '0,6000',
 '0,3000',
 '0,3500',
 '0,9000',
 '4,0000',
 '1,9000',
 '12,5000',
 '6,3000',
 '1,5000',
 '1,5000',
 '0,6000',
 '0,9000',
 '1,50

In [27]:
len(Esp)

1658

In [28]:
print(extracted_pdf_data)  # Modified data with matches removed

[' FLANGE OXICORTADO LCG 12,50 NBR 8300 90,0000 420,0000 55,45 4,27 780,00 3.542,68', ' FLANGE OXICORTADO LCG 9,50 NBR 8300 90,0000 360,0000 31,10 4,04 385,00 1.657,10', ' CHAPA GR LTQ 9,50 NBR 6656 LNE 38 1.200,0000 3.000,0000 0,00 2,84 1.355,00 4.040,61', ' CHAPA FF 1,50 NBR5915 EM OL 1.200,0000 1.595,0000 0,00 2,64 5.831,00 16.163,53', ' CHAPA FQ DEC 2,00 SAE1006 OL 1.200,0000 3.000,0000 0,00 2,70 5.192,00 14.719,32', ' CHAPA FQ 4,25 SAE1010 1.200,0000 3.000,0000 0,00 2,25 6.239,00 14.739,64', ' TR REL 0,30 BOB NBR5007 G2L32 BbOL 260,0000 0,0000 0,00 4,50 2.857,00 13.499,33', ' TIRA FF BOB 2,65 NBR6658 OL 66,0000 0,0000 0,00 2,97 1.069,00 3.333,68', ' CHAPA FF 1,20 NBR6658 OL 1.000,0000 2.700,0000 0,00 2,55 1.614,00 4.321,49', ' CHAPA FQ 4,75 NBR6658 1.200,0000 3.000,0000 0,00 2,14 2.961,00 6.846,61', ' TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV 20,0000 0,0000 0,00 3,19 2.284,00 7.650,26', ' TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV 32,0000 0,0000 0,00 3,18 524,00 1.782,77', ' CHAPA FF 0,90 

In [29]:
# Getting Larg
larg = r'(\b\d{1,3}(?:\.\d{3})*(?:,\d{4})\b)'
Larg = []

# Compile the regex pattern
pattern = re.compile(larg)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
    if match:  # If there's a match
        Larg.append(match.group().strip())  # Append the matched string to Esp (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
    else:  # If there's no match
        Larg.append("9.999,9999")  # Add the placeholder value

# Print the result
Larg  # List of matched patterns

['90,0000',
 '90,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '260,0000',
 '66,0000',
 '1.000,0000',
 '1.200,0000',
 '20,0000',
 '32,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '354,0000',
 '240,0000',
 '260,0000',
 '411,0000',
 '354,0000',
 '1.500,0000',
 '411,0000',
 '37,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '260,0000',
 '260,0000',
 '260,0000',
 '260,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '56,7000',
 '630,0000',
 '155,0000',
 '630,0000',
 '115,0000',
 '1.000,0000',
 '45,0000',
 '55,0000',
 '85,0000',
 '411,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '1.000,0000',
 '287,0000',
 '213,0000',
 '55,0000',
 '1.200,0000',
 '1.200,0000',
 '1.200,0000',
 '55,0000',
 '1.000,0000',
 '1.000

In [30]:
len(Larg)

1658

In [31]:
print(extracted_pdf_data)  # Modified data with matches removed

[' FLANGE OXICORTADO LCG 12,50 NBR 8300  420,0000 55,45 4,27 780,00 3.542,68', ' FLANGE OXICORTADO LCG 9,50 NBR 8300  360,0000 31,10 4,04 385,00 1.657,10', ' CHAPA GR LTQ 9,50 NBR 6656 LNE 38  3.000,0000 0,00 2,84 1.355,00 4.040,61', ' CHAPA FF 1,50 NBR5915 EM OL  1.595,0000 0,00 2,64 5.831,00 16.163,53', ' CHAPA FQ DEC 2,00 SAE1006 OL  3.000,0000 0,00 2,70 5.192,00 14.719,32', ' CHAPA FQ 4,25 SAE1010  3.000,0000 0,00 2,25 6.239,00 14.739,64', ' TR REL 0,30 BOB NBR5007 G2L32 BbOL  0,0000 0,00 4,50 2.857,00 13.499,33', ' TIRA FF BOB 2,65 NBR6658 OL  0,0000 0,00 2,97 1.069,00 3.333,68', ' CHAPA FF 1,20 NBR6658 OL  2.700,0000 0,00 2,55 1.614,00 4.321,49', ' CHAPA FQ 4,75 NBR6658  3.000,0000 0,00 2,14 2.961,00 6.846,61', ' TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV  0,0000 0,00 3,19 2.284,00 7.650,26', ' TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV  0,0000 0,00 3,18 524,00 1.782,77', ' CHAPA FF 0,90 NBR6658 OL  3.000,0000 0,00 2,70 2.101,00 5.956,34', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275  3.000,00

In [32]:
# Getting Comp
comp = r'(\b\d{1,3}(?:\.\d{3})*(?:,\d{4})\b)'
#larg = r'(\b\d{1,3}(?:\.\d{3})*(?:,\d{4})\b)'
Comp = []

# Compile the regex pattern
pattern = re.compile(comp)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    match = pattern.search(line)  # Search for the pattern in the line
    if match:  # If there's a match
        Comp.append(match.group().strip())  # Append the matched string to Esp (without spaces)
        extracted_pdf_data[index] = pattern.sub("", line, count=1)  # Remove the matched part from the line
    else:  # If there's no match
        Comp.append("9.999,9999")  # Add the placeholder value

# Print the result
Comp  # List of matched patterns

['420,0000',
 '360,0000',
 '3.000,0000',
 '1.595,0000',
 '3.000,0000',
 '3.000,0000',
 '0,0000',
 '0,0000',
 '2.700,0000',
 '3.000,0000',
 '0,0000',
 '0,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '0,0000',
 '0,0000',
 '0,0000',
 '914,0000',
 '0,0000',
 '3.000,0000',
 '914,0000',
 '0,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '0,0000',
 '0,0000',
 '0,0000',
 '0,0000',
 '2.000,0000',
 '2.000,0000',
 '2.000,0000',
 '0,0000',
 '1.121,0000',
 '0,0000',
 '1.121,0000',
 '0,0000',
 '1.820,0000',
 '0,0000',
 '0,0000',
 '0,0000',
 '914,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '2.000,0000',
 '2.000,0000',
 '3.000,0000',
 '3.100,0000',
 '2.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '0,0000',
 '0,0000',
 '0,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '0,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000,0000',
 '3.000

In [33]:
len(Comp)

1658

In [34]:
print(extracted_pdf_data)  # Modified data with matches removed

[' FLANGE OXICORTADO LCG 12,50 NBR 8300   55,45 4,27 780,00 3.542,68', ' FLANGE OXICORTADO LCG 9,50 NBR 8300   31,10 4,04 385,00 1.657,10', ' CHAPA GR LTQ 9,50 NBR 6656 LNE 38   0,00 2,84 1.355,00 4.040,61', ' CHAPA FF 1,50 NBR5915 EM OL   0,00 2,64 5.831,00 16.163,53', ' CHAPA FQ DEC 2,00 SAE1006 OL   0,00 2,70 5.192,00 14.719,32', ' CHAPA FQ 4,25 SAE1010   0,00 2,25 6.239,00 14.739,64', ' TR REL 0,30 BOB NBR5007 G2L32 BbOL   0,00 4,50 2.857,00 13.499,33', ' TIRA FF BOB 2,65 NBR6658 OL   0,00 2,97 1.069,00 3.333,68', ' CHAPA FF 1,20 NBR6658 OL   0,00 2,55 1.614,00 4.321,49', ' CHAPA FQ 4,75 NBR6658   0,00 2,14 2.961,00 6.846,61', ' TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV   0,00 3,19 2.284,00 7.650,26', ' TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV   0,00 3,18 524,00 1.782,77', ' CHAPA FF 0,90 NBR6658 OL   0,00 2,70 2.101,00 5.956,34', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275   0,00 3,29 1.455,00 5.121,80', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275   0,00 3,29 1.450,00 5.104,20', ' CHAPA FF 1,5

In [35]:
#Getting the price per piece
#piece_price = r'\s(\d{1,3},\d{2})'
piece_price = r'(?<!<=\s)(\d{1,3},\d{2})'
Piece_price = []

# # Compile the regex pattern
pattern = re.compile(piece_price)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    matches = list(pattern.finditer(line))  
    match = matches[1]   # Search for the pattern in the line
    if "PONTA" in line:
        Piece_price.append(match.group().strip())  # Append the matched string to Esp (without spaces)
    elif match:  # If there's a match
        Piece_price.append(match.group(1).strip())  # Append the matched string to Esp (without spaces)
        extracted_pdf_data[index] = line[:match.start()] + line[match.end():] # Remove the matched part from the line
    else:  # If there's no match
        Piece_price.append("0,00")  # Add the placeholder value

# Iterate through extracted_pdf_data and replace the matched pattern
# for index, line in enumerate(extracted_pdf_data):
#     matches = list(pattern.finditer(line))  # Find all matches in the line

#     if len(matches) > 1:  # Check if there are at least two matches
#         match = matches[1]  # Get the second match
#         Piece_price.append(match.group(1).strip())  # Append the matched string to KG_price (strip spaces)
#         # Remove the matched part from the line
#         extracted_pdf_data[index] = line[:match.start()] + line[match.end():]
#     else:  # If there are fewer than two matches
#         Piece_price.append("0,00")  # Add the placeholder value

# Print the result


# # Define the regex patterns
# piece_price_general = r'(?<!<=\s)(\d{1,3},\d{2})'  # General pattern for matching piece prices
# piece_price_zero = r'\b0,00\b'  # Specific pattern to match only '0,00'

# Piece_price = []

# # Compile the regex patterns
# general_pattern = re.compile(piece_price_general)
# zero_pattern = re.compile(piece_price_zero)

# # Iterate through extracted_pdf_data
# for index, line in enumerate(extracted_pdf_data):
#     if "PONTA DE BOBINA" in line:  # If the line contains "PONTA DE BOBINA"
#         match = zero_pattern.search(line)  # Look for '0,00'
#         if match:  # If there's a match for '0,00'
#             Piece_price.append(match.group().strip())  # Append the matched '0,00'
#             extracted_pdf_data[index] = line[:match.start()] + line[match.end():]
#         else:
#             Piece_price.append("NOTING")  # Fallback value for missing '0,00'
#     else:  # For all other lines
#         matches = list(general_pattern.finditer(line))  # Match all general patterns in the line
#         if matches:  # If matches are found
#             Piece_price.append(matches[1].group(1).strip() if len(matches) > 1 else matches[0].group(1).strip())
#             extracted_pdf_data[index] = line[:match.start()] + line[match.end():]
#         else:
#             Piece_price.append("NOTING ELSE")  # Fallback value for missing matches
            
Piece_price  # List of matched patterns

['55,45',
 '31,10',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '67,92',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00',
 '0,00'

In [36]:
len(Piece_price)

1658

In [37]:
print(extracted_pdf_data)  # Modified data with matches removed

[' FLANGE OXICORTADO LCG 12,50 NBR 8300    4,27 780,00 3.542,68', ' FLANGE OXICORTADO LCG 9,50 NBR 8300    4,04 385,00 1.657,10', ' CHAPA GR LTQ 9,50 NBR 6656 LNE 38    2,84 1.355,00 4.040,61', ' CHAPA FF 1,50 NBR5915 EM OL    2,64 5.831,00 16.163,53', ' CHAPA FQ DEC 2,00 SAE1006 OL    2,70 5.192,00 14.719,32', ' CHAPA FQ 4,25 SAE1010    2,25 6.239,00 14.739,64', ' TR REL 0,30 BOB NBR5007 G2L32 BbOL    4,50 2.857,00 13.499,33', ' TIRA FF BOB 2,65 NBR6658 OL    2,97 1.069,00 3.333,68', ' CHAPA FF 1,20 NBR6658 OL    2,55 1.614,00 4.321,49', ' CHAPA FQ 4,75 NBR6658    2,14 2.961,00 6.846,61', ' TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV    3,19 2.284,00 7.650,26', ' TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV    3,18 524,00 1.782,77', ' CHAPA FF 0,90 NBR6658 OL    2,70 2.101,00 5.956,34', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275    3,29 1.455,00 5.121,80', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275    3,29 1.450,00 5.104,20', ' CHAPA FF 1,50 NBR6658 OL    2,53 1.496,00 4.072,32', ' CHAPA FF 1,50 NBR66

In [38]:
# Getting the kg price
# kg_price = r'(?:\s\d{1,2}\,\d{2})'
kg_price = r'(?<!<=\s)(\d{1,3},\d{2})'
#kg_price = r'\s(\d{1,2},\d{2})'
KG_price = []

# Compile the regex pattern
pattern = re.compile(kg_price)  # Assuming "kg_price" is the correct pattern

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    matches = list(pattern.finditer(line))  # Find all matches in the line

    if len(matches) > 1:  # Check if there are at least two matches
        match = matches[1]  # Get the second match
        KG_price.append(match.group(1).strip())  # Append the matched string to KG_price (strip spaces)
        # Remove the matched part from the line
        extracted_pdf_data[index] = line[:match.start()] + line[match.end():]
    else:  # If there are fewer than two matches
        KG_price.append("0,00")  # Add the placeholder value

# Print the result
KG_price  # List of matched patterns

['4,27',
 '4,04',
 '2,84',
 '2,64',
 '2,70',
 '2,25',
 '4,50',
 '2,97',
 '2,55',
 '2,14',
 '3,19',
 '3,18',
 '2,70',
 '3,29',
 '3,29',
 '2,53',
 '2,53',
 '2,53',
 '2,59',
 '2,85',
 '3,15',
 '4,82',
 '2,85',
 '2,85',
 '2,25',
 '2,85',
 '5,35',
 '2,54',
 '2,45',
 '2,37',
 '5,16',
 '5,16',
 '5,16',
 '5,16',
 '2,39',
 '2,39',
 '2,39',
 '3,08',
 '3,02',
 '3,15',
 '3,02',
 '3,05',
 '2,60',
 '3,17',
 '3,56',
 '3,31',
 '3,20',
 '2,27',
 '2,53',
 '2,59',
 '2,53',
 '2,53',
 '2,53',
 '2,53',
 '2,23',
 '2,23',
 '2,70',
 '2,23',
 '2,23',
 '2,43',
 '3,01',
 '2,68',
 '2,81',
 '3,16',
 '3,05',
 '3,00',
 '2,40',
 '2,82',
 '2,77',
 '3,06',
 '2,68',
 '2,54',
 '2,54',
 '2,43',
 '2,43',
 '2,82',
 '5,36',
 '2,54',
 '2,98',
 '5,16',
 '5,16',
 '3,02',
 '3,70',
 '3,10',
 '5,22',
 '2,46',
 '3,56',
 '3,56',
 '2,98',
 '3,20',
 '3,05',
 '2,50',
 '3,16',
 '3,16',
 '2,15',
 '2,50',
 '2,85',
 '2,29',
 '2,45',
 '4,82',
 '4,82',
 '2,29',
 '2,53',
 '2,39',
 '2,80',
 '3,02',
 '3,02',
 '3,02',
 '2,43',
 '2,43',
 '2,43',
 

In [39]:
len(KG_price)

1658

In [40]:
print(extracted_pdf_data)  # Modified data with matches removed

[' FLANGE OXICORTADO LCG 12,50 NBR 8300     780,00 3.542,68', ' FLANGE OXICORTADO LCG 9,50 NBR 8300     385,00 1.657,10', ' CHAPA GR LTQ 9,50 NBR 6656 LNE 38     1.355,00 4.040,61', ' CHAPA FF 1,50 NBR5915 EM OL     5.831,00 16.163,53', ' CHAPA FQ DEC 2,00 SAE1006 OL     5.192,00 14.719,32', ' CHAPA FQ 4,25 SAE1010     6.239,00 14.739,64', ' TR REL 0,30 BOB NBR5007 G2L32 BbOL     2.857,00 13.499,33', ' TIRA FF BOB 2,65 NBR6658 OL     1.069,00 3.333,68', ' CHAPA FF 1,20 NBR6658 OL     1.614,00 4.321,49', ' CHAPA FQ 4,75 NBR6658     2.961,00 6.846,61', ' TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV     2.284,00 7.650,26', ' TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV     524,00 1.782,77', ' CHAPA FF 0,90 NBR6658 OL     2.101,00 5.956,34', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275     1.455,00 5.121,80', ' CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275     1.450,00 5.104,20', ' CHAPA FF 1,50 NBR6658 OL     1.496,00 4.072,32', ' CHAPA FF 1,50 NBR6658 OL     1.548,00 4.213,87', ' CHAPA FF 1,50 NBR6658 OL     1.5

In [41]:
# #Getting the kg price
# kg_price = r'\s(\d{1,2},\d{2})'
# KG_price = []

# # Compile the regex pattern
# pattern = re.compile(piece_price)

# # Iterate through extracted_pdf_data and replace the matched pattern
# for index, line in enumerate(extracted_pdf_data):
#     matches = list(pattern.finditer(line))  
#     #print(matches[1])
#     match = matches[1]   # Search for the pattern in the line
    
#     if len(matches) > 1:  # Check if there are at least two matches
#         match = matches[1]  # Get the second match
#         KG_price.append(match.group(1).strip())  # Append the matched string to KG_price (strip spaces)
#         # Remove the matched part from the line
#         extracted_pdf_data[index] = line[:match.start()] + line[match.end():]
#     else:  # If there are fewer than two matches
#         KG_price.append("0,00")  # Add the placeholder value



# # Print the result
# KG_price  # List of matched patterns

In [42]:
# len(KG_price)

In [43]:
# print(extracted_pdf_data)  # Modified data with matches removed

In [44]:
# # Regex to match prices with groups
# quantity = r'(\d{1,3}(?:\.\d{3})*,\d{2})'

# # Compile the regex pattern
# pattern = re.compile(quantity)

# # Iterate through extracted_pdf_data
# for index, line in enumerate(extracted_pdf_data):
#     print(f"Line {index + 1}: {line}")  # Print current line

#     # Find all matches in the line
#     matches = list(pattern.finditer(line))
#     if matches:  # If matches are found
#         for match_index, match in enumerate(matches, start=1):
#             print(f"  Match {match_index}: {match.group(1)}")  # Print the matched group
#     else:  # If no matches
#         print("  No matches found.")


In [45]:
# Getting Quantity
quantity = r'(?<!<=\s)(\d{1,3}(?:\.\d{3})*,\d{2})'
Quantity = []

# Compile the regex pattern
pattern = re.compile(quantity)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    matches = list(pattern.finditer(line))  # Get all matches
    
    if len(matches) >= 2:  # Check if at least two matches are found
        second_match = matches[1]  # Get the second match
        start, end = second_match.start(), second_match.end()  # Get the start and end positions of the second match
        # Remove the second match using slicing
        updated_line = line[:start] + line[end:]
        Quantity.append(second_match.group().strip())  # Append the matched string to Esp (without spaces)
        extracted_pdf_data[index] = updated_line.strip()  # Update the line in the list
    else:  # If there's no match
        Quantity.append("0.000,00")  # Add the placeholder value

# Print the result
Quantity  # List of matched patterns

['780,00',
 '385,00',
 '1.355,00',
 '5.831,00',
 '5.192,00',
 '6.239,00',
 '2.857,00',
 '1.069,00',
 '1.614,00',
 '2.961,00',
 '2.284,00',
 '524,00',
 '2.101,00',
 '1.455,00',
 '1.450,00',
 '1.496,00',
 '1.548,00',
 '1.542,00',
 '2.802,00',
 '1.565,00',
 '2.615,00',
 '2.085,00',
 '2.082,00',
 '2.150,00',
 '899,00',
 '2.071,00',
 '964,00',
 '1.004,00',
 '4.577,00',
 '2.467,00',
 '1.919,00',
 '2.761,00',
 '1.863,00',
 '2.894,00',
 '2.017,00',
 '23.820,00',
 '8.056,00',
 '880,00',
 '5.711,00',
 '3.133,00',
 '5.505,00',
 '1.106,00',
 '5.299,00',
 '1.152,00',
 '665,00',
 '1.767,00',
 '3.187,00',
 '1.485,00',
 '2.116,00',
 '2.770,00',
 '3.918,00',
 '1.608,00',
 '1.146,00',
 '1.628,00',
 '14.087,00',
 '13.559,00',
 '2.022,00',
 '9.331,00',
 '8.058,00',
 '1.080,00',
 '3.044,00',
 '4.000,00',
 '3.160,00',
 '2.691,00',
 '3.810,00',
 '4.152,00',
 '5.391,00',
 '3.836,00',
 '3.746,00',
 '1.011,00',
 '1.540,00',
 '1.607,00',
 '601,00',
 '2.570,00',
 '2.923,00',
 '3.736,00',
 '1.022,00',
 '2.053,00',

In [46]:
len(Quantity)

1658

In [47]:
print(extracted_pdf_data)  # Modified data with matches removed

['FLANGE OXICORTADO LCG 12,50 NBR 8300      3.542,68', 'FLANGE OXICORTADO LCG 9,50 NBR 8300      1.657,10', 'CHAPA GR LTQ 9,50 NBR 6656 LNE 38      4.040,61', 'CHAPA FF 1,50 NBR5915 EM OL      16.163,53', 'CHAPA FQ DEC 2,00 SAE1006 OL      14.719,32', 'CHAPA FQ 4,25 SAE1010      14.739,64', 'TR REL 0,30 BOB NBR5007 G2L32 BbOL      13.499,33', 'TIRA FF BOB 2,65 NBR6658 OL      3.333,68', 'CHAPA FF 1,20 NBR6658 OL      4.321,49', 'CHAPA FQ 4,75 NBR6658      6.846,61', 'TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV      7.650,26', 'TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV      1.782,77', 'CHAPA FF 0,90 NBR6658 OL      5.956,34', 'CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275      5.121,80', 'CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275      5.104,20', 'CHAPA FF 1,50 NBR6658 OL      4.072,32', 'CHAPA FF 1,50 NBR6658 OL      4.213,87', 'CHAPA FF 1,50 NBR6658 OL      4.197,56', 'CHAPA FQ DEC 2,65 NBR6658 OL      7.805,11', 'TIRA FF BOB 0,60 NBR6658 OL      4.683,26', 'TIRA ZC BOB 1,25 NBR7008 ZC EXTRAG GI      8.64

In [48]:
# Getting Total
total = r'(?<!<=\s)(\d{1,3}(?:\.\d{3})*,\d{2})'
Total = []

# Compile the regex pattern
pattern = re.compile(total)

# Iterate through extracted_pdf_data and replace the matched pattern
for index, line in enumerate(extracted_pdf_data):
    matches = list(pattern.finditer(line))  # Get all matches
    
    if len(matches) >= 2:  # Check if at least two matches are found
        second_match = matches[1]  # Get the second match
        start, end = second_match.start(), second_match.end()  # Get the start and end positions of the second match
        # Remove the second match using slicing
        updated_line = line[:start] + line[end:]
        Total.append(second_match.group().strip())  # Append the matched string to Esp (without spaces)
        extracted_pdf_data[index] = updated_line.strip()  # Update the line in the list
    else:  # If there's no match
        Total.append("0.000,00")  # Add the placeholder value

# Print the result
Total  # List of matched patterns

['3.542,68',
 '1.657,10',
 '4.040,61',
 '16.163,53',
 '14.719,32',
 '14.739,64',
 '13.499,33',
 '3.333,68',
 '4.321,49',
 '6.846,61',
 '7.650,26',
 '1.782,77',
 '5.956,34',
 '5.121,80',
 '5.104,20',
 '4.072,32',
 '4.213,87',
 '4.197,56',
 '7.805,11',
 '4.683,26',
 '8.649,11',
 '10.552,19',
 '6.230,39',
 '6.433,88',
 '2.123,89',
 '6.197,47',
 '5.415,27',
 '2.718,12',
 '11.384,60',
 '5.938,94',
 '10.397,14',
 '14.959,10',
 '10.093,73',
 '15.679,69',
 '4.820,63',
 '56.929,80',
 '19.253,84',
 '2.845,92',
 '18.109,58',
 '10.362,40',
 '17.456,36',
 '3.541,97',
 '14.466,27',
 '3.834,43',
 '2.485,77',
 '6.141,21',
 '10.708,32',
 '3.635,95',
 '5.758,28',
 '7.716,06',
 '10.664,10',
 '4.376,68',
 '3.119,20',
 '4.431,15',
 '32.984,71',
 '31.748,40',
 '5.732,37',
 '21.848,54',
 '18.867,81',
 '2.827,85',
 '9.620,56',
 '11.517,48',
 '9.530,16',
 '8.928,74',
 '12.201,53',
 '13.078,80',
 '13.585,32',
 '11.608,74',
 '11.139,72',
 '3.248,34',
 '4.435,20',
 '4.391,93',
 '1.642,55',
 '6.885,69',
 '7.831,47

In [49]:
len(Total)

1658

In [50]:
print(extracted_pdf_data)

['FLANGE OXICORTADO LCG 12,50 NBR 8300', 'FLANGE OXICORTADO LCG 9,50 NBR 8300', 'CHAPA GR LTQ 9,50 NBR 6656 LNE 38', 'CHAPA FF 1,50 NBR5915 EM OL', 'CHAPA FQ DEC 2,00 SAE1006 OL', 'CHAPA FQ 4,25 SAE1010', 'TR REL 0,30 BOB NBR5007 G2L32 BbOL', 'TIRA FF BOB 2,65 NBR6658 OL', 'CHAPA FF 1,20 NBR6658 OL', 'CHAPA FQ 4,75 NBR6658', 'TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV', 'TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV', 'CHAPA FF 0,90 NBR6658 OL', 'CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275', 'CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275', 'CHAPA FF 1,50 NBR6658 OL', 'CHAPA FF 1,50 NBR6658 OL', 'CHAPA FF 1,50 NBR6658 OL', 'CHAPA FQ DEC 2,65 NBR6658 OL', 'TIRA FF BOB 0,60 NBR6658 OL', 'TIRA ZC BOB 1,25 NBR7008 ZC EXTRAG GI', 'TR REL 0,40 BOB NBR5007 G4RL BbOL', 'BLANK GA FF 0,90 NBR6658 OL', 'TIRA FF BOB 0,60 NBR6658 OL', 'CHAPA FQ 4,25 SAE1010', 'BLANK GA FF 0,90 NBR6658 OL', 'TR REL 0,80 BOB NBR5007 G4RL BbOL', 'CHAPA FQ 2,00 NBR6658', 'CHAPA FQ 3,00 SAE1010', 'CHAPA FQ 4,25 SAE1010', 'TR REL 0,35 BOB NBR5007

In [51]:
# # Getting the Description
# Collecting descriptions
Descrip = []

# Iterate through extracted_pdf_data and remove the line if needed
for index, line in enumerate(extracted_pdf_data):  # [:] creates a copy of the list to avoid modifying it while iterating
    # Add the line to Descrip
    Descrip.append(line.strip())
    # Remove the entire line from extracted_pdf_data
    extracted_pdf_data[index] = ""  # Replace the line with an empty string if that's the goal

# Print the result
print(Descrip)  # List of matched patterns

['FLANGE OXICORTADO LCG 12,50 NBR 8300', 'FLANGE OXICORTADO LCG 9,50 NBR 8300', 'CHAPA GR LTQ 9,50 NBR 6656 LNE 38', 'CHAPA FF 1,50 NBR5915 EM OL', 'CHAPA FQ DEC 2,00 SAE1006 OL', 'CHAPA FQ 4,25 SAE1010', 'TR REL 0,30 BOB NBR5007 G2L32 BbOL', 'TIRA FF BOB 2,65 NBR6658 OL', 'CHAPA FF 1,20 NBR6658 OL', 'CHAPA FQ 4,75 NBR6658', 'TIRA ZC BOB 1,55 NBR7008 ZC CR MI REV', 'TIRA ZC BOB 0,95 NBR7008 ZC CR MI REV', 'CHAPA FF 0,90 NBR6658 OL', 'CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275', 'CHAPA ZC 0,50 NBR7008 ZC CR NO RV Z275', 'CHAPA FF 1,50 NBR6658 OL', 'CHAPA FF 1,50 NBR6658 OL', 'CHAPA FF 1,50 NBR6658 OL', 'CHAPA FQ DEC 2,65 NBR6658 OL', 'TIRA FF BOB 0,60 NBR6658 OL', 'TIRA ZC BOB 1,25 NBR7008 ZC EXTRAG GI', 'TR REL 0,40 BOB NBR5007 G4RL BbOL', 'BLANK GA FF 0,90 NBR6658 OL', 'TIRA FF BOB 0,60 NBR6658 OL', 'CHAPA FQ 4,25 SAE1010', 'BLANK GA FF 0,90 NBR6658 OL', 'TR REL 0,80 BOB NBR5007 G4RL BbOL', 'CHAPA FQ 2,00 NBR6658', 'CHAPA FQ 3,00 SAE1010', 'CHAPA FQ 4,25 SAE1010', 'TR REL 0,35 BOB NBR5007

In [52]:
len(Descrip)

1658

In [53]:
print(extracted_pdf_data)  # Modified data with matches removed

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',

In [54]:
del extracted_pdf_data

In [55]:
data = {'Date': Date,
        'Inv': NF,
        'Est': Est,
        'Order': Order,
        'Client': Client,
        'Descrip': Descrip,
        'Esp': Esp,
        'Larg': Larg,
        'Comp': Comp,
        'Piece_price': Piece_price,
        'KG_price': KG_price, 
        'Quantity': Quantity,
        'Total': Total
        }
all_data = pd.DataFrame(data)
print(data)

{'Date': ['01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-02-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-03-2013', '01-04-2013', '01-04-2013', '01-04-2013', '01-04-2013', '01-04-2013', '01-04-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-07-2013', '01-08-2013', '01-10-2013', '01-10-2013', '01-10-2013', '01-11-2013', '01-11-2013', '01-11-2013', '01-11-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-04-2013', '02-05-2013', '02-05-2013', '02-05-2013', '02-05-2013', '02-05-2013', '02-05-2013', '02-05-2013', '02-07-2013', '02-07-2013', '02-07-2013', '02-07-2013', '02-07-201

In [56]:
all_data

Unnamed: 0,Date,Inv,Est,Order,Client,Descrip,Esp,Larg,Comp,Piece_price,KG_price,Quantity,Total
0,01-02-2013,0025305,14,147876-20,GPANIZ,"FLANGE OXICORTADO LCG 12,50 NBR 8300",125000,900000,4200000,5545,427,78000,"3.542,68"
1,01-02-2013,0025305,14,147876-30,GPANIZ,"FLANGE OXICORTADO LCG 9,50 NBR 8300",95000,900000,3600000,3110,404,38500,"1.657,10"
2,01-02-2013,0107856,11,137081-10,KEKO,"CHAPA GR LTQ 9,50 NBR 6656 LNE 38",95000,"1.200,0000","3.000,0000",000,284,"1.355,00","4.040,61"
3,01-02-2013,0107856,11,147002-20,KEKO,"CHAPA FF 1,50 NBR5915 EM OL",15000,"1.200,0000","1.595,0000",000,264,"5.831,00","16.163,53"
4,01-02-2013,0107857,11,145592-10,KEKO,"CHAPA FQ DEC 2,00 SAE1006 OL",20000,"1.200,0000","3.000,0000",000,270,"5.192,00","14.719,32"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1653,31-10-2013,0128144,11,169176-10,BORTOLOTTO,"TIRA ZC BOB 0,80 NBR7008 ZC CR NO REV",08000,330000,00000,000,394,42500,"1.758,23"
1654,31-10-2013,0128145,11,169341-10,VECTOR,"BLANK GA FF 0,90 NBR6658 OL",09000,6300000,"1.121,0000",000,320,"3.051,00","10.251,36"
1655,31-10-2013,0128147,11,169341-10,VECTOR,"BLANK GA FF 0,90 NBR6658 OL",09000,6300000,"1.121,0000",000,320,"3.091,00","10.385,76"
1656,31-10-2013,0128191,11,170115-10,LUPEME,"TIRA FF BOB 1,20 NBR5915 EP OL",12000,360000,00000,000,304,"1.023,00","3.265,42"


In [57]:
# saving the data as CSV
all_data.to_csv('data/csv_files/2013.csv',index = False)

In [58]:
counter = Counter(data['Client'])
counter

Counter({'GPANIZ': 243,
         'KEKO': 107,
         'METMETALCIN': 133,
         'LUPEME': 114,
         'LAIND': 21,
         'METALMATRIX': 40,
         'GS': 26,
         'METSISTEM': 41,
         'GPANIZIND': 92,
         'MAQENGE': 76,
         'PLASLINK': 4,
         'VECTOR': 72,
         'KAE': 4,
         'INCORPOL': 69,
         'FLANTECH': 151,
         'SASPLASTI': 16,
         'COPRIMA': 5,
         'POMMIER': 2,
         'CANELLO': 1,
         'ACOGRATO': 16,
         'JMARCONIND': 3,
         'RODAROSIND': 7,
         'RODAROSIND2': 11,
         'CIRNA': 10,
         'LUZI': 4,
         'METCOSBI': 44,
         'BASECOMPONE': 17,
         'PARTS': 8,
         'STEFFENSIND': 14,
         'MICHELEHAAB': 1,
         'MOVIMENTO': 6,
         'UTIMIL': 15,
         'ACOPLANO': 20,
         'MULTISERVSE': 12,
         'FAMERTEC': 5,
         'GUARANY': 2,
         'IBRAL': 2,
         'ACOBRASIL': 4,
         'DOSULINDE': 1,
         'PERFILLINE': 7,
         'ANODILAR': 1,