## Data Governance
A logger was implemented using `logger_config.py` in the base directory using the logging module and previous files were updated to implement the logger, specifically the `processor.py` and the `dataloading.py`.

### Logger setup
The logger was setup to obtain the source of the logger call by having it be instantiated with a parameter `name`. The parameter is passed into different loggers but will be collated into one file, the `data_pipeline.log` file. The logger also sets the stream to only collect INFO level logs but the file to collect DEBUG level logs, supposedly lessening the stream outputs. The logs are formatted in order to get the time when the log was created, the source of the log in `name`, the urgency of the log in `level`, and the log message itself. The logger was then spread out into the `processor.py` file and the `dataloading.py` file.

In [None]:
import logging

def setup_logger(name):
    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)
    
    console_handler = logging.StreamHandler()
    file_handler = logging.FileHandler('data_pipeline.log')
    
    console_handler.setLevel(logging.INFO)
    file_handler.setLevel(logging.DEBUG)
    
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    console_handler.setFormatter(formatter)
    file_handler.setFormatter(formatter)
    
    logger.addHandler(console_handler)
    logger.addHandler(file_handler)
    
    return logger


### Processor logs
The logs in `processor.py` handles mostly processing level data such as the number of records from each file, as well as catching the errors such as column inconsistencies.

In [None]:
from logger_config import setup_logger

logger = setup_logger('processing')

The logger was first setup by importing `logger_config` and calling the `setup_logger` function.

In [None]:
def add_csv(self, filepath):
    new_entries = pd.read_csv(filepath)
    try:
        new_entries = self.clean_data(new_entries)
        new_entries = new_entries[self.columns]
        logger.info('Adding ' + str(new_entries.shape[0]) + ' entries from: ' + filepath)
        self.dataframe = (self.dataframe.copy() if new_entries.empty else new_entries.copy() if self.dataframe.empty
                            else pd.concat([self.dataframe, new_entries], ignore_index=True)
                            )
    except KeyError as e:
        logger.error(e)

As the data is being read in `add_csv`, the logger takes note of the entries from the filepath and catches the error when a column is called that does not exist for the dataframe.

In [None]:
def check_entries(self, row):
    errors = []
    for cur in range(len(self.columns)):
        if self.types[cur] == 'datetime':
            if pd.isna(row[self.columns[cur]]):
                errors.append(str(self.columns[cur]) + ' is not datetime')
                logger.error(str(self.columns[cur]) + ' is not datetime')
        elif self.types[cur] == 'int':
            if not isinstance(row[self.columns[cur]], int):
                if self.columns[cur] == -1:
                    errors.append(str(self.columns[cur]) + ' has an incorrect data type')
                    logger.error(str(self.columns[cur]) + ' has an incorrect data type')
                    continue
                errors.append(str(self.columns[cur]) + ' is not ' + str(self.types[cur]))
                logger.error(str(self.columns[cur]) + ' is not ' + str(self.types[cur]))
        else:
            if not isinstance(row[self.columns[cur]], str):
                errors.append(str(self.columns[cur]) + ' is not ' + str(self.types[cur]))
                logger.error(str(self.columns[cur]) + ' is not ' + str(self.types[cur]))

The `check_entries` function which was the main error checking tool previously now adds the same error messages into the logger file instead of storing it into an array of strings.

### Logging the records loaded into PostgreSQL
We modify the code for `upload_data` as this was the only code we implemented to upload code to PostgreSQL to include a logger instead of printing to the output stream.

In [None]:
def upload_data():
    users = transformation.users
    users.to_sql('users', engine, if_exists='replace', index=False)
    logger.info(str(users.shape[0]) + ' entries were added to PostgreSQL.')