## Database Analysis
This notebook begins by outlining a scenario that anyone who works for an entity that collects customer/client data will encounter. This use-case is the possibility of a new batch of incoming electronic data containing errors. 


There are some steps and computations I'll be performing on this:
1. Identify possible problems with the data
2. Attempt to mitigate the issues as much as possible
3. Report all discrepancies with detailed logs of why a row is labeled as erroneous. 

The Python script will allow the user to set parameters or arguments easily (rather than going into this notebook and manually changing everything). 

In [None]:
import pandas as pd

In [None]:
database_df = pd.read_csv('data/database.csv') # replace this file with whatever the database file is
print(database_df.head())

In [None]:
target_df = pd.read_csv('data/target.csv') # replace this file with whatever the database file is
print(target_df.head())

In [None]:
print(target_df.isin(database_df))

### __Data Validation__

#### Missing Data
Missing in this case means rows containing elements that are not present in the database using an __outer join__. For more information on this SQL style command: https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge

In [None]:
missing_df = pd.merge(database_df, target_df, how='outer', indicator=True)
missing_df = missing_df[missing_df._merge != 'both']
missing_df

Before we add these to our database, we should make sure they aren't duplicates. In this case, we can validate their identities using some form of identification (like this datasets GUID). Depending on different scenarios, a business could require every client to have a different ```guid```, or in other cases they might have overlapping ID's. For our use case we will assume that ID's should __not__ be duplicated.

Luckily ```pandas``` supports an easy way for us to check if values in one column are present in another dataframe using ```isin```.

In [None]:
missing_df.guid.isin(database_df.guid).astype(int)