# Week 12 Exercises
## Joining Tables with `pandas`

[merge() documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html#pandas.DataFrame.merge)

Reference 1: [Joining Tables in pandas](https://www.analyticsvidhya.com/blog/2020/02/joins-in-pandas-master-the-different-types-of-joins-in-python/)

Reference 2: [Visual Representation of Joining Tables](https://www.postgresqltutorial.com/postgresql-joins/)

### Learning Objectives
1. Read and understand an entity relationship diagram (ERD)
>- Identify the tables and fields in a database
>- Identify keys (primary/foreign) keys that join tables together
2. Perform exploratory data analysis (EDA), <a id='Section 1'></a>[Section 1: EDA](#Section-1)
3. Join tables to extract data from multiple sources: <a id='Section 2'></a>[Section 2: Joins](#Section-2)

### A Database Model to Practice Joins 

We will use a fictional company's data model, `SaleCo`, to practice joining tables. The entity relationship model is shown below that helps illustrate the business rules listed here:
1. A customer may generate many invoices. Each invoice is generated by one customer
2. An invoice contains one or more invoice lines. Each invoice is generated by one customer.
3. Each invoice line references one product. A product may be found in many invoice lines (i.e, you can sell more than one hammer to more than one customer)
>- Within the invoice table, `line_units` gives us the quantity sold for each product and `line_price` gives us the unit cost. 
4. A vendor may supply  many products. Some vendors do not yet supply products. For example, a vendor list may include potential vendors. 
5. If a product is vendor-supplied, it is supplied by only a single vendor
6. Some products are not supplied by a vendor. For example, some products may be produced in-house or bought on the open market. 

#### Data needed for this tutorial
>- Download the following files from Canvas (or other source) and save in your working director for this notebook
>>- customer.csv
>>- invoice.csv
>>- line.csv
>>- product.csv
>>- vendor.csv

## Entity Relationship Diagram (ERD) of the SaleCo Database

### Examine the ERD to show you how tables are related
#### Note: You may need the `SaleCoERD.png` file saved in the same working direction as this notebook in order to see the image
>- customer and invoice can be joined on `CUS_CODE`
>- invoice and line can be joined on `INV_NUMBER`
>- line and product can be joined on `P_CODE`
>- product and vendor and be joined on `V_CODE`

![SaleCoERD.PNG](attachment:SaleCoERD.PNG)

### Create the `cust` dataframe from the customer.csv file

### Create the `inv` dataframe from the invoice.csv file

### Create the `line` dataframe from the line.csv file

### Create the `prod` dataframe from the product.csv file

### Create the `vend` dataframe from the vendor.csv file

# Section 1
## Getting Familiar with the SaleCo database

>- The questions in this section are mostly to help you get familiar with the data in the `SaleCo` database. You shouldn't need to join tables in this section and practice selecting and aggregating data. 

# Q1: What are the unique values for all of the columns in the `line` table? 
>- From this you should be answer questions such as:
>>- How many total invoices were there?
>>- How many total products were sold? (remember products could show up more than once)
>>- How many distinct prices of products are there?

# Q2: What is the overall number of units sold?
>- Also show the total number of distinct products sold in the output

# Q3: What is the average units sold per invoice line?

# Q4: What is the average units sold per product?
>- Remember that products can occur on multiple lines in the line table so think through how to calculate this average correctly. 

# Q5: What is the total inventory on hand for all our products?
>- Inventory can be found in the `product` table and the `P_QOH` filed which stands for Product Quantity on Hand
>- Also show the total number of unique products in this table

# Q6: What is the average product inventory?

# Q7: What is the value of our inventory?
>- Examine the product table to help you answer this
>- Create a new feature in the `prod` dataframe named, `TOTVAL` that stores each's products inventory value
>- Then calculate and show the total value across all products

# Q8: Write your own questions and continue examining the SaleCo dabase

>- Continue to explore the tables in this section as well as the others using exploratory data analysis methods we have discussed in class
>- Make sure you have a good understanding of the data that is stored in each field

# Section 2
## Joining tables using `pandas`

# Q9: 
## What is all the customer information for any of our customers that have an invoice? 
>- Show all the customer info (name, areacode, etc)
>- Remember, just because we have customer information doesn't mean they have created an invoice (i.e., placed an order) yet. 

# Q10: 
## Who are all our customers with or without an order (i.e., invoice)?
>- Note the difference in how this question is asked versus Q9
>- A common follow-up question is: Which customers do we have information on that haven't placed an order?
>>- This could indicate a lead list for our sales team to contact

##### Follow-up: Which customers do we have info on but haven't placed an order?
>- Only show these customers in your output

##### Show how many customers we have info on that don't have orders
>- Just show the count in your output

# Q11: Which of our products have vendor information?
>- Remember, not all of our products come from vendors. For example, we could produce some products in-house. 

##### Q11 Follow-up. Show the number of products we have vendor info on.

# Q12: Which products do we produce in-house?
>- In other words, which products do we not have vendor info on?

##### Q12 follow up: What is the total inventory of the products we produce in-house?