# Step 5 - Let the Data Analysis Begin!
Now that we've explored all 3 of our tables - let's try to first visualize how each of the tables are joined onto eachother using an Entity Relationship Diagram or ERD for short!

## What is an ERD?
ERDs are very useful to visualize the relationships between columns in tables - especially when it comes to combining them together using tables joins (something we'll cover in this current tutorial)

Below you will see the ERD for our current case study - the most important thing is to notice how all of the columns relate to one another.

<p align="center">
<img src="..\Images/ERD.png" width="300">
</p>

## Realistic Analytics
Even though we have been exploring our datasets and exploring a few of the basic SQL concepts required for data analysis - we have yet to combine our SQL queries into a single focused analytical process to solve a larger problem. This is our opportunity to try this now!

Let's say that we wish to analyse our overall portfolio performance and also each member's performance based off all the data we have in our 3 tables.

### Analyse the Ranges
Firstly - let's see what is the range of data we have to play with!

In [1]:
import pandas as pd
import mysql.connector as sql
import os

In [2]:
connection = sql.connect(
    host = os.environ.get('mysql_host'),
    user = os.environ.get('mysql_user'),
    password = os.environ.get('mysql_password')
)

### Question 1
What is the earliest and latest date of transactions for all members?

In [3]:
pd.read_sql_query(
    """
    SELECT
        MIN(txn_date) AS earliest_date,
        MAX(txn_date) AS latest_date
    FROM trading.transactions;
    """,
    connection
)

Unnamed: 0,earliest_date,latest_date
0,2017-01-01,2021-08-27


### Question 2
What is the range of market_date values available in the prices data?

In [4]:
pd.read_sql_query(
    """
    SELECT
        MIN(market_date) AS earliest_date,
        MAX(market_date) AS latest_date
    FROM trading.prices;
    """,
    connection
)

Unnamed: 0,earliest_date,latest_date
0,2017-01-01,2021-08-29


## Joining our Datasets
Now that we now our date ranges are from January 2017 through to almost the end of August 2021 for both our prices and transactions datasets - we can now get started on joining these two tables together!

Let's make use of our ERD shown above to combine the `trading.transactions` table and the `trading.members` table to answer a few simple questions about our mentors!

### Question 3
Which top 3 mentors have the most Bitcoin quantity as of the 29th of August?

In [5]:
pd.read_sql_query(
    """
    SELECT
        member_id,
        SUM(
            CASE 
              WHEN txn_type='BUY' THEN quantity 
              WHEN txn_type='SELL' THEN -quantity
              ELSE 0
            END
           ) AS btc_quantity
    FROM trading.transactions
    WHERE ticker = 'BTC'
    GROUP BY member_id
    ORDER BY btc_quantity DESC
    LIMIT 3;
    """,
    connection
)

Unnamed: 0,member_id,btc_quantity
0,a87ff6,4160.219868
1,c20ad4,4046.090895
2,167909,3945.198079


## Calculating Portfolio Value
Now let's combine all 3 tables together using only strictly INNER JOIN so we can utilise all of our datasets together.

### Question 4
What is total value of all Ethereum portfolios for each region at the end date of our analysis? Order the output by descending portfolio value

In [6]:
pd.read_sql_query(
    """
    WITH eth_latest_price AS (
    SELECT
      ticker,
      price
    FROM trading.prices
    WHERE ticker = 'ETH'
      AND market_date = '2021-08-29'
    )

    SELECT
      m.region AS region,
      SUM(
          CASE 
            WHEN t.txn_type='BUY' THEN quantity
            WHEN t.txn_type='SELL' THEN -quantity
            ELSE 0
          END
          ) * eth_latest_price.price AS ethereum_value
    FROM trading.members AS m
    INNER JOIN trading.transactions AS t
      ON m.member_id=t.member_id
    INNER JOIN eth_latest_price
      ON t.ticker=eth_latest_price.ticker
    WHERE t.ticker = 'ETH'
    GROUP BY region, eth_latest_price.price
    ORDER BY ethereum_value DESC;
    """,
    connection
)

Unnamed: 0,region,ethereum_value
0,United States,50688410.0
1,Australia,40076020.0
2,India,6276427.0
3,Asia,5011671.0
4,Africa,2183933.0


### Question 5
What is the average value of each Ethereum portfolio in each region? Sort this output in descending order

In [7]:
pd.read_sql_query(
    """
    WITH eth_latest_price AS (
    SELECT
      ticker,
      price
    FROM trading.prices
    WHERE ticker = 'ETH'
      AND market_date = '2021-08-29'
    )

    SELECT
      m.region AS region,
      AVG(
          CASE 
            WHEN t.txn_type='BUY' THEN quantity
            WHEN t.txn_type='SELL' THEN -quantity
            ELSE 0
          END
          ) * eth_latest_price.price AS avg_ethereum_value
    FROM trading.members AS m
    INNER JOIN trading.transactions AS t
      ON m.member_id=t.member_id
    INNER JOIN eth_latest_price
      ON t.ticker=eth_latest_price.ticker
    WHERE t.ticker = 'ETH'
    GROUP BY region, eth_latest_price.price
    ORDER BY avg_ethereum_value DESC;
    """,
    connection
)

Unnamed: 0,region,avg_ethereum_value
0,Australia,10752.890319
1,United States,10549.097535
2,Asia,8933.460081
3,India,8036.397768
4,Africa,3899.881039


Mmm hang on a second...does the output for the above query look correct to you?

Let's try again - this time we will calculate the total sum of portfolio value and then manually divide it by the total number of mentors in each region!

In [8]:
pd.read_sql_query(
    """
    WITH eth_latest_price AS (
    SELECT
      ticker,
      price
    FROM trading.prices
    WHERE ticker = 'ETH'
      AND market_date = '2021-08-29'
    ),

    calculations AS ( 
    SELECT
      m.region AS region,
      SUM(
          CASE 
            WHEN t.txn_type='BUY' THEN quantity
            WHEN t.txn_type='SELL' THEN -quantity
            ELSE 0
          END
          ) * eth_latest_price.price AS ethereum_value,
      COUNT(DISTINCT m.member_id) AS mentor_count
    FROM trading.members AS m
    INNER JOIN trading.transactions AS t
      ON m.member_id=t.member_id
    INNER JOIN eth_latest_price
      ON t.ticker=eth_latest_price.ticker
    WHERE t.ticker = 'ETH'
    GROUP BY region, eth_latest_price.price
    )
  
    SELECT
      *,
      ethereum_value / mentor_count AS avg_ethereum_value
    FROM calculations
    ORDER BY avg_ethereum_value DESC;
    """,
    connection
)

Unnamed: 0,region,ethereum_value,mentor_count,avg_ethereum_value
0,Australia,40076020.0,4,10019010.0
1,United States,50688410.0,7,7241202.0
2,India,6276427.0,1,6276427.0
3,Asia,5011671.0,1,5011671.0
4,Africa,2183933.0,1,2183933.0


In [9]:
pd.read_sql_query(
    """
     WITH eth_latest_price AS (
    SELECT
      ticker,
      price
    FROM trading.prices
    WHERE ticker = 'ETH'
      AND market_date = '2021-08-29'
    )
 
    SELECT
      m.region AS region,
      SUM(
          CASE 
            WHEN t.txn_type='BUY' THEN quantity
            WHEN t.txn_type='SELL' THEN -quantity
            ELSE 0
          END
          ) * eth_latest_price.price AS ethereum_value,
      COUNT(DISTINCT m.member_id) AS mentor_count
    FROM trading.members AS m
    INNER JOIN trading.transactions AS t
      ON m.member_id=t.member_id
    INNER JOIN eth_latest_price
      ON t.ticker=eth_latest_price.ticker
    WHERE t.ticker = 'ETH'
    GROUP BY region, eth_latest_price.price
    ORDER BY ethereum_value DESC
    """,
    connection
)

Unnamed: 0,region,ethereum_value,mentor_count
0,United States,50688410.0,7
1,Australia,40076020.0,4
2,India,6276427.0,1
3,Asia,5011671.0,1
4,Africa,2183933.0,1


# References
- [Data With Danny Course - Step 5](https://github.com/DataWithDanny/sql-masterclass/blob/main/course-content/step5.md)

### Bonus
Why the first calculation of the average ethereum portfolio value is wrong?

In [16]:
pd.read_sql_query(
    """
    WITH eth_latest_price AS (
    SELECT
      ticker,
      price
    FROM trading.prices
    WHERE ticker = 'ETH'
      AND market_date = '2021-08-29'
    ),

    calculations AS (
    SELECT
      m.region AS region,
      SUM(
          CASE 
            WHEN t.txn_type='BUY' THEN quantity
            WHEN t.txn_type='SELL' THEN -quantity
            ELSE 0
          END
          ) * eth_latest_price.price AS ethereum_value,
      COUNT(t.txn_date) AS counting,
      AVG(
          CASE 
            WHEN t.txn_type='BUY' THEN quantity
            WHEN t.txn_type='SELL' THEN -quantity
            ELSE 0
          END
          ) * eth_latest_price.price AS avg_ethereum_value
    FROM trading.members AS m
    INNER JOIN trading.transactions AS t
      ON m.member_id=t.member_id
    INNER JOIN eth_latest_price
      ON t.ticker=eth_latest_price.ticker
    WHERE t.ticker = 'ETH'
    GROUP BY region, eth_latest_price.price
    ORDER BY ethereum_value DESC)

    SELECT 
      *,
      avg_ethereum_value * counting
    FROM calculations;
    """,
    connection
)

Unnamed: 0,region,ethereum_value,counting,avg_ethereum_value,avg_ethereum_value * counting
0,United States,50688410.0,4805,10549.097535,50688410.0
1,Australia,40076020.0,3727,10752.890319,40076020.0
2,India,6276427.0,781,8036.397768,6276427.0
3,Asia,5011671.0,561,8933.460081,5011671.0
4,Africa,2183933.0,560,3899.881039,2183933.0


As you can see above, it is wrong because the average os calculated on the total number
of transactions, and not on the number of members.  