# Step 6 - Planning Ahead for Data Analysis

## Planning Ahead
Sometimes when creating SQL queries - we can jump to the initial problem at hand, but what happens when we stop and plan through our approach to a multi-part problem?

### Further Portfolio Questions
Let's take this next series of questions and methodically break down our approach before we reveal the answers.

**Questions 1-4**

1. What is the total portfolio value for each mentor at the end of 2020?

2. What is the total portfolio value for each region at the end of 2019?

3. What percentage of regional portfolio values does each mentor contribute at the end of 2018?

4. Does this region contribution percentage change when we look across both Bitcoin and Ethereum portfolios independently at the end of 2017?

We can see that most questions are based off total portfolio value apart from the final question - which requires both tickers to be separated.

Additionally, the region value for each mentor is going to be important also for both questions 3 and 4.

We also need to factor in the timing aspect for these questions - it's not going to be as straightforward as our previous questions which required only the final portfolio value.

For these questions - let's first create a base table which we can refer to later to solve our problems!

## Create a Base Table
We can make use of a `TEMP` table which is stored in a temporary schema which will disappear once the SQL session is closed down - this is very useful in practice because you don't always have write access to production databases all the time!

First let's create a portfolio quantity base table which summarizes our data with the required data first.

In [16]:
import pandas as pd
import os
import sqlalchemy

In [22]:
host = os.environ.get('mysql_host')
user = os.environ.get('mysql_user')
password = os.environ.get('mysql_password')
engine = sqlalchemy.create_engine(f'mysql+pymysql://{user}:{password}@{host}/trading')

## Step 1
Create a base table that has each mentor's name, region and end of year total quantity for each ticker.

In [18]:
pd.read_sql_query('SELECT * FROM members LIMIT 5', engine)

Unnamed: 0,member_id,first_name,region
0,c4ca42,Danny,Australia
1,c81e72,Vipul,United States
2,eccbc8,Charlie,United States
3,a87ff6,Nandita,United States
4,e4da3b,Rowan,United States


In [41]:
query = """SELECT
      m.first_name,
      m.region,
      t.ticker,
      DATE_ADD(DATE_ADD(MAKEDATE(EXTRACT(YEAR FROM t.txn_date),1), INTERVAL 12 MONTH), INTERVAL -1 DAY) AS year_end,
      ROUND(
          SUM(
              CASE
                WHEN t.txn_type='BUY' THEN quantity
                WHEN t.txn_type='SELL' THEN -quantity
                ELSE 0
              END
          ), 2
      ) AS yearly_quantity
    FROM trading.members m
    INNER JOIN trading.transactions t
      ON m.member_id=t.member_id
    GROUP BY m.first_name, m.region, t.ticker, year_end;"""

pd.read_sql_query(
    query,
    engine
)

Unnamed: 0,first_name,region,ticker,year_end,yearly_quantity
0,Vipul,United States,BTC,2017-12-31,433.56
1,Charlie,United States,BTC,2017-12-31,590.32
2,Nandita,United States,BTC,2017-12-31,1021.56
3,Rowan,United States,BTC,2017-12-31,713.25
4,Ayush,United States,BTC,2017-12-31,794.53
...,...,...,...,...,...
135,Ayush,United States,ETH,2021-12-31,66.31
136,Abe,United States,ETH,2021-12-31,223.20
137,Rowan,United States,BTC,2021-12-31,280.15
138,Abe,United States,BTC,2021-12-31,479.33


In [42]:
base_table = pd.read_sql_query(query, engine)

base_table.to_sql(
  name='base_table_step6', 
  con=engine,
  if_exists='replace'
  )

## Step 2
Let's take a look at our base table now to see what data we have to play with - to keep things simple, let's take a look at Abe's data from our new temp table `temp_portfolio_base`

Inspect the `year_end`, `ticker` and `yearly_quantity` values from our new temp table `temp_portfolio_base` for Mentor Abe only. Sort the output with ordered BTC values followed by ETH values

In [43]:
pd.read_sql_query(
    """
    SELECT
      year_end,
      ticker,
      yearly_quantity
    FROM base_table_step6
    WHERE first_name='Abe'
    ORDER BY ticker
    """,
    engine
)

Unnamed: 0,year_end,ticker,yearly_quantity
0,2017-12-31,BTC,861.01
1,2018-12-31,BTC,755.15
2,2019-12-31,BTC,765.66
3,2020-12-31,BTC,859.37
4,2021-12-31,BTC,479.33
5,2017-12-31,ETH,543.21
6,2018-12-31,ETH,350.0
7,2019-12-31,ETH,464.32
8,2020-12-31,ETH,508.47
9,2021-12-31,ETH,223.2


We can see from the output above that the yearly quantity is exactly the total portfolio quantity values that we need - we will need to create a cumulative sum of the yearly_quantity column that is separate for each `mentor` and `ticker`, using the `year_end` as the ordering column.

We can do exactly this using a SQL window function!

## Step 3
To create the cumulative sum - we'll need to apply a window function!

Although we will only touch on this briefly in this course.

Create a cumulative sum for Abe which has an independent value for each ticker

In [46]:
pd.read_sql_query(
    """
    SELECT
      year_end,
      ticker,
      yearly_quantity,
      SUM(yearly_quantity) OVER(PARTITION BY first_name, ticker ORDER BY year_end) AS cumulative_quantity
    FROM base_table_step6
    WHERE first_name='Abe'
    ORDER BY ticker;
    """,
    engine
)

Unnamed: 0,year_end,ticker,yearly_quantity,cumulative_quantity
0,2017-12-31,BTC,861.01,861.01
1,2018-12-31,BTC,755.15,1616.16
2,2019-12-31,BTC,765.66,2381.82
3,2020-12-31,BTC,859.37,3241.19
4,2021-12-31,BTC,479.33,3720.52
5,2017-12-31,ETH,543.21,543.21
6,2018-12-31,ETH,350.0,893.21
7,2019-12-31,ETH,464.32,1357.53
8,2020-12-31,ETH,508.47,1866.0
9,2021-12-31,ETH,223.2,2089.2


## Step 4
Now let's apply our same window function to the entire temporary dataset and start answering our questions.

We can actually `ALTER` and `UPDATE` our temp table to add in an extra column with our new calculation.

Generate an additional cumulative_quantity column for the `base_table_step6` table.

Now let's check that our updates to the table worked by inspecting Abe's records again!

In [115]:
base_table['cumulative_quantity'] = new_base_table.groupby(['first_name', 'ticker']).agg({'yearly_quantity': 'cumsum'})

In [116]:
base_table.to_sql(
    name='base_table_step6',
    con=engine,
    if_exists='replace'
)
pd.read_sql_table('base_table_step6', engine)

Unnamed: 0,index,first_name,region,ticker,year_end,yearly_quantity,cumulative_quantity
0,0,Vipul,United States,BTC,2017-12-31,433.56,433.56
1,1,Charlie,United States,BTC,2017-12-31,590.32,590.32
2,2,Nandita,United States,BTC,2017-12-31,1021.56,1021.56
3,3,Rowan,United States,BTC,2017-12-31,713.25,713.25
4,4,Ayush,United States,BTC,2017-12-31,794.53,794.53
...,...,...,...,...,...,...,...
135,135,Ayush,United States,ETH,2021-12-31,66.31,412.73
136,136,Abe,United States,ETH,2021-12-31,223.20,2089.20
137,137,Rowan,United States,BTC,2021-12-31,280.15,2569.00
138,138,Abe,United States,BTC,2021-12-31,479.33,3720.52


In [117]:
pd.read_sql_query(
    """
    SELECT
      *
    FROM base_table_step6
    WHERE first_name='Abe'
    ORDER BY first_name, ticker, year_end;
    """,
    engine
)

Unnamed: 0,index,first_name,region,ticker,year_end,yearly_quantity,cumulative_quantity
0,7,Abe,United States,BTC,2017-12-31,861.01,861.01
1,47,Abe,United States,BTC,2018-12-31,755.15,1616.16
2,58,Abe,United States,BTC,2019-12-31,765.66,2381.82
3,98,Abe,United States,BTC,2020-12-31,859.37,3241.19
4,138,Abe,United States,BTC,2021-12-31,479.33,3720.52
5,21,Abe,United States,ETH,2017-12-31,543.21,543.21
6,45,Abe,United States,ETH,2018-12-31,350.0,893.21
7,75,Abe,United States,ETH,2019-12-31,464.32,1357.53
8,110,Abe,United States,ETH,2020-12-31,508.47,1866.0
9,136,Abe,United States,ETH,2021-12-31,223.2,2089.2


Now that we've obtained our base table properly - let's start answering some of these questions!

# References
- [Data With Danny Course - Step 6](https://github.com/DataWithDanny/sql-masterclass/blob/main/course-content/step6.md)