# Step 2 - Exploring The Members Data

In [2]:
import pandas as pd
import mysql.connector as sql
import os

In [4]:
connection = sql.connect(
    host = os.environ.get('mysql_host'),
    user = os.environ.get('mysql_user'),
    password = os.environ.get('mysql_password')
)

Let's now inspect our `trading.members` table in a bit more depth.

## Table Records
We can see that there are 3 columns and 14 rows in this dataset:

In [5]:
pd.read_sql_query(
    """
    SELECT *
    FROM trading.members;
    """,
    connection
)

Unnamed: 0,member_id,first_name,region
0,c4ca42,Danny,Australia
1,c81e72,Vipul,United States
2,eccbc8,Charlie,United States
3,a87ff6,Nandita,United States
4,e4da3b,Rowan,United States
5,167909,Ayush,United States
6,8f14e4,Alex,United States
7,c9f0f8,Abe,United States
8,45c48c,Ben,Australia
9,d3d944,Enoch,Africa


# Basic SQL Introduction
Let's try and answer a few questions using this dataset too better understand the DWD mentor team from the trading.members table.

Each question has its own SQL query solution which you can run to generate the required data outputs.

In the [original course](https://github.com/DataWithDanny/sql-masterclass/blob/main/course-content/step2.md) all the solutions are initially hidden.

## Question 1
Show only the top 5 rows from the `trading.members` table

In [7]:
pd.read_sql_query(
    """
    SELECT *
    FROM trading.members
    LIMIT 5;
    """,
    connection
)

Unnamed: 0,member_id,first_name,region
0,c4ca42,Danny,Australia
1,c81e72,Vipul,United States
2,eccbc8,Charlie,United States
3,a87ff6,Nandita,United States
4,e4da3b,Rowan,United States


## Question 2
Sort all the rows in the table by `first_name` in alphabetical order and show the top 3 rows

In [8]:
pd.read_sql_query(
    """
    SELECT *
    FROM trading.members
    ORDER BY first_name
    LIMIT 3;
    """,
    connection
)

Unnamed: 0,member_id,first_name,region
0,c9f0f8,Abe,United States
1,8f14e4,Alex,United States
2,167909,Ayush,United States


# Question 3
Which records from trading.members are from the United States region?

In [9]:
pd.read_sql_query(
    """
    SELECT *
    FROM trading.members
    WHERE region='United States';
    """,
    connection
)

Unnamed: 0,member_id,first_name,region
0,c81e72,Vipul,United States
1,eccbc8,Charlie,United States
2,a87ff6,Nandita,United States
3,e4da3b,Rowan,United States
4,167909,Ayush,United States
5,8f14e4,Alex,United States
6,c9f0f8,Abe,United States


# Question 4
Select only the `member_id` and `first_name` columns for members who are not from Australia

In [10]:
pd.read_sql_query(
    """
    SELECT 
        member_id,
        first_name
    FROM trading.members
    WHERE region != 'Australia';
    """,
    connection
)

Unnamed: 0,member_id,first_name
0,c81e72,Vipul
1,eccbc8,Charlie
2,a87ff6,Nandita
3,e4da3b,Rowan
4,167909,Ayush
5,8f14e4,Alex
6,c9f0f8,Abe
7,d3d944,Enoch
8,6512bd,Vikram
9,c20ad4,Leah


# Question 5
Return the unique region values from the `trading.members` table and sort the output by reverse alphabetical order

In [11]:
pd.read_sql_query(
    """
    SELECT DISTINCT 
        region
    FROM trading.members
    ORDER BY region DESC;
    """,
    connection
)

Unnamed: 0,region
0,United States
1,India
2,Australia
3,Asia
4,Africa


# Question 6
How many mentors are there from Australia or the United States?

In [13]:
pd.read_sql_query(
    """
    SELECT 
        COUNT(*) AS mentor_count
    FROM trading.members
    WHERE region IN ('United States', 'Australia');
    """,
    connection
)

Unnamed: 0,mentor_count
0,11


# Question 7
How many mentors are not from Australia or the United States?

In [14]:
pd.read_sql_query(
    """
    SELECT 
        COUNT(*) AS mentor_count
    FROM trading.members
    WHERE region NOT IN ('United States', 'Australia');
    """,
    connection
)

Unnamed: 0,mentor_count
0,3


# Question 8
How many mentors are there per region? Sort the output by regions with the most mentors to the least

In [16]:
pd.read_sql_query(
    """
    SELECT 
        region,
        COUNT(*) AS mentor_count
    FROM trading.members
    GROUP BY region
    ORDER BY mentor_count DESC;
    """,
    connection
)

Unnamed: 0,region,mentor_count
0,United States,7
1,Australia,4
2,Africa,1
3,India,1
4,Asia,1


# Question 9
How many US mentors and non US mentors are there?

In [20]:
pd.read_sql_query(
    """
    SELECT
        CASE
            WHEN region = 'United States' THEN 'US'
            ELSE 'Non US'
        END AS mentor_region,
        COUNT(*) AS mentor_count
    FROM trading.members
    GROUP BY mentor_region
    ORDER BY mentor_count DESC;
    """,
    connection
)

Unnamed: 0,mentor_region,mentor_count
0,Non US,7
1,US,7


# Question 10
How many mentors have a first name starting with a letter before `'E'`?

In [22]:
pd.read_sql_query(
    """
    SELECT
        COUNT(*) AS mentor_count
    FROM trading.members
    WHERE LEFT(first_name, 1) < 'E';
    """,
    connection
)

Unnamed: 0,mentor_count
0,6


# Appendix
`SELECT *`

In practice - always try to return specific columns which you are after and use SELECT * sparingly!

`LIMIT`

Note that LIMIT is sometimes implemented as TOP in some database flavours.

One must also be careful when using `LIMIT` with newer database tools such as BigQuery - although you will only return the number of rows you ask for, BQ is billed by the total number of rows scanned and a LIMIT will not avoid this!

Best practice is to always apply `WHERE` filters on specific partitions where possible to narrow down the amount of data that must be scanned - reducing your query costs and speeding up your query execution!

`!=` or `<>` for "not equals"

You might have noticed in questions 4 and 9 there are two different methods for showing "not equals"

You can use both `!=` or `<>` in `WHERE` filters to exclude records.

# References
- [Data With Danny Course - Step 2](https://github.com/DataWithDanny/sql-masterclass/blob/main/course-content/step2.md)