# Hi there,

In this intro lab we'll be exploring the what can be achieved with SQL.

As Nicholas mentioned in Monday's meeting, learning SQL is fundamental in kicking off a career in all things data and analytics.

What you should take away from today's lab is an initial feel for how to manipulate and wrangle data. Languages and syntax will always vary, but SQL is a good starting point to understand the mechanics of pulling, processing, analyzing data.


______________

# SET UP 
* We can ignore this for the moment.
* This is solely how we set up our python session by importing necessary tools or libraries that will allow us to work with the data.

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandasql as ps #this is what we'll be using to query our data using sql
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# IMPORTING THE DATA
*     This is how we make sure we can read the data stored in the csv.
*     Pandas helps in this aspect because it has functionality that helps read csvs and other data files.
*     In today's sessions we'll be looking at COVID vaccination data/


In [None]:

covid_vacs = pd.read_csv (r'/kaggle/input/covid-world-vaccination-progress/country_vaccinations.csv')



# OUR FIRST QUERY
*     Now that we have the data imported into this session, we can start querying.
*     Lets start with the basic query. If we want to query the whole table we want to use two different statements.

    *         SELECT will indicate which columns we want to show.
    *         FROM will indicate which tabel we want to pull from.
    *         These two statements are always required to query in SQL.
    
    
*     When we add * in the SELECT statement, it indicated that we want to query ALL columns.

In [None]:
ps.sqldf(
    """
    SELECT 
        * 
    FROM covid_vacs
    """)

# SELECTING SPECIFIC COLUMNS
*     Let's imagine we only wanted to see specific columns.
*     In this case, we indicate which columns we'd like to query right under the SELECT statement.
*     Column names HAVE to be passed as a comma seperated list, where the last item in the list does not have a comma


In [None]:

query = """
    select 
        country,
        date
    from covid_vacs
"""

ps.sqldf(query)

# JUMPING INTO OTHER CLAUSES

Now that we've covered SELECT and FROM, we'll jump into other clauses or statements that will help you navigate the data.



**WHERE**

* If we want to look at only data from the US, we use the WHERE clause to filter the data.
* In this case we filter only using one condition which is that the country is (or equals)  "United States"
* In using the WHERE statement, we are filtering what **ROWS** we see. Remeber that to filter or select what "COLUMNS" we want to see, we use the SELECT statement.

In [None]:
query = """
    select * 
    from covid_vacs
    WHERE country = "United States"
"""
ps.sqldf(query)

**WHERE WITH TWO CONDITIONS [AND]**

* We might want to filter the data to show only the US as the country and all dates in March, to do this, we add a second where cluase by using "AND".
* This will yield a result in which BOTH conditions are met.

In [None]:
query = """
    select * 
    from covid_vacs
    WHERE country = "United States"
    AND date >= "2021-03-01"
"""
ps.sqldf(query)

**WHERE WITH TWO CONDITIONS [OR]**

* We might want to filter the data to show rows 
* This will yield a result in which BOTH conditions are met.

In [None]:
query = """
    select * 
    from covid_vacs
    WHERE country = "United States"
    OR country = "United Kingdom"
"""
ps.sqldf(query)

This same query can be achieved with less code by reformatting the WHERE clause. In this case, we are telling the where clauses to look for any rows WHERE we find the United Kingdom or the United States within the country column.

In [None]:
query = """
    select * 
    from covid_vacs
    where country in ("United States","United Kingdom")
"""
ps.sqldf(query)

We can combine all of the conditions we used above into one query to only show rows for the UK and the US with dates in March.

In [None]:
query = """
    select * 
    from covid_vacs
    where country in ("United States","United Kingdom")
    AND date >= "2021-03-01"
"""
ps.sqldf(query)

We can also have cases in which we want to exclude some rows. If we wanted to llook at all countries BUT the united states we could use one of two methods to do this.

In [None]:
query = """
    select * 
    from covid_vacs
    where country <> "United States"
"""

ps.sqldf(query)

In [None]:
query = """
    select * 
    from covid_vacs
    where country NOT IN ("United States")
"""

ps.sqldf(query)

Note that the conditions you create must follow the format of the columns.

If a date is given in YYYY-MM-DD, you have to write the condition in that way. This happens with strings as well but to a greater degree. If I write United States in lower case, the query will yield no results.

In [None]:
query = """
    select * 
    from covid_vacs
    where country in ("united states")
"""
ps.sqldf(query)

ORDER BY

Order by helps to sort the data that you query. This functionality needs two different components to do it's job
    * select the **columns** you want to order by.
    * select whether you want to sort in **ascending or descending** order.
    
In this first example we'll order the data by date. In this case, we want to see the latest dates first so we use "ORDER BY date desc". We use descending becasue we want the largest values to be higher up. When sorting by dates, whatever is "later" in time is always larger.

In [None]:
query = """
    select 
        *
    from covid_vacs
    ORDER BY date desc
    
"""
ps.sqldf(query)

With this query, we have learned that the latest date available for these countries is around the 15th of March. Now let's say we wanted to see which countries had the highest number of vaccinations on March 1st.

In [None]:
query = """
    select 
        country,
        date,
        total_vaccinations
    from covid_vacs
    WHERE date = "2021-03-01"
    ORDER BY total_vaccinations desc
    
"""
ps.sqldf(query)

**LIMIT**

The LIMIT clause allows us to limit the number of rows we want to see.
Let's take the last query we made as an example. If we want to see only the top 5 countries in terms of number of vaccinations on March 1st, we would do that by stating the we want to set the limit to 5.

In [None]:
query = """
    select 
        country,
        date,
        total_vaccinations
    from covid_vacs
    WHERE date = "2021-03-01"
    ORDER BY total_vaccinations desc
    LIMIT 5
    
"""
ps.sqldf(query)


If we wanted to see the the 5 countries who have vaccinated the least, we would have to change the order by to ascending

In [None]:
query = """
    select 
        country,
        date,
        total_vaccinations
    from covid_vacs
    WHERE date = "2021-03-01"
    ORDER BY total_vaccinations asc
    LIMIT 5
    
"""
ps.sqldf(query)


By March 1st, it's clear some countries had not kicked off their vaccination campaigns. Let's look at countries which had at least one vaccination.

In [None]:
query = """
    select 
        country,
        date,
        total_vaccinations
    from covid_vacs
    WHERE date = "2021-03-01"
    AND total_vaccinations >= 1
    ORDER BY total_vaccinations asc
    LIMIT 5
    
"""
ps.sqldf(query)

# AGGREGATIONS

Now that we've covered SELECT, FROM and some other clauses that allows us to decide what data we want to see, se can mvoe on to other functionality that can help us summarize our data and answer questions which involve mathemtical functions.


**COUNT**

Count refers simply to counting the number of rows in a dataset. To use count or any other aggregation function, we add it as if it were a column name within our SELECT statement.

TIP: You can then add a name to your column by stating "as <variable_name>" at the end of the statement.

Let's start off by counting the number of rows in our dataset.

In [None]:
query = """
    select 
        COUNT(*) as n_rows
    from covid_vacs
    
"""
ps.sqldf(query)

**COUNT DISTINCT**

Now let's saw we wanted to count unique values within a certain column. To do this, we indicate the name of the column

In [None]:
# COUNTING DISTINCT COUNTRIES
query = """
    select 
        COUNT(DISTINCT country) as n_distinct_countries
    from covid_vacs
    
"""
ps.sqldf(query)

In [None]:
# COUNTING DISTINCT DATES
query = """
    select 
        COUNT(DISTINCT date) as n_distinct_dates
    from covid_vacs
    
"""
ps.sqldf(query)

We can join both statement into one query to see both the number of distinct countries and dates in one same result. Notice how we ALWAYS have the comma seperating each column whether it already existed in the dataset or not.

In [None]:
query = """
    select 
        COUNT(DISTINCT date) as n_distinct_dates,
        COUNT(DISTINCT country) as n_distinct_countries

    from covid_vacs
    
"""
ps.sqldf(query)

If we applied a country filter to this same query, our results would automatically change. Let's see what would happen if we indicate we want to count the number of distinct dates for the US and the UK.

In [None]:
query = """
    select 
        COUNT(DISTINCT date) as n_distinct_dates,
        COUNT(DISTINCT country) as n_distinct_countries

    from covid_vacs
    where country IN ("United States","United Kingdom")
    
"""
ps.sqldf(query)

**SUM**

Now let's review other more familiar mathemtical functions. The easiest one to start off with is the concept of addition. In this case we use SUM as an aggregation function to some up the values in a specific column.

Let's revisit our query that was meant to look at the number of vaccinations that have been supplied in different countries on March 1st. Let's assume a follow up question to this was: "How many vaccinations had  been administered in the world on March 1st?"

To do this, we have to add up the "total_vaccination" column filtering out March 1st as the only date we want to see.

In [None]:
query = """
    select 
        SUM(total_vaccinations)

    from covid_vacs
    where date = "2021-03-01"
    
"""
ps.sqldf(query)

According to the data, around 224M vaccintions had been administered by March 1st Globally. We can continue to filter using different attributes to count the number of vaccinations administered in the US as well

In [None]:
query = """
    select 
        SUM(total_vaccinations)

    from covid_vacs
    where date = "2021-03-01"
    AND country = "United States"
    
"""
ps.sqldf(query)

Now, we know that this total vaccination colum gives us the cumulative number of vaccinations that have been administered to date for any country. If we wanted to use a different column in order to calculate the number of vaccinations administered over a specific period, we could also use the SUM function. However, in this case we'll use the daily vaccination column.

In this case, we'll count the number of vaccinations administered in the first 10 days of March.

In [None]:
query = """
    select 
        SUM(daily_vaccinations)

    from covid_vacs
    where date >= "2021-03-01" AND date < "2021-03-10"
    AND country = "United States"
    
"""
ps.sqldf(query)

**AVERAGE**

Let's assume now that we are no interested so much in the total number of vaccinations administered over a specific period of time, but that rather, we waant to understand the average daily vaccinations for the US over the same period of time.

In this case, we would now use the AVG aggregation function in order to make this calculation. In this example, I have also applied the ROUND function in order to end up with a whole number instead of something with decimals.

In [None]:
query = """
    select 
        ROUND(AVG(daily_vaccinations)) as avg_daily_vacs

    from covid_vacs
    where date >= "2021-03-01" AND date < "2021-03-10"
    AND country = "United States"
    
"""
ps.sqldf(query)

Again, we can join several aggregation queries together to calculate several metrics in a single query

In [None]:
query = """
    select 
        SUM(daily_vaccinations) as n_vaccines,
        ROUND(AVG(daily_vaccinations)) as avg_daily_vacs

    from covid_vacs
    where date >= "2021-03-01" AND date < "2021-03-10"
    AND country = "United States"
    
"""
ps.sqldf(query)

Now let's see the same results for the UK

In [None]:
query = """
    select 
        SUM(daily_vaccinations) as n_vaccines,
        ROUND(AVG(daily_vaccinations)) as avg_daily_vacs

    from covid_vacs
    where date >= "2021-03-01" AND date < "2021-03-10"
    AND country = "United Kingdom"
    
"""
ps.sqldf(query)

**GROUPING**

Let's take a look at the last query we ran. Let's assume that we wanted to trasform this into a query that now had 3 columns. 
1. country name
2. number of vaccines administered during the first 10 days of March
3. average number of vaccines administered daily during the first 10 days of March

Now, we may be inclined to just add the UK into the WHERE clause, but this would give us the summarized values for both countries together.

What we want to do instead is to SUMMARIZE by country.

To do this, we use grouping. In this case, the only thing we will need to do is 
1. add the country name as a column we want to see under the select statement and 
2. add a GROUP BY statement at the end of the query indicating we would like to view the aggregations by country.

In [None]:
query = """
    select
        country,
        SUM(daily_vaccinations) as n_vaccines,
        ROUND(AVG(daily_vaccinations)) as avg_daily_vacs

    from covid_vacs
    where date >= "2021-03-01" AND date < "2021-03-10"
    AND country IN ("United States","United Kingdom")
    
    GROUP BY country
    
"""
ps.sqldf(query)