<a href="https://colab.research.google.com/github/ratfarts/datasciencecoursera/blob/master/Copy_of_Getting_Started_in_SQL_live_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p align="center">
<img src="https://github.com/KelseyMcNeillie/datacampgettingstartedinsql/blob/master/images/regular.png?raw=true" width="50%">
</p>
<br><br> 

# **Getting Started in SQL Live Training**

Welcome to the DataCamp "Getting Started in SQL Live Training", where we will be covering the basics of SQL queries through a hands on coding session. You will learn how to query, filter, and aggregate data to answer our real world business questions. 

In today's notebook, you will learn:

* How to translate business questions into powerful SQL queries
* Use SELECT to return the exact data you need  
* Use WHERE to filter your returned data
* Create powerful aggregations to sort and analyze your data
* Organize your data using ORDER BY and LIMIT


In today's session, you will be taking on the role of a data analyst for a prestigious country club called **Pinebrook**. 

The club has seen a significant increase in membership cancellations over the past few years. Management has asked us to create a report summarizing the membership profile to try and understand what's driving these cancellations.


# **The Dataset**

This data is taken from a .csv file called "membership_clean". As today's session will focus on creating queries, the data has already been cleaned for you. It contains the following columns. 

`id:` unique identifier for a member 

`last_name:` the member’s last name

`marital_status`: member’s marital status

`gender`: member’s reported gender

`annual_income`: how much the member makes a year

`industry`: the industry sector a member works in 

`zip _code`: where the member lives

`age_at_issue`: how old the member was when they joined

`member_type`: which membership tier the member belongs to

`add_member`: number of additional members on the account

`annual_fee`: the cost of membership

`payment`: membership payment plan 

`status`: active membership vs. cancellation

`start_month/start_day/start_year`: date joined

`end_month/end_day/end_year`: date ended (if applicable)




# **Setting Up PostgreSQL**

In [0]:
# This block of code will install PostgreSQL
%%capture
!wget -qO- https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
!echo "deb http://apt.postgresql.org/pub/repos/apt/ bionic-pgdg main" >/etc/apt/sources.list.d/pgdg.list
!apt -qq update
!apt -yq install postgresql-12 postgresql-client-12
!service postgresql start
# make calling psql shorter
!sudo -u postgres psql -c "CREATE USER root WITH SUPERUSER"  
!psql postgres -c "CREATE DATABASE root"  # now just !psql -c "..."
# load SQL extensions
%load_ext sql
%config SqlMagic.feedback=False 
%config SqlMagic.autopandas=True
%sql postgresql+psycopg2://@/postgres

In [0]:
# This will download your data to local environment
!wget -q https://raw.githubusercontent.com/datacamp/getting-started-in-sql-live-session/master/data/membership.csv

In [0]:
# This will create your table
%%sql
DROP TABLE IF EXISTS membership;
CREATE TABLE membership (
 id varchar(50) primary key,
 last_name varchar(50),
 marital_status varchar(50),
 gender varchar(50),
 annual_income INT,
 industry varchar(50),
 zip_code INT,
 age_at_issue INT,
 member_type varchar(50),
 add_members INT,
 annual_fee INT,
 payment VARCHAR(50),
 status VARCHAR(50),
 start_month INT,
 start_day INT,
 start_year INT,
 end_month INT,
 end_day INT,
 end_year INT
);
COPY membership
FROM '/content/membership.csv' DELIMITER ',' CSV HEADER;

 * postgresql+psycopg2://@/postgres


# **Understanding the Membership - Part I**

The first step in exploring a new dataset is to view it in it's entirety. This lets you explore the columns and data types within. 

Here are some important statements to remember when querying your dataset. 


*   `SELECT`: returns either all columns using * or specific columns as specified, seperated by a comma. Example: `SELECT this_column, that_column`
*   `FROM` : specifies the table that the data should be returned from. Example: `FROM table` 
* `ORDER BY`: returns the data sorted by column specified. Can be sorted in `ASC` (ascending) or `DESC` (descending). Example: `ORDER BY this_column ASC` 
*`LIMIT`: limits the number of rows returned. Example: `LIMIT 10` will only return ten rows 

### **Code Commenting**
There are two types of code commenting in Postgres

-- `Inline commenting` : Used for quick, short notes 

"/*" `Multi line commenting` "*/" (use without quotation marks): Used for longer comments, such as metadata, or code headers including the author, date, purpose, etc.  


# **Exploring the Dataset**

In [0]:
%%sql

-- View the entire dataset


After you have viewed the entire dataset, you may want to only view certain columns. 

Use the `SELECT` function to isolate only the columns needed

In [0]:
%%sql

-- Show the membership demographic information 


In addition to reviewing specific columns, you may want to order and arrange the data in a simple way. 

Use `ORDER BY` and `LIMIT` to organize and arrange data. 

In [0]:
%%sql

-- Show the top and lowest earners in the dataset  



In [0]:
%%sql

-- Show only the top oldest members by the age they joined at  



# What Have We Learned About the Membership? 

So far, we have discovered:


*   There are 7,275 members in the dataset
*   The top earners have an annual income of 119,996 dollars
*   The lowest earners have an annual income of 35,002 dollars
*   The oldest members are either 78 or 77 years old





# **Q&A**

# **Understanding the Membership - Part II**


## **Filtering on Rows**
After selecting the specific `COLUMNS` needed, it can also be necessary to filter on the `ROWS` of your database. 

Here are some of the functions you can use to filter your data.

* `=` : Indicates when a column's row matches the criteria exactly  
* `>` and `<`: Indicates when a row value is higher or lower than the specified criteria 
* `BETWEEN`: Indicate a range of values to filter on. Includes both of the values used. 
* `AND`: Indicates that there are two filtering requirements that must be met 
* `OR`: Indicates that the row must meet at least one of the specified criteria
* `IN`: Specifies multiple string values to filter on
* `LIKE`: Used as a boolean search when the exact term is unknown 
*`ILIKE`: Similar to`LIKE`, but is case indifferent  

### Using ! and `NOT`
You can also filter rows based on what they do NOT include using the following:

*   `=!` : Indicates that the returned rows should NOT equal 'X' 
*  `NOT`: Used with `LIKE` and `ILIKE` to indicate the returned rows should include everything BUT those values





## **Filtering Using** `=`, `>`, `<`, `AND`, `OR`, **and** `BETWEEN` 





Filter using `=` 

In [0]:
%%sql

-- Find only members who have cancelled their membership



Filter using `>` 

In [0]:
%%sql

-- Find members who earn more than $50,000 a year



Filter using `<=`

In [0]:
%%sql

-- Find members age 40 or younger 


Filter using `BETWEEN`

In [0]:
%%sql

-- Find members who joined between 2009 and 2011


Filter using `>`, `AND`

In [0]:
%%sql

-- Find members who both earn above $75,000 a year, and have a Gold Membership 



Filter Using `>`, `OR`

In [0]:
%%sql

-- Find members who either earn over $80,000 a year, or who have a Silver Membership 

Filter using `IN()`

In [0]:
%%sql

-- Find members who live in the zip codes 80202, 80210, and 80206


Filter using `IN`, `AND`, `OR` using logical order

In [0]:
%%sql

-- Identify the Gold members who are either single or divorced, or who work in Health Care


## **Filtering with** `LIKE` **and** `ILIKE` 

There are several situations where you may need to use `LIKE` or `ILIKE`. 



*   When the criteria to be filtered on is only partially known
*   When the criteria to be filtered on needs to capture a range of values, but too many to use an `IN` function 

*Remember:* `LIKE` *is case sensitive,* `ILIKE` *is not. For this reason, it is common in Postgres to simply use*`ILIKE`when filtering. 



 Filter using `LIKE` and `NOT LIKE`

In [0]:
%%sql

-- Find members who have last names starting with the letter 'A'


In [0]:
%%sql

-- Find members who have last names that do NOT start with the letter 'B'



Filter using `ILIKE` and ` NOT ILIKE`

In [0]:
%%sql

-- Find members who make annual or semi-annual membership payments



In [0]:
%%sql

-- Find members who do NOT work in real estate



# **What Have We Learned About the Membership?** 


*   2,810 people have cancelled their memberships (39%)
*   5,961 earn over 50,000 a year (82%)
*   3,108 are age 40 or below (43%)
*   3,890 joined between 2009 and 2011 (53%)








# **Q&A** 

# **Understanding the Membership - Part III**

In addition to returning columns and filtering on rows, SQL can be used to create aggregations of data, similar to a certain functions in Excel. 

## **Aggregating Data**

Here are some of the functions used to aggregate data in SQL across only one column.  

* `SUM`: Is used to add up all the numerical values in a column
* `COUNT`: Is used to count all values in a column, string or numerical
* `AVG`: Is used to add up all the numerical values in a column and then divides by the total number of values
* `MIN`: Finds the smallest numerical value in a column
* `MAX`: Find the largest numerical value in a column 



### **Aliases**

Sometimes it is useful in SQL to rename a column - this is called giving it an 'alias'. The syntax is simple, but remember - column names cannot have spaces, so you must use an underscore to seperate words. 


Example:
```
SELECT avg(annual_income) AS Avg_income

FROM membership 
```
Aliases are used to create cleaner outputs, and are useful in more complex SQL queries, such as when joining tables. 


Aggregate using `SUM()`

In [0]:
%%sql

-- Find the sum of all annual fees for members who have cancelled



Aggregate using `COUNT()`

In [0]:
%%sql

-- Find how many male members have cancelled their memberships 

Aggregate using `AVG()`

In [0]:
%%sql

-- Find the average annual income of cancelled members 


Aggregate using `MIN()`

In [0]:
%%sql

-- Find the youngest member 


Aggregate using `MAX()`

In [0]:
%%sql

-- Find the most additional members a member can have  


## **Aggregation in SQL with** `GROUP BY`

Similar to creating a pivot table in Excel, you can aggregate numerical values by a categorical variable. You can do this in SQL by using the `GROUP BY` statement. 

The `GROUP BY` statement must include the categorical column that the data will be aggregated by. It must be written after a `WHERE` statement. 


Example:
```
SELECT this_column, sum(that_column)
FROM table
WHERE other_column = 'value'
GROUP BY this_column
```

Remember: The column being used to aggregate in the `GROUP BY` statement must also appear in the `SELECT` statement

### **Aggregate Data with** `GROUP BY`

In [0]:
%%sql

-- Find the marital status with the most total members



In [0]:
%%sql

-- Find the membership type with the fewest cancellations, and what their average annual income is



In [0]:
%%sql

-- Find the average income, the sum of annual fees, and the count of members across industries for cancelled members



## **Filtering on aggregated data with** `HAVING`

In order to filter on aggregated data, you will need to use the `HAVING` statement. This goes after the GROUP BY statement. 

Example: 

```
SELECT this_column, sum(that_column)
FROM table
GROUP BY this_column
HAVING sum(that_column) > 'value' 
```


Aggregate on Previously Aggregated Data with `HAVING`

In [0]:
%%sql

-- How many Bronze members started by year, in years where more than 150 members joined? 


In [0]:
%%sql

/*What is the total of annual fees for people who have held membership for longer than a year, by start month,
for months where more than $100,000 in fees was collected? Which month is the most profitable?*/


In [0]:
%%sql

-- What is the average annual income of members who earn more than $50,000 who left after 2012 by marital status?


# What Have We Learned So Far? 


*   Cancelled membership fees amount to a total loss of 25,162,000 dollars 
*   2,195 men have cancelled their memberships (78% of all cancelled memberships - but men also make up 76% of the total membership) 

*   82% of members are married, 15% are single, 2% are widowed, and 0.6% are divorced
*   Platinum members cancel at the least, with Gold Members cancellng the most. However, the difference between the cancellation numbers is not high, and all membership types have approximately the same level of average income. 

*   The average income, fees, and member counts are fairly evenly distributed across all industries













# **Q&A**

# **What Is Driving Cancellations?**

It's time to use aggregations to explore the data and find out why cancellation rates are increasing. 

We will introduce two more functions: `ROUND` and `DISTINCT` 


*   `ROUND`: Used to round a value to the specified number of decimals. `ROUND` is used in the `SELECT` statement and is wrapped around calculations. 

  Example: 
```
  SELECT ROUND(AVG(annual_income), 2)
  FROM membership
```
  Will return the value: 77366.51


*   `DISTINCT`: Used to indicate that only unique values should be returned, and duplicates eliminated. Used in the `SELECT` statement.

Example:
```
SELECT DISTINCT member_type
FROM membership 
```
Will return: Bronze, Silver, Gold, Platinum 



ROUND and DISTINCT Practice

In [0]:
%%sql

-- Practice using the ROUND function to find the average annual income to two decimal places 


In [0]:
%%sql

-- Practice using the DISTINCT function to find the count of id by zip_code


## **Finding Answers**

So far, we have determined that cancellation rates do not seem to be affected by annual income, gender, industry, or membership types.

What are some of the other various factors that could affect the rise of cancellations? 

*   Marital Status and children
*   Age at issue
*   Geographic location

Let's look at each of these through the lens of a potential business question. 



Does being unmarried with children affect cancellation rates? 

In [0]:
%%sql



Does age at joining affect cancellations? 

In [0]:
%%sql



Does industry affect cancellations, controlling for people who earn over $50,000?

In [0]:
%%sql


Is there any correlation between zip code and cancellations? 

In [0]:
%%sql



Create a report

In [0]:
%%sql

-- Compare the cancellation counts by zip code against other possible factors



# **Additional Research**

Sometimes, data analysis requires additional research!

Here we can see that the zip codes with the highest cancellation numbers are clustered close together. 


<p align="left">
<img src="https://github.com/KelseyMcNeillie/datacampgettingstartedinsql/blob/master/images/zipmap.png?raw=true" width="50%">
</p>
<br><br> 







# **Q&A**

# **Recap and Closing**