<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module11_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://cdn.pixabay.com/photo/2015/04/15/14/55/calculator-723917_1280.jpg' width=700>  
Photo by Edar production from Pixabay

# APEX Faculty Training, Module 11: Statistics

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: Mar 25, 2022  

**Learning outcomes**  
In this module you will learn how to:  
1. Use `Pandas` methods to calculate summary and descriptive statistics
2. Use the `stats` module within the `SciPy` library to perform a correlational analysis

## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`
        


<font color='red'>Exercise 1</font>  
Below, we have included code that imports the Pandas library and provides a path to a CSV file that you'll be using for the rest of the module. Your job is to add a couple new lines of code that:
* Read in the CSV file to create a dataframe named `rental_df`
* Check the header of the dataframe

If you need a reminder of how to perform these steps, look back at Module 8 (Pandas I).

In [None]:
import pandas as pd
filepath = "https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/rental_prices.csv"

# read in CSV file

# display header


## 2. Summarizing and Describing Data
Before diving deeper into statistical analysis, it's a great idea to summarize a data set and calculate descriptive statistics. The `rental_df` dataframe that you created above includes median rental prices in 100 cities around the US.  

There are many ways to summarize or describe a data set, including:  
* Total number of observations
  - Ex: How many rows are in `rental_df`?
* Number of unique values in a given column
  - Ex: `rental_df` includes data from several cities within the same state; how many unique states are represented in the `state` column?
* Number of observations per unique value in a given column
  - Ex: How many times does each state appear in the `state` column?
* Measures of central tendency and variability for columns with numerical values
  - Ex: mean, median, mode, variance, standard deviation
* Min, max, and quantiles for columns with numerical values
  - Quantiles ex: 25th, 50th, 75th percentile 


### 2a. Pandas methods:

See below for a table that includes common Pandas methods for summarizing and describing data:


<table>
  <tr>
    <th>Method</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>df_name['col_name'].count()</td>
    <td>Number of observations for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].nunique()</td>
    <td>Number of unique values for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].value_counts()</td>
    <td>Number of observations per unique value in a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].mean()</td>
    <td rowspan="3">Measures of central tendency for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].median()</td>
  </tr>
  <tr>
    <td>df_name['col_name'].mode()</td>
  </tr>
  <tr>
    <td>df_name['col_name'].var()</td>
    <td rowspan="2">Variance and standard deviation for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].std()</td>
  </tr>
  <tr>
    <td>df_name['col_name'].quantile([x, y, z])</td>
    <td>Quantiles for a given column, where x/y/z are values between 0-1</td>
  </tr>
  <tr>
    <td>df_name['col_name'].min()</td>
    <td rowspan="2">Minimum and Maximum values for a column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].max()</td>
  </tr>
</table>


Note of interest for Stats instructors: By default, Pandas normalizes standard deviation and variance using N-1 (sample), rather than N (population). If you prefer N, you can include an additional argument, as described [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.var.html). 

<font color='red'>Exercise 2</font>  
Run each of the cells below to see examples of these methods applied to the `state` column of `rental_df`:

In [None]:
# number of observations in the state column
rental_df['state'].count()

In [None]:
# number of unique values in the state column
rental_df['state'].nunique()

In [None]:
# number of observations per unique value in the state column
rental_df['state'].value_counts()

<font color='red'>Exercise 3</font>  
Refer to the `quantile()` method in the table above to calculate the 25th, 50th, and 75th percentile values for the `median_rent` column. As a reminder, these percentiles should be listed as decimals, e.g., 0.25 rather than 25.

<font color='red'>Exercise 4</font>  
Below, we demonstrate how you might include various descriptive statistics embedded in a string, making the outputs easier to read. Run the cell to see the mean and median values for the `median_rent` column.

In [None]:
print('Mean:', rental_df['median_rent'].mean())
print('Median:', rental_df['median_rent'].median())

<font color='red'>Exercise 5</font>  
Now you try! Using the example code in Exercise 4 as a guide, print two strings, the first of which states the standard deviation, and the second of which states the maximum value of the `median_rent` column.

 <font color='red'>Exercise 6</font>  
Computing these descriptive one-by-one for each numerical column in a dataframe could be rather tedious. Instead, we can use the method `describe()` to compute several of these measures for *all* numerical columns using the following syntax:

`df_name.describe()` 

Below, write code that applies this method to the `rental_df` dataframe:

## 3. Grouping Data
Rather than view descriptive statistics for all observations in a dataframe, we can instead use the `groupby()` method to group the data by one or more variables of interest. For example, we could examine median rent prices according to number of bedrooms.

### 3a. Grouping by a Single Variable
The syntax for grouping the dataframe by a single variable is as follows:

`df_name.groupby(by = 'col_name').method_name()`

Breaking down this syntax:
* `by = col_name` is used to indicate which variable (i.e., column) you'll be using to group the dataframe
* `method_name` simply indicates the desired descriptive statistic to display, such as mean, median, etc.

<font color='red'>Exercise 7</font>  
Using the example syntax above, write code that groups the rental data according to number of bedrooms and displays the mean.

### 3b. Grouping by Multiple Variables
Now, what about grouping by more than one variable -- i.e. grouping by the number of bedrooms *and* state? The only change we'd need to make to the line of code above is to provide a list of columns rather than a single column name. The syntax for this looks like this:

`df_name.groupby(by = ['col1', 'col2', 'etc.']).method_name()`

The only difference between the syntax for grouping by multiple variables

<font color='red'>Exercise 8</font>  
Try this challenge for yourself -- group the `rental_df` dataframe by state and number of bedrooms.

### 3b. Grouping by Multiple Columns
Now, what about grouping by more than one variable -- i.e. grouping by the number of bedrooms *and* state? The only change we'd need to make to the line of code above is to provide a list of columns rather than a single column name. The syntax for this looks like this:  
`df_name.groupby(by = ['col1', 'col2', 'etc.']).method_name()`

<font color='red'>Exercise 8</font>  
Try this challenge for yourself -- group the `rental_df` dataframe by number of bedrooms and state.

## 4. Inferential statistics with `stats`
We've been using the `Pandas` library to subset, modify, and summarize dataframes. To perform inferential statistics, we'll introduce you to a new library, `SciPy`. This library gets its name from the conjuntion of 'Science' and 'Python' and is pronounced like "sigh-pie." The library is enormous and serves many science-related purporses, so we'll only import a small portion of the library pertaining to statistics, as demonstrated below. 

<font color='red'>Exercise 9</font>  
The cell below imports the `stats` module within the `scipy` library. Run the cell, keeping in mind that no output will be produced.

In [None]:
from scipy import stats

The `stats` module contains a wide array of functions pertaining to statistics, such as:
* Correlation
* Linear regression
* T-test
* Chi square
* Binomial test
* Tests of skew and kurtosis
* And many more! You can find a complete list of functions [here](https://docs.scipy.org/doc/scipy/reference/stats.html).  

To use a given function within the `stats` module, use the following syntax: `stats.func_name()` 

Breaking down this syntax, `stats` is simply the name of the module, and `func_name()` is a placeholder for the specific function you'd like to use. We'll keep things short and sweet for today's purposes and simply demonstrate how to use the `pearsonr()` function.

### 4a. Pearson's Correlation
The `pearsonr()` function is used to evaluate the linear relationship between two data sets and assumes that both are normally distributed. This function uses the following syntax:

`stats.pearsonr(df_name['col1'], df_name['col2'])`

Breaking down the syntax:
* We're using the `stats` module and the `pearsonr()` function
* We need to specify two variables of interest (i.e., two columns in a dataframe)

The `pearsonr()` function produces two outputs: `r` and `p`, with `r` ranging from -1 to 1 and `p` ranging from 0 to 1. 

Two notes:
* The output doesn't specify which value is which, so you'll have to remember that `r` comes first, followed by `p`.
* If the resulting p-value is quite small, it's printed in scientific notation (e.g., 2.63e-11). Many students aren't familiar with this type of notation, to which end we recommend including a bit of hand-holding.

<font color='red'>Exercise 10</font>  
The rental data set we've used thus far isn't well-suited for a correlational analysis. Instead, we'll go back to the state population data set used in prior modules. We'll ask the question: Is there a significant correlation between a state's total population and its Hispanic population? 

Below, we've included the relevant filepath for you. Your job is to:
1. Read in the CSV file to create a dataframe called `states_df`
2. Use the `pearsonr()` function to examine the correlation between values in the `totalPop` and `hispPop` columns within this dataframe
3. Ask yourself whether the correlation was significant, and if so, if the correlation is positive or negative

In [None]:
filepath = 'https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/state_pop.csv'

# read in CSV file

# run correlational analysis


## All done!
You've finished the Stats module! In the final module of this training course, we'll learn how to visualize the data we're working with.