<a href="https://colab.research.google.com/github/JaimeAdele/APEX/blob/main/Module11_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://cdn.pixabay.com/photo/2015/04/15/14/55/calculator-723917_1280.jpg' width=700>  
Photo by Edar production from Pixabay

# APEX Faculty Training, Module 11: Statistics

Created by Valerie Carr and Jaime Zuspann  
Licensed under a Creative Commons license: CC BY-NC-SA  
Last updated: Mar 21, 2022  

**Learning outcomes**  


## 1. A couple notes before you start 
* This file is view only, meaning that you can't edit it.
    * To create an editable copy, look towards the top of the notebook and click on `Copy to Drive`. This will cause a new tab to open with your own personal copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter`.   
* Keep the following Python style preferences in mind:
    * Variable names should use `snake_case`
    * Include spaces before and after operators, e.g., `x + 1`
    * Don't put unnecessary spaces after a function name, before the parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print (my_variable)`
    * Don't put unnecessary spaces at the beginning or end of parentheses
        * Correct: `print(my_variable)`
        * Incorrect: `print( my_variable )`
        


<font color='red'>Exercise 1</font>  
Run the following cell to import the Pandas library and read in a csv file to use throughout this module. Feel free to add a call to `rental_df.head()` to view a portion of this new dataframe.  

In [1]:
import pandas as pd
filepath = "https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/rental_prices.csv"
rental_df = pd.read_csv(filepath)

## 2. Summarizing and Describing Data
Before performing statistical tests, it is often helpful to summarize a data set. We'll look at a data set relating to rental prices in 100 cities around the country.  

First, what does it mean to "summarize" a data set? There are many ways to summarize a data set, including:  
* Total number of observations
  - Essentially, how many rows are in the data set?
* Number of unique values
  - Ex: The dataset contains many different cities within the same state; how many unique states are represented?
* Number of observations per unique value
  - Ex: How many times does each state appear?
* Measures of central tendency and variability
  - mean, median, mode, variance, standard deviation
* Quantiles, min, and max
  - Quantiles: 25th, 50th, 75th percentile for example  


### 2a. Summary methods:

Here is a list of a few common methods for summarizing data:  

<table>
  <tr>
    <th>Method</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>df_name['col_name'].count()</td>
    <td>Number of observations for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].nunique()</td>
    <td>Number of unique values for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].value_counts()</td>
    <td>Number of observations per unique value in a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].mean()</td>
    <td rowspan="3">Measures of central tendency for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].median()</td>
  </tr>
  <tr>
    <td>df_name['col_name'].mode()</td>
  </tr>
  <tr>
    <td>df_name['col_name'].var()</td>
    <td rowspan="2">Variance and standard deviation for a given column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].std()</td>
  </tr>
  <tr>
    <td>df_name['col_name'].quantile([x, y, z])</td>
    <td>Quantiles for a given column, where x/y/z are values between 0-1</td>
  </tr>
  <tr>
    <td>df_name['col_name'].min()</td>
    <td rowspan="2">Minimum and Maximum values for a column</td>
  </tr>
  <tr>
    <td>df_name['col_name'].max()</td>
  </tr>
</table>  

<font color='red'>Exercise 2</font>  
Let's put some of these to practice. Run the following cells to see the output.

In [None]:
rental_df['state'].count()

In [None]:
rental_df['state'].nunique()

In [None]:
rental_df['state'].value_counts()

This summarization data can also be added into print statements to display multiple outputs from the same cell.  

<font color='red'>Exercise 3</font>  
Run the cell to see the output.

In [None]:
print('Mean: ', rental_df['median_rent'].mean())
print('Median: ', rental_df['median_rent'].median())
print('Standard deviation: ', rental_df['median_rent'].std())
print('Max: ', rental_df['median_rent'].max())


<font color='red'>Exercise 4</font>  
Refer to the `quantile()` method in the table of methods above to display the median rent for the 25th, 50th, and 75th percentile.

Computing these measures one-by-one for each column would be rather tedious. Instead, we can use the method `describe()` to compute several of these measures for all numerical data in the dataframe. The syntax goes like this:  
`df_name.describe()`  

<font color='red'>Exercise 5</font>  
Now run the cell to see the data summarized all at once.

In [None]:
rental_df.describe()

## 3. Grouping Data
### 3a. Grouping by a Single Column
Is there an easy way to group data together according to a particular variable? For example, what if we wanted to group rent prices according to the number of bedrooms? The method that does exactly this is called `groupby()`, and it's syntax goes like this:  
`df_name.groupby(by = 'col_name').method_name()`  

<font color='red'>Exercise 6</font>  
Now let's apply this to our rental dataframe. If we wanted to get the mean rent for places according to how many rooms they have, the code would look like that in the cell below. Run the cell to see the output.

In [None]:
rental_df.groupby(by = 'bedrooms').mean()

<font color='red'>Exercise 7</font>  
Use the `groupby()` along with the appropriate method to show how many places are available based on the number of bedrooms.

### 3b. Grouping by Multiple Columns
Now, what about grouping by more than one variable -- i.e. grouping by the number of bedrooms *and* state? The only change we'd need to make to the line of code above is to provide a list of columns rather than a single column name. The syntax for this looks like this:  
`df_name.groupby(by = ['col1', 'col2', 'etc.']).method_name()`

<font color='red'>Exercise 8</font>  
Try this challenge for yourself -- group the `rental_df` dataframe by number of bedrooms and state.

## 4. Statistics with `stats`
We've been using the Pandas library to subset, manipulate, and summarize dataframes. Now we'll introduce you to a new library -- Scipy. Scipy gets its name from the conjuntion of 'science' and 'python'.  We'll use the Scipy library for statistical tests. The library is huge, however, so we'll import only a small portion of it that pertains to statistics. We'll do so in the following exercise.

<font color='red'>Exercise 9</font>  
The cell below imports the `stats` component of the `scipy` library. Run the cell and remember that no output will be produced.

In [None]:
from scipy import stats

To use a function from the `stats` module of the Scipy library, you do not need to include the keywork `scipy`. Instead, you'll just use the following syntax:  
`stats.func_name()`  
where 'func_name()' is the name of a specific function you're using. You can find the full list of functions included in the `stats` module here: https://docs.scipy.org/doc/scipy/reference/stats.html  

For now, we'll use a single, simple example: Correlation (Pearson).

### 4a. Pearson's Correlation
Pearson's Correlation is used to evaluate the linear relationship between two sets of continuous values. For example, when the values for one variable increase, do the values of the other? To perform this test, we use the function `pearsonr()` with the following syntax:  
`stats.pearsonr(df_name['col1'], df_name['col2'])`
The `pearsonr()` function returns two values: `r` and `p`.  
The `r` value:
  * ranges from -1 to 1
  * a positive value means positive correlation (e.g., in kids, when height goes up, so does weight)
  * a negative value means negative correlation (e.e., when temperature drops in winter, heating bill goes up)

<font color='green'>What is the p value? I didn't see it in the slides</font>

Let's do an example with our states dataframe from previous modules. Is there a significant correlation between a state's total population and its Hispanic population? I.e., is it the case that states with a large total population have a larger Hispanic population?  

<font color='red'>Exercise 10</font>  
Run the cell below to see the correlation described above.

In [None]:
filepath = 'https://raw.githubusercontent.com/valeriecarr/engr120/main/S21/state_pop.csv'
states_df = pd.read_csv(filepath);

stats.pearsonr(states_df['totalPop'], states_df['hispPop'])

The first value returned is the `r` value, and the second is the `p` value. Note that the `p` value is given in scientific format. Since the value returned for `r` is positive and close to 1, this means that there is a high positive correlation between states' total population and Hispanic population.

## All done!
You've finished the Stats module! Next we'll learn how to visualize the data we're working with.