## Chapter 3 Section I: Basic Mathematical Tools for Data: Descriptive Statistics
#### MA 189 Data Dive Into Birmingham (with R)
##### _Blazer Core: City as Classroom_

Course Website: [Github.com/kerenli/statbirmingham/](https://github.com/kerenli/statbirmingham/) (to be published)


#### Levels:
<div class="alert-success"> Concepts and general information</div>
<div class="alert-warning"> Important methods and technique details </div>
<div class="alert-info"> Extended reading </div>
<div class="alert-danger"> (Local) Examples, assignments, and <b>Practice in Birmingham</b> </div>

##### <div class="alert alert-block alert-success"> Introduction </div>

As discussed in Chapter 1, there is an abundance of data available for our local region and the world at large. Knowing how to analyze that data and then represent it in a useful way for others to understand it and make conclusions is the focus of this chapter. Consider the video below regarding the population sizes of various cities in the state of Alabama over time. 

[![Video: Alabama Populations](https://www.youtube.com/watch?v=L2xPb__ZoWo)](https://www.youtube.com/watch?v=L2xPb__ZoWo)





By the end of this chapter, you will have the skills to create graphs to represent different datasets in our local region and to make conclusions of your own.


<div class="alert alert-block alert-danger">
    <b>Application in Birmingham from Area of History </b>
</div>

As a further motivation for this chapter, consider the video below from Dr. Jonathan Wiesen from the Department of History here at UAB. He dives into a variety of historical contexts in which he utilizes data visualization to communicate trends in the values of different variables related to armament (military weapons and equipment), slavery, wealth share, country independence, birth rates, and more! 


[![Video: UAB Department of History Data Visualization](https://youtu.be/FhoPvrbRw5g)](https://youtu.be/FhoPvrbRw5g)


### Variables and Descriptive Statistics

##### <div class="alert alert-block alert-success"> Types of Variables for Statistical Studies </div>

* A __variable__ is a characteristic of a study subject that can be observed or measured. 
* A __quantiative variable__ is a variable that has observations with numerical values on which meaningful mathematical operations can be performed.
* A __categorical variable__ is a variable that describes a study subject as being in a particular category or group.


A quantitative variable can be further distinguished between two types of variables: 
* If the possible values of the quantiative variable <u>form an interval</u>, then it is called a __continuous quantitative variable__. 
* If the possible values of a quantitative variable can only be a <u>set of specific numeric values</u>, then it is called a __discrete quantitative variable__. 

<div class="alert alert-block alert-danger"><b>Local Example:</b> Variables in Birmingham</div>

Continuous Quantitative: Temperature of iron ore at Sloss Furnaces <br>
Discrete Quantitative: Number of daily visitors to Vulcan Park and Museum <br>
Categorical: Types of businesses in Birmingham (e.g. nonprofit, governmental, financial, restaurants) 


<div class="alert alert-block alert-danger"><b>Local Student Practice: </b> Variables in Birmingham</div>

Classify the variables below as one of the following: continuous quantitative, discrete quantitative, or categorical.

1. Count of daily morning walkers at Railroad Park  
2. Average July temperatures in Birmingham 
3. Hair color of UAB students 
4. Country of origin for Birmingham residents  
5. Number of courses UAB students enroll in each fall 
6. Molecule weights in chemistry lab 
7. Major for UAB students 
8. Birth year of UAB professors 
9. Age of UAB professors 
10. True/False on an English exam 
11. Time of sunset (viewing from Ruffner Mountain)
12. Visitor names at Sloss Furnances 






##### Your answer:
Continuous Quantitative:  



Discrete Quantitative:

Categorical:  

##### <div class="alert alert-block alert-success">Statistics for Individual <ins>Quantitative</ins> Variables (Univariate Statistics) </div>

For the following definitions, let $n$ represent the sample size, $N$ the population size, and $x$ the individual data points (observations).

##### Measures of center: 
* The __mean__ is the sum of the observations divided by the number of observations. (Another word for this is the average). 
    \begin{equation}
    \text{Sample mean: }\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i, \\
    \text{Population mean: } \mu=\frac{1}{N}\sum_{i=1}^N x_i.
    \end{equation}
    
A weighted mean accounts for variations in the relative importance of data values. Each data value $x_i$ is assigned a weight $w_i$ and the weighted mean is 
    \begin{equation}
    \frac{\sum_{i=1}^n w_i\cdot x_i}{\sum_{i=1}^n w_i}
    \end{equation}
    

* The __median__ is the middle value of the observations when the observations are ordered from the smallest to the largest. 

##### Measures of variability: 
* The __standard deviation__ is a measure of the dispersion of the data from the mean and is the square root of the __variance__. 
\begin{equation}
\text{Sample standard deviation: }s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2} \\
\text{Population standard deviation: } \sigma=\sqrt{\frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2}.
\end{equation}

Note: The standard deviation is approximately one quarter of the range of the distribution. 


* The __range__ is the difference between the largest and smallest observations. 
* The __interquartile range__ is the distance between the first and third quartiles. To find __Quartiles__: arrange the data in order; the median is the __second quartile, Q2__; the median of the lower half of data is called the __first quartile, Q1__; and the median of the upper half of the data is called the __third quartile, Q3__.  



    
##### Measure of Frequency:
* The __mode__ is the value that occurs most frequently in a set of observations. 

    
    



<div class="alert alert-block alert-danger">
<b>Student Practice</b>: Novels Most Widely Held in Libraries Worldwide
</div>

Last class we discussed the importance of having context and humanizing data points to the best of one's ability. The following dataset from the OCLC, a library organization in over 100 countries, gives detailed information on the 500 novels most widely held in libraries. The data collected includes author biographies, novel specifics, online holdings info, and online popularity of the novels: https://www.responsible-datasets-in-context.com/posts/top-500-novels/top-500-novels.html?tab=data-essay. 

1. Calculate the mean Goodreads rating for the top 10 most widely held novels by libraries.

2. Calculate the standard deviation of the Goodreads average ratings for the top 10 most widely held novels by libraries. 

3. What is the median publication year for the top 10 most widely held novels by libraries?

4. What is the range of total physical library holdings for the top 10 most widely held novels by libraries? 

5. What is the mode for the author birth year for the top 10 most widely held novels by libraries? 

6. For the top 10 most widely held novels by libraries, determine the quartiles for the year of death of the author. 

7.  For the top 10 most widely held novels by libraries, calculate the interquartile range. 

We will reflect on your process after a few moments. 


##### Your answer:

##### Further Definitions:

* The __margin of error__ is the measure of the expected variability from one random sample to the next, calculated by $\frac{1}{\sqrt{n}}\times 100\%$.

* A __five number summary__ is a way to summarize a dataset using the following five values:

    The minimum<br>
    The first quartile<br>
    The median<br>
    The third quartile<br>
    The maximum<br> 
    <br>
    
    
* An __outlier__ is an observation that falls well above or well below the overall bulk of the data 

Note:</br>
-Later we will see that an observation in a bell-shaped distribution is regarded as a potential outlier if it falls more than three standard deviations away from the mean ($\bar{x}\pm 3s$) and is called the __Standard Deviation Criterion__. </br>
    
-The __Interquartile Range (IQR) Criterion__ is helpful for identifying potential outliers: An observation is a potential outlier if it falls a distance of more than 1.5 $\times IQR$ below the first quartile or a distance of more than 1.5 $\times IQR$ above the third quartile. 

    

<div class="alert alert-block alert-warning">
    <b></b> Distributions and Comparing the Mean and Median for a Quantitative Variable
</div>

* A __distribution__ of a variable describes how the observations (data points) fall across the range of possible values. 

* A __density curve__ is a continuous curve drawn to visualize the underlying probability distribution of the data. (It gives an idealized overall picture of the data while allowing minor irregularities to be "ignored.") 

Notes: <br>
-The mean of a density curve is the balance point, at which the curve would balance if made of solid material. <br>
-The median of a density curve is the equal-areas point, the point that divides the area under the curve in half. <br>
-If a distribution is highly skewed to the left or right (i.e. the bulk of the data is on one end or the other), the median is preferred over the mean. </br>
-If a distribution is close to symmetric or only mildly skewed, the mean is preferred over the median. 

![image.png](attachment:3818d037-9d55-48c5-8adc-2b8b523f54cd.png)


<div class="alert alert-block alert-danger">
<b>Example</b>: Novels Most Widely Held in Libraries Worldwide
</div>

Consider the distribution for the top 500 most widely held novels in libraries and the timeframe they were published in the "Examining Bias" section of https://www.responsible-datasets-in-context.com/posts/top-500-novels/top-500-novels.html?tab=data-essay. 


What skew does this distribution have? 

<p align="center">
    <img src="pics/skewed_left.png" alt="Figure 1" style="width:45%;"/>
</p>

##### <div class="alert alert-block alert-success">Statistics for Individual <ins>Categorical</ins> Variables (Univariate Statistics)</div>

* The __proportion__ is the fraction of observations that hold a certain categorical characteristic. 

##### Measure of Frequency:
* The __mode__ is the category that occurs most frequently in a set of observations. 


<div class="alert alert-block alert-danger"><b>Example</b>: Novels Most Widely Held in Libraries Worldwide </div>

Refer back to the top 500 novels most widely held by libraries dataset: https://www.responsible-datasets-in-context.com/posts/top-500-novels/top-500-novels.html?tab=data-essay.

1. What proportion of the top 10 novels most widely held in libraries were originally written in Spanish?
2. What is the mode for author nationality in the top 10 novels most widely held in libraries? 


##### Your answer:

### Data Visualization Techniques 

Let's take a look at a variety of data visualizations, some quantitative and some categorical.

##### <div class="alert alert-block alert-success"><ins>Quantiative</ins> Data Visualization</div>

* Histogram
* Scatter Plot
* Box Plot (or Box-and-Whiskers Plot)
* Time Series
* Stem-and-Leaf Plot
* Dot Plot
* Density Functions

<p align="center">
    <img src="pics/histogram.png" alt="Figure 1" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/scatterplot.png" alt="Figure 2" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/scatterplot2.png" alt="Figure 3" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/boxandwhiskers.png" alt="Figure 4" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/timeseries.png" alt="Figure 5" style="width:100%;"/>
</p>

<p align="center">
    <img src="pics/stemandleaf.png" alt="Figure 6" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/dotplot.png" alt="Figure 7" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/densityfunctions.png" alt="Figure 8" style="width:70%;"/>
</p>

<div class="alert alert-block alert-danger">
<b>Local Student Practice: Birmingham Data Visualizations </b>
</div>

Identify the following data visualizations as one of the above categories: 

Birmingham Airport and Hotel Occupancy (page 6 on https://downtownbhm.com/wp-content/uploads/2024/04/Q3-Q4-2023-Data-Report-FINAL-1.pdf)

Age Distribution of Downtown Residents (...if we were to put the bars together and re-label the axes) (page 17 on https://downtownbhm.com/wp-content/uploads/2024/04/Q3-Q4-2023-Data-Report-FINAL-1.pdf)





 ##### <div class="alert alert-block alert-success"><ins>Categorical</ins> Data Visualization</div>
 
* Bar Graph
* Pareto Chart
* Pie Chart






<p align="center">
    <img src="pics/bargraph.png" alt="Figure 1" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/bargraph2.png" alt="Figure 2" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/sidebysidebargraph.png" alt="Figure 3" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/splitstackbargraph.png" alt="Figure 4" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/paretochart.png" alt="Figure 5" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/piechart.png" alt="Figure 6" style="width:70%;"/>
</p>

<p align="center">
    <img src="pics/piechart2.png" alt="Figure 7" style="width:70%;"/>
</p>

<div class="alert alert-block alert-danger"><b>Local Student Practice</b>:  Birmingham Data Visualizations  </div>


Identify the following data visualizations as one of the above categories: 

Monthly Employee Visits (page 4 on https://downtownbhm.com/wp-content/uploads/2024/04/Q3-Q4-2023-Data-Report-FINAL-1.pdf)

Racial Makeup of Downtown Residents (page 17 on https://downtownbhm.com/wp-content/uploads/2024/04/Q3-Q4-2023-Data-Report-FINAL-1.pdf)

U.S. vs Birmingham, AL Foreign-Born \% (https://datausa.io/profile/geo/birmingham-al/)

Awarded Degrees Over Time - graph on left (https://datausa.io/profile/geo/birmingham-al/)

Average Net Price by Sector (https://datausa.io/profile/geo/birmingham-al/)

Median Earnings by Industry / Gender (https://datausa.io/profile/geo/birmingham-al/)



<div class="alert alert-block alert-danger">
    <b>Local Student Practice: </b> City of Birmingham Fiscal Year Visualizations
</div>

Take a look around the following website for information on the City of Birmingham's annual fiscal (financial) reports. 

https://birminghamal.opengov.com/transparency/#/22078/accountType=expenses&embed=n&breakdown=67161fe1-9b6b-4735-b6b7-e848f1d44237&currentYearAmount=cumulative&currentYearPeriod=years&graph=percentage&legendSort=coa&proration=true&saved_view=null&selection=C53CDC200C9444356C8CEAB4F78F24F8&projections=null&projectionType=null&highlighting=null&highlightingVariance=null&year=2024&selectedDataSetIndex=null&fiscal_start=earliest&fiscal_end=latest

1. What different types of data visualizations do they offer to represent the various reports?
2. Write two conclusions that you made about the City of Birmingham from analyzing these figures. 

We will revisit this website in a future chapter to dive deeper into the communication of results using data visualization. 


##### Your answer:

<div class="alert alert-block alert-danger">
    <b>Local Student Practice: </b> Alabama City Populations
</div>

Consider the video from the introduction of this chapter (given again below). What type of data visualization method (type of graph) is used in this video? 

[![Some Sample Video Link](https://i.ytimg.com/vi/L2xPb__ZoWo/hq720.jpg?sqp=-oaymwEjCOgCEMoBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLAKi29tEJRWAho6rZybYwOcXSXVTQ)](https://www.youtube.com/watch?v=L2xPb__ZoWo)

##### Your answer:

### Sampling Methods

##### <div class="alert alert-block alert-warning">Proper and Improper Sampling</div>


It is important to select samples in ways that are likely to reflect the actual population. 

Proper ways of sampling: 
* A __simple random sample__ (or __random sample__) of $n$ subjects from a population is one in which each possible sample of that size has the same chance of being selected. 

Improper ways of sampling: 
* __Convenience sampling__: choosing a sample because it is convenient (e.g. time, location, people the surveyors want to approach for the survey) and is most likely not a random sample
* __Volunteer sampling__: individuals volunteer to participate so there may be some parts of the population more likely to participate than others 


However, $\textit{even if}$ a sample is taken properly, there can still be __bias__ in which some outcomes are more or less likely to show up than others even though that may not be representiative of what is actually occuring in the population. Examples:

* __Sampling bias__: occurs when a sample is not taken randomly (__nonrandom sample__) or when a sample that is taken is not representative of different parts of a population (__undercoverage__).
* __Nonresponse bias__: occurs when subjects cannot be reached or refuse to participate (or may not respond to some questions which results in __missing data__) 
* __Response bias__: occurs when when the subject gives untruthful responses (e.g. due to what they think the interviewer wants to hear or because it's more socially acceptable), or when the questions are asked in a confusing or misleading way to encourage subjects to respond in a certain way.


Note: In general, large sample sizes are better than smaller; however, if a sample exhibits one or more of the above biases, then it does not give a more accurate result and it would have been better to stick to a smaller sample that was taken properly.


<div class="alert alert-block alert-danger">
    <b>Local Student Practice: </b> Survey Design Bias
</div>

1) I want to take a survey of the population of Birmingham and do a random digit dialing of cell phones. What type of bias does this sampling design exhibit?  

2) Due to the strong emotions that a survey regarding political issues evokes for a portion of the sample, those individuals decide not to participate in the study. What type of bias does this sampling design exhibit? 

3) UAB would like to do a study on drug use for students on campus. Some students are fearful of their identity being attached to their answers and so they do not answer honestly.  What type of bias does this sampling design exhibit? 

4) Researchers in Birmingham want to look at the traffic across the city. At each day at 5pm, they go to one spot on Highway 280 and count the number of cars passing by to obtain their data.  What type of bias does this sampling design exhibit? 

##### Your answer:
