![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)
$$\large \textbf{Measure of Data Spread}$$
$$\large \textbf{Elements of Visualizations 4 - Time Series or Line Graphs}$$

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Standard Deviations and Variance**
<details>
  <summary><b>Definition of Standard Deviation and Variance</b></summary>
  
<b>Standard deviation </b> is a measure of the dispersion or variability of a set of data points. It quantifies how spread out the values are from the average (mean) of the data set. Standard deviation involves a square root, while variance is defined as the square of the standard deviation.  

**Variance** is square of a standard deviation.   It provides information about the average squared deviation of individual data points from the mean. In other words, variance measures how much the data points are scattered or spread out from the average value.  The variance is typically reported in squared units, which may not be directly interpretable or intuitive.
</details>

<details>
  <summary><b>Characteristics of Standard Deviation</b></summary>

* provides a numerical measure of the overall amount of variation (spread) in a data set
* can be used to determine whether a particular data value is close to or far from the mean
* is always positive or zero.
* measures that spread in the same units as the data.  
* has the same units as the original data
* sensitive to outliers

  </details>


<details>
  <summary><b>Characteristics of Variance</b></summary>

* is reported in squared units, which may not be directly interpretable or intuitive
* variance is also sensitive to outliers
* is often easier to work with mathematically than the standard deviation.
* is a key parameter in various statistical models, such as linear regression or analysis of variance (ANOVA). These models require the estimation of variance to understand the variability in the data and make appropriate statistical inferences.

  </details>


<details>
  <summary><b>Mathematical Formula for Computing Standard Deviation</b></summary>
 $$ \begin{align}\text{Let } \sigma & = \text{  the standard deviation of a poluation} \\
 \mu  & = \text{  represents standard deviation of a sample set.} \end{align}\\
\text{sume of squared error} = \displaystyle\sum_{i=1}^{n} (x_i - x_{i \text{average}})^2 \\
\text{Let } x_{i \text{average}} = \mu \text{ for the average of a poluation and } x_{i \text{average}} = \overline{x} \text{  be the average of a sample set.}$$
$$\begin{align} \text{variance of a population } \hspace{15mm} &\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n} (x_i - \mu)^2}{n}\\
\text{variance of  a sample set} \hspace{15mm} &s^2 = \frac{\displaystyle\sum_{i=1}^{n} (x_i - \overline{x})^2}{n-1} \end{align}$$
    
$$\begin{align} \text{Standard Deviation of a population } \hspace{15mm} &\sigma = \sqrt{\frac{\displaystyle\sum_{i=1}^{n} (x_i - \mu)^2}{n}}\\
\text{Standard Deviation of  a sample set} \hspace{15mm} &s = \sqrt{\frac{\displaystyle\sum_{i=1}^{n} (x_i - \overline{x})^2}{n-1}} \end{align}$$

</details>

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Python for numerical calculation of standard deviations**


---
### Use the **`Numpy`** library to calculate **population** standard deviation.

#### `numpy.std(arr)`

<details>
  <summary><b>Show syntax in numpy</b></summary>
Using Numpy for computing population standard deviation or sample standard deviation if the sample size is large

</details>

---

### Use the **`statistics`** module to calculate **population** standard deviation

###`statistics.pstdev(data)`

### Use **`statistics`** module for computing **sample** standard deviation

### `statistics.stdev(data) `

---

## Use **`pandas.std`** method for computing **population** standard deviation


### `DataFrame.std(axis, skipna = None, numeric_only = None)`

<details>
  <summary><b>Show explanation of parameters</b></summary>

Using Pandas for computing the mode

**axis:** "index or 0" or "column or 1" \
**dropna:** Optional, "True" or "False", default True. \
**numeric_only:**	Optional, "True", "False", Default False, Specify whether to only check numeric values. \
**return:** A Series with the highest frequency value in a  (column).

</details>

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **Line Graphs or Time Series**

<details>
  <summary><b>Show description</b></summary>

<b>Line graphs</b> are scattere plots with lines connecting each consecutive pair of points.

A  <a href = "https://whatis.techtarget.com/definition/time-series-chart#:~:text=A%20time%20series%20chart%2C%20also,quantity%20that%20is%20being%20measured."> <b>time series graph</b> </a>  (also known as timeplot) displays values against time. They are similar to x-y graphs, but while an x-y graph can plot a variety of “x” variables (for example, height, weight, age), timeplots can only display time on the x-axis. Unlike pie charts and bar charts, these plots do not have categories. Timeplots are good for showing how data changes over time. For example, this type of chart would work well if you were sampling data at random times.  We can use the same syntax for time series plots.

</details>


![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## 🔖 **$\color{blue}{\textbf{Lab Work 1: }}$Computing standard deviations of an array-like data structure using `numpy` and `statistics` Python Library**

<details>
  <summary><b>Show Problem</b></summary>
Using the following data (first exam scores) from an instructor's spring pre-calculus class:

33, 42, 49, 90, 49, 49, 53, 55, 55, 61, 63, 82, 67, 68, 68,\
69, 69, 72, 73, 66, 74, 78, 80, 56, 83, 88, 88, 71, 88, 90, \
92, 94, 70, 94, 94, 33, 94, 96, 100, 100

</details>

In [1]:
import statistics
import numpy as np

final_scores = [
    33, 42, 49, 90, 49, 49, 53, 55, 55, 61, 63, 82, 67, 68, 68,
    69, 69, 72, 73, 66, 74, 78, 80, 56, 83, 88, 88, 71, 88, 90,
    92, 94, 70, 94, 94, 33, 94, 96, 100, 100
]

pstd_np = np.std(final_scores)
print(f"The population standard deviation with numpy = {pstd_np:.2f} from the mean score {np.mean(final_scores)}")
pstd_para = statistics.pstdev(final_scores)
print(f"The population standard deviation with statistics = {pstd_para:.2f} from the mean score {np.mean(final_scores)}")
sstd_statistic = statistics.stdev(final_scores)
print(f"The sample standard deviation with statistics = {sstd_statistic:.2f} from the mean score {np.mean(final_scores)}")

The population standard deviation with numpy = 18.21 from the mean score 72.4
The population standard deviation with statistics = 18.21 from the mean score 72.4
The sample standard deviation with statistics = 18.44 from the mean score 72.4


###🧚 **$\color{orange}{\textbf{Reminder:}}$** For **small** sample size use the formula for **Sample** Standard Deviation

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **$\color{green}{\textbf{TO DO 1:}}$**

<details>
  <summary><b>Show data set</b></summary>

 On a baseball team, the ages of each of the players are as follows:

21, 21, 22, 23, 24, 24, 25, 25, 28, 29, 29, 31,
32, 33, 33, 34, 25, 25, 26, 26, 26, 38, 38, 38, 40

</details>

> **1.** Find the mean $\overline{x}$ of the dataset to ***two*** decimal places.\
> **2.** Apply the formulas for computing both **population** and **sample standard deviations** of the dataset.  Print the output using `f-string` with meaninful description and rounding the results to ***six*** decimal places when\
> **3** Write a short interpretation of the implication of the difference between sample and population standard deviation.\

In [2]:
baseball_ages = [21, 21, 22, 23, 24, 24, 25, 25, 28, 29, 29, 31, 32, 33, 33, 34, 25, 25, 26, 26, 26, 38, 38, 38, 40]
sample_mean = np.mean(baseball_ages)
print(f"The sample mean of the baseball players ages = {sample_mean:.2f}")
population_std = statistics.pstdev(baseball_ages)
print(f"The population standard deviation using statistics = {population_std:.6f}")
sample_std = statistics.stdev(baseball_ages)
print(f"The sample standard deviation using statistics = {sample_std:.6f}")

The sample mean of the baseball players ages = 28.64
The population standard deviation using statistics = 5.606282
The sample standard deviation using statistics = 5.721888


Because of the way that the formulas for sample standard deviation and population standard deviation are calculated, you would expect the sample standard deviation to be slightly larger than the population standard deviation, because the denominator is smaller for the sample standard deviation (`n - 1` vs `n`)

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## 🔖 **$\color{blue}{\textbf{Lab Work 2 }}$ Calculate Standard Deviation of A Dataset in the Pandas DataFrame with the [URL](https://raw.githubusercontent.com/plotly/datasets/master/violin_data.csv)**

### **Create a dataframe**

In [3]:
import pandas as pd
tip_data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/violin_data.csv")
tip_data.std(numeric_only=True)
tip_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### **Calculate the Standard Deviation for each <font color = "red" > Numerical Variable </font> using population standard deviation in `numpy`**

In [4]:
# Using four decimal places in the output here to show the difference between the population and sample standard deviation
tip_pstd = np.std(tip_data.tip)
print(f"The population standard deviation of tips = {tip_pstd:.4f} from the mean = {tip_data.tip.mean():.2f}")
total_bill_pstd = np.std(tip_data.total_bill)
print(f"The population standard deviation of total bills = {total_bill_pstd:.4f} from the mean = {tip_data.total_bill.mean():.2f}")
total_size_pstd = np.std(tip_data["size"])
print(f"The population standard deviation of size = {total_size_pstd:.4f} from the mean = {tip_data['size'].mean():.2f}")

The population standard deviation of tips = 1.3808 from the mean = 3.00
The population standard deviation of total bills = 8.8842 from the mean = 19.79
The population standard deviation of size = 0.9491 from the mean = 2.57


### **Calculate Standard Deviation for Each Variable using sample standard deviation in `statistics`**

In [5]:
# Using four decimal places in the output here to show the difference between the population and sample standard deviation
tip_sstd = statistics.stdev(tip_data.tip)
print(f"The sample standard deviation of tips = {tip_sstd:.4f} from the mean = {tip_data.tip.mean():.2f}")
total_bill_sstd = statistics.stdev(tip_data.total_bill)
print(f"The sample standard deviation of total bills = {total_bill_sstd:.4f} from the mean = {tip_data.total_bill.mean():.2f}")
total_size_sstd = statistics.stdev(tip_data["size"])
print(f"The sample standard deviation of size = {total_size_sstd:.4f} from the mean = {tip_data['size'].mean():.2f}")

The sample standard deviation of tips = 1.3836 from the mean = 3.00
The sample standard deviation of total bills = 8.9024 from the mean = 19.79
The sample standard deviation of size = 0.9511 from the mean = 2.57


###🧚 **$\color{orange}{\textbf{Reminder:}}$** For a **large** sample size use the formula for either **Sample** or **population** Standard Deviation.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **$\color{green}{\textbf{TO DO 2 }}$** Calculate Standard Deviation of A Dataset in the Pandas DataFrame with the [URL](https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv)

### **1. Create a dataframe**

In [6]:
url = "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv"
iris_df = pd.read_csv(url)
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### **2. Calculate the Standard Deviation for each <font color = "red" > Numerical Variable </font> using population standard deviation in `numpy`**

In [7]:
# Using four decimal places in the output here to show the difference between the population and sample standard deviation
sepal_length_pstd = np.std(iris_df.sepal_length)
sepal_width_pstd = np.std(iris_df.sepal_width)
petal_length_pstd = np.std(iris_df.petal_length)
petal_width_pstd = np.std(iris_df.petal_width)
print(f"The population standard deviation of sepal lengths = {sepal_length_pstd:.4f} from the mean = {iris_df.sepal_length.mean():.2f}")
print(f"The population standard deviation of sepal width = {sepal_width_pstd:.4f} from the mean = {iris_df.sepal_width.mean():.2f}")
print(f"The population standard deviation of petal lengths = {petal_length_pstd:.4f} from the mean = {iris_df.petal_length.mean():.2f}")
print(f"The population standard deviation of petal width = {petal_width_pstd:.4f} from the mean = {iris_df.petal_width.mean():.2f}")

The population standard deviation of sepal lengths = 0.8253 from the mean = 5.84
The population standard deviation of sepal width = 0.4321 from the mean = 3.05
The population standard deviation of petal lengths = 1.7585 from the mean = 3.76
The population standard deviation of petal width = 0.7606 from the mean = 1.20


### **3. Calculate Standard Deviation for the variables "sepal_length	sepal_width" and "sepal_length	sepal_width" using sample standard deviation in `statistics` and then compare the results from using `numpy` for population std above**

In [8]:
# Using four decimal places in the output here to show the difference between the population and sample standard deviation
sepal_length_sstd = statistics.stdev(iris_df.sepal_length)
sepal_width_sstd = statistics.stdev(iris_df.sepal_width)
petal_length_sstd = statistics.stdev(iris_df.petal_length)
petal_width_sstd = statistics.stdev(iris_df.petal_width)
print(f"The sample standard deviation of sepal lengths = {sepal_length_sstd:.4f} from the mean = {iris_df.sepal_length.mean():.2f}")
print(f"The sample standard deviation of sepal width = {sepal_width_sstd:.4f} from the mean = {iris_df.sepal_width.mean():.2f}")
print(f"The sample standard deviation of petal lengths = {petal_length_sstd:.4f} from the mean = {iris_df.petal_length.mean():.2f}")
print(f"The sample standard deviation of petal width = {petal_width_sstd:.4f} from the mean = {iris_df.petal_width.mean():.2f}")

The sample standard deviation of sepal lengths = 0.8281 from the mean = 5.84
The sample standard deviation of sepal width = 0.4336 from the mean = 3.05
The sample standard deviation of petal lengths = 1.7644 from the mean = 3.76
The sample standard deviation of petal width = 0.7632 from the mean = 1.20


###🧚 **$\color{orange}{\textbf{Reminder:}}$** For a **large** sample size use the formula for either **sample** or **population** Standard Deviation.

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## 🔖 **$\color{blue}{\textbf{Lab Work 3:}}$ Compute and visualize the Center (mean) and spread (Standard Deviation) of Annual CPI data with a single Line Graph or Time Series**

<details>
  <summary><b>Show description</b></summary>
The following data shows the Annual Consumer Price Index, each month, for ten years. Construct a time series graph for the Annual Consumer Price Index data only.


|Year|Jan|Feb|Mar|Apr|May|Jun|Jul|	Aug |	Sep |Oct |	Nov |	Dec |	Annual|
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
|2003 |	181.7|	183.1|	184.2|	183.8|	183.5|	183.7|	183.9|	184.6|	185.2|	185.0|	184.5|	184.3|	184.0|
|2004 |	185.2|	186.2|	187.4|	188.0|	189.1 |	189.7 |	189.4|	189.5|	189.9|	190.9|	191.0|	190.3|	188.9|
|2005 |	190.7|	191.8|	193.3|	194.6|	194.4|	194.5|	195.4 |	196.4|	198.8|	199.2|	197.6|	196.8|	195.3|
|2006 |	198.3|	198.7|	199.8|	201.5|	202.5|	202.9|	203.5 |	203.9|	202.9|	201.8|	201.5|	201.8|	201.6|
|2007 |	202.416|	203.499|	205.352|	206.686|	207.949|	208.352 |	208.299|	207.917|	208.490|	208.936|	210.177|	210.036|	207.342|
|2008 |	211.080|	211.693|	213.528|	214.823|	216.632|	218.815|	219.964	| 219.086|	218.783|	216.573|	212.425|	210.228|	215.303|
|2009 |	211.080|	212.193|	212.709|	213.240|	213.856|	215.693|	215.351|	215.834|	215.969|	216.177|	216.330|	215.949|	214.537|
|2010 |	216.687|	216.741|	217.637|	218.009|	218.178|	217.965|	218.011|	218.312|	218.439|	218.711|	218.803|	219.179|	218.056|
|2011 |	220.223|	221.30|	223.467|	224.906|	225.964|	225.722	|225.922|	226.545|	226.889|	226.421|	226.230|	225.672|	224.939|
|2012 |	226.665|	227.663|	229.392|	230.085|	229.815|	229.478|	229.104|	230.379|	231.407|	231.317|	230.221|	229.601|	229.594|


</details>

In [9]:
# @title <font size = 5 color = "magenta"> <b>Reminder: Run this code cell first </b></font>  **to create a data frame**
import pandas as pd
google_sheet_url = "https://docs.google.com/spreadsheets/d/11ZwX_Z4baEWTD0S523UDqJP2NK_b3j7P_aoDOHIksa0/edit#gid=0"

# convert the file format to csv
url = google_sheet_url.replace("edit#", "export?format=csv&")
df_cpi = pd.read_csv(url)
df_cpi


Unnamed: 0,date,CPI
0,2003-01-01,181.700
1,2003-02-01,183.100
2,2003-03-01,184.200
3,2003-04-01,183.800
4,2003-05-01,183.500
...,...,...
115,2012-08-01,230.379
116,2012-09-01,231.407
117,2012-10-01,231.317
118,2012-11-01,230.221


### **A.** Compute descriptive statistics of the data: mean, median, and standard deviation


In [10]:
df_cpi.dtypes
cpi_mean = np.mean(df_cpi.CPI)
cpi_median = np.median(df_cpi.CPI)
cpi_std = np.std(df_cpi.CPI)
print(f"""Annual CPI Statistics:
Mean CPI = {cpi_mean:.3f}
Median CPI = {cpi_median:.3f}
Standard deviation = {cpi_std:.3f}""")

Annual CPI Statistics:
Mean CPI = 207.949
Median CPI = 210.202
Standard deviation = 14.614


###**B.** Create a Time Series Graph (CPI vs Date)


<details>
<summary>Options for visualizing points on the graph</summary>

In Plotly's `go.Scatter`, which is a graph object class for scatter plots in Plotly, the `mode` attribute specifies how the points in the scatter plot are visualized. It determines if the scatter plot is displayed with markers (points), lines, or both.

The `mode` attribute can take the following values:

1. **`'markers'`**: Displays the data points as individual markers (like dots or other shapes). This mode is typically used when you want to emphasize each individual data point.

2. **`'lines'`**: Connects the data points with lines. This mode is useful when you want to show a trend or relationship between the points, or when plotting a continuous dataset.

3. **`'lines+markers'`**: Combines both markers and lines. This mode is used when you want to highlight individual data points along with the trend or relationship between them.

4. **`'text'`**: Displays only text labels at each data point (or at specified points). You can combine it with other modes to have text along with markers or lines (e.g., `'markers+text'` or `'lines+text'`).

Here's an example of using `mode` in `go.Scatter`:

```python
import plotly.graph_objects as go

# Sample data
x_data = [1, 2, 3, 4, 5]
y_data = [2, 1, 3, 5, 4]

# Create scatter plot
fig = go.Figure(data=go.Scatter(x=x_data, y=y_data, mode='markers'))

# Show the plot
fig.show()
```

In this example, a scatter plot is created with data points represented as markers. By changing the `mode` parameter, you can alter the visual representation of the scatter plot in the Plotly graph.
</details>

In [19]:
import plotly.graph_objects as go
import datetime

# Create plot area
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        # horizontal axis, or "independent variable"
        x=df_cpi.date,
        # vertical axis, or "dependent variable"
        y=df_cpi.CPI,
        mode="markers+lines",
        marker=dict(
            color="blue",
            colorscale="jet",
            size=4
        ),
        line=dict(
            color="firebrick",
            width=2
        )
    )
)
# Add mean, median, and one std above/below

fig.update_layout(
    width=1200,
    height=600,
    title="Monthly CPI",
    title_x=0.5,
    yaxis_title="Consumer Price Index (CPI)",
    xaxis_title="Years",
    font=dict(
        family="courier New, Monospace",
        size=18,
        color="green"
    )
)
fig.add_hline(
    y=cpi_mean,
    line=dict(
        color="red",
        width=2,
        dash="dash",
    ),
    annotation_text="Mean CPI",
    annotation_position="bottom left"
)
fig.add_hline(
    y=cpi_median,
    line=dict(
        color="blue",
        width=2,
        dash="dash"
    ),
    annotation_text="Median CPI",
    annotation_position="top left"
)
fig.add_hline(
    y=cpi_mean + cpi_std,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Mean CPI + 1 STD",
    annotation_position="top left"
)
fig.add_hline(
    y=cpi_mean - cpi_std,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Mean CPI - 1 STD",
    annotation_position="bottom left"
)
fig.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## 🔖 **$\color{blue}{\textbf{Lab Work 4:}}$ Compute std in `pandas` Data Frame and Visualize the Center (mean) and spread (Standard Deviation) as a a Group of Line Graphs or Time Series**

<details>
  <summary><b>Show description</b></summary>

Data for carbon dioxide emissions include gases from the burning of fossil fuels and cement manufacture, but excludes emissions from land use such as deforestation. The unit of measurement is kt (kiloton). Carbon dioxide emissions are often calculated and reported as elemental carbon.

The following table is a portion of a data set from [databank.worldbank.org](https://databank.worldbank.org/home).

|Years |	Ukraine| United Kingdom|United States|
|--|--|--|--|
|2003 |	352,259|	540,640|	5,681,664|
|2004 |	343,121|	540,409|	5,790,761|
|2005 |	339,029|	541,990|	5,826,394|
|2006 |	327,797|	542,045|	5,737,615|
|2007 |	328,357|	528,631 |	5,828,697 |
|2008 |	323,657 |	522,247|	5,656,839|
|2009 |	272,176|	474,579 |	5,299,563|


  </details>

---
#### **Syntax for computing mean, median, and standard deviation of a `pandas data frame`**

``` python
dataframe["column title"].mean()
dataframe["column title"].median()
dataframe["column title"].std()
```
---

In [20]:
# @title **NOTE:** $\color{magenta}{\textbf{Run this code cell first}}$ to activate the Emission data set.
# create the data frame first
import pandas as pd

data = {"years": [2003, 2004, 2005, 2006, 2007, 2008, 2009],
        "Ukraine": [352259, 343121, 339029, 327797, 328357, 323657, 272176],
        "United Kingdom": [ 540640,540409, 541990,542045, 528631, 522247, 474579],
        "United States": [5681664,5790761, 5826394, 5737615, 5828697, 5656839, 5299563]
       }
df = pd.DataFrame(data)
df

Unnamed: 0,years,Ukraine,United Kingdom,United States
0,2003,352259,540640,5681664
1,2004,343121,540409,5790761
2,2005,339029,541990,5826394
3,2006,327797,542045,5737615
4,2007,328357,528631,5828697
5,2008,323657,522247,5656839
6,2009,272176,474579,5299563


### **A.** Compute descriptive statistics of the data: mean, median, and standard deviation of emissions in kiloton


In [57]:
uk_mean = df["United Kingdom"].mean()
uk_median = df["United Kingdom"].median()
uk_std = statistics.stdev(df["United Kingdom"])
print(f"""Carbon Emission Statistics (UK):
Mean Carbon Emissions= {uk_mean:.1f}
Median Carbon Emissions = {uk_median:.1f}
Standard Deviation Carbon Emissions = {uk_std:.1f}
""")

ukraine_mean = df["Ukraine"].mean()
ukraine_median = df["Ukraine"].median()
ukraine_std = statistics.stdev(df["Ukraine"])
print(f"""Carbon Emission Statistics (Ukraine):
Mean Carbon Emissions = {ukraine_mean:.3f}
Median Carbon Emissions = {ukraine_median:.3f}
Standard Deviation Carbon Emissions = {ukraine_std:.3f}
""")

us_mean = df["United States"].mean()
us_median = df["United States"].median()
us_std = statistics.stdev(df["United States"])
print(f"""Carbon Emission Statistics (United States):
Mean Carbon Emissions = {us_mean:.3f}
Median Carbon Emissions = {us_median:.3f}
Standard Deviation Carbon Emissions = {us_std:.3f}
""")

Carbon Emission Statistics (UK):
Mean Carbon Emissions= 527220.1
Median Carbon Emissions = 540409.0
Standard Deviation Carbon Emissions = 24460.1

Carbon Emission Statistics (Ukraine):
Mean Carbon Emissions = 326628.000
Median Carbon Emissions = 328357.000
Standard Deviation Carbon Emissions = 26015.877

Carbon Emission Statistics (United States):
Mean Carbon Emissions = 5688790.429
Median Carbon Emissions = 5737615.000
Standard Deviation Carbon Emissions = 184327.652



####**B. Graph the raw data, a single time series**

Other options for mode: ```mode = "markers" ```, ``` "lines" ```, or ```"markers+lines" ```

In [71]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        # horizontal axis, or "independent variable"
        x=df.years,
        # vertical axis, or "dependent variable"
        y=df["United Kingdom"],
        name="United Kingdom",
        mode="markers+lines",
        marker=dict(
            color="blue",
            colorscale="jet",
            size=4
        ),
        line=dict(
            color="firebrick",
            width=2
        )
    )
)
fig.add_trace(
    go.Scatter(
        # horizontal axis, or "independent variable"
        x=df.years,
        # vertical axis, or "dependent variable"
        y=df["Ukraine"],
        name="Ukraine",
        mode="markers+lines",
        marker=dict(
            color="black",
            colorscale="jet",
            size=4
        ),
        line=dict(
            color="black",
            width=2
        )
    )
)

fig.add_trace(
    go.Scatter(
        # horizontal axis, or "independent variable"
        x=df.years,
        # vertical axis, or "dependent variable"
        y=df["United States"],
        name="United States",
        mode="markers+lines",
        marker=dict(
            color="firebrick",
            colorscale="jet",
            size=4
        ),
        line=dict(
            color="blue",
            width=2
        )
    )
)

fig.add_hline(
    y=uk_mean,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="UK Mean",
    annotation_position="bottom left"
)

fig.add_hline(
    y=uk_median,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="UK Median",
    annotation_position="top left"
)

fig.add_hline(
    y=ukraine_mean,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Ukraine Mean",
    annotation_position="bottom right"
)

fig.add_hline(
    y=ukraine_median,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Ukraine Median",
    annotation_position="top right"
)

fig.add_hline(
    y=us_mean,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="U.S. Mean",
    annotation_position="bottom right"
)

fig.add_hline(
    y=us_median,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="U.S. Median",
    annotation_position="top right"
)

fig.update_layout(
    width=1200,
    height=600,
    title=r"$C0_2 \text{ Emissions between 2003 and 2009}$",
    title_x=0.5,
    yaxis_title=r"$C0_2 \text{ Emissions (kt)}$",
    xaxis_title="Years",
    font=dict(
        family="courier New, Monospace",
        size=18,
        color="black"
    )
)
fig.show()

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## **$\color{green}{\textbf{ TODO 3: }}$** With live Feberal Research of Economic Data (**FRED**) that is updated each month.


---
### **1. Data Preparation: <font size = 5 color = "magenta"> import and clean the live data.**

>#### **Step 1: Understand the Data:** All data sets are retrieved with codes made a few letters.  We will use the following data sets.

>> **`UNRATE`:** [Unemployment rate](https://fred.stlouisfed.org/series/UNRATE)

>> **`FEDFUND`:** [federal funds rate in month](https://fred.stlouisfed.org/searchresults/?st=fedfunds)

>> **`INDOPRO`:** [Industrial Production: Total Index](https://fred.stlouisfed.org/series/INDPRO)

>#### **Step 2: install the `pandas.datareader` for reading data from `FRED`**

```python
# code to install pandas.datareader
!pip install pandas.datareader
```


In [59]:
!pip install pandas.datareader



>#### **Step 3: read the data into `a pandas data frame` and save it into a variable of name of your choice**

```python
# code to retrieve data into pandas data frame with pandas.datareader
import pandas_datareader as pdr
pdr.get_data_fred(["UNRATE", "FEDFUNDS", "INDPRO"])# fred = federal research ecnomic data # government
```

In [65]:
import pandas_datareader as pdr
df_fed = pdr.get_data_fred(["UNRATE", "FEDFUNDS", "INDPRO"])
df_fed.head()

Unnamed: 0_level_0,UNRATE,FEDFUNDS,INDPRO
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-04-01,14.8,0.05,84.6812
2020-05-01,13.2,0.05,86.0108
2020-06-01,11.0,0.08,91.6745
2020-07-01,10.2,0.09,95.0037
2020-08-01,8.4,0.1,95.9294


>#### **Step 4: Data Cleaning:**  Write code to remove any "Nan" with inplace = True from df_fed or the data in any name you choose

>>#### **Reference to the code for removing all Nan in the video for LAB 2.3 Operation of the pandas data frame Part IV**

>>Since this is a live data, meaning that they should be updated at the beginning of each period, month in this case.  But that might not happen. So last month, the last row could contain NaN data type.    If the last month's value is missing, you will see a "NaN" (Not a number) instead.  We need to do a "data cleaning" by removing the row containing "NaN" for numerical computations.


In [68]:
# Remove any rows that have "NaN" in them
df_fed.dropna(how="any", inplace=True)

### **2. Measurement of center and spread:** Compute the descriptive statistics: mean, median, and standard deviation of the variable you chose and as usual print with f-string and good description.

In [70]:
unrate_mean = np.mean(df_fed.UNRATE)
unrate_median = np.median(df_fed.UNRATE)
unrate_sstd = statistics.stdev(df_fed.UNRATE)
print(f"""Unemployment Rate Statistics:
Mean Unemployment Rate = {unrate_mean:.3f}
Median Unemployment Rate = {unrate_median:.3f}
Standard Deviation Unemployment Rate = {unrate_sstd:.3f}
""")

fedfunds_mean = np.mean(df_fed.FEDFUNDS)
fedfunds_median = np.median(df_fed.FEDFUNDS)
fedfunds_sstd = statistics.stdev(df_fed.FEDFUNDS)
print(f"""Federal Funds Rate Statistics:
Mean Federal Funds Rate = {fedfunds_mean:.3f}
Median Federal Funds Rate = {fedfunds_median:.3f}
Standard Deviation Federal Funds Rate = {fedfunds_sstd:.3f}
""")

indpro_mean = np.mean(df_fed.INDPRO)
indpro_median = np.median(df_fed.INDPRO)
indpro_sstd = statistics.stdev(df_fed.INDPRO)
print(f"""Industrial Production Statistics:
Mean Industrial Production = {indpro_mean:.3f}
Median Industrial Production = {indpro_median:.3f}
Standard Deviation Industrial Production = {indpro_sstd:.3f}
""")

Unemployment Rate Statistics:
Mean Unemployment Rate = 4.980
Median Unemployment Rate = 4.000
Standard Deviation Unemployment Rate = 2.360

Federal Funds Rate Statistics:
Mean Federal Funds Rate = 2.586
Median Federal Funds Rate = 2.560
Standard Deviation Federal Funds Rate = 2.362

Industrial Production Statistics:
Mean Industrial Production = 100.646
Median Industrial Production = 102.519
Standard Deviation Industrial Production = 3.932



---
### **3. Visualization:**
* Create an interactive the time series graph of a variable of your choice
*  Create horizontal lines to indicate, mean, median, two std above and below the mean with the proper annotation texts.   
* Create labels for the axes and a graph title.

### <font color = "red"><b> Note: for x variable in go.Scatter </b></font>  We want the x axis to be labeled by the date.  Use `x = your_data_frame.index` because the date is the row ID (index) in this data frame. If you use your own data frame name (not df_fed), make the appropriate adjustment.

In [79]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        # horizontal axis, or "independent variable"
        x=df_fed.index,
        # vertical axis, or "dependent variable"
        y=df_fed.UNRATE,
        name="Unemployment Rate",
        mode="markers+lines",
        marker=dict(
            color="blue",
            colorscale="jet",
            size=4
        ),
        line=dict(
            color="firebrick",
            width=2
        )
    )
)

fig.add_hline(
    y=unrate_mean,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Unemployment Rate Mean",
    annotation_position="top left"
)

fig.add_hline(
    y=unrate_median,
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Unemployment Rate Median",
    annotation_position="bottom left"
)

fig.add_hline(
    y=unrate_mean + (unrate_sstd * 2),
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Unemployment Rate +2 STD",
    annotation_position="top right"
)

fig.add_hline(
    y=unrate_mean - (unrate_sstd * 2),
    line=dict(
        color="gray",
        width=2,
        dash="dash",
    ),
    annotation_text="Unemployment Rate -2 STD",
    annotation_position="top right"
)

fig.update_layout(
    width=1200,
    height=600,
    title="United States Unemployment Rate",
    title_x=0.5,
    yaxis_title="Unemployment Rate",
    xaxis_title="Year",
    font=dict(
        family="courier New, Monospace",
        size=18,
        color="black"
    )
)
fig.show()

### **4. Discussion:** Using the graph to discuss the implications of the data.  Do you notice something unusual around 2020? Explain the peaks or valleys using the events that happened around those times.

The major spike that happened in early 2020 was due to the Covid-19 pandemic, where the response was to lockdown for multiple weeks/months to attempt to contain the spread of the disease. Because a larger number of people were not working and because many other people were not spending money in the same way that they had before, the unemployment rate spiked very high, very fast. Over the next few years, the unemployment rate came back down as things stabilized and people largely returned to their previous habits.