<p style="text-align:center">
    <a href="https://aivietnam.edu.vn/" target="_blank">
    <img src="https://drive.google.com/uc?id=1DT9Y0yqoSuvwXmUZ6FhtWIxJ0WKUbC-m"  width='26%' height='26%' alt="Skills Network Logo"  />
    </a>
</p>

# **1. Line Plots**

### **1.1 - Load your time series data**

The most common way to import time series data in Python is by using the `pandas` library. You can use the `read_csv()` from `pandas` to read the contents of a file into a DataFrame. This can be achieved using the following command:

`df = pd.read_csv("name_of_your_file.csv")`

Once your data is loaded into Python, you can display the first rows of your DataFrame by calling the `.head(n=5)` method, where `n=5` indicates that you want to print the first five rows of your DataFrame.

In this exercise, you will read in a time series dataset that contains the number of "great" inventions and scientific discoveries from 1860 to 1959, and display its first five rows.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Import the pandas library using the `pd` alias. Read in the time series data from the csv file located at url_discoveries into a DataFrame called discoveries. \\
  Print the first 5 lines of the DataFrame using the .head() method.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Import pandas
import pandas as pd

# Read in the file content in a DataFrame called discoveries
discoveries = pd.read_csv(url_discoveries)

# Display the first five lines of the DataFrame
print(discoveries.head())
```

</details>

In [None]:
# Import pandas
____

# Read in the file content in a DataFrame called discoveries
discoveries = ____(url_discoveries)

# Display the first five lines of the DataFrame
print(discoveries.____)

---

### **1.2 - Test whether your data is of the correct type**

When working with time series data in `pandas`, any date information should be formatted as a `datetime64` type. Therefore, it is important to check that the columns containing the date information are of the correct type. You can check the type of each column in a DataFrame by using the `.dtypes` attribute. Fortunately, if your date columns come as strings, epochs, etc… you can use the `to_datetime()` function to convert them to the appropriate `datetime64` type:

df['date_column'] = pd.to_datetime(df['date_column'])
In this exercise, you will learn how to check the data type of the columns in your time series data and convert a date column to the appropriate `datetime` type.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Print out the data type of the column in the `discoveries` object.
  * Convert the `date` column in the `discoveries` DataFrame to the `datetime` type.
  * Print out the data type of the column in the `discoveries` object again to check that your conversion worked.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Print the data type of each column in discoveries
print(discoveries.dtypes)

# Convert the date column to a datestamp type
discoveries['date'] = pd.to_datetime(discoveries['date'])

# Print the data type of each column in discoveries, again
print(discoveries.dtypes)
```

</details>

In [None]:
# Print the data type of each column in discoveries
print(discoveries____)

# Convert the date column to a datestamp type
discoveries['date'] = ____

# Print the data type of each column in discoveries, again
 ____

---

### **1.3 - Your first plot**

Let's take everything you have learned so far and plot your first time series plot. You will set the groundwork by producing a time series plot of your data and labeling the axes of your plot, as this makes the plot more readable and interpretable for the intended audience.

`matplotlib` is the most widely used plotting library in Python, and would be the most appropriate tool for this job. Fortunately for us, the `pandas` library has implemented a `.plot()` method on Series and DataFrame objects that is a wrapper around `matplotlib.pyplot.plot()`, which makes it easier to produce plots.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Set the `date` column as the index of your DataFrame.
  * Using the `discoveries` DataFrame, plot the time series in your DataFrame using a "blue" line plot and assign it to `ax`.
  * Specify the x-axis label on your plot: `Date`.
  * Specify the y-axis label on your plot: `Number of great discoveries`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Set the date column as the index of your DataFrame discoveries
discoveries = discoveries.set_index('date')

# Plot the time series in your DataFrame
ax = discoveries.plot(color='blue')

# Specify the x-axis label in your plot
ax.set_xlabel('Date')

# Specify the y-axis label in your plot
ax.set_ylabel('Number of great discoveries')

# Show plot
plt.show()
```

</details>

In [None]:
# Set the date column as the index of your DataFrame discoveries
discoveries = ____

# Plot the time series in your DataFrame
ax = discoveries.____(____=____)

# Specify the x-axis label in your plot
____('Date')

# Specify the y-axis label in your plot
____('Number of great discoveries')

# Show plot
plt.show()

---

### **1.4 - Specify plot styles**

The `matplotlib` library also comes with a number of built-in stylesheets that allow you to customize the appearance of your plots. To use a particular style sheet for your plots, you can use the command `plt.style.use(your_stylesheet)` where `your_stylesheet` is the name of the style sheet.

In order to see the list of available style sheets that can be used, you can use the command `print(plt.style.available)`. For the rest of this course, we will use the awesome `fivethirtyeight` style sheet.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Import `matplotlib.pyplot` using its usual alias `plt`.
  * Use the `fivethirtyeight` style sheet to plot a line plot of the `discoveries` data.
  * Use the `ggplot` style sheet to plot a line plot of the `discoveries` data.
  * Set the title of your second plot as `ggplot Style`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 1</b></font></summary>   

```
# Import the matplotlib.pyplot sub-module
import matplotlib.pyplot as plt

# Use the fivethirtyeight style
plt.style.use('fivethirtyeight')

# Plot the time series
ax1 = discoveries.plot()
ax1.set_title('FiveThirtyEight Style')
plt.show()
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 2</b></font></summary>   

```
# Import the matplotlib.pyplot sub-module
import matplotlib.pyplot as plt

# Use the ggplot style
plt.style.use('ggplot')
ax2 = discoveries.plot()

# Set the title
ax2.set_title('ggplot Style')
plt.show()
```

</details>

In [None]:
# Import the matplotlib.pyplot sub-module
____

In [None]:
# Method 1: Use the fivethirtyeight style

plt.____('fivethirtyeight')

# Plot the time series
ax1 = discoveries.plot()
ax1.set_title('FiveThirtyEight Style')
plt.show()

In [None]:
# Method 2: Use the ggplot style
____
ax2 = discoveries.plot()

# Set the title
____
plt.show()

---

### **1.5 - Display and label plots**

As you saw earlier, if the index of a `pandas` DataFrame consists of dates, then `pandas` will automatically format the x-axis in a human-readable way. In addition the `.plot()` method allows you to specify various other parameters to tailor your time series plot (color of the lines, width of the lines and figure size).

You may have noticed the use of the notation `ax = df.plot(...)` and wondered about the purpose of the `ax` object. This is because the `plot` function returns a `matplotlib` `AxesSubplot` object, and it is common practice to assign this returned object to a variable called `ax`. Doing so also allows you to include additional notations and specifications to your plot such as axis labels.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  Display a line chart of the `discoveries` DataFrame.
  * Specify the color of the line as `blue`.
  * Width of the line as 2.
  * The dimensions of your plot to be of length 8 and width 3.
  * Specify the `fontsize` of 6.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Plot a line chart of the discoveries DataFrame using the specified arguments
ax = discoveries.plot(color='blue', figsize=(8, 3), linewidth=2, fontsize=6)

# Specify the title in your plot
ax.set_title('Number of great inventions and scientific discoveries from 1860 to 1959', fontsize=8)

# Show plot
plt.show()
```

</details>

In [None]:
# Plot a line chart of the discoveries DataFrame using the specified arguments
ax = ____.____(____='blue', ____=(8, ____), ____=2, fontsize=____)

# Specify the title in your plot
ax.set_title('Number of great inventions and scientific discoveries from 1860 to 1959', fontsize=8)

# Show plot
plt.show()

---

### **1.6 - Subset time series data**

When plotting time series data, you may occasionally want to visualize only a subset of the data. The `pandas` library provides powerful indexing and subsetting methods that allow you to extract specific portions of a DataFrame. For example, you can subset all the data between 1950 and 1960 in the `discoveries` DataFrame by specifying the following date range:

`subset_data = discoveries['1950-01-01':'1960-01-01']`

Note: Subsetting your data this way is only possible if the index of your DataFrame contains dates of the `datetime` type. Failing that, the `pandas` library will return an error message.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Use `discoveries` to create a new DataFrame `discoveries_subset_1` that contains all the data between January 1, 1945 and January 1, 1950.
  * Plot the time series of `discoveries_subset_1` using a "blue" line plot.
  * Use `discoveries` to create a new DataFrame `discoveries_subset_2` that contains all the data between January 1, 1939 and January 1, 1958.
  * Plot the time series of `discoveries_subset_2` using a "blue" line plot.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 1</b></font></summary>   

```
# Select the subset of data between 1945 and 1950
discoveries_subset_1 = discoveries['1945':'1950']

# Plot the time series in your DataFrame as a blue area chart
ax = discoveries_subset_1.plot(color='blue', fontsize=15)

# Show plot
plt.show()
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 2</b></font></summary>   

```
# Select the subset of data between 1939 and 1958
discoveries_subset_2 = discoveries['1939':'1958']

# Plot the time series in your DataFrame as a blue area chart
ax = discoveries_subset_2.plot(color='blue', fontsize=15)

# Show plot
plt.show()
```

</details>

In [None]:
# Select the subset of data between 1945 and 1950
discoveries_subset_1 = discoveries['____':'____']

# Plot the time series in your DataFrame as a blue area chart
ax = discoveries_subset_1.____(color='blue', fontsize=15)

# Show plot
plt.show()

In [None]:
# Select the subset of data between 1939 and 1958
discoveries_subset_2 = ____

# Plot the time series in your DataFrame as a blue area chart
ax = discoveries_subset_2.____(color='blue', fontsize=15)

# Show plot
____

---

### **1.7 - Add vertical and horizontal markers**

Additional annotations can help further emphasize specific observations or events. Here, you will learn how to highlight significant events by adding markers at specific timestamps of your time series plot. The `matplotlib` library makes it possible to draw vertical and horizontal lines to identify particular dates.

Recall that the index of the `discoveries` DataFrame are of the `datetime` type, so the x-axis values of a plot will also contain dates, and it is possible to directly input a date when annotating your plots with vertical lines. For example, a vertical line at January 1, 1945 can be added to your plot by using the command:

`ax.axvline('1945-01-01', linestyle='--')`

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Add a red vertical line at the date January 1, 1939 using the `.axvline()` method.
  * Add a green horizontal line at the y-axis value `4` using the `.axhline()` method.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Plot your the discoveries time series
ax = discoveries.plot(color='blue', fontsize=6)

# Add a red vertical line
ax.axvline('1939-01-01', color='red', linestyle='--')

# Add a green horizontal line
ax.axhline(4, color='green', linestyle='--')

plt.show()
```

</details>

In [None]:
# Plot your the discoveries time series
ax = discoveries.plot(color='blue', fontsize=6)

# Add a red vertical line
ax.____(____, color=____, linestyle='--')

# Add a green horizontal line
ax.____(____, color=____, linestyle='--')

plt.show()


---


### **1.8 - Add shaded regions to your plot**

When plotting time series data in Python, it is also possible to highlight complete regions of your time series plot. In order to add a shaded region between January 1, 1936 and January 1, 1950, you can use the command:

`ax.axvspan('1936-01-01', '1950-01-01', color='red' , alpha=0.5)`

Here we specified the overall transparency of the region by using the `alpha` argument (where `0` is completely transparent and `1` is full color).

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Use the `.axvspan()` method to add a vertical red shaded region between the dates of January 1, 1900 and January 1, 1915 with a transparency of `0.3`.
  * Use the `.axhspan()` method to add a horizontal green shaded region between the values of 6 and 8 with a transparency of `0.3`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Plot your the discoveries time series
ax = discoveries.plot(color='blue', fontsize=6)

# Add a vertical red shaded region
ax.axvspan('1900-01-01', '1915-01-01', color='red', alpha=.3)

# Add a horizontal green shaded region
ax.axhspan(6, 8, color='green', alpha=.3)

plt.show()
```

</details>

In [None]:
# Plot your the discoveries time series
ax = discoveries.plot(color='blue', fontsize=6)

# Add a vertical red shaded region
ax.____('1900-01-01', ____, color=____, alpha=____)

# Add a horizontal green shaded region
ax.____(6, ____, color=____, alpha=____)

plt.show()


---

---


# **2. Summary Statistics and Diagnostics**

### **2.1 - Find missing values**

In the field of Data Science, it is common to encounter datasets with missing values. This is especially true in the case of time series data, where missing values can occur if a measurement fails to record the value at a specific timestamp. To count the number of missing values in a DataFrame called `df` that contains time series data, you can use the command:

`missing_values = df.isnull().sum()`

In this exercise, you will learn how to find whether your data contains any missing values.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * The `co2_levels` time series DataFrame contains time series data on global CO2 levels. Start by printing the first seven rows of `co2_levels`.
  * Set the `datestamp` column as the index of the `co2_levels` DataFrame.
  * Print the total number of missing values in `co2_levels`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 1</b></font></summary>   

```
# Display first seven rows of co2_levels
print(co2_levels.head(7))
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 2</b></font></summary>   

```
# Set datestamp column as index
co2_levels = co2_levels.set_index('datestamp')

# Print out the number of missing values
print(co2_levels.isnull().sum())
```

</details>

In [None]:
# Display first seven rows of co2_levels
print(___.____(____))

In [None]:
# Set datestamp column as index
co2_levels = co2_levels.____(____)

# Print out the number of missing values
print(co2_levels.____.____)

---

### **2.2 - Handle missing values**

In order to replace missing values in your time series data, you can use the command:

`df = df.fillna(method="ffill")`

where the argument specifies the type of method you want to use. For example, specifying `bfill` (i.e backfilling) will ensure that missing values are replaced using the next valid observation, while `ffill` (i.e. forward-filling) ensures that missing values are replaced using the last valid observation.

Recall from the previous exercise that `co2_levels` has 59 missing values.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Impute these missing values in `co2_levels` by using backfilling.
  * Print the total number of missing values.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Impute missing values with the next valid observation
co2_levels = co2_levels.fillna(method='bfill')

# Print out the number of missing values
print(co2_levels.isnull().sum())
```

</details>

In [None]:
# Impute missing values with the next valid observation
co2_levels = co2_levels.____(method=____)

# Print out the number of missing values
____(____.____())

---

### **2.3 - Display rolling averages**

It is also possible to visualize rolling averages of the values in your time series. This is equivalent to "smoothing" your data, and can be particularly useful when your time series contains a lot of noise or outliers. For a given DataFrame df, you can obtain the rolling average of the time series by using the command:

`df_mean = df.rolling(window=12).mean()`

The window parameter should be set according to the granularity of your time series. For example, if your time series contains daily data and you are looking for rolling values over a whole year, you should specify the parameter to window=365. In addition, it is easy to get rolling values for other other metrics, such as the standard deviation (.std()) or variance (.var()).

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Compute the 52 weeks rolling mean of `co2_levels` and assign it to `ma`.
  * Compute the 52 weeks rolling standard deviation of `co2_levels` and assign it to `mstd`.
  * Calculate the upper bound of time series which can defined as the rolling mean + (2 * rolling standard deviation) and assign it to ma[upper]. Similarly, calculate the lower bound as the rolling mean - (2 * rolling standard deviation) and assign it to `ma[lower]`.
  * Plot the line chart of `ma`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Compute the 52 weeks rolling mean of the co2_levels DataFrame
ma = co2_levels.rolling(window=52).mean()

# Compute the 52 weeks rolling standard deviation of the co2_levels DataFrame
mstd = co2_levels.rolling(window=52).std()

# Add the upper bound column to the ma DataFrame
ma['upper'] = ma['co2'] + (mstd['co2'] * 2)

# Add the lower bound column to the ma DataFrame
ma['lower'] = ma['co2'] - (mstd['co2'] * 2)

# Plot the content of the ma DataFrame
ax = ma.plot(linewidth=0.8, fontsize=6)

# Specify labels, legend, and show the plot
ax.set_xlabel('Date', fontsize=10)
ax.set_ylabel('CO2 levels in Mauai Hawaii', fontsize=10)
ax.set_title('Rolling mean and variance of CO2 levels\nin Mauai Hawaii from 1958 to 2001', fontsize=10)
plt.show();
```

</details>

In [None]:
# Compute the 52 weeks rolling mean of the co2_levels DataFrame
ma = ____.rolling(window=____).____()

# Compute the 52 weeks rolling standard deviation of the co2_levels DataFrame
mstd = ____

# Add the upper bound column to the ma DataFrame
ma['upper'] = ma['co2'] + (____ * ____)

# Add the lower bound column to the ma DataFrame
ma['lower'] = ma['co2'] - (____ * ____)

# Plot the content of the ma DataFrame
ax = ____(linewidth=0.8, fontsize=6)

# Specify labels, legend, and show the plot
ax.set_xlabel('Date', fontsize=10)
ax.set_ylabel('CO2 levels in Mauai Hawaii', fontsize=10)
ax.set_title('Rolling mean and variance of CO2 levels\nin Mauai Hawaii from 1958 to 2001', fontsize=10)
plt.show()

---

### **2.4 - Display aggregated values**

You may sometimes be required to display your data in a more aggregated form. For example, the `co2_levels` data contains weekly data, but you may need to display its values aggregated by month of year. In datasets such as the `co2_levels` DataFrame where the index is a datetime type, you can extract the year of each dates in the index:

```
# extract of the year in each dates of the df DataFrame
index_year = df.index.year
```
To extract the month or day of the dates in the indices of the df DataFrame, you would use df.index.month and df.index.day, respectively. You can then use the extracted year of each indices in the co2_levels DataFrame and the groupby function to compute the mean CO2 levels by year:

`df_by_year = df.groupby(index_year).mean()`

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Extract the month for each of the dates in the index of the `co2_levels` DataFrame and assign the values to a variable called `index_amonth`.
  * Using the groupby and mean functions from the pandas library, compute the monthly mean CO2 levels in the `co2_levels` DataFrame and assign that to a new DataFrame called `mean_co2_levels_by_month`.
  * Plot the values of the `mean_co2_levels_by_month` DataFrame using a fontsize of 6 for the axis ticks.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Get month for each dates in the index of co2_levels
index_month = co2_levels.index.month

# Compute the mean CO2 levels for each month of the year
mean_co2_levels_by_month = co2_levels.groupby(index_month).mean()

# Plot the mean CO2 levels for each month of the year
mean_co2_levels_by_month.plot(fontsize=6)

# Specify the fontsize on the legend
plt.legend(fontsize=10)

# Show plot
plt.show()
```

</details>

In [None]:
# Get month for each dates in the index of co2_levels
index_month = ____.index.____

# Compute the mean CO2 levels for each month of the year
mean_co2_levels_by_month = co2_levels.____(____).____()

# Plot the mean CO2 levels for each month of the year
mean_co2_levels_by_month.____

# Specify the fontsize on the legend
plt.legend(fontsize=10)

# Show plot
plt.show()

---

### **2.5 - Compute numerical summaries**

You have learnt how to display and annotate time series data in multiple ways, but it is also informative to collect summary statistics of your data. Being able to achieve this task will allow you to share and discuss statistical properties of your data that can further support the plots you generate. In `pandas`, it is possible to quickly obtain summaries of columns in your DataFrame by using the command:

`print(df.describe())`

This will print statistics including the mean, the standard deviation, the minima and maxima and the number of observations for all numeric columns in your `pandas` DataFrame.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Print the statistical summaries of the `co2_levels` DataFrame.
  * Print the reported minimum value in the `co2_levels` DataFrame.
  * Print the reported maximum value in the `co2_levels` DataFrame.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Print out summary statistics of the co2_levels DataFrame
print(co2_levels.describe())

# Print out the minima of the co2 column in the co2_levels DataFrame
print(co2_levels.co2.min())

# Print out the maxima of the co2 column in the co2_levels DataFrame
print(co2_levels.co2.max())
```

</details>

In [None]:
# Print out summary statistics of the co2_levels DataFrame
print(____.____)

# Print out the minima of the co2 column in the co2_levels DataFrame
print(____)

# Print out the maxima of the co2 column in the co2_levels DataFrame
print(____)

---

### **2.6 - Boxplots and Histograms**

Boxplots represent a graphical rendition of the minimum, median, quartiles, and maximum of your data. You can generate a boxplot by calling the `.boxplot()` method on a DataFrame.

Another method to produce visual summaries is by leveraging histograms, which allow you to inspect the data and uncover its underlying distribution, as well as the presence of outliers and overall spread. An example of how to generate a histogram is shown below:

`ax = co2_levels.plot(kind='hist', bins=100)`

Here, we used the standard `.plot()` method but specified the `kind` argument to be `hist`. In addition, we also added the `bins=100` parameter, which specifies how many intervals (i.e `bins`) we should cut our data into.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Using the `co2_levels` DataFrame, produce a boxplot of the CO2 level data.
  * Using the `co2_levels` DataFrame, produce a histogram plot of the CO2 level data with 50 bins.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 1</b></font></summary>   

```
# Generate a boxplot
ax = co2_levels.boxplot()

# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10)
ax.set_ylabel('Boxplot CO2 levels in Maui Hawaii', fontsize=10);
plt.legend(fontsize=10)
plt.show()
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 2</b></font></summary>   

```
# Generate a histogram
ax = co2_levels.plot(kind='hist', bins=50, fontsize=6)

# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10)
ax.set_ylabel('Histogram of CO2 levels in Maui Hawaii', fontsize=10)
plt.legend(fontsize=10)
plt.show();
```

</details>

In [None]:
# Generate a boxplot
ax = ____.____

# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10)
ax.set_ylabel('Boxplot CO2 levels in Maui Hawaii', fontsize=10)
plt.legend(fontsize=10)
plt.show()

In [None]:
# Generate a histogram
ax = ____.____(____, ____, fontsize=6)

# Set the labels and display the plot
ax.set_xlabel('CO2', fontsize=10)
ax.set_ylabel('Histogram of CO2 levels in Maui Hawaii', fontsize=10)
plt.legend(fontsize=10)
plt.show()

---

### **2.7 - Density plots**

In practice, histograms can be a substandard method for assessing the distribution of your data because they can be strongly affected by the number of bins that have been specified. Instead, kernel density plots represent a more effective way to view the distribution of your data. An example of how to generate a density plot of is shown below:

`ax = df.plot(kind='density', linewidth=2)`

The standard `.plot()` method is specified with the kind argument set to `density`. We also specified an additional parameter `linewidth`, which controls the width of the line to be plotted.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Using the `co2_levels` DataFrame, produce a density plot of the CO2 level data with line width parameter of 4.
  * Annotate the x-axis labels of your boxplot with the string `CO2`.
  * Annotate the y-axis labels of your boxplot with the string `Density plot of CO2 levels in Maui Hawaii`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Display density plot of CO2 levels values
ax = co2_levels.plot(kind='density', linewidth=4, fontsize=6)

# Annotate x-axis labels
ax.set_xlabel('CO2', fontsize=10)

# Annotate y-axis labels
ax.set_ylabel('Density plot of CO2 levels in Maui Hawaii', fontsize=10)

plt.show()
```

</details>

In [None]:
# Display density plot of CO2 levels values
ax = ____.____(____=____, ____=____, fontsize=6)

# Annotate x-axis labels
____.____('CO2', fontsize=10)

# Annotate y-axis labels
____.____('Density plot of CO2 levels in Maui Hawaii', fontsize=10)

plt.show()

---
---

# **3. Seasonality, Trend and Noise**

### **3.1 - Autocorrelation in time series data**

In the field of time series analysis, autocorrelation refers to the correlation of a time series with a lagged version of itself. For example, an autocorrelation of `order 3` returns the correlation between a time series and its own values lagged by 3 time points.

It is common to use the autocorrelation (ACF) plot, also known as self-autocorrelation, to visualize the autocorrelation of a time-series. The `plot_acf()` function in the `statsmodels` library can be used to measure and plot the autocorrelation of a time series.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Import `tsaplots` from `statsmodels.graphics`.
  * Use the `plot_acf()` function from `tsaplots` to plot the autocorrelation of the `co2` column in `co2_levels`.
  * Specify a maximum lag of 24.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Import required libraries
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from statsmodels.graphics import tsaplots

# Display the autocorrelation plot of your time series
fig = tsaplots.plot_acf(co2_levels['co2'], lags=24)

# Show plot
plt.show();
```

</details>

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ____ import ____

# Display the autocorrelation plot of your time series
fig = ____(co2_levels[____], lags=____)

# Show plot
plt.show()

---

### **3.2 - Partial autocorrelation in time series data**

Like autocorrelation, the partial autocorrelation function (PACF) measures the correlation coefficient between a time-series and lagged versions of itself. However, it extends upon this idea by also removing the effect of previous time points. For example, a partial autocorrelation function of `order 3` returns the correlation between our time series (`t_1`, `t_2`, `t_3`, …) and its own values lagged by 3 time points (`t_4`, `t_5`, `t_6`, …), but only after removing all effects attributable to lags 1 and 2.

The `lot_pacf()` function in the statsmodels library can be used to measure and plot the partial autocorrelation of a time series.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Import `tsaplots` from `statsmodels.graphics`.
  * Use the `plot_pacf()` function from `tsaplots` to plot the partial autocorrelation of the `co2` column in `co2_levels`.
  * Specify a maximum lag of 24.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Import required libraries
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from statsmodels.graphics import tsaplots

# Display the partial autocorrelation plot of your time series
fig = tsaplots.plot_pacf(co2_levels['co2'], lags=24)

# Show plot
plt.show()
```

</details>

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
____

# Display the partial autocorrelation plot of your time series
fig = ____(co2_levels[____], lags=____)

# Show plot
plt.show()

---

### **3.3 - Time series decomposition**

When visualizing time series data, you should look out for some distinguishable patterns:
  * seasonality: *does the data display a clear periodic pattern?*
  * trend: *does the data follow a consistent upwards or downward slope?*
  * noise: are there any outlier points or missing values that are not consistent with the rest of the data?
You can rely on a method known as time-series decomposition to automatically extract and quantify the structure of time-series data. The statsmodels library provides the `seasonal_decompose()` function to perform time series decomposition out of the box.

`decomposition = sm.tsa.seasonal_decompose(time_series)`

You can extract a specific component, for example seasonality, by accessing the `seasonal` attribute of the decomposition object.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Import `statsmodels.api` using the alias `sm`.
  * Perform time series decomposition on the `co2_levels` DataFrame into a variable called `decomposition`a.
  * Print the seasonality component of your time series decomposition.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Import statsmodels.api as sm
import statsmodels.api as sm

# Perform time series decompositon
decomposition = sm.tsa.seasonal_decompose(co2_levels)

# Print the seasonality component
print(decomposition.seasonal)
```

</details>

In [None]:
# Import statsmodels.api as sm
import ____ as ____

# Perform time series decompositon
decomposition = sm.tsa.____(____)

# Print the seasonality component
print(____)

---

### **3.4 - Plot individual components**

It is also possible to extract other inferred quantities from your time-series decomposition object. The following code shows you how to extract the observed, trend and noise (or residual, `resid`) components.

```
observed = decomposition.observed
trend = decomposition.trend
residuals = decomposition.resid

```
You can then use the extracted components and plot them individually.
The decomposition object you created in the last exercise is available in your workspace.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Extract the `trend` component from the `decomposition` object.
  * Plot this trend component.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Extract the trend component
trend = decomposition.trend

# Plot the values of the trend
ax = trend.plot(figsize=(12, 6), fontsize=6)

# Specify axis labels
ax.set_xlabel('Date', fontsize=10)
ax.set_title('Seasonal component the CO2 time-series', fontsize=10)
plt.show()
```

</details>

In [None]:
# Extract the trend component
trend = ____.____

# Plot the values of the trend
ax = ____.____(figsize=(12, 6), fontsize=6)

# Specify axis labels
ax.set_xlabel('Date', fontsize=10)
ax.set_title('Seasonal component the CO2 time-series', fontsize=10)
plt.show()

---

### **3.5 - Visualize the airline dataset**

You will now review the contents of chapter 1. You will have the opportunity to work with a new dataset that contains the monthly number of passengers who took a commercial flight between January 1949 and December 1960.

We have printed the first 5 and the last 5 rows of the `airline` DataFrame for you to review.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Plot the time series of `airline` using a "blue" line plot.
  * Add a vertical line on this plot at December 1, 1955.
  * Specify the x-axis label on your plot: `Date`.
  * Specify the title of your plot: `Number of Monthly Airline Passengers`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution</b></font></summary>   

```
# Plot the time series in your dataframe
ax = airline.plot(color='blue', fontsize=12)

# Add a red vertical line at the date 1955-12-01
ax.axvline('1955-12-01', color='red', linestyle='--')

# Specify the labels in your plot
ax.set_xlabel('Date', fontsize=12)
ax.set_title('Number of Monthly Airline Passengers', fontsize=12)
plt.show()
```

</details>

In [None]:
# Plot the time series in your DataFrame
ax = airline.____(____, fontsize=12)

# Add a red vertical line at the date 1955-12-01
____('1955-12-01', color='red', linestyle='--')

# Specify the labels in your plot
ax.____('Date', fontsize=12)
ax.____('Number of Monthly Airline Passengers', fontsize=12)
plt.show()

---

### **3.6 - Analyze the airline dataset**

You learned:

* How to check for the presence of missing values, and how to collect summary statistics of time series data contained in a `pandas` DataFrame.
* To generate boxplots of your data to quickly gain insight in your data.
* Display *aggregate* statistics of your data using `groupby()`.

In this exercise, you will apply all these concepts on the `airline` DataFrame.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Print the numbers of missing values in the `airline` DataFrame.
  * Print the summary statistics of all the numeric columns in `airline`.
  * Generate a boxplot of the monthly volume of airline passengers data.
  * Extract the month from the index of `airline`.
  * Compute the mean number of passengers per month in `airline` and assign it to `mean_airline_by_month`.
  * Plot the mean number of passengers per month in `airline`.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 1</b></font></summary>   

```
# Print out the number of missing values
print(airline.isnull().sum())

# Print out summary statistics of the airline DataFrame
print(airline.describe())
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 2</b></font></summary>   

```
# Display boxplot of airline values
ax = airline.boxplot()

# Specify the title of your plot
ax.set_title('Boxplot of Monthly Airline\nPassengers Count', fontsize=20)
plt.show()
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 3</b></font></summary>   

```
# Get month for each dates from the index of airline
index_month = airline.index.month

# Compute the mean number of passengers for each month of the year
mean_airline_by_month = airline.groupby(index_month).mean()

# Plot the mean number of passengers for each month of the year
mean_airline_by_month.plot()
plt.legend(fontsize=20)
plt.show()
```

</details>

In [None]:
# Print out the number of missing values
print(airline.____)

# Print out summary statistics of the airline DataFrame
print(airline.____)

In [None]:
# Display boxplot of airline values
ax = ____

# Specify the title of your plot
ax.set_title('Boxplot of Monthly Airline\nPassengers Count', fontsize=20)
plt.show()

In [None]:
# Get month for each dates from the index of airline
index_month = airline.____

# Compute the mean number of passengers for each month of the year
mean_airline_by_month = airline.____(____).____

# Plot the mean number of passengers for each month of the year
mean_airline_by_month.____
plt.legend(fontsize=20)
plt.show()

---

### **3.7 - Time series decomposition of the airline dataset**

In this exercise, you will apply time series decomposition to the `airline` dataset, and visualize the `trend` and `seasonal` components.

<details>
  <summary><font size="3" color="blue"><b>Instructions</b></font></summary>   

  * Import `statsmodels.api` using the alias `sm`.
  * Perform time series decomposition on the `airline` DataFrame into a variable called `decomposition`.
  * Extract the `trend` and `seasonal` components.

  We placed the `trend` and `seasonal` components in the `airline_decomposed` DataFrame.
  * Print the first 5 rows of `airline_decomposed`.
  * Plot these two components on the same graph.

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 1</b></font></summary>   

```
# Import statsmodels.api as sm
import statsmodels.api as sm

# Perform time series decompositon
decomposition = sm.tsa.seasonal_decompose(airline)

# Extract the trend and seasonal components
trend = decomposition.trend
seasonal = decomposition.seasonal
```

</details>

<details>
  <summary><font size="3" color="green"><b>Solution - 2</b></font></summary>   

```
# Print the first 5 rows of airline_decomposed
print(airline_decomposed.head())

# Plot the values of the df_decomposed DataFrame
ax = airline_decomposed.plot(figsize=(12, 6), fontsize=15)

# Specify axis labels
ax.set_xlabel('Date', fontsize=15)
plt.legend(fontsize=15)
plt.show()
```

</details>

In [None]:
# Import statsmodels.api as sm
____

# Perform time series decompositon
decomposition = sm.tsa.____(____)

# Extract the trend and seasonal components
trend = ____
seasonal = ____

In [None]:
# Print the first 5 rows of airline_decomposed
print(____)

# Plot the values of the airline_decomposed DataFrame
ax = ____.____(figsize=(12, 6), fontsize=15)

# Specify axis labels
ax.set_xlabel('Date', fontsize=15)
plt.legend(fontsize=15)
plt.show()

---
---