# Python Block Course
# Session 3: Importing, formatting and plotting data

Prof. Dr. Karsten Donnay, Stefan Scholz

Winter Term 2019 / 2020

In this third session we will build an exemplary data pipeline with Python. For this, we will work with external packages to import and clean data, then perform simple analyses and finally make visualizations. 

## 3.1 External Packages

One of the major advantages of Python is its variety of **external packages**. In addition to built-in modukes, like they are not enough already, there is an enormous amount of external packages. These external packages have to be **installed** before they can actually be imported and used. After they are installed they can be used like built-in modules. The following list shows which **packages** we will use today and in the rest of the week. 

| Package | Description |
| -------- | ------- |
| `numpy` | scientific computing with arrays |
| `sklearn` | machine learning |
| `pandas` | data structures and data analysis |
| `matplotlib` | figures |
| `beautifulsoup4` | parsing of web pages |


<div class="alert alert-block alert-info">
    <b>Exercise</b>: Install all packages listed above in your Python environment, e.g. via the Anaconda Navigator. 
</div>

We will be using some of these packages in our data pipeline today. 

## 3.2 Data Sources

In practice, there are many different **data sources** available where you can get data from. Each source has its own **advantages** and **disadvantages**. So far we have worked a lot with variables in our code, but they are not suitable for storing big amounts of data. That is why we also used files which can store more data, but they are difficult to exchange. This is where **application programming interface** (API) have emerged that provide an interface for exchanging data between clients and servers. We will look with them in detail on our last day. At the same time, we will look at how to get data from **websites**. Behind APIs and websites, there is usually a **database system** in which all data is systematically stored and made available. These databases can also be used directly from Python. 

Below is a short (and probably incomplete) **list** of **data sources**. 

| Source | Description |
| -------- | ------- |
| Variable | reserved memory location to store data |
| File | physical storage to store data |
| Website | online available web resources |
| API | interface for exchanging data between clients and servers |
| Database | organized storage and access with software |

Today we will present the data pipeline using **simulated data** in **variables**. You will then complete the individual steps using a **file** by yourself. For this purpose we have a **dataset** on **flights** departing New York City in 2013. 

## 3.3 Data Handling

In our last exercise yesterday, you have already seen how you can load the data of a file as a nested list with the use of the module `csv`. Maybe you have realized that this exercise is very **time-consuming** because many **little steps** must be implemented: How do I open the file? How do I iterate over the lines? How do I save it?

That is why developers have written an **external packages** for **data handling**. For example, in these packages files can be loaded in one line of code. Two of these packages are `pandas` and `numpy` and both support all kinds of different data sources. But let us take a look at both package one after the other. 

### Pandas

The **package** `pandas` helps you to arrange your **data** like **tables**. Through `pandas`, you can import, clean, transform, analyse and export data.

Let us first **import** the **package** or install it again if necessary. 

In [None]:
import pandas as pd

The primary component of `pandas` is its `DataFrame`. A `DateFrame` is organized like a **table** and has **rows** and **columns**. There are several ways to create a new `DataFrame`, but the easiest way to start is to take a **dictionary** and pass it into a new `DataFrame`.

Let us start with a **simple example** where we create a **table**.

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# print dataframe
print(fruits)

Like you have just seen, `pandas` takes by default from the dictionary its **keys** as **columns** and its **values** as **cells**. Of course you can also pass the **row names** as a **separate attribute** to `DataFrame`. Assuming you want to change the **column** or **row names** afterwards, you can overwrite and access them with the **attributes** `columns` and `index`. 

Let us change the **column** and **row names** of our simple table.

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# print dataframe
print(fruits)

As your **datasets** become **larger** and non-trivial, you will not be able to print the entire data set anymore. `pandas` offers two methods `head()` and `tail()` to display either the **first** or the **last rows** of a `DataFrame`. By default, 10 rows are shown, but you can select any other number too. 

Let us print the **first** and **last row** once. 

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# print first row
print(fruits.head(1))

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# print last row
print(fruits.tail(1))

There are also several ways to change a `DataFrame` in `pandas`. A few ways are for example to **append** a `DataFrame` to another `DataFrame` with `append()`, find and delete **duplicate rows** with `drop_duplicates()`, and find rows with **missing values** and delete them with `dropna()`. Note that certain operations only pass a **reference** and others pass a **new** `DataFrame`. If you want to write to the exact same `DataFrame`, you can in most cases pass the argument `inplace = True` to the function. 

Let us try some **operations** on our data frame. 

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# append dataframe to dataframe
fruits = fruits.append(fruits)

# print dataframe
print(fruits)

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# append dataframe to dataframe
fruits = fruits.append(fruits)

# drop duplicated in dataframe
fruits.drop_duplicates(inplace=True)

# print dataframe
print(fruits)

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# drop rows with missing values
fruits.dropna(inplace=True)

# print dataframe
print(fruits)

Instead of **accessing** the entire `DateFrame`, you can also restrict the **selection** to certain columns, rows and cells. **Columns** can be easily restricted with an index with a **list** of their **names**. **Rows** can either be selected by their **name** using the **method** `loc[]` or by their **index** using the **method** `iloc[]`. Of course you can also restrict both columns and rows at the same time. 

Let us select **different parts** of our table.

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# select column of melons
select = fruits[["Watermelon"]]

# print selected dataframe
print(select)

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# select row of Edeka
select = fruits.loc[["Edeka"]]

# print selected dataframe
print(select)

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# select column of melons and row of Edeka
select = fruits.loc[["Edeka"], ["Watermelon"]]

# print selected dataframe
print(select)

When you want to **select** parts of a `DataFrame` which fulfill a certain **condition**, then you can also write a **conditional statement** instead of a name or index. You then write the statement again as a kind of index behind the `DataFrame`. Then you get all parts of the `DateFrame` for which the condition is **correct**. Of course you can also **combine** several conditions in round brackets `(` `)` with logical ands `&` and logical ors `|`. 

Let us select **supermarkets** according to their **prices**. 

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# select rows with melons cheaper than 4
select = fruits[fruits["Watermelon"] < 4]

# print selected dataframe
print(select)

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# select rows with melons cheaper than 4 and apple cheaper than 1
select = fruits[(fruits["Watermelon"] < 4) & (fruits["Green Apple"] < 1)]

# print selected dataframe
print(select)

When you have finished your data wrangling, you can **save** the finished `DataFrame` obviously. In `pandas` you can save them as **CSV**, **JSON**, **SQL** with the **methods** `to_csv()`, `to_json()` and `to_sql()`. All in one line of code.

Let us **save** our **table** as **CSV** once. 

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# save dataframe
fruits.to_csv("test.csv")

When you want to **load** your previous date or any other data, you can load it as `DataFrame` with `pandas` from a **CSV**, **JSON** or **SQL** with the **functions** `pd.read_csv()`, `pd.read_json()` or `pd.read_sql_query()`.

Let us **reload** the **data** from the **CSV**. 

In [None]:
# read dataframe
fruits = pd.read_csv("test.csv", index_col=0)

# print dataframe
print(fruits)

### Numpy

The **package** `numpy` provides Python with **multidimensional array objects**, which are easy and fast to work with. There are countless functions that make **scientific computing** convenient. We will only go through a few **simple use cases** with made up data. 

Let us first **import** the **package** or install it again if necessary. 

In [None]:
import numpy as np

There are several ways to create an array with `numpy`. The easiest way is to **create** an **array object** with the **function** `np.array()` and give it as input a **list** or **nested list**. But `numpy` offers many other helpful functions, like `np.zeros()` to create an **array** filled with **zeros** with a certain shape. Or you can create an array with **random values** with the help of the function `np.random.random()` and **reshape** it with the method `reshape()` to the desired shape. 

Note that **all elements** inside the array must have the **same data type**!

Let us create some **first arrays**.

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# print array
print(fruits)

In [None]:
# create array with zeros
fruits = np.zeros((3,4))

# print array
print(fruits)

In [None]:
# create array with random numbers and reshape
fruits = np.random.random(12).reshape((3,4))

# print array
print(fruits)

Besides the actual elements, each array has further information saved about itself in attributes. Thus, the **number** of **dimensions** of the array can be retrieved with the **attribute** `ndim`, the **total number** of **elements** with the **attribute** `size`, and the **size** of the **dimensions** with the **attribute** `shape`. 

Let us find out these **attributes** for our **array**. 

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# print dimensions
print(fruits.ndim)

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# print number elements
print(fruits.size)

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# print shape
print(fruits.shape)

To access a **certain part** of an **array**, the array can be **indexed** and **sliced** by writing **square backets** `[` `]` behind the array. This procedure is similar to the one with lists. However, **multiple dimensions** can be **indexed** at once by indexing them in their order and separate them with commas `,`. With a two dimensional array you would first select an index of a row and then of a column. If an index is ommited, all following dimensions are displayed. With a two dimensional array only one index would show the entire row, along the second dimension. This sounds more complicated than it actually is, so let us try it out with a few examples. 

Let us access **certain parts** of our **array**. 

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# select first row
select = fruits[0]

# print selected array
print(select)

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# select first column
select = fruits[:, 0]

# print selected array
print(select)

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# select in last column first two elements
select = fruits[0:2, -1]

# print selected array
print(select)

As with `pandas`, we can **filter** the data in our **array** using **conditional statements**. You can also filter on **multiple dimensions** at the same time. 

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# select columns where value in last row is smaller than 4
select = fruits[:, fruits[2] < 4]

# print selected array
print(select)

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# select columns where value in first row is smaller than 1 and last row is smaller than 4
select = fruits[:, (fruits[2] < 4) & (fruits[0] < 1)]

# print selected array
print(select)

Finally we can show how to import and export data between `numpy` and `pandas`. To **export** an array as a `DataFrame` you can simply hand the array inside the **function** `pd.DataFrame()`. This will create a `DataFrame` without any row or column names, but you can add these as previously shown. To **import** the data from a `DataFrame` back into a `numpy` array, you can apply the **method** `to_numpy()` to your `DataFrame`. In the process of the import the column and row names are lost. 

Let us once **convert** the data between a **table** and an **array**. 

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# import in dataframe
fruits = pd.DataFrame(fruits) 

# change column names
fruits.columns = ["Aldi", "Alnatura", "Penny", "Edeka"]

# change row names
fruits.index = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# print dataframe
print(fruits)

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# export as array
fruits = fruits.to_numpy()

# print array
print(fruits)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Import the <a href="https://raw.githubusercontent.com/snehavcs/NYC-Flight-Data-Analysis/master/flights.csv">flight dataset</a> as dataframe. Inspect the dataframe. Use the package pandas. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Add a column date in the dataframe with datetime objects using the columns year, month and day. 
</div>

## 3.4 Data Analysis

On the basis of the completed data handling, the next step in our **data pipeline** is to start with the **data analysis**. **Descriptive statistics** help us in the first place to better understand the data, e.g. means, quantiles, deviations, counts etc. These are implemented in `pandas` and `numpy` as well. To model the data and recognize mechanisms we move on to **inferential statistics**, e.g. correlations, regressions etc. For these statistics we will introduce `sklearn`. But let us take a look at all packages one after the other. 

### Pandas

Within a `DateFrame` of `pandas`, you can use the **method** `describe()` to view the **basic statistical characteristics** of each feature. These characteristics can also be calculated individually using e.g. the methods `max()`, `min()` and `mean()`. Or additionally calculate the methods `sum()` and `corr()`, which should explain themselves by their names. 

Let us have a look on the **basic statistical characteristics** of our features. 

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# print characteristics
print(fruits.describe())

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# print correlation
print(fruits.corr())

In [None]:
# create dictionary
fruits = {"apple": [0.2, 1, 0.5, 1.2],
          "coconut": [3, 3.4, None, 4],
          "melon": [3, 3.5, 4, 4.5]}

# create dataframe
fruits = pd.DataFrame(fruits)

# change column names
fruits.columns = ["Green Apple", "Indonesian Coconut", "Watermelon"]

# change row names
fruits.index = ["Aldi", "Alnatura", "Penny", "Edeka"]

# print sum prices in supermarkets
print(fruits.sum(axis=1))

Another cool feature that can be used with `DataFrame` is **grouping** by certain **variables**. For that, the **method** `groupby()` first collects all observations with the same values in the grouped variables, and does this for all **combininations** of the **grouped variables**, and then performs a **function** of your choice on these grups to aggregate them from many observations into a **single value**. Here you could perform a function like `agg([np.mean])` to get the mean of all observations. This aggregation can also only be limited to certain shown variables. Grouping is especially helpful for **categorical variables** to find out the differences between variables and other variables. In our dummy data set it does not make sense though, therefore we only present the **general schema** at this point. 

```python
dataframe.groupby(by=[grouped_variable_1, grouped_variable_2, ...])[[shown_variable_1, shown_variable_2, ...]].function()
```

### Numpy

Since `numpy` focuses on **scientific computing**, all **analysis** with **calculations** are much easier than with `pandas`. The syntax of calculation with arrays is very similar to the MATLAB syntax. However, there are a lot of **functions** implemented to calculate the **basic statistical characteristics**. Note that you have to specify an **axis** in **multidimensional arrays** such that `numpy` knows which dimension you are interested in. `Numpy` computes **means** with `np.mean()`, **standard deviations** with `np.std()`, **sums** with `np.sum()` ... However, the following example will show that these function are **sensitive** to **missing values**. Also a typical `None` must actually be parsed to the same datatype as the other values inside the array. Tp deal with missing values, `numpy` offers the same functions for means with `np.nanmean()`, standard deviations with `np.nanstd()`, sums with `np.nansum()` ...

Let us show how an **array** has to be **prepared** in order to **analyse** it. 

In [None]:
# create array with values
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]])

# compute sum with None 
computed = np.sum(fruits, axis=1)

# print sum
print(computed)

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# compute sum
computed = np.sum(fruits, axis=1)

# print sum
print(computed)

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# compute sum and consider nans as zero
computed = np.nansum(fruits, axis=1)

# print sum
print(computed)

### Sklearn

A new **package**, we have not seen yet, is `sklearn` or `scikit-learn`. `sklearn` offers different **machine learning methods** like linear models, support vector machines, tree-based methods, nearest neighbors, neural networks, clustering, matrix decomposition ... Each of these methods can be **conveniently prepared**, **performed** and **evaluated**. To demonstrate how easy an analysis in `sklearn` is, we will run a **linear regression**. First, we will demonstrate it on our trivial data set, but then you will do it by yourself on your data set. 

Let us first **import** from the **package** its **linear model** or install it again if necessary. 

In [None]:
from sklearn import linear_model

The first step is to prepare the underlying data as `numpy` array, where the **dependent variable** is ùsually stored in `Y`, and the **independent variables** are stored in `X`. The next step is to create a **linear regression model** using `sklearn`. This model can be computed by passing the dependent and independent variables with the method `fit()`. To get some **results** of the model, you can retrieve the intercept from `intercept_`, the coefficients from `coef_`, and all kinds of errors. 

Let us run our first **linear regression**. 

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# prepare dependent variable melon
Y = fruits[2]

# prepare independent variable apple
X = fruits[0].reshape(-1, 1)

# create linear regression model
regression = linear_model.LinearRegression()

# compute linear regression model
regression.fit(X, Y)

# print intercept
print("Intercept is", regression.intercept_)

# print coefficients
print("Coefficients are", regression.coef_)

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Find out how many flights flew from one of New Yorks airports (EWR, LGA, JFK) to Los Angeles (LAX) according to the dataset. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Find out the average arrival delay for each carrier in the dataset. 
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Run a linear regression for the effect of delay at departure on flight speed. Compute the speed based on the distance and airtime. Ignore flights with a missing value in any necessary value. 
</div>

## 3.5 Data Visualization

### Matplotlib

The package `matplotlib` is de facto the **visualization** tool in Python. It allows us to create all possible kinds of visualizations and get further insights into our data. You can basically use `matplotlib` out of `pandas`, but we will use it as a standalone package for the sake of clarity. 

A figure in `matplotlib` consists in principle of the following components:

| Component | Description |
| -------- | ------- |
| Figure | canvas which contains one or more axes |
| Axes | plot with one axis per dimension |
| Axis | number line like object |



Let us first **import pyplot** from the **package** or install it again if necessary. 

In [None]:
import matplotlib.pyplot as plt

For our first **line plot**, we have to **prepare** our **data** in such a way that all x coordinates are in one array and all y coordinates in another. Then we pass these two arrays into the **method** `plot()`, which **creates** the actual plot. With the **method** `show()` we **invoke** all previously created plots. 

Let us make a **minimalistic line plot**. 

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# prepare variable apple
X = fruits[0]

# prepare variable melon
Y = fruits[2]

# create plot
plt.plot(X, Y)

# show plot
plt.show()

To make our plot more presentable, we want to add a **title**, **label** the **axes** and add a **legend**. For this you can use the **methods** `title()`, `xlabel()`, `ylabel()` and `legend()`. In order for `matplotlib` to know what you want to call your **line**, you have to add a **label** to the method `plot()`, which will later appear in the legend. 

Let us add some **labels** to our line plot. 

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# prepare variable apple
X = fruits[0]

# prepare variable melon
Y = fruits[2]

# create plot
plt.plot(X, Y, label="Price of melon")

# add title
plt.title("Supermarket Prices")

# add axis labels
plt.xlabel("Price of apple")
plt.ylabel("Price of melon")

# add legend
plt.legend()

# show plot
plt.show()

`Matplotlib` draws a straight line by default and selects different colors based on a predefined color list. We can change this by setting some **attributes** like `color`, `marker`, `linestyle` and `linewidth`. Most **colors** are are available by their name, e.g. `red`., `green`, `blue`, `black` and `white`. Some common **markers** are points `.`, square `s` and diamond `D`. Some common **line styles** are `solid`, `dashed`, `dashdot`, `dotted`.

Let us **personalize** the standard style. 

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# prepare variable apple
X = fruits[0]

# prepare variable melon
Y = fruits[2]

# create plot
plt.plot(X, Y, color="green", marker="D", linestyle="solid", linewidth=3, label="Price of melon")

# add title
plt.title("Supermarket Prices")

# add axis labels
plt.xlabel("Price of apple")
plt.ylabel("Price of melon")

# add legend
plt.legend()

# show plot
plt.show()

Obviously you can also draw other **types** of **plots** besides line plots, e.g. bar plots, scatter plots, histograms, pie charts, To draw a different type of plot, you simply adjust the **method** `plot()` to the desired method. In the following we will draw a simple **bar plot** with the **method** `bar()`. However, the types and complexity are endless. 

Additionally, we will **save** this **plot** with the **method** `savefig()` which takes the **location** of the plot as argument. 

Let us make a **simple bar plot**. 

In [None]:
# create array with values and same data type
fruits = np.array([[0.2, 1, 0.5, 1.2], 
                   [3, 3.4, None, 4], 
                   [3, 3.5, 4, 4.5]],
                  dtype=np.float)

# create array with supermarkets
markets = ["Aldi", "Alnatura", "Penny", "Edeka"]

# prepare variable apple
X = fruits[0]

# prepare variable melon
Y = fruits[2]

# create subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4))

# create plot
ax[0].plot(X, Y, color="green", marker="D", linestyle="solid", linewidth=3, label="Price of melon")
ax[0].set(xlabel="Price of apple", ylabel="Price of melon")
ax[0].legend()

# create bar plot
ax[1].bar(markets, Y, color="green", label="Price of melon")
ax[1].set(xlabel="Supermarket", ylabel="Price of melon")
ax[1].legend()

# add title
fig.suptitle("Supermarket Prices")

# save plot
plt.savefig("test.png")

# show plot
plt.show()

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Find the average arrival delay per month in your dataset. Plot the average delay for each month as a bar plot. Label your plot. Use the packages pandas, numpy and matplotlib.  
</div>

<div class="alert alert-block alert-info">
    <b>Exercise</b>: Find the average arrival delay for each carrier. Select the carrier with the lowest and highest delay and plot their average delays for each month next to each other in two bar plots. Label your plot. Use the packages pandas, numpy and matplotlib.
</div>