## The Python Numerical Stack

Consists of:

- numpy/scipy (vectors and computational mathematics)
- pandas (dataframes)
- matplotlib (plotting)
- seaborn (statistical plotting)
- scikit-learn (machine learning)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

**Note** We typically only import what we need from scikit-learn e.g.

In [None]:
from sklearn.linear_model import LinearRegression

## The Iris Dataset

> This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day.
          — [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)



In [None]:
from sklearn.datasets import load_iris

> The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
          — [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)



In [None]:
IRIS = load_iris()

In [None]:
type(IRIS.data)

In [None]:
IRIS.data.shape

#### What does `.shape` do?

In [None]:
# your code here
raise NotImplementedError

## Dataframes

We will load the Iris data into a dataframe for ease of manipulation.

In [None]:
iris_df = pd.DataFrame(IRIS.data, columns=IRIS.feature_names)

In [None]:
iris_df.head()


## Pair Plot

We will use Searborn to prepare a **Pair Plot** of the Iris dataset. A Pair Plot is an array of scatter plots, one for each pair of features in the data. Rather than plotting a feature against itself, the diagonal is rendered as a **probability distribution** of the given feature.


In [None]:
sns.pairplot(iris_df)

## List Comprehension

We will use a **list comprehension** to remove the units and white space from the feature names to make them more "computer-friendly".


In general, list comprehensions have this form:

```python
lc = [do_something_to(var) for var in some_other_list]
```


In [None]:
def square_number(x):
    return x**2

In [None]:
[square_number(i) for i in (1,2,3,4,5)]

### Write your own list comprehension

Write a function that uses a list comprehension to change this list

    [1,2,3,4,5]
    
into this list

    [2,3,4,5,6]

In [None]:
# your code here
raise NotImplementedError

In [None]:
assert incr_list_by_1([1,2,3,4,5]) == [2,3,4,5,6]


### Remove Unit and White Space from Feature Name

Here we use a list comprehension to change the feature names:

In [None]:
IRIS.feature_names

In [None]:
iris_features_names = IRIS.feature_names
iris_features_names

In [None]:
def remove_unit_and_white_space(feature_name):
    feature_name = feature_name.replace(' (cm)','')
    feature_name = feature_name.replace(' ', '_')
    return feature_name

In [None]:
iris_features_names = [remove_unit_and_white_space(name) for name in iris_features_names]

In [None]:
iris_features_names

In [None]:
iris_df.columns = iris_features_names
iris_df.head()

## Export to CSV

Ultimately, we will export a CSV of the dataframe to disk. This will make it easy to access the same data from both Python and R.


In [None]:
%ls

In [None]:
%mkdir -p data

In [None]:
%ls

In [None]:
iris_df.to_csv('data/iris.csv')