# Feature engineering in Pandas

## Loading/Exploring the data

Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data.

## Import Pandas

Import the `pandas` library as `pd`

In [92]:
import pandas as pd
import numpy as np

Read the `../data/iris.csv` dataset into an object named `iris`

In [33]:
data = pd.read_csv('../data/iris.csv')


How many different species are in this dataset?

In [34]:
species_types = iris['species'].unique()
species_types.tolist()
len(species_types)


3

What are their names?

In [35]:
species_types

array(['setosa', 'versicolor', 'virginica'], dtype=object)

How many samples are there per species?

<details><summary>Hint</summary>Use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html"><code>.value_counts()</code></a> method</details>

In [62]:
data['species'].value_counts()



versicolor    50
virginica     50
setosa        50
Name: species, dtype: int64

## Feature Engineering

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [69]:
data.head(5)
data['sepal_ratio'] = (data['sepal width (cm)']/data['sepal length (cm)'])
data['sepal_ratio'].head()

0    0.686275
1    0.612245
2    0.680851
3    0.673913
4    0.720000
Name: sepal_ratio, dtype: float64

Create a similar column called `'petal_ratio'`: petal width / petal length

In [72]:
data.head(5)
data['petal ratio'] = (data['petal width (cm)']/data['petal length (cm)'])
data['petal ratio'].head(5)

0    0.142857
1    0.142857
2    0.153846
3    0.133333
4    0.142857
Name: petal ratio, dtype: float64

Create 4 columns that correspond to `sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, and `petal width (cm)`, only in inches.

In [78]:
data.head(5)
data['sepal length (in)'] = (data['sepal length (cm)'] * 0.393701)
data['sepal width (in)'] = (data['sepal width (cm)'] * 0.393701)
data['petal length (in)'] = (data['petal length (cm)'] * 0.393701)
data['petal width (in)'] = (data['petal width (cm)'] * 0.393701)
data.head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,sepal_ratio,petal ratio,sepal length (in),sepal width (in),petal length (in),petal width (in)
0,5.1,3.5,1.4,0.2,setosa,0.686275,0.142857,2.007875,1.377954,0.551181,0.07874
1,4.9,3.0,1.4,0.2,setosa,0.612245,0.142857,1.929135,1.181103,0.551181,0.07874
2,4.7,3.2,1.3,0.2,setosa,0.680851,0.153846,1.850395,1.259843,0.511811,0.07874
3,4.6,3.1,1.5,0.2,setosa,0.673913,0.133333,1.811025,1.220473,0.590552,0.07874
4,5.0,3.6,1.4,0.2,setosa,0.72,0.142857,1.968505,1.417324,0.551181,0.07874


## Apply

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


<details><summary>Hint 1</summary>
Create a dictionary using the species as keys and the numbers 0-2 for values
</details>

<details><summary>Hint 2</summary>
    Use the dictionary in hint 1 with the <code>.apply()</code> method to create the new column
</details>

In [94]:
def dictionary(x):
    if x=='setosa':
        return 0
    elif x=='versicolor':
        return 1
    elif x=='virginica':
        return 2
data['encoded_species'] = data['species'].apply(dictionary)
data[['species', 'encoded_species']]


Unnamed: 0,species,encoded_species
0,setosa,0
1,setosa,0
2,setosa,0
3,setosa,0
4,setosa,0
5,setosa,0
6,setosa,0
7,setosa,0
8,setosa,0
9,setosa,0


## March Madness

Let's change up the dataset to something different than flowers: March Madness!

Read in the dataset `../data/ncaa-seeds.csv` to an object named `seeds`.

This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

In [93]:
seeds = pd.read_csv('../data/ncaa-seed.csv')

FileNotFoundError: File b'../data/ncaa-seed.csv' does not exist

For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `.apply()` method, create the following new columns:
- `team_division`
- `opponent_division`

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 01N       | 16N           | N             | N                 |


Now that you have the divisions, change the `team_seed` and `opponent_seed` columns to just be the numbers.

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division |
|-----------|---------------|---------------|-------------------|
| 1         | 16            | N             | N                 |

Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

The first row of your result should look as follows:

| team_seed | opponent_seed | team_division | opponent_division | seed_delta |
|-----------|---------------|---------------|-------------------|------------|
| 1         | 16            | N             | N                 | -15        |

<br>
<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>