# Feature engineering in pandas

## Loading/Exploring the data

Load the iris.csv file from this repo into a pandas dataframe. Take a minute to familiarize yourself with the data.

In [2]:
import pandas as pd
iris=pd.read_csv('iris.csv')

How many different species are in this dataset?

In [3]:
iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

What are their names?

In [4]:
iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

How many samples are there per species?

<details><summary>Hint</summary>Use the [value_counts](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method</details>

In [5]:
len(iris)

150

In [6]:
iris['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

## Broadcasting

Create a new column called `'sepal_ratio'` which is equal to sepal width / sepal length

In [7]:
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [8]:
sepal_ratio=iris['sepal width (cm)']/iris['sepal length (cm)']

Create a similar column called `'petal_ratio'`: petal width / petal length

In [9]:
petal_ratio=iris['petal width (cm)']/iris['petal length (cm)']

Since we're in 'Murica, create 4 columns the correspond to **sepal length (cm)**, **sepal width (cm)**, **petal length (cm)**, and **petal width (cm)**, only in inches.

In [10]:
inch=0.393701
speal_length=iris['sepal width (cm)']*inch
sepal_width=iris['sepal width (cm)']*inch
petal_length=iris['petal length (cm)']*inch
petal_width=iris['petal width (cm)']*inch

## Mapping

Create a column called `'encoded_species'`:
- 0 for setosa
- 1 for versicolor
- 2 for virginica


<details><summary>Hint 1</summary>
Create a dictionary using the species as keys and the numbers 0-2 for values
</details>

<details><summary>Hint 2</summary>
Use the dictionary in hint 1 with the map method to create the new column
</details>

In [11]:
species_dict={'setosa':0,'versicolor':1,'virginica':2}
iris['encoded_species']=iris['species'].map(species_dict)
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,encoded_species
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0
3,4.6,3.1,1.5,0.2,setosa,0
4,5.0,3.6,1.4,0.2,setosa,0


## Apply

Let's change up the dataset to something way cooler than flowers: March Madness!

Load `ncaa-seeds.csv` into pandas. This dataframe simulates the games that will occur in the first round of the [NCAA basketball tournament](http://www.sportingnews.com/au/ncaa-basketball/news/ncaa-tournament-2017-march-madness-bracket-schedule-matchups-print-a-bracket/1r6cau9sb1xj4131zzhay2dj5g). In the first row, you should see the following:

| team_seed | opponent_seed |
|-----------|---------------|
| 01N       | 16N           |

For team_seed, the 01 is their seed, and N is their division (North). This row is saying the 1st seed in the north division will play the 16th seed (same division).

Using the `apply` method, create the following new columns:
- team_division
- opponent_division

In [None]:
ncaa=pd.read_csv("ncaa-seeds.csv")

In [20]:

def get_location(seeds):
    for team_seed in seeds:
        if 'N' in team_seed:
            return 'N'
        elif 'S' in team_seed:
            return'S'
        elif 'W' in team_seed:
            return'W'
        elif 'E' in team_seed:
            return'E'
ncaa['team_location']=ncaa['team_seed'].apply(get_location)  

In [21]:

def get_opponent_location(o_seeds):
    for oppo_seed in o_seeds:
        if 'N' in oppo_seed:
            return'N'
        elif 'S' in oppo_seed:
            return'S'
        elif 'W' in oppo_seed:
            return'W'
        elif 'E' in oppo_seed:
            return'E'
ncaa['opponent_location']=ncaa['opponent_seed'].apply(get_opponent_location)           

In [22]:
ncaa.head()

Unnamed: 0,team_seed,opponent_seed,location,team_location,opponent_location
0,01N,16N,N,N,N
1,02N,15N,N,N,N
2,03N,14N,N,N,N
3,04N,13N,N,N,N
4,05N,12N,N,N,N


Now that you have the divisions, change the team_seed and opponent_seed columns to just be the numbers.

In [23]:
ncaa['team_seed']=ncaa['team_seed'].str[:-1].astype(int)

In [25]:
ncaa['opponent_seed']=ncaa['opponent_seed'].str[:-1].astype(int)

Create a new column called seed_delta, which is the difference between the team's seed and their opponent's. 

For example, the `seed_delta` in the first row will be result of 1 - 16: -15

<details><summary>Did you get an error?</summary>
team_seed and opponent_seed need to be numerical columns in order for you to perform mathematical operations on them.
</details>

In [27]:
ncaa['seed_delta']=ncaa['team_seed']-ncaa['opponent_seed']

In [28]:
ncaa

Unnamed: 0,team_seed,opponent_seed,location,team_location,opponent_location,seed_delta
0,1,16,N,N,N,-15
1,2,15,N,N,N,-13
2,3,14,N,N,N,-11
3,4,13,N,N,N,-9
4,5,12,N,N,N,-7
5,6,11,N,N,N,-5
6,7,10,N,N,N,-3
7,8,9,N,N,N,-1
8,1,16,S,S,S,-15
9,2,15,S,S,S,-13


## Dummies

Using pandas get_dummies method, create a new dataframe with 4 columns from team_divison.

NOTE: Be sure to use 'team_division' as your prefix.

In [51]:
ncaa.dtypes

team_seed             int64
opponent_seed         int64
location             object
team_location        object
opponent_location    object
seed_delta            int64
dtype: object

In [53]:
ncaa_dumm=pd.get_dummies(ncaa['team_location'],prefix='team_division')
ncaa_dumm

Unnamed: 0,team_division_E,team_division_N,team_division_S,team_division_W
0,0,1,0,0
1,0,1,0,0
2,0,1,0,0
3,0,1,0,0
4,0,1,0,0
5,0,1,0,0
6,0,1,0,0
7,0,1,0,0
8,0,0,1,0
9,0,0,1,0


In machine learning, it's common to drop one the columns and have that be the baseline. Drop 'team_division_E', and append the remaining three columns to your original ncaa dataframe.

In [62]:
ncaa_dumm.columns

Index(['team_division_N', 'team_division_S', 'team_division_W'], dtype='object')

In [61]:
ncaa_dumm.drop('team_division_E',axis=1,inplace=True)

Repeat the previous two steps for opponent_division.

In [63]:
oppo_dumm=pd.get_dummies(ncaa['opponent_location'],prefix='opponent_division')
oppo_dumm.drop('opponent_division_E',axis=1,inplace=True)

In [64]:
oppo_dumm.columns

Index(['opponent_division_N', 'opponent_division_S', 'opponent_division_W'], dtype='object')