<a href="https://colab.research.google.com/github/zmy2338/Machine-Learning-AWS/blob/main/TRAIN_AWS_P1_Lecture_3_%5BSTUDENT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture #3: Data Wrangling for Machine Learning**
---
**Description:** This lecture notebook will provide practice data wrangling Pandas commands while revisiting the Iris dataset from Day 1.

**About the dataset:** The Iris Dataset is one of the best known datasets in pattern recognition. The dataset consists of several types of Iris flowers and their respective petal and sepal dimensions, that we are going to use to classify the type of Iris flower.

**Goals:** By the end of this notebok, you will:
* Be able to create and encode features to enhance the features available in a given dataset.
* Bet able to select the top *k* features using sklearn's `SelectKBest(...)`.

<br>

### **Lab Structure**
**Part 1**: [Feature Engineering](#p1)

> **Part 1.1**: [Feature Creation](#p11)

> **Part 1.2**: [Feature Encoding](#p12)

**Part 2**: [Feature Selection](#p2)

**Part 3**: [[OPTIONAL] Addition Practice](#p3)


<br>


### **Cheat Sheets**
* [Feature Engineering and Selection with Pandas](https://docs.google.com/document/d/191CH-X6zf4lESuThrdIGH6ovzpHK6nb9NRlqSIl30Ig/edit?usp=sharing)

* [Pandas Commands](https://docs.google.com/document/d/1v-MZCgoZJGRcK-69OOu5fYhm58x2G0JUWyi2H53j8Ls/edit)

<br>

**Before beginning the practice problems, run the code below to import and install the necessary libraries, as well as to read the data.**

In [None]:
!pip install scikit-learn

# import pandas as pd 
import pandas as pd

# Import datasets submodule
from sklearn import datasets
from sklearn.feature_selection import SelectKBest

# Load dataset (actual data with associated documentation)
iris = datasets.load_iris()

# Create dataframe
iris_df = pd.DataFrame(data=iris.data,columns=iris.feature_names)

# Add target to dataset
iris_df['target'] = iris.target_names[iris.target]

<a name="p1"></a>

---
## **Part 1: Feature Engineering**
---

<a name="p11"></a>

---
### **Part 1.1: Feature Creation**
---

#### **Exercise #1: Create a new feature `sepal length (s, m, l)`.**
---

Create new feature called `sepal length (s, m, l)` that designates any flows with a sepal length:
* Under 5 cm as `'s'` (for small).
* Between 5 and 6 cm as `'m'` (for medium).
* Over 6 cm as `'l'` (for large).

In [None]:
# sepal length (s, m, l)
iris_df.loc[iris_df['sepal length (cm)'] < 5, 'sepal length (s, m, l)'] = # COMPLETE THIS LINE
iris_df.loc[(iris_df['sepal length (cm)'] >= 5) & # COMPLETE THIS LINE), 'sepal length (s, m, l)'] = 'm'
iris_df.# COMPLETE THIS LINE

iris_df.head()

#### **Exercise #2: Create a new feature `sepal area (cm^2)`.**
---

Create new feature called `sepal area (cm^2)` that calculates a rough estimate of the sepal area as follows: `sepal area (cm^2)` = `sepal length (cm)` * `sepal width (cm)`.

In [None]:
# sepal area (cm^2)
iris_df['sepal area (cm^2)'] = # COMPLETE THIS LINE

iris_df.head()

#### **Exercise #3: Create a new feature `sepal width (s, m, l)`.**
---

Create new feature called `sepal width (s, m, l)` that divides this column into 3 buckets, `'s'`, `'m'`, `'l'`, using `qdcut(...)`.

In [None]:
iris_df['sepal width (s, m, l)'] = pd.qcut(iris_df['sepal width (cm)'], # COMPLETE THIS LINE
                                           
iris_df.head()

#### **Exercise #4: Create a new feature `petal length (s, m, l)`.**
---

Create new feature called `petal length (s, m, l)` that designates any flows with a petal length:
* Under 3 cm as `'s'` (for small).
* Between 3 and 5 cm as `'m'` (for medium).
* Over 5 cm as `'l'` (for large).

In [None]:
# petal length (s, m, l)
# COMPLETE THIS CODE

iris_df.head()

#### **Exercise #5: Create a new feature `petal area (cm^2)`.**
---

Create new feature called `petal area (cm^2)` that calculates a rough estimate of the petal area as follows: `petal area (cm^2)` = `petal length (cm)` * `petal width (cm)`.

In [None]:
# petal area (cm^2)
# COMPLETE THIS CODE

iris_df.head()

#### **Exercise #6: Create a new feature `petal width (s, m, l, xl)`.**
---

Create new feature called `petal width (s, m, l, xl)` that divides this column into 4 buckets, `'s'`, `'m'`, `'l'`, `'xl'`, using `qdcut(...)`.

In [None]:
# COMPLETE THIS CODE

iris_df.head()

<a name="p12"></a>

---
### **Part 1.2: Feature Encoding**
---

#### **Exercise #1: Create an encoded version of the categorical feature `sepal length (s, m, l)`.**
---

Call the feature `sepal length encoded`.

In [None]:
size_map = {'s': 0, 'm': 1, 'l': 2}
iris_df['sepal length encoded'] = iris_df['sepal length (s, m, l)'].# COMPLETE THIS LINE

iris_df.head()

#### **Exercise #2: Create an encoded version of the categorical feature `sepal width (s, m, l)`.**
---


Call the feature `sepal width encoded`.

In [None]:
size_map = # COMPLETE THIS LINE
iris_df['sepal width encoded'] = iris_df['sepal width (s, m, l)'].# COMPLETE THIS LINE

iris_df.head()

#### **Exercise #3: Create an encoded version of the categorical feature `petal width (s, m, l, xl)`.**
---


Call the feature `petal width encoded`. **NOTE**: This requires having completed Exercise #6 of the previous part.

#### **Exercise #4: Create an encoded version of the *label* `target`.**
---

Sometimes it is also useful to encode the label using the same methods. Create a new label called `target encoded` such that:
* `'setosa'` goes to 0
* `'versicolor'` goes to 1
* `'virginica'` goes to 2

In [None]:
target_map = {# COMPLETE THIS LINE
iris_df['target encoded'] = iris_df['target'].map(target_map)

iris_df.head()

<a name="p2"></a>

---
## **Part 2: Feature Selection**
---

**Run the code below to organize our data into numerical features and the label.**

In [None]:
features = iris_df.select_dtypes('number')
features = features.drop('target encoded', axis = 1)

label = iris_df['target']

#### **Exercise #1: Select the 3 best features using `SelectKBest(...)`.**
---

In [None]:
feature_selector = SelectKBest(k = # COMPLETE THIS LINE
feature_selector.fit_transform(features, label)

best_features = iris_df[feature_selector.get_feature_names_out()]

best_features.head()

#### **Exercise #2: Select the 5 best features using `SelectKBest(...)`.**
---

In [None]:
# COMPLETE THIS LINE
feature_selector.fit_transform(features, label)

best_features = iris_df[feature_selector.get_feature_names_out()]

best_features.head()

#### **Exercise #3: Select the single best feature using `SelectKBest(...)`.**
---

In [None]:
# COMPLETE THIS LINE
feature_selector.fit_transform(features, label)

best_features = iris_df[feature_selector.get_feature_names_out()]

best_features.head()

<a name="p3"></a>

---
## **Part 3: [OPTIONAL] Additional Practice**
---

You can continue practicing these skills using a dataset containing the top hit for each year from 1999 - 2019 according to Spotify. This dataset contains many features, so it would likely need a lot of feature engineering and selection for any task you are trying to perform.

<br>

**NOTE**: We will practicing all of this more tomorrow, so do not worry if you don't have time for these problems.

<br>

**Run the code below to load in our data.**

In [None]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQ_8wL6mLc01IFXWqv8i4fkVcnaeB0ipMCkKrCKjbKVwM4xCbsSesX7J5aF4k_4lWa6lTEqGxHR9-9A/pub?gid=1132556054&single=true&output=csv'

spotify_df = pd.read_csv(url)

spotify_df.head()

#### **Exercise #1: Create a new feature `duration_s`.**
---

Create new feature called `duration_s` that converts the `duration_ms` column from milliseconds to seconds. **NOTE**: 1 millisecond is a thousandth of a second.

In [None]:
spotify_df['duration_s'] = # COMPLETE THIS LINE

spotify_df.head()

#### **Exercise #2: Create a new feature `dance energy`.**
---

Create new feature called `dance energy` that adds the `danceability` and `energy` columns together.

In [None]:
# COMPLETE THIS CODE

#### **Exercise #3: Create a new feature `early or late 2000s`.**
---

Create new feature called `early or late 2000s` such that:
* Any rows where `year` is less than 2005 has the value `'early'`.
* Any rows where `year` is greater than or equal to 2005 has the value `'late'`.

In [None]:
# COMPLETE THIS CODE

#### **Exercise #4: Create a new feature `popularity (low, medium, high)`.**
---

Create new feature called `popularity (low, medium, high)` such that:
* Any rows where `popularity` is less than 25 has the value `'low'`.
* Any rows where `popularity` is between 25 and 75 has the value `'medium'`.
* Any rows where `popularity` is greater than 75 has the value `'high'`.

<br>


**NOTE**: When specifying multiple conditions, it's important to include parentheses to make them very clear. Refer to Part 1.1 for an example of how to do this.

In [None]:
# COMPLETE THIS CODE

#### **Exercise #5: Create an encoded version of the categorical feature `explicit`.**
---


Call the feature `explicit encoded` such that `False` becomes 0 and `True` becomes 1.

In [None]:
explicit_map = {False: 0 # COMPLETE THIS CODE

#### **Exercise #6: Create an encoded version of the categorical feature `popularity (low, medium, high)`.**
---


Call the feature `popularity encoded`.

In [None]:
# COMPLETE THIS CODE

#### **Run the code below to separate our data into numerical features and a label.**

**NOTE**: We are using the `popularity encoded` column as our label, meaning we are selecting the best features to predict this variable.

In [None]:
features = spotify_df.select_dtypes('number')
features = features.drop('popularity encoded', axis = 1)

label = spotify_df['popularity encoded']

#### **Exercise #7: Select the single best feature using `SelectKBest(...)`.**
---

#### **Exercise #8: Drop the `popularity` column.**


You should have seen that `popularity` is the single best feature for predicting `popularity encoded`. This should make sense since we *made* the `popularity encoded` column from `popularity` ultimately. So actually, it's really not meaningful for us to say that `popularity` is a useful feature at all!

So, drop the `popularity` column from the features DataFrame before continuing.

---

In [None]:
features = features.drop(# COMPLETE THIS LINE)

#### **Exercise #9: Select the 3 best features using `SelectKBest(...)`.**
---

#End of notebook
---
© 2023 The Coding School, All rights reserved