![](./feature_engineering.jpg)

# Feature Engineering for Machine Learning Datasets

The book [Feature Engineering for Machine Learning](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/) by Alice Zheng & Amanda Casari comes with a set of Jupyter notebooks so that you can run the code examples in the book.
These notebooks use publicly available data, but does not come with instructions how to download the data onto your system.

This notebook provides detailed instructions for loading the datasets needed to run the notebooks.


## Getting Started

The notebooks for the book are located in the GitHub repository <https://github.com/alicezheng/feature-engineering-book>.  Decide the directory where you would like to store the files on your system and clone or download the repo there.

This notebook will reference your copy of the book repo. You can place this notebook in the same directory.  If you are saving this notebook in a different directory, modify the following cell with the path to the book repo.

In [1]:
REPO_DIR = "./"

## Chapter Two

We start with the data used in chapter two.  This notebook will be updated as we read the rest of the book with instructions for all of the other datasets needed.

### Yelp Dataset

Since the Feature Engineering book was written, Yelp has updated its public dataset.  I wasn't able to find an archive of the exact same version that is used in the book.  The latest Yelp dataset is in the same format, so we can use it, and the code examples in the book will still work.  The full Yelp dataset is 10 GB.

I didn't work out a really easy way to download the Yelp dataset (feel free to contribute fully automated download code).
You can manually download the Yelp data either from the kaggle website or the Yelp website.  I recommend using the kaggle website.

#### Kaggle website instructions

To download the Yelp dataset from the kaggle website:

* You must have a kaggle login.  If you don't have one, go to <https://kaggle.com> and sign up for free.
* Go to <https://www.kaggle.com/yelp-dataset/yelp-dataset>
* In the menu area, click the Download link (it's next to the blue New Notebook button)
* Create the following subfolders underneath your notebook directory:  **data/yelp/v6/yelp_dataset_challenge_academic_dataset**
* Save the ZIP file into the above folder
* Unzip its contents into this same folder or run the cell below

In [2]:
# Unzip the files into this subdirectory
from zipfile import ZipFile 

folder_path = REPO_DIR + "data/yelp/v6/yelp_dataset_challenge_academic_dataset/"
fn = folder_path + "10100_1035793_bundle_archive.zip"    # Note update the filename if yours doesn't match
with ZipFile(fn, 'r') as zip: 
    zip.extractall(path=folder_path) 

#### Yelp website instructions

These instructions are only if you don't want to download the Yelp dataset from the kaggle website.  To download the Yelp dataset from the Yelp website:

* Go to <https://www.yelp.com/dataset>
* Click the Download Dataset Button
* Fill out the form with your personal information and click the Download button
* Click the Download JSON button
* Create the following subfolders underneath your notebook directory:  **data/yelp/v6/yelp_dataset_challenge_academic_dataset**
* Save the **yelp_dataset.tar** file into the above folder
* Extract the contents into this same folder or run the cell below

In [3]:
import tarfile
folder_path = fn = REPO_DIR + "data/yelp/v6/yelp_dataset_challenge_academic_dataset/"
my_tar = tarfile.open(folder_path + 'yelp_dataset.tar')
my_tar.extractall(folder_path)
my_tar.close()

#### Verifying the download
If you have downloaded all of the files into the proper subdirectory, running the follow cell should print the success message.

In [4]:
import pandas as pd
import json
fn = REPO_DIR + "data/yelp/v6/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"
try:
    biz_f = open(fn, encoding="utf-8")
    biz_df = pd.DataFrame([json.loads(x) for x in biz_f.readlines()])
    print("Success!!! The business JSON file was successfully loaded.")
    biz_f.close()
    del(fn, biz_f, biz_df)
except:
    raise Exception("Error, the business JSON file could not be loaded.")

Success!!! The business JSON file was successfully loaded.


### News Popularity Dataset

The News Popularity dataset can be downloaded from the UCI machine learning datasets.  This dataset is only 24 MB.

In [5]:
# Download the dataset ZIP file
import requests
import os.path

data_path = REPO_DIR + "data/"

if not(os.path.exists(data_path + "OnlineNewsPopularity\OnlineNewsPopularity.csv")):
    myfile = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/00332/OnlineNewsPopularity.zip")
    file = open(data_path + 'OnlineNewsPopularity.zip', 'wb')
    file.write(myfile.content)
    file.close()
    del(myfile)

In [6]:
# Unzip the CSV file into the subdirectory
from zipfile import ZipFile 

data_path = REPO_DIR + "data/"
with ZipFile(data_path + 'OnlineNewsPopularity.zip', 'r') as zip: 
    zip.extractall(path=data_path) 

#### Verifying the download
If you have downloaded Online News Popularity CSV file into the proper subdirectory, running the follow cell should print the success message.

In [7]:
import pandas as pd

try:
    news_df = pd.read_csv(REPO_DIR + "data/OnlineNewsPopularity/OnlineNewsPopularity.csv", delimiter=", ", engine="python")
    print("Success!!! The Online News Popularity CSV file was successfully loaded.")
    del(news_df)
except:
    raise Exception("Error, the Online News Popularity CSV file could not be loaded.")

Success!!! The Online News Popularity CSV file was successfully loaded.


## To be continued...

As we read through the book, instructions for the other datasets used will be added here.