<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_02_5_pandas_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 2: Python for Machine Learning**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 2 Material

Main video lecture:

* Part 2.1: Introduction to Pandas [[Video]](https://www.youtube.com/watch?v=wixHCvnvnsU&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_1_python_pandas.ipynb)
* Part 2.2: Categorical Values [[Video]](https://www.youtube.com/watch?v=Fm7Ax23hDP0&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_2_pandas_cat.ipynb)
* Part 2.3: Grouping, Sorting, and Shuffling in Python Pandas [[Video]](https://www.youtube.com/watch?v=tUhaD8xWd7k&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_3_pandas_grouping.ipynb)
* Part 2.4: Using Apply and Map in Pandas [[Video]](https://www.youtube.com/watch?v=YNo_mg1RrkM&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_4_pandas_functional.ipynb)
* **Part 2.5: Feature Engineering in Pandas for Deep Learning in PyTorch** [[Video]](https://www.youtube.com/watch?v=ezaVtM405Qs&list=PLjy4p-07OYzuy_lHcRW8lPTLPTTOmUpmi) [[Notebook]](t81_558_class_02_5_pandas_features.ipynb)

# Part 2.5: Feature Engineering

Feature engineering is an essential part of machine learning.  For now, we will manually engineer features.  However, later in this course, we will see some techniques for automatic feature engineering.  

## Calculated Fields

It is possible to add new fields to the data frame that your program calculates from the other fields.  We can create a new column that gives the weight in kilograms.  The equation to calculate a metric weight, given weight in pounds, is:

$$ m_{(kg)} = m_{(lb)} \times 0.45359237 $$

The following Python code performs this transformation:

In [2]:
import numpy as np
import pandas as pd

print(f"Numpy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")

Numpy Version: 1.24.2
Pandas Version: 2.2.0


In [3]:
url = "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv"

In [4]:
df = pd.read_csv(url, 
    na_values=['NA', '?'])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   year          398 non-null    int64  
 7   origin        398 non-null    int64  
 8   name          398 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB


In [5]:
df.insert(1, 'weight_kg', (df['weight'] * 0.45359237).astype(int))
pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 5)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   weight_kg     398 non-null    int32  
 2   cylinders     398 non-null    int64  
 3   displacement  398 non-null    float64
 4   horsepower    392 non-null    float64
 5   weight        398 non-null    int64  
 6   acceleration  398 non-null    float64
 7   year          398 non-null    int64  
 8   origin        398 non-null    int64  
 9   name          398 non-null    object 
dtypes: float64(4), int32(1), int64(4), object(1)
memory usage: 29.7+ KB


The code `df.insert(1, 'weight_kg', (df['weight'] * 0.45359237).astype(int))` and using `assign` to create a new column in a DataFrame have some key differences:

1. **`df.insert()`**:
   - The `df.insert()` method is an in-place operation that modifies the original DataFrame by inserting a new column at a specified position.
   - It requires specifying the column index where the new column should be inserted.
   - The original DataFrame is directly modified, and the operation is not chainable.

2. **`assign()`**:
   - The `assign()` method returns a new DataFrame with the specified columns added, leaving the original DataFrame unchanged.
   - It allows for creating new columns based on existing columns or other computations in a chainable manner.
   - It does not modify the original DataFrame but instead returns a new DataFrame with the additional columns.

In summary, using `df.insert()` directly modifies the original DataFrame by inserting a new column at a specific position, while `assign()` creates a new DataFrame with additional columns without altering the original DataFrame. The choice between these methods depends on whether you want to modify the existing DataFrame in place or create a new DataFrame with additional columns.

### **Google API Keys**

When working with external APIs to gather data, you might find the Google Maps API particularly useful for tasks such as encoding addresses. This can be valuable for various applications, including neural network training. To utilize the Google Maps API, you'll need to obtain a personal Google API key. Remember, the example key provided here is fictitious and solely for illustrative purposes. You must use your own key, which can be acquired from Google. They might request your credit card details for verification, but typically, there's no charge unless your usage exceeds a substantial number of lookups.

If you decide to obtain a Google API key, you can do so by visiting the following link: [Google API Keys](https://developers.google.com/maps/documentation/embed/get-api-key). This link will direct you to the official site where you can securely get your key for geocoding purposes.


## Dealing with Addresses

Addresses can be difficult to encode into a neural network.  There are many different approaches, and you must consider how you can transform the address into something more meaningful.  Map coordinates can be a good approach.  [latitude and longitude](https://en.wikipedia.org/wiki/Geographic_coordinate_system) can be a useful encoding.  Thanks to the power of the Internet, it is relatively easy to transform an address into its latitude and longitude values.  The following code determines the coordinates of [Washington University](https://wustl.edu/):

In [6]:
import os

from config import set_environment

set_environment()

if 'GOOGLE_API_KEY' in os.environ:
    # If the API key is defined in an environmental variable,
    # the use the env variable.
    GOOGLE_KEY = os.environ['GOOGLE_API_KEY']    
else:
    # If you have a Google API key of your own, you can also just
    # put it here:
    GOOGLE_KEY = 'REPLACE WITH YOUR GOOGLE API KEY'

In [7]:
import requests

address = "1 Brookings Dr, St. Louis, MO 63130"

response = requests.get(
    'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}' \
    .format(GOOGLE_KEY,address))

resp_json_payload = response.json()

if 'error_message' in resp_json_payload:
    print(resp_json_payload['error_message'])
else:
    print(resp_json_payload['results'][0]['geometry']['location'])

{'lat': 38.6482446, 'lng': -90.30494159999999}


They might not be overly helpful if you feed latitude and longitude into the neural network as two features.  These two values would allow your neural network to cluster locations on a map.  Sometimes cluster locations on a map can be useful.  Figure 2.SMK shows the percentage of the population that smokes in the USA by state.

**Figure 2.SMK: Smokers by State**
![Smokers by State](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_6_smokers.png "Smokers by State")

The above map shows that certain behaviors, like smoking, can be clustered by the global region. 

However, often you will want to transform the coordinates into distances.  It is reasonably easy to estimate the distance between any two points on Earth by using the [great circle distance](https://en.wikipedia.org/wiki/Great-circle_distance) between any two points on a sphere:

The following code implements this formula:

$$ \Delta\sigma=\arccos\bigl(\sin\phi_1\cdot\sin\phi_2+\cos\phi_1\cdot\cos\phi_2\cdot\cos(\Delta\lambda)\bigr) $$

$$ d = r \, \Delta\sigma $$

In [13]:
from math import sin, cos, sqrt, atan2, radians

URL='https://maps.googleapis.com' + \
    '/maps/api/geocode/json?key={}&address={}'

# Distance function
def distance_lat_lng(lat1,lng1,lat2,lng2):
    # approximate radius of earth in km
    R = 6373.0

    # degrees to radians (lat/lon are in degrees)
    lat1 = radians(lat1)
    lng1 = radians(lng1)
    lat2 = radians(lat2)
    lng2 = radians(lng2)

    dlng = lng2 - lng1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlng / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    return R * c

# Find lat lon for address
def lookup_lat_lng(address):
    response = requests.get( \
        URL.format(GOOGLE_KEY,address))
    json = response.json()
    if len(json['results']) == 0:
        raise ValueError("Google API error on: {}".format(address))
    map = json['results'][0]['geometry']['location']
    return map['lat'],map['lng']


# Distance between two locations

import requests

address1 = "1 Brookings Dr, St. Louis, MO 63130" 
address2 = "3301 College Ave, Fort Lauderdale, FL 33314"

lat1, lng1 = lookup_lat_lng(address1)
lat2, lng2 = lookup_lat_lng(address2)

print("Distance, St. Louis, MO to Ft. Lauderdale, FL: {} km".format(
        distance_lat_lng(lat1,lng1,lat2,lng2)))

Distance, St. Louis, MO to Ft. Lauderdale, FL: 1685.3085640558543 km


Distances can be a useful means to encode addresses.  It would help if you considered what distance might be helpful for your dataset.  Consider:

* Distance to a major metropolitan area
* Distance to a competitor
* Distance to a distribution center
* Distance to a retail outlet

The following code calculates the distance between 10 universities and Washington University in St. Louis:

In [14]:
# Encoding other universities by their distance to Washington University

schools = [
    ["Princeton University, Princeton, NJ 08544", 'Princeton'],
    ["Massachusetts Hall, Cambridge, MA 02138", 'Harvard'],
    ["5801 S Ellis Ave, Chicago, IL 60637", 'University of Chicago'],
    ["Yale, New Haven, CT 06520", 'Yale'],
    ["116th St & Broadway, New York, NY 10027", 'Columbia University'],
    ["450 Serra Mall, Stanford, CA 94305", 'Stanford'],
    ["77 Massachusetts Ave, Cambridge, MA 02139", 'MIT'],
    ["Duke University, Durham, NC 27708", 'Duke University'],
    ["University of Pennsylvania, Philadelphia, PA 19104", 
         'University of Pennsylvania'],
    ["Johns Hopkins University, Baltimore, MD 21218", 'Johns Hopkins']
]

lat1, lng1 = lookup_lat_lng("1 Brookings Dr, St. Louis, MO 63130")

for address, name in schools:
    lat2,lng2 = lookup_lat_lng(address)
    dist = distance_lat_lng(lat1,lng1,lat2,lng2)
    print("School '{}', distance to wustl is: {}".format(name,dist))

School 'Princeton', distance to wustl is: 1354.4803504417146
School 'Harvard', distance to wustl is: 1670.6258859436082
School 'University of Chicago', distance to wustl is: 418.0737344670344
School 'Yale', distance to wustl is: 1508.2145262222327
School 'Columbia University', distance to wustl is: 1418.223388237836
School 'Stanford', distance to wustl is: 2780.6825998255795
School 'MIT', distance to wustl is: 1672.404945267829
School 'Duke University', distance to wustl is: 1046.7982924956727
School 'University of Pennsylvania', distance to wustl is: 1307.1929423813397
School 'Johns Hopkins', distance to wustl is: 1184.3811108108175
