# Adding additional features

### Introduction

Now that we have inserted the data into our database and placed it in the correct format, the next step is to extract some additional features from the data.

### Loading our data

Let's again load up and explore our data.

> First, we use sqlalachemy to create a connection to our postgres instance and the ecommerce database.  Change the string so that it matches your sql username.

In [3]:
from sqlalchemy import create_engine
conn_string = 'postgresql://jeffreykatz@localhost/ecommerce'

conn = create_engine(conn_string)

In [7]:
import pandas as pd
df = pd.read_sql("select * from transactions limit 2", conn)

And we can look at our columns and the related datetypes like so.  We slice the first four columns below.

In [10]:
df.dtypes[:4]

index                      int64
transaction_id             int64
customer_id                int64
date              datetime64[ns]
dtype: object

Ok, so the important column is the datetime column.  That one column actually has a lot of information that we can extract from it, and we'll do that below.

### Further coercing the data

Our datetime column contains information about the day, month, year, and day of week of each purchase.  Each of these components would be interesting to learn about how our customers make purchases.  We can extract each of these into a respective field like so.

In [15]:
query = """select EXTRACT(month FROM date) as month, 
       EXTRACT(week FROM date) as week, 
EXTRACT(DOW FROM date) as dow from raw_transactions"""

month_cols = pd.read_sql(query, conn)
month_cols[:3]

Unnamed: 0,month,week,dow
0,11.0,46.0,4.0
1,5.0,19.0,6.0
2,1.0,2.0,4.0


And we can include the original columns by listing those in our select statement as well.

In [17]:
print(df.columns)

['transaction_id', 'customer_id', 'date', 'product', 'gender',
       'device_type', 'country', 'state', 'city', 'category',
       'customer_login_type', 'delivery_type', 'quantity', 'transaction_start',
       'transaction_result', 'amount', 'individual_price', 'time', 'year',
       'month']

Index(['index', 'transaction_id', 'customer_id', 'date', 'product', 'gender',
       'device_type', 'country', 'state', 'city', 'category',
       'customer_login_type', 'delivery_type', 'quantity', 'transaction_start',
       'transaction_result', 'amount', 'individual_price', 'time', 'year',
       'month'],
      dtype='object')


We want to copy and paste the columns above into our select statement, but we need to remove all of the quotation marks.  If we're in VSCode, we can highlight the quotation mark and then press `cmd + shift + l` to do so.  

In [19]:
query = """select transaction_id, customer_id, product, gender,
       device_type, country, state, city, category,
       customer_login_type, delivery_type, quantity, transaction_start,
       transaction_result, amount, individual_price, time, EXTRACT(month FROM date) as month, 
       EXTRACT(week FROM date) as week, 
EXTRACT(DOW FROM date) as dow from raw_transactions"""

df_with_date_cols = pd.read_sql(query, conn)
df_with_date_cols[:2]

Unnamed: 0,transaction_id,customer_id,product,gender,device_type,country,state,city,category,customer_login_type,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,time,month,week,dow
0,40170,1348959766,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910.0,576.0,1900-01-01 22:35:51,11.0,46.0,4.0
1,33374,2213674919,Hair Band,Female,Web,United States,California,Los Angles,Accessories,Member,one-day deliver,17,1,1,1699.0,100.0,1900-01-01 06:44:41,5.0,19.0,6.0


And now you can see that we have the relevant original columns included.

We also removed some columns.  We did not select the original date column, as that info is now spread out across our new columns.  And we also did not select the original year and month columns in our dataset as that would be repetitive.   

### Extract from time

The time column is pretty similar.  We can extract the our from the time column, and this way we can perform analysis to see which hours are particularly popular, or not.

In [5]:
import pandas as pd
hour_col = pd.read_sql("""select EXTRACT(hour FROM time) as hour from raw_transactions""", conn)

hour_col[:3]

Unnamed: 0,hour
0,22.0
1,6.0
2,0.0


And now we can remove the time column, and just use the hour column.

In [6]:
query = """select transaction_id, customer_id, product, gender,
       device_type, country, state, city, category,
       customer_login_type, delivery_type, quantity, transaction_start,
       transaction_result, amount, individual_price, EXTRACT(month FROM date) as month, 
       EXTRACT(week FROM date) as week, 
EXTRACT(DOW FROM date) as dow, EXTRACT(hour from time) as hour from raw_transactions"""

df_with_date_time_cols = pd.read_sql(query, conn)
df_with_date_time_cols[:2]

Unnamed: 0,transaction_id,customer_id,product,gender,device_type,country,state,city,category,customer_login_type,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,month,week,dow,hour
0,40170,1348959766,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910.0,576.0,11.0,46.0,4.0,22.0
1,33374,2213674919,Hair Band,Female,Web,United States,California,Los Angles,Accessories,Member,one-day deliver,17,1,1,1699.0,100.0,5.0,19.0,6.0,6.0


### Loading the data

Ok, now let's load the data into our database.  We'll do so by creating a new table called transactions -- instead of `raw_transactions`.  

In [7]:
df_with_date_time_cols.to_sql('transactions', conn, if_exists = 'replace')

535

And we can confirm that our new data was loaded into the transactions table.

In [8]:
transactions_df = pd.read_sql('select * from transactions', conn)
transactions_df[:2]

Unnamed: 0,index,transaction_id,customer_id,product,gender,device_type,country,state,city,category,...,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,month,week,dow,hour
0,0,40170,1348959766,Hair Band,Female,Web,United States,New York,New York City,Accessories,...,one-day deliver,12,1,0,6910.0,576.0,11.0,46.0,4.0,22.0
1,1,33374,2213674919,Hair Band,Female,Web,United States,California,Los Angles,Accessories,...,one-day deliver,17,1,1,1699.0,100.0,5.0,19.0,6.0,6.0


### Resources

[Crosstab](https://stackoverflow.com/questions/3002499/postgresql-crosstab-query/11751905#11751905)