# Exploring a dataset

### Introduction

An important task of working with data is not only cleaning and loading our data, but also analyzing and exploring the dataset.  Frequently, the task and even the goal involved may be fairly vague.  Still, it's our job to explore the data and try to extract valuable insights from it.

In this lesson, we'll learn how to quickly load up and explore a dataset.



### Loading our data

Given a CSV file, the first step is to load our data with pandas.

In [1]:
import warnings
warnings.simplefilter(action='ignore')
import pandas as pd
df = pd.read_csv('./ecommerce-dataset.csv')

First we suppress warnings -- this must be placed before the import pandas line.

### Formatting our columns

Now the next step would be to load this into postgres.  But before we do, this task will be a lot easier if we first ensure that our data is in the correct format.  

Let's begin by viewing the first row of our data.

In [2]:
df[:1]

Unnamed: 0,Transaction_id,customer_id,Date,Product,Gender,Device_Type,Country,State,City,Category,Customer_Login_type,Delivery_Type,Quantity,Transaction Start,Transaction_Result,Amount US$,Individual_Price_US$,Year_Month,Time
0,40170,1348959766,14/11/2013,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910,576,13-Nov,22:35:51


The first thing we should do is change our column names to be lowercase, this is **a requirement** to ensure pandas works with postgres properly.  We can do so with the following.

In [3]:
lower_cols = [col.lower() for col in df.columns]
print(lower_cols)

['transaction_id', 'customer_id', 'date', 'product', 'gender', 'device_type', 'country', 'state', 'city', 'category', 'customer_login_type', 'delivery_type', ' quantity ', 'transaction start', 'transaction_result', 'amount us$', 'individual_price_us$', 'year_month', 'time']


Next, some of the column names are pretty poor.  We need to remove any spaces from columns, and it's also a good idea to remove the `$`.  Let's copy our list of columns from above, and then edit the problematic ones individually.

In [4]:
cols = ['transaction_id', 'customer_id', 'date', 'product', 'gender', 'device_type',
 'country', 'state', 'city', 'category', 'customer_login_type',
'delivery_type', 'quantity', 'transaction_start', 'transaction_result',
'amount', 'individual_price', 'year_month', 'time']

Now that we've set the column names, it's time to update them.

In [5]:
df.columns = cols

In [6]:
df[:2]

Unnamed: 0,transaction_id,customer_id,date,product,gender,device_type,country,state,city,category,customer_login_type,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,year_month,time
0,40170,1348959766,14/11/2013,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910,576,13-Nov,22:35:51
1,33374,2213674919,05/11/2013,Hair Band,Female,Web,United States,California,Los Angles,Accessories,Member,one-day deliver,17,1,1,1699,100,13-Nov,06:44:41


Ok, so we just lowercased all of our columns, and then removed any spaces and special characters.  We also made some column names shorter. 

### Fomatting our data

Now that we have changed our column names, the next step is coerce our data into the correct format.

By the correct format, we mean, that we should try to change as much of our data from strings as possible into datetime or numeric datatypes.  

In pandas, type string is called an object. And we can view the datatypes of our columns that are of type object like so.

In [7]:
df.dtypes[df.dtypes == 'object']

date                   object
product                object
gender                 object
device_type            object
country                object
state                  object
city                   object
category               object
customer_login_type    object
delivery_type          object
amount                 object
individual_price       object
year_month             object
time                   object
dtype: object

In [8]:
df[:1]

Unnamed: 0,transaction_id,customer_id,date,product,gender,device_type,country,state,city,category,customer_login_type,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,year_month,time
0,40170,1348959766,14/11/2013,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910,576,13-Nov,22:35:51


Ok, so some of these columns should not be of type object.  Let's copy the columns we should reformat below.

In [9]:
# date, time, individual_price, amount, year_month
                    

### Coercing our data

Next we can coerce by select each individual column, coercing it to the correct type and then replacing that column in our dataframe.  Let's get started.

In [10]:
updated_date = pd.to_datetime(df['date'])
updated_date[:2]

0   2013-11-14
1   2013-05-11
Name: date, dtype: datetime64[ns]

And now we can see that the datatype is of type datetime.  Now let's work on the timestamp in the `df['Time']` column.  If we again use the to_datetime function, we'll see that the date is off, as one is not provided.

In [11]:
updated_time = pd.to_datetime(df['time'])
updated_time[:2]

0   2022-11-16 22:35:51
1   2022-11-16 06:44:41
Name: time, dtype: datetime64[ns]

So from here, we can extract the time by calling `.dt.time` on our datetime series.

In [12]:
updated_time = pd.to_datetime(df['time'], format = '%H:%M:%S')
updated_time[:1]

0   1900-01-01 22:35:51
Name: time, dtype: datetime64[ns]

And we can see that our data is of type time.  Ok, so now our remaining columns are `Individual_Price_US$`, `Amount US$`,  and `Year_Month`.

Ok, so starting with the `Individual_Price_US$` column, attempting to convert this to an integer or float, we may try the following.

```python
pd.to_numeric(df['Individual_Price_US$'])
```

But doing so will give us this error.

```python
ValueError: Unable to parse string "1,075" at position 10
```

The issue is that pandas does not know how to handle those commas.  So we can remove them with something like the following.

In [13]:
removed_comma = df['individual_price'].str.replace(',', '')
removed_comma[:2]

0    576
1    100
Name: individual_price, dtype: object

So we just called `str` to access our string methods and then removed the comma from any row that had it present.  And now we can again try to convert this string to be numeric.

```python
numeric_individual_price = pd.to_numeric(removed_comma)
numeric_individual_price[:2]

ValueError: Unable to parse string "#VALUE!" at position 192
```

Unfortunately, we have other values that pandas does not know how to handle.  So we can replace these `#VALUE!` with nan values, which stands for not a number.

In [14]:
import numpy as np
removed_value_strings = removed_comma.replace('#VALUE!', np.nan)
numeric_individual_price = pd.to_numeric(removed_value_strings)
numeric_individual_price[:2]

0    576.0
1    100.0
Name: individual_price, dtype: float64

And now we are in good shape.  Any of the `#VALUE!` properties is changed to `nan`, which is considered a numeric value.

### Your turn

Next work on the `Amount US$` series.  convert it to be of type numeric and assign it to the variable `updated_amount`.

In [15]:
updated_amount = pd.to_numeric(df['amount'].str.replace(',', '')) # make this numeric
updated_amount[:2]

# 0    6910.0
# 1    1699.0
# Name: Amount US$, dtype: float64

0    6910.0
1    1699.0
Name: amount, dtype: float64

### Splitting our Data

Finally, let's work with our Year Month data.  There are two pieces of information is this, but it will be easier if we extract the year, and separate that into it's own column, and do the same for the month.

We can do so with the following:

In [16]:
df['year_month'].str.split('-')[:2]

0    [13, Nov]
1    [13, Nov]
Name: year_month, dtype: object

So again start with `.str` to access the string methods, then split splits our data into a list.  From there, we can pull out the first element in the list and assign it to a series like so.

In [17]:
year = df['year_month'].str.split('-').str[0]

From there, we can prepend the `20`.

In [18]:
updated_year = '20' + year
updated_year[:2]

0    2013
1    2013
Name: year_month, dtype: object

And then convert this to be numeric.

In [19]:
numeric_year = pd.to_numeric(updated_year)
numeric_year[:2]

0    2013
1    2013
Name: year_month, dtype: int64

Now it's your turn.  Extract the month data from `year_month`.  You **do not** need to convert it to be numeric.

In [20]:
month = df['year_month'].str.split('-').str[1]
month[:2]

# 0    Nov
# 1    Nov
# Name: Year_Month, dtype: object

0    Nov
1    Nov
Name: year_month, dtype: object

### Finishing up

Ok, so now we have formatted a number of different series to be numeric, but we still have not updated our dataframe.  Let's change that.

> Here are our new columns.

And we can update the dataframe like so.

In [21]:
df.columns

Index(['transaction_id', 'customer_id', 'date', 'product', 'gender',
       'device_type', 'country', 'state', 'city', 'category',
       'customer_login_type', 'delivery_type', 'quantity', 'transaction_start',
       'transaction_result', 'amount', 'individual_price', 'year_month',
       'time'],
      dtype='object')

In [22]:
# updated_date, updated_time, numeric_individual_price, amount_us, numeric_year, month

In [23]:
updated_df = df.assign(date = updated_date, time = updated_time, 
individual_price = numeric_individual_price, amount = updated_amount, year = numeric_year, month = month)

In [24]:
updated_df[:1]

Unnamed: 0,transaction_id,customer_id,date,product,gender,device_type,country,state,city,category,...,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,year_month,time,year,month
0,40170,1348959766,2013-11-14,Hair Band,Female,Web,United States,New York,New York City,Accessories,...,one-day deliver,12,1,0,6910.0,576.0,13-Nov,1900-01-01 22:35:51,2013,Nov


Now our updated dataframe has a year_month column and the year and month columns separately.  So let's drop the `year_month` column. 

In [25]:
reduced_df = updated_df.drop(columns = ['year_month'])
reduced_df[:1]

Unnamed: 0,transaction_id,customer_id,date,product,gender,device_type,country,state,city,category,customer_login_type,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,time,year,month
0,40170,1348959766,2013-11-14,Hair Band,Female,Web,United States,New York,New York City,Accessories,Member,one-day deliver,12,1,0,6910.0,576.0,1900-01-01 22:35:51,2013,Nov


And if we look at our datatypes now, we'll see that fewer columns are numeric.

In [26]:
reduced_df.dtypes[reduced_df.dtypes == 'object']

product                object
gender                 object
device_type            object
country                object
state                  object
city                   object
category               object
customer_login_type    object
delivery_type          object
month                  object
dtype: object

This looks good.

From here, we can load our data into our postgres database with the following.

In [27]:
from sqlalchemy import create_engine
conn_string = 'postgresql://jeffreykatz@localhost/ecommerce'

conn = create_engine(conn_string)

In [28]:
reduced_df.to_sql('raw_transactions', conn, if_exists='replace')

535

In [30]:
txns_df = pd.read_sql('select * from raw_transactions limit 1', conn)
txns_df

Unnamed: 0,index,transaction_id,customer_id,date,product,gender,device_type,country,state,city,...,customer_login_type,delivery_type,quantity,transaction_start,transaction_result,amount,individual_price,time,year,month
0,0,40170,1348959766,2013-11-14,Hair Band,Female,Web,United States,New York,New York City,...,Member,one-day deliver,12,1,0,6910.0,576.0,1900-01-01 22:35:51,2013,Nov


And if you look at the table columns in postgres, you can see that they have been stored in the appropriate type.

<img src="./displayed-txn.png" width="60%">

### Summary

### Resources

[Crosstab](https://stackoverflow.com/questions/3002499/postgresql-crosstab-query/11751905#11751905)