# Qantium Data Wrangling and Data Exploration

We are given two datasets, one for customer transactions and one for purchase behavoir. We will start our analysis from customer transactions and will move to next performing following tasks:
1. Check if datatypes are consistent and makes sense.
2. Check if there are null values and if we can replace them with something.

In [44]:
#Required imports
import pandas as pd
import numpy as np

In [4]:
#Loading transactions dataset
trans_dataset = pd.read_csv('QVI_transaction_data1.csv')
trans_dataset.head()
#PROD_NAME can be splitted to fetch weight of the product like "175g" further regular expression can be written to extract
#flavor like "SeaSalt" and name of brand "Natural Chip Company".

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,43390,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,43599,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,43605,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,43329,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,43330,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [9]:
#Checking datatypes of all columns
trans_dataset.dtypes
#All columns are fine, except date. Need to look new ways to retrieve data from date columns.

DATE                int64
STORE_NBR           int64
LYLTY_CARD_NBR      int64
TXN_ID              int64
PROD_NBR            int64
PROD_NAME          object
PROD_QTY            int64
TOT_SALES         float64
dtype: object

In [12]:
# Checking if any value in any dataframe is missing or null.
for cols in trans_dataset.columns:
    if trans_dataset[trans_dataset[cols].isnull()].empty:
        print(f"The column {cols} has no null values")
    else:
        print(f"The columns {cols} has some null values")
# No column has any missing values. Now we need to enquire and drop NA values.

The column DATE has no null values
The column STORE_NBR has no null values
The column LYLTY_CARD_NBR has no null values
The column TXN_ID has no null values
The column PROD_NBR has no null values
The column PROD_NAME has no null values
The column PROD_QTY has no null values
The column TOT_SALES has no null values


In [17]:
# Now we have transactions dataset that has no missing or null values in it. All rows containing NA values are dropped.
# All the date types are perfect. We are ready to move on into our analysis. 
# Up next we will try to extract the weight, flavor and name of brand from PROD_NAME.
# After that we will explore outliers in PROD_QTY and TOT_SALES

In [45]:
# Extracting Product Weight from PROD_NAME
prod_wht = []
for x in range(trans_dataset['PROD_NAME'].shape[0]):
    try:
        wht = int(trans_dataset['PROD_NAME'][x][-4:-1])
    except:
        wht = np.nan
    prod_wht.append(wht)

In [47]:
# Merging two datasets togther and adding PROD_WHT column
trans_dataset = pd.concat([trans_dataset, pd.DataFrame(prod_wht, columns=['PROD_WHT'])], axis=1)

In [55]:
trans_dataset[trans_dataset['PROD_WHT'].isnull()]
# since the all product names Kettle has nan values, and as we can see from PROD_NAME the weight is not at the end, we can
# safetly use fillna command and set it to 134g.

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,PROD_WHT
65,43605,83,83008,82099,63,Kettle 135g Swt Pot Sea Salt,2,8.4,
153,43602,208,208139,206906,63,Kettle 135g Swt Pot Sea Salt,1,4.2,
174,43332,237,237227,241132,63,Kettle 135g Swt Pot Sea Salt,2,8.4,
177,43602,243,243070,246706,63,Kettle 135g Swt Pot Sea Salt,1,4.2,
348,43399,7,7077,6604,63,Kettle 135g Swt Pot Sea Salt,2,8.4,
...,...,...,...,...,...,...,...,...,...
264564,43381,260,260240,259480,63,Kettle 135g Swt Pot Sea Salt,2,8.4,
264574,43628,261,261035,259860,63,Kettle 135g Swt Pot Sea Salt,2,8.4,
264725,43301,266,266413,264246,63,Kettle 135g Swt Pot Sea Salt,1,4.2,
264767,43624,269,269133,265839,63,Kettle 135g Swt Pot Sea Salt,2,8.4,


In [57]:
#setting all na values to 135
trans_dataset['PROD_WHT'].fillna(135, inplace=True)

In [63]:
# Sanity check if there are any null values.
trans_dataset[trans_dataset['PROD_WHT'].isnull()].empty
# The null dataset generated is empty, indicating everything is alright.

True