### Customer Invoice Data Cleaning

This notebook utilizes the String Split, Pandas explode function to clean the given data.

##### Dataset: data/Customer_invoice.xlsx

In [1]:
import pandas as pd

#### Loading the Dataset

In [2]:
customer_data = pd.read_excel('data/Customer_invoice.xlsx')

In [3]:
customer_data

Unnamed: 0,Order ID,Category,Amount
0,CA-2011-167199,Binders | Art | Phones | Fasteners | Paper,609.98 | 5.48 | 391.98 | 755.96 | 31.12
1,CA-2011-149020,Office Supplies | Furniture,2.98 | 51.94
2,CA-2011-131905,Office Supplies | Technology | Technology,7.2 | 42.0186 | 42.035
3,CA-2011-127614,Accessories | Tables | Binders,234.45 | 1256.22 | 17.46


#### Aim : Segragate respective category and amount to Order ID
Cleaned data should look like:  
CA-2011-167199 Binders 609.98  
CA-2011-167199 Art     5.48

In [4]:
# Step - 1: Let's separate out the category column

data1 = customer_data[['Order ID','Category']] 

In [5]:
data1

Unnamed: 0,Order ID,Category
0,CA-2011-167199,Binders | Art | Phones | Fasteners | Paper
1,CA-2011-149020,Office Supplies | Furniture
2,CA-2011-131905,Office Supplies | Technology | Technology
3,CA-2011-127614,Accessories | Tables | Binders


In [6]:
# Using String split function, it is posible to convert the elements in 'Category' column into lists.

data1['Category'] = data1['Category'].str.split('|')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [7]:
data1

Unnamed: 0,Order ID,Category
0,CA-2011-167199,"[Binders , Art , Phones , Fasteners , Paper]"
1,CA-2011-149020,"[Office Supplies , Furniture]"
2,CA-2011-131905,"[Office Supplies , Technology , Technology]"
3,CA-2011-127614,"[Accessories , Tables , Binders]"


In [8]:
# Use Pandas explode function to duplicate the row indices for each 'Category'
data1 = data1.explode('Category').reset_index()
data1.drop('index', axis = 1, inplace = True)
data1

Unnamed: 0,Order ID,Category
0,CA-2011-167199,Binders
1,CA-2011-167199,Art
2,CA-2011-167199,Phones
3,CA-2011-167199,Fasteners
4,CA-2011-167199,Paper
5,CA-2011-149020,Office Supplies
6,CA-2011-149020,Furniture
7,CA-2011-131905,Office Supplies
8,CA-2011-131905,Technology
9,CA-2011-131905,Technology


In [9]:
# Step - 2 : Follow above steps for 'Amount' Column

data2 = customer_data[['Order ID', 'Amount']]

In [10]:
data2

Unnamed: 0,Order ID,Amount
0,CA-2011-167199,609.98 | 5.48 | 391.98 | 755.96 | 31.12
1,CA-2011-149020,2.98 | 51.94
2,CA-2011-131905,7.2 | 42.0186 | 42.035
3,CA-2011-127614,234.45 | 1256.22 | 17.46


In [11]:
data2['Amount'] = data2['Amount'].str.split('|')
data2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Order ID,Amount
0,CA-2011-167199,"[609.98 , 5.48 , 391.98 , 755.96 , 31.12]"
1,CA-2011-149020,"[2.98 , 51.94]"
2,CA-2011-131905,"[7.2 , 42.0186 , 42.035]"
3,CA-2011-127614,"[234.45 , 1256.22 , 17.46]"


In [12]:
data2 = data2.explode('Amount').reset_index()
data2.drop('index', axis = 1, inplace = True)
data2

Unnamed: 0,Order ID,Amount
0,CA-2011-167199,609.98
1,CA-2011-167199,5.48
2,CA-2011-167199,391.98
3,CA-2011-167199,755.96
4,CA-2011-167199,31.12
5,CA-2011-149020,2.98
6,CA-2011-149020,51.94
7,CA-2011-131905,7.2
8,CA-2011-131905,42.0186
9,CA-2011-131905,42.035


In [13]:
# Merging the two data frames to get the cleaned dataset
data1['Amount'] = data2.drop('Order ID', axis = 1)

In [14]:
data1

Unnamed: 0,Order ID,Category,Amount
0,CA-2011-167199,Binders,609.98
1,CA-2011-167199,Art,5.48
2,CA-2011-167199,Phones,391.98
3,CA-2011-167199,Fasteners,755.96
4,CA-2011-167199,Paper,31.12
5,CA-2011-149020,Office Supplies,2.98
6,CA-2011-149020,Furniture,51.94
7,CA-2011-131905,Office Supplies,7.2
8,CA-2011-131905,Technology,42.0186
9,CA-2011-131905,Technology,42.035


In [15]:
#### Save the cleaned dataset
data1.to_csv('data/cleaned_customer_invoice.csv', index = False)