## OBJECTIVE:
    - To identify duplicate records with row numbering
    - In the example below, a duplicate record is based on CLAIM_NUM that we want to differentiate by using PART_NUM column
    - With records having a row number other than 1, we want to set cost amounts to zero

This example mimics SQL's [row_number()](https://docs.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql) function.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_clipboard()

In [3]:
df

Unnamed: 0,CLAIM_NUM,PART_NUM,PART_COST_USD,LABOR_COST_USD,HANDLING_COST_USD,TOTAL_COST_USD
0,1,062315LH,645.33,60.34,46.3,751.97
1,1,062345LH,323.55,67.25,20.56,751.97
2,1,062015LH,303.13,80.45,35.34,751.97
3,2,062315LH,613.45,60.34,46.3,720.09
4,2,062015LH,300.25,80.45,35.34,720.09
5,3,062345LH,333.1,67.25,20.56,420.91
6,4,062345LH,300.25,80.45,46.3,427.0


#### Let's create a ROW_NUM column to identify duplicate records (by CLAIM_NUM column) and then differentiate them using PART_NUM column:

In [4]:
df['ROW_NUM'] = df.sort_values(by=['PART_NUM']).groupby(['CLAIM_NUM']).cumcount() + 1

In [5]:
df.sort_values(by=['CLAIM_NUM', 'ROW_NUM'], inplace=True)
df

Unnamed: 0,CLAIM_NUM,PART_NUM,PART_COST_USD,LABOR_COST_USD,HANDLING_COST_USD,TOTAL_COST_USD,ROW_NUM
2,1,062015LH,303.13,80.45,35.34,751.97,1
0,1,062315LH,645.33,60.34,46.3,751.97,2
1,1,062345LH,323.55,67.25,20.56,751.97,3
4,2,062015LH,300.25,80.45,35.34,720.09,1
3,2,062315LH,613.45,60.34,46.3,720.09,2
5,3,062345LH,333.1,67.25,20.56,420.91,1
6,4,062345LH,300.25,80.45,46.3,427.0,1


#### Now we can use np.where() function to handle the duplicate records separately from the original record having ROW_NUM = 1

np.where() allows us to use IF-ELSE logic: IF ROW_NUM = 1, then keep the original value, otherwise, set value = zero

In [6]:
df['PART_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['PART_COST_USD'], 0)
df['LABOR_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['LABOR_COST_USD'], 0)
df['HANDLING_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['HANDLING_COST_USD'], 0)
df['TOTAL_COST_USD'] = np.where(df['ROW_NUM'] == 1, df['TOTAL_COST_USD'], 0)

In [7]:
df

Unnamed: 0,CLAIM_NUM,PART_NUM,PART_COST_USD,LABOR_COST_USD,HANDLING_COST_USD,TOTAL_COST_USD,ROW_NUM
2,1,062015LH,303.13,80.45,35.34,751.97,1
0,1,062315LH,0.0,0.0,0.0,0.0,2
1,1,062345LH,0.0,0.0,0.0,0.0,3
4,2,062015LH,300.25,80.45,35.34,720.09,1
3,2,062315LH,0.0,0.0,0.0,0.0,2
5,3,062345LH,333.1,67.25,20.56,420.91,1
6,4,062345LH,300.25,80.45,46.3,427.0,1


Some may ask why not just do all this using SQL?  There are times where the data did not originate from a database, but from a web API or <br>
we need to combine data from different database servers.  Therefore, an analyst will have to weigh the effort to load this dataset into a database<br>
versus just processing the data using Python.