# Task 2 - Detecting Fraudulent Sales
You were just hired by BA Toys as an Internal Auditor. The Chief Audit Executive (CAE) is also new and has been reviewing processes surrounding the purchasing and sales processes. The internal controls appear to be well-designed, but the external auditor has raised some concerns about revenues. Specifically, the auditor claims that their analytics-based risk assessment suggests sales appear inconsistent with inventory levels. The CAE has reviewed the auditor’s evidence and agrees the issue warrants investigation.

The CAE has tasked you with reviewing inventory records, purchasing activity, and sales activity. She has obtained the data for you from the relevant departments. Below, she provides a description of each data source as well as what she can ascertain from the data. However, her information may be incomplete, and she conveys that the departments (Sales, Purchasing, and Inventory Management) are unlikely to be of much help. Like most retailers, BA’s fiscal year-end is January 31, and you will be focusing on the year ended January 31, 2018.


### Data files 📂
You will find the following datasets in your task materials, which you can unzip (if on ICE, I have provided the reference to these files):

1.	“BAToysEndInvJan2017.csv” and “BAToysEndInvJan2018.csv” – These files contain beginning and ending inventory, respectively, for BA Toys’ 21 stores.

2.	“BAToysPurchasesJan2018.csv” – All inventory purchases made during the year ending 1/31/2018.

3.	“BAToysTAXSalesJan2018.csv” – All sales occurring during the year ending 1/31/2018.

4.	“BAToysReturns.csv” – All returns processed during the year ending 1/31/2018. Note that returns refer to toys returned *to the vendor* (i.e., they reduce inventory available for sale). These *are not* customer returns.

5.	“PriceListing.csv” – A listing purchase and sales prices for inventory.


**Additional Notes** 📝

As noted, the CAE did not receive much information about the data, but made a few notes before passing the task to you:

- Each store is referenced in several datasets and appears to be referred to by a state code (e.g., “NY”) followed by a four-digit number.
- The CAE thinks BrandId somehow corresponds to InventoryId, which may give reference to stores.


### Requirements (summarized here, given in more detail below)
1. Review & describe data.
2. Inspect toy vendor products and comment on volume, margins, etc.
3. Determine whether any fraudulent sales appear to exist; quantify the dollar impact of fradulent sales.
4. (4046 only) Conduct qualitative review of suspected fraudulent sales.
5. (6046 only) Conduct systematic review of suspected fraudulent sales.


#### General Hint

The Sales, Purchases, and Returns files all have timestamps that would allow you to evaluate inventory at any given date. However, I suggest simplifying your analysis based on the specific question of whether there are fraudulent sales in the year (regardless of when they occur). In other words, begin by considering total purchases (by store-product), total sales (by store-product), etc. Collapse the datasets using the idenfier(s) corresponding to unique Store-Brand combinations. This may be sufficient to uncover any issues. If it is not, then you can circle back and perform a more detailed analysis at the daily level (but, hint, you shouldn’t have to).


## Requirement 1
Load each dataset into a pandas dataframe. Inspect the data and provide the CAE with definitions of what the various identifiers appear to mean.

Specifically, briefly describe each of the following fields:
1. Store
2. VendorNo (or VendorId)
3. Brand (or BrandId)
4. InventoryId

In [32]:
import pandas as pd

folder = "Task2Data" # update to the correct folder path

# Load datasets (I provided first, fill in the rest)
beginv = pd.read_csv(f"{folder}/BAToysEndInvJan2017.csv")
endinv = pd.read_csv(f"{folder}/BAToysEndInvJan2018.csv")
purch  = pd.read_csv(f"{folder}/BAToysPurchasesJan2018.csv")
sale   = pd.read_csv(f"{folder}/BAToysTAXSalesJan2018.csv")
returns= pd.read_csv(f"{folder}/BAToysRetJan2018.csv")

listings = pd.read_csv(f"{folder}/PriceListing.csv") #Adding one more for references
# Add code here to inspect the datasets:

def printColumns (datasets: list, column: str):
    X = 1
    for data in datasets:
        print(f'Unique values for Dataset {X}: \n',data[column].unique())
        X += 1
print('For Store Columns:')
printColumns([beginv, endinv, purch, returns], 'Store')
print('For Vendor ID:')
print(sale.VendorNo.unique()) 
print(listings['VendorId'].unique())

For Store Columns:
Unique values for Dataset 1: 
 ['CT-6253' 'DC-3827' 'DE-2832' 'MA-1384' 'MA-1738' 'MA-2647' 'MA-5262'
 'MD-2830' 'MD-3274' 'ME-0293' 'NH-7575' 'NJ-4291' 'NJ-5264' 'NY-3349'
 'NY-3458' 'NY-7283' 'PA-1445' 'PA-7263' 'PA-9272' 'RI-4893' 'VT-8234']
Unique values for Dataset 2: 
 ['CT-6253' 'DC-3827' 'DE-2832' 'MA-1384' 'MA-1738' 'MA-2647' 'MA-5262'
 'MD-2830' 'MD-3274' 'ME-0293' 'NH-7575' 'NJ-4291' 'NJ-5264' 'NY-3349'
 'NY-3458' 'NY-7283' 'PA-1445' 'PA-7263' 'PA-9272' 'RI-4893' 'VT-8234']
Unique values for Dataset 3: 
 ['CT-6253' 'NJ-4291' 'MA-5262' 'VT-8234' 'MD-3274' 'MD-2830' 'ME-0293'
 'NH-7575' 'PA-7263' 'MA-2647' 'PA-9272' 'DE-2832' 'NJ-5264' 'DC-3827'
 'PA-1445' 'MA-1738' 'MA-1384' 'NY-3458' 'RI-4893' 'NY-7283' 'NY-3349']
Unique values for Dataset 4: 
 ['CT-6253' 'NJ-4291' 'MA-5262' 'MD-3274' 'NH-7575' 'MA-2647' 'MD-2830'
 'DE-2832' 'VT-8234' 'MA-1738' 'PA-9272' 'MA-1384' 'NY-7283' 'PA-7263'
 'ME-0293' 'NJ-5264' 'RI-4893' 'NY-3458' 'NY-3349' 'DC-3827' 'PA-1445']
F

✅  *Insert Requirement 1 answer:*

1. <p style = 'color: #1ab9b9ff'>The Store identifier seems to note the specific store using a State identifier followed by a 4 digit code. If guessed, the first two letters are the state initials and the last 4 numbers appears to be a store identifier. However, not enough information is provided to determine the numbering system </p>

2.<p style = 'color: #1ab9b9ff'> Vendor ID/No seems to be a foreign key in the Sales Dataset for vendor information. The sales dataset includes information regarding the vendor name and vendor ID. It is highly probable that we use the vendor ID/No from the Sales Dataset to merge with the Price Listing. However, we note that a vendor can provide multiple products, so using only this identifier is not good enough to merge the dataset. </p>

3.

4.

## Requirement 2

Inspect the data and answer these specific questions about vendors (HINT: I suggest relying on PriceListing file:)

1. How many unique vendors does BA Toys transact with?
2. How many unique products are there for each vendor?
3. Which vendor’s products produce the largest gross margins? Define the gross margin as sales price less purchase price (cost) divided by sales price.


In [5]:
#  Insert code needed to answer requirement 2 here



✅ *Insert Requirement 2 Answers:*

1.

2.

3.


## Requirement 3

Evaluate the external auditors’ claims. Specifically, evaluate sales **by store, by product**.

Focus on these fundamental questions as you answer this question:
1.  Do sales for a given store-product combination ever exceed inventory available for sale? How do you define "Inventory Available for Sales"? (Hint: Beginning Inventory + Purchases + Returns - Ending Inventory; Note that I *add* returns because in the data this is a negative quantity)
2.  If so, does this occur for all products and stores? Or a small number? Describe the stores and products for which you observe sales exceeding available inventory.
3. What is the total dollar amount of the “suspect sales”?
4. What is the average gross margin for the suspect sales?

**HINT**: You'll want to use `pd.merge` to combine datasets, but be careful about merging too early or the dataset will be huge. I suggest using `groupby` to collapse larger datasets (like purchases or sales) down to a more granular level (`InventoryId`).

In [None]:
# Insert code for requirement 3.1 here


In [None]:
# Insert code for requirement 3.2 here


In [None]:
# Insert code for requirement 3.3 here


In [None]:
# Insert code for requirement 3.4 here


✅ *Insert Requirement 3 Answers:*

1.

2.

3.

4.


## Requirement 4 (4046 Only )
Conduct a qualitative analysis of fraudulent sales. Specifically, generate and output a CSV file with the likely frauduent sales and review. Identify any red flags you oberve in data.

Note that you should submit your CSV file of fraudulent sales along with your completed jupyter notebook.


> 📌 Please do not zip the files. When files are unzipped after being downloaded from Canvas for grading purposes, your identity information is lost, which may lead to confusion with other students’ submissions.

In [None]:
# Insert code to answer requirement 4


✅ *Insert answer for requirement 4*

*


## Requirement 5 (6046 Only)
Conduct a systematic review of the fraudulent sales (i.e., from the original sales records). Specifically, choose **at least 4** of the following questions to analyze, commenting on the difference between fraudulent and legitimate sales. You can use visualizations, tables, or other methods to answer these questions.

Add a code cell for each, as well as a markdown that provides an analysis:

1. Sales quantities
2. Sales prices
3. Sales dates
4. Toy brands
5. Purchase prices
6. Toy descriptions

When done, save your sample of fraudulent sales to a CSV file and submit it along with your completed jupyter notebook.


>  📌 Please do not zip the files. When files are unzipped after being downloaded from Canvas for grading purposes, your identity information is lost, which may lead to confusion with other students’ submissions.

In [None]:
# Analysis 1


✅ Discussion of Analysis 1:


In [None]:
# Analysis 2


✅ Discussion of Analysis 2:

In [None]:
# Analysis 3


✅ Discussion of Analysis 3:

In [None]:
# Analysis 4


✅ Discussion of Analysis 4: