In [1]:
import pandas as pd
from datetime import datetime
from numpy import random, ceil
from pandas_profiling import ProfileReport

- Change the number of iterations in the file 'generate_sales_dataset.ipynb' to create a dataset of any number of rows.
- The csv file included in the repo has 1000 rows.

In [14]:
# Dataframe from 'Sales_Dataset.csv' file included in the repo.
df = pd.read_csv('Sales_Dataset.csv', index_col=[0])
df.head(10)

Unnamed: 0,Order ID,Order Date,In Store or Online,Customer Name,Product ID,Product Name,Category,Sales,Quantity,Profit
0,1,2021-09-28,Online,Sandra Smith,339,Product 339,Category 1,324,9,49.0
1,2,2021-09-15,Online,Geoff Carter,261,Product 261,Category 1,189,3,18.0
2,3,2019-01-31,In Store,Barry Wilson,11,Product 11,Category 1,336,8,21.0
3,4,2019-08-31,In Store,Janet Green,56,Product 56,Category 1,295,9,42.0
4,5,2021-05-31,In Store,Geoff Green,283,Product 283,Category 1,380,1,73.0
5,6,2019-09-24,In Store,Steve Gill,126,Product 126,Category 1,263,6,-7.0
6,7,2021-12-31,Online,Geoff Cox,372,Product 372,Category 2,149,7,68.0
7,8,2020-02-19,In Store,Geoff Wilson,346,Product 346,Category 1,116,6,10.0
8,9,2019-01-31,Online,Josephine Davis,420,Product 420,Category 2,359,2,166.0
9,10,2021-02-25,Online,Terry Sedgewick,255,Product 255,Category 1,392,1,8.0


- Advanced Profiling can be done with the ProfileReport class.
- Profiling tells you if there are any missing values, columns with unique values, the cardinality level, correlation between columns, mean, max, min and also the memory size for each column.
- The detailed Correlation charts are a great starting point for further analysis.

# 1. Advanced Profiling

In [3]:
# Profile Report class setup
profile_report = ProfileReport(df, title="Sales Data Profiling")

- The **to_widgets** method will print an advanced interactive profiling report here in the notebook.

In [4]:
# View profile report in this notebook
profile_report.to_widgets()

- You can also send the report to an html file that will give you an interactive web page.

In [5]:
# The report can also be published as a web page (with menu options)
profile_report.to_file("sales_data_profiling.html")

**The dataset above has been created so that:**
- Higher sales amounts will have a relatively high profit amount (within a range)
- Sales that are both Online and products that are in Category 2 produce the best profit.

**The advanced analysis report on the above dataset would tell us:**
- Customer Name and Product Name columns have High Cardinality
- Category is 'Categorical' (you can perform further grouping for insights using this column)
- In Store or Online is also 'Categorical'
- Order ID has unique fields
- Profit has a small percentage of 0 fields
- There is a correlation between Sales and Profit (as programmed in the sales_data dataset above)
- There is a correlation between Product ID and Profit (the products in Category 2 have higher profit margins)
- There is a correlation between Category and Profit
- There is a correlation between In Store or Online and Profit (Online along with Category 2 products have higher profit margins)