# PEI Group Interview Project

## Data Analyst Task

### The sales team has the following data from various sources:
* Customers.xls - [https://easyupload.io/t9m9my]
* Orders.csv - [https://easyupload.io/pngfna]
* Shippings.json - [https://easyupload.io/fm8t5t]


### Objectives: The team is trying to generate the reports for the below requirements:

* the total amount spent and the country for the Pending delivery status for each country.
* the total number of transactions, total quantity sold, and total amount spent for each customer, along with the product details.
* the maximum product purchased for each country.
* the most purchased product based on the age category less than 30 and above 30.
* the country that had minimum transactions and sales amount.


### Quality Checks: As a Data Analyst, you are required to

* Verify the accuracy, completeness, and reliability of source data. 
* Based on your findings, define and outline the requirements for anticipated datasets, detailing the necessary data components.
* Develop the data models to effectively organise and structure the information and provide a detailed mapping of existing data flows, focussing on the areas of concern.
* Prepare a story with technical specifications for one part of the data model for a data engineer.
* Communicate the findings and insights to stakeholders in a visually comprehensive manner.


In [1]:
# Additional packages required that are not part of the default docker image.

# Required to import xls files in pandas
!pip install xlrd



In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/pei-group/Order.csv
/kaggle/input/pei-group/Customer.xls
/kaggle/input/pei-group/Shipping.json


## 1. Exploratory Data Analysis

### Importing Data
We will import data from multiple sources into our analysis environment using pandas.

In [3]:
customers = pd.read_excel('/kaggle/input/pei-group/Customer.xls', sheet_name=0)
orders = pd.read_csv('/kaggle/input/pei-group/Order.csv')
shipping = pd.read_json('/kaggle/input/pei-group/Shipping.json')

In [9]:
# Let's check if all dataframes have been created successfully
print(customers.head(), orders.head(), shipping.head(), sep='\n\n')

   Customer_ID    First     Last  Age Country
0            1   Joseph     Rice   43     USA
1            2     Gary    Moore   71     USA
2            3     John   Walker   44      UK
3            4     Eric   Carter   38      UK
4            5  William  Jackson   58     UAE

   Order_ID      Item  Amount  Customer_ID
0         1  Keyboard     400          139
1         2     Mouse     300          250
2         3   Monitor   12000          239
3         4  Keyboard     400          153
4         5  Mousepad     250          153

   Shipping_ID     Status  Customer_ID
0            1    Pending          173
1            2    Pending          155
2            3  Delivered          242
3            4    Pending          223
4            5  Delivered           72


**Result**: The schemas are matching with the data present in the files.

### Data Validation

* Check for missing values.
* Check for duplicates.
* Check that the data-types have been assigned correctly.