# Data Engineer Assessment
## UC01: TTD_DE_UC01_EDA: Perform `Exploratory Data Analysis (EDA)` on provided CSV data

## Summary
This assessments evaluate the capabilities of the candidate in solving data engineering use cases. The candidate is required to solve the below assessment questions using a Jupyter notebook and post the solutions in the notebook in the assessment section.  

Each assessment is structured as a collection of one or more scenarios that need to be addressed by the data engineer.


* __Problem Statement__ - Business users have asked the data engineers to assist with exploratory data analysis to enable business make informed decisions.
* __Description__ - Business would like to perform `Exploratory Data Analysis` on the dataset as part of reporting and also to prepare data for Machine Learning purposes.  
The business user has recently joined the organization and is unfamiliar with the data and has asked the data engineer to just assist with the review of the data so that they generate reports together.

The business user would first like to explore the data and see if there are any patterns in the data that can be used for reporting.


## Code Complexity
- Low / Medium


## `Diagram - Also refer PDF in folder`

![Exploratory Data Analysis](./TTD_UC01_EDA.png "Exploratory Data")



## Datasets:

`File Location`: Refer to the attached `data` folder for information

* Vehicles (vehicles.csv)  at the plants (plants.csv) are built to order (orders.csv) placed - order_number
* Vehicles are manufactured at different Company plants (plants.csv)-  (plant_code_id)
* Customer (customers.csv) provides reviews(welcome_call.csv) 60 to 80 days after the vehicles are delivered(vin)
* Orders (orders.csv) are logged by sales_rep_number at various BMW dealerships.
* Sales (sales_rep.csv) representatives are linked to dealership (dealers.csv) and have dealership names




## Perform the following joins:

* Link all the data based on the statements made above to create a larger dataset that answers the below questions.
* Identify any duplicates in the data and perform cleanup of the duplicates. Just drop the duplicates columns.
* The Dataset must contain vehicles linked to the order, sentiments, sales people, plants
* Provide the name of the sales person (first_name, last_name and sales_number the dealership)



## Questions: `Exploratory Data Analysis - Provide graphs for options below and document your observations in markdown. `

1. Perform `Exploratory data analysis` and provide insights into the data.
2.  Provide the distribution by brand, model, iso_country.
3. Provide the percentage of customers that have purchased more than 1 car.
4.  Provide the distribution of the vehicles manufactured by the plants and provide information brand, model  etc.
5.  Provide the top sales peoples per dealership - 10 top sales people
6.  Indicate the total sales per dealership.
7.  Get the models of the cars that had the most positive reviews (reviews greater than 3.5)
8.  Provide a distribution of the vehicles by different status.
9.  List all the dealerships that have sold the Rolls-Royce brand.




## Libraries or Options used
* Jupyter Notebook - Install and run locally on your laptop or device.
* PySpark, Pandas and matplot lib or similar plotting libraries.
* Other Python libraries required for Exploratory Data Analysis



## `Acceptance Criteria`
The following acceptance criteria must be met:

1. Perform Exploratory data Analysis and present your results as observations.
2. Python Graph libraries must be used to plot graphs to support your findings.
3. Comment your notebook file with markdown indicating observations: and write statements to indicate your observations.
4. Perform Analysis fo the Data using Spark or Pandas

# Implementation

Provide all the implementation steps in the sections that follow. Ensure that you provide detailed explanations of the approach.


### Import the libraries that you need for EDA

In [None]:
# Import any relevant libraries
import os
import re

import pandas as pd
# Import other EDA libraries that you need below



#### List of expected dataframes to be loaded


  * Vehicles (vehicles.csv) at the plants (plants.csv) are built to order (orders.csv) placed - order_number
  * Vehicles are manufactured at different Company plants (plants.csv)- (plant_code_id)
  * Customer (customers.csv) provides reviews(welcome_call.csv) 60 to 80 days after the vehicles are delivered(vin)
  * Orders (orders.csv) are logged by sales_rep_number at various BMW dealerships.
  * Sales (sales_rep.csv) representatives are linked to dealership (dealers.csv) and have dealership names



### Load the data from the data folder into the data frame.

In [None]:
# Write your code below to load the relevant data into a data frame
# Perform any Clean up operations if required. remove duplicates etc.

#### Question: Did you need to perform any clean up on the dataframes. If yes. What cleanup operations did you perform

#### *Answer*: Replace with your response



### Provide some statistical information about the data you just loaded


In [None]:
# Write code to provide statistical information about each dataframe that you just loaded.

# Write your code below

### Perform all the relevant join operations between the datasets.

Hint! - Relationship between the datasets is mentioned above


In [None]:
# Perform the queries to perform the relevant dataframe join operations.

#### Question: Did you perform any joins on the datasets. If yes, what joins. Also what information was available after the joins were performed.

#### *Answer*: Replace with your response

### Perform All the standard Exploratory Data Analysis in the sections that follow to provide information to the Business users about the data. Report your findings in the form of Graphs or Response statements

In [None]:
# Example: Write down the distribution of Vehicles by plant and iso_country and plot a bar graph



In [None]:
# Write your own exploratory data analysis on the ingested dataframe and report on the different findings.
# also provide visual aids for each finding.

In [None]:
# Use the matplotlib libraries or other graphing libraries and create charts to support your findings
import matplotlib

### Report all your Findings:

Report your findings in bullet points.
Example: For illustration purpose only - replace below with your own findings and support with Evidence
1. The US plant manufactured the most number of vehicles in 2023 etc. There were 30,000 vehicles manufactured at the plant etc.

## `Acceptance Criteria`
The following acceptance criteria must be met:

1. Perform Exploratory data Analysis and present your results as observations.
2. Python Graph libraries must be used to plot graphs to support your findings.
3. Comment your notebook file with markdown indicating observations: and write statements to indicate your observations.
4. Perform Analysis fo the Data using Spark or Pandas