# Data Analyst Professional Practical Exam Submission


## üìù Task List
Your written report should include written text summaries and graphics of the following:
- Data validation:   
  - Describe validation and cleaning steps for every column in the data 
- Exploratory Analysis:  
  - Include two different graphics showing single variables only to demonstrate the characteristics of data  
  - Include at least one graphic showing two or more variables to represent the relationship between features
  - Describe your findings
- Definition of a metric for the business to monitor  
  - How should the business use the metric to monitor the business problem
  - Can you estimate initial value(s) for the metric based on the current data
- Final summary including recommendations that the business should undertake

*Start writing report here..*

# Data validation and cleaning
- The dataset contains 8 columns: week, sales_method, customer_id, nb_sold, revenue, years_as_customer, nb_site_visits, and state
- By looking at all the columns, I see that the revenue column is the only one which contains missing values.
- The sales method column has 5 different values, but we need only three (Email, Call, and Email + Call), since the other two are mistyped. This was solved by turning the column to string format, and replacing the incorrect groups with correct ones.
- There were no duplicates for the column customer_id. No significant outliers in nb_site_visits or nb_sold. However, two rows contain years_as_customer with a value higher than 40. These were marked as an error and removed, since the company was founded in 1984. All the other columns are of the correct data type, and are fully populated with correct value ranges. 
- I have decided to impute revenue column values where missing, since the missing values exceed the 5% typical threshold for considering dropping them altogether. The imputed values are drawn from the average price of a product, based on the sales method group they belong to.

# Exploratory Analysis
In order to start analyzing the data, the first step is to group the data by the sales method/approach used to promote the new products. Below is a graph showing the total number of customers contacted in each of the three ways.
![Breakdown by approach](figures/customer_count_by_sales_method.png)
We see in the graph that the **Email method** was used the most - which makes sense, since it is the easiest of all the methods, and least time consuming.
**Call method** was second most frequent way of approaching customers, meaning more people were called without a previous email, than the combined email + call method. This should be an important parameter for the sales representative, since this method consumes more time than the latter one.
Lastly, the **Email + Call method** is least used as an approach.

## Overall Revenue:
| Summary Statistic | Value |
| :--- | :--- |
| Number of entries | 14998 |
| Average | 94.84 |
| Standard deviation | 47.35 |
| Minimum | 32.54 |
| 25th percentile | 52.76 |
| 50th percentile | 90.00 |
| 75th percentile | 107.86 |
| Maximum | 238.32 |


## Call Approach:
| Summary Statistic | Value |
| :--- | :--- |
| Number of entries | 4961 |
| Average | 47.57 |
| Standard deviation | 8.60 |
| Minimum | 32.54 |
| 25th percentile | 41.46 |
| 50th percentile | 48.23 |
| 75th percentile | 52.7 |
| Maximum | 71.36 |


## Email Approach:
| Summary Statistic | Value |
| :--- | :--- |
| Number of entries | 7465 |
| Average | 96.67 |
| Standard deviation | 11.30 |
| Minimum | 74.04 |
| 25th percentile | 87.52 |
| 50th percentile | 95.04 |
| 75th percentile | 104.6 |
| Maximum | 148.97 |


## Email And Call Approach:
| Summary Statistic | Value |
| :--- | :--- |
| Number of entries | 2572 |
| Average | 180.70 |
| Standard deviation | 29.49 |
| Minimum | 103.87 |
| 25th percentile | 155.8 |
| 50th percentile | 183.7 |
| 75th percentile | 191.275 |
| Maximum | 238.32 |


![Top 10 States by number of contacted customers](figures/top10_states.png)
Here we have the bar chart showing the top 10 countries by number of customers. We can see that there exists a sort of a big 4, with California being most dominant, followed by Texas, New York and Florida. The other states are much closer to each other in terms of the number of customers. 
By comparing these numbers from the chart with the most populous states in the USA, we can conclude that the distribution of customers in the dataset closely resembles the distribution of population in the whole country. In other words, it is proportional to the actual population.
After also checking the total amount of revenue by country, I have also concluded that it closely resembles the total number of customers by country, so there are no anomalies in that regard.

Further on, we will delve into deeper analysis of revenue by sales method. The following chart will give us a more understandable state of things, rather than looking at the spread of the data in the above tables.
![Total revenue over time by subgroup](figures/total_revenue_difference.png)
In terms of the total revenue achieved by the three methods (accumulated over the 6 weeks), it is obvious that the **Email approach** has brought the most money to the company from the customers - around 700,000$. We can also see that the initial first week has been the most successful for the approach, and later on the other methods have approached it in terms of results, but still by a large margin not as successful.

![Revenue over time by week and subgroup](figures/revenue_per_week_difference.png)
By week 3, we see that the **Email + Call method** kicked off, starting to build momentum and to bring in more money than the **Call method**, which is the worst one. We also see that the Email method loses momentum after week 3, whereas Email + Call continues to rise.

![Revenue divided by bins and years as customer](figures/bins.png)
Revenue divided into sales methods and bins representing years the customer has been with us. The longer people have been our customers, the less likely they are to spend money on new products.

# Definition of a metric for the business to follow
| Sales method | Revenue | Number of customers | Hours spent | Revenue per hour |
| :--- | :--- | :--- | :--- | :--- |
| Call | 236065.80 | 4961 | 2481 | 95 |
| Email | 721705.72 | 7465 | 19 | 37,984 |
| Email + Call | 464773.97 | 2572 | 428 | 1085 |
The metric 'Revenue per hour' shows how effective a method is in terms of work needed to be performed by our employees. Email is by far the most effective.

# Final summary including recommendations
- Email contact is the dominant revenue generator; use aggressively with launch emails and follow-up reminder in week 3.
- Email + Call becomes most effective after week 3; allocate resources to maximize this method.
- Call method is inefficient; only use after initial email.
- Focus on newer customers (<5 years) for maximum revenue potential.
- Business should monitor the 'Revenue per hour' metric weekly, as effectiveness changes over time.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as colors

raw_df = pd.read_csv('product_sales.csv')

revenue_nan_threshold = raw_df['revenue'].isna().sum() / 15000 * 100 #it is 7%, above 

# Turn 'em + call' and lowercase 'email' into correct sales methods
raw_df['sales_method'] = raw_df['sales_method'].astype(str)
raw_df['sales_method'] = raw_df['sales_method'].replace('em + call', 'Email + Call')
raw_df['sales_method'] = raw_df['sales_method'].replace('email', 'Email')

# Remove the rows where the years_as_customer is higher than 40
raw_df = raw_df[raw_df['years_as_customer'] < 41]

# Impute missing revenue values with the mean value of the corresponding sales method, depending on the number of products sold
grouped_df = raw_df.groupby('sales_method')[['revenue','nb_sold']].sum()
grouped_df['expected_price'] = grouped_df['revenue'] / grouped_df['nb_sold']
revenue_dict = grouped_df['expected_price'].to_dict()
raw_df['standard_price'] = raw_df['sales_method'].map(revenue_dict)
raw_df['revenue'] = raw_df['revenue'].fillna(raw_df['standard_price'] * raw_df['nb_sold']).round(2)
raw_df.drop(columns='standard_price', inplace=True)

# Task 4: Revenue per hour metric
revenue_customers_df = raw_df.groupby('sales_method').agg(
    total_revenue=('revenue', 'sum'),
    customer_count=('customer_id', 'count')
).reset_index()

multiplier_map = {
    'Email': 0.16, # about 10 seconds
    'Call': 30, # minutes
    'Email + Call': 10 # minutes
}
revenue_customers_df['hours_spent'] = (revenue_customers_df['customer_count'] * revenue_customers_df['sales_method'].map(multiplier_map)) // 60
revenue_customers_df['revenue_per_hour'] = revenue_customers_df['total_revenue'] // revenue_customers_df['hours_spent']
print(revenue_customers_df)

   sales_method  total_revenue  customer_count  hours_spent  revenue_per_hour
0          Call      236015.33            4961       2480.0              95.0
1         Email      721608.50            7465         19.0           37979.0
2  Email + Call      464773.97            2572        428.0            1085.0
