# Data Report – Predicting Catalog Demand

### by Travis Gillespie

## Table of Contents
- [Introduction](#intro)
- [Business and Data Understanding](#businessUnderstanding)
    - [Key Decisions](#keyDecisions)
- [Analysis, Modeling, and Validation](#analysisModelingValidation)
    - [Question 1](#analysis_Question1)
    - [Question 2](#analysis_Question2)
    - [Question 3](#analysis_Question3)
- [Presentation/Visualization](#presentationVisualization)
    - [Question 1](#visual_Question1)
    - [Question 2](#visual_Question2)
    - [Question 3](#visual_Question3)

<a id='intro'></a>
## Introduction

You recently started working for a company that manufactures and sells high-end home goods. Last year the company sent out its first print catalog, and is preparing to send out this year's catalog in the coming months. The company has 250 new customers from their mailing list that they want to send the catalog to.

Your manager has been asked to determine how much profit the company can expect from sending a catalog to these customers. You, the business analyst, are assigned to help your manager run the numbers. While fairly knowledgeable about data analysis, your manager is not very familiar with predictive models.

You’ve been asked to predict the expected profit from these 250 new customers. Management does not want to send the catalog out to these new customers unless the expected profit contribution exceeds $10,000.

<a id='businessUnderstanding'></a>
## Business and Data Understanding

<a id='keyDecisions'></a>
### Key Decisions

<a id='keyDecisions_Question1'></a>
### Question 1

_What decisions needs to be made?_

Whether or not to send this year's company catalog to new customers; dependant on the profit exceeding $10,000.




<a id='keyDecisions_Question2'></a>
### Question 2

_What data is needed to inform those decisions?_

* The expected revenue from 250 new customers.
* The probability a custome will buy the catalog.
* Number of catalogs purchased.
* Categorical varibles converted to dummy variables




<a id='analysisModelingValidation'></a>
## Analysis, Modeling, and Validation

<a id='analysis_Question1'></a>
### Question 1

_How and why did you select the predictor variables in your model?_

Using linear regression models I was able to assess which predictor variables have the strogest correlation.

<img src="assets/images/regplot_df_Customers.png" alt="IMAGE" title="TITLE" width="75%" align="left" >

<img src="assets/images/pearsonCorrelation.png" alt="IMAGE" title="TITLE" width="75%" align="left" >

The pair plot and Pearson Correlation matrix (above) suggests *Avg_Sale_Amount* and *Avg_Num_Products_Purchased* have a strong positive correlation of approximately 0.8558.

<img src="assets/images/scatterPlot.png" alt="IMAGE" title="TITLE" width="75%" align="left" >

The scatter plot provides a cleaner display of the relationship between *Avg_Sale_Amount* and *Avg_Num_Products_Purchased* as a positive correlation.

<a id='analysis_Question2'></a>
### Question 2

*Explain why you believe your linear model is a good model.*

<img src="assets/images/linearRegression_Summary.png" alt="IMAGE" title="TITLE" width="75%" align="left" >

First categorical variables were converted to dummy variables rather than assign random numbers to each category for the model. Thus preventing an erroneous relationship between the target variabel and the category variable(s) due to arbitrarily assigned value(s).


From there a simple regression analysis was conducted using the dummy variables. The Adjusted R-squared value is 0.837, indicating a strong positive correlation between my predictor variables (listed below):

* Avg_Num_Products_Purchased
* Customer_Segment_Loyalty_Club_Only
* Customer_Segment_Loyalty_Club_and_Credit_Card
* Customer_Segment_Store_Mailing_List 

Note: The p-value is truncated after the third decimal place. Although, I cannot determine if p-value is exactly equal zero, the data suggests this a very small p-value (e.g. p < 0.05, and p < 0.001) which is statistically significant.

<a id='analysis_Question3'></a>
### Question 3

*What is the best linear regression equation based on the available data? Each coefficient should have no more than 2 digits after the decimal (ex: 1.28).*

$Predicted Average Sale Amount \space = \space 303.46\space+\space$<br>$
$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space
(66.98  \space\times\space$
<span style = "color : DodgerBlue "> $AvgNumProductsPurchased$</span>
$)\space+\space$<br>$
$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space
($ -$149.36 \space\times\space$
<span style = "color : DodgerBlue "> $CustomerSegmentLoyaltyClubOnly$</span>
$)\space+\space$<br>$
$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space
(281.84 \space\times\space$
<span style = "color : DodgerBlue "> $CustomerSegmentLoyaltyClubAndCreditCard$</span>
$)\space+\space$<br>$
$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space$$\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space\space
($ -$245.42 \space\times\space$
<span style = "color : DodgerBlue "> $CustomerSegmentStoreMailingList$</span>
$)$

<a id='presentationVisualization'></a>
## Presentation/Visualization  

<a id='visual_Question1'></a>
### Question 1

*What is your recommendation? Should the company send the catalog to these 250 customers?*

Yes. The company should send the catalog out to the 250 new customers in their mailing list.

<a id='visual_Question2'></a>
### Question 2

*How did you come up with your recommendation?*

First I used my multiple linear regression model to calculate the *Predicted_Average_Sale_Amount*, then I calculated *Predicted_Revenue*, and finally calculated *Predicted_Profit*. I placed these values for these variables in corresponding columns in the dataset labeled [df_mailingList_dummies](./assets/data/df_mailingList_dummies.csv). To find the overall value for these variables I calulcated the sum of their corresponding column within the dataset. The overall values were finally written to a csv titled [df_overallValues](./assets/data/df_overallValues.csv).

Further information on how each vairable was calcualted is listed below, while the actual calculations reside in the [Data Wrangling](./Data%20Wrangling.ipynb#three) file.

* *Predicted_Average_Sale_Amount* is calculated by following the *PredictedAverageSaleAmount* formula above. Substitute the formula's variables with corresponding column values for each of the 250 customers in the mailing list dataset. Finally sum the *Predicted_Average_Sale_Amount* values. Example formula below:
    * ``` Python 
    PredictedAverageSaleAmount = 303.46 +  
                                 (66.98 × AvgNumProductsPurchased) +
                                 (- 149.36 × CustomerSegmentLoyaltyClubOnly) +
                                 (281.84 × CustomerSegmentLoyaltyClubAndCreditCard) +
                                 (- 245.42 × CustomerSegmentStoreMailingList)
    ```


* *Predicted_Revenue* is calculated by finding the product of *Average_Sale_Amount* and *Score_Yes* (the probability a customer will respond and make a purchase), then taking the sum of all those values. Example formula below:
    * ```Python
    Predicted_Revenue = Predicted_Average_Sale_Amount * Score_Yes

    ```
    

* *Predicted_Profit* is calculated by subtracting the catalog cost (given $6.50) from the product of *Predicted_Revenue* and average gross margin (which is a given value of 50%). Example formula below:
    * ```Python
    Predicted_Profit = (0.5 * Predicted_Revenue) - 6.5
    ```



<a id='visual_Question3'></a>
### Question 3

*What is the expected profit from the new catalog (assuming the catalog is sent to these 250 customers)?*

As shown in the image below, the *Overall Predicted Profit* is approximately ``` $21,987.96```. This is more than double the $10,000 breaking point.

<img src="./assets/images/overall_values.png" alt="IMAGE" title="TITLE" width="50%" align="left" >