<a href="https://colab.research.google.com/github/yijia-ye/my-profile/blob/main/Carrefour_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Background information in business context**

We are about to **release a new product** Frozen Jamón y Queso Paella in all Carrefour stores in Spain. The marketing teqam is currently devising a strategy to **market the new  product to our 250.000 Loyalty Rewards Card holders**. 

**Costs and profts information:**

Price: 20 Euros

COGS: 9 Euros

Unit Profit Margin: 11 Euros

Mail Advertisement Cost: 3 Euros

E-mail Advertisement Cost: 0 Euros

**First Step: Define our goal**

choose the most efficient strategy for marketing and maximize the profits 

**Second Step: Data Preparation**

import our .csv sources and clean the data

In [None]:
import datetime as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
marketingpg = pd.read_csv('marketing_campaign.csv', sep = ';', encoding = 'utf8')

marketingpg.head(50)

It is noticeable that there are several unrelevante colums in the orignial table. Now we use .drop to edit our tables.

We delete ID, Dt_Customer(when they become loyal customer), Complain and Z_CostContact. 

We assume that education in this case is less important, as no matter what degree our loyal customers have, they have to eat at least something to keep alive.


In [None]:
marketingpg.dropna(axis=0, how = 'any', inplace = True)

In [None]:
marketingpgmodi = marketingpg.drop(columns=['Education','ID','Dt_Customer','Complain','Z_CostContact','Z_Revenue'])

In [None]:
marketingpgmodi['Marital_Status'] = marketingpgmodi['Marital_Status'].str.replace('Together','Household')
marketingpgmodi['Marital_Status'] = marketingpgmodi['Marital_Status'].str.replace('Married','Household')
marketingpgmodi['Marital_Status'] = marketingpgmodi['Marital_Status'].str.replace('Single','One Person')
marketingpgmodi['Marital_Status'] = marketingpgmodi['Marital_Status'].str.replace('Divorced','One Person')
marketingpgmodi['Marital_Status'] = marketingpgmodi['Marital_Status'].str.replace('Widow','One Person')

In [None]:
marketingpgmodi['Year_Birth'] = 2022 - marketingpgmodi['Year_Birth']
marketingpgmodi = marketingpgmodi.rename(columns = {'Year_Birth':'Age'})

In [None]:
marketingpgmodi['Total Purchase'] = marketingpgmodi['MntWines']+ marketingpgmodi['MntFruits']+marketingpgmodi['MntMeatProducts']+marketingpgmodi['MntFishProducts'] + marketingpgmodi['MntSweetProducts']

In [None]:
marketingpgmodi = marketingpgmodi.drop(columns=['MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts'])

In [None]:
marketingpgmodi.head(50)

In [None]:
pd.get_dummies(marketingpgmodi.Marital_Status)[['One Person','Household']]

In [None]:
is_OnePerson = pd.get_dummies(marketingpgmodi.Marital_Status)['One Person']
is_OnePerson.head()

In [None]:
marketingpgmodi['Marital_Status'] = is_OnePerson

In [None]:
marketingpgmodi.head(50)

**Third Step: Creat Model**

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs')

In [None]:
X = marketingpgmodi.drop(['Response'], axis=1)
Y = marketingpgmodi['Response']

In [None]:
from sklearn.model_selection import train_test_split  

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.25, random_state=92)

In [None]:
model.fit(X_train, Y_train)

print(model.intercept_)
model.coef_

In [None]:
model.score(X_test,Y_test)

**Fourth Step: Interpret our Model**

In [None]:
sns.boxplot(y="Age",x="Response",hue="Marital_Status",data= marketingpgmodi )
plt.ylim(0, 90)

From this plot, we can tell that age distribution and maritual status are not very informative for predicting the purchase decision of a customer.

If we turn to income and the marital status of our loyal customers, we could see a more significant influence of income. 

In [None]:
sns.boxplot(y="Income",x="Response",hue="Marital_Status",data= marketingpgmodi)
plt.ylim(0, 120000)

In [None]:
features = ['Age','Marital_Status','Income']
sns.pairplot(marketingpgmodi, vars = features, hue ='Response')

Considering our business problem, it is important for us to check the precision and recall of this model. We want to keep the mail target as precise as possible, so that we won't waste our marketing budget on those customers who will not buy our new product.

Let's check the confusion matrix.

We have to know that the logistic regression model writes the probability that a customer purchases our new product, as a function of the different parameters.

In [None]:
probabilities = model.predict_proba(X_test)
probabilities[0]

So the first customer in the test set has a probability of 0.99 of purchasing.
To make actual predictions, we need to fix a threshold on the probability of survival, above which we predict that a customer purchases.
Let's fix this cut-off value to 1/2.

In [None]:
prediction = probabilities[:,1]> 0.5

In [None]:
from sklearn import metrics

metrics.confusion_matrix(Y_test, prediction)

The confusion matrix above tells us where the 14% of classification errors occur. 

How many individuals are predicted to purchase?  **461+13 = 474**

How many individuals are treuly predicted to purchase?  **461**

How many individuals are wrongly predicted to purchase (false positive)? **13**

By dividing 474 with 461, we can gain the **precision** of **97%**.

By dividing 554 with 476, we can calculate the **accurary** of **86%**.

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(Y_test,probabilities[:,1])
plt.plot(fpr, tpr)
    
plt.xlabel("False positive rate (fpr)")
plt.ylabel("True positive rate (tpr)")
plt.plot([0,1], [0,1], 'k--', label="Random")
plt.legend()

**Fifth Step: Implement our model**