In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/marketing-data/marketing_data.csv')

df.head()

In [None]:
df.describe()

In [None]:
df.info()

### From the above, we know that our data has 2240 records in total and 28 columns.

# Section 1: Exploratory Data Analysis

### Are there any Null Values?

df.isnull() returns the entire dataframe as result with each element replaced with False is not empty and True if empty.

df.isnull().values returns the resulting Dataframe in the form of an numpy ndarray.

df.isnull().values.any() returns a single boolean value(True/False) depending on whether any null elements exist in the dataframe

In [None]:
df.isnull().values.any()

From the result, we can conclude that our data has some null values. Now let's check the location of the null entries and the count.

This can be done using the any & sum function. 

df.isnull() returns a dataframe full of True False values which is passed as the source to the any function. Each column is iterated through seperately and the output is a series with column names and True False indicating whether that column has a null value


False boolean is equal to 0. True is equal to 1. Hence when we try to sum them, all the true(blank elements) will be counted.
The sum function counts the total for every column in the source data and returns the result. Since the source data(df.isnull()) is ones and zeroes, the sum returned is the total number of blank elements in each column.

In [None]:
df.isnull().any()

### We can see that the null values are solely located in the Income Column. Now let's check the count

In [None]:
df.isnull().sum()

### There are 24 null values present in total.

### Now the next question is what do we do with missing data? We have two options:

1: To fill the missing values we can replace the blank values with the average of the column.

2: We can simply drop the 24 records, as they make up only ~1 to 1.1% of the total data, which is not a significant number.

### Question: Are there any variables that warrant transformations?

Income column has the type Object, since it includes other symbols like the dollar sign and commas. We need to process it and convert the column to the type integer or Float.

### First let's convert Income to numeric using string functions. But before we do that, let's try to check whether all values in income column are actually strings. If there are any integer or float values, these values will turn to NaNs if we directly use string functions on them.

We can clearly see the income column name has spaces in it. Let's rename the column and remove the spaces to avoid confusion later on.

In [None]:
df.columns

df.rename(columns={' Income ':'Income'}, inplace=True)

In [None]:
df['Income'].apply(type).value_counts()

We can see that there are 24 float values. So let's develop a custom function to replace the $ and comma signs only if the current value is a string.

In [None]:
def clean_currency(x):
    """If the current value is a string with dollar and comma signs, return the cleaned value.
    This can be later converted to a float or int as necessary."""
    
    if isinstance(x, str):
        return(x.replace('$','').replace(',',''))
    return(x)

In the above cell we have defined our custom function for cleaning values, and we will be applying it to all the cells using the inbuilt apply function, which applies our function to all entries.

In [None]:
df['Income'] = df['Income'].apply(clean_currency).astype(float)

df['Income'].apply(type).value_counts()

In [None]:
df.dtypes

We have successfully converted the Income column to entirely numerical values. The column is no longer considered an 'object' column, but a float column instead. We can take a peek with the head function.

In [None]:
df['Income'].head()

### Now we'll create a separate dataframe copy for dropna

In [None]:
df_dropped = df.copy()

In [None]:
df_dropped.dropna(inplace = True)

In [None]:
df_dropped.isna().any()

As seen above, we have dropped all null values. Now let's boxplot all the numeric columns to see any possible outliers.

First let's retrieve the list of all numeric columns

In [None]:
int_cols = df_dropped.select_dtypes([np.int64, np.float64]).columns

print(int_cols)

In [None]:
list_cols = int_cols.tolist()

In [None]:
print(list_cols)
print(type(list_cols))

Since ID is not needed for Plotting, we can remove that from the list.

In [None]:
list_cols.remove('ID')

In [None]:
list_cols

### Question: Are there any outliers? How will you handle them?

First we'll plot the data to see possible outliers. Then we will drop the unnecessary rows.

In [None]:
df_dropped.boxplot(column=list_cols, rot = 90)

The Income column has outliers and it is causing the rest of our boxplot to zoom out. Let's study that column a bit

In [None]:
df_dropped.boxplot(column='Income', rot=90)

Let's find out how many entries have income above 120,000 dollars.

In [None]:
df_dropped[df_dropped['Income']>120000].count()

We will drop those 8 entries as it will not cause a significant data loss. We can do so, simply by only fetching the rows where our income value is under 120,000 dollars.

In [None]:
df2 = df_dropped[df_dropped['Income']<120000]

df2.head()

In [None]:
df2.boxplot(column=list_cols, rot =90)

The rest of the boxplot is still not visible clearly, so let's exclude the income column for now from our list

In [None]:
list_cols.remove('Income')
print(list_cols)

In [None]:
df2.boxplot(column=list_cols, rot =90)

As the data is still not visible clearly, let us study each column individually.

In [None]:
sns.boxplot(x='Year_Birth', data = df2)

We need to remove outlier entries with Birth Year before 1920.

In [None]:
df2 = df2[df2['Year_Birth']>1920]
sns.boxplot(x='Year_Birth', data = df2)

In [None]:
sns.boxplot(x = 'MntWines', data = df2)

let's check the count of values in MntWines column above 1400

In [None]:
df2[df2['MntWines']>1400].count()

In [None]:
df2 = df2[df2['MntWines']<1400]
sns.boxplot(x='MntWines', data =df2)

In [None]:
sns.boxplot(x= 'MntFruits', data=df2)

Few outliers can mean that those two records are an exception when compared to the rest of the data. If there were upto 10 entries as outliers, we could excluded them. However, we can see that there are multiple outliers located close to each other, which may mean that customers might have spent a bigger amount on fruits.

### Every outlier record does not need to be deleted.

The same can be applied to other columns.

In [None]:
sns.boxplot(x = 'MntMeatProducts', data = df2)

In [None]:
df2 = df2[df2['MntMeatProducts']<1250]
sns.boxplot(x = 'MntMeatProducts', data = df2)

In [None]:
sns.boxplot(x = 'MntFishProducts', data = df2)

In [None]:
sns.boxplot(x = 'MntSweetProducts', data = df2)

In [None]:
df2 = df2[df2['MntSweetProducts']<250]
sns.boxplot(x = 'MntSweetProducts', data = df2)

In [None]:
sns.boxplot(x='MntGoldProds', data =df2)

In [None]:
df2 = df2[df2['MntGoldProds']<250]
sns.boxplot(x='MntGoldProds', data =df2)

Since we have cleaned these columns, we can successfully remove all of them from our list_cols

In [None]:
list_cols

In [None]:
del list_cols[:10]

In [None]:
list_cols

In [None]:
df2.boxplot(column=list_cols, rot = 90)

The rest of the data as we can see is within the acceptable range. A person can have 15 deals or they can visit a website 20 times a month even if it is statistically an outlier value. Hence the rest of the columns will stay unchanged as well

### Question: Are there any useful variables that you can engineer with the given data?


Here we can engineer an Age column from the Year_Birth Column. Further on in the notebook, we also have to engineer a column for total purchases in the last 2 years per customer.

In [None]:
df2['Age'] = df2['Year_Birth'].apply(lambda x: 2020-x)

### Question: Do you notice any patterns or anomalies in the data? Can you plot them?

From the above plots, we have seen multiple anomalous records. 

* Some of them were born in the 1900s. This means they were aged 100 and above which is a very rare scenario. 
* Some people had income of more than 200,000 dollars.
* Couple of people might have been enthusiasts of a certain category and they spent an extravagant amount of money on their respective favourite category of products.

### We can also look for correlations within the data

In [None]:
sns.heatmap(df2.corr(), cmap='magma')

From the above heatmap, we can conclude that there are many variables whose values are correlated with each other, as seen my the multiple orange squares.

Most noticably, the value of Income column affect the amounts spent on Wines, Fruit, Meat, Fish, Sweets and Gold products, which makes sense.

# Section 2: Statistical Analysis


### Question: What factors are significantly related to the number of store purchases?

From the above heatmap, we can roughly estimate that the Number of Store Purchases values are strongly related to:

'Income','MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds','NumWebPurchases','NumCatalogPurchases'


Apart from that, we can also see that the number of Store purchases is inversely related to the number of Web Visits. Possible explanation for this could be:

Scenario 1: An old fashioned customer visits a store to find an item that they want. Even if they don't find what they want, they may discover something that they like along the way. People tend to not leave a store empty handed out of courtesy or out of personal guilt. 


Scenario 2: A customer visits the webpage and does a quick search for what they like to purchase. If they don't find anything they like, they can simply close the webpage and carry on without feeling guilty.

### Question: Does US fare significantly better than the Rest of the World in terms of total purchases?

To answer this we must create a new column for Total Purchases which will be the total of purchases made through web, catalogue and store per customer. We will then group customers as per their country and sum their total purchases together.

In [None]:
df2['Total_Purchases'] = df2['NumWebPurchases']+df2['NumCatalogPurchases']+df2['NumStorePurchases']

In [None]:
df2.head()

In [None]:
df2['Country'].value_counts()

In [None]:
df_sum  = df2.groupby('Country').sum()

In [None]:
df_sum.reset_index(inplace = True)

In [None]:
df_sum

In [None]:
sns.barplot(x='Country', y = 'Total_Purchases', data = df_sum)

From the above graph we can see that the highest number of Purchases were made from Spain,followed by South Africa and Canada. Therefore, the answer to our question is No.

### Question: Your supervisor insists that people who buy gold are more conservative. Therefore, people who spent an above average amount on gold in the last 2 years would have more in store purchases. Justify or refute this statement using an appropriate statistical test

In [None]:
df_gvs = df2[['MntGoldProds','NumStorePurchases']]

df_gvs.head()

In [None]:
sns.scatterplot(x = 'MntGoldProds', y='NumStorePurchases', data = df_gvs)

In [None]:
df_gvs.corr()

From the above two representations of the data, we can see that, people who have spent a lot on gold products have a varied number of store purchases. These purchases have a wider range from 3 to 12 total purchases. Therefore, Gold Puchases and StorePurchases are proven to not have a strong correlation.

### Question: Fish has Omega 3 fatty acids which are good for the brain. Accordingly, do "Married PhD candidates" have a significant relation with amount spent on fish? What other factors are significantly related to amount spent on fish? (Hint: use your knowledge of interaction variables/effects)

In [None]:
df2['Education'].value_counts()

In [None]:
df2['Marital_Status'].value_counts()

In [None]:
df_mar_phd = df2[(df2['Marital_Status'] == 'Married') & (df2['Education']=='PhD')]

df_mar_phd.head()

In [None]:
df_unm_nophd = df2[(df2['Marital_Status'] != 'Married') & (df2['Education'] != 'PhD')]

df_unm_nophd.head()

In [None]:
f, axes = plt.subplots(1,2, figsize = (10,5))

sns.boxplot(y='MntFishProducts', data = df_mar_phd, ax = axes[0])
axes[0].set_title('Married with PhD')

sns.boxplot(y='MntFishProducts', data = df_unm_nophd, ax = axes[1])
axes[1].set_title('Unmarried without PhD')

As seen from the above subplots, Married customers with PhDs spend less amount on Fish products on an average.

In [None]:
df_fish_data = df2[['MntFishProducts','Income','MntWines','MntFruits','MntMeatProducts','MntSweetProducts','MntGoldProds']]

sns.heatmap(df_fish_data.corr(), cmap='magma')

The Income of the person in general affects the spending of the person on all kinds of products. But the effect is stronger in case of meat and wines as both of these products can be expensive. 

Another thing to note is that, people can be vegetarians or non-vegetarians. Hence people who spend on meat products are the ones who will more likely spend on fish products as well.

### Question: Is there a significant relationship between geographical regional and success of a campaign?

For this we are going to use a previous created Dataset, df_sum

In [None]:
df_sum

In [None]:
df_bar  = df2.groupby('Country').sum()
df_bar

In [None]:
bar_col_list = ['AcceptedCmp1','AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5','Response']
df_bar = df_bar[bar_col_list]

df_bar.plot(kind='bar')

As seen in the above graph, the response from the latest campaign was exceedingly huge from Spain.
Other than the customers which responded positively to previous campaigns were also mostly from Spain.

Hence, hence people living in Spain have shown a better response trend than customers from any other country.

The second and third best response historically was from South Africa and Canada respectively.

# Section 03: Data Visualization

### Question: Which marketing campaign is most successful?

In [None]:
cmp1_total = df2['AcceptedCmp1'].sum()
cmp2_total = df2['AcceptedCmp2'].sum()
cmp3_total = df2['AcceptedCmp3'].sum()
cmp4_total = df2['AcceptedCmp4'].sum()
cmp5_total = df2['AcceptedCmp5'].sum()
resp_total = df2['Response'].sum()

Above, we have calculated the sum of total responses from the 5 campains and the most recent one

In [None]:
list_campaign_res = []
list_campaign_res.extend([cmp1_total,cmp2_total,cmp3_total,cmp4_total,cmp5_total,resp_total])

list_campaign_title = ['cmp1_total','cmp2_total','cmp3_total','cmp4_total','cmp5_total','resp_total']

list_campaign_res

In [None]:
fig, ax = plt.subplots()
plt.barh(list_campaign_title,list_campaign_res)

The latest campaign was the most successful one out of the bunch. However, historically speaking, Campaign 3 was the most successful, followed by Campign 4 then Campaign 5 and finally Campaign 1.

Campaign 2 can be seen as a failure.

### Question: What does the average customer look like for this company?

To find the answer for this question, let's start with looking at our columns once again.

In [None]:
df2.head()

We have to review many aspects of this customer like their Age, Education, Income, Marital Status, whether they have children at home, and their Country. 

Let's find out the average age for our customers. Since age is a numeric value that can fit within ranges, we will be using the Histogram to find out the range within which our customers fit.

In [None]:
sns.distplot(df2['Age'], kde=False, bins = 10)
plt.grid(True)

From the above distribution graph, we can see that a lot of our customers are 40-50 years old.

Sicne Education is a categorical value, we will use a Countplot to count the number of entries for each category.

In [None]:
sns.countplot(x='Education', data = df2)

From the above chart, we can conclude that most of the customers are Graduates.

Next up is Income, which is a numeric value again, hence a histogram will show us the range distribution of Income

In [None]:
sns.distplot(df2['Income'], kde = False, bins = 10)
plt.grid(True)

From the above representation, it is clear that most customers earn somewhere between the \\$20,000  and  \\$80,000 range.

Marital Status, is again a Categorical value, so we'll be using a count plot

In [None]:
sns.countplot(x='Marital_Status', data = df2)

Most of our customers are married.

Let's check how many kids/teens do our customers have on average.

In [None]:
fig, axes = plt.subplots(1,2)
fig.suptitle('Analysis of Children Count')

sns.distplot(df2['Kidhome'], kde=False, ax = axes[0])
sns.distplot(df2['Teenhome'], kde=False, ax = axes[1])

In [None]:
df2['Children_Count'] = df2['Kidhome']+df2['Teenhome']

sns.distplot(df2['Children_Count'], kde = False)

We can see that most customers either have no kids/teens or 1 kid/teen in their family.

Finally, let's analyse the customer distribution as per their Country

In [None]:
sns.countplot(x='Country', data=df2)

Most our Customers are from Spain.

### To summarize our above findings, our average customer's profile looks like the one below:

* Is 40-50 years old
* Married
* Has 1 kid/teen in their family
* Graduate
* Earns somewhere between \\$20,000 and \\$80,000
* From Spain

### Question: Which products are performing best?

In [None]:
df2.columns

In [None]:
df_spending = df2[['MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds']]

In [None]:
list_spending = []

for i in df_spending.columns:
    sum_spending = df2[i].sum()
    list_spending.append(sum_spending)
    
print(list_spending)

In [None]:
plot = sns.barplot(x = df_spending.columns, y= list_spending)

plt.xticks(rotation = 50)

If we sum up the spending on each type of product, we can see that the most amount of money is spent on Wine and Meat products, which are popular dinner combo items.

### Question: Which channels are underperforming?

To answer this question we will have to look at the number of purchases made through each channels.

In [None]:
df.columns

In [None]:
df_channels = df2[['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']]
list_channel = []

for i in df_channels.columns:
    sum_channel = df[i].sum()
    list_channel.append(sum_channel)
    
list_channel

In [None]:
sns.barplot(x = df_channels.columns, y=list_channel)

The most underperforming channel is Catalogue purchasing. This could be because Catalogue purchasing is quite old-fashioned in today's time. This could be because older generations prefer going to the store to purchase items, while newer generations prefer the internet to browse and shop for items. Hence the most popular mode would either be in store or via the web.

Previously, we have seen that our average consumer is 40-50 years old. Hence it is normal to expect that this customer segment is not reliant on the internet for shopping. This can be seen through the above graph, as the maximum number of purchases are made in store.

In [None]:
df2.columns

In [None]:
df_children = df2[['Kidhome','Teenhome','NumWebPurchases','NumWebVisitsMonth','NumStorePurchases','NumCatalogPurchases']]

In [None]:
sns.heatmap(df_children.corr(), cmap='magma')

In [None]:
df_kidhome = df_children.groupby('Kidhome').sum()

df_kidhome

In [None]:
df_teenhome = df_children.groupby('Teenhome').sum()

df_teenhome

In [None]:
df_kidhome.plot(kind='bar')

Here we can see that, when people have children, all kinds of purchases are reduced. This can be mainly due to the fact that raising children can cost a lot of money early on. Hence people compensate by reducing other extra expenses or by reducing the amount they splurge on luxury items.

In [None]:
df_teenhome.plot(kind='bar')

On the contrary, when the kids grow up till they're teenagers, the number of kids at home does not affect the spending done on items by a big margin.


Also, we can observe, that, when there is a teenager in the house, the number of web purchases have slightly increased. Whereas, the number of store purchases and catalogue purchases have sligntly descreased.

# Summary


## Section 1: Exploratory Data Analysis

We had null values as well as outliers in our data. We deemed it fit to drop the rows containing null values and outliers, as they only formed a small fraction of the total data.

The income field was stored as a text value. We successfully converted cleaned the unnecessary characters like \\$ sign and commas to produce a cleaned float value.

We engineered the total purchases column as well as the Age column from the data provided.

We noticed anomalies as well as certain patterns in data. Anomalies were mostly handled by dropping the anomalous records as they comprised only a small fraction of the complete dataset.

## Section 2: Statistical Analysis

The number of store purchases is lightly related to the number of children in the house. If a customer has kids, their overall spending is reduced, which also reduces the number of store purchases. If there are teens in the house, the number of store purchases are reduced slightly. Other factors which is strongly related is the Income. Higher income customers have higher spending power, which can be seen through more amount spent on products from different categories.

We saw that almost half of our customer base was from Spain, while the remaining small portion was from South Africa and Canada and other countries.

We observed that people purchase products through stores regardless of their purchases on Gold Products.

We also observed that being married or having a PhD has no relation with the customer's spending on Fish Products. The factors that are actually somewhat related to spending on fish products are the customer's income and whether they are vegetarian. This is because people who spend on meat/fish are likely to also spend on the other. Whereas people who do not consume meat of any form will not consume fish either and vice versa.

Finally we learnt that geographical reason does matter when it comes to success of a campaign. Since half of our customer base is form Spain, all of the campaigns have been exceedingly popular in Spain while not having as much of an impact in the other countries.

# Section 03: Data Visualization

The latest/ most recent campaign was the most successful campaign so far. 

Our average consumer is 40-50 years old, married with 1 kid/teen. They're a graduate and earn between 20 thousand to 80 thousand dollars. They live in Spain.

If success is to be measured by the amount spent on a certain category of a product then Wine has been the most successful product follwed by Meat products.

People do not use Catalogue for purchasing products as much and hence Catalogue channel is unsuccessful as compared to stores or Web purchases.

# Advice to the CMO

Based on the profile of our Average customer and their purchasing habits, it is advisable to pay more attention to the web as the most amount of purchases were made in through the web. This is also better going forward as newer generations will be more reliant on technology. It was observed that the category of people with a teen in their home had higher number of purchases through the web in total.

Since most of our Customers are from spain, it might be advisable to develop new marketing strategies that will appeal to the English speaking locals from Canada or South Africa as they are the next most dominant category.

The recent campaign was the most successful campaign, so it's formula can be repeated with minor variations to hopefully obtain similar results in the future.