# Challenges Week 4

Now that you have some experience with pandas and data exploration in Python, it's time for you to combine and apply this week's knowledge. You will start working on these challenges in the tutorial and will be asked to complete them by the end of the week.

In each challenge, you are asked to provide the programming solution to it as well as a technical interpretation explaining the steps taken and the result.

In this week's challenges, we will use the dataset `marketing_campaign.csv`. A full description of the dataset is available below. 

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

**People**
<br>
customer_id: Customer's unique identifier
<br>
Year_Birth: Customer's birth year
<br>
Education: Customer's education level
<br>
Marital_Status: Customer's marital status
<br>
Income: Customer's yearly household income
<br>

**Products**
<br>
MntWines: Amount spent on wine in last 2 years
<br>
MntFruits: Amount spent on fruits in last 2 years
<br>
MntMeatProducts: Amount spent on meat in last 2 years
<br>
MntFishProducts: Amount spent on fish in last 2 years
<br>

**Place**
<br>
NumWebPurchases: Number of purchases made through the company’s web site
<br>
NumCatalogPurchases: Number of purchases made using a catalogue
<br>
NumStorePurchases: Number of purchases made directly in stores
<br>
NumWebVisitsMonth: Number of visits to company’s web site in the last month


Original dataset source: https://www.kaggle.com/imakash3011/customer-personality-analysis

# Challenge 1

In the first challenge, we will focus on exploring and describing the data.

#### 1.1 Read and explore the dataset

* Load the dataset into pandas. *Hint: you will need to specify the column separator to properly read the dataset.* 

* Explore the dataset: print first few rows, check all column types and check for missing values. Decide what to do with missing values and explain your decision.

#### 1.2 Filter the dataset

* Create a smaller dataset consisting of households that have above average spendings on meat *and* fish.

* Who are the meat and fish eaters? Describe them using demographic information available to you.

In [1]:
import pandas as pd
#I make sure to specify the separator, otherwise the dataset cannot be worked with
marketing = pd.read_csv("marketing_campaign.csv", sep=';')


customer_id             0
Year_Birth              0
Education               2
Marital_Status          0
Income                 24
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
dtype: int64

In [2]:
#I check for column types
marketing.dtypes

customer_id              int64
Year_Birth               int64
Education               object
Marital_Status          object
Income                 float64
MntWines                 int64
MntFruits                int64
MntMeatProducts          int64
MntFishProducts          int64
NumWebPurchases          int64
NumCatalogPurchases      int64
NumStorePurchases        int64
NumWebVisitsMonth        int64
dtype: object

In [3]:
#I check for NAs
marketing.isna().sum()

customer_id             0
Year_Birth              0
Education               2
Marital_Status          0
Income                 24
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
dtype: int64

I have read the csv file into pandas. Then I explored it. I can conclude that the majority of the data is numeric (ins or floats) mad that two columns contain missing values. For now, I will leave the missing values in the dataset. It does not look like they have a eaning on their own (it might be that some customers did not provide the information).

In [5]:
#To create a smaller dataset I do:
marketing_smaller = marketing[(marketing['MntMeatProducts'] > marketing['MntMeatProducts'].mean()) & (marketing['MntFishProducts'] > marketing['MntFishProducts'].mean())]


In [7]:
len(marketing), len(marketing_smaller)

(2240, 509)

As both columns (`MntMeatProducts` and `MntFishProducts`) are numeric (they contain integers), I can do mathematical operations on them. This means that I can calculate the mean by using the method `.mean()` and can compare that mean to the value in the column. I only keep cells hthat meet both conditions (note use of `&`). This reduced the size of my data.

# Challenge 2

The dataset `marketing_campaign_response.csv` contains an additional observation - whether a customer did or did not respond to the marketing campaign (a binary variable with values 0 and 1). The variable is coded 1 if a customer reacted to the campaign positively, 0 otherwise. Additionally, it contains ID of each customer exposed to the campaign (this ID corresponds to customer_id in the other dataset). Not all customers of the company where exposed to the campaign.

#### 2.1 How to choose the best merge strategy?

Merge the `marketing_campaigns.csv` dataset with `marketing_campaign_response.csv`. Explore all four merge types and their consequences (dataset length and missing values). Why do they result in different dataset lengths? Which merge is most useful? <br>  **Note: the identifier column names differ across the two datasets.**  <br>

#### 2.2 Merge datasets

Use the appropriate join (left, right, inner, outer) to merge `marketing_campaign.csv` with `marketing_campaign_response.csv`, such that only customers exposed to the marketing campaign are included.

In [8]:
#As before, having tried to load the dataset without a separator argument, I see that the columns are not formatted well
#I input the appropriate sep argument
responses = pd.read_csv("marketing_campaign_response.csv", sep =';')

#I check che column names before I merge
responses.columns

Index(['ID', 'Response'], dtype='object')

In [14]:
len(marketing), len(responses)

(2240, 2174)

It looks like the dataset with responses is smaller than the dataset with all customers. It makes sense - not all customers have been exposed to the marketing campaign. 

In [15]:
#In the responses, the id columns is called ID, so I merge left_on ID and right_on customer_id (for marketing_smaller)
outer_data = responses.merge(marketing, how='outer', left_on="ID", right_on="customer_id")

In [16]:
inner_data = responses.merge(marketing, how='inner', left_on="ID", right_on="customer_id")

In [17]:
left_data = responses.merge(marketing, how='left', left_on="ID", right_on="customer_id")

In [18]:
right_data = responses.merge(marketing, how='right', left_on="ID", right_on="customer_id")

In [19]:
len(outer_data), len(inner_data), len(left_data), len(right_data)

(2240, 2174, 2174, 2240)

I have now tried the four different merges and the result in different lengths. The outer merge takes all the data in both dataframes and whenever it does not find a match between them, it generates missing values. You can see below that we hence are missing information on marketing response for customers who were not exposed to the message.

In [20]:
outer_data.isna().sum()

ID                     66
Response               67
customer_id             0
Year_Birth              0
Education               2
Marital_Status          0
Income                 24
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
dtype: int64

The inner merge only includes rows that are in both dataframes. Its length corresponds to the length of the responses dataframe - we have "lost" users who were not exposed to the message.

The right merge "takes the right dataframe (marketing)" as a starting point and adds matching information from the responses dataframe. That means that all rows from the marketing dataframe are kept and whenever a row does not have matching information in the responses dataframe, missing value is created (See below).

In [21]:
right_data.isna().sum()

ID                     66
Response               67
customer_id             0
Year_Birth              0
Education               2
Marital_Status          0
Income                 24
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
dtype: int64

The left merge "takes the left dataframe (responses)" as a starting point and adds matching information from the marketing dataframe. That means that all rows from the responses dataframe are kept and whenever a row does not have matching information in the responses dataframe, missing value is created. This also means that whenever there is a row in marketing that did not match to a row in responses, this row is omitted. Hence, the datafarame is smaller than marketing dataframe, but corresponds in size to the responses dataframe.

To keep only customers exposed to the campaign, I will choose the inner merge - this way, I make sure that only rows that are both in marketing dataframe and in responses dataframe are included.

# Challenge 3

#### 3.1 Create new columns

Create a new column called Age which denotes age in years for customers from the merged dataset. What is the average age of customers who reacted to the campaign positively?

#### 3.2 Writing pandas 
Write a csv file with the dataset you ended up with. 

In [24]:
marketing['Age'] = 2023 - marketing['Year_Birth']

To create an age variable, I can do a mathematical operation by substracting one's year of birth (value in the column `Year_Birth`) from the current year. This operation creates a new column `Age`. 

In [25]:
marketing['Age'].describe()

count    2240.000000
mean       54.194196
std        11.984069
min        27.000000
25%        46.000000
50%        53.000000
75%        64.000000
max       130.000000
Name: Age, dtype: float64

In [26]:
marketing.to_csv('marketing_final.csv')