# Challenge 4YFN - MWC Barcelona 2022


## Background

Nuwefruit is a startup that wants to revolutionize people habit on everyday fruit intake. Because of this, the company is focused on home delivery, and thanks to its last mile optimization algorithm, has very low logistic costs. This lets Nuwefruit have lower prices than competition on selling fruits. Its catalog is based in more than 20 different types of fruit, the ones with better nutritional properties.

## Overview: the dataset and challenge

We will use two datasets:
 - Nuwefruit customer data
 - Order data from the customers
 
Customer data in 'CLIENT TABLE' with these variables:

**CLIENT ID**: Unique customer id
**CLIENT_SEGMENT**: Client segment  
**AVG CONSO**: Mean month consumption, calculated at the end of 2020 (pieces of fruit)
**AVG BASKET SIZE**: Mean basket size, calculated at the end of 2020 (pieces of fruit)
**RECEIVED_COMMUNICATION**: 1 = Recived a promotion / 0 = didn't receive a promotion

The 'CLIENT TABLE' dataset is [here](https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/mwc22/mwc22-client_table.csv)


Client orders 'ORDERS TABLE' with these variables:

**CLIENT ID**: Unique customer id  
**NB PRODS**: Number of 'prods' of the type of fruit (1 prod = 10 fruit pieces)  
**ORDER ID**: Unique order id  
**FRUIT_PRODUCT**: Type of fruit  
The  'ORDERS TABLE' dataset is [here](https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/mwc22/mwc22-orders_table.csv)

## Goals

- Make an EDA that lets you do: 
    - Analize sales and customer activity
    - Evaluate promotion impact 
- Make a predictive model that let us know the customer segment from the prediction variables used on [test_x](https://challenges-asset-files.s3.us-east-2.amazonaws.com/data_sets/Data-Science/4+-+events/mwc22/mwc22-client_table+-+test_x.csv). (We must predict the CLIENT_SEGMENT variable)


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from warnings import filterwarnings

from IPython.display import display, Markdown, Latex

filterwarnings('ignore')

import random

In [3]:
PATH = './'

SEED = 42 
random.seed(SEED)

## Loading the data

In [4]:
clients = pd.read_csv(PATH+'mwc22-client_table.csv', decimal=',')
orders = pd.read_csv(PATH+'mwc22-orders_table.csv', decimal=',')

dftest = pd.read_csv(PATH+'mwc22-client_table+-+test_x.csv', decimal=',')

## EDA

### First view on Data

### Clients

In [8]:
clients.head()


Unnamed: 0,CLIENT ID,CLIENT_SEGMENT,AVG CONSO,AVG BASKET SIZE,RECEIVED_COMMUNICATION
0,24321771,6,67.25,201.75,0
1,24321859,2,58.33,350.0,0
2,24321880,3,46.67,112.0,0
3,24321957,2,50.0,600.0,0
4,24321962,4,10.0,120.0,0


In [9]:
clients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35884 entries, 0 to 35883
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   CLIENT ID               35884 non-null  int64  
 1   CLIENT_SEGMENT          35884 non-null  int64  
 2   AVG CONSO               35884 non-null  float64
 3   AVG BASKET SIZE         35884 non-null  float64
 4   RECEIVED_COMMUNICATION  35884 non-null  int64  
dtypes: float64(2), int64(3)
memory usage: 1.4 MB


In [11]:
clients.describe().round(3)

Unnamed: 0,CLIENT ID,CLIENT_SEGMENT,AVG CONSO,AVG BASKET SIZE,RECEIVED_COMMUNICATION
count,35884.0,35884.0,35884.0,35884.0,35884.0
mean,27060580.0,3.124,64.534,181.219,0.508
std,8835076.0,1.513,64.382,129.605,0.5
min,18073110.0,1.0,0.83,10.0,0.0
25%,20533110.0,2.0,20.83,100.0,0.0
50%,24621900.0,3.0,50.0,160.0,1.0
75%,32985380.0,4.0,88.17,225.0,1.0
max,48365940.0,6.0,2433.33,3400.2,1.0


### Orders

In [6]:
orders.head()

Unnamed: 0,CLIENT ID,NB PRODS,ORDER ID,FRUIT_PRODUCT
0,18070505,5,671907264,Apple
1,18070505,10,671907264,Orange
2,18070505,5,671907264,Kiwi
3,18070505,10,671907264,Pear
4,18070505,5,671907264,Cheery


In [12]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66912 entries, 0 to 66911
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   CLIENT ID      66912 non-null  int64 
 1   NB PRODS       66912 non-null  int64 
 2   ORDER ID       66912 non-null  int64 
 3   FRUIT_PRODUCT  66912 non-null  object
dtypes: int64(3), object(1)
memory usage: 2.0+ MB


In [13]:
# We convert prods to units
orders['NB_UNITS'] = orders['NB PRODS'] * 10

In [14]:
orders.describe()

Unnamed: 0,CLIENT ID,NB PRODS,ORDER ID,NB_UNITS
count,66912.0,66912.0,66912.0,66912.0
mean,26134070.0,4.528112,672253300.0,45.281115
std,8473596.0,5.788227,3205826.0,57.882265
min,18070500.0,-80.0,663833500.0,-800.0
25%,20174270.0,1.0,669480300.0,10.0
50%,24380550.0,3.0,671997100.0,30.0
75%,25387080.0,5.0,675089300.0,50.0
max,48365860.0,198.0,683213200.0,1980.0
