# Resumo do Desafio - KDD Cup 2009: Customer relationship prediction
Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offers the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add-ons proposed to them to make the sale more profitable (up-selling).

The most practical way, in a CRM system, to build knowledge on customer is to produce scores. A score (the output of a model) is an evaluation for all instances of a target variable to explain (i.e. churn, appetency or up-selling). Tools which produce scores allow to project, on a given population, quantifiable information. The score is computed using input variables which describe instances. Scores are then used by the information system (IS), for example, to personalize the customer relationship. An industrial customer analysis platform able to build prediction models with a very large number of input variables has been developed by Orange Labs. This platform implements several processing methods for instances and variables selection, prediction and indexation based on an efficient model combined with variable selection regularization and model averaging method. The main characteristic of this platform is its ability to scale on very large datasets with hundreds of thousands of instances and thousands of variables. The rapid and robust detection of the variables that have most contributed to the output prediction can be a key factor in a marketing application.

The challenge is to beat the in-house system developed by Orange Labs. It is an opportunity to prove that you can deal with a very large database, including heterogeneous noisy data (numerical and categorical variables), and unbalanced class distributions. Time efficiency is often a crucial point. Therefore part of the competition will be time-constrained to test the ability of the participants to deliver solutions quickly.

Link: https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Intro

# Índice
- [Carregando as libs e os arquivos](#Carregando-as-libs-e-os-arquivos)
- [EDA](#EDA)
    - [Trazendo as labels para o dataset principal](#Trazendo-as-labels-para-o-dataset-principal)
    - [Gráficos com as features numéricas](#Gráficos-com-as-features-numéricas)
    - [Preenchendo os missing values](#Preenchendo-os-missing-values)
    - [Transformando as features categóricas em dummies](#Transformando-as-features-categóricas-em-dummies)
- [Gerando os datasets para criação dos modelos](#Gerando-os-datasets-para-criação-dos-modelos)
    - [Modelo de Churn](#Modelo-de-Churn)
- [Conclusão](#Conclusão)

# Carregando as libs e os arquivos

In [None]:
#-- carregando as bibliotecas
import zipfile
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, roc_auc_score, roc_curve, confusion_matrix, classification_report, plot_confusion_matrix)
from imblearn.over_sampling import SMOTE

In [None]:
#-- listando arquivos na pasta do Office Track
arr = os.listdir('../data')
arr

In [None]:
#-- extraindo os arquivos do zip
extension = ".zip"
os.chdir('../data')

for item in os.listdir('../data'):
    if item.endswith(extension):
        file_name = os.path.abspath(item)
        zip_ref = zipfile.ZipFile(file_name)
        zip_ref.extractall('../data')
        zip_ref.close()
        os.remove(file_name)

In [None]:
#-- listando arquivos na pasta do Office Track
arr = os.listdir('../data')
arr

In [None]:
#-- carregando os dados de treino
db = pd.read_csv('orange_small_train.data', sep='\t')

In [None]:
#-- printando o head do dataset de treino
db.head()

In [None]:
#-- printando as dimensões do dateset para verificar sua integridade
db.shape

# EDA

In [None]:
#-- verificando as features com mais missing values


> Existem algumas features com missing values acima de 90%, por isso irei printar a quantidade de colunas com missing values variando sua a porcentagem 

In [None]:
#-- printando a quantidade de colunas com missing values variando sua a porcentagem


> Irei assumir missing values até 15% como aceitável. As colunas que restarem com missing values irei tratá-las posteriormente.

In [None]:
#-- copiando o nome das colunas que serão mantidas. Essas colunas tem até 15% de missing value


In [None]:
#-- filtrando o dataset original com as colunas que contém até 15% de missing value


In [None]:
#-- verificando a quantidade de colunas removidas, nenhuma linha deve ser removida


In [None]:
#-- printando o head do novo dataset


In [None]:
#-- printando o tipo das colunas


In [None]:
#-- printando algumas métricas das colunas numéricas


## Trazendo as labels para o dataset principal

## Gráficos com as features numéricas

## Preenchendo os missing values

In [None]:
#-- preenchendo as features numéricas com missing value - usando a mediana para isso
imputer = SimpleImputer(strategy='median')
imputer = imputer.fit(db_dcol_drow.loc[:, columns_numeric])
db_dcol_drow.loc[:, columns_numeric] = imputer.transform(db_dcol_drow.loc[:, columns_numeric])

## Transformando as features categóricas em dummies

# Gerando os datasets para criação dos modelos

In [None]:
#-- criando os dataset para análise


## Modelo de Churn

In [None]:
#-- criando o balanceamento
