# Projeto Machine Learning

Glossário do conjunto de dados (em colunas)

- ID do cliente – identificador exclusivo para cada cliente
- Idade - Idade do cliente
- Gênero - Gênero do cliente (Masculino/Feminino)
- Item comprado - O item comprado pelo cliente
- Categoria - Categoria do item adquirido
- Valor da compra (USD) - O valor da compra em USD
- Local – Local onde a compra foi realizada
- Tamanho - Tamanho do item adquirido
- Cor – Cor do item adquirido
- Temporada - Temporada durante a qual a compra foi feita
- Avaliação de revisão - Classificação dada pelo cliente para o item comprado
- Status da Assinatura - Indica se o cliente possui uma assinatura (Sim/Não)
- Tipo de Envio - Tipo de envio escolhido pelo cliente
- Desconto Aplicado - Indica se foi aplicado desconto na compra (Sim/Não)
- Código Promocional Usado - Indica se um código promocional foi usado para a compra (Sim/Não)
- Compras Anteriores - Número de compras anteriores realizadas pelo cliente
- Método de pagamento - método de pagamento preferido do cliente
- Frequência de Compras - Frequência com que o cliente faz compras (por exemplo, Semanal, Quinzenal, Mensal)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Projeto - Machine Learning II/shopping_trends.csv')

In [None]:
df.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Credit Card,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Bank Transfer,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Cash,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,PayPal,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Cash,Free Shipping,Yes,Yes,31,PayPal,Annually


## Conhecendo os Dados

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               3900 non-null   int64  
 1   Age                       3900 non-null   int64  
 2   Gender                    3900 non-null   object 
 3   Item Purchased            3900 non-null   object 
 4   Category                  3900 non-null   object 
 5   Purchase Amount (USD)     3900 non-null   int64  
 6   Location                  3900 non-null   object 
 7   Size                      3900 non-null   object 
 8   Color                     3900 non-null   object 
 9   Season                    3900 non-null   object 
 10  Review Rating             3900 non-null   float64
 11  Subscription Status       3900 non-null   object 
 12  Payment Method            3900 non-null   object 
 13  Shipping Type             3900 non-null   object 
 14  Discount

In [None]:
df.describe()

Unnamed: 0,Customer ID,Age,Purchase Amount (USD),Review Rating,Previous Purchases
count,3900.0,3900.0,3900.0,3900.0,3900.0
mean,1950.5,44.068462,59.764359,3.749949,25.351538
std,1125.977353,15.207589,23.685392,0.716223,14.447125
min,1.0,18.0,20.0,2.5,1.0
25%,975.75,31.0,39.0,3.1,13.0
50%,1950.5,44.0,60.0,3.7,25.0
75%,2925.25,57.0,81.0,4.4,38.0
max,3900.0,70.0,100.0,5.0,50.0


In [None]:
# Percorrer colunas categóricas para a contagem de valores exclusivos

unique_value_counts = {}

for column in df.select_dtypes(include=['object']):
    unique_values = df[column].nunique()
    unique_value_counts[column] = unique_values

print(unique_value_counts)

{'Gender': 2, 'Item Purchased': 25, 'Category': 4, 'Location': 50, 'Size': 4, 'Color': 25, 'Season': 4, 'Subscription Status': 2, 'Payment Method': 6, 'Shipping Type': 6, 'Discount Applied': 2, 'Promo Code Used': 2, 'Preferred Payment Method': 6, 'Frequency of Purchases': 7}


In [None]:
# Percorrer colunas numéricas para a contagem de valores exclusivos

unique_value_counts = {}

for column in df.select_dtypes(include=['int64', 'float']):
    unique_values = df[column].nunique()
    unique_value_counts[column] = unique_values

print(unique_value_counts)

{'Customer ID': 3900, 'Age': 53, 'Purchase Amount (USD)': 81, 'Review Rating': 26, 'Previous Purchases': 50}


## Limpeza dos Dados

In [None]:
df.isnull().sum()

Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type               0
Discount Applied            0
Promo Code Used             0
Previous Purchases          0
Preferred Payment Method    0
Frequency of Purchases      0
dtype: int64

In [None]:
df[df.duplicated()].sum()

Customer ID                 0.0
Age                         0.0
Gender                      0.0
Item Purchased              0.0
Category                    0.0
Purchase Amount (USD)       0.0
Location                    0.0
Size                        0.0
Color                       0.0
Season                      0.0
Review Rating               0.0
Subscription Status         0.0
Payment Method              0.0
Shipping Type               0.0
Discount Applied            0.0
Promo Code Used             0.0
Previous Purchases          0.0
Preferred Payment Method    0.0
Frequency of Purchases      0.0
dtype: float64

## Análise Exploratória

# Thiago - XGBoost
---

In [1]:
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics



In [4]:
dataframe_thiago = pd.read_csv("./datasets/dataset_cluster.csv")

In [6]:
dataframe_thiago.shape

(2193, 35)

In [7]:
dataframe_thiago

Unnamed: 0,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,...,MntTotal,Education_Graduated,Education_Not Graduated,Education_Post Graduated,Partner_Not Partner,Partner_Partner,Age_Category_Young,Age_Category_Adult,Age_Category_Elderly,Cluster
0,58138.0,0,0,58,635,88,546,172,88,88,...,1617,1,0,0,1,0,0,0,1,2
1,46344.0,1,1,38,11,1,6,2,1,6,...,27,1,0,0,1,0,0,0,1,3
2,71613.0,0,0,26,426,49,127,111,21,42,...,776,1,0,0,0,1,0,1,0,1
3,26646.0,1,0,26,11,4,20,10,3,5,...,53,1,0,0,0,1,0,1,0,0
4,58293.0,1,0,94,173,43,118,46,27,15,...,422,0,0,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2188,88325.0,0,0,42,519,71,860,93,27,53,...,1623,1,0,0,0,1,0,1,0,1
2189,38054.0,1,1,24,23,2,17,2,1,0,...,45,0,0,1,1,0,0,0,1,3
2190,80617.0,0,0,42,594,51,631,72,55,32,...,1435,1,0,0,1,0,1,0,0,2
2191,5305.0,0,1,12,12,4,7,1,3,5,...,32,1,0,0,0,1,0,1,0,1


In [8]:
df_X = dataframe_thiago.drop(["Cluster"], axis=1)
df_y = dataframe_thiago["Cluster"]

X_train, X_test, Y_train, Y_test = train_test_split(df_X, df_y, random_state=42)

In [14]:
X_test.shape[0]/X_train.shape[0]

0.33394160583941607

In [10]:
classificador_xgb = xgb.XGBClassifier()
type(classificador_xgb)

xgboost.sklearn.XGBClassifier

In [None]:
from sklearn.model_selection import cross_val_score