# Exercício

# Classificação

**Exercício: Classificação se o funcionário vai sair ou não**

**Arquivo Employee.csv**

**Passos:**

1. Carregue o conjunto de dados em um DataFrame usando a biblioteca pandas.
2. Explore e visualize os dados para entender suas características.
3. Divida os dados em recursos (X) e rótulos (y).
4. Divida o conjunto de dados em conjuntos de treinamento e teste.
5. Utilização da árvore de decisão `DecisionTreeClassifier`.
   1. Inicialize o modelo de árvore de decisão.
   2. Treine o modelo usando o conjunto de treinamento.
   3. Faça previsões usando o conjunto de teste.
   4. Avalie o desempenho do modelo usando métricas como acurácia, matriz de confusão, etc.
   5.  Visualize a árvore de decisão resultante (opcional).
   
6. Utilização do KNN:
   1. Importe a classe `KNeighborsClassifier` da biblioteca scikit-learn.
   2. Inicialize o modelo k-NN com um valor de k desejado.
   3. Treine o modelo usando o conjunto de treinamento.
   4. Faça previsões usando o conjunto de teste.
   5. Avalie o desempenho do modelo usando métricas como acurácia, matriz de confusão, etc.
   6. Experimente diferentes valores de k e avalie como eles afetam o desempenho.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn import preprocessing
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
import math
from sklearn.neighbors import DistanceMetric, KNeighborsClassifier
from sklearn import neighbors
from sklearn import tree
import graphviz
from scipy import stats

# Classificação

In [34]:
# Carregar o conjunto de dados
df = pd.read_csv("../../Datasets/Employee.csv")

# 0 não saiu, 1 para saiu
df

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,Bachelors,2017,Bangalore,3,34,Male,No,0,0
1,Bachelors,2013,Pune,1,28,Female,No,3,1
2,Bachelors,2014,New Delhi,3,38,Female,No,2,0
3,Masters,2016,Bangalore,3,27,Male,No,5,1
4,Masters,2017,Pune,3,24,Male,Yes,2,1
...,...,...,...,...,...,...,...,...,...
4648,Bachelors,2013,Bangalore,3,26,Female,No,4,0
4649,Masters,2013,Pune,2,37,Male,No,2,1
4650,Masters,2018,New Delhi,3,27,Male,No,5,1
4651,Bachelors,2012,Bangalore,3,30,Male,Yes,2,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB


Checando duplicadas

In [5]:
df.duplicated().sum()

1889

In [6]:
df[df.duplicated()]

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
111,Bachelors,2017,Pune,2,27,Female,No,5,1
130,Bachelors,2017,Bangalore,3,26,Female,No,4,0
138,Bachelors,2017,New Delhi,3,28,Male,No,2,0
160,Bachelors,2014,Bangalore,3,28,Female,No,3,0
167,Bachelors,2014,Bangalore,3,25,Male,No,3,0
...,...,...,...,...,...,...,...,...,...
4640,Bachelors,2015,Bangalore,3,35,Male,No,0,0
4642,Bachelors,2012,Bangalore,3,36,Female,No,4,0
4646,Bachelors,2013,Bangalore,3,25,Female,No,3,0
4648,Bachelors,2013,Bangalore,3,26,Female,No,4,0


In [7]:
df.loc[(df["Education"] == 'Bachelors') & 
       (df["JoiningYear"] == 2017) & 
       (df["City"] == "Pune") &
       (df["PaymentTier"] == 2) &
       (df["Age"] == 27)]

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
86,Bachelors,2017,Pune,2,27,Female,No,5,1
111,Bachelors,2017,Pune,2,27,Female,No,5,1
423,Bachelors,2017,Pune,2,27,Male,No,5,0
1910,Bachelors,2017,Pune,2,27,Female,No,5,1
2239,Bachelors,2017,Pune,2,27,Male,No,5,0
2582,Bachelors,2017,Pune,2,27,Female,No,5,1
2743,Bachelors,2017,Pune,2,27,Male,No,5,0
4125,Bachelors,2017,Pune,2,27,Female,No,5,1


Deletando duplicadas

In [35]:
df.drop_duplicates(keep='first', inplace=True)
df.duplicated().sum()

0

In [9]:
df.isna().sum().sum()

0

In [10]:
df['LeaveOrNot'].value_counts()

0    1676
1    1088
Name: LeaveOrNot, dtype: int64

In [36]:
# Correlação

corr = df.corr()
corr.style.background_gradient(cmap = 'coolwarm')

Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,LeaveOrNot
JoiningYear,1.0,-0.053823,0.024445,-0.031228,0.15065
PaymentTier,-0.053823,1.0,0.067514,-0.004602,-0.119891
Age,0.024445,0.067514,1.0,-0.053276,-0.114943
ExperienceInCurrentDomain,-0.031228,-0.004602,-0.053276,1.0,-0.021181
LeaveOrNot,0.15065,-0.119891,-0.114943,-0.021181,1.0


In [38]:
# label encoder

le = preprocessing.LabelEncoder()
for column in df.columns:
    if(df[column].dtypes=='object'):
        df[column] = le.fit_transform(df[column])
        
df.head()

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain,LeaveOrNot
0,0,2017,0,3,34,1,0,0,0
1,0,2013,2,1,28,0,0,3,1
2,0,2014,1,3,38,0,0,2,0
3,1,2016,0,3,27,1,0,5,1
4,1,2017,2,3,24,1,1,2,1


In [39]:
# Dividir em recursos (X) e rótulos (y)
X = df.drop('LeaveOrNot', axis = 1)
y = df['LeaveOrNot']

In [14]:
X

Unnamed: 0,Education,JoiningYear,City,PaymentTier,Age,Gender,EverBenched,ExperienceInCurrentDomain
0,0,2017,0,3,34,1,0,0
1,0,2013,2,1,28,0,0,3
2,0,2014,1,3,38,0,0,2
3,1,2016,0,3,27,1,0,5
4,1,2017,2,3,24,1,1,2
...,...,...,...,...,...,...,...,...
4645,1,2017,2,2,31,0,0,2
4647,0,2016,2,3,30,1,0,2
4649,1,2013,2,2,37,1,0,2
4650,1,2018,1,3,27,1,0,5


Dividir em conjuntos de treinamento e teste

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Avaliar o desempenho

In [16]:
def results(y_test,y_pred):
    results = confusion_matrix(y_test, y_pred)
    print ('Confusion Matrix :')
    print(results)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    print ('Report : ')
    print (classification_report(y_test, y_pred))

## Removendo outliers utilizando iqr

In [41]:
def remove_outliers_iqr(df, threshold=1.5):
    
    for column_name in df.columns:
    
        Q1 = df[column_name].quantile(0.25)
        Q3 = df[column_name].quantile(0.75)
        
        IQR = Q3 - Q1

        lower_bound = Q1 - threshold * IQR
        upper_bound = Q3 + threshold * IQR

        df_cleaned = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
    print(df.shape)
    print(df_cleaned.shape)
    return df_cleaned

In [42]:
df_without_outliers_iqr = remove_outliers_iqr(df)

(2764, 9)
(2764, 9)


### KNN

In [43]:
def computeKNN(X_train,y_train,X_test,y_test,n_neighbors=3, algorithm='auto'):
    # Inicializar o modelo k-NN
    knn = neighbors.KNeighborsClassifier(n_neighbors = n_neighbors, algorithm=algorithm)
    
    # Treinar o modelo
    knn.fit(X_train, y_train)
    
    # Fazer previsões
    y_pred = knn.predict(X_test)
    
    results(y_test,y_pred)
    return y_pred,knn

In [44]:
# KNN com os dados originais

y_pred, knn = computeKNN(X_train, y_train, X_test, y_test, n_neighbors= 3)

Confusion Matrix :
[[251  82]
 [102 118]]
Accuracy: 66.73%
Report : 
              precision    recall  f1-score   support

           0       0.71      0.75      0.73       333
           1       0.59      0.54      0.56       220

    accuracy                           0.67       553
   macro avg       0.65      0.65      0.65       553
weighted avg       0.66      0.67      0.66       553



Aplicando robust scaler

In [45]:
rs = RobustScaler()
X_train_robust = rs.fit_transform(X_train)
X_test_robust = rs.transform(X_test)

In [60]:
y_pred,knn = computeKNN(X_train_robust,y_train,X_test_robust,y_test,n_neighbors=15)

Confusion Matrix :
[[306  27]
 [ 95 125]]
Accuracy: 77.94%
Report : 
              precision    recall  f1-score   support

           0       0.76      0.92      0.83       333
           1       0.82      0.57      0.67       220

    accuracy                           0.78       553
   macro avg       0.79      0.74      0.75       553
weighted avg       0.79      0.78      0.77       553



### Árvore de Decisão

In [61]:
def computeClassificationDecisionTree(X_train, y_train, X_test, y_test):
    
    arvore_classificacao = tree.DecisionTreeClassifier(random_state=0)
    arvore_classificacao.fit(X_train,y_train)
    y_pred = arvore_classificacao.predict(X_test)
    
    results(y_test, y_pred)
    return y_pred,arvore_classificacao

In [62]:
y_pred,arvore_classificacao = computeClassificationDecisionTree(X_train, y_train, X_test, y_test)

Confusion Matrix :
[[252  81]
 [ 87 133]]
Accuracy: 69.62%
Report : 
              precision    recall  f1-score   support

           0       0.74      0.76      0.75       333
           1       0.62      0.60      0.61       220

    accuracy                           0.70       553
   macro avg       0.68      0.68      0.68       553
weighted avg       0.69      0.70      0.70       553



In [68]:
y_pred,arvore_classificacao = computeClassificationDecisionTree(X_train_robust,y_train,X_test_robust,y_test)

Confusion Matrix :
[[253  80]
 [ 87 133]]
Accuracy: 69.80%
Report : 
              precision    recall  f1-score   support

           0       0.74      0.76      0.75       333
           1       0.62      0.60      0.61       220

    accuracy                           0.70       553
   macro avg       0.68      0.68      0.68       553
weighted avg       0.70      0.70      0.70       553



Aplicando StandardScaler

In [65]:
sc = StandardScaler()
X_train_standard = sc.fit_transform(X_train)
X_test_standard = sc.transform(X_test)

In [66]:
y_pred,arvore_classificacao = computeClassificationDecisionTree(X_train_standard,y_train,X_test_standard,y_test)

Confusion Matrix :
[[255  78]
 [ 86 134]]
Accuracy: 70.34%
Report : 
              precision    recall  f1-score   support

           0       0.75      0.77      0.76       333
           1       0.63      0.61      0.62       220

    accuracy                           0.70       553
   macro avg       0.69      0.69      0.69       553
weighted avg       0.70      0.70      0.70       553

