# Exercício

# Regressão

**Exercício: Prever o preço dos diamantes**

**Arquivo diamonds.csv**

KNN:
1. Testar outros valores de k e outras métricas de distância
2. Rodar o KNN apenas utilizando os atributos que estão altamente correlacionados com o preço

Regressão Linear:
1. Testar a predição utilizando o modelo de regressão linear (`LinearRegression`)

# Classificação

**Exercício: Classificação de Câncer**

**Arquivo wisconsin_breast_cancer.csv**

**Passos:**

1. Carregue o conjunto de dados em um DataFrame usando a biblioteca pandas.
2. Explore e visualize os dados para entender suas características.
3. Divida os dados em recursos (X) e rótulos (y).
4. Divida o conjunto de dados em conjuntos de treinamento e teste.
   
5. Caso use árvore de decisão:
   1. Importe a classe `DecisionTreeClassifier` da biblioteca scikit-learn.
   2. Inicialize o modelo de árvore de decisão.
   3. Treine o modelo usando o conjunto de treinamento.
   4. Faça previsões usando o conjunto de teste.
   5. Avalie o desempenho do modelo usando métricas como acurácia, matriz de confusão, etc.
   6.  Visualize a árvore de decisão resultante (opcional).
   
6. Caso use KNN:
   1. Importe a classe `KNeighborsClassifier` da biblioteca scikit-learn.
   2. Inicialize o modelo k-NN com um valor de k desejado.
   3. Treine o modelo usando o conjunto de treinamento.
   4. Faça previsões usando o conjunto de teste.
   5. Avalie o desempenho do modelo usando métricas como acurácia, matriz de confusão, etc.
   6. Experimente diferentes valores de k e avalie como eles afetam o desempenho.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, r2_score
from sklearn.model_selection import train_test_split
import math
from sklearn.neighbors import KNeighborsClassifier
from sklearn import neighbors
from sklearn.metrics import DistanceMetric, mean_squared_error,mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.cluster import KMeans
from sklearn import tree
import graphviz

# Regressão

In [3]:
df = pd.read_csv('../../Datasets/diamonds.csv', sep = ",", low_memory=False)
#removendo os ids
df = df.drop(df.columns[0], axis=1)
df

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [4]:
# label encoder
le = preprocessing.LabelEncoder()
for column in df.columns:
    if(df[column].dtypes=='object'):
        df[column] = le.fit_transform(df[column])
        
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,2,1,3,61.5,55.0,326,3.95,3.98,2.43
1,0.21,3,1,2,59.8,61.0,326,3.89,3.84,2.31
2,0.23,1,1,4,56.9,65.0,327,4.05,4.07,2.31
3,0.29,3,5,5,62.4,58.0,334,4.2,4.23,2.63
4,0.31,1,6,3,63.3,58.0,335,4.34,4.35,2.75


In [5]:
# Correlçação
df.drop_duplicates(keep='first',inplace=True) 
corr = df.corr()
corr.style.background_gradient(cmap = 'coolwarm')

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
carat,1.0,0.017785,0.291019,-0.214068,0.027861,0.181091,0.921548,0.97538,0.951908,0.953542
cut,0.017785,1.0,0.000393,0.028141,-0.193184,0.150366,0.040196,0.022592,0.027805,0.002442
color,0.291019,0.000393,1.0,-0.028002,0.047572,0.026102,0.171825,0.269876,0.263153,0.267825
clarity,-0.214068,0.028141,-0.028002,1.0,-0.053165,-0.088074,-0.071218,-0.225575,-0.217459,-0.224117
depth,0.027861,-0.193184,0.047572,-0.053165,1.0,-0.297669,-0.011048,-0.025348,-0.029389,0.094757
table,0.181091,0.150366,0.026102,-0.088074,-0.297669,1.0,0.126566,0.194855,0.183231,0.15027
price,0.921548,0.040196,0.171825,-0.071218,-0.011048,0.126566,1.0,0.884504,0.865395,0.861208
x,0.97538,0.022592,0.269876,-0.225575,-0.025348,0.194855,0.884504,1.0,0.974592,0.970686
y,0.951908,0.027805,0.263153,-0.217459,-0.029389,0.183231,0.865395,0.974592,1.0,0.951844
z,0.953542,0.002442,0.267825,-0.224117,0.094757,0.15027,0.861208,0.970686,0.951844,1.0


### Separando o alvo (preço) dos atributos

In [6]:
target = pd.DataFrame(df, columns=["price"])

In [7]:
X = df.drop(df.columns[6], axis=1)
y = target

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Regressão Linear

In [9]:
from sklearn.linear_model import LinearRegression

#Criando objeto de regressão
lr = LinearRegression()

#Treinando regressão
lr.fit(X_train,y_train)

In [10]:
value_pred = lr.predict(X_test) 

In [11]:
def results_regression(y_test,y_pred):
    mse = mean_squared_error(y_test,y_pred)
    print(f"mse: {mse}")
    rmse = math.sqrt(mse)
    print(f"rmse: {rmse}")
    mae = mean_absolute_error(y_test,y_pred)
    print(f"mae: {mae}")
    mape = mean_absolute_percentage_error(y_test,y_pred)
    print(f"mape: {mape}")
    r2 = r2_score(y_test,y_pred)
    print(f"r2_score {r2}")


In [12]:
results_regression(y_test, value_pred)

mse: 1873968.8605330877
rmse: 1368.9298230855693
mae: 879.5494551718102
mape: 0.38773662971707784
r2_score 0.8827630924488556


### KNN

In [13]:
knn = neighbors.KNeighborsRegressor(n_neighbors = 5, algorithm='auto')
model_KNN = knn.fit(X_train, y_train)
y_pred = model_KNN.predict(X_test)

In [14]:
results_regression(y_test, y_pred)

mse: 838375.0304749513
rmse: 915.6282162946658
mae: 481.39801096756196
mape: 0.134245981529335
r2_score 0.9475506247670414


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X[['carat', 'x', 'y', 'z']], y, test_size = 0.2, random_state = 0)
knn = neighbors.KNeighborsRegressor(n_neighbors = 5, algorithm='auto')
model_KNN = knn.fit(X_train, y_train)
y_pred = model_KNN.predict(X_test)
results_regression(y_test, y_pred)

mse: 2131210.3489952595
rmse: 1459.866551776312
mae: 831.1038758248908
mape: 0.2116022102715784
r2_score 0.8666698706049349


# Classificação

In [16]:
# Carregar o conjunto de dados
df = pd.read_csv("../../Datasets/wisconsin_breast_cancer.csv")
# 0 for benign, 1 for malignant
df.drop('id', axis = 1, inplace=True)
df

Unnamed: 0,thickness,size,shape,adhesion,single,nuclei,chromatin,nucleoli,mitosis,class
0,5,1,1,1,2,1.0,3,1,1,0
1,5,4,4,5,7,10.0,3,2,1,0
2,3,1,1,1,2,2.0,3,1,1,0
3,6,8,8,1,3,4.0,3,7,1,0
4,4,1,1,3,2,1.0,3,1,1,0
...,...,...,...,...,...,...,...,...,...,...
694,3,1,1,1,3,2.0,1,1,1,0
695,2,1,1,1,2,1.0,1,1,1,0
696,5,10,10,3,7,3.0,8,10,2,1
697,4,8,6,4,3,4.0,10,6,1,1


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   thickness  699 non-null    int64  
 1   size       699 non-null    int64  
 2   shape      699 non-null    int64  
 3   adhesion   699 non-null    int64  
 4   single     699 non-null    int64  
 5   nuclei     683 non-null    float64
 6   chromatin  699 non-null    int64  
 7   nucleoli   699 non-null    int64  
 8   mitosis    699 non-null    int64  
 9   class      699 non-null    int64  
dtypes: float64(1), int64(9)
memory usage: 54.7 KB


In [18]:
df.duplicated().sum()

236

In [19]:
df.drop_duplicates(keep='first', inplace=True)
df.duplicated().sum()

0

In [20]:
df.isna().sum().sum()

14

In [21]:
df.dropna(inplace=True)
df.isna().sum().sum()

0

In [22]:
df['class'].value_counts()

1    236
0    213
Name: class, dtype: int64

In [23]:
# Dividir em recursos (X) e rótulos (y)
X = df.drop('class', axis = 1)
y = df['class']

In [24]:
X

Unnamed: 0,thickness,size,shape,adhesion,single,nuclei,chromatin,nucleoli,mitosis
0,5,1,1,1,2,1.0,3,1,1
1,5,4,4,5,7,10.0,3,2,1
2,3,1,1,1,2,2.0,3,1,1
3,6,8,8,1,3,4.0,3,7,1
4,4,1,1,3,2,1.0,3,1,1
...,...,...,...,...,...,...,...,...,...
693,3,1,1,1,2,1.0,2,1,2
694,3,1,1,1,3,2.0,1,1,1
696,5,10,10,3,7,3.0,8,10,2
697,4,8,6,4,3,4.0,10,6,1


In [25]:
# Dividir em conjuntos de treinamento e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Inicializar o modelo k-NN
k = 3
knn = KNeighborsClassifier(n_neighbors=k)

# Treinar o modelo
knn.fit(X_train, y_train)

# Fazer previsões
y_pred = knn.predict(X_test)

In [27]:
# Avaliar o desempenho
def results(y_test,y_pred):
    results = confusion_matrix(y_test, y_pred)
    print ('Confusion Matrix :')
    print(results)
    accuracy = accuracy_score(y_test, y_pred)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    print ('Report : ')
    print (classification_report(y_test, y_pred))

In [28]:
results(y_test,y_pred)

Confusion Matrix :
[[47  2]
 [ 2 39]]
Accuracy: 95.56%
Report : 
              precision    recall  f1-score   support

           0       0.96      0.96      0.96        49
           1       0.95      0.95      0.95        41

    accuracy                           0.96        90
   macro avg       0.96      0.96      0.96        90
weighted avg       0.96      0.96      0.96        90



In [29]:
def computeClassificationDecisionTree(X_train, X_test, y_train, y_test,printResults):
    
    arvore_classificacao = tree.DecisionTreeClassifier()
    arvore_classificacao.fit(X_train,y_train)
    y_pred = arvore_classificacao.predict(X_test)
    if printResults:
        results(y_test, y_pred)
    return y_pred,arvore_classificacao

In [30]:
y_pred,arvore_classificacao = computeClassificationDecisionTree(X_train, X_test, y_train, y_test, True)

Confusion Matrix :
[[46  3]
 [ 4 37]]
Accuracy: 92.22%
Report : 
              precision    recall  f1-score   support

           0       0.92      0.94      0.93        49
           1       0.93      0.90      0.91        41

    accuracy                           0.92        90
   macro avg       0.92      0.92      0.92        90
weighted avg       0.92      0.92      0.92        90



In [31]:
dot_data = tree.export_graphviz(arvore_classificacao, out_file=None, 
                      feature_names=X_train.columns,  
                      class_names=['benigno', 'maligno'],  
                      filled=True, rounded=True,  
                      special_characters=True) 
graph = graphviz.Source(dot_data)
graph.render("arvore")

'arvore.pdf'