Regressão 
1. Pretende-se testar qual o melhor algoritmo a aplicar ao dataset ‘Auto_CO2.csv’, para determinar a 
produção de CO2.
 
a. Faça uma comparação entre regressão linear, polinómio de grau 2 e polinómio de grau 3. 
Selecione o melhor algoritm

o. 
b. Utilize o algoritmo selecionado na alínea para prever novos dados. Que opinião retira dessa 
previsão? 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, recall_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.utils import shuffle
from sklearn.datasets import load_iris
from math import sqrt
from statistics import mean


In [2]:
df = pd.read_csv("Auto_CO2.csv")

In [3]:
df

Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyoty,Aygo,1000,790,99
1,Mitsubishi,Space Star,1200,1160,95
2,Skoda,Citigo,1000,929,95
3,Fiat,500,900,865,90
4,Mini,Cooper,1500,1140,105
5,VW,Up!,1000,929,105
6,Skoda,Fabia,1400,1109,90
7,Mercedes,A-Class,1500,1365,92
8,Ford,Fiesta,1500,1112,98
9,Audi,A1,1600,1150,99


In [4]:
# Correlação 
df[["Volume", "Weight", "CO2"]].corr()

Unnamed: 0,Volume,Weight,CO2
Volume,1.0,0.753537,0.592082
Weight,0.753537,1.0,0.55215
CO2,0.592082,0.55215,1.0


In [5]:
# Padronizar dados 
scaler = StandardScaler()

X = scaler.fit_transform(df[["Volume", "Weight"]])
y = df[["CO2"]].to_numpy()

X, y = shuffle(X, y)

In [6]:
# Calcular e Plotar regressão
regr = LinearRegression()
regr.fit(X, y)

# Uso do K-Fold para avaliar os modelos

In [7]:
scores = cross_val_score(regr, X, y, cv=6, scoring="neg_mean_squared_error")

print(f"MSE: {sum(scores)/len(scores)}")

MSE: -45.09365914379734


O MSE pode ser negativo, porque o método o permite (não deveria dar)

In [8]:
scores = cross_val_score(regr, X, y, cv=6, scoring="r2")

print(f"R^2: {sum(scores)/len(scores)}")

R^2: -2.1285010542202283


$R^2$ negativo significa que é pior do que o pior caso (ser 0)

In [9]:
# Transforma as features em polinómios
X_2 = PolynomialFeatures(2).fit_transform(X)

regr = LinearRegression()
regr.fit(X_2, y)

scores = cross_val_score(regr, X_2, y, cv=6, scoring="neg_mean_squared_error")

print(f"MSE: {sum(scores)/len(scores)}")

MSE: -51.04501402434672


In [10]:
scores = cross_val_score(regr, X_2, y, cv=6, scoring="r2")

print(f"R^2: {sum(scores)/len(scores)}")

R^2: -1.7732646262791558


In [11]:
X_3 = PolynomialFeatures(3).fit_transform(X)

regr = LinearRegression()
regr.fit(X_3, y)

scores = cross_val_score(regr, X_3, y, cv=6, scoring="neg_mean_squared_error")

print(f"MSE: {sum(scores)/len(scores)}")

MSE: -55.23051850260086


In [12]:
scores = cross_val_score(regr, X_3, y, cv=6, scoring="r2")

print(f"R^2: {sum(scores)/len(scores)}")

R^2: -2.3175377636676697


Das 3 opções o $X^2$ é o que apresenta menor erro e maior coeficiente de determinação

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3) 

regr = LinearRegression()
regr.fit(X_train, y_train)
pred = regr.predict(X_test)

print(f"MSE: {mean_squared_error(y_test, pred)}")

MSE: 29.355834089483746


Classificação 
2. Recorrendo ao dataset da Iris, determine qual é o melhor algoritmo para prever a sua espécie. 
Teste os algoritmos de árvores de decisão, kNN (k=3) e Naive Bayes.

In [14]:
X, y = load_iris(return_X_y=True)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) 

In [20]:
# Recall de cada modelo
total_recall = list()
total_fpr = list()

tree = DecisionTreeClassifier()

tree.fit(X_train, y_train)
pred = tree.predict(X_test)
accuracy_score(y_test, pred)


0.9333333333333333

In [17]:
# Media de recall (recall foca-se nos positivos)
macro_avg_recall = recall_score(y_test, pred, average="macro")
total_recall.append(macro_avg_recall)

# Falsos positivos
conf_mat = confusion_matrix(y_test, pred)

fpr_per_class = dict()

for i in range(len(np.unique(y))):
    falses_positives = sum(conf_mat[:, i]) - conf_mat[i,i]
    true_negatives = sum(sum(conf_mat)) - sum(conf_mat[i, :]) - sum(conf_mat[:, i])+ conf_mat[i,i]
    fpr_per_class[i] = falses_positives / (falses_positives + true_negatives)

fpr = mean(fpr_per_class.values())
total_fpr.append(fpr)

In [18]:
knn = KNeighborsClassifier(3)

knn.fit(X_train, y_train)
pred = knn.predict(X_test)
accuracy_score(y_test, pred)

0.9333333333333333

In [19]:
naive = GaussianNB()

naive.fit(X_train, y_train)
pred = naive.predict(X_test)
accuracy_score(y_test, pred)

0.9333333333333333