# 1. Árbol de decisión para clasificación

**Objetivo:** entrenar y probar un modelo de árbol de decisión para clasificar tipo de uso de suelo a partir de imágenes satelitales.


Este dataset es usado para clasificar el uso de suelo en imágenes geoespaciales. 
https://www.kaggle.com/apollo2506/eurosat-dataset

**Información de las características**
Este dataset contiene imágenes que pertenecen all dataset de EuroSat. Hay 10 folders:
* 0 AnnualCrop
* 1 Forest
* 2 HerbaceousVegatation
* 3 Highway
* 4 Industrial
* 5 Pasture
* 6 PermanentCrop
* 7 Residential
* 8 River
* 9 SeaLake


**Número de instancias:** 27000

# 2. Autenticación a Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 3. Importando librerías

In [None]:
import ____ as pd
import ___ #Sistema operativo
import numpy as ___
import itertools
from sklearn.____ import confusion_matrix
import _____ as plt
import random
import ____ as sns

In [None]:
from ______ import LabelEncoder
from ______ import MinMaxScaler
from ______ import PCA
from sklearn.model_selection import train_test_split
from sklearn.____ import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn import metrics

# 4. Lectura del archivo

In [None]:
file_path = _________
images_path = _______ # Path de las imágenes de Eurosat
train_path = os.path.join(_____,'EUROSAT_TRAIN_FEAT.csv')

In [None]:
train_df = pd.____(____)
train_df.head()

In [None]:
train_df.shape

In [None]:
clases = train_df['label'].unique()
clases

# 5. Exploración de datos (EDA)

In [None]:
plt.figure(figsize=(20,20))
for i,folder in enumerate(clases):
    path_folder = os.path.join(images_path, folder)
    imgs_list =os.listdir(path_folder)
    random.shuffle(imgs_list)
    for j in range(3):
      img_path = os.path.join(path_folder,imgs_list[j])
      plt.subplot(10,10,j*10+i+1)
      img = plt.imread(img_path)
      plt.imshow(img)
      plt.tick_params(axis='both',which='both', bottom=False, top=False, left=False, right=False,
                        labelbottom=False, labelleft=False)
      if j==2:
        plt.xlabel(folder,
        horizontalalignment='center',
        verticalalignment='top', fontsize=13)
plt.show()

# 6. Limpieza de datos

#### a) Escalamiento

In [None]:
scaler = _______(feature_range=(___, ___))
train_df.loc[:, train_df.columns != 'label'] = scaler.______(train_df.loc[:, train_df.columns != 'label'])

#### b) Codificación de etiquetas

In [None]:
le = ____()
train_df['label'] = le._______(train_df.label.values)

### c) Análisis de componentes principales con varianza acumulada de al menos el 80%

In [None]:
pca = ____(0.8)
pc = pca._____(train_df.iloc[:,:-1])
df_pca_train = pd.DataFrame(data = pc,
                           columns=range(pc.shape[1]))
df_pca_train = pd.concat([df_pca_train, train_df[['label']]], axis = 1)
df_pca_train.head()

Imprimiendo el poder explicativo y el número de componentes principales

In [None]:
print('Número de componentes principales: %s'%len(___.explained_variance_ratio_))
print('Varianza acumulada con %s componentes: %s'%(len(pca._____),np.sum(____.explained_variance_ratio_)))

Renombrando a las columnas del dataframe df_pca_train

In [None]:
feat_names = ['PC_'+str(i+1) for i in range(len(pca.explained_variance_ratio_))]
df_pca_train.columns=feat_names+['label']
print(feat_names)

In [None]:
g = sns.PairGrid(data=df_pca_train, vars=feat_names, hue='label', size=2)
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend()

# 7. Modelo de árbol de decisión CART usando Holdout validation


In [None]:
seed = 6

In [None]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(_____, train_df['label'], test_size = ____, random_state = seed)

In [None]:
n_classes = len(clases)

# 8. Creando modelo 

c) Instanciando un árbol de decisión

In [None]:
dectree = ______(random_state=seed, max_depth = ___)

d) Entrenamiento

In [None]:
dectree = dectree._____(____,____)

**Score de entrenamiento**

In [None]:
dectree.score(_____,_____)

Plot del árbol

In [None]:
plt.figure(figsize = (25,12))
plot_tree(dectree, feature_names = feat_names, class_names = clases, filled = True, fontsize=8)
plt.savefig('dectree_eurosat.png',format='png',bbox_inches = "tight")

# 8. Prediciendo para los datos de prueba

In [None]:
y_pred = dectree._____(Xtest)

a) Calculando el rendimiento general del modelo

In [None]:
score = metrics.accuracy_score(____, ____)
print("Test Acc: %s"%____)

b) Predicciones vs etiquetas verdaderas

In [None]:
predictions = np.float32(_____)
true_labels = np.float32(_____)

c) Matriz de confusión para evaluar los errores

In [None]:
def plot_confusion_matrix(cm, classes, tit, normalize=False):
    if normalize:
        cm = cm.astype('float')/cm.sum(axis=1)
        title, fmt = 'Matriz de confusión normalizada', '.2f'
    else:
        title, fmt = tit, 'd'
    plt.figure(figsize=(10,8))
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    plt.title(title)#, fontsize=12)
    plt.colorbar(pad=0.05)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=40)
    plt.yticks(tick_marks, classes)
    thresh = cm.max()/2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center", 
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('Clase Verdadera', fontsize=10)
    plt.xlabel('Clase Predicha', fontsize=10)
    plt.savefig(title+'.png')
    #plt.grid(b=None)
    plt.show()

In [None]:
cnf_matrix = confusion_matrix(______, _______, labels=range(n_classes))
tit = 'Matriz de confusión árbol de decisión (CART)'
plot_confusion_matrix(cnf_matrix,clases, tit, normalize=False)

e) Otras métricas para evaluar el rendimiento

In [None]:
sensitivity = []
specificity = []
acc=[]
for i,name in enumerate(df_pca_train.label.unique()):
  TP = np.sum((true_labels==name) & (predictions==name))
  TN = np.sum((true_labels!=name) & (predictions!=name))
  FP = np.sum((true_labels!=name) & (predictions==name))
  FN = np.sum((true_labels==name) & (predictions!=name))
  sensitivity.append(TP/(TP+FN))
  specificity.append(FP/(TN+FP))
  acc.append(TP/(TP+FP))
sensitivity.append(sum([x*y for x,y in zip(sensitivity,[1/10]*10)]))
specificity.append(sum([x*y for x,y in zip(specificity,[1/10]*10)]))
acc.append(sum([x*y for x,y in zip(acc,[1/10]*10)]))
d = {'Sensitivity':sensitivity, 'Specificity':specificity, 'Accuracy':acc}
ind = list(clases)+['Promedio']
df = pd.DataFrame(d, index=ind)
index = df.index
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.heatmap(df, annot=True)