# Spam en mensajes de texto

## Carga del dataset

In [8]:
import pandas as pd
import io

# Importamos y convertimos los datos a un DataFrame
#df_spam = pd.read_csv('io.StringIO(uploaded['spam.csv'].decode('iso8859'))) #, sep='\t''
df_spam = pd.read_csv('datasets/spam.csv', encoding='iso8859')
df_spam.head()

Unnamed: 0,v1,v2,x1,x2,x3
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [9]:
# Categorías del dataset
df_spam['v1'].unique()

array(['ham', 'spam'], dtype=object)

In [4]:
len(df_spam)

5572

In [5]:
df_spam['v1'].value_counts()

ham     4825
spam     747
Name: v1, dtype: int64

Como vemos, este es un dataframe binario, dado que tiene dos categorías únicas. Tiene 5572 filas (ejemplos). Cabe observar que está relativamente desbalanceado, registrándose muchos más elementos en la categoría *ham*.

## Preprocesamiento

### Tratamiento de valores `NaN`

Como podemos ver, para todas las primeras filas, las últimas tres columnas están rellenas con `NaN`. Para determinar si conservar o no las mismas, veamos cuántos y cuáles valores muestran.

In [6]:
df_spam.count()

v1    5572
v2    5572
x1      50
x2      12
x3       6
dtype: int64

In [7]:
df_spam[pd.notna(df_spam.loc[:, 'x2'])].head()

Unnamed: 0,v1,v2,x1,x2,x3
95,spam,Your free ringtone is waiting to be collected....,PO Box 5249,"MK17 92H. 450Ppw 16""",
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
899,spam,Your free ringtone is waiting to be collected....,PO Box 5249,"MK17 92H. 450Ppw 16""",
1038,ham,"Edison has rightly said, \A fool can ask more ...",GN,GE,"GNT:-)"""
2170,ham,\CAN I PLEASE COME UP NOW IMIN TOWN.DONTMATTER...,JUST REALLYNEED 2DOCD.PLEASE DONTPLEASE DONTIG...,"U NO THECD ISV.IMPORTANT TOME 4 2MORO\""""",


Las columnas restantes aparentan ser la continuación del texto de la columna `v2`, por esto, se decide combinar la información en esta última columna.

In [8]:
df_spam.loc[:, 'v2'] = df_spam.loc[:, 'v2' : 'x3'].fillna('').sum(axis=1)
df_spam.drop(['x1', 'x2', 'x3'], axis=1, inplace=True)

df_spam.loc[281, 'v2']

'\\Wen u miss someone the person is definitely special for u..... But if the person is so special why to miss them just Keep-in-touch\\" gdeve.."'

De esta manera, se eliminaron todos los valores `NaN` sin perder los datos de entrada.

### Preprocesamiento de texto

A continuación, se determina aplicar un preprocesamiento sobre el texto en sí, realizando:

*   Remoción de stop words.
*   Stemming.
*   Inclusión de n-grams (hasta n=2).
*   Control de frecuencia mínima (dos documentos).
*   Conservación únicamente de tokens alfabéticos.
*   Generación de matriz TF/IDF.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import PorterStemmer

from sklearn.feature_extraction import text

# Creamos el stemmer
stemmer = PorterStemmer()

# Construimos un tokenizer
tokenizer = CountVectorizer().build_tokenizer()

# Obtenemos las stopwords que provee scikit-learn
stop_words = text.ENGLISH_STOP_WORDS

# Definimos una función que aplica el stemming luego de tokenizar y remover stopwords
def  stem_tokenizer(doc):
    # Aplica la tokenización
    tokens = tokenizer(doc)
    # Retorna la lista de tokens luego de aplicar stemming si no es un stopword
    return list(stemmer.stem(w) for w in tokens if w not in stop_words and w.isalpha()) # definir list(k for k in w if k.isdigit())

# Aplicamos el procesamiento
vectorizer = TfidfVectorizer(tokenizer=stem_tokenizer, ngram_range=(1,2), min_df=2)
tokens = vectorizer.fit_transform(df_spam['v2'].tolist())
features = vectorizer.get_feature_names()

df_tokens = pd.DataFrame(tokens.toarray(), columns=features)

In [10]:
df_tokens.head()

Unnamed: 0,aah,aathi,aathi dear,aathi love,abi,abil,abiola,abj,abl,abl come,abl deliv,abl pay,absolutli,absolutli fine,abt,abt tht,abt ur,abta,abta complimentari,aburo,aburo enjoy,abus,ac,ac stop,acc,accept,accept brother,accept day,access,access number,accid,accid claim,accident,accident delet,accomod,accordingli,account,account balanc,account bank,account detail,...,yiju meet,ym,yo,yo come,yo valentin,yo yo,yoga,yogasana,yogasana oso,yor,your,yr,yr prize,yummi,yun,yun ah,yunni,yuo,yuo exmpel,yuo ra,yup,yup have,yup lor,yup ok,yup thk,zed,zed logo,zed profit,zoe,åð,ìï,ìï come,ìï dun,ìï got,ìï home,ìï ma,ìï sch,ìï wait,ìï wan,ûò
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Teniendo un método mediante el cual obtener la matriz resultante, se procede a generar el modelo de clasificación.

## Generación de los modelos



Elegimos en este caso dos modelos a realizar, SVM y Naive Bayes. El primero de estos es un clasificador binario que funciona bien para espacios altamente dimensionales (como lo es el texto), mientras que el segundo suele ser bueno para texto, es robusto a valores faltantes y outliers, y nos ofrece la ventaja de ser incremental.

A modo de generalizar el procesamiento aplicado en la celda anterior, se realiza un pipeline.

In [0]:
# Obtengo los atributos y las clases por separado
X, y = df_spam['v2'].values, df_spam['v1'].values

In [0]:
from sklearn.pipeline import make_pipeline

from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# Genero dos pipelines, con los dos modelos elegidos
pipe_a = make_pipeline(
     TfidfVectorizer(tokenizer=stem_tokenizer, ngram_range=(1,2), min_df=2),
     SVC())

pipe_b = make_pipeline(
     TfidfVectorizer(tokenizer=stem_tokenizer, ngram_range=(1,2), min_df=2),
     MultinomialNB())

## Evaluación de los modelos

Respecto de las métricas a considerar, se destaca el F-1 Score, que combina precisión (predicciones correctas del total de positivas) y recall (del total de ejemplos positivos cuántos fueron correctamente clasificados) en una sola métrica, facilitando la comparación de modelos. Se realiza un promediado con macro-averaging, que aplica igual peso a ambas clases. Se computa además el Accuracy de los modelos y se obtienen métricas respecto el tiempo de entrenamiento y predicción.

Se utiliza una variante de k-fold cross validation, que da la ventaja de que todos los ejemplos se utilizan, en algún momento, para entrenamiento y prueba. La variante utilizada es `StratifiedKFold`, dado que intenta preservar la cantidad de muestras por clase. Esto se elige dada la disparidad en los ejemplos para cada clase (siendo la mayoría de la clase `ham`).

Se determina utilizar un `k = 10` dado que se tiene una buena cantidad de ejemplos.

In [0]:
from sklearn.model_selection import cross_validate, StratifiedKFold, cross_val_predict

scoring = ['f1_macro', 'accuracy']

cv = StratifiedKFold(n_splits=10)
scores_a = cross_validate(pipe_a, X, y, cv=cv, scoring=scoring)
scores_b = cross_validate(pipe_b, X, y, cv=cv, scoring=scoring)

In [82]:
pd.DataFrame([[scores_a['test_f1_macro'].mean(), scores_a['test_accuracy'].mean(), 
               scores_a['fit_time'].mean(), scores_a['score_time'].mean()],
              [scores_b['test_f1_macro'].mean(), scores_b['test_accuracy'].mean(),
               scores_b['fit_time'].mean(), scores_b['score_time'].mean()]]
             , index=['SVM', 'Naive Bayes'], 
             columns=['F1 macro','Accuracy','Fit time','Score time']).round(2)

Unnamed: 0,F1 macro,Accuracy,Fit time,Score time
SVM,0.96,0.98,2.45,0.23
Naive Bayes,0.92,0.97,0.87,0.09


Puede verse que se obtienen mejores valores tanto para el F-1 Score como para Accuracy en el clasificador SVM. Por otro lado, el modelo Naive Bayes demuestra ser más rápido tanto en la etapa de entrenamiento como en la predicción.

Para finalizar, realizamos predicciones a través de `cross_val_predict()`, y luego mostramos la matriz de confusión para cada uno de los modelos.

In [83]:
from sklearn.metrics import classification_report, confusion_matrix

cv_pred_a = cross_val_predict(pipe_a, X, y, cv=10)
cv_pred_b = cross_val_predict(pipe_b, X, y, cv=10)

print('SVM :')
print(confusion_matrix(y, cv_pred_a))
print('\nNaive Bayes :')
print(confusion_matrix(y, cv_pred_b))

SVM :
[[4815   10]
 [  91  656]]

Naive Bayes :
[[4823    2]
 [ 172  575]]


En general, se obtienen mejores resultados con el modelo SVM, pero cabe destacar que utilizando Naive Bayes se producen menos errores tipo II, ya que asigna solo 2 ejemplos del tipo *ham* a la categoría *spam*.

El SVM representa un modelo de mayor exactitud, pero siendo este relativamente más lento, y produciendo más errores de tipo II.

# Human Activity Recognition

## Carga del dataset

In [0]:
#!pip install openml

In [10]:
import openml

dataset_har = openml.datasets.get_dataset(1478)

X, y, categorical_indicator, attribute_names = dataset_har.get_data(
    dataset_format='dataframe',
    target=dataset_har.default_target_attribute
)

In [11]:
# Mostramos información sobre la clase
print(dataset_har.default_target_attribute)
y.unique()

Class


[5, 4, 6, 1, 3, 2]
Categories (6, object): [1 < 2 < 3 < 4 < 5 < 6]

In [117]:
# Cantidad de elementos en cada categoría
y.value_counts()

6    1944
5    1906
4    1777
1    1722
2    1544
3    1406
Name: Class, dtype: int64

In [118]:
X.shape

(10299, 561)

Se puede ver que los atributos del dataset pueden corresponder a seis categorias diferentes, identificadas con números enteros del 1 al 6. El dataset consiste además de 10299 filas y 561 atributos.

Según la información brindada en `openml`, sabemos que las mismas corresponden a las actividades que realizan usuarios, mientras se releva información del movimiento de los mismos. Estas actividades son: 

WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING

Se asume que las mismas están identificadas numéricamente con el orden en que figuran.

In [119]:
dict_class = dict(enumerate(('WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS', 
                             'SITTING', 'STANDING', 'LAYING'), start=1))
dict_class

{1: 'WALKING',
 2: 'WALKING_UPSTAIRS',
 3: 'WALKING_DOWNSTAIRS',
 4: 'SITTING',
 5: 'STANDING',
 6: 'LAYING'}

## Preprocesamiento

Se observa información básica sobre el dataset, a modo de evaluar la realización de preprocesamiento sobre el mismo.

In [120]:
X.describe().round(2)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,...,V522,V523,V524,V525,V526,V527,V528,V529,V530,V531,V532,V533,V534,V535,V536,V537,V538,V539,V540,V541,V542,V543,V544,V545,V546,V547,V548,V549,V550,V551,V552,V553,V554,V555,V556,V557,V558,V559,V560,V561
count,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,...,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0,10299.0
mean,0.27,-0.02,-0.11,-0.61,-0.51,-0.61,-0.63,-0.53,-0.61,-0.47,-0.31,-0.56,0.53,0.39,0.6,-0.55,-0.83,-0.9,-0.85,-0.69,-0.64,-0.64,-0.1,-0.13,-0.16,-0.12,0.11,-0.04,0.12,-0.03,0.03,0.16,-0.02,0.01,0.04,0.03,-0.08,-0.12,-0.2,0.1,...,-0.84,-0.68,-0.34,-0.88,0.17,-0.3,-0.6,-0.7,-0.7,-0.68,-0.73,-0.89,-0.7,-0.88,-0.72,-0.08,-0.89,-0.04,-0.26,-0.58,-0.78,-0.79,-0.77,-0.81,-0.87,-0.78,-0.94,-0.77,-0.27,-0.9,0.13,-0.3,-0.62,0.01,0.0,0.02,-0.01,-0.5,0.06,-0.05
std,0.07,0.04,0.05,0.44,0.5,0.4,0.41,0.48,0.4,0.54,0.28,0.28,0.36,0.34,0.29,0.46,0.25,0.13,0.21,0.36,0.37,0.37,0.46,0.43,0.37,0.31,0.25,0.25,0.23,0.25,0.21,0.21,0.22,0.28,0.22,0.24,0.23,0.36,0.33,0.38,...,0.23,0.37,0.67,0.19,0.25,0.36,0.35,0.32,0.31,0.33,0.28,0.16,0.32,0.18,0.31,0.6,0.16,0.28,0.32,0.32,0.27,0.26,0.28,0.24,0.19,0.27,0.13,0.28,0.62,0.14,0.25,0.32,0.31,0.34,0.45,0.62,0.48,0.51,0.31,0.27
min,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,0.26,-0.02,-0.12,-0.99,-0.98,-0.98,-0.99,-0.98,-0.98,-0.94,-0.56,-0.81,0.21,0.11,0.39,-0.98,-1.0,-1.0,-1.0,-0.99,-0.98,-0.98,-0.56,-0.55,-0.5,-0.37,-0.08,-0.19,-0.03,-0.22,-0.13,0.03,-0.17,-0.21,-0.12,-0.11,-0.24,-0.36,-0.41,-0.14,...,-1.0,-0.99,-1.0,-0.97,-0.0,-0.6,-0.88,-0.98,-0.98,-0.98,-0.98,-0.99,-0.98,-1.0,-0.99,-0.67,-1.0,-0.23,-0.5,-0.81,-0.99,-0.99,-0.99,-0.99,-0.99,-0.99,-1.0,-0.99,-0.92,-0.97,-0.02,-0.54,-0.84,-0.12,-0.29,-0.49,-0.39,-0.82,0.0,-0.13
50%,0.28,-0.02,-0.11,-0.94,-0.84,-0.85,-0.95,-0.84,-0.85,-0.87,-0.47,-0.72,0.78,0.62,0.77,-0.88,-1.0,-0.99,-0.98,-0.96,-0.88,-0.85,-0.06,-0.1,-0.14,-0.14,0.08,-0.02,0.13,-0.05,0.02,0.16,-0.02,0.02,0.01,0.05,-0.08,-0.16,-0.19,0.14,...,-1.0,-0.94,-0.68,-0.9,0.16,-0.35,-0.71,-0.88,-0.83,-0.85,-0.83,-0.96,-0.88,-0.98,-0.91,-0.16,-0.95,-0.05,-0.32,-0.66,-0.95,-0.94,-0.94,-0.94,-0.97,-0.95,-1.0,-0.94,-0.41,-0.9,0.14,-0.34,-0.7,0.01,0.01,0.02,-0.01,-0.72,0.18,-0.0
75%,0.29,-0.01,-0.1,-0.25,-0.06,-0.28,-0.3,-0.09,-0.29,-0.01,-0.07,-0.35,0.84,0.69,0.84,-0.12,-0.72,-0.83,-0.76,-0.41,-0.32,-0.34,0.33,0.28,0.17,0.13,0.29,0.13,0.28,0.16,0.18,0.29,0.13,0.22,0.18,0.19,0.07,0.08,0.0,0.37,...,-0.73,-0.37,0.35,-0.87,0.36,-0.06,-0.43,-0.45,-0.47,-0.42,-0.56,-0.84,-0.45,-0.81,-0.5,0.51,-0.85,0.15,-0.08,-0.44,-0.61,-0.64,-0.61,-0.68,-0.81,-0.61,-0.92,-0.6,0.34,-0.87,0.29,-0.11,-0.49,0.15,0.29,0.54,0.37,-0.52,0.25,0.1
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [121]:
X.isna().any().any()

False

El dataset no contiene celdas con `NaN`, además los atributos están normalizados en el rango [-1, 1], por esto no se considera necesario aplicar tareas de preprocesamiento sobre el mismo.

## Generación de los modelos

Se evita utilizar Naive Bayes, dado que se estima que por su naturaleza, los atributos están fuertemente correlacionados. Tampoco se desea aplicar k-NN, dado que sería muy costoso el entrenamiento por la cantidad de ejemplos y atributos.

Se determina el uso de un árbol de decisión, que permite de requerirlo, una visualización de la toma de decisiones que se llevan a cabo. Por otro lado, se aplica el modelo de centroide más cercano, que ofrece una alternativa más rápida y que requiere menor almacenamiento respecto de k-NN.



In [0]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import NearestCentroid

clf_a = DecisionTreeClassifier(random_state=0)
clf_b = NearestCentroid()

## Evaluación de los modelos

De manera similar que para el dataset anterior, se evaluan los modelos utilizando validación cruzada, para un `k=10`, dado que se tiene una buena cantidad de ejemplos.

En este caso no se opta por utilizar la alternativa estratificada, dado que la cantidad de ejemplos por clase no tiene una gran variación. Se computa el Accuracy junto con el F-1 Score, en este caso mediante micro-averaging por la misma razón.

In [0]:
mscoring = ['f1_micro', 'accuracy']

scores_a = cross_validate(clf_a, X, y, cv=10, scoring=scoring)
scores_b = cross_validate(clf_b, X, y, cv=10, scoring=scoring)

In [137]:
pd.DataFrame([[scores_a['test_f1_micro'].mean(), scores_a['test_accuracy'].mean(), 
               scores_a['fit_time'].mean(), scores_a['score_time'].mean()],
              [scores_b['test_f1_micro'].mean(), scores_b['test_accuracy'].mean(),
               scores_b['fit_time'].mean(), scores_b['score_time'].mean()]]
             , index=['Árbol de decisión', 'Centroide más cercano'], 
             columns=['F1 micro','Accuracy','Fit time','Score time']).round(2)

Unnamed: 0,F1 micro,Accuracy,Fit time,Score time
Árbol de decisión,0.87,0.87,8.01,0.01
Centroide más cercano,0.82,0.82,0.07,0.02


El árbol de decisión muestra mejores resultados en ambas métricas. Su tiempo de predicción es menor que para el centroide más cercano, pero requiere una cantidad considerablemente mayor de tiempo de entrenamiento.

A continuación, se realizan predicciones con `cross_val_predict` y se computa la matriz de confusión para cada modelo. En este caso, la misma se normaliza para una visualización más simple.

In [130]:
cv_pred_a = cross_val_predict(clf_a, X, y, cv=10)
cv_pred_b = cross_val_predict(clf_b, X, y, cv=10)

print('Árbol de decisión :')
print(confusion_matrix(y, cv_pred_a, normalize='true').round(2))
print('\nCentroide más cercano :')
print(confusion_matrix(y, cv_pred_b, normalize='true').round(2))

Árbol de decisión :
[[0.85 0.08 0.06 0.   0.   0.  ]
 [0.1  0.8  0.1  0.   0.   0.  ]
 [0.04 0.1  0.86 0.   0.   0.  ]
 [0.   0.   0.   0.84 0.15 0.02]
 [0.   0.   0.   0.16 0.83 0.  ]
 [0.   0.   0.   0.   0.   1.  ]]

Centroide más cercano :
[[0.72 0.14 0.14 0.   0.   0.  ]
 [0.03 0.87 0.1  0.   0.   0.  ]
 [0.1  0.13 0.77 0.   0.   0.  ]
 [0.   0.   0.   0.75 0.22 0.03]
 [0.   0.   0.   0.22 0.78 0.  ]
 [0.   0.01 0.   0.   0.   0.99]]


In [141]:
dict_class[2]

'WALKING_UPSTAIRS'

La matriz de confusión también muestra una mayor exactitud utilizando árbol de decisión. Esto se cumple en todas las clases salvo para la segunda clase (*'WALKING_UPSTAIRS'*), donde se obtiene un mejor valor mediante centroide más cercano. Para el primer modelo, se ve que hay más casos donde se predice erronamente la primer clase.

Para finalizar, se realiza un reporte de métricas diferenciadas por clase.

In [142]:
from sklearn.metrics import classification_report

print('Árbol de decisión :')
print(classification_report(y, cv_pred_a, target_names=dict_class.values()))
print('\nCentroide más cercano :')
print(classification_report(y, cv_pred_b, target_names=dict_class.values()))

Árbol de decisión :
                    precision    recall  f1-score   support

           WALKING       0.87      0.85      0.86      1722
  WALKING_UPSTAIRS       0.81      0.80      0.80      1544
WALKING_DOWNSTAIRS       0.82      0.86      0.84      1406
           SITTING       0.82      0.84      0.83      1777
          STANDING       0.86      0.83      0.84      1906
            LAYING       0.98      1.00      0.99      1944

          accuracy                           0.87     10299
         macro avg       0.86      0.86      0.86     10299
      weighted avg       0.87      0.87      0.87     10299


Centroide más cercano :
                    precision    recall  f1-score   support

           WALKING       0.87      0.72      0.78      1722
  WALKING_UPSTAIRS       0.75      0.87      0.80      1544
WALKING_DOWNSTAIRS       0.74      0.77      0.75      1406
           SITTING       0.76      0.75      0.76      1777
          STANDING       0.79      0.78      0.79  

La misma observación se realiza para el reporte, el árbol de decisión muestra para todos los casos mejores resultados, salvo para el Recall en la clase (*'WALKING_UPSTAIRS'*), o sea, la cantidad de ejemplos de esa clase que fueron correctamente clasificados.



# Clasificación con scikit-learn

En el presente trabajo se cargan dos datasets diferentes, uno conteniendo documentos de correos eléctronicos, y el otro con información relativa a los movimientos de usuarios de celulares, a partir de las mediciones realizadas con un acelerómetro. Ambos consisten en problemas de clasificación, teniendo en el primer caso un atributo objetivo binario (*spam y no spam*), y en el segundo caso uno multiclase (*las actividades que los usuarios realizan*).

Se aplican y comparan para cada caso dos modelos de clasificación diferentes de la biblioteca `scikit-learn`, estos son:

* `SVC`
* `MultinomialNB`
* `DecisionTreeClassifier`
* ``Nearest Centroid`