# Regresión Logística: Detección de SPAM

En este ejercicio se muestran los fundamentos de la Regresión Logística planteando uno de los primeros problemas que fueron solucionados mediante el uso de técnicas de Machine Learning: la detección de SPAM.

## Enunciado del ejercicio

Se propone la construcción de un sistema de aprendizaje automático capaz de predecir si un correo determinado se corresponde con un correo de SPAM o no, para ello, se utilizará el siguiente conjunto de datos:

##### [2007 TREC Public Spam Corpus](https://plg.uwaterloo.ca/cgi-bin/cgiwrap/gvcormac/foo07)
The corpus trec07p contains 75,419 messages:

    25220 ham
    50199 spam

These messages constitute all the messages delivered to a particular
server between these dates:

    Sun, 8 Apr 2007 13:07:21 -0400
    Fri, 6 Jul 2007 07:04:53 -0400

### 1. Funciones complementarias

In [9]:
# Esta clase facilita el preprocesamiento de correos electrónicos que poseen código HTML
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [10]:
# Esta función se encarga de elimar los tags HTML que se encuentren en el texto del correo electrónico
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [11]:
# Ejemplo de eliminación de los tags HTML de un texto
t = '<tr><td align="left"><a href="../../issues/51/16.html#article">Phrack World News</a></td>'
strip_tags(t)

'Phrack World News'

In [12]:
import email
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

class Parser:

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(), msg.get_content_type())
        content_type = msg.get_content_type()
        return {
            "subject": subject,
            "body": body,
            "content_type": content_type
        }
    
    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if isinstance(payload, str) and content_type == 'text/plain':
            return self.tokenize(payload)
        elif isinstance(payload, str) and content_type == 'text/html':
            # Add a function to strip HTML tags if necessary
            return self.tokenize(self.strip_tags(payload))
        elif isinstance(payload, list):
            for p in payload:
                body += self.get_email_body(p.get_payload(), p.get_content_type())
        return body

    def tokenize(self, text):
        """Transform a text string in tokens. Clean punctuation and perform stemming."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

    def strip_tags(self, text):
        """Remove HTML tags."""
        from bs4 import BeautifulSoup
        return BeautifulSoup(text, "html.parser").get_text()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\26wen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##### Lectura de un correo en formato raw

In [13]:
inmail = open(r"C:\Users\26wen\Desktop\trec07p\MySet\inmail.1").read()
print(inmail)



Delivered-To: 26wendyrr.wr@gmail.com
Received: by 2002:a05:7022:b88:b0:87:f8a5:1541 with SMTP id cy8csp146939dlb;
        Wed, 9 Oct 2024 17:57:47 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IEBNnjzjST74udsGOA713HQ8xBIsrN52kUiH+78t+uvRyPhru3zwWBc+AZHUBTUK0ZDGOeD
X-Received: by 2002:a05:622a:8f:b0:44f:e076:b533 with SMTP id d75a77b69052e-4603f56ec9amr32187521cf.5.1728521866779;
        Wed, 09 Oct 2024 17:57:46 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1728521866; cv=none;
        d=google.com; s=arc-20240605;
        b=HaadPAEv8URewy9Bj22uNdeziuCXuXfw2PzTM3Yq8QcAucjKGw5RuZu4ywmZYLjD+4
         tEQC5qAFGs7oV3AhSA2ugInA5HSGjT4rOv+tj6DkUI7E7ceyB5CHbXCKgGTlqooQew8c
         AZuNT1NTNvM6txY965ycq+BE0xK0bB5auqe456/Eubyd1PtTB7PQDi/tl0QQR5OumP96
         aDlzAuYU5nLqv+NpOYH4Vnc2CjPQaq5PYs3oprA3BWIQSJ+4p6rFPI9mMMDaSvlloddP
         33ngQQRh58KckbsL3OAcwaDL+eSe0c6kmjjPum77E3x++io40VcWZRE4STMR800niPUn
         O0fA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-202

##### Parsing del correo electrónico

In [14]:
p = Parser()
p.parse("\\Users\\26wen\\Desktop\\trec07p\\MySet\\inmail.1")


{'subject': ['enviast', 'dinero', 'desd', 'spin', 'oxxo', 'utf8b8jsua'],
 'body': ['entra',
  'para',
  'ver',
  'me1',
  'detal',
  'del',
  'movimiento',
  'hola',
  'wendi',
  'enviast',
  '1500',
  'wendi',
  'rey',
  'al',
  'nfamero',
  'de',
  'clabe',
  '646180295704140915',
  'tu',
  'nfamero',
  'de',
  'movimiento',
  'es',
  '6707268548a2f2152170adc3',
  'consulta',
  'me1',
  'detal',
  'le',
  'en',
  'la',
  'app',
  'si',
  'reconoc',
  'est',
  'movimiento',
  'repf3rtalo',
  'desd',
  'tu',
  'app',
  'llama',
  'al',
  'centro',
  'de',
  'atencif3n',
  'client',
  '81',
  '3282',
  '8282',
  'de',
  '800',
  '1000',
  'pm',
  'gracia',
  'equipo',
  'spin',
  'oxxo',
  'est',
  'correo',
  'acepta',
  'respuesta',
  'recib',
  'est',
  'correo',
  'de',
  'notificacif3n',
  'por',
  'ser',
  'client',
  'de',
  'spin',
  'oxxo',
  'consulta',
  'nuestro',
  'aviso',
  'de',
  'privacidad',
  'httpurl3476spinbyoxxocommxl',
  'clickupn3du001697435njrrsoyh2bpqdnxckhkno

##### Lectura del índice

In [15]:
index = open("\\Users\\26wen\\Desktop\\trec07p\\delay\\index1").readlines()
index

['ham ../MySet/inmail.1\n',
 'ham ../MySet/inmail.2\n',
 'ham ../MySet/inmail.3\n',
 'ham ../MySet/inmail.4\n',
 'ham ../MySet/inmail.5\n',
 'ham ../MySet/inmail.6\n',
 'ham ../MySet/inmail.7\n',
 'ham ../MySet/inmail.8\n',
 'ham ../MySet/inmail.9\n',
 'ham ../MySet/inmail.10\n',
 'ham ../MySet/inmail.11\n',
 'ham ../MySet/inmail.12\n',
 'ham ../MySet/inmail.13\n',
 'ham ../MySet/inmail.14\n',
 'ham ../MySet/inmail.15\n',
 'spam ../MySet/inmail.16\n',
 'spam ../MySet/inmail.17\n',
 'spam ../MySet/inmail.18\n',
 'spam ../MySet/inmail.19\n',
 'spam ../MySet/inmail.20\n']

In [16]:
import os

DATASET_PATH = "\\Users\\26wen\\Desktop\\trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label":label, "email_path":os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [17]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [18]:
indexes = parse_index("\\Users\\26wen\\Desktop\\trec07p\\delay\\index1", 10)
indexes

[{'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.1'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.2'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.3'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.4'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.5'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.6'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.7'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.8'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.9'},
 {'label': 'ham',
  'email_path': '\\Users\\26wen\\Desktop\\trec07p\\MySet/inmail.10'}]

### 2. Preprocesamiento de los datos del conjunto de datos

Con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programática y el procesamiento de los mismos para eliminar aquellos componentes que no resultan de utilidad para la detección de correos de SPAM. Sin embargo, cada uno de los correos sigue estando representado por un diccionario de Python con una serie de palabras.

In [19]:
# Cargamos el índice y las etiquetas en memoria
index = parse_index("/Users/26wen/Desktop/trec07p/delay/index1", 1)


In [20]:
# Leemos el primer correo
import os

open(index[0]["email_path"]).read()

'Delivered-To: 26wendyrr.wr@gmail.com\nReceived: by 2002:a05:7022:b88:b0:87:f8a5:1541 with SMTP id cy8csp146939dlb;\n        Wed, 9 Oct 2024 17:57:47 -0700 (PDT)\nX-Google-Smtp-Source: AGHT+IEBNnjzjST74udsGOA713HQ8xBIsrN52kUiH+78t+uvRyPhru3zwWBc+AZHUBTUK0ZDGOeD\nX-Received: by 2002:a05:622a:8f:b0:44f:e076:b533 with SMTP id d75a77b69052e-4603f56ec9amr32187521cf.5.1728521866779;\n        Wed, 09 Oct 2024 17:57:46 -0700 (PDT)\nARC-Seal: i=1; a=rsa-sha256; t=1728521866; cv=none;\n        d=google.com; s=arc-20240605;\n        b=HaadPAEv8URewy9Bj22uNdeziuCXuXfw2PzTM3Yq8QcAucjKGw5RuZu4ywmZYLjD+4\n         tEQC5qAFGs7oV3AhSA2ugInA5HSGjT4rOv+tj6DkUI7E7ceyB5CHbXCKgGTlqooQew8c\n         AZuNT1NTNvM6txY965ycq+BE0xK0bB5auqe456/Eubyd1PtTB7PQDi/tl0QQR5OumP96\n         aDlzAuYU5nLqv+NpOYH4Vnc2CjPQaq5PYs3oprA3BWIQSJ+4p6rFPI9mMMDaSvlloddP\n         33ngQQRh58KckbsL3OAcwaDL+eSe0c6kmjjPum77E3x++io40VcWZRE4STMR800niPUn\n         O0fA==\nARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google

In [21]:
# Parseamos el primer correo
mail, label = parse_email(index[0])
print("El correo es:", label)
print(mail)

El correo es: ham
{'subject': ['enviast', 'dinero', 'desd', 'spin', 'oxxo', 'utf8b8jsua'], 'body': ['entra', 'para', 'ver', 'me1', 'detal', 'del', 'movimiento', 'hola', 'wendi', 'enviast', '1500', 'wendi', 'rey', 'al', 'nfamero', 'de', 'clabe', '646180295704140915', 'tu', 'nfamero', 'de', 'movimiento', 'es', '6707268548a2f2152170adc3', 'consulta', 'me1', 'detal', 'le', 'en', 'la', 'app', 'si', 'reconoc', 'est', 'movimiento', 'repf3rtalo', 'desd', 'tu', 'app', 'llama', 'al', 'centro', 'de', 'atencif3n', 'client', '81', '3282', '8282', 'de', '800', '1000', 'pm', 'gracia', 'equipo', 'spin', 'oxxo', 'est', 'correo', 'acepta', 'respuesta', 'recib', 'est', 'correo', 'de', 'notificacif3n', 'por', 'ser', 'client', 'de', 'spin', 'oxxo', 'consulta', 'nuestro', 'aviso', 'de', 'privacidad', 'httpurl3476spinbyoxxocommxl', 'clickupn3du001697435njrrsoyh2bpqdnxckhknohcjsoxgtn6c4l8tetys3o2b2h37uu', 'iilte2f0agomusil9crjwikkmcoc5d2fvij9iq372ff9byrcneez4jjw3dp25zt2f9lcx6', 'x5s4iyi2fagfsqoird2bboqynrre3x

El algoritmo de Regresión Logística no es capaz de ingerir texto como parte del conjunto de datos. Por lo tanto, deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos parseados en una representación numérica.

##### Aplicación de CountVectorizer

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

# Preapración del email en una cadena de texto
prep_email = [" ".join(mail['subject']) + " ".join(mail['body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("Email:", prep_email, "\n")
print("Características de entrada:", vectorizer.get_feature_names_out())

Email: ['enviast dinero desd spin oxxo utf8b8jsuaentra para ver me1 detal del movimiento hola wendi enviast 1500 wendi rey al nfamero de clabe 646180295704140915 tu nfamero de movimiento es 6707268548a2f2152170adc3 consulta me1 detal le en la app si reconoc est movimiento repf3rtalo desd tu app llama al centro de atencif3n client 81 3282 8282 de 800 1000 pm gracia equipo spin oxxo est correo acepta respuesta recib est correo de notificacif3n por ser client de spin oxxo consulta nuestro aviso de privacidad httpurl3476spinbyoxxocommxl clickupn3du001697435njrrsoyh2bpqdnxckhknohcjsoxgtn6c4l8tetys3o2b2h37uu iilte2f0agomusil9crjwikkmcoc5d2fvij9iq372ff9byrcneez4jjw3dp25zt2f9lcx6 x5s4iyi2fagfsqoird2bboqynrre3x2y4tjxz66yo2fahzehfysfj0jsm8suuo6po11kuk1p oxhyszkkoptjhxxeq2b4k2bsuegfusj4d2bav07uhjoteqz2bgtihob1d2bb1rz12bczu sb1ub2bjop9oeo0h0tjdymceh1imnnj9j95jouqaapiwmriheif3ugpcp06x1rajzsywrpysvk jcpvc9zvpz4qv3onsyicawlhnuruukuyrpw9tql2tdof6rmuyoo8bwzhcm2f2fehp2bk2vz7 blqhzb20cvcjtorhxpjnomppogwf

In [24]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[ 2  1  1  1  2  1  1  2  2  2  2  2  2  1  4  4  2  2  2  2  1  1  2  2
   1  1  3  2  2  2  6  4  2  1 20  2  3  4  1  1  2  1  2  2  2  1  3  2
   2  6  1  4  2  2  1  2  1  1  1  1  1  2  1  1  1  5  1  2  2  5  2  1
   4  1  4  2  1  1  1  1  1  1  2  1  1  7  2  4  2  4  2  1  2  2  1  2
   2  2  1  2  1  2  2  7  2  1  1  4  1  1  1  1  2  4  1  1  1  1]]


##### Aplicación de OneHotEncoding

In [25]:
from sklearn.preprocessing import OneHotEncoder

prep_email = [[w] for w in mail['subject'] + mail['body']]

enc = OneHotEncoder(handle_unknown='ignore')
X = enc.fit_transform(prep_email)

print("Features:\n", enc.get_feature_names_out())
print("\nValues:\n", X.toarray())

Features:
 ['x0_0' 'x0_1000' 'x0_150' 'x0_1500'
 'x0_2bav07uhjoteqz2bgtihob1d2bb1rz12bczusb1ub2bjop9oeo0h0tjdymceh1imnnj9j9'
 'x0_3282'
 'x0_3du001697435njrrsoyh2bpqdnxcpkkxmhfngdvnhflgbz1pg3euq4wlzsiedwtpneni5321'
 'x0_5jouqaapiwmriheif3ugpcp06x1rajzsywrpysvkjcpvc9zvpz4qv3onsyicawjaku49ufxxkjf'
 'x0_646180295704140915' 'x0_6707268548a2f2152170adc3' 'x0_800' 'x0_81'
 'x0_8282' 'x0_acepta'
 'x0_akhldvbjgeqz98pcyctdddezsiljsqv1lexkftf7fceex2bjhieujnfh1hor2bkfwdu3upkph'
 'x0_al' 'x0_app' 'x0_atencif3n' 'x0_autoridad' 'x0_autorizada' 'x0_aviso'
 'x0_bjpi2ffoari9gdo2phdoegf3jptq4brke0nitjalzqmf2b5udlp2byiwiapnbz92lcswpfbz'
 'x0_blqhzb20cvcjtorhxpjnomppogwfiwrv2f9vtgy8wh3itde2bl52bhvgwr2bvmnt9nocn1k'
 'x0_centro' 'x0_clabe'
 'x0_clickupn3du001697435njrrsoyh2bpqdnxckhknohcjsoxgtn6c4l8tetys3o2b2h37uu'
 'x0_clien' 'x0_client' 'x0_comerci' 'x0_compropago' 'x0_con'
 'x0_consulta' 'x0_correo' 'x0_cv'
 'x0_cvcfxcjmef7x5kiwwhugdgudqn2y1ewodr2ruhbfn5emaqdtsewxslm7iisk332ehfm'
 'x0_de' 'x0_del' 'x0_de

##### Funciones auxiliares para preprocesamiento del conjunto de datos

In [27]:

def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end='')
        try:
            mail, label = parse_email(indexes[i])
            X.append(" ".join(mail['subject']) + " ".join(mail['body']))
            y.append(label)
        except:
            pass
    return X, y

### 3. Entrenamiento del algoritmo 

In [28]:
# Leemos únicamente un subconjunto de 20 correos electrónicos
X_train, y_train = create_prep_dataset("/Users/26wen/Desktop/trec07p/delay/index1", 20)
X_train

Parsing email: 20

['enviast dinero desd spin oxxo utf8b8jsuaentra para ver me1 detal del movimiento hola wendi enviast 1500 wendi rey al nfamero de clabe 646180295704140915 tu nfamero de movimiento es 6707268548a2f2152170adc3 consulta me1 detal le en la app si reconoc est movimiento repf3rtalo desd tu app llama al centro de atencif3n client 81 3282 8282 de 800 1000 pm gracia equipo spin oxxo est correo acepta respuesta recib est correo de notificacif3n por ser client de spin oxxo consulta nuestro aviso de privacidad httpurl3476spinbyoxxocommxl clickupn3du001697435njrrsoyh2bpqdnxckhknohcjsoxgtn6c4l8tetys3o2b2h37uu iilte2f0agomusil9crjwikkmcoc5d2fvij9iq372ff9byrcneez4jjw3dp25zt2f9lcx6 x5s4iyi2fagfsqoird2bboqynrre3x2y4tjxz66yo2fahzehfysfj0jsm8suuo6po11kuk1p oxhyszkkoptjhxxeq2b4k2bsuegfusj4d2bav07uhjoteqz2bgtihob1d2bb1rz12bczu sb1ub2bjop9oeo0h0tjdymceh1imnnj9j95jouqaapiwmriheif3ugpcp06x1rajzsywrpysvk jcpvc9zvpz4qv3onsyicawlhnuruukuyrpw9tql2tdof6rmuyoo8bwzhcm2f2fehp2bk2vz7 blqhzb20cvcjtorhxpjnomppogwfiwrv2f9

##### Aplicamos la vectorización a los datos

In [29]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [31]:
print(X_train.toarray())
print("\nFeatures:", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 1 ... 1 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 1 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]

Features: 1966


In [32]:
import pandas as pd

pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,00,000000,00color8b8b90lineheight15pxtextdecorationunderlin,00hqyivh3ly222b6l4dyms26xajva2bu7tgmqrbniyd2mvk289mlb27kfdshglvnsnv2b9lx,048583divdivdivtdtrtbodyt,0691971382,07657096334498304626ismemberrequest3d026messagepreviewenabled3d126,09,0909,090909,...,zhj1r2f20tdk2bvlaq3d3d,ziqomwqv3q0e67mbhchu7pf6ce6mq9ar0qrlsugj1qhcgnsu4vranxbgjr8eb6htl9asxoum4fd,zjed8muutemu3ldkcf7rlljecjszmo1khf2k6yvarxtcjbkx80iu2br5c2f4xzo132bmr2b,zqpc0k5kf8xh38sxwhyf7espg0rzlhkphvuqgdwvovyikcq9nszm4gllu7nabhsvy9pjcvojygc,zugseknkat1hotvvky4cz12bne1dp0c6ktudt2f9lcx6x5s4iyi2fagfsqoird2bboqynr,zw,zwn,zwnj,zzylba72bcfvdxp2bhnq7ds1shyfbfuoobji1t6wrriqp2bzveu1iux5l50qunxziqomwqv3,éxito
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,1,0,0,0,...,0,0,0,0,0,2,1,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:

y_train

['ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'ham',
 'spam',
 'spam',
 'spam',
 'spam',
 'spam']

###### Entrenamiento del algoritmo de regresión logística con el conjunto de datos preprocesado

In [34]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

### 4. Predicción

##### Lectura de un conjunto de correos nuevos

In [35]:

# Leemos únicamente los 20 correos
X, y = create_prep_dataset("\\Users\\26wen\\Desktop\\trec07p\\delay\\index1", 20)

# Usamos los últimos 5 correos como conjunto de prueba y los primeros 15 para entrenamiento
X_train, y_train = X[:15], y[:15]
X_test, y_test = X[15:], y[15:]


Parsing email: 20

##### Preprocesamiento de los correos con el vectorizador creado anteriormente

In [36]:

X_test = vectorizer.transform(X_test)

##### Predicción del tipo de correo

In [38]:

print("Predicción:\n", y_pred)
print("\nEtiquetas reales:\n", y_test)

Predicción:
 ['spam' 'spam' 'spam' 'spam' 'spam']

Etiquetas reales:
 ['spam', 'spam', 'spam', 'spam', 'spam']


##### Evaluación de los resultados

In [39]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 1.000


### 5. Aumentando el conjunto de datos

In [40]:

# Leemos 20 correos electrónicos
X, y = create_prep_dataset("\\Users\\26wen\\Desktop\\trec07p\\delay\\index1", 20)

Parsing email: 20

In [44]:
# Usamos 12 correos 'ham' y 3 'spam' para entrenar el modelo
X_train, y_train = X[:12] + X[15:18], y[:12] + y[15:18]

# Usamos 3 correos 'ham' y 2 'spam' para pruebas
X_test, y_test = X[12:15] + X[18:], y[12:15] + y[18:]



In [45]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [46]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [47]:
X_test = vectorizer.transform(X_test)

In [48]:
y_pred = clf.predict(X_test)

In [49]:

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.800
