# Regrecion logistica: deteccion de SPAM

en este ejercicio se mustran los fundamentos de la regrecion logistica planteandooslo uno de los primeros problemas que fueron soluciónados mediante el uso de tecnicas de marching learning 

## enunciado del ejercicio 
se propone la construcción de un  sistema de aprendizaje automático capas de predecir si un correo determinado se corresponde con un correoSPAM o no, para ellop se utilizara el siguiente conjunto de datos:

###### [2007 TREC Publica SPAM Corpus](https://plg.uwaterloo.ca/~gvcormac/treccorpus07/)
the corpus trec07p contains 75,419 messages :

25,220 ham
50,190 SPAM

these messages constitute all the messages delivered to a particular server between these dates:

sun, 8 Apr 2007 13:07:21 -0400

fri, 6 jul 2007 07:04:53 -0400


# 1.- funciones complementarias

En este caso practico relacionado con la detección de correos electrónicos de SPAM el conjunto de datos que se dispone esta formado por cxorreos electrónicos con sus correspondientes cabeceras y campos adicionales por lo tanto requieren un procesamiento previo a que sean ingeridos por el algoritmo de machine lerning

In [1]:
# en esta clase facilita el preprocesamiento de correos electrónicos que possen codigo html
from html.parser import HTMLParser


class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

In [2]:
# Esta funcion  se encarga de leinar los tags HTML que se encuentren en el texto del correo electronico
def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [3]:
# Ejemplo de leliminacion de los tags HTML  de un texto
t = '<tr><td align="left"><a href="../..issues/51/16.html#article">Phrack World News</a></td>'
strip_tags(t)

'Phrack World News'

Ademas de eliminar los posibles tags HTML  que se encuentren en el correo electrónico deben realizarse otras acciones de preprosesamiento para evitar que los mensajes obtengan ruido inecesario. entre ellos se encuentra la eliminación de los signos de puntuación, eliminacion de posibles campos de correo electrónico que no son relevantes o eliminacion de los afijos de una palabra panteniendo unicamnete la raiz de la misma (Stemming). la clase que se muestra acontinuacion realiza estas transformaciones 

In [4]:
import email
import string
import nltk


class Parser:
    
    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)
    def parse(self, email_path):
        """Parse an email."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)
        
    def get_email_content(self, msg):
        """Ectract the email content."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(),
                                  msg.get_content_type())
        content_type = msg.get_content_type()
        # Returing the content of the email
        return {"subject": subject,
               "body": body,
               "content_type": content_type}

    def get_email_body(self, payload, content_type):
        """Extract the body of the email."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
           return self.tokenize(strip_tags(payload))
        elif type (payload) is list:
           for p in payload:
               body += self.get_email_body(p.get_payload(),
                                        p.get_content_type())
        return body

    def tokenize(self, text):
        """Transform a tex string in tokens Perform two main actions, clean the puntuation symbolss and do of the text."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ")
        text = text.replace("\n", "")
        tokens = list(filter(None, text.split(" ")))
        #Stemming of the tokens
        return[self.stemmer.stem(w) for w in tokens if w not in self.stopwords]
        

    
    

lectura de un correo en formato.raw

In [5]:
inmail = open("datasets/trec07p/data/inmail.1").read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

In [6]:
p = Parser()
p.parse("datasets/trec07p/data/inmail.1")

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occasiontri',
  'viagrayour',
  'anxieti',
  'thing',
  'past',
  'willb',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}

##### lectura del indice

estas funciones complementarias se encargan de cargar en memoria la ruta de cada correo electrónico y su etiqueta correspondiente {SPAM,ham}

In [7]:
index = open("datasets/trec07p/full/index").readlines()
index

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n',
 'spam ../data/inmail.4\n',
 'spam ../data/inmail.5\n',
 'spam ../data/inmail.6\n',
 'spam ../data/inmail.7\n',
 'spam ../data/inmail.8\n',
 'spam ../data/inmail.9\n',
 'ham ../data/inmail.10\n',
 'spam ../data/inmail.11\n',
 'spam ../data/inmail.12\n',
 'spam ../data/inmail.13\n',
 'spam ../data/inmail.14\n',
 'spam ../data/inmail.15\n',
 'spam ../data/inmail.16\n',
 'spam ../data/inmail.17\n',
 'spam ../data/inmail.18\n',
 'spam ../data/inmail.19\n',
 'ham ../data/inmail.20\n',
 'ham ../data/inmail.21\n',
 'spam ../data/inmail.22\n',
 'spam ../data/inmail.23\n',
 'spam ../data/inmail.24\n',
 'spam ../data/inmail.25\n',
 'spam ../data/inmail.26\n',
 'spam ../data/inmail.27\n',
 'spam ../data/inmail.28\n',
 'ham ../data/inmail.29\n',
 'spam ../data/inmail.30\n',
 'ham ../data/inmail.31\n',
 'spam ../data/inmail.32\n',
 'spam ../data/inmail.33\n',
 'ham ../data/inmail.34\n',
 'spam ../data/inmail.35\n',
 

In [8]:
import os
DATASET_PATH = "datasets/trec07p"

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        ret_indexes.append({"label": label, "email_path": os.path.join(DATASET_PATH, path)})
    return ret_indexes

In [9]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [10]:
indexes = parse_index("datasets/trec07p/full/index", 10)
indexes

[{'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.1'},
 {'label': 'ham', 'email_path': 'datasets/trec07p/data/inmail.2'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.3'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.4'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.5'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.6'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.7'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.8'},
 {'label': 'spam', 'email_path': 'datasets/trec07p/data/inmail.9'},
 {'label': 'ham', 'email_path': 'datasets/trec07p/data/inmail.10'}]

# 2.- preprocesamiento del DataSet

con las funciones presentadas anteriormente se permite la lectura de los correos electrónicos de manera programtica y el preprocesamiento de los mismos para eliminar aquellos componentes  que no resultan de utilidad para la detección de correos SPAM. sin embargo cada uno de los correos sigue estando representado por un dicccionario de python con una serie de palabras. 

In [11]:
# cargar el indice de las etiquetas en memoria
index = parse_index("datasets/trec07p/full/index", 1)

In [12]:
# leeer el primer correo 
import os
open(index[0]["email_path"]).read()

'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0

In [13]:
# parsear el primer correo
mail, label = parse_email(index[0])
print("el correo es:", label, "\n")
print(mail)

el correo es: spam 

{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occasiontri', 'viagrayour', 'anxieti', 'thing', 'past', 'willb', 'back', 'old', 'self'], 'content_type': 'multipart/alternative'}


el algoritmo de regresion logistica no es capaz de ingerir texto como parte del DataSet. por lo tanto deben aplicarse una serie de funciones adicionales que transformen el texto de los correos electrónicos pensados en una representación numerica 

##### Aplicacion CounVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
# preparar el email en una cadena de texto
prep_email = [" ".join(mail['subject']) + " ".join(mail['body'])]

vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)
print("e-mail",prep_email, "\n")
print("caracteristicas de entrada:", vectorizer.get_feature_names_out())

e-mail ['gener ciali brand qualitido feel pressur perform rise occasiontri viagrayour anxieti thing past willb back old self'] 

caracteristicas de entrada: ['anxieti' 'back' 'brand' 'ciali' 'feel' 'gener' 'occasiontri' 'old'
 'past' 'perform' 'pressur' 'qualitido' 'rise' 'self' 'thing' 'viagrayour'
 'willb']


In [15]:
X = vectorizer.transform(prep_email)
print("\nValues:\n", X.toarray())


Values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


##### aplicacion de OneHotEncodig

In [16]:
from sklearn.preprocessing import OneHotEncoder
prep_email = [[w] for w in mail['subject'] + mail ['body']]

enc = OneHotEncoder(handle_unknown = 'ignore')
X = enc.fit_transform(prep_email)
print("features:\n", enc.get_feature_names_out())
print("Values:\n", X.toarray())

features:
 ['x0_anxieti' 'x0_back' 'x0_brand' 'x0_ciali' 'x0_do' 'x0_feel' 'x0_gener'
 'x0_occasiontri' 'x0_old' 'x0_past' 'x0_perform' 'x0_pressur'
 'x0_qualiti' 'x0_rise' 'x0_self' 'x0_thing' 'x0_viagrayour' 'x0_willb']
Values:
 [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0.

##### Funciones Auxiliares para preprocesamiento del DataSet.

In [17]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end = "")
        mail, label = parse_email(indexes[i])
        X.append(" ".join(mail['subject']) + " ".join(mail['body']))
        y.append(label)
    return X,y

# 3.- Entretenimiento del Algoritmo

In [18]:
# Leer unicamente un subconjunto de 1000 corros electrónicos
X_train, y_train = create_prep_dataset("datasets/trec07p/full/index", 1500)
X_train

Parsing email: 1500

['gener ciali brand qualitido feel pressur perform rise occasiontri viagrayour anxieti thing past willb back old self',
 'typo debianreadmhi ive updat gulu i check mirrorsit seem littl typo debianreadm fileexamplehttpgulususherbrookecadebianreadmeftpftpfrdebianorgdebianreadmetest lenni access releas diststest thecurr test develop snapshot name etch packag whichhav test unstabl pass autom test propog tothi releaseetch replac lenni like readmehtml yan morinconsult en logiciel libreyanmorinsavoirfairelinuxcom5149941556 to unsubscrib email debianmirrorsrequestlistsdebianorgwith subject unsubscrib troubl contact listmasterlistsdebianorg',
 'authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click herehttpwwwmoujsjkhchumcom authent viagramega authenticv i a g r a discount pricec i a l i s discount pricedo miss it click',
 'nice talk yahey billi realli fun go night talk said feltinsecur manhood i notic toiletsy quit small area worri websit i tell secret

##### Aplicar la vectorizacion de los datos

In [19]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [20]:
print(X_train.toarray())
print("\nFeatures:", len(vectorizer.get_feature_names_out()))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Features: 33784


In [21]:
import pandas as pd

In [22]:
pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()])

Unnamed: 0,00,000,0000,000000,000000001,0000000181,00000018482,000000categori,000000ctoptxt1,000000smallitalictext,...,淶ҵƿϊһ淶ĵƶȶҳ,绰tel,绰۹ϵͳctsƽe,肾ǝvă,鏗ėvłq,饻jwkݤ,鵵χ2ʶ3ҵ൵νռࡢ鵵õȹҫդãˡҵĵԭ뷽ҵĵĸصҵĵԭҵĵļšҵ1ҵߵ2ҵߵ,뵭袵,뼰ʱϵ,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
X_train

<1500x33784 sparse matrix of type '<class 'numpy.int64'>'
	with 149988 stored elements in Compressed Sparse Row format>

##### entrenamiento del algoritmo de regrecion logistica con DataSet preprocesado

In [24]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train,y_train)

## 4.- Prediccion

In [25]:
# leer 1,500 coreos de nuestro DatáFreme y quedarnos unicamente con los 500 ultimos. 
#estos 500 correos electrónicos no se han utlizado para entrenar el algoritmo
X, y = create_prep_dataset("datasets/trec07p/full/index", 1500)
X_test = X[1000:]
y_test = y[1000:]

Parsing email: 1500

##### preprocesamiento de los correos con el vectrorizador creado anteriormente 

In [26]:
X_test = vectorizer.transform(X_test)

In [27]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam',
       'spam', 'spam', 'ham', 'spam', 'ham', 'spam', 'spam', 'ham', 'ham',
       'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham',
       'spam', 'spam', 'ham', 'spam', 'ham', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham',
       'ham', 'ham', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam',
       'ham', 'spam', 'ham', 'spam', 'spam', 'spam', 'ham', 'spam', 'ham',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'ham', 'ham', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'ham',
       'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spa

In [28]:
print("prediccion:\n", y_pred)
print("\nEtiquetas Reales:\n", y_test)

prediccion:
 ['spam' 'ham' 'ham' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'ham'
 'spam' 'ham' 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'spam' 'ham'
 'ham' 'ham' 'spam' 'ham' 'spam' 'spam' 'ham' 'spam' 'ham' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'ham' 'ham' 'ham'
 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'spam' 'ham' 'spam' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam' 'ham'
 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'ham' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'ham' 'ham'
 'ham' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham'
 'ham' 'spam' 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'ham' 'ham' 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam'
 'ham' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam' 'ham' 'spam'
 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam

#### Evaluacion de los resultados

In [29]:
from sklearn.metrics import accuracy_score

print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 1.000


# 5.- aumentandonoslas el DataSet

In [30]:
# leer 12,000 correos electrónicos
X, y =  create_prep_dataset("datasets/trec07p/full/index", 12000)

Parsing email: 12000

In [31]:
# utlizamos 10,000 correos para entrenar el algoritmo y 2,000 para realizar pruebas 
X_train, y_train = X[:10000], y[:10000]
X_test, y_test = X[10000:], y[10000:]


In [32]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [33]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [34]:
X_test = vectorizer.transform(X_test)

In [35]:
y_pred = clf.predict(X_test)

In [36]:
print("Accuracy: {:.3f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.984


In [72]:
# leer 12,000 correos electrónicos
X, y =  create_prep_dataset("datasets/trec07p/full/index",10000)


Parsing email: 10000

In [79]:
# utlizamos 10,000 correos para entrenar el algoritmo y 2,000 para realizar pruebas 
X_train, y_train = X[:5000], y[:5000]
X_test, y_test = X[5000:], y[5000:]

In [80]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [81]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [82]:
X_test = vectorizer.transform(X_test)

In [83]:
y_pred = clf.predict(X_test)

In [84]:
print("Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))

Accuracy: 0.99
