# IIC-3800 Tópicos en CC - NLP UC

- Versiones de librerías, python 3.8.10

- numpy 1.20.3
- nltk 3.7
- lime 0.2.0.1

____________________________________________________________________________________________________________

## Actividad en clase

Construya clasificadores de documentos **MultinomialNB** que trabajen sobre el dataset 20Newsgroups. Para esto haga lo siguiente:

- Construya una primera representación vectorizando el corpus usando **Idf=1** para todas las palabras del vocabulario. 
- Construya una segunda representación usando **sublinear_tf** e Idf por defecto. 
- Entrene un clasificador MultinomialNB para cada una de las representaciones.  
- Evalúe los clasificadores usando **classification_report** sobre la particion de test.
- Cuanto termine, me avisa para entregarle una **L (logrado)**.
- Recuerde que las L otorgan un bono en la nota final de la asignatura.

Revise la documentación para poder hacer la actividad:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

***Tiene hasta el final de la clase.***

_________________________________________________________________________________________________________________

# Solución

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

In [2]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer, sent_tokenize
from nltk.stem import WordNetLemmatizer

# Load stop-words
stop_words = set(stopwords.words('english'))

# Initialize tokenizer
# It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer
tokenizer = RegexpTokenizer('[\'a-zA-Z]+')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def tokenize(document):
    words = []

    for sentence in sent_tokenize(document):
        tokens = [lemmatizer.lemmatize(t.lower()) for t in tokenizer.tokenize(sentence) if t.lower() not in stop_words and len(t) > 2]
        words += tokens

    text = ' '.join(words)
    return text


In [3]:
train_docs = []
test_docs = []

for raw_text in newsgroups_train.data:
    text = tokenize(raw_text)
    train_docs.append(text)
    
for raw_text in newsgroups_test.data:
    text = tokenize(raw_text)
    test_docs.append(text)
    

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_1 = TfidfVectorizer(use_idf = False)
vectors_1 = vectorizer_1.fit_transform(train_docs)
vectors_test_1 = vectorizer_1.transform(test_docs)

In [6]:
vectorizer_2 = TfidfVectorizer(sublinear_tf = True)
vectors_2 = vectorizer_2.fit_transform(train_docs)
vectors_test_2 = vectorizer_2.transform(test_docs)

In [7]:
from sklearn.naive_bayes import MultinomialNB

clf1 = MultinomialNB()
clf1.fit(vectors_1, newsgroups_train.target)
clf1.score(vectors_test_1, newsgroups_test.target)

0.6271906532129581

In [8]:
clf2 = MultinomialNB()
clf2.fit(vectors_2, newsgroups_train.target)
clf2.score(vectors_test_2, newsgroups_test.target)

0.6595857673924589

In [9]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Y_predict_1 = clf1.predict(vectors_test_1)
print(classification_report(newsgroups_test.target, Y_predict_1))

              precision    recall  f1-score   support

           0       0.73      0.07      0.13       319
           1       0.66      0.62      0.64       389
           2       0.66      0.50      0.57       394
           3       0.53      0.71      0.61       392
           4       0.76      0.55      0.64       385
           5       0.73      0.77      0.75       395
           6       0.80      0.75      0.78       390
           7       0.81      0.65      0.72       396
           8       0.83      0.69      0.75       398
           9       0.91      0.73      0.81       397
          10       0.57      0.90      0.70       399
          11       0.50      0.79      0.61       396
          12       0.64      0.49      0.55       393
          13       0.81      0.72      0.76       396
          14       0.76      0.69      0.73       394
          15       0.31      0.92      0.46       398
          16       0.56      0.66      0.61       364
          17       0.81    

In [10]:
Y_predict_2 = clf2.predict(vectors_test_2)
print(classification_report(newsgroups_test.target, Y_predict_2))

              precision    recall  f1-score   support

           0       0.76      0.12      0.21       319
           1       0.72      0.67      0.69       389
           2       0.71      0.50      0.59       394
           3       0.54      0.74      0.63       392
           4       0.77      0.62      0.68       385
           5       0.76      0.81      0.78       395
           6       0.83      0.76      0.79       390
           7       0.84      0.72      0.78       396
           8       0.86      0.71      0.78       398
           9       0.93      0.78      0.85       397
          10       0.59      0.93      0.72       399
          11       0.51      0.82      0.63       396
          12       0.72      0.51      0.60       393
          13       0.85      0.74      0.80       396
          14       0.80      0.74      0.77       394
          15       0.33      0.93      0.49       398
          16       0.57      0.69      0.62       364
          17       0.80    