# Template for the homework


## Task
- Classify each document in one of 20 categories.
- The objective is obtain the better accuracy in the test set. You can use any library and model explained in the course.
- The delivery are a unique jupiter notebook with all the code. Must run in the course Anaconda environment. Not use additional libraries.
- Send the notebook named homework\_[name]\_[surename].ipynb to sueiras@gmail.com before November 20th.

## Template structure

- A Jupiter notebook template is provided to do the task. Structure:
  - Read the train and validation data.
  - Transform to generate numerical features.: Build your transformations here
  - Model: Build your model or models here. Check the accuracy over the validation set.
  - Evaluate results: Build your scoring function here and apply it over the test set.
- You need to complete the transform and model steps to achieve the best result in the evaluation metric, the accuracy, in test set.
- Is completely forbidden load and use the test set except once in the final evaluate results step.

## Evaluation

- Exercise evaluated in 0-10 range points.
- To obtain 5 points you must deliver a notebook without errors that provide a solution whit a minimum accuracy of 67%.
- If you obtain an accuracy over 87% you have 10 points.
- Intermediated accuracies between 67% and 87% obtain intermediated points proportionally, but depending of the quality of the work is possible to reduce or increase a maximum of 2 the points assigned automatically by accuracy. 
 

In [210]:
# Header
from __future__ import print_function

import pandas as pd
import numpy as np

## 01 Load Data


In [211]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

print(twenty_train.target_names) #categorías a clasificar

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [212]:
print (twenty_train.filenames.shape)
print (twenty_train.target.shape)

(11314,)
(11314,)


## 02 Text Encoding an preprocessing

A continuación nos vamos a crear una función que nos va a hacer un preprocesado y limpieza de nuestro texto: quitar puntuación, contracciones, dígitos, stopwords...

In [216]:
from string import punctuation
from nltk.corpus import stopwords

import re
from nltk.stem.snowball import SnowballStemmer


def clean_text(text):
     
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\d+.*", "digits", text)
    
    #remove punctuation
    text = ''.join(c for c in text if c not in punctuation)    
    

    ## Convert words to lower case
    text = text.lower().replace("\n", " ").split()
    
    #Remove stop words
    stops = set(stopwords.words("english"))
    text = ([w for w in text if w not in stops])
    text = " ".join(text)

    #Stemming
    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)
    return text


In [217]:
texto =[]
for text in twenty_train.data:
    texto.append(clean_text(text))

In [218]:
# Separate train and validation
from sklearn.model_selection import train_test_split

# Recommended 20% to validation. 
text_trn, text_val, y_trn, y_val = train_test_split(texto, twenty_train.target, test_size=0.2)
print(len(text_trn), len(text_val))

9051 2263


# PRUEBA 1: SKLEARN models

Comenzamos el análisis con modelos de sklearn. Vamos a utilizar diferentes modelos para ver qué tal funcionan en validación.

In [265]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer

# Extract word ocurrences
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=50000)
X_train_counts = tf_vectorizer.fit_transform(text_trn)

#From occurrences to frequencies
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tf = tfidf_transformer.transform(X_train_counts)


def encoding_text(text):
    text_counts = tf_vectorizer.transform(text)
    text_tf = tfidf_transformer.transform(text_counts)
    return text_tf

X_trn = encoding_text(text_trn)
X_val=encoding_text(text_val)

In [266]:
from sklearn.metrics import accuracy_score, confusion_matrix

#multinomial
from sklearn.naive_bayes import MultinomialNB
clf_multi =  MultinomialNB(alpha=0.1).fit(X_trn, y_trn)
pred_multi = clf_multi.predict(X_val)
print('Multinomial accuracy val: ', accuracy_score(y_val, pred_multi))

#reg. logistica
from sklearn.linear_model import LogisticRegression
clf_log = LogisticRegression(random_state=0).fit(X_trn, y_trn)
pred_log = clf_log.predict(X_val)
print('Reg.logistica accuracy val: ', accuracy_score(y_val, pred_log))

#Random Forest
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0).fit(X_trn, y_trn)
pred_rf = clf_rf.predict(X_val)
print('RF accuracy val: ', accuracy_score(y_val, pred_rf))


#SVM
from sklearn import svm
clf_svc = svm.LinearSVC(C=1).fit(X_trn, y_trn)
pred_svc = clf_svc.predict(X_val)
print('SVC accuracy val: ', accuracy_score(y_val, pred_svc))


#SGD
from sklearn.linear_model import SGDClassifier
clf_sgd=SGDClassifier( alpha=0.0001).fit(X_trn, y_trn)
pred_sgd = clf_sgd.predict(X_val)
print('SGD accuracy val: ', accuracy_score(y_val, pred_sgd))


#Percep
from sklearn.linear_model import Perceptron
clf_perc = Perceptron(max_iter=50, n_jobs=-1).fit(X_trn, y_trn)
pred_perc = clf_perc.predict(X_val)
print('Perc accuracy val: ', accuracy_score(y_val, pred_perc))




Multinomial accuracy val:  0.8965974370304906
Reg.logistica accuracy val:  0.8780380026513478
RF accuracy val:  0.5846221829429961


KeyboardInterrupt: 

Observamos que tanto SVC como SGD funcionan bastante bien. Vamos a crear un gridsearch para optimizar los parámetros del modelo SGD y ver si mejora:

In [263]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

# pipeline = Pipeline([
#     ('vect', CountVectorizer(max_df=0.95, min_df=2,max_features=50000)),
#     ('tfidf', TfidfTransformer()),
#     ('clf', SGDClassifier()),
# ])

parameters={'alpha': (0.0001,0.001,0.01,0.1),
 'n_iter': (80, 100, 120),
 'penalty': ('l2', 'elasticnet')}

model_sgd =SGDClassifier()

grid_search = GridSearchCV(model_sgd, parameters, n_jobs=-1, verbose=1)

grid_search.fit(X_trn, y_trn) 
best_parameters = grid_search.best_estimator_.get_params()
print("Best score: %0.3f" % grid_search.best_score_)

predicted = grid_search.predict(X_val)

print('accuracy test: ', accuracy_score(y_val, predicted))


Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done  72 out of  72 | elapsed:  6.5min finished


Best score: 0.891
accuracy test:  0.9045514803358374


Tras crear el modelo, veamos que parámetros son los más adecuados y qué tal se comporta en validación:

In [264]:
grid_search.best_params_

{'alpha': 0.0001, 'n_iter': 80, 'penalty': 'l2'}

In [246]:
from sklearn.metrics import accuracy_score, confusion_matrix
predicted = grid_search.predict(text_val)
print('accuracy val: ', accuracy_score(y_val, predicted))

accuracy val:  0.9041095890410958


A continuación vamos a probar a lanzar una red neural para ver si podemos conseguir un mejor accuracy:

# RNN

Partimos de nuestra partición de train y validación. Vamos a crearnos una función que tokenice y a partir de ella construirnos el diccionario con las palabras.

In [195]:
# Keras
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.preprocessing import sequence
from tensorflow.python.keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from tensorflow.python.keras.layers import Embedding
from tensorflow.python.keras import optimizers



In [196]:
from nltk import word_tokenize
#tokenizamos sobre train
def tokenize(prueba):
    tokens = []
    for sentence in prueba:
        tokens += [word_tokenize(sentence)]

    return tokens
tokens= tokenize(text_trn)
# print(tokens)

In [197]:
#create the dictionary to conver words to numbers. Order it with most frequent words first
def build_dict(sentences):
#    from collections import OrderedDict

    '''
    Build dictionary of train words
    Outputs: 
     - Dictionary of word --> word index
     - Dictionary of word --> word count freq
    '''
    print( 'Building dictionary..',)
    wordcount = dict()
    #For each word in each sentence, cummulate frequency
    for ss in sentences:
        for w in ss:
            if w not in wordcount:
                wordcount[w] = 1
            else:
                wordcount[w] += 1

    counts = list(wordcount.values()) # List of frequencies
    keys = list(wordcount) #List of words
    
    sorted_idx = reversed(np.argsort(counts))
    
    worddict = dict()
    for idx, ss in enumerate(sorted_idx):
        worddict[keys[ss]] = idx+2  # leave 0 and 1 (UNK)
    print( np.sum(counts), ' total words ', len(keys), ' unique words')

    return worddict, wordcount


worddict, wordcount = build_dict(tokens)

# print(worddict['the'], wordcount['the'])

Building dictionary..
1476511  total words  117790  unique words


In [198]:
def generate_sequence(sentences, dictionary):
    '''
    Convert tokenized text in sequences of integers
    '''
    seqs = [None] * len(sentences)
    for idx, ss in enumerate(sentences):
        seqs[idx] = [dictionary[w] if w in dictionary else 1 for w in ss]

    return seqs

In [199]:
X_train_full = generate_sequence(tokens, worddict)
print(X_train_full[0], y_trn[0])

[38861, 2044, 48061, 2, 224, 7485, 26, 5, 1800, 17292, 180, 9758, 7201, 3, 272, 372, 15825, 5151, 2234, 4059, 366, 334, 34, 52, 34, 270, 154, 575, 134, 562, 1341, 5151, 2234, 18, 2142, 285, 548, 364, 226, 3932, 5151, 2234, 20, 285, 548, 5124, 339, 888, 55399, 111, 1527, 1012, 30, 500, 1053, 188, 460, 4, 192, 752, 232, 1127, 548, 2574, 2044, 48061, 38861, 1800, 17292, 180, 168, 55414, 418, 4276, 562, 3033, 808, 12506, 55434, 53, 9758, 1800, 55432, 12506, 55428, 752, 256, 895, 246, 55352] 11


In [200]:
X_val_full = generate_sequence(tokenize(text_val), worddict)


In [201]:
max_features = 5000 # Number of most frequent words selected. the less frequent recode to 0
maxlen = 200 

In [202]:
def remove_features(x):
    return [[0 if w >= max_features else w for w in sen] for sen in x]

X_train = remove_features(X_train_full)
X_val  = remove_features(X_val_full)


In [203]:
from tensorflow.contrib.keras import preprocessing


X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_val = preprocessing.sequence.pad_sequences(X_val, maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_val shape:', X_val.shape)


X_train shape: (9051, 200)
X_val shape: (2263, 200)


In [204]:
from tensorflow.python.keras import optimizers
from tensorflow.python.keras.layers import Conv1D
from tensorflow.python.keras.layers import MaxPooling1D

In [205]:
## Network architecture
model = Sequential()
model.add(Embedding(5000, 32, input_length=200))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(20, activation='softmax'))
rms_optimizer = optimizers.RMSprop(lr=0.001)

model.compile(loss='sparse_categorical_crossentropy', optimizer=rms_optimizer, metrics=['accuracy'])
#sparse_categorical_crossentropy si no categorizo y_trn

## Fit the model
model.fit(X_train, y_trn, validation_data=(X_val, y_val), epochs=20, batch_size=128)

Train on 9051 samples, validate on 2263 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras._impl.keras.callbacks.History at 0x2504165b7f0>

No observamos que sea un buen modelo.

## 05 Evaluate test data
- Don't edit after this!!!
- Execute only ONCE whit the optimal model selected based on the validation accuracy metric calculated over multiple experiments.

Finalmente nos quedamos con el modelo SGD con el que obtuvimos 0.89 en validación:

In [247]:
twenty_test = fetch_20newsgroups(subset='test')

In [248]:
textotest =[]
for text in twenty_test.data:
    textotest.append(clean_text(text))

In [258]:
X_test.shape

(7532, 29727)

In [240]:
X_test=encoding_text(textotest)
predicted = clf_sgd.predict(X_test)
print('Accuracy test: ', accuracy_score(twenty_test.target, predicted))



AttributeError: lower not found

Obtenemos una precisión de 0.7867 en test.