# Trabajo Práctico 2: Enunciado 

El segundo TP es una competencia de Machine Learning en donde cada grupo debe intentar determinar, para cada tweet brindado, si el mismo esta basado en un hecho real o no.

La competencia se desarrolla en la plataforma de Kaggle  https://www.kaggle.com/c/nlp-getting-started.  

El dataset consta de una serie de tweets, para los cuales se informa:

<br/>

* id - identificador unico para cada  tweet
* text - el texto del tweet
* location - ubicación desde donde fue enviado (podría no estar)
* keyword - un keyword para el tweet  (podría faltar)
* target - en train.csv, indica si se trata de un desastre real  (1) o no (0)
 
<br/><br/>


Los submits con el resultado deben tener el formato:

Id: Un id numérico para identificar el tweet
target: 1 / 0 según se crea que el tweet se trata sobre un desastre real, o no.

Los grupos deberán probar distintos algoritmos de Machine Learning para intentar predecir si el tweet está basado en hechos reales o no. A medida que los grupos realicen pruebas deben realizar el correspondiente submit en Kaggle para evaluar el resultado de los mismos.

Al finalizar la competencia el grupo que mejor resultado tenga obtendrá 10 puntos para cada uno de sus integrantes que podrán ser usados en el examen por promoción o segundo recuperatorio.

Requisitos para la entrega del TP2:

- El TP debe programarse en Python o R.
- Debe entregarse un pdf con el informe de algoritmos probados, algoritmo final utilizado, transformaciones realizadas a los datos, feature engineering, etc. 
- El informe debe incluir también un link a github con el informe presentado en pdf, y todo el código.
- El grupo debe presentar el TP en una computadora en la fecha indicada por la cátedra, el TP debe correr en un lapso de tiempo razonable (inferior a 1 hora) y generar un submission válido que iguale el mejor resultado obtenido por el grupo en Kaggle. (mas detalles a definir)

El TP2 se va a evaluar en función del siguiente criterio:

- Cantidad de trabajo (esfuerzo) del grupo: ¿Probaron muchos algoritmos? ¿Hicieron un buen trabajo de pre-procesamiento de los datos y feature engineering?
- Resultado obtenido en Kaggle (obviamente cuanto mejor resultado mejor nota)
- Presentación final del informe, calidad de la redacción, uso de información obtenida en el TP1, conclusiones presentadas.
- Performance de la solución final.


# 1. Preprocesado
#### Introducción 
Se levantan los datos como en el TP1. Sin EDA, solo el preprocesado del texto.

### 1.1 Instalación de librerias 

In [None]:
!pip install nltk
!pip install stopwords
!pip install gensim

!pip install sklearn
!pip install xgboost==0.7.post4

!pip3 install tensorflow
!pip3 install tensorflow_hub
!pip3 install tqdm>=4.46.0

!pip3 install keras
!pip3 install seaborn

### 1.2 Importación de librerías

In [None]:
import pandas as pd
import numpy as np

import warnings
import re
import string
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn import model_selection

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier, Perceptron
from sklearn.metrics import f1_score,roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB,GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.decomposition import PCA, TruncatedSVD

import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt

from xgboost import XGBClassifier

from time import process_time

import tensorflow_hub as hub 
import tensorflow as tf 

from tqdm.notebook import tqdm 
from sklearn.ensemble import RandomForestRegressor


warnings.filterwarnings('ignore')

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
nltk.download('wordnet')

In [None]:
tqdm.pandas()

### 1.3 Obtención de datos
Lectura de datos de entrenamiento y test.

In [None]:
tweets_train = pd.read_csv('../data/train.csv', encoding='utf-8')
tweets_test = pd.read_csv('../data/test.csv', encoding='utf-8')

### 1.4 Limpieza de datos.
#### Introducción
Antes de empezar, hay que normalizar el texto ya que luego de la tokenización serán convertidos en vectores dentro de una matriz, las técnicas a utilizar:
* **Uppercase/lowercase**: Paso todo a lower/upper case, ya que una misma palabra tiene una representación distinta si se hay un cambio de mayúscula minúscula.
* **Limpieza de texto**: Signos de puntuación, valores numéricos, links, carácteres especiales, etc.
* **Tokenizacion**: Es el proceso de convertir el texto en una lista de tokens,
* **Stopwords**: Elimino palabras comunes que no aportan información
* **Stemming**: Elimino los sufijos de palabras que puedan tener el mismo significado (o función dentro del texto)
* **Lemmatization**: Unifico palabras que signifiquen lo mismo en base a su definición del diccionario

Inicializo dataset para probar las funciones

In [None]:
#Copia de datasets para trabajar el pre-procesado de texto
train_df1 = tweets_train.copy()
test_df1  = tweets_test.copy()

#### 1.4.1 Uppercase + Limpieza de texto

In [None]:
#Funcion para eliminar emojis, viene del tp1
emoji_pattern = re.compile("["
         u"\U0001F600-\U0001F64F"  # emoticons
         u"\U0001F300-\U0001F5FF"  # symbols & pictographs
         u"\U0001F680-\U0001F6FF"  # transport & map symbols
         u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
         u"\U00002702-\U000027B0"
         u"\U000024C2-\U0001F251"
         "]+", flags=re.UNICODE)

def remove_emojis_non_ascii(text):    
    #replace consecutive non-ASCII characters with a space
    result = re.sub(r'[^\x00-\x7F]+',' ', text)
    #remove emojis from tweet
    result = emoji_pattern.sub(r'', result)    
    return result


#Funcion para limpieza del texto (todo a LOWERCASE)
def text_clean(text):
    text = text.lower()
    text = remove_emojis_non_ascii(text)
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
#Aplico la funcion a la copia de los Dataset de entrenamiento y test
train_df1['text'] = train_df1['text'].progress_apply(lambda x: text_clean(x))
test_df1['text'] = test_df1['text'].progress_apply(lambda x: text_clean(x))

#### 1.4.2 Tokenización
_Probar los distintos que ofrece la librería nltk_

In [None]:
#Para tokenizar utilizo el RegEx tokenizer de nltk
#tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

#Para tokenizar utilizo WhitespaceTokenizer
#tokenizer = nltk.tokenize.WhitespaceTokenizer()

#Para tokenizar utilizo WordPunctTokenizer
#tokenizer = nltk.tokenize.WordPunctTokenizer()

#Para tokenizar utilizo TreebankWordTokenizer
self_tokenizer = nltk.tokenize.TreebankWordTokenizer()


In [None]:
train_df1['text'] = train_df1['text'].progress_apply(lambda x: self_tokenizer.tokenize(x))
test_df1['text'] = test_df1['text'].progress_apply(lambda x: self_tokenizer.tokenize(x))

#### 1.4.3 Stopwords

In [None]:
#Funcion para eliminar Stopwords
def text_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return words

In [None]:
train_df1['text'] = train_df1['text'].progress_apply(lambda x : text_stopwords(x))
test_df1['text'] = test_df1['text'].progress_apply(lambda x : text_stopwords(x))

#### 1.4.4 Stemming + Lemmatizing
Probar si aportan algo

In [None]:
# Funcion para Stemming y Lemmatizing
def text_stemming(text):
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = tokenizer.tokenize(text)
    stemmer = nltk.stem.PorterStemmer()
    text_stemmed = " ".join(stemmer.stem(token) for token in tokens)
    return text_stemmed

def text_lemmatizing(text):
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = tokenizer.tokenize(text)
    lemmatizer=nltk.stem.WordNetLemmatizer()
    text_lemmatized = " ".join(lemmatizer.lemmatize(token) for token in tokens)
    return text_lemmatized

In [None]:
#Combino el texto para luego de haberlo procesado
def text_combine(text):
    comb_text = ' '.join(text)
    return comb_text

In [None]:
train_df1['text'] = train_df1['text'].apply(lambda x : text_combine(x))
test_df1['text'] = test_df1['text'].apply(lambda x : text_combine(x))

In [None]:
test_df1.head() ##Datos Antes del lemmatizing

In [None]:
test_df1['text'] = test_df1['text'].progress_apply(lambda x : text_lemmatizing(x))

In [None]:
test_df1.head() ##Datos luego del lemmatizing

#### 1.4.5 Pre-procesado de texto
Devuelve texto, agregar una para devolver tambien solo TOKENS, ya que es lo que se va a utilizar para entrenar al modelo

In [None]:
def pre_process_text(text): 
    cleaned_txt = text_clean(text)
    lemma_text = text_lemmatizing(cleaned_txt)
    tokenized_text = self_tokenizer.tokenize(lemma_text)    
    remove_stopwords = text_stopwords(tokenized_text)
    combined_text = text_combine(remove_stopwords)
    return combined_text

In [None]:
test_df1['text'] = test_df1['text'].progress_apply(lambda x : pre_process_text(x))
test_df1.head()

# 2. Vectorización del texto
Para entrenar el modelo necesitamos convertir el texto a una matriz de vectores para que pueda interpretarlo, 
para lograrlo existen distintas técnicas.


* Bag of Words
* TF-IDF
* N-Gramas
* Feature Hashing
* Red convolucional 1-D


Cada una de estas alternativas esta directamente relacionada con la transformación del texto (Tokenizacion, limpieza, lemming, stemming)

### 2.0 Preparacion de datasets - funciones
Preparo datasets de train y test aplicando el preprocesado.

In [None]:
train_df2=tweets_train.copy()
train_df2['text'] = train_df2['text'].progress_apply(lambda x : pre_process_text(x))

test_df2=tweets_test.copy()
test_df2['text'] = test_df2['text'].progress_apply(lambda x : pre_process_text(x))

In [None]:
#Es necesario el parseo a String, porque ciertos keywords se convierten a numerico
train_df2['keyword'] = train_df2['keyword'].progress_apply(lambda x : pre_process_text(str(x)))

test_df2=tweets_test.copy()
test_df2['keyword'] = test_df2['keyword'].progress_apply(lambda x : pre_process_text(str(x)))

Defino una función para visualizar las distintas vectorizaciones. La idea es poder visualizar la distancia que existe entre los tokens de cada tweet (verdaderos/falsos)

In [None]:
##Funcion interna
def plot_LSA(test_data, test_labels):
        lsa = TruncatedSVD(n_components=2)
        lsa.fit(test_data)
        lsa_scores = lsa.transform(test_data)
        color_mapper = {label:idx for idx,label in enumerate(set(test_labels))}
        color_column = [color_mapper[label] for label in test_labels]
        colors = ['orange','blue']        
        plt.scatter(lsa_scores[:,0], lsa_scores[:,1], s=8, alpha=.8, c=test_labels, cmap=matplotlib.colors.ListedColormap(colors))
        orange_patch = mpatches.Patch(color='orange', label='Falso')
        blue_patch = mpatches.Patch(color='blue', label='Verdadero')
        plt.legend(handles=[orange_patch, blue_patch], prop={'size': 30})


list_text = train_df2["text"].tolist()
list_keyword = train_df2["keyword"].tolist()
list_labels = train_df2["target"].tolist()

X_train, X_test, y_train, y_test = train_test_split(list_text, list_labels, test_size=0.2, 
                                                                            random_state=32)            

X_train_k, X_test_k, y_train_k, y_test_k = train_test_split(list_keyword, list_labels, test_size=0.2, 
                                                                            random_state=32)    
##ultima funcion
##train_x es el array de la columna text, vectorizado
def Graph_vectorization(train_x,train_y,model_name):
    fig = plt.figure(figsize=(16, 16))          
    plot_LSA(train_x, train_y)
    fig.suptitle(model_name)    
    plt.show()


# Embbeding

## w2vec


In [None]:
import gensim
import gensim.downloader as api
#glove-twitter-100 1193514 387 MB  Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
#glove-twitter-200 1193514 758 MB  Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
#glove-twitter-25  1193514 104 MB  Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
#glove-twitter-50  1193514 199 MB  Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)
model = api.load("glove-twitter-200")  # download the model and return as object ready for use

#word2vec = gensim.models.KeyedVectors.load_word2vec_format(path_for_word2vec, binary = True)

In [None]:
# http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, dimentions):
        self.word2vec = word2vec
        # dimension del vector
        self.dim = dimentions

    def fit(self, X, y):
        return self

    def transform(self, X):
        # tranformación: se busca la palabra en el modelo de w2vec, si no existe se llena con ceros el vector
        # se calcula el promedio para el vector final
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])

In [None]:
from collections import defaultdict

class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec, dimentions):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = dimentions

    def fit(self, X):
        # TODO ver si estos parametros estan OK
        tfidf = TfidfVectorizer(analyzer=lambda x: x,min_df=2, max_df=0.5, ngram_range=(2, 3))
        tfidf.fit(X)
        # Se calcula el "peso" de cada palabra: 
        # usando el mayor valor de tf-idf 
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])

        return self

    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w] for w in words if w in self.word2vec]
                        or [np.zeros(self.dim)], axis=0)
                for words in X
            ])

In [None]:
# armamos el diccionario de w2vec
w2v = dict(zip(model.wv.index2word, model.wv.syn0))

In [None]:
meanEmbedding = MeanEmbeddingVectorizer(w2v, model.vector_size)
train_w2vec = meanEmbedding.transform(train_df2.text)
test_w2vec = meanEmbedding.transform(test_df2.text)

graph_w2vec_x = meanEmbedding.transform(X_train)

In [None]:
train_w2vec_k =  meanEmbedding.transform(train_df2.keyword)
test_w2vec_k = meanEmbedding.transform(test_df2.keyword)

graph_w2vec_x_k = meanEmbedding.transform(X_train_k)

In [None]:
tfidfEmbedding = TfidfEmbeddingVectorizer(w2v, model.vector_size)
train_w2vecTfid = tfidfEmbedding.fit(train_df2.text).transform(train_df2.text)
test_w2vecTfid = tfidfEmbedding.transform(test_df2.text)

In [None]:
#Grafico vectorizacion
Graph_vectorization(graph_w2vec_x,y_train,"MeanEmbeding")

In [None]:
#Grafico vectorizacion
Graph_vectorization(graph_w2vec_x_k,y_train_k,"MeanEmbeding (KEYWORD)")

### Embeding: Universal Sentence Encoder (tensorflow)

In [None]:
large_use = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
embed = hub.load(large_use)

def transfrom(text_train, text_test):


    vector_train = [tf.reshape(embed([line]), [-1]).numpy() for line in tqdm(text_train)]
    vector_test = [tf.reshape(embed([line]), [-1]).numpy() for line in tqdm(text_test)]

    return vector_train, vector_test
    

paso el texto a vectores

In [None]:
train_use_tf, test_use_tf = transfrom(train_df2.text, test_df2.text)
train_use_tf_k, test_use_tf_k = transfrom(train_df2.keyword, test_df2.keyword)

In [None]:
#Grafico vectorizacion
graph_useftx, _dummy = transfrom(X_train, X_train)
Graph_vectorization(graph_useftx,y_train,"USE tensorflow")

In [None]:
#Grafico vectorizacion (KEYWORDS)
graph_useftx_k, _dummy = transfrom(X_train_k, X_train_k)
Graph_vectorization(graph_useftx_k,y_train_k,"USE tensorflow + (KEYWORDS)")

### 2.1 Bag of Words
Se crea un diccionario de palabras conocidas, luego de eso se representa el texto en un vector donde cada posición indica la existencia (o no) de las palabras.

#### CountVectorize

CountVectorize convierte una coleccion de documentos a una matriz de tokens contabilizados. Esta funcion incluye varios metodos para preprocedo/tokenizacion/stopwords, por lo que se podría modificar desde la siguiente línea. Sin embargo, como ya se hizo el pre-procesado del texto solo voy a usar la función sin ningun feature.

In [None]:
# Vectorizacion con countVectorize
count_vectorizer = CountVectorizer()
train_cv = count_vectorizer.fit_transform(train_df2['text'])
test_cv = count_vectorizer.transform(test_df2["text"])

In [None]:
#Grafico vectorizacion
graph_cv =  count_vectorizer.fit_transform(X_train)
Graph_vectorization(graph_cv,y_train,"CountVectorize")

In [None]:
#Grafico vectorizacion
graph_cv_k =  count_vectorizer.fit_transform(X_train_k)
Graph_vectorization(graph_cv_k,y_train_k,"CountVectorize (KEYWORD)")

### 2.2 TF-IDF
Tf-idf (Term frequency – Inverse document frequency), frecuencia de término – frecuencia inversa de documento (o sea, la frecuencia de ocurrencia del término en la colección de documentos), es una medida numérica que expresa cuán relevante es una palabra para un documento en una colección. 
Es una mejora de Bag of Words ya que contabiliza y pondera las palabras en base a su frecuencia de aparición en el documento, por ejemplo la palabra "the" puede tener muchas apariciones en el texto, por lo que se podria dar una importancia menor.

#### **Calculo TD-IDF**

<br/>

**Term Frequency(TF)**: Es la ponderación de la palabra dentro del documento

$ {\displaystyle tf} = \frac{fdt}{nT}$
<br/>
Donde:
* $ fdt $: Frecuencia de aparición del término t en el documento
* $ nT $: Número de términos en el documento

<br/>

**Inverse Document Frequency(IDF)**: Es el valor de que tan "rara" es la palabra a través de todos los documentos

$ {\displaystyle idf} = 1+\log(\frac{N}{n}) $ 
<br/>
Donde:
* $ N $: numero de documentos
* $ n $: numero de documentos con aparición del termino t

<br/>

**TF-IDF**: La ponderación del termino por tf-idf está dada por

$ {\displaystyle tfidf}(w,d,D) = {\displaystyle tf}(w,d) \times {\displaystyle idf}(w,D) $ 

In [None]:
# Vectorizacion utilizando TF-IDF (UNI Y BI-GRAMAS)
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
train_tf = tfidf.fit_transform(train_df2['text'])
test_tf = tfidf.transform(test_df2["text"])

**Parámetros TfidfVectorizer**

* _mindf_ = cantidad o porcentaje minimo de aparición del token en todos los documentos. En este caso se descartan todas las que tengan menos de 2 apariciones

* _maxdf_ = cantidad o porcentaje maximo de aparición del token. En este caso se descartan todos los token que tengan una frecuencia de aparición mayor al 50%

* _ngramrange_ = Rango de ngramas a utilizar para generar los tokens. En este caso se usan desde 1 a 2 gramas (uni y bi-grama)



In [None]:
#Grafico vectorizacion
graph_tf =  tfidf.fit_transform(X_train)
Graph_vectorization(graph_tf,y_train,"TF-IDF")

In [None]:
#Grafico vectorizacion
graph_tf =  tfidf.fit_transform(X_train_k)
Graph_vectorization(graph_tf,y_train_k,"TF-IDF (KEYWORD)")

### 2.3 N-Gramas
Agrupo las palabras en grupos de 1,2,3,n palabras, para agregarles un contexto.

Esto se puede lograr utilizando countVectorize para analizar la frecuencia de aparición de n-gramas o combinarlo con tf-idf para considerar la ponderación del término en base a sus apariciones.

In [None]:
# Agrupo por bi-gramas y tri-gramas con CountVectorizer
ngram_cv = CountVectorizer(ngram_range=(2,3))
train_ng_cv = ngram_cv.fit_transform(train_df2['text'])
test_ng_cv = ngram_cv.transform(test_df2["text"])

In [None]:
# Agrupo por bi-gramas y tri-gramas con TF-IDF
ngram_tf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(2, 3))
train_ng_tf = ngram_tf.fit_transform(train_df2['text'])
test_ng_tf = ngram_tf.transform(test_df2["text"])

In [None]:
#Grafico vectorizacion
graph_cv_ngs =  ngram_cv.fit_transform(X_train)
Graph_vectorization(graph_cv_ngs,y_train,"Count vectorizer + (2-3)Grama")

In [None]:
#Grafico vectorizacion
graph_tf_ng =  ngram_tf.fit_transform(X_train)
Graph_vectorization(graph_tf_ng,y_train,"TF-IDF + (2-3)Grama")

### 2.4 Feature Hashing
Pendiente

In [None]:
#Vectorizacion usando Feature hashing
hv = HashingVectorizer()
train_fh = hv.fit_transform(train_df2["text"])
test_fh = hv.transform(test_df2["text"])

In [None]:
#Grafico vectorizacion
graph_fh =  hv.fit_transform(X_train)
Graph_vectorization(graph_fh,y_train,"Feature Hashing")

In [None]:
train_df2

# 3. Entrenamiento del modelo 
Para el entrenamiento pruebo algunos algoritmos _(en verde los probados, en rojo los descartados por ineficientes)_
* <font color='green'>Logistic Regression </font>
* <font color='green'>Decision tree</font>
* <font color='green'>KNN</font>
* <font color='green'>Gradient Boosting Clasifier</font>
* <font color='green'>Random Forest</font>
* <font color='green'>RidgeClassifier</font>
* <font color='green'>MNB (MultinomialNB)</font>
* <font color='green'>Perceptron</font>
* <font color='green'>xgBoost</font>

In [None]:
# agrego feature de sentimiento
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
train_df2['sentiment_score'] = train_df2.text.apply(lambda x: sid.polarity_scores(x)['compound'])
test_df2['sentiment_score'] = test_df2.text.apply(lambda x: sid.polarity_scores(x)['compound'])

In [None]:
train_df2['sentiment_score']

### 3.1  Validation curve

Analizamos la curva de aprendizaje para evaluar cuales son los mejores ratios a tomar para aprendizaje/test de cada modelo...

In [None]:
import numpy as np
from sklearn.model_selection import validation_curve

If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible. All three cases can be found in the plot below where we vary the parameter of an SVM on the digits dataset.

In [None]:
train_scores, valid_scores = validation_curve(GradientBoostingClassifier(learning_rate=0.0055, max_depth=2, max_features=27,
                           min_samples_leaf=23, n_estimators=4850,random_state=2020)
                                              , train_use_tf2_s, tweets_train.target, "ccp_alpha",
                                               np.logspace(-4, 1, 30),
                                               cv=5, n_jobs=-1)

In [None]:
GradientBoostingClassifier().get_params().keys()

In [None]:
valid_scores

In [None]:
np.logspace(-7, 3, 30)

In [None]:
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(valid_scores, axis=1)
test_scores_std = np.std(valid_scores, axis=1)

param_range = np.logspace(-4, -1, 30)

plt.figure(figsize=(16, 4))

plt.title("Validation Curve with MNB")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.grid()
lw = 2

plt.semilogx(param_range, train_scores_mean, label="Training score",
             color="darkorange", lw=lw,subsx='all')
plt.fill_between(param_range, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.2,
                 color="darkorange", lw=lw)
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
             color="navy", lw=lw,subsx='all')
plt.fill_between(param_range, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.2,
                 color="navy", lw=lw)
plt.legend(loc="best")


plt.show()

In [None]:
import seaborn; seaborn.set()  # plot formatting

X_test = train_use_tf2_s
y_test = tweets_train.target
plt.scatter(train_use_tf2_s["sentiment_score"], y_test, color='black')
axis = plt.axis()
for degree in [1, 3, 5]:
    y_test = PolynomialRegression(degree).fit(X, y_test).predict(X_test)
    plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree))
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc='best');

In [None]:
param_range

### 3.1 Busqueda de hiperparametros
Busco los mejores hiperparametros para los distintos modelos probando diferentes vectorizaciones (siempre que sea posible)


Funcion para hacer validacion cruzada. Recibe dos parametros: El modelo a analizar (classifier) y los parametros a probar. Devuelve los mejores parametros de la validacion cruzada

In [None]:
##Convierto la lista a dataframe, le agrego las columnas de KEYWORD
train_use_tf2 = pd.DataFrame(train_use_tf)
train_use_k_tf2 = pd.DataFrame(train_use_tf_k)
train_use_tf2_k = pd.concat([train_use_tf2, train_use_k_tf2], axis=1, sort=False)

#Mismo pero con sentimient analysis
train_use_tf2_s = pd.concat([train_use_tf2_k, train_df2['sentiment_score']], axis=1, sort=False)


#---test----
##Convierto la lista a dataframe, le agrego las columnas de KEYWORD
test_use_tf2 = pd.DataFrame(test_use_tf)
test_use_k_tf2 = pd.DataFrame(test_use_tf_k)
test_use_tf2_k = pd.concat([test_use_tf2, test_use_k_tf2], axis=1, sort=False)

#Mismo pero con sentimient analysis
test_use_tf2_s = pd.concat([test_use_tf2_k, test_df2['sentiment_score']], axis=1, sort=False)

In [None]:
#Agrego columnas a countvector y los demas???

In [None]:
#Diccionario con las vectorizaciones obtenidas previamente
vectorDict = {    
    "Count Vector": train_cv,
    "TF-IDF": train_tf,
    "Count Vector + ng": train_ng_cv,
    "TF-IDF + ng": train_ng_tf,
    "Feature Hashing": train_fh,
    "Word2Vec": train_w2vec,
    "Word2Vec + TF-IDF": train_w2vecTfid,
    "Universal sentence encoder": train_use_tf2,
    "Universal full":train_use_tf2_k,
    "USE Sentiment" : train_use_tf2_s
    }

In [None]:
def Cross_validation(classifier,paramDict):
    for key,param_grid in paramDict.items():
        try:
            grid_search = GridSearchCV(estimator= classifier[key], param_grid = param_grid, cv=3 , n_jobs = -1, verbose = 2)
            grid_search.fit(vectorDict[key],tweets_train.target)
            print(key)
            print(grid_search.best_score_)
            print(grid_search.best_params_)
            print(grid_search.best_estimator_)
        except Exception as e:
            #agrego esto para los casos de vectores negativos para seguir adelante y no analizar ese modelo
            print(e)                

#### 3.1.1 Gradient Boost
Boosting is a sequential technique which works on the principle of ensemble. It combines a set of weak learners and delivers improved prediction accuracy. At any instant t, the model outcomes are weighed based on the outcomes of previous instant t-1. The outcomes predicted correctly are given a lower weight and the ones miss-classified are weighted higher. This technique is followed for a classification problem while a similar technique is used for regression.

As discussed earlier, there are two types of parameter to be tuned here – tree based and boosting parameters. There are no optimum values for learning rate as low values always work better, given that we train on sufficient number of trees.

Though, GBM is robust enough to not overfit with increasing trees, but a high number for a particular learning rate can lead to overfitting. But as we reduce the learning rate and increase trees, the computation becomes expensive and would take a long time to run on standard personal computers.

Keeping all this in mind, we can take the following approach:

1. Choose a relatively high learning rate. Generally the default value of 0.1 works but somewhere between 0.05 to 0.2 should work for different problems
2. Determine the optimum number of trees for this learning rate. This should range around 40-70. Remember to choose a value on which your system can work fairly fast. This is because it will be used for testing various scenarios and determining the tree parameters.
3. Tune tree-specific parameters for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
4. Lower the learning rate and increase the estimators proportionally to get more robust models.


Tuning tree-specific parameters

Now lets move onto tuning the tree parameters. I plan to do this in following stages:

1. Tune max_depth and num_samples_split
2. Tune min_samples_leaf
3. Tune max_features

The order of tuning variables should be decided carefully. You should take the variables with a higher impact on outcome first. For instance, max_depth and min_samples_split have a significant impact and we’re tuning those first.

In [None]:
#Modelo Inicial
modelGB = {
    "Universal sentence encoder":GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, subsample=0.8, random_state=2020),
    "Universal full":GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, subsample=0.8, random_state=2020)
}

In [None]:
step_1 ={
    "Universal sentence encoder":{'max_depth':range(2,10,2), 'min_samples_split':range(100,500,100)},
    "Universal full":{'max_depth':range(2,10,2), 'min_samples_split':range(100,500,100)},
}
Cross_validation(modelGB,step_1)

---------------------------------

In [None]:
#afinando los parametros (1)
step_1_deep ={
    "Universal sentence encoder":{'max_depth':[4], 'min_samples_split':[200]},
    "Universal full":{'max_depth':[2], 'min_samples_split':[150,170,190]}
}
Cross_validation(modelGB,step_1_deep)

In [None]:
#Modelo PASO 2
modelGB = {
    "Universal sentence encoder":GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, subsample=0.8, random_state=2020,max_depth=4,min_samples_split=200),
    "Universal full":GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, subsample=0.8, random_state=2020,max_depth=2,min_samples_split=190,max_features=38)
}

In [None]:
step_2 = {
    "Universal sentence encoder":{'min_samples_split':range(20,170,25), 'min_samples_leaf':range(30,50,5)},
    "Universal full":{'min_samples_split':range(20,180,10), 'min_samples_leaf':range(30,50,5)}
}
Cross_validation(modelGB,step_2)

In [None]:
step_2_deep = {
    "Universal sentence encoder":{'min_samples_split':[100,110,125], 'min_samples_leaf':[39,40,41]},
    "Universal full":{'min_samples_split':[5,10,15], 'min_samples_leaf':[29,30,31]}
}
Cross_validation(modelGB,step_2_deep)

In [None]:
#Modelo PASO 3
modelGB = {
    "Universal sentence encoder":GradientBoostingClassifier(max_depth=4, min_samples_leaf=41,
                           min_samples_split=125, n_estimators=60,
                           random_state=2020, subsample=0.8),
    "Universal full":GradientBoostingClassifier(max_depth=2, max_features=38, min_samples_leaf=30,
                           min_samples_split=5, n_estimators=60,
                           random_state=2020, subsample=0.8)
}

In [None]:
step_3 = {
    "Universal sentence encoder":{'max_features':range(19,22,1),'min_samples_split':[105,110,115], 'min_samples_leaf':[39,41,42,44]},
    "Universal full":{"max_features":range(37,39,1),'min_samples_split':[1,2,3,4,5], 'min_samples_leaf':[30]}
}
Cross_validation(modelGB,step_3)


In [None]:
#Modelo PASO 4
modelGB = {
    "Universal sentence encoder":GradientBoostingClassifier(max_depth=4, max_features=19, min_samples_leaf=42,
                           min_samples_split=110, n_estimators=60,
                           random_state=2020, subsample=0.8),
    "Universal full":GradientBoostingClassifier(max_depth=2, max_features=38, min_samples_leaf=30,
                           n_estimators=60, random_state=2020, subsample=0.8)  
}

In [None]:
step_4 = {
    "Universal sentence encoder":{'subsample':[0.75,0.77,0.8,0.83,0.85]},
    "Universal full":{'subsample':[0.75,0.77,0.8,0.83,0.85]}
}
Cross_validation(modelGB,step_4)

In [None]:
#Modelo PASO 5
modelGB = {
    "Universal sentence encoder":GradientBoostingClassifier(max_depth=4, max_features=19, min_samples_leaf=42,
                           min_samples_split=110, n_estimators=60,
                           random_state=2020, subsample=0.8),
    "Universal full":GradientBoostingClassifier(max_depth=2, max_features=38, min_samples_leaf=30,
                           n_estimators=60, random_state=2020, subsample=0.8)  
}

In [None]:
step_5 = {
    "Universal sentence encoder":{'n_estimators':range(100,10000,500),'learning_rate':[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]},
    "Universal full":{'n_estimators':range(100,10000,500),'learning_rate':[0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]}
}
Cross_validation(modelGB,step_4)

### EXTRA, ANALISIS DE SENTIMIENTOS:
Agrego pasos para probar modelo con analisis de sentimiento...


In [None]:
##Dataframe con analisis de sentimiento:
train_use_tf2_s

In [None]:
#Modelo Inicial
modelGB = {
    "USE Sentiment":GradientBoostingClassifier(learning_rate=0.1, n_estimators=60, subsample=0.8, random_state=2020),
}
step_1 ={
    "USE Sentiment":{'max_depth':range(2,10,2), 'min_samples_split':range(100,500,100)}
}
Cross_validation(modelGB,step_1)

In [None]:
step_1 ={
    "USE Sentiment":{'max_depth':[2], 'min_samples_split':[10,30,50,60,70]}
}
Cross_validation(modelGB,step_1)

In [None]:
#PASO 2
modelGB = {
    "USE Sentiment":GradientBoostingClassifier(max_depth=2, min_samples_split=10, n_estimators=60,
                           random_state=2020, subsample=0.8)
}
step_2 = {
    "USE Sentiment":{'min_samples_split':range(1,20,2), 'min_samples_leaf':range(30,50,5)}
}
Cross_validation(modelGB,step_2)

In [None]:
#PASO 2- ajuste
modelGB = {
    "USE Sentiment":GradientBoostingClassifier(max_depth=2, min_samples_split=10, n_estimators=60,
                           random_state=2020, subsample=0.8)
}
step_2 = {
    "USE Sentiment":{'min_samples_split':range(1,4,1), 'min_samples_leaf':range(26,34,1)}
}
Cross_validation(modelGB,step_2)

In [None]:
#PASO 3
modelGB = {
    "USE Sentiment":GradientBoostingClassifier(max_depth=2, min_samples_leaf=31, n_estimators=60,
                           random_state=2020, subsample=0.8)
}
step_3 = {
    "USE Sentiment":{'max_features':range(23,28,1),'min_samples_split':[1,2,3,4,5,6], 'min_samples_leaf':range(1,30,1)},
}
Cross_validation(modelGB,step_3)

In [None]:
#PASO 4
modelGB = {
    "USE Sentiment":GradientBoostingClassifier(max_depth=2, max_features=27, min_samples_leaf=23,
                           n_estimators=60, random_state=2020, subsample=0.8)
}
step_4 = {
    "USE Sentiment":{'subsample':[0.785,0.79,0.795,0.8]}
}
Cross_validation(modelGB,step_4)

In [None]:
#PASO 5
step_5 = {
    "USE Sentiment":{'n_estimators':range(5000,10001,1000),'learning_rate':[0.0001,0.0005,0.001,0.005,0.01]}
}
Cross_validation(modelGB,step_5)

In [None]:
#PASO 5 - ajuste
step_5 = {
    "USE Sentiment":{'n_estimators':range(4840,4860,5),'learning_rate':[0.0052,0.0053,0.0054,0.0055,0.0057]}
}
Cross_validation(modelGB,step_5)

#### 3.1.1 Logistic Regression
pendiente explicar de que trata..
Voy a usar USE de tf tambien aca....




In [None]:
#Modelo Inicial
modelLR = {
    "USE Sentiment":LogisticRegression(),
    "Count Vector": LogisticRegression(),
    "TF-IDF": LogisticRegression(),
    "Count Vector + ng": LogisticRegression(),
    "TF-IDF + ng":LogisticRegression(),
    "Feature Hashing":LogisticRegression(),
    "Word2Vec": LogisticRegression(),
    "Word2Vec + TF-IDF":LogisticRegression()
}
step_1 ={
    "USE Sentiment":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "Count Vector":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "TF-IDF":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "Count Vector + ng":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "TF-IDF + ng":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "Feature Hashing":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "Word2Vec":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']},
    "Word2Vec + TF-IDF":{'penalty':['l1', 'l2'], 'C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                    'class_weight':[{1:0.5, 0:0.5}, {1:0.4, 0:0.6}, {1:0.6, 0:0.4}, {1:0.7, 0:0.3}],
                    'solver' : ['liblinear', 'saga']}
}
Cross_validation(modelLR,step_1)

In [None]:
#Modelo Paso 2 - Ajuste en C
modelLR = {
    "USE Sentiment":LogisticRegression(C=1, class_weight={0: 0.6, 1: 0.4}, solver='liblinear'),
}
step_2 ={
    "USE Sentiment":{'C':[0.985,0.986,0.987,0.988,0.989,0.9899]}
}
Cross_validation(modelLR,step_2)

In [None]:
#Modelo Paso 3 
modelLR = {
    "USE Sentiment":LogisticRegression(C=0.985, class_weight={0: 0.6, 1: 0.4}, solver='liblinear'),
}
step_3 ={
    "USE Sentiment":{'tol':[0.0001,0.0002,0.0005,0.001,0.002,0.005]}
}
Cross_validation(modelLR,step_3)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import decomposition
sc = StandardScaler()
pca = decomposition.PCA()
logistic = LogisticRegression()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import decomposition
sc = StandardScaler()
pca = decomposition.PCA()
logistic = LogisticRegression()

#Agrego pipeline para mejorar la entrada del logistic Reg.
pipe = Pipeline(steps=[('sc', sc),
                           ('pca', pca),
                           ('logistic', logistic)])
    
# Create Parameter Space
# Create a list of a sequence of integers from 1 to 30 (the number of features in X + 1)
n_components = list(range(1,train_use_tf2_s.shape[1]+1,100))
# Create a list of values of the regularization parameter
C = [0.985,0.986,0.987,0.988,0.989,0.9899]
# Create a list of options for the regularization penalty
penalty = ['l1', 'l2']
# Create a dictionary of all the parameter options 
# Note has you can access the parameters of steps of a pipeline by using '__’
parameters = dict(pca__n_components=n_components,
                  logistic__C=C,
                  logistic__penalty=penalty)    

modelLR = {
    "USE Sentiment":pipe,
}
step_1 ={
    "USE Sentiment":parameters
}

Cross_validation(modelLR,step_1)

### 3.1 Organizo algoritmos
Para tener un poco mas ordenado todo, agrupo los algortimos en una colección para luego poder evaluarlos en bloque.

In [None]:
#Creo diccionario con los modelos de regresion a probar.
modelsDict = {    
    "Gradient Boost": GradientBoostingClassifier(n_estimators=10),
    "Random Forest": RandomForestClassifier(max_depth = 10),  
    "Decision Tree": DecisionTreeClassifier(max_depth = 10),
    "kNN": KNeighborsClassifier(n_neighbors=20),
    'MNB': MultinomialNB(),
    'GNB': GaussianNB(),
    'RidgeClassifier': RidgeClassifier(class_weight='balanced'),
    'Perceptron': Perceptron(class_weight='balanced'),
    'xgboost': XGBClassifier(n_estimators=10),
    "Logistic Regression": LogisticRegression(C=1.0)
    }

no_classifiers = len(modelsDict.keys())


In [None]:

def batch_classify(x_train, y_train, x_test, y_test,positive_values = True):
    df_results = pd.DataFrame(data=np.zeros(shape=(no_classifiers,6)), columns = ['Clasificador', 'Prec. train', 'Prec. test','AUC score','F1', 'Tiempo transcurrido'])
    count = 0
    for key, classifier in modelsDict.items():
        if positive_values == False and key == "MNB":
            continue
            
        t_start = process_time()  
        try:
            classifier.fit(x_train, y_train)
            t_stop = process_time() 
            t_elapsed = t_stop - t_start        
            y_predicted = classifier.predict(x_test)
            df_results.loc[count,'AUC score'] = roc_auc_score(y_test, y_predicted)
            df_results.loc[count,'Prec. train'] = round(classifier.score(x_train, y_train)*100)
            df_results.loc[count,'Prec. test'] =round(accuracy_score(y_test,y_predicted)*100) 
            df_results.loc[count,'F1'] = f1_score(y_test, y_predicted, zero_division=1)
        except Exception as e:
            #agrego esto para los casos de vectores negativos para seguir adelante y no analizar ese modelo
            print(e)
        
        df_results.loc[count,'Clasificador'] = key        
        df_results.loc[count,'Tiempo transcurrido'] = t_elapsed                  
        count+=1

    return df_results


In [None]:
#Datos para countVector
x_train_cv, x_test_cv, y_train_cv, y_test_cv =train_test_split(train_cv,tweets_train.target,test_size=0.2,random_state=2020)
cv_results = batch_classify(x_train_cv, y_train_cv,x_test_cv, y_test_cv)


In [None]:
#Datos para TF-IDF
x_train_tf, x_test_tf, y_train_tf, y_test_tf = train_test_split(train_tf,tweets_train.target,test_size=0.2,random_state=2020)
tf_results = batch_classify(x_train_tf, y_train_tf,x_test_tf, y_test_tf)


In [None]:
#Datos para countVector + n-gramas
x_train_ng_cv, x_test_ng_cv, y_train_ng_cv, y_test_ng_cv =train_test_split(train_ng_cv,tweets_train.target,test_size=0.2,random_state=2020)
cv_ng_results = batch_classify(x_train_ng_cv, y_train_ng_cv,x_test_ng_cv, y_test_ng_cv)


In [None]:
#Datos para TF-IDF + n-gramas
x_train_ng_tf, x_test_ng_tf, y_train_ng_tf, y_test_ng_tf = train_test_split(train_ng_tf,tweets_train.target,test_size=0.2,random_state=2020)
tf_ng_results = batch_classify(x_train_ng_tf, y_train_ng_tf,x_test_ng_tf, y_test_ng_tf)


In [None]:
#Datos para Feature hashing
x_train_fh, x_test_fh, y_train_fh, y_test_fh = train_test_split(train_fh,tweets_train.target,test_size=0.2,random_state=2020)
fh_results = batch_classify(x_train_fh,y_train_fh,x_test_fh,y_test_fh,False)

In [None]:
#Datos para w2vec
x_train_w2vec, x_test_w2vec, y_train_w2vec, y_test_w2vec = train_test_split(train_w2vec,tweets_train.target,test_size=0.2,random_state=2020)
w2vec_results = batch_classify(x_train_w2vec, y_train_w2vec,x_test_w2vec, y_test_w2vec)

In [None]:
#Datos para w2vec+tfid
x_train_w2vecTfid, x_test_w2vecTfid, y_train_w2vecTfid, y_test_w2vecTfid = train_test_split(train_w2vecTfid,tweets_train.target,test_size=0.2,random_state=2020)
w2vecTfid_results = batch_classify(x_train_w2vecTfid, y_train_w2vecTfid,x_test_w2vecTfid, y_test_w2vecTfid)

In [None]:
#Datos para U.S.E TF
x_train_usetf, x_test_usetf, y_train_usetf, y_test_usetf = train_test_split(train_use_tf,tweets_train.target,test_size=0.2,random_state=2020)
usetf_results = batch_classify(x_train_usetf, y_train_usetf,x_test_usetf, y_test_usetf)

# 4. Resultados
Comparo la performance de los distintos modelos probados

In [None]:
# Imprimo resultados para countVector
cv_results.sort_values(by=["Prec. test", "AUC score"], ascending=(False,False))

In [None]:
# Imprimo resultados para TF-IDF
tf_results.sort_values(by=["Prec. test", "AUC score"], ascending=(False,False))

In [None]:
# Imprimo resultados para countVector + n-gramas
cv_ng_results.sort_values(by=["Prec. test", "AUC score"], ascending=(False,False))

In [None]:
# Imprimo resultados para TF-IDF + n-gramas
tf_ng_results.sort_values(by=["Prec. test", "AUC score"], ascending=(False,False))

In [None]:
#Imprimo datos para Feature Hashing
fh_results.sort_values(by=["Prec. test","AUC score"],ascending=(False,False))

In [None]:
# w2vec
w2vec_results.sort_values(by=["Prec. test", "AUC score"], ascending=(False,False))

In [None]:
# w2vec + tfid
w2vecTfid_results.sort_values(by=["Prec. test", "AUC score"], ascending=(False,False))

In [None]:
#Tensorflow universal setence encoder
usetf_results.sort_values(by=["Prec. test","AUC score"],ascending=(False,False))

### 4.1 Analísis general
comparo las metricas de todos los modelos.

In [None]:
def graph_classifier(x_train, y_train,title):
    results = []
    names = []
    for key, classifier in modelsDict.items():
        kfold = model_selection.KFold(n_splits=10, random_state=2020)
        cv_results = model_selection.cross_val_score(classifier, x_train, y_train, cv=kfold, scoring='accuracy')
        results.append(cv_results)
        names.append(key)    
        
    fig = plt.figure(figsize=(16,8))
    plt.title(title)
    fig.suptitle('Comparación de algoritmos')    
    ax = fig.add_subplot(111)
    plt.boxplot(results)
    ax.set_xticklabels(names)    
    plt.axhline(y=0.80, color='g', linestyle='-')
    plt.show()


In [None]:
#Datos para countVector
graph_classifier(x_train_cv, y_train_cv,"CountVector")

In [None]:
#Datos para TF-IDF
graph_classifier(x_train_tf, y_train_tf,"TF-IDF (1-2-grama)")

In [None]:
#Datos para countVector + n-gramas
graph_classifier(x_train_ng_cv, y_train_ng_cv,"CountVector (1-3-grama)")

In [None]:
#Datos para TF-IDF ngramas
graph_classifier(x_train_ng_tf, y_train_ng_tf,"TF-IDF (1-3-grama)")

In [None]:
#Datos para Feature Hashing
graph_classifier(x_train_fh, y_train_fh,"Feature Hashing")

In [None]:
#Datos para w2vec
graph_classifier(x_train_w2vec, y_train_w2vec,"w2vec")

In [None]:
#Datos para w2vec+tfid
graph_classifier(x_train_w2vecTfid, y_train_w2vecTfid,"w2vec")

In [None]:
#Datos para Tensorflor universal setence encode
graph_classifier(x_train_usetf, y_train_usetf,"Tensorflow USE")

# 5. Mejora y selección de modelos
Para encontrar las mejores configuraciones de los modelos, testeo distintos hiperparámetros dentro de cada modelo. Hago una selección entre los mejores candidatos.

### 5.0 Pre-seleccion de los mejores modelos
Para no comparar todos los modelos y sus combinaciones, elijo los que mejor resultado dieron

In [None]:
#del modelsDict["Key"] 


### 5.1 Búsqueda de hiperparámetros
Veo los hiperparámetros que tiene cada modelo

In [None]:
for key, classifier in modelsDict.items():
    print("Key: " + key)
    print(classifier.get_params().keys())
    print("-"*50)


Creo un diccionario para almacenar la colección de hiperparámetros

In [None]:
modelsHyper = dict()
modelsDict.keys()

##### 5.1.1 Gradient Boost

In [None]:
modelsHyper["Gradient Boost"] = {
    "":"",
    "":""
}

##### 5.1.2 Random Forest

In [None]:
modelsHyper["Random Forest"] = {
    "":"",
    "":""
}

##### 5.1.3 Decision Tree

In [None]:
modelsHyper["Decision tree"] = {
    "":"",
    "":""
}

##### 5.1.4 KNN

In [None]:
modelsHyper["kNN"] = {
    "":"",
    "":""
}

##### 5.1.5 MNB

In [None]:
modelsHyper["MNB"] = {
    "":"",
    "":""
}

##### 5.1.6 GNB

In [None]:
modelsHyper["GNB"] = {
    "":"",
    "":""
}

##### 5.1.7 RidgeClassifier

In [None]:
modelsHyper["Ridge Classifier"] = {
    "":"",
    "":""
}

##### 5.1.8 Perceptron

In [None]:
modelsHyper["Perceptron"] = {
    "":"",
    "":""
}

##### 5.1.9 xgBoost

In [None]:
modelsHyper["xgboost"] = {
    "":"",
    "":""
}

##### 5.1.10 Logistic Regression

In [None]:
modelsHyper["Logistic Regression"] = {
    "":"",
    "":""
}

### 5.2 Seleccion de Hiperparametros
Utilizo gridsearch para buscar los mejores hiperparametros para cada modelo.


In [None]:
def Cross_validation(classifier,param_grid):
    grid_search = GridSearchCV(estimator= classifier, param_grid = param_grid, cv=3 , n_jobs = -1, verbose = 2)
    grid_search.fit(x_train_usetf,y_train_usetf)
    return grid_search.best_estimator_

# 6. Envío de datos
Preparo el submit

In [None]:
def submission(model,test_vector):
    
    '''Input- model=final fit model to be used for predictions
              test_vector=pre-processed and vectorized test dataset
       Output- submission file in .csv format with predictions       
    
    '''    
    sub_df = pd.read_csv('../data/sample_submission.csv')
    sub_df["target"] = model.predict(test_vector)
    sub_df.to_csv("submission.csv", index=False)
        

In [None]:
#MNB + countVector
mnb_model = MultinomialNB()
mnb_model.fit(x_train_cv, y_train_cv)
submission(mnb_model,test_cv)

In [None]:
#Logistic Regression +TF-IDF (1-2-gramas)
lr_model = LogisticRegression(C=1.0)
lr_model.fit(x_train_tf, y_train_tf)
submission(lr_model,test_tf)

In [None]:
#MNB + TF-IDF (1-2-gramas)
mnb_model = MultinomialNB()
mnb_model.fit(x_train_tf, y_train_tf)
submission(mnb_model,test_tf)

In [None]:
# TensorFlow + RandomForest (0.80324)
trf_model = modelsDict["Random Forest"]
trf_model.fit(x_train_usetf,y_train_usetf)
submission(trf_model,test_use_tf)

In [None]:
# TensorFlow + Knn (0.78884)
tknn_model = modelsDict["kNN"]
tknn_model.fit(x_train_usetf,y_train_usetf)
submission(tknn_model,test_use_tf)

In [None]:
# TensorFlow + Ridge (0.79221)
trid_model = modelsDict["RidgeClassifier"]
trid_model.fit(x_train_usetf,y_train_usetf)
submission(trid_model,test_use_tf)

In [None]:
# TensorFlow + Logistic Reg (0.80416)
tlr_model = modelsDict["Logistic Regression"]
tlr_model.fit(x_train_usetf,y_train_usetf)
submission(tlr_model,test_use_tf)

In [None]:
# Randomforest + Hiperparams
hip_model = modelsDict["RandomForest up"]
hip_model.fit(x_train_usetf,y_train_usetf)
submission(hip_model,test_use_tf)

In [None]:
# Tensorflor --- logistic + Hyper (0.80876)
lrh_model = LogisticRegression(C=0.615848211066026, solver='liblinear')
lrh_model.fit(x_train_usetf,y_train_usetf)
submission(lrh_model,test_use_tf)

In [None]:

train_use_tf2 = pd.DataFrame(train_use_tf)
train_use_k_tf2 = pd.DataFrame(train_use_tf_k)
train_use_tf2_k = pd.concat([train_use_tf2, train_use_k_tf2], axis=1, sort=False)

test_use_tf2 = pd.DataFrame(test_use_tf)
test_use_k_tf2 = pd.DataFrame(test_use_tf_k)
test_use_tf2_full = pd.concat([test_use_tf2, test_use_k_tf2], axis=1, sort=False)

x_train_usetf2, x_test_usetf2, y_train_usetf2, y_test_usetf2 = train_test_split(train_use_tf2_k,tweets_train.target,test_size=0.2,random_state=2020)

#Tensorflow --- Gradient boost (0.81366)
tgb_model = GradientBoostingClassifier(learning_rate=0.001, max_depth=2, max_features=38,
                           min_samples_leaf=30, n_estimators=9600,
                           random_state=2020, subsample=0.8)
tgb_model.fit(x_train_usetf2,y_train_usetf2)
submission(tgb_model,test_use_tf2_full)

In [None]:
x_train_usetf, x_test_usetf, y_train_usetf, y_test_usetf = train_test_split(train_use_tf,tweets_train.target,test_size=0.2,random_state=2020)

#Tensorflow --- gradient boost hyper SOLO TEXT (0.80508)
gb_model = GradientBoostingClassifier(max_depth=4, max_features=19, min_samples_leaf=42,
                           min_samples_split=110, n_estimators=60,
                           random_state=2020, subsample=0.8)
gb_model.fit(x_train_usetf,y_train_usetf)
submission(gb_model,test_use_tf)

In [None]:
#TENSORFLOW --- gradient boost Hyper TEXT + KEYWORD + SENTIMENT (0.82010)

train_use_tf2 = pd.DataFrame(train_use_tf)
train_use_k_tf2 = pd.DataFrame(train_use_tf_k)
train_use_tf2_k = pd.concat([train_use_tf2, train_use_k_tf2], axis=1, sort=False)
train_use_tf2_s = pd.concat([train_use_tf2_k, train_df2['sentiment_score']], axis=1, sort=False)

test_use_tf2 = pd.DataFrame(test_use_tf)
test_use_k_tf2 = pd.DataFrame(test_use_tf_k)
test_use_tf2_full = pd.concat([test_use_tf2, test_use_k_tf2], axis=1, sort=False)
test_use_tf2_s = pd.concat([test_use_tf2_full, test_df2['sentiment_score']], axis=1, sort=False)

x_train_usetf_sen, x_test_usetf_sen, y_train_usetf_sen, y_test_usetf_sen = train_test_split(train_use_tf2_s,tweets_train.target,test_size=0.2,random_state=2020)

sen_model = GradientBoostingClassifier(learning_rate=0.0055, max_depth=2, max_features=27,
                           min_samples_leaf=23, n_estimators=4850,
                           random_state=2020, subsample=0.8)
sen_model.fit(x_train_usetf_sen, y_train_usetf_sen)
submission(sen_model,test_use_tf2_s)

In [None]:
#Tensorflow --- gradient boost hyper + text + key + sent + Ajustes extra
sen_model2 = GradientBoostingClassifier(learning_rate=0.0053, max_depth=2, max_features=27,
                           min_samples_leaf=23, n_estimators=4840,
                           random_state=2020, subsample=0.8)
sen_model2.fit(x_train_usetf_sen,y_train_usetf_sen)
submission(sen_model,test_use_tf2_s)

In [None]:
#Tensorflow + Logistic regression (full) (0.81428)
log_model = LogisticRegression(C=0.985, class_weight={0: 0.6, 1: 0.4}, solver='liblinear',tol=0.0001)
log_model.fit(x_train_usetf_sen,y_train_usetf_sen)
submission(log_model,test_use_tf2_s)


In [None]:
#Tensorflow + Logistic regression (full) + PIPELINE (0.81213)
log_model = Pipeline(steps=[('pca', PCA(n_components=1025)),
                
                           ('boost', sen_model)])
log_model.fit(x_train_usetf_sen,y_train_usetf_sen)
submission(log_model,test_use_tf2_s)

In [None]:
sen_model = Pipeline(steps=[('sc',StandardScaler()),('pca',PCA(n_components=100)),
                           ('boost', sen_model)])
sen_model.fit(x_train_usetf_sen,y_train_usetf_sen)
submission(sen_model,test_use_tf2_s)