## Importación del corpus

Utilizamos el corpus de [Canéphore](https://github.com/ressources-tal/canephore) que contiene tweets en francés anotados de opiniones de usuarios sobre el concurso de Miss France. Previamente, hemos podido descargarnos 2000 tweets (el corpus tiene 10000 pero la API de Twitter nos lo limitaba), que hemos agrupado en un mismo archivo (results.csv) junto con su polaridad (0-negativa, 1-positiva, Nan-neutra).

In [1]:
import pandas as pd
pd.set_option('max_colwidth',1000)

In [2]:
corpus_frances = pd.read_csv('results_extended.csv', encoding='utf-8')
corpus_frances.sample(20)

Unnamed: 0,content,polarity
3554,'Eh ben notre miss côte d'azur est soit dauphine soit Miss France bravo elle est bien !!!Dommage que Languedoc ne soit pas passée#MissFrance',1
279,'Sylvie Tellier a grossi #ça me console suis pas la seule!!!! #MissFrance',0
3749,'Miss Réunion n'est pas magnifique mais je la trouve mignonne elle a quelque chose en plus. #MissFrance',0
3519,'Bon Miss Réunion !!!! #MissFrance',1
2699,'Mes favorites: Pays de loire et Roussillon #MissFrance',1
4824,'Alsace P1!!! #MissFrance',Nan
1712,'Je paris sur Miss Réunion. #MissFrance',Nan
3152,'HS à l'hôtel après cette grosse journée parisienne mais le cœur est à #Brest ! &lt3 #MissFrance #Bretagne #monpays',Nan
4668,'@Brit_CiciAddict Dommage mec !! =) Miss Alsace a gagné !!! #MissFrance',Nan
2273,'Et à part les Miss France il se passe quoi ce soir dans le monde ? #missfrance @BenThev @zappette @Daphne_Burki01 @FlorencePorcel @Vinvin',Nan


In [3]:
corpus_frances.shape

(5546, 2)

Preparamos otro corpus descartando los tweets con polaridad neutra (Nan).

In [4]:
corpus_frances_sinNan = corpus_frances.query('polarity != "Nan"')
corpus_frances_sinNan.shape

(2443, 2)

## Tokenizing & Stemming

Obtenenemos de nltk las palabras vacías francesas. Obtenemos también una lista de caracteres que se utilizan como puntuación (no añadimos ninguno porque son los mismos que los ingleses).

In [5]:
#download french stopwords
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
french_stopwords = stopwords.words('french')
french_stopwords

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'je',
 'la',
 'le',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',
 'ayants',
 'eu'

In [6]:
from string import punctuation
non_words = list(punctuation)
non_words

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

Utilizamos el algoritmo de stemming SnowballStemmer, disponible en francés también.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer       
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = SnowballStemmer('french')
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = ''.join([c for c in text if c not in non_words])
    # tokenize
    tokens =  word_tokenize(text)

    # stem
    try:
        stems = stem_tokens(tokens, stemmer)
    except Exception as e:
        print(e)
        print(text)
        stems = ['']
    return stems

stemmer

<nltk.stem.snowball.SnowballStemmer at 0x7fe576b27cf8>

## Evaluación del modelo

Vamos a probar con tres modelos distintos: LinearSVC, k-NN u Naive Bayes.

In [8]:
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline



### Tres polaridades (positiva-1, negativa-0, neutra-Nan)

Convertimos los valores de polaridad en números enteros (polarity_num).

In [9]:
corpus_frances['polarity_num'] = 0
corpus_frances.polarity_num[corpus_frances.polarity.isin(['1'])] = 1
corpus_frances.polarity_num[corpus_frances.polarity.isin(['Nan'])] = 2
corpus_frances.dtypes

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


content         object
polarity        object
polarity_num     int64
dtype: object

El corpus posee más tweets con polaridad neutra.

In [10]:
corpus_frances.polarity_num.value_counts(normalize=True)

2    0.559502
1    0.222322
0    0.218175
Name: polarity_num, dtype: float64

Es necesario descargarse el paquete nltk (si no lo hemos hecho ya una primera vez).

In [11]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Hacemos en GridSearch para encontrar los parámetros óptimos de cada modelo (esto solo es necesario hacerlo una vez).

In [27]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__C': (0.2, 0.5, 0.7),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 1000)
}


grid_search_lsvc = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search_lsvc.fit(corpus_frances.content, corpus_frances.polarity_num)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 1.9), 'vect__min_df': (10, 20, 50), 'vect__max_features': (500, 1000), 'vect__ngram_range': ((1, 1), (1, 2)), 'cls__C': (0.2, 0.5, 0.7), 'cls__loss': ('hinge', 'squared_hinge'), 'cls__max_iter': (500, 1000)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [None]:
grid_search_lsvc.best_params_

In [38]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', KNeighborsClassifier()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__n_neighbors': (20,50,100),
    'cls__weights': ('uniform', 'distance')
}


grid_search_knn = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search_knn.fit(corpus_frances.content, corpus_frances.polarity_num)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
          fit_params={}, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'vect__max_df': array([ 0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,  1.2,  1.3,  1.4,  1.5,
        1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,  2.3,  2.4,  2.5,  2.6,
        2.7,  2.8,  2.9]), 'vect__min_df': array([10, 20, 30, 40, 50, 60, 70, 80, 90]), 'vect__max_features': array([...6, 41, 46, 51, 56, 61, 66, 71, 76, 81,
       86, 91, 96]), 'cls__weights': ('uniform', 'distance')},
          pre_dispatch='2*n_jobs', random_st

In [None]:
grid_search_knn.best_params_

In [None]:
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', MultinomialNB()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__alpha': (0.2,0,5,1),
    'cls__fit_prior': ('True', 'False')
}


grid_search_mnb = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search_mnb.fit(corpus_frances.content, corpus_frances.polarity_num)

In [19]:
grid_search_mnb.best_params_

{'cls__alpha': 0.28000000000000003,
 'cls__fit_prior': 'True',
 'vect__max_df': 0.5,
 'vect__max_features': 500,
 'vect__min_df': 10,
 'vect__ngram_range': (1, 1)}

**Accuracy**

Para conocer la eficacia de cada modelo, utilizamos los parámetros óptimos que hemos encontrado (es necesario cambiarlos).

In [12]:
from sklearn.svm import LinearSVC
model = LinearSVC(C=.5, loss='hinge',max_iter=500,multi_class='ovr',
              random_state=None,
              penalty='l2',
              tol=0.0001
)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 2),
    max_features=1000
)

corpus_data_features = vectorizer.fit_transform(corpus_frances.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [13]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances)],
    y=corpus_frances.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.7035675025205711

In [14]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=20)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 20,
    max_df = 0.5,
    ngram_range=(1, 1),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [15]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances)],
    y=corpus_frances.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.59557615377109963

In [16]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=1, fit_prior="True")

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 1),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [17]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances)],
    y=corpus_frances.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.6694975119523856

### Dos polaridades (positiva-1, negativa-0)

Convertimos los valores de polaridad en números enteros (polarity_num).

In [18]:
corpus_frances_sinNan['polarity_num'] = 0
corpus_frances_sinNan.polarity_num[corpus_frances_sinNan.polarity.isin(['1'])] = 1
corpus_frances.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


content         object
polarity        object
polarity_num     int64
dtype: object

In [19]:
corpus_frances_sinNan.polarity_num.value_counts(normalize=True)

1    0.504707
0    0.495293
Name: polarity_num, dtype: float64

Hacemos en GridSearch para encontrar los parámetros óptimos de cada modelo (esto solo es necesario hacerlo una vez).

In [27]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__C': (0.2, 0.5, 0.7),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 1000)
}


grid_search_lsvc = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search_lsvc.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 1.9), 'vect__min_df': (10, 20, 50), 'vect__max_features': (500, 1000), 'vect__ngram_range': ((1, 1), (1, 2)), 'cls__C': (0.2, 0.5, 0.7), 'cls__loss': ('hinge', 'squared_hinge'), 'cls__max_iter': (500, 1000)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [None]:
grid_search_lsvc.best_params_

In [38]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', KNeighborsClassifier()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__n_neighbors': (20,50,100),
    'cls__weights': ('uniform', 'distance')
}


grid_search_knn = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search_knn.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
          fit_params={}, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'vect__max_df': array([ 0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,  1.2,  1.3,  1.4,  1.5,
        1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,  2.3,  2.4,  2.5,  2.6,
        2.7,  2.8,  2.9]), 'vect__min_df': array([10, 20, 30, 40, 50, 60, 70, 80, 90]), 'vect__max_features': array([...6, 41, 46, 51, 56, 61, 66, 71, 76, 81,
       86, 91, 96]), 'cls__weights': ('uniform', 'distance')},
          pre_dispatch='2*n_jobs', random_st

In [None]:
grid_search_knn.best_params_

In [None]:
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', MultinomialNB()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__alpha': (0.2,0,5,1),
    'cls__fit_prior': ('True', 'False')
}


grid_search_mnb = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search_mnb.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)

In [19]:
grid_search_mnb.best_params_

{'cls__alpha': 0.28000000000000003,
 'cls__fit_prior': 'True',
 'vect__max_df': 0.5,
 'vect__max_features': 500,
 'vect__min_df': 10,
 'vect__ngram_range': (1, 1)}

**Accuracy**

Para conocer la eficacia de cada modelo, utilizamos los parámetros óptimos que hemos encontrado (es necesario cambiarlos).

In [20]:
model = LinearSVC(C=.2, loss='squared_hinge',max_iter=500,multi_class='ovr',
              random_state=None,
              penalty='l2',
              tol=0.0001
)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 2),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances_sinNan.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances_sinNan)],
    y=corpus_frances_sinNan.polarity_num,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

0.86399954734649564

In [22]:
model = KNeighborsClassifier(n_neighbors=50)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 1),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances_sinNan.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [23]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances_sinNan)],
    y=corpus_frances_sinNan.polarity_num,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

0.79113436355529954

In [24]:
model = MultinomialNB(alpha=0.2, fit_prior="True")

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 2),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances_sinNan.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [25]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances_sinNan)],
    y=corpus_frances_sinNan.polarity_num,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

0.85011709417124293

## Predicción de polaridad

** Utilizamos el modelo entrenado para el análisis de sentimientos en los tweets descargados **

Cargamos uno de los archivos csv con los tweets de una de las regiones de Francia (es necesario hacerlo con todos los csv).

In [327]:
tweets = pd.read_csv('tweets_debat_dic/dic_Centre-Val de Loire.csv', encoding='utf-8')
tweets.head()

Unnamed: 0,time,text,user,rts,place,lon,lat,dic,dic_rounded
0,2017-05-07 08:42:40,RT @TabNacim: Allez on remballe tout fin du débat\n#2017LeDebat,nome,155,,47751,1675,0.0,Nan
1,2017-05-07 06:35:50,Nan mais ce soir je crois on va jouer nos vies ptn🤞🏼🍣 #2017LeDebat,Paul MONMARTEAU,0,,47751,1675,0.0,Nan
2,2017-05-06 23:29:23,#2017LeDebat il s'agit de choisir le moins con,nαhwel,0,,47751,1675,-1.0,0
3,2017-05-06 20:18:14,RT @ddlarry13: #2017LeDebat j'espère un grand écart entre les deux candidats demain comme ça la haine ne reviendra plus,Good Kisser,1,,47751,1675,0.5,1
4,2017-05-06 19:54:52,#2017LeDebat j'espère un grand écart entre les deux candidats demain comme ça la haine ne reviendra plus,nαhwel,1,,47751,1675,0.5,1


In [328]:
tweets.shape

(954, 9)

** Predicción con los parámetros óptimos y el modelo entrenado **

Es necesario meter los parámetros óptimos que hemos encontrado en el apartado anterior, tanto para los de tres polaridades como los binarios.

In [329]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=1000
            )),
    ('cls', LinearSVC(C=.5, loss='hinge',max_iter=500,multi_class='ovr',
             random_state=None,
             penalty='l2',
             tol=0.0001
             )),
])

In [330]:
pipeline.fit(corpus_frances.content, corpus_frances.polarity_num)
tweets['lsvc'] = pipeline.predict(tweets.text)

In [331]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 20,
            max_df = 0.5,
            ngram_range=(1, 1),
            max_features=500
            )),
    ('cls', KNeighborsClassifier(n_neighbors=20)),
])

In [332]:
pipeline.fit(corpus_frances.content, corpus_frances.polarity_num)
tweets['knn'] = pipeline.predict(tweets.text)

In [333]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 1),
            max_features=500
            )),
    ('cls', MultinomialNB(alpha=1, fit_prior="True")),
])

In [334]:
pipeline.fit(corpus_frances.content, corpus_frances.polarity_num)
tweets['mnb'] = pipeline.predict(tweets.text)

In [335]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', LinearSVC(C=.2, loss='squared_hinge',max_iter=500,multi_class='ovr',
             random_state=None,
             penalty='l2',
             tol=0.0001
             )),
])

In [336]:
pipeline.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)
tweets['lsvc_bin'] = pipeline.predict(tweets.text)

In [337]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 1),
            max_features=500
            )),
    ('cls', KNeighborsClassifier(n_neighbors=50)),
])

In [338]:
pipeline.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)
tweets['knn_bin'] = pipeline.predict(tweets.text)

In [339]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', MultinomialNB(alpha=0.2, fit_prior="True")),
])

In [340]:
pipeline.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)
tweets['mnb_bin'] = pipeline.predict(tweets.text)

In [341]:
tweets[['text', 'dic_rounded','lsvc', 'knn', 'mnb', 'lsvc_bin', 'knn_bin', 'mnb_bin']].sample(30)

Unnamed: 0,text,dic_rounded,lsvc,knn,mnb,lsvc_bin,knn_bin,mnb_bin
762,#2017LeDebat il l'a laisse pas parler c'est abusé que il coupe la parole mdrr,Nan,2,2,0,0,0,0
10,"RT @romainribas: #2017LeDebat @EmmanuelMacron ""80% de nos médicaments sont importés"" or avec un TVA FN à 23% les prix vont augmenter (+ inf…",1,2,2,2,0,0,1
450,"RT @romainribas: #2017LeDebat @EmmanuelMacron ""@MLP_officiel profite de l'échec et de la colère Elle utilise toute sa conclusion pr insulte…",0,2,2,2,0,0,0
504,RT @Alex_Quenet: Elle a complètement pété un câble 😂😂 #2017LeDebat,0,2,2,2,0,1,0
665,Mon dieu la cassos 😂😂😂😂😂 #Debat2017 #2017LeDebat,Nan,2,2,2,0,0,0
515,#2017LeDebat Au moins ce débat aura servi à quelque chose : confirmé que #MLP et vraiment inutile à ce pays,0,2,2,2,1,1,1
836,RT @alexandre_spada: Supprimer le voile à l'université : encore une preuve que @MLP_officiel n'a rien compris à la laïcité ! #2017LeDébat #…,0,2,2,2,0,1,0
902,RT @Rob90rtega: jurez il y en a qui voteront pour une futur présidente qui se comporte comme une gamine de lycée #2017LeDebat,1,2,2,2,1,1,0
401,RT @nininet37: #2017LeDebat A l'idée qu'un quart des français se laisse abuser par cette femme me donne des sueurs froides.,Nan,2,2,2,0,0,0
807,#2017LeDebat @MLP_officiel invente des comptes off-shore sur son adversaire mais ne respecte pas l'indépendance de l'autorité judiciaire.,0,2,2,2,0,0,1


Guardamos los tweets con su polaridad y coordenadas para situarlos en el mapa.

In [342]:
tweets[['text','dic_rounded','lsvc', 'knn', 'mnb', 'lsvc_bin', 'knn_bin', 'mnb_bin']].to_csv('tweets_debat_completos/Centre-Val de Loire.csv', encoding='utf-8')