## Importación del corpus

Utilizamos el corpus de [Canéphore](https://github.com/ressources-tal/canephore) que contiene tweets en francés anotados de opiniones de usuarios sobre el concurso de Miss France. Previamente, hemos podido descargarnos 2000 tweets (el corpus tiene 10000 pero la API de Twitter nos lo limitaba), que hemos agrupado en un mismo archivo (results.csv) junto con su polaridad (0-negativa, 1-positiva, Nan-neutra).

In [1]:
import pandas as pd
pd.set_option('max_colwidth',1000)

In [3]:
corpus_frances = pd.read_csv('results_extended.csv', encoding='utf-8')
corpus_frances.sample(20)

Unnamed: 0,content,polarity
2536,'Roussillon elle a marque 20points #MissFrance',Nan
2055,'Miss Bretagne est trop vilaine. #MissFrance',0
2567,'C'est à quel moment le défilé en bas de survêtement/sweat-shirt qui fait des bouloches/pantoufles/coiffure en freestyle ? #MissFrance',Nan
820,'Moi je vote pour une seconde année de @LauryThilleman #missfrance2012 #TF1',Nan
161,'Tout sa pour laissé Jean-Pierre se préparer en coulisse. A j'te jure ! #TF1',Nan
3395,'Réunion : j'étais sûr. Alsace : aussi. Côte d'Azur : beurk. Pays de Loire : noooon ! Provence : mais beurk ! #MissFrance',0
2745,'Moi je suis pour Miss Languedoc ! #MissFrance',1
821,'Ah ouais Miss Gwada cette année AIE !!! elle fait mal #Beauté #MissFrance',0
4811,'Trop heureuse pour Miss Alsace ! :) #MissFrance',1
1971,'Bon il dise les 12 finalistes #missfrance',Nan


In [4]:
corpus_frances.shape

(5546, 2)

Preparamos otro corpus descartando los tweets con polaridad neutra (Nan).

In [7]:
corpus_frances_sinNan = corpus_frances.query('polarity != "Nan"')
corpus_frances_sinNan.shape

(2443, 2)

## Tokenizing & Stemming

Obtenenemos de nltk las palabras vacías francesas. Obtenemos también una lista de caracteres que se utilizan como puntuación (no añadimos ninguno porque son los mismos que los ingleses).

In [5]:
#download french stopwords
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords
french_stopwords = stopwords.words('french')
french_stopwords

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'je',
 'la',
 'le',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',
 'ayants',
 'eu'

In [6]:
from string import punctuation
non_words = list(punctuation)
non_words

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

Utilizamos el algoritmo de stemming SnowballStemmer, disponible en francés también.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer       
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = SnowballStemmer('french')
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = ''.join([c for c in text if c not in non_words])
    # tokenize
    tokens =  word_tokenize(text)

    # stem
    try:
        stems = stem_tokens(tokens, stemmer)
    except Exception as e:
        print(e)
        print(text)
        stems = ['']
    return stems

stemmer

<nltk.stem.snowball.SnowballStemmer at 0x7f2a188c1ac8>

## Evaluación del modelo

Vamos a probar con tres modelos distintos: LinearSVC, k-NN u Naive Bayes.

In [9]:
from sklearn.cross_validation import cross_val_score
from sklearn.pipeline import Pipeline



### Tres polaridades (positiva-1, negativa-0, neutra-Nan)

Convertimos los valores de polaridad en números enteros (polarity_num).

In [9]:
corpus_frances['polarity_num'] = 0
corpus_frances.polarity_num[corpus_frances.polarity.isin(['1'])] = 1
corpus_frances.polarity_num[corpus_frances.polarity.isin(['Nan'])] = 2
corpus_frances.dtypes

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


content         object
polarity        object
polarity_num     int64
dtype: object

El corpus posee más tweets con polaridad neutra.

In [10]:
corpus_frances.polarity_num.value_counts(normalize=True)

2    0.559502
1    0.222322
0    0.218175
Name: polarity_num, dtype: float64

Es necesario descargarse el paquete nltk (si no lo hemos hecho ya una primera vez).

In [11]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Hacemos en GridSearch para encontrar los parámetros óptimos de cada modelo (esto solo es necesario hacerlo una vez).

In [27]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__C': (0.2, 0.5, 0.7),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 1000)
}


grid_search_lsvc = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search_lsvc.fit(corpus_frances.content, corpus_frances.polarity_num)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 1.9), 'vect__min_df': (10, 20, 50), 'vect__max_features': (500, 1000), 'vect__ngram_range': ((1, 1), (1, 2)), 'cls__C': (0.2, 0.5, 0.7), 'cls__loss': ('hinge', 'squared_hinge'), 'cls__max_iter': (500, 1000)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [None]:
grid_search_lsvc.best_params_

In [38]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', KNeighborsClassifier()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__n_neighbors': (20,50,100),
    'cls__weights': ('uniform', 'distance')
}


grid_search_knn = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search_knn.fit(corpus_frances.content, corpus_frances.polarity_num)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
          fit_params={}, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'vect__max_df': array([ 0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,  1.2,  1.3,  1.4,  1.5,
        1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,  2.3,  2.4,  2.5,  2.6,
        2.7,  2.8,  2.9]), 'vect__min_df': array([10, 20, 30, 40, 50, 60, 70, 80, 90]), 'vect__max_features': array([...6, 41, 46, 51, 56, 61, 66, 71, 76, 81,
       86, 91, 96]), 'cls__weights': ('uniform', 'distance')},
          pre_dispatch='2*n_jobs', random_st

In [None]:
grid_search_knn.best_params_

In [None]:
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', MultinomialNB()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__alpha': (0.2,0,5,1),
    'cls__fit_prior': ('True', 'False')
}


grid_search_mnb = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='accuracy')
grid_search_mnb.fit(corpus_frances.content, corpus_frances.polarity_num)

In [19]:
grid_search_mnb.best_params_

{'cls__alpha': 0.28000000000000003,
 'cls__fit_prior': 'True',
 'vect__max_df': 0.5,
 'vect__max_features': 500,
 'vect__min_df': 10,
 'vect__ngram_range': (1, 1)}

**Accuracy**

Para conocer la eficacia de cada modelo, utilizamos los parámetros óptimos que hemos encontrado (es necesario cambiarlos).

In [42]:
model = LinearSVC(C=.2, loss='hinge',max_iter=1000,multi_class='ovr',
              random_state=None,
              penalty='l2',
              tol=0.0001
)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 1.9,
    ngram_range=(1, 1),
    max_features=1000
)

corpus_data_features = vectorizer.fit_transform(corpus_frances.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances)],
    y=corpus_frances.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.62430430430430428

In [40]:
model = KNeighborsClassifier(n_neighbors=81)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 20,
    max_df = 1.9999999999999996,
    ngram_range=(1, 1),
    max_features=1100
)

corpus_data_features = vectorizer.fit_transform(corpus_frances.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances)],
    y=corpus_frances.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.62430430430430428

In [20]:
model = MultinomialNB(alpha=0.28, fit_prior="True")

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 1),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances)],
    y=corpus_frances.polarity_num,
    scoring='accuracy',
    cv=5
    )

scores.mean()

0.62430430430430428

### Dos polaridades (positiva-1, negativa-0)

Convertimos los valores de polaridad en números enteros (polarity_num).

In [12]:
corpus_frances_sinNan['polarity_num'] = 0
corpus_frances_sinNan.polarity_num[corpus_frances_sinNan.polarity.isin(['1'])] = 1
corpus_frances.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


content         object
polarity        object
polarity_num     int64
dtype: object

In [13]:
corpus_frances_sinNan.polarity_num.value_counts(normalize=True)

1    0.504707
0    0.495293
Name: polarity_num, dtype: float64

Hacemos en GridSearch para encontrar los parámetros óptimos de cada modelo (esto solo es necesario hacerlo una vez).

In [27]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', LinearSVC()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__C': (0.2, 0.5, 0.7),
    'cls__loss': ('hinge', 'squared_hinge'),
    'cls__max_iter': (500, 1000)
}


grid_search_lsvc = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search_lsvc.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 1.9), 'vect__min_df': (10, 20, 50), 'vect__max_features': (500, 1000), 'vect__ngram_range': ((1, 1), (1, 2)), 'cls__C': (0.2, 0.5, 0.7), 'cls__loss': ('hinge', 'squared_hinge'), 'cls__max_iter': (500, 1000)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [None]:
grid_search_lsvc.best_params_

In [38]:
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', KNeighborsClassifier()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__n_neighbors': (20,50,100),
    'cls__weights': ('uniform', 'distance')
}


grid_search_knn = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search_knn.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None,
        stop_words=['au', 'aux...owski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
          fit_params={}, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'vect__max_df': array([ 0.5,  0.6,  0.7,  0.8,  0.9,  1. ,  1.1,  1.2,  1.3,  1.4,  1.5,
        1.6,  1.7,  1.8,  1.9,  2. ,  2.1,  2.2,  2.3,  2.4,  2.5,  2.6,
        2.7,  2.8,  2.9]), 'vect__min_df': array([10, 20, 30, 40, 50, 60, 70, 80, 90]), 'vect__max_features': array([...6, 41, 46, 51, 56, 61, 66, 71, 76, 81,
       86, 91, 96]), 'cls__weights': ('uniform', 'distance')},
          pre_dispatch='2*n_jobs', random_st

In [None]:
grid_search_knn.best_params_

In [None]:
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import GridSearchCV

vectorizer = CountVectorizer(
                analyzer = 'word',
                tokenizer = tokenize,
                lowercase = True,
                stop_words = french_stopwords)

pipeline = Pipeline([
    ('vect', vectorizer),
    ('cls', MultinomialNB()),
])



parameters = {
    'vect__max_df': (0.5, 1.9),
    'vect__min_df': (10, 20,50),
    'vect__max_features': (500, 1000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'cls__alpha': (0.2,0,5,1),
    'cls__fit_prior': ('True', 'False')
}


grid_search_mnb = GridSearchCV(pipeline, parameters, n_jobs=-1 , scoring='roc_auc')
grid_search_mnb.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)

In [19]:
grid_search_mnb.best_params_

{'cls__alpha': 0.28000000000000003,
 'cls__fit_prior': 'True',
 'vect__max_df': 0.5,
 'vect__max_features': 500,
 'vect__min_df': 10,
 'vect__ngram_range': (1, 1)}

**Accuracy**

Para conocer la eficacia de cada modelo, utilizamos los parámetros óptimos que hemos encontrado (es necesario cambiarlos).

In [42]:
model = LinearSVC(C=.2, loss='hinge',max_iter=1000,multi_class='ovr',
              random_state=None,
              penalty='l2',
              tol=0.0001
)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 1.9,
    ngram_range=(1, 1),
    max_features=1000
)

corpus_data_features = vectorizer.fit_transform(corpus_frances_sinNan.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances_sinNan)],
    y=corpus_frances_sinNan.polarity_num,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

0.62430430430430428

In [40]:
model = KNeighborsClassifier(n_neighbors=81)

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 20,
    max_df = 1.9999999999999996,
    ngram_range=(1, 1),
    max_features=1100
)

corpus_data_features = vectorizer.fit_transform(corpus_frances_sinNan.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances_sinNan)],
    y=corpus_frances_sinNan.polarity_num,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

0.62430430430430428

In [20]:
model = MultinomialNB(alpha=0.28, fit_prior="True")

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = french_stopwords,
    min_df = 10,
    max_df = 0.5,
    ngram_range=(1, 1),
    max_features=500
)

corpus_data_features = vectorizer.fit_transform(corpus_frances_sinNan.content)
corpus_data_features_nd = corpus_data_features.toarray()

In [21]:
scores = cross_val_score(
    model,
    corpus_data_features_nd[0:len(corpus_frances_sinNan)],
    y=corpus_frances_sinNan.polarity_num,
    scoring='roc_auc',
    cv=5
    )

scores.mean()

0.62430430430430428

## Predicción de polaridad

** Utilizamos el modelo entrenado para el análisis de sentimientos en los tweets descargados **

Cargamos uno de los archivos csv con los tweets de una de las regiones de Francia (es necesario hacerlo con todos los csv).

In [3]:
tweets = pd.read_csv('Ile-de-France.csv', encoding='utf-8')
tweets.head()

Unnamed: 0,time,text,user,rts,place,lon,lat
0,2017-05-06 16:02:55,"RT @Freezze: ""Excusez moi mais ... Mais .. Pourrait on évoquer le ... S'il vous plait ? Est ce que ... Oh et puis démerdez vous tiens"" #20…",Ju',7435,,,
1,2017-05-06 16:02:38,RT @Freezze: RT si t'as rien compris. #2017LeDebat https://t.co/8mT06MGiZA,Marc barbier,10654,,,
2,2017-05-06 16:02:38,RT @ErenJaeger95: Normalement #2017LeDébat aurait dû ce passé comme ça 😭😭😭😭 https://t.co/XSQEn936G7,Dany,1568,,,
3,2017-05-06 16:01:43,RT @EmmanuelMacron: Je veux présider le pays. #2017LeDébat,APPELEZ MOI ZA2👸🏻,1763,,,
4,2017-05-06 15:59:41,RT @deleteitugly: Meilleur moment du débat #2017LeDebat https://t.co/8N9Dmgl2mH,Clem's,15120,,,


** Detección del lenguaje **

Nos aseguramos que todos los tweets están escritos en francés.

In [11]:
import langid
from langdetect import detect
import textblob

def langid_safe(tweet):
    try:
        return langid.classify(tweet)[0]
    except Exception as e:
        pass
        
def langdetect_safe(tweet):
    try:
        return detect(tweet)
    except Exception as e:
        pass

def textblob_safe(tweet):
    try:
        return textblob.TextBlob(tweet).detect_language()
    except Exception as e:
        pass   

In [12]:
#this will take a loong time.
tweets['lang_langid'] = tweets.text.apply(langid_safe)
tweets['lang_langdetect'] = tweets.text.apply(langdetect_safe)
tweets['lang_textblob'] = tweets.text.apply(textblob_safe)

In [13]:
tweets

Unnamed: 0,time,text,user,rts,place,lon,lat,lang_langid,lang_langdetect,lang_textblob
0,2017-05-06 16:02:55,"RT @Freezze: ""Excusez moi mais ... Mais .. Pourrait on évoquer le ... S'il vous plait ? Est ce que ... Oh et puis démerdez vous tiens"" #20…",Ju',7435,,,,fr,fr,fr
1,2017-05-06 16:02:38,RT @Freezze: RT si t'as rien compris. #2017LeDebat https://t.co/8mT06MGiZA,Marc barbier,10654,,,,it,en,fr
2,2017-05-06 16:02:38,RT @ErenJaeger95: Normalement #2017LeDébat aurait dû ce passé comme ça 😭😭😭😭 https://t.co/XSQEn936G7,Dany,1568,,,,fr,fr,fr
3,2017-05-06 16:01:43,RT @EmmanuelMacron: Je veux présider le pays. #2017LeDébat,APPELEZ MOI ZA2👸🏻,1763,,,,fr,fr,fr
4,2017-05-06 15:59:41,RT @deleteitugly: Meilleur moment du débat #2017LeDebat https://t.co/8N9Dmgl2mH,Clem's,15120,,,,fr,fr,fr
5,2017-05-06 15:57:39,"RT @TeamMacron2017: Louis Aliot, du FN, avec Camel Bechikh représentant de l'UOIF. Marine Le Pen devrait balayer devant sa porte #2017LeDeb…",Dominique Baiguini,2113,,,,fr,fr,fr
6,2017-05-06 15:57:35,RT @EmmanuelMacron: #2017LeDébat en 5 minutes ! https://t.co/xAeJKpjDKu,🐑,3171,,,,fr,fr,fr
7,2017-05-06 15:57:31,"RT @EmmanuelMacron: Madame Le Pen, la France mérite mieux que vous. #2017LeDébat",twenty2,16683,,,,fr,fr,fr
8,2017-05-06 15:57:05,RT @deleteitugly: Meilleur moment du débat #2017LeDebat https://t.co/8N9Dmgl2mH,Emilie Arwidson,15120,,,,fr,fr,fr
9,2017-05-06 15:56:58,RT @Freezze: C'est bon elle a vrillé complet #2017LeDebat https://t.co/ldRd72wX7d,princesse_loulou13,12121,,,,fr,fr,fr


In [24]:
tweets = tweets.query(''' lang_langdetect == 'fr' or lang_langid == 'fr' or lang_textblob == 'fr'  ''')
tweets.shape

(956, 10)

** Predicción con los parámetros óptimos y el modelo entrenado **

Es necesario meter los parámetros óptimos que hemos encontrado en el apartado anterior, tanto para los de tres polaridades como los binarios.

In [25]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', LinearSVC(C=.2, loss='hinge',max_iter=1000,multi_class='ovr',
             random_state=None,
             penalty='l2',
             tol=0.0001
             )),
])

In [26]:
pipeline.fit(corpus_frances.content, corpus_frances.polarity_num)
tweets['lsvc'] = pipeline.predict(tweets.text)

In [25]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', KNeighborsClassifier(n_neighbors=81)),
])

In [26]:
pipeline.fit(corpus_frances.content, corpus_frances.polarity_num)
tweets['knn'] = pipeline.predict(tweets.text)

In [25]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', MultinomialNB(alpha=0.28, fit_prior="True")),
])

In [26]:
pipeline.fit(corpus_frances.content, corpus_frances.polarity_num)
tweets['mnb'] = pipeline.predict(tweets.text)

In [25]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', LinearSVC(C=.2, loss='hinge',max_iter=1000,multi_class='ovr',
             random_state=None,
             penalty='l2',
             tol=0.0001
             )),
])

In [26]:
pipeline.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)
tweets['lsvc_bin'] = pipeline.predict(tweets.text)

In [25]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', KNeighborsClassifier(n_neighbors=81)),
])

In [26]:
pipeline.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)
tweets['knn_bin'] = pipeline.predict(tweets.text)

In [25]:
pipeline = Pipeline([
    ('vect', CountVectorizer(
            analyzer = 'word',
            tokenizer = tokenize,
            lowercase = True,
            stop_words = french_stopwords,
            min_df = 10,
            max_df = 0.5,
            ngram_range=(1, 2),
            max_features=500
            )),
    ('cls', MultinomialNB(alpha=0.28, fit_prior="True")),
])

In [26]:
pipeline.fit(corpus_frances_sinNan.content, corpus_frances_sinNan.polarity_num)
tweets['mnb_bin'] = pipeline.predict(tweets.text)

In [27]:
tweets[['text', 'lsvc', 'knn', 'mnb', 'lsvc_bin', 'knn_bin', 'mnb_bin']].sample(20)

Unnamed: 0,text,polarity
341,"RT @fligoupier: Tu me manques, petit ange parti trop tôt... 😢 #2017LeDebat https://t.co/UTUFL693MB",0
907,RT @Sylvqin: T'as pas besoin de parler de ton programme si t'insultes l'autre candidat pendant 3h #2017LeDébat https://t.co/ocJ33xJMTd,0
882,RT @deleteitugly: Meilleur moment du débat #2017LeDebat https://t.co/8N9Dmgl2mH,0
595,"RT @EmmanuelMacron: Madame Le Pen, la France mérite mieux que vous. #2017LeDébat",0
860,RT @mkfrison: Ça marche avec toutes les chansons !!! #2017LeDébat https://t.co/wVNpaVwjxu,0
568,RT @deleteitugly: Meilleur moment du débat #2017LeDebat https://t.co/8N9Dmgl2mH,0
872,"RT @gmaujean: Ce soir, les fact-checkers en burn-out avec MLP #2017LeDébat https://t.co/amSUCwwPtl",1
616,"RT @Neacko83: En 2002, Chirac disait : ""on ne débat pas avec l'extreme droite"".\n15 ans après, en voyant #LePen, on comprend mieux pourquoi.…",0
458,"RT @TheClownOfParis: ""non mais a un moment donné ... LA FEMME A BOOBA !"" #2017LeDebat https://t.co/8K8lYgs4SQ",0
491,RT @NasNacera: Le FN promeut le « Made in France » mais fabrique ses tee-shirts en Asie. Patriote tu dis ? #2017LeDebat https://t.co/SBd28…,0


Guardamos los tweets con su polaridad y coordenadas para situarlos en el mapa.

In [28]:
tweets[['text', 'lat', 'lon', 'lsvc', 'knn', 'mnb', 'lsvc_bin', 'knn_bin', 'mnb_bin']].to_csv('Ile-de-France_polarity_latlon.csv', encoding='utf-8')