## Le grand debat

## Import libraries

In [None]:
!pip install git+https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer.git

In [None]:
# basics
import pandas as pd
import numpy as np
import datetime
import os

# string
import string
!pip install unidecode
import unidecode
import re
from textwrap import wrap # wrapping long text into lines

# plot
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from wordcloud import WordCloud
from mpl_toolkits.axes_grid1 import make_axes_locatable
# %matplotlib inline

# text mining
import nltk
from nltk.tokenize import RegexpTokenizer


# Because we have some long strings to deal with:
pd.options.display.max_colwidth = 300



replacement_patterns = [
    (r'won\'t', 'will not'),
    (r'can\'t', 'cannot'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would'),
]

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns): 
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            s = re.sub(pattern, repl, s) 
        return s

replacer=RegexpReplacer()

from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer
lemmatizer = FrenchLefffLemmatizer()
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stops=set(stopwords.words('french'))

nltk.download('stopwords')

In [None]:

# Because we have some long strings to deal with:
pd.options.display.max_colwidth = 300

In [None]:
os.listdir('../input/granddebat')

## Import dataset

In [None]:
themes = {
    'LA_FISCALITE_ET_LES_DEPENSES_PUBLIQUES.csv':'La fiscalité et les dépenses publiques',
    'ORGANISATION_DE_LETAT_ET_DES_SERVICES_PUBLICS.csv':"Organisation de l'état et des services publics",
    'DEMOCRATIE_ET_CITOYENNETE.csv':'Démocratie et citoyenneté',
    'LA_TRANSITION_ECOLOGIQUE.csv':'La transition écologique'
}

filenames = list(themes.keys())
themes = list(themes.values())

In [None]:
filepaths = [os.path.join("..", "input", "granddebat", filename) for filename in filenames]
col_date = ['createdAt', 'publishedAt', 'updatedAt']
df_list = [pd.read_csv(filepath, low_memory=False,
                       dtype={'authorZipCode':'str'},
                       parse_dates=col_date) for filepath in filepaths]

We can now import each file, all in one list of dataframes for easier use.

We pay special attention to data types: ZipCode must be read as strings and date columns as timestamps.

## Discovering the dataset

The 4 dataframes share some common variables, other columns are questions that are specific to the theme. The common variables are the following:


In [None]:
col_common = set.intersection(*[set(df.columns) for df in df_list])
col_common

Let's have a look at the missing values


In [None]:
pd.concat([df[df.columns.intersection(col_common)] for df in df_list]).isnull().mean() * 100


Each line of the dataframes corresponds to one contribution: the answers of an author to the questions of the corresponding theme. Let's see how many contributions we have for each dataset, and how many questions:


In [None]:
df_infos = pd.DataFrame({
     'theme': themes,
     'nb_contributions': [df.shape[0] for df in df_list],
     'nb_questions': [sum(~df.columns.isin(col_common)) for df in df_list]
    })
df_infos

##### We can ask ourself when were the contributions submitted?

We will have a look at the createdAt variable to spot when the contributions were submitted, and at what time of the day.


## Daily contributions


In [None]:
day_contrib = pd.concat([df.createdAt for df in df_list]).dt.date.value_counts().sort_index()

fig, ax = plt.subplots(figsize = (18,6))
day_contrib.plot()
ax.set_title('Daily contributions')
ax.set_xlabel('Date')
fig.autofmt_xdate()
ax.set_ylim(bottom=0)
plt.show(fig)

We can see a first peak at the very beginning of the Grand Débat.

Let's look also at the time the contributions were made:

In [None]:
hour_contrib = pd.concat([df.createdAt for df in df_list]).dt.hour.value_counts().sort_index()

fig, ax = plt.subplots(figsize = (18,6))
hour_contrib.plot()
ax.set_title('Hourly contributions')
ax.set_xlabel('Hour')
ax.set_ylim(bottom=0)
plt.show(fig)

The number of contribution per hour reaches a peak in the late afternoon, between 18h and 19h

### Who are the contributors?

In this section we will have a closer look at the authors of the Grand Débat. For each contribution we have an authorID that is shared among datasets.


In [None]:
pd.DataFrame({'theme':themes,
              'max_contrib_per_author':[df.groupby('authorId').size().max() for df in df_list]})

Since we focus on contributors, we aggregate the table by authorId in order to have one line per author. If an author has several authorType or authorZipCode, we keep the most frequent one: the mode.

We also add a count statistics: how many contributions that author made over the whole dataset.

In [None]:
def mode_na(x): 
    m = pd.Series.mode(x)
    return m.values[0] if not m.empty else np.nan

authors = pd.concat([df[df.columns.intersection(col_common)] for df in df_list])
# With pandas>=0.24, we would use: pandas.Series.mode
authors = authors.groupby('authorId').agg({'id':'count', # number of contributions
                                           'authorType':mode_na,
                                           'authorZipCode':mode_na})

The first statistics we can get out of this new dataframe is the number of distinct contributors

In [None]:
authors.shape[0]


There are more than 150,000 distinct contributors.

In [None]:
n_contrib = authors.id.value_counts().reset_index(name='counts')
n_contrib.loc[n_contrib['index'] > 4, 'index'] = '>4'
n_contrib = n_contrib.groupby('index').agg(sum)
fig, ax = plt.subplots(figsize=(18,6))
ax = sns.barplot(x='index',
            y='counts',
            data=n_contrib.reset_index(),
            palette=sns.color_palette('Blues'))
ax.set_xlabel('Number of contributions')
ax.set_title('Authors per number of contributions')
plt.show()

As can be seen, around 50% of the authors submitted a single contribution.

In [None]:
fig, ax = plt.subplots(figsize=(18,6))
ax = sns.countplot(x='authorType',
                   data=authors,
                   palette=sns.color_palette('Blues'))
ax.set_yscale('log')
ax.set_title('Author types')
plt.show()


We notice that the great majority of respondents, are citizens  i.e. they are neither politicals, officials nor part of an organisation.

After this very brief dataset analysis, it is time to focus on the variables of interest: the questions. Each dataframe contains several questions, but we will try to treat them all at once.

The column names for the questions are a bit messy, we will rename them for more clarity. We build a dataframe containing information about each question: old and new name, title, and the theme and dataframe they are linked to.


In [None]:
questions = pd.concat([pd.DataFrame({'old_name':df_list[i].columns,
                                     'df_id':i,
                                     'theme':themes[i]}) for i in range(len(df_list))])
questions = questions[-questions["old_name"].isin(col_common)].reset_index(drop=True)
questions = questions.assign(new_name=(pd.Series(
    ['Q{}'.format(i) for i in range(1, questions.shape[0] + 1)])))
questions = questions.assign(question=pd.Series(
    [name.split(' - ')[1] for name in questions.old_name]))

In [None]:
# Questions rename
dict_rename = {old:new for old, new in zip(questions.old_name,questions.new_name)}
for df in df_list:
    df.rename(columns=dict_rename,inplace=True)
    
questions.head()


In [None]:
questions.shape


We can see that the dataset concatened contains 94 questions.

We created for ourself a csv to a better understanding of all the questions.

In [None]:
fichier_csv = questions.to_csv(r"questions.csv",index=False)


For each question, we compute the following statistics:

- nbrow: number of rows (i.e. number of contributions for the corresponding theme)
- nbnnull: number of answers that are not null (answer is null if the contributor skipped that question)
- nbunique: number of distinct answers
- nnull_rate: nbnnull/nbrow * 100
- unique_rate: nbunique/nbnnull * 100

In [None]:
questions['nbrow'] = questions.apply(lambda g: df_list[g.df_id].shape[0], axis=1)
questions['nbnnull'] = questions.apply(lambda g: df_list[g.df_id].loc[:,g.new_name]\
                                       .notnull().sum(), axis=1)
questions['nbunique'] = questions.apply(lambda g: df_list[g.df_id].loc[:,g.new_name]\
                                        .nunique(), axis=1)

questions['nnull_rate'] = questions.nbnnull/questions.nbrow * 100
questions['unique_rate'] = questions.nbunique/questions.nbnnull * 100

We can notice that some questions have very few distinct answers:

In [None]:
questions['closed'] = questions['nbunique'] <= 3
sum(questions.closed)

Those 19 questions are closed-ended question: the answer is forced into a few choices, mainly Yes or No.

We can now aggregate at the theme scale:

In [None]:
questions.groupby(['theme']).agg({'question':'count', 'closed':'sum',
                                  'nbrow':'mean', 'nnull_rate':'mean'})

we see that there are lot of null values, we want to understand that. Let's see which questions have the most null values:

In [None]:
questions.sort_values('nnull_rate').head(10)


We can see that all of those questions start with "Si". They are conditional: an answer is not necessarily expected.
That's eplain why.

If we pay attention we can notice that the unique_rate is also very low, this is because a lot of contributors answered "non concerné" ("not applicable"), for instance with question Q40:

In [None]:
df_list[1].Q40.value_counts().head(30)


Some other questions have low unique_rate because they are guided question: choices were given but the respondant could decide to answer something else. This is the case for instance for questions Q91, Q79 and Q4:

In [None]:
df_list[3].Q91.value_counts().head(10)


### Closed questions analysis


For each of the 19 close questions, we plot the count of each answer in order to identify most popular opinions.

We use the seaborn library for plotting.

In [None]:
def add_frequencies(ax, ncount):
    for p in ax.patches:
        x=p.get_bbox().get_points()[:,0]
        y=p.get_bbox().get_points()[1,1]
        ax.annotate('{:.1f} %'.format(100.*y/ncount), (x.mean(), y), 
                ha='center', va='bottom', size='small', color='black', weight='bold')

In [None]:
# Countplot of questions_df
def countplot_qdf(questions_df, suptitle):
    n = questions_df.shape[0]
    
    # If there is nothing to plot, we stop here
    if n==0:
        return
    
    # Numbers of rows and cols in the subplots
    ncols = 3
    nrows = (n+3)//ncols
    fig,ax = plt.subplots(nrows, ncols, figsize=(25,6*nrows))
    fig.tight_layout(pad=9, w_pad=10, h_pad=7)
    fig.suptitle(suptitle, size=30, fontweight='bold')
    
    # Hide exceeding subplots
    for i in range(n, ncols*nrows):
        ax.flatten()[i].axis('off')
        
    # Countplot for each question
    for index, row in questions_df.iterrows():
        plt.sca(ax.flatten()[index])
        # We add the sort_values argument to always have the same order: Oui, Non...
        xlabels = df_list[row.df_id].loc[:,row.new_name]
        xlabels = xlabels.value_counts().index.sort_values(ascending=False)
        axi = sns.countplot(x=row.new_name,
                           data=df_list[row.df_id],
                           order = xlabels)
        # Wrap long questions into lines
        axi.set_title("\n".join(wrap(row.new_name + '. ' + row.question, 60)))
        axi.set_xlabel('')
        # We also set a wrap here (for one very long answer...)
        axi.set_xticklabels(["\n".join(wrap(s, 17)) for s in xlabels])
        axi.set_ylabel('Nombre de réponses')
        add_frequencies(axi, row.nbnnull)

In [None]:
# Plotting questions, grouped by theme
for i in range(len(themes)):
    countplot_qdf(questions[(questions.closed) & (questions.df_id == i)].reset_index(), themes[i])


On the themes of State organisation, democracy and citizenship: when asked their opinion, contributors always take side for change.

It's very interesting to see all these opinions.

### Open questions analysis

Most of the information of the dataset lies in the open questions, but they are the most difficult to analysis!

We can start with with a basic statistic, the number of words contained in the whole dataset.

In [None]:
# Count words in a string, a word being here any sequence of characters between white spaces
def count_words(s):
    if s is np.nan:
        return(0)
    return(len(s.split()))

In [None]:
# For each dataframe:
# filter on questions and title
# count words for each contribution of each question
# sum it all
n_words = [df.filter(regex=r'title|^Q', axis=1).apply(np.vectorize(count_words)).sum().sum()\
           for df in df_list]
sum(n_words)


The contributions contain 95 million words!

Let's focus on the 75 open questions. 
We first remove all stop words, those are the most common words that don't give any insight, and must be filtered out when doing natural language processing.

In [None]:
stop_words = [unidecode.unidecode(w.lower()) for w in stops]
# Add punctuation and some missing stopwords using this website : https://www.ranks.nl/stopwords/french
stop_words = set(stop_words +
                 list(string.punctuation) +
                 ["’", "...", "'", "", ">>", "<<"] +
                 ["oui", "non", "plus", "toute", "toutes", "faut","à","tous","tandis","quels",
                  "alors","au","aucuns","aussi","autre","avant","avec","avoir","juste","la","tout","toutes","très","trop",
"www","http","html","peu","en","etc","chaque","sans","ne","ils","il","que","quand","quoi","qui","plupart",
"doit","donc","dos","elle","elles","comme","comment","ci","ni","même","mais","mes","aussi","alors","an","je","ça","où","org","moi"
                 
                 ])

The next important step is to run a tokenization, i.e. splitting text into words. This might be tricky because of punctuation, wich is slightly different according to the language. There are some important features we have to take into considreation: punctuation, case, encoding and stop words.

In [None]:
# Get tokens from list of strings (can probably be optimised)
def get_tokens(s):
    # MosesTokenizer has been moved out of NLTK due to licensing issues
    # So we define a simple tokenizer based on regex, designed for French language
    pattern = r"[cdjlmnstCDJLMNST]['´`]|\w+|\$[\d\.]+|\S+"
    tokenizer = RegexpTokenizer(pattern)
    tokens = tokenizer.tokenize(" ".join(s.dropna()))
    # remove punctuation (for words like "j'")
    tokens = [w.translate(str.maketrans('', '', string.punctuation)) for w in tokens]
    # lowercase ASCII
    tokens = [unidecode.unidecode(w.lower()) for w in tokens]
    # remove stop words from tokens
    tokens = [w for w in tokens if w not in stop_words]
    return(tokens)

We will use the tokens to draw a word cloud. This is a visual representation of n-gram counts. The more frequent a term is, the bigger it will appear on the plot.

Let's plot a wordcloud for each of the 4 themes. We will see what are the most raised topics among each of them.

In [None]:
def plot_wordcloud(s, title, mw = 500):
    wordcloud = WordCloud(width=1200, height=600, max_words=mw,
                          background_color="white").generate(" ".join(s))
    plt.figure(figsize=(20, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title(title, fontsize=50, pad=50)
    plt.show()

col_q = questions.new_name[~questions.closed].append(pd.Series('title'))
for i in range(len(themes)):
    col_q_i = df_list[i].columns.intersection(col_q)
    tokens = pd.concat([df_list[i][col].dropna() for col in col_q_i])
    tokens = get_tokens(tokens)
    plot_wordcloud(tokens, title = themes[i])

## Let's start with La fiscalité et les dépenses publiques

In [None]:
df_list[0] = df_list[0].astype(str) 

#contains all the answers of questions 1 to 8
reponse_question1 = df_list[0].Q1
reponse_question2 = df_list[0].Q2
reponse_question3 = df_list[0].Q3
reponse_question4 = df_list[0].Q4
reponse_question5 = df_list[0].Q5
reponse_question6 = df_list[0].Q6
reponse_question7 = df_list[0].Q7
reponse_question8 = df_list[0].Q8

### Let's have a look at the different possible answer

In [None]:
reponse_question1.value_counts().head(10)


In [None]:
reponse_question2.value_counts().head(10)


In [None]:
reponse_question3.value_counts().head(10)


In [None]:
reponse_question4.value_counts().head(10)


# Pretreatments

In [None]:
nltk.download('punkt')


In [None]:
def preprocess_text(test):

  

    #test = test.lower()
    #Removing Numbers
    test=re.sub(r'\d+','',test)

    
    #Removing white spaces
    test=test.strip()
    
    #Replacer replace
    text_replaced = replacer.replace(test)

      #Tokenize
    tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
    sentences = tokenizer.tokenize(text_replaced)

     #Tokenize words
    tokenizer = WordPunctTokenizer()
    for i in range(len(sentences)):
        sentences[i] = tokenizer.tokenize(sentences[i])
        
     #Remove stop words


    for i in range(len(sentences)):
        sentences[i] = [word for word in sentences[i] if word not in stop_words]

    for i in range(len(sentences)):
        for j in range(len(sentences[i])):
            sentences[i][j] = lemmatizer.lemmatize(sentences[i][j])


    #Join the words back into a sentence.
    a=[' '.join(s) for s in sentences]
    b=['. '.join(a)]

    return b

In [None]:
reponse_question1_cleaned = [preprocess_text(doc) for doc in reponse_question1]
reponse_question1 = [' '.join(r) for r in reponse_question1_cleaned]


reponse_question2_cleaned = [preprocess_text(doc) for doc in reponse_question2]
reponse_question2 = [' '.join(r) for r in reponse_question2_cleaned]


reponse_question3_cleaned = [preprocess_text(doc) for doc in reponse_question3]
reponse_question3 = [' '.join(r) for r in reponse_question3_cleaned]


reponse_question4_cleaned = [preprocess_text(doc) for doc in reponse_question4]
reponse_question4 = [' '.join(r) for r in reponse_question4_cleaned]

reponse_question5_cleaned = [preprocess_text(doc) for doc in reponse_question5]
reponse_question5 = [' '.join(r) for r in reponse_question5_cleaned]


reponse_question6_cleaned = [preprocess_text(doc) for doc in reponse_question6]
reponse_question6 = [' '.join(r) for r in reponse_question6_cleaned]


reponse_question7_cleaned = [preprocess_text(doc) for doc in reponse_question7]
reponse_question7 = [' '.join(r) for r in reponse_question7_cleaned]


reponse_question8_cleaned = [preprocess_text(doc) for doc in reponse_question8]
reponse_question8 = [' '.join(r) for r in reponse_question8_cleaned]

In [None]:
!pip install pyLdavis

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None 
import numpy as np
import re
import nltk
from pprint import pprint

from gensim.models import word2vec

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

STOP_WORDS = nltk.corpus.stopwords.words()

from gensim.utils import simple_preprocess
from smart_open import smart_open
import pyLDAvis.gensim # To visualise LDA model effectively

import os
from collections import defaultdict # For accumlating values
from nltk.corpus import stopwords # To remove stopwords
from gensim import corpora # To create corpus and dictionary for the LDA model
from gensim.models import LdaModel # To use the LDA model

# Unsupervised analysis

# Let's start with La fiscalité et les dépenses publiques

In [None]:
reponse1 = pd.DataFrame(reponse_question1)
reponse1.columns = ['Question1_Quelles sont toutes les choses qui pourraient être faites pour améliorer information des citoyens sur utilisation des impôts ?']
reponse1 = reponse1[reponse1['Question1_Quelles sont toutes les choses qui pourraient être faites pour améliorer information des citoyens sur utilisation des impôts ?']!= 'nan']
reponse1.head()


In [None]:
# Create gensim dictionary form a single tet file
dictionary= corpora.Dictionary(simple_preprocess(line, deacc=True) for line in reponse1['Question1_Quelles sont toutes les choses qui pourraient être faites pour améliorer information des citoyens sur utilisation des impôts ?'])

# Token to Id map
dictionary.token2id

In [None]:
# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in reponse1['Question1_Quelles sont toutes les choses qui pourraient être faites pour améliorer information des citoyens sur utilisation des impôts ?']]
mydict = corpora.Dictionary()
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_list]

In [None]:
NUM_TOPICS = 10 # This is an assumption. 
ldamodel = LdaModel(mycorpus, num_topics = NUM_TOPICS, id2word=mydict, passes=15)#This might take some time.

In [None]:
topics = ldamodel.show_topics()
for topic in topics:
    print(topic)


In [None]:
word_dict = {};
for i in range(NUM_TOPICS):
    words = ldamodel.show_topic(i, topn = 15)
    word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
pd.DataFrame(word_dict)


In [None]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, mycorpus, mydict, sort_topics=False)
pyLDAvis.display(lda_display)

## Interpretation 

Avant de commenter notre tableau, on peut voir avec un value_counts() qu'une des choses qui pourraient être faites pour améliorer information des citoyens sur utilisation des impôts est bien la transparence. Ce qui resort sont les médias ou des sites publiques dédiés à l’utilisation des impôts bien expliquer avec des informations concrètes seraient bénéfiques. Il faut faire des debats, partager l'information via la tv, les journaux et des emissions. Mais surtout etre transparent, il faudrait des forme simples. On peut voir également une sorte de plainte au niveauu des avantages des salaires de haut fonctionnaires travaillant dans la politique.

## Let's analyze question 2 : Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?

In [None]:
reponse2 = pd.DataFrame(reponse_question2)
reponse2.columns = ['Question_2_Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?']
reponse2 = reponse2[reponse2['Question_2_Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?']!= 'nan']

reponse2.head()


In [None]:
# Create gensim dictionary form a single tet file
dictionary= corpora.Dictionary(simple_preprocess(line, deacc=True) for line in reponse2['Question_2_Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?'])

# Token to Id map
dictionary.token2id

In [None]:
# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in reponse2['Question_2_Que faudrait-il faire pour rendre la fiscalité plus juste et plus efficace ?']]

# Create the Corpus
mydict = corpora.Dictionary()
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_list]

In [None]:
NUM_TOPICS = 10 # This is an assumption. 
ldamodel = LdaModel(mycorpus, num_topics = NUM_TOPICS, id2word=mydict, passes=15)#This might take some time.

In [None]:
topics = ldamodel.show_topics()
for topic in topics:
    print(topic)

In [None]:
word_dict = {};
for i in range(NUM_TOPICS):
    words = ldamodel.show_topic(i, topn = 15)
    word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
pd.DataFrame(word_dict)

## Interprétation
Il faudrait diminuer le prix du carburant, faire attention à l'évasion fiscale.S upprimer la CSG pour tous les retraités ainsi que la taxe d'habitation. Il faudrait également selon eux rétablir l'ISF.On peut voir qu'il faut supprimer la CSG. Il faut faire attention à l'évasion fiscale également. Il faudrait également supprimer les niches fiscales et réduire les avantages des hauts fonctionnaires.

##  Let's focus on question 3 :  Quels sont selon vous les impôts qu'il faut baisser en priorité ?

In [None]:
reponse3 = pd.DataFrame(reponse_question3)
reponse3.columns = ['Question_3_Quels sont selon vous les impôts qui faut baisser en priorité ?']
reponse3 = reponse3[reponse3['Question_3_Quels sont selon vous les impôts qui faut baisser en priorité ?']!= 'nan']

In [None]:
# Create gensim dictionary form a single tet file
dictionary= corpora.Dictionary(simple_preprocess(line, deacc=True) for line in reponse3['Question_3_Quels sont selon vous les impôts qui faut baisser en priorité ?'])

# Token to Id map
dictionary.token2id

In [None]:
# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in reponse3['Question_3_Quels sont selon vous les impôts qui faut baisser en priorité ?']]
mydict = corpora.Dictionary()
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_list]


In [None]:
NUM_TOPICS = 10 # This is an assumption. 
ldamodel = LdaModel(mycorpus, num_topics = NUM_TOPICS, id2word=mydict, passes=15)#This might take some time.

In [None]:
topics = ldamodel.show_topics()
for topic in topics:
    print(topic)


In [None]:
word_dict = {};
for i in range(NUM_TOPICS):
    words = ldamodel.show_topic(i, topn = 15)
    word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
pd.DataFrame(word_dict)

## Interprétation

Les impôts qui faut baisser en priorité sont la CSG des retraités, la TVA, ISF mais également l'impot sur le revenu. Les charges sociales pour les PME ainsi que la taxe habitation et la taxe fonciere. Il est vraiment tres interessant de voir ce que pense les français à travers cette analyse.

##  Let's focus on question 6 :  Quels sont les domaines prioritaires où notre protection sociale doit être renforcée ?

In [None]:
reponse6 = pd.DataFrame(reponse_question6)
reponse6.columns = ['Question6_Quels sont les domaines prioritaires où notre protection sociale doit être renforcée ?']
reponse6 = reponse6[reponse6['Question6_Quels sont les domaines prioritaires où notre protection sociale doit être renforcée ?']!= 'nan']
reponse6.head()

In [None]:
dictionary= corpora.Dictionary(simple_preprocess(line, deacc=True) for line in reponse6['Question6_Quels sont les domaines prioritaires où notre protection sociale doit être renforcée ?'])

# Token to Id map
dictionary.token2id

In [None]:
# Tokenize the docs
tokenized_list = [simple_preprocess(doc) for doc in reponse6['Question6_Quels sont les domaines prioritaires où notre protection sociale doit être renforcée ?']]
mydict = corpora.Dictionary()
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in tokenized_list]

## Interprétation 
Les domaines prioritaires où notre protection sociale doit être renforcée sont sans aucun doute la santé et l'éducation qui ressort beaucoup de ce tableau. Au niveau de tout ce qui est medicale( hôpital, médecin, etc.). Ainsi qu'au niveau des soins et de la prise en charge avec notamment le remboursement. (mutuelle, médicament, etc.)Aider les personnes âgées et handicapées. Mais aussi les allocations pour les familles en difficultés. Il faudrait aussi mettre en place des formations. Les français insiste sur l'assurance chomage.