# Grand Débat National - Exploratory Data Analysis
*** 
**Jérémy Lesuffleur - 06/04/2019**

## Context
Following the [Yellow vests movement](https://en.wikipedia.org/wiki/Yellow_vests_movement), the President of the French Republic launched the **Grand Débat National**, a natio-wide public debate. Every French citizens were invited to express their views and propositions on four main themes: *ecological transition*, *taxation*, *organisation of the State*, *democracy and citizenship*.

This national debate produced a huge amount of data, mainly textual data, provided [in Open Data here](https://www.data.gouv.fr/fr/datasets/donnees-ouvertes-du-grand-debat-national). In this kernel, we will load and **discover the dataset**, see what it contains, do an **exploratory analysis** and run some **natural language processing** algorithms. This notebook is not pretending to undertake a deep analysis of the dataset but rather **an overview of its content**.

## Table of contents

* [Importing the data](#importing)

* [Discovering the dataset](#discovering)

* [Who are the contributors?](#who)
   
* [Shaping the questions](#shaping)

* [Closed questions analysis](#closed)

* [Open questions analysis](#open)

* [Conclusion](#conclusion)

## Environment

Loading environment libraries and importing modules.

In [None]:
# basics
import pandas as pd
import numpy as np
import datetime
import os

# string
import string
import unidecode
import re
from textwrap import wrap # wrapping long text into lines

# plot
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from wordcloud import WordCloud
import geopandas as gpd
from mpl_toolkits.axes_grid1 import make_axes_locatable
%matplotlib inline

# text mining
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from stop_words import get_stop_words

# Because we have some long strings to deal with:
pd.options.display.max_colwidth = 300

# Importing the data
<a id="importing"></a>
***

Let's see the files available in the dataset:

In [None]:
os.listdir('../input/granddebat')

There is one file for each of the **4 main themes** of the debate. The file *EVENTS.csv* is different, it contains information about public events held regarding the debate, it will not be used here.  

We match each **file** with its corresponding **theme**.

In [None]:
themes = {
    'LA_FISCALITE_ET_LES_DEPENSES_PUBLIQUES.csv':'La fiscalité et les dépenses publiques',
    'ORGANISATION_DE_LETAT_ET_DES_SERVICES_PUBLICS.csv':"Organisation de l'état et des services publics",
    'DEMOCRATIE_ET_CITOYENNETE.csv':'Démocratie et citoyenneté',
    'LA_TRANSITION_ECOLOGIQUE.csv':'La transition écologique'
}

filenames = list(themes.keys())
themes = list(themes.values())

We can now import each file, all in one **list of dataframes** for easier use.  

We pay special attention to data types: *ZipCode* must be read as *strings* and date columns as *timestamps*.

In [None]:
filepaths = [os.path.join("..", "input", "granddebat", filename) for filename in filenames]
col_date = ['createdAt', 'publishedAt', 'updatedAt']
df_list = [pd.read_csv(filepath, low_memory=False,
                       dtype={'authorZipCode':'str'},
                       parse_dates=col_date) for filepath in filepaths]

# Discovering the dataset
<a id="discovering"></a>
***

## Available variables
The 4 dataframes share some **common variables**, other columns are **questions** that are specific to the theme. The common variables are the following:

In [None]:
col_common = set.intersection(*[set(df.columns) for df in df_list])
col_common

We see that we have some information about the **author**: a unique *Id*, a *Type* (we will dig into it later) and a *ZipCode*. Not that much: we don't have any information about the age or the gender of the author for instance.

We can wonder if those variables contain some **missing values**:

In [None]:
pd.concat([df[df.columns.intersection(col_common)] for df in df_list]).isnull().mean() * 100

The dataset is rather clean. The variables `updatedAt` and `trashedStatus` are poorly filled because most contributions are neither updated nor trashed.

## Dataset size

Each line of the dataframes corresponds to one **contribution**: the answers of an author to the questions of the corresponding theme. Let's see how many contributions we have for each dataset, and how many questions:

In [None]:
df_infos = pd.DataFrame({
     'theme': themes,
     'nb_contributions': [df.shape[0] for df in df_list],
     'nb_questions': [sum(~df.columns.isin(col_common)) for df in df_list]
    })
df_infos

The dataset is huge: **over 500 thousand contributions**. There are a lot of analysis to be done!
The survey about *taxation* was the most answered, but this may be due to it being the shortest: 8 questions only.

## When were the contributions submitted?

We will have a look at the `createdAt` variable to spot **when** the contributions were submitted, and at what time of the day.

In [None]:
# Daily contributions
day_contrib = pd.concat([df.createdAt for df in df_list]).dt.date.value_counts().sort_index()

fig, ax = plt.subplots(figsize = (18,6))
day_contrib.plot()
ax.set_title('Daily contributions')
ax.set_xlabel('Date')
fig.autofmt_xdate()
ax.set_ylim(bottom=0)
plt.show(fig)

We can see a first peak at the very beginning of the *Grand Débat* (the website was opened for contributions on tuesday 2019-01-22), and then another peak on sunday 2019-03-10. On those particular days, the submissions **exceeded 25,000 contributions by day**.

Let's look also at the time the contributions were made:

In [None]:
# Hourly contributions
hour_contrib = pd.concat([df.createdAt for df in df_list]).dt.hour.value_counts().sort_index()

fig, ax = plt.subplots(figsize = (18,6))
hour_contrib.plot()
ax.set_title('Hourly contributions')
ax.set_xlabel('Hour')
ax.set_ylim(bottom=0)
plt.show(fig)

The number of contribution per hour reaches a peak in the late afternoon, between 18h and 19h (we don't know whether it is **UTC** or **CET** since the raw data are characters without TZ indication). 

# Who are the contributors?
<a id="who"></a>
***

In this section we will have a closer look at the **authors** of the *Grand Débat*. For each contribution we have an `authorID` that is shared among datasets. 

Everyone could submit any number of contribution for each theme. An author wrote **252 contributions** about the *organisation of the State*!

In [None]:
# Maximal number of contributions per author per theme:
pd.DataFrame({'theme':themes,
              'max_contrib_per_author':[df.groupby('authorId').size().max() for df in df_list]})

Since we focus on contributors, we aggregate the table by `authorId` in order to have one line per author. If an author has several `authorType` or `authorZipCode` (that should be rare), we keep the most frequent one: the *mode*. 

We also add a count statistics: how many contributions that author made over the whole dataset.

In [None]:
def mode_na(x): 
    m = pd.Series.mode(x)
    return m.values[0] if not m.empty else np.nan

authors = pd.concat([df[df.columns.intersection(col_common)] for df in df_list])
# With pandas>=0.24, we would use: pandas.Series.mode
authors = authors.groupby('authorId').agg({'id':'count', # number of contributions
                                           'authorType':mode_na,
                                           'authorZipCode':mode_na})

The first statistics we can get out of this new dataframe is the **number of distinct contributors**:

In [None]:
authors.shape[0]

There are more than **250,000 distinct contributors**. As a city population, it would be in the [top 10 of France most populated cities](https://fr.wikipedia.org/wiki/Liste_des_communes_de_France_les_plus_peupl%C3%A9es)!

In [None]:
n_contrib = authors.id.value_counts().reset_index(name='counts')
n_contrib.loc[n_contrib['index'] > 4, 'index'] = '>4'
n_contrib = n_contrib.groupby('index').agg(sum)
fig, ax = plt.subplots(figsize=(18,6))
ax = sns.barplot(x='index',
            y='counts',
            data=n_contrib.reset_index(),
            palette=sns.color_palette('Blues'))
ax.set_xlabel('Number of contributions')
ax.set_title('Authors per number of contributions')
plt.show()

As can be seen, around 50% of the authors submitted a single contribution. 

Only **50,000** authors have at least 4 contributions and hence may have answered the 4 themes.

In [None]:
fig, ax = plt.subplots(figsize=(18,6))
ax = sns.countplot(x='authorType',
                   data=authors,
                   palette=sns.color_palette('Blues'))
ax.set_yscale('log')
ax.set_title('Author types')
plt.show()

We notice that the great majority of respondents, 84% actually, are **citizens** *(mind the log scale!)*, i.e. they are neither politicals, officials nor part of an organisation.

## Contributors map
<a id="map"></a>

### Retrieving zipcodes

In this section we will plot a map of contributors in France. For this purpose we will use the `authorZipCode` variable. First thing to check is the quality of this column: does it contains missing values?

In [None]:
authors.authorZipCode.isnull().sum()

Good news, only 3 missing values! But are the other correct however?  

In France, the zipcode must contains **exactly 5 digits**. Let's check:

In [None]:
authors.authorZipCode.str.len().value_counts()

As we can notice, the zipcode was not forced to be 5 digit. Some are more, some are less.

Since we cannot use those incorrect zipcodes, we set them to **nan**. There are around 3000, it is not much regarding the number of contributors.

In [None]:
authors.loc[authors.authorZipCode.str.len() != 5, 'authorZipCode'] = np.nan

But keep in mind it doesn't mean that the remaining zipcodes are necessarily correct! The remaining incorrect zipcodes will be lost as well. Here is an example:

In [None]:
authors.authorZipCode[49789]

### Link it with open data

To have an idea of how much each part of France contribute to the *Grand Débat*, we must compute the **author rate**: number of contributors per inhabitant.  

We can compute the number of contributors per zipcode. The local open data in France (population, geo boundaries) is usually at the **commune** scale, but the relation between the commune and the zipcode is **many to many**! We have to aggregate to the **department** scale to have a **one to many** link with both the commune and the zipcode.

We will use two datasets:  

1. **communes-francaises/code-postal-code-insee-2015.csv**: a dataset containing the communes population, and link between commune, zipcode and department [(source)](https://data.opendatasoft.com/explore/dataset/code-postal-code-insee-2015%40public/table/).
2. **contours-des-departements-francais/departements-20180101.shp**: the boundaries of French departments [(source)](https://www.data.gouv.fr/fr/datasets/contours-des-departements-francais-issus-d-openstreetmap).

Let's use the first dataset to compute the author rate per department:

In [None]:
# French variant of csv:
communes_fr = pd.read_table('../input/communes-francaises/code-postal-code-insee-2015.csv',
                            encoding = 'utf-8', delimiter=";", dtype ={'Code_postal':'str'})

# Population per department (population is at INSEE_COM scale, wich is the commune)
population_dep = communes_fr[['INSEE_COM', 'CODE_DEPT', 'POPULATION']].drop_duplicates().\
groupby('CODE_DEPT').sum()

# Link zipcode/department
zc_dep = communes_fr[['Code_postal', 'CODE_DEPT']].drop_duplicates()

# Authors per zipcode
authors_zc = authors.assign(code_postal=authors.authorZipCode.str.slice(0, 5)).\
groupby('code_postal').size().reset_index(name='counts')

# Authors per department
authors_dep = zc_dep.set_index('Code_postal').join(authors_zc.set_index('code_postal')).\
groupby('CODE_DEPT').sum()
# Adding population
rate_dep = authors_dep.join(population_dep)
# Computing rate
rate_dep['author_rate'] = rate_dep.counts/rate_dep.POPULATION * 1000 # per 1000

To ensure the consistency of our data, we can check if we have, as expected, a French population of 65 millions, and 250 thousand contributors:

In [None]:
print(rate_dep.POPULATION.sum())
print(rate_dep.counts.sum())

Now we use the second dataset to combine those data with the department boundaries.

In [None]:
map_dep = gpd.read_file('../input/contours-des-departements-francais/departements-20180101.shp',
                        encoding = 'utf-8')
map_dep = map_dep[['code_insee', 'geometry']]

# Mainland only
map_dep = map_dep[~map_dep.code_insee.isin(['971', '972','973','974','976'])]

# In this dataset, department 69 (Lyon) is split in two: 69D and 69M. We merge them.
map_dep.loc[map_dep.code_insee.str.contains('69'), 'code_insee'] = '69'
map_dep = map_dep.dissolve(by='code_insee')

# Add variable
map_dep = map_dep.join(rate_dep['author_rate'])

# Set CRS from latitude/longitude to Lambert93 for a better grid projection
map_dep.crs = {'init': 'epsg:4326'}
map_dep = map_dep.to_crs({'init': 'epsg:2154'})

We can use the `plot` function to plot a `GeoDataFrame` object:

In [None]:
fig, ax = plt.subplots(1, figsize=(18, 12))
map_dep.plot(column='author_rate', cmap='Blues', ax=ax, linewidth=0.1, edgecolor='black')
ax.axis('off')
ax.set_title('Contributors of the Grand Débat per inhabitant (‰)',
             fontdict={'fontsize':'25', 'fontweight':'3'})

# Add colorbar
sm = plt.cm.ScalarMappable(cmap='Blues',
                           norm=plt.Normalize(vmin=0, vmax=map_dep.author_rate.max()))
sm._A = []
# Place the cbar next to the plot
cbar = fig.colorbar(sm,make_axes_locatable(ax).append_axes("right", size="5%"))

plt.show()

The departments with the highest contributor rate are: **Paris** and **Haute-Garonne** (*7.1* and *6.3* contributors per 1000 inhabitant, respectively).  
On the contrary the departments with lowest rate are: **Seine-Saint-Denis** and **Haute-Corse** (<*1.8* contributors per 1000 inhabitant).


# Shaping questions
<a id="shaping"></a>
***

After this very brief dataset analysis, it is time to focus on the variables of interest: **the questions**. Each dataframe contains several questions, but we will try to treat them all at once.  

The column names for the questions are a bit messy, we will rename them for more clarity. We build a dataframe containing information about each question: old and new name, title, and the theme and dataframe they are linked to.

In [None]:
questions = pd.concat([pd.DataFrame({'old_name':df_list[i].columns,
                                     'df_id':i,
                                     'theme':themes[i]}) for i in range(len(df_list))])
questions = questions[-questions["old_name"].isin(col_common)].reset_index(drop=True)
questions = questions.assign(new_name=(pd.Series(
    ['Q{}'.format(i) for i in range(1, questions.shape[0] + 1)])))
questions = questions.assign(question=pd.Series(
    [name.split(' - ')[1] for name in questions.old_name]))

# Questions rename
dict_rename = {old:new for old, new in zip(questions.old_name,questions.new_name)}
for df in df_list:
    df.rename(columns=dict_rename,inplace=True)
    
questions.head(3)

For each question, we compute the following statistics:
* **nbrow**: number of rows (i.e. number of contributions for the corresponding theme)
* **nbnnull**: number of answers that are not *null* (answer is *null* if the contributor skipped that question)
* **nbunique**: number of distinct answers
* **nnull_rate**: nbnnull/nbrow * 100
* **unique_rate**: nbunique/nbnnull * 100


In [None]:
questions['nbrow'] = questions.apply(lambda g: df_list[g.df_id].shape[0], axis=1)
questions['nbnnull'] = questions.apply(lambda g: df_list[g.df_id].loc[:,g.new_name]\
                                       .notnull().sum(), axis=1)
questions['nbunique'] = questions.apply(lambda g: df_list[g.df_id].loc[:,g.new_name]\
                                        .nunique(), axis=1)

questions['nnull_rate'] = questions.nbnnull/questions.nbrow * 100
questions['unique_rate'] = questions.nbunique/questions.nbnnull * 100

We can notice that some questions have **very few distinct answers**:

In [None]:
questions['closed'] = questions['nbunique'] <= 3
sum(questions.closed)

Those 19 questions are **closed-ended question**: the answer is forced into a few choices, mainly **Yes** or **No**.

We can now aggregate at the **theme** scale:

In [None]:
questions.groupby(['theme']).agg({'question':'count', 'closed':'sum',
                                  'nbrow':'mean', 'nnull_rate':'mean'})

We can note that the two themes **ecological transition** and **taxation** gave more interest to respondants: they have both more contributions and less null values.

In particular, we see that there are lot of null values, we want to understand that. Let's see which questions have the most null values:

In [None]:
questions.sort_values('nnull_rate').head(3)

We can see that all of those questions start with **"Si..."** *("If...")*. They are conditional: an answer is not necessarily expected.  

If we pay attention we can notice that the `unique_rate` is also very low, this is because a lot of contributors answered **"non concerné"** *("not applicable")*, for instance with question **Q40**:

In [None]:
df_list[1].Q40.value_counts().head(20)

Some other questions have low `unique_rate` because they are **guided question**: choices were given but the respondant could decide to answer something else. This is the case for instance for questions **Q91**, **Q79** and **Q4**:

In [None]:
df_list[3].Q91.value_counts().head(10)

# Closed questions analysis
<a id="closed"></a>
***

For each of the 19 **close questions**, we plot the count of each answer in order to identify most popular opinions.

We use the **seaborn** library for plotting.

In [None]:
# Add frequencies to a countplot
# Source: https://stackoverflow.com/questions/33179122/seaborn-countplot-with-frequencies
def add_frequencies(ax, ncount):
    for p in ax.patches:
        x=p.get_bbox().get_points()[:,0]
        y=p.get_bbox().get_points()[1,1]
        ax.annotate('{:.1f} %'.format(100.*y/ncount), (x.mean(), y), 
                ha='center', va='bottom', size='small', color='black', weight='bold')
        
# Countplot of questions_df
def countplot_qdf(questions_df, suptitle):
    n = questions_df.shape[0]
    
    # If there is nothing to plot, we stop here
    if n==0:
        return
    
    # Numbers of rows and cols in the subplots
    ncols = 3
    nrows = (n+3)//ncols
    fig,ax = plt.subplots(nrows, ncols, figsize=(25,6*nrows))
    fig.tight_layout(pad=9, w_pad=10, h_pad=7)
    fig.suptitle(suptitle, size=30, fontweight='bold')
    
    # Hide exceeding subplots
    for i in range(n, ncols*nrows):
        ax.flatten()[i].axis('off')
        
    # Countplot for each question
    for index, row in questions_df.iterrows():
        plt.sca(ax.flatten()[index])
        # We add the sort_values argument to always have the same order: Oui, Non...
        xlabels = df_list[row.df_id].loc[:,row.new_name]
        xlabels = xlabels.value_counts().index.sort_values(ascending=False)
        axi = sns.countplot(x=row.new_name,
                           data=df_list[row.df_id],
                           order = xlabels)
        # Wrap long questions into lines
        axi.set_title("\n".join(wrap(row.new_name + '. ' + row.question, 60)))
        axi.set_xlabel('')
        # We also set a wrap here (for one very long answer...)
        axi.set_xticklabels(["\n".join(wrap(s, 17)) for s in xlabels])
        axi.set_ylabel('Nombre de réponses')
        add_frequencies(axi, row.nbnnull)
        
# Plotting questions, grouped by theme
for i in range(len(themes)):
    countplot_qdf(questions[(questions.closed) & (questions.df_id == i)].reset_index(), themes[i])

On the themes of **State organisation, democracy and citizenship**: when asked their opinion, contributors always take side for **change**. In particular, most popular demands include:
* revising the **functioning of the administration** (**Q26** - **91%**) 
* taking into account the **blank vote** (**Q52** - **82%**) 
* **transforming the Assemblies** (**Q59** - **86%**)

On the theme of **ecological transition**, we can note that **69%** of respondents consider their daily life being impacted by climate change (**Q81**). However **95%** of them think they can personally contribute to the environmental protection (**Q83**). Solutions could arise from **heating method** (**61%** - **Q87**) or **mobility** (**42%** - **Q89**).

# Open questions analysis
<a id="open"></a>
***

Most of the information of the dataset lies in the **open questions**, but they are the most **difficult** to analysis!

We can start with with a basic statistic, the **number of words** contained in the whole dataset.

In [None]:
# Count words in a string, a word being here any sequence of characters between white spaces
def count_words(s):
    if s is np.nan:
        return(0)
    return(len(s.split()))

# For each dataframe:
# filter on questions and title
# count words for each contribution of each question
# sum it all
n_words = [df.filter(regex=r'title|^Q', axis=1).apply(np.vectorize(count_words)).sum().sum()\
           for df in df_list]
sum(n_words)

The contributions contain **167 million words**! That is equivalent to **325 times** ***Les Misérables***, or **154 times the whole** ***Harry Potter*** **series**.  

It is impossible to make an exhaustive analysis of the dataset via human reading... **Artificial Intelligence seems necessary** to interpret this dataset. Here you come kagglers!

### Basic text mining

To finish, let's try some simple text mining. We will investigate the information hidden in the contributions. 

We have **75 open questions**. If we do a question-wise analysis, this notebook will get very sprawling. We aggregate all questions at the **theme** scale.

To begin, one must define **stop words**: those are the most common words that don't give any insight, and must be filtered out when doing **natural language processing**.

In [None]:
# Get French stop words (from both nltk and stop_words libraries)
stop_words = list(set(get_stop_words('fr')).union(stopwords.words('french')))
# Put them in lowercase ASCII
stop_words = [unidecode.unidecode(w.lower()) for w in stop_words]
# Add punctuation and some missing words
stop_words = set(stop_words +
                 list(string.punctuation) +
                 ["’", "...", "'", "", ">>", "<<"] +
                 ["oui", "non", "plus", "toute", "toutes", "faut"])

The next important step is to run a **tokenization**, i.e. splitting text into words. This might be tricky because of punctuation, wich is slightly different according to the language. There are some important features we have to take into considreation: **punctuation**, **case**, **encoding** and **stop words**.

In [None]:
# Get tokens from list of strings (can probably be optimised)
def get_tokens(s):
    # MosesTokenizer has been moved out of NLTK due to licensing issues
    # So we define a simple tokenizer based on regex, designed for French language
    pattern = r"[cdjlmnstCDJLMNST]['´`]|\w+|\$[\d\.]+|\S+"
    tokenizer = RegexpTokenizer(pattern)
    tokens = tokenizer.tokenize(" ".join(s.dropna()))
    # remove punctuation (for words like "j'")
    tokens = [w.translate(str.maketrans('', '', string.punctuation)) for w in tokens]
    # lowercase ASCII
    tokens = [unidecode.unidecode(w.lower()) for w in tokens]
    # remove stop words from tokens
    tokens = [w for w in tokens if w not in stop_words]
    return(tokens)

We will use the tokens to draw a **word cloud**. This is a visual representation of **n-gram** counts. The more frequent a term is, the bigger it will appear on the plot.

In [None]:
def plot_wordcloud(s, title, mw = 500):
    wordcloud = WordCloud(width=1200, height=600, max_words=mw,
                          background_color="white").generate(" ".join(s))
    plt.figure(figsize=(20, 10))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title(title, fontsize=50, pad=50)
    plt.show()

Let's plot a wordcloud for each of the 4 **themes**. We will see what are the **most raised topics** among each of them. 

*It may take a few minutes and some GBs of RAM.*

In [None]:
col_q = questions.new_name[~questions.closed].append(pd.Series('title'))
for i in range(len(themes)):
    col_q_i = df_list[i].columns.intersection(col_q)
    tokens = pd.concat([df_list[i][col].dropna() for col in col_q_i])
    tokens = get_tokens(tokens)
    plot_wordcloud(tokens, title = themes[i])

There is room for improvment: we count **singular** and **plural** separately, for instance "*services public*" and "*service public*" on the second graph. We need to add **stemming** here.

We conclude on these clouds, any interpretations on these results are up to you!

# Conclusion
<a id="conclusion"></a>
***

This is the end of this notebook for now, there are a lot more to say about this dataset, but the notebook is getting long. **Thank you for reading it**! 

It was aimed to make you discover the dataset, and maybe inspire you for some more investigation on it.  

We have seen here only the surface of the data, I may do a second part with **deeper textual analysis**.  

Feel free to **upvote** if you enjoyed this notebook, to **fork** it for further analysis, or to **comment** for any suggestion. All ideas are welcome!