# The 2020 Kaggle ML & DS survey discovery
## or How to discover a new dataset with Jupyter

**Acknowledgements** :  
This notebook based on the datasets : https://www.kaggle.com/c/kaggle-survey-2020/

For the fourth year, Kaggle proposes an annual Machine Learning and Data Science Survey Challenge [[Kaggle2020]](#Kaggle2020), where the goal is to create a notebook who tells a rich story about a dataset based of the data science and machine learning community. These data provide an overview of the Kaggle community usage. 

In this notebook, we try to show a way to deal with an unknown database. Our story will be about how to discover a database for the first time. It is always difficult to know where to start and what we are  looking for. What sense do we want to give to these data? What information are we looking for? Do there exist hidden informations or knowledge, which could be the basis of new ideas or new concepts?

This notebook will propose a data preparation and some data analysis approach for this dataset. 


## Contents
1. [Import librairies](#1)
2. [Download data](#2)
3. [Data preparation](#3)
4. [Data analysis](#4)

In [None]:
# Load useful package
import sys    # library for accessing system-specific parameters and functions
import gc     # library for garbage collector
import warnings    # library to deal with warning messages
warnings.filterwarnings("ignore")

# 1 - Import librairies
<a id="1"></a>
Firstly, we import the used libraries:

* **[[matplotlib.pyplot]](#matplotlib)**: This library provides basic charts to visualize data.
* **[[missingno]](#missingno)**: This library proproses utility functions to filter records in your dataset based on completion. 
* **[[numpy]](#numpy)**: This library is for scientific computing. It provides a high-performance multidimensional array object, and tools for working with these arrays.
* **[[pandas]](#pandas)**: This library provides fast, powerful, flexible and easy-to-use data structures, as well as the means to quickly perform operations on these structures.
* **[[plotly]](#plotly)**: This library provides interactive charts to visualize data.
* **[[seaborn]](#seaborn)**: This library is data visualization library based on matplotlib. It provides a high-level interface for informative statistical graphics.

In [None]:
# Load used package for data science
import numpy as np   # library for scientific computing
import pandas as pd  # library for data processing, CSV file I/O
pd.set_option("display.max.columns", None)  # Display all columns 
#pd.set_option("display.precision", 3)   # Use 3 decimal places in output display

# Libraries used for visualization
import matplotlib.pyplot as plt  # library for graphics
%matplotlib inline
plt.style.use('ggplot')
import plotly.express as px
import plotly.graph_objs as go
import seaborn as sns

# Library specific for missing data
import missingno as msno

# 2 - Download data
<a id="2"></a>

The 2020 Kaggle Machine Learning & Data Science survey [[Kaggle2020]](#Kaggle2020) was lived for 3.5 weeks in October. The data have been cleaned by the organisers. They include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. The data have been published in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

To begin, we have to load the data file and to well-name all columns. In the data file, the two first rows contain informations about the column contents: the first one contents the name of the question and the second one the label of the question. We decided to use only the first one to name the columns of the dataset.

In [None]:
# -- Load survey data from 2020
# Load column names (which are in the 2 first rows)
colName = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", header=None, nrows=2)
colName.loc[2,:] = colName.loc[0,:]+" "+colName.loc[1,:]
colName.rename(columns=colName.loc[0,:], 
               index={0:'number',1:'question',2:'numberAndQuestion'}, 
               inplace=True)

# Load data in a dataframe
df_data = pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv", 
                      header=1, 
                      names=colName.loc['number'])
df_data.head(5)

The 2020 Kaggle DS & ML Survey dataset is made up of 20 036 entries and 355 columns. All columns (except the first one which corresponds to the user time to fill the survey) contain categorical informations. The size of used memory on the server to store the dataset is 255.1 Mb.

In [None]:
# Size of the load dataset
print("Size of the dataframe :", df_data.shape)

In [None]:
# Type of data stored in the columns
print("Types of data stored in the columns : \n", df_data.dtypes.value_counts())

In [None]:
# Size of used memory to store the dataset
# <=> print(sys.getsizeof(df_data)/1024/1024)      # bytes
# <=> print(df_data.memory_usage(deep=True).sum()/1024/1024)    # bytes
df_data.info(memory_usage='deep')

# 3 - Data preparation
<a id="3"></a>
Data preparation [[Kuhn2019]](#Kuhn2019) is the most important and time-consuming part of data analysis. This involves transforming raw data into a representation that could be understandable for machine learning algorithms and run quickly with data science tools and techniques. This step is highly specific to the used data, to the goals of the project and to the algorithms that will be used to model them or to extract information from them. 

There are common or standard tasks to use or explore during the data preparation step in a data analysis or a machine learning project. These tasks include:
* **Data cleaning**: Identifying and correcting mistakes or errors in the data.
* **Data encoding**: Reducing used memory to store the data in order to optimize future used algorithms.
* **Missing data**: Identifying if there are no data whatsoever for a contributor (non-response) or when some variables for a participant are unknown (item non-response) because of refusal to provide or failure to collect the response.
* **Feature engineering**: Deriving new variables from available data.
* **Data transforms**: Changing the scale or distribution of variables.
* **Feature selection**: Identifying which variables are most relevant to our analysis.
Each of these tasks is a whole field of study with specialized algorithms.

In some cases, variables must be encoded or transformed before we can apply a machine learning algorithm, such as converting strings to numbers. In other cases, it is less clear, such as scaling a variable may or may not be useful to an algorithm. We decided to apply the 4 first steps in this part and to deal with the 2 last one in the next part according to the processing we will apply (if it is useful).

## a) Data cleaning
Data cleaning is the most important step before analyzing or modeling data. It could be boring and considered as a lost of time, but it is fundamental to do better data analysis.

The 2020 Kaggle DS & ML Survey database have been prepared for the challenge by the organisers. So we could consider that the data is clean. There contains no error.

## b) Data encoding 
It could be interesting to optimize the size of our dataframe. It allows to use low memory in the environment. One way to address that is to specify data types of your dataframe in a more efficient way than the automatic detection done by Pandas.

The 2020 Kaggle DS & ML Survey database is composed of the answers of 39 questions :
* 1 numerical question with the time taken to fill up the form, 
* 18 single answer questions,
* 21 questions with multiple choice: The participants have the ability to select all options and one column for each option has be created. In these questions, there are also 8 specific answers to a particular question leads to an additional (earlier-choice specific or an extra) question. The answers of these questions are recorded in 336 columns of the dataframe.

In [None]:
# Rename with short names the longest country names
df_data.loc[df_data['Q3']=="Iran, Islamic Republic of...", "Q3"] = "Iran"
df_data.loc[df_data['Q3']=="United Kingdom of Great Britain and Northern Ireland", "Q3"] = "United Kingdom"
df_data.loc[df_data['Q3']=="United States of America", "Q3"] = "USA"

In [None]:
# List of questions with multiple choice answers
listQuestMCA = ['Q7', 'Q9', 'Q10', 'Q12', 'Q14', 'Q16', 'Q17', 'Q18', 
                'Q19', 'Q23', 'Q36', 'Q37', 'Q39']
listQuestMCAB = ['Q26', 'Q27', 'Q28', 'Q29', 'Q31', 'Q33', 'Q34', 'Q35']
listQuestMC = listQuestMCA + [i+"_A" for i in listQuestMCAB] + [i+"_B" for i in listQuestMCAB]
#print(listQuestMC)

# List of questions with single answers
tmpNb = 0
listQuestSA = list(colName.loc['number',:])
# For all questions, verify the "None" answer and calculate the binary representaiton of answers
for quest in listQuestMC:
    # Select column which names begin with "question"
    tmpQuest = [col for col in df_data.columns if col.startswith(quest)]
    # Delete found columns to list of single answer questions
    for quest in tmpQuest:
        listQuestSA.remove(quest)
        tmpNb += 1
#print(listQuestSA)

print("Number of questions with single answer :", len(listQuestSA))
print("Number of questions with multiple choice answers :", 
     (len(listQuestMCA)+len(listQuestMCAB)), 
     "(",tmpNb,")")

In [None]:
# Add new rows in the colName 
# with the number of the value in the column
tmpData = pd.DataFrame(df_data.apply(lambda x: x.count()), columns={'nbValue'})
tmpData = tmpData.transpose()
colName = pd.concat([colName, tmpData]) # colName.append(tmpData,ignore_index=False)
# with the number of the different value in the column
tmpData = pd.DataFrame(df_data.apply(lambda x: len(x.unique())), columns={'nbDiffValue'})
tmpData = tmpData.transpose()
colName = pd.concat([colName, tmpData]) # colName.append(tmpData,ignore_index=False)
# with the different values in the columns
tmpData = pd.DataFrame(df_data.apply(lambda x: x.unique()), columns={'value'})
tmpData = tmpData.transpose()
colName = pd.concat([colName, tmpData])    #colName.append([colName, tmpData], ignore_index=False)

Single answer questions contains a finite number of text values. In general, we consider them as categorical variables, but, by default, Pandas stores them as objects. This type of storage is not optimal, because it creates a list of pointers to the memory address of each value of your column. For columns with low cardinality (the amount of unique values is lower than 50% of the count of these values), this can be optimized by forcing pandas to use a virtual mapping table where all unique values are mapped via an integer instead of a pointer. This is done using the category datatype. It is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more efficiently store the data. We apply an encoding method to consider our single answer questions as category.

In [None]:
# List of single answer questions
tmpQuest = [col for col in listQuestSA if col.startswith('Q')]
# Transform object in category
df_data[tmpQuest] = df_data[tmpQuest].apply(lambda x:x.astype('category'), axis=0)

In [None]:
# Size of used memory to store the dataset
df_data.info(memory_usage='deep')

The previous outputs show us that many columns contain "Yes or No" answers  represented as object. The binary encoding could be used to effectively store these answers. It is a standard type in Python. It is possible to reduce the memory usage of the dataframe without losing information in transforming these objects as boolean value. 

In [None]:
# A list which contains the information mentioned in the 2-valued column
tmpList = []
# Transform 2-valued categorical in boolean value
for col in colName.columns:
    tmpCol = colName[col]
    if tmpCol['nbDiffValue'] == 2:
        #print(tmpCol['value'])
        # Collect the name mentioned in the studied column 
        tmpData = pd.Series(tmpCol['value'])
        tmpData = list(tmpData.dropna())
        # Add this information to the list of category name
        tmpList.append(tmpData[0])
        # Encoding as boolean values the found columns
        df_data.loc[:,tmpCol['number']] = np.where(df_data[tmpCol['number']].isnull(), False, True)
    else :
        tmpList.append(np.nan)

In [None]:
# Add a new row in the colName with the value concerned by the column
tmpData = pd.DataFrame(tmpList, columns=['refValue'], index=list(colName.loc['number']))
tmpData = tmpData.transpose()
colName = pd.concat([colName, tmpData])

In [None]:
# Size of used memory to store the dataset
df_data.info(memory_usage='deep')

The both representation convertion allows to save 97,3% of the used memory on the server (21Mb with the categorical type and 226.7Mb with the binary encoding), which will have consequences on performance of further processings. The new dataframe is consuming only 6.9 Mb.

## c) Missing data
The second step consists in exploring data in order to know if there exists missing and/or duplicate data elements and then to decide how we are going to deal with them.

The 2020 Kaggle DS & ML Survey database have been prepared for the challenge by the organisers. So we could consider that we have no duplicate data. However, the following outputs show the count of missing values by columns for the single question. Some of them are very numerous. 

In [None]:
# Selection of columns of single questions
tmpCol = listQuestSA[1:len(listQuestSA)]
# Drawing matrix of missing data
msno.bar(df_data[tmpCol], color=(0, 0, 1))
plt.title("Number of answer by one-answer questions", fontsize=26)
plt.show()

In [None]:
# Drawing matrix of missing data
msno.matrix(df_data[tmpCol], labels=True,
            figsize=(20,20), color=(0, 0, 1))
plt.title("Completeness of the answer to one-answer questions", fontsize=26)
plt.show()

If we explore more precisely theses questions, these missing data become normal: the questions are so specific on a domain that the participants might not known it and they do not have the possibility to say it. For example, the question Q30 concerns big data products and Q32 the business intelligence. 

## d) Feature engineering

Sometimes it could be interesting to transform a set of binary values in a single condensed representation with all the binary value to apply a machine learning approach. The comparison between 2 data becomes then only a AND-operation between them. This processing could so allow to improve the further performance. 

In [None]:
# Function to delete None answer in the questions
def verifyNoneAnswer(question, colName):
    # Select column which names begin with "question"
    colQuestion = [col for col in colName if col.startswith(question)]
    # Select column which names begin with "question" (without "None" answer)
    colWithoutNone = colQuestion.copy()
    colWithoutNone.pop(-2)
    # Number of columns for which the user answered "Yes" (without "None" answer)
    tabNbYes = df_data[colWithoutNone].sum(axis=1)    

    # Comparison between column "question_None" and number of answers
    tabBin = pd.concat([df_data[colQuestion].iloc[:,-2], tabNbYes], axis=1)
    # Est ce que tous les "None" n'ont rien saisi d'autres ? 
    s1 = tabBin.index[~(tabBin.iloc[:,0].isnull())]
    s2 = tabBin.index[(tabBin.iloc[:,1] == 0)] 
    tmpCol = question+"_NoneWD"
    tabBin.loc[:,tmpCol] = np.where((((~(tabBin.iloc[:,0].isnull())) & (tabBin.iloc[:,1] == 0))
                                     | (((tabBin.iloc[:,0].isnull())) & (tabBin.iloc[:,1] != 0))),
                                     True, False)
    
    # Return the columns with None
    return colWithoutNone, tabNbYes, tabBin[tmpCol]

In [None]:
# Function used to translate the multiple choice questions in binarary representation
def translateBinAnswer(colQuestion):
    # Transformation du nom du langage en 0 ou 1 dans la colonne concernée
    tabBin = df_data[colQuestion]*1
    # Concaténation des 0 et des 1 des colonnes correspondants à la question
    colBin = tabBin.astype(str).apply(''.join, 1)
    return colBin

In [None]:
# For all questions, verify the "None" answer and calculate the binary representaiton of answers
for quest in listQuestMC:
    # Verify the "None" answer 
    # Easy way : df_raw.count() : count the number of values not NaN per column, 
    # but do not take into account the raw representation of multiple choice questions
    tmpQuestion, tmpNbAnswer, tmpNoneWD = verifyNoneAnswer(quest, colName)
    # Add the number of answer to the question in the dataset
    tmpCol = quest+"_Nb"
    df_data.insert(len(df_data.columns), tmpCol, tmpNbAnswer, True)
    # Add the validation of the "None" question in the dataset
    tmpCol = quest+"_WD"
    df_data.insert(len(df_data.columns), tmpCol, tmpNoneWD, True)
    # Define a binary representation of the answers to the question
    tmpAnswer = translateBinAnswer(tmpQuestion)
    # Add the validation of the "None" question in the dataset
    df_data.insert(len(df_data.columns), quest, tmpAnswer, True)

In [None]:
df_data.head(5)

Unfortunately, we did not have the time to explore these new data in the next step.

## e) Knowledge adding
The questions could be grouped by domain. We identified 10 specific domains in the proposed dataset:
* **time**: the time taken to fill up the survey (Time from Start to Finish (seconds)),
* **person**: short description of the user (Q1, Q2, Q3, Q4, Q5),
* **programming**: his programming knowledge (Q6, Q7, Q8, Q9),
* **notebook**: his usage of programming notebook (Q10, Q11, Q12, Q13, Q14),
* **machine learning**: his machine learning environment (Q15, Q16, Q17, Q18, Q19, Q25, Q28, Q33, Q34, Q35, Q36),
* **work**: his professional environment (Q20, Q21, Q22, Q23, Q24), 
* **cloud**: his cloud computing usage (Q26, Q27),
* **databases**: his practice in database (Q29, Q30),
* **business intelligence**: his business intelligence usage (Q31, Q32),
* **course**: his educational background and future learning (Q37, Q38, Q39).


In [None]:
# Definition of question category 
myCategory = {'time': ['Time from Start to Finish (seconds)'],
              'individual': ['Q1', 'Q2', 'Q3', 'Q4', 'Q5'],
              'programming': ['Q6','Q7', 'Q8', 'Q9'],
              'notebook': ['Q10', 'Q11', 'Q12', 'Q13', 'Q14'],
              'ml': ['Q15', 'Q16', 'Q17', 'Q18', 'Q19', 'Q25', 
                     'Q28_A', 'Q28_B', 'Q33_A', 'Q33_B', 'Q34_A', 'Q34_B', 'Q35_A', 'Q35_B', 'Q36'],
              'professional': ['Q20', 'Q21', 'Q22', 'Q23', 'Q24'], 
              'cloud': ['Q26_A', 'Q26_B', 'Q27_A', 'Q27_B'],
              'db': ['Q29_A', 'Q29_B', 'Q30'],
              'businessintelligence': ['Q31_A', 'Q31_B', 'Q32'],
              'course': ['Q37', 'Q38', 'Q39']
             }

In [None]:
# Explicitly delete the unsused variables
del tmpNb, tmpQuest, tmpData, tmpList, tmpCol
# Explicit call to the garbage collector
gc.collect()

# 4 - Data analysis

After cleaning up our data, now it is the time to analyze them with some data visualization [[Jacques2019]](#Jacques2019) to better understand their meaning.



457 / 5000
Résultats de traduction
It is quite simply a process of putting into perspective information that is apparently complex or embedded in a large amount of parameters by representing it in graphical form.
There are different methods of obtaining this information. Some use sites that offer quick “drag and drop” viewing services. We will favor approaches by code using Python and its libraries.

## a) Visualization of univariate analysis
Descriptive analysis (or univariate analysis) provides an understanding of the characteristics of each attribute of the dataset. 

### i. What is the gender of participants ?

In [None]:
# -- Draw the gender of the participants
# Selection of the used data
tmpData = df_data['Q2'].value_counts() # count the number of participants by age
tmpData = tmpData.reset_index() # rebuild the index 
# Draw the figure
#tmp = df['Q2'][1:].value_counts().reset_index()
fig = go.Figure(data=[go.Pie(
    labels=tmpData['index'], 
    values=tmpData['Q2'],
    marker_colors=px.colors.qualitative.Prism
)])
fig.update_layout(title_text='The gender of the participants', 
                 title_x=0.5, title_y=0.85)
fig.show()

As in the computer science world, the gender parity has not yet arrived in the world of data science. The women remain in  minority with 19.4% of our cohort.

### ii. What is the age of participants ?

In [None]:
# -- Draw the age of the participants
# Selection of the used data
tmpData = df_data['Q1'].value_counts() # count the number of participants by age
tmpData = tmpData.sort_index() # order by the "label"
tmpData = tmpData.reset_index() # rebuild the index 
# Drawing of the figure
fig = go.Figure(data=[go.Bar(
    x=tmpData['index'],
    y=tmpData['Q1']
)])
fig.update_layout(title='The age of the participants',
                  title_x=0.5,
                  title_y=0.9,
                  xaxis=dict(title='Age group'),
                  yaxis=dict(title='Count'))
fig.show()

Data science is an emerging technolgy. It is obvious to have many young people in our cohort.

### iii. From which country come the participants ?

In [None]:
# -- Draw an hemisphere with the number of participations by country
# Selection of the used data
tmpData = pd.DataFrame({'country':df_data.Q3.value_counts().index, 
                        'count':df_data.Q3.value_counts().values})
# Drawing of the hemisphere
data = dict (
    type='choropleth',
    locations=tmpData['country'], 
    locationmode='country names',
    colorscale='portland',
    colorbar={'len':.6},
    z=tmpData['count'])
layout = dict(
    title = "Repartition of the country participants",
    title_x = 0.425,
    title_y = 0.92,
    margin={"l":0, "r":0, "t":0, "b":0}
)
fig = go.Figure(data=[data], layout=layout)
fig.update_geos(projection_type="kavrayskiy7") 
fig.show()

The participants come from all over the world with a great population from India and USA. Very few participants live in the African continent.

## b) Visualization of bivariate analysis


Bivariate analysis examines the relationship between two attributes and determines whether the two are correlated. This analysis could be done from two perspectives: qualitative or quantitative analysis.

### i. Relationship between age and job 

In [None]:
# -- Draw the age of the participants
# Selection of the used data
tmpData = df_data.loc[:,['Q1','Q5']] # age and job
tmpIndex = df_data.index[tmpData.isnull().any(axis=1)]
# Drop lines with Nan value
tmpData.drop(tmpIndex, 0, inplace=True)
#print(tmpData)

# Create the two-way table between two variables.
tmpData2 = pd.crosstab(tmpData['Q1'], tmpData['Q5'])
tmpData2 = tmpData2.transpose()
# Draw the heatmap
sns.heatmap(tmpData2, 
            annot=True, fmt="d", annot_kws={"size": 7},
            linewidths=.5, cmap='Blues') 
#plt.title("The job by age")
plt.xticks(rotation=40)
plt.show()

This figure shows that many students spend time with Kaggle. The fact that in a notebook is fundamental to have a step-by-step approach of a problem and to visualize most of these steps, it is a good way to learn to deal with data. The second people who use Kaggle are data scientists for which a notebook is a good form to analyse and present their results. 

### ii. Relationship between age and programmation experience

In [None]:
# -- Draw the age of the participants
# Selection of the used data
tmpData = df_data.loc[:,['Q1','Q6']] # age and job
tmpIndex = df_data.index[tmpData.isnull().any(axis=1)]
# Drop lines with Nan value
tmpData.drop(tmpIndex, 0, inplace=True)

# Create the two-way table between two variables.
tmpData2 = pd.crosstab(tmpData['Q1'], tmpData['Q6'])
tmpData2 = tmpData2.transpose()
# Well-order the index
tmpDico = {'I have never written code':6,'< 1 years':5,'1-2 years':4,
          '3-5 years':3, '5-10 years':2, '10-20 years':1, '20+ years':0}
tmpData2 = pd.DataFrame(tmpData2, index=sorted(tmpDico, key=tmpDico.get))

# Draw the heatmap
sns.heatmap(tmpData2, 
            annot=True, fmt="d", annot_kws={"size": 7},
            linewidths=.5, cmap='Blues') 
plt.title("Programmation experience by age")
plt.xticks(rotation=40)
plt.show()

As most of our population is young students, it is obvious to find 1-5 years of coding experience in our population.

### iii. Relationship between age and programming language used

In [None]:
# -- Draw the age of the participants
# Selection of the question 7
tmpCol = [col for col in df_data.columns if col.startswith('Q7_Part')]
# Browse all value available for age
tmpData = pd.DataFrame()
for i in df_data['Q1'].value_counts().index.to_list():
    tmpData2 = df_data[df_data['Q1'] == str(i)]
    tmpData2 = tmpData2[tmpCol].apply(pd.Series.value_counts)
    tmpData2.columns = colName.loc['refValue', tmpCol]  
    tmpData2.sort_index(axis=1, inplace=True)
    tmpData2.reset_index(drop=True, inplace=True)
    tmpData2.rename(index={1:i}, inplace=True)
    tmpData2 = tmpData2.loc[i]
    tmpData = tmpData.append(tmpData2, ignore_index=False) 

# Create the two-way table between two variables.
tmpData = tmpData.fillna(0)
tmpData = tmpData.astype(int)
tmpData = tmpData.sort_index(axis=0) # order by the "label"
tmpData = tmpData.transpose()

# Draw the heatmap
sns.heatmap(tmpData, 
            annot=True, fmt="d", annot_kws={"size": 7},
            linewidths=.5, cmap='Blues') 
plt.title("Programming languages used by age")
plt.xticks(rotation=40)
plt.show()

The figure shows us that most of the participants use Python and the most unpopular languages are Julia and Swift. This information is not surprising, because the survey was filled up by the Kaggle community and the Kaggle community programs in Python or R.

# Conclusion
An essential skill for any data scientist is knowing how to present results in the right format. The creation of a notebook is a good way to deal with this problem. It forces the analyst to move forward step by step and to go further to find the best and well-understanding representation.

Thanks to this challenge, I manage to discover a new way to explore a new dataset. I am used to analyse data with statistics and machine learning approachs defined in C++, Java and R and to visualise in the last step of my work the results. Jupyter notebook offers a new approach in which I discover the importance to show the usual black box and to explain it. I still have a lot of features to explore in Jupyter and can't wait to do it with my own datasets in order to compare the performance with my tradition algorithms in R or Java.

#### Webography

<a id="matplotlib">[matplotlib]</a> Mathplotlib library : https://matplotlib.org/<br>
<a id="missingno">[missingno]</a> MissingNo library : https://pypi.org/project/missingno/0.4.2/ <br>
<a id="numpy">[numpy]</a> NumpPy library : https://numpy.org/ <br>
<a id="pandas">[pandas]</a> Pandas library : https://pandas.pydata.org/ <br>
<a id="plotly">[plotly]</a> Plotly library : https://plotly.com/python/ <br>
<a id="seaborn">[seaborn]</a> Seaborn library : https://seaborn.pydata.org/ <br>


<a id="Jacques2019">[Jacques2019]</a>JACQUES W. (2019) Data visualization en Python avec des librairies telles que Matplotlib et Seaborn. https://medium.com/france-school-of-ai/data-visualization-en-python-avec-des-librairies-telles-que-matplotlib-et-seaborn-6811385df020<br>
<a id="kaggle2020">[Kaggle2020]</a> Kaggle (2020) 2020 Kaggle Machine Learning and Data Science Survey : https://www.kaggle.com/c/kaggle-survey-2020<br>
<a id="Keita2017">[Keita2017]</a> KEITA Moussa (2017) Data Science with Python: Algorithm, Statisitcs, DataViz, DataMining and Machine-Learning. MPRA report  : https://mpra.ub.uni-muenchen.de/76653/1/MPRA_paper_76653.pdf<br>
<a id="Kuhn2019">[Kuhn2019]</a> KUHN Max, JOHNSON Kjell (2019) Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.<br>