# Analyse de Données Complète

## **I. Introduction**
### **1.1. Contexte de l'analyse**

### **1.2. Objectifs**

### **1.3. Questions de recherche**

## **2. Importations**
### **2.1. Importation des Librairies**

In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import the secondary libraries


### **2.2. Importation des jeux de données**

In [None]:
# Load the dataset into a Pandas DataFrame
df = pd.read_csv('your_dataset.csv') # To read CSV file

# Read data from a URL
#df_url = pd.read_csv('https://example.com/your_data.csv') # To read data from a URL

# Read Excel file
#df_excel = pd.read_excel('your_file.xlsx', sheet_name='Sheet1') # To read Excel file

# Read JSON file
#df_json = pd.read_json('your_file.json') # To read JSON file

# Read SQL File
#from sqlalchemy import create_engine # To insert in the secondary libraries
#engine = create_engine('sqlite:///your_database.db')      ### Create a SQLite database engine
#df_sql = pd.read_sql('SELECT * FROM your_table', engine)  ### Read data from a SQL table

# Read Parquet file
#df_parquet = pd.read_parquet('your_file.parquet') # To read Parquet file

# The file type above are the main file types you will encounter in Data Science. But here are some other file types below such as :

# Read HDF5 file
#df_hdf5 = pd.read_hdf('your_file.h5', key='your_key') # To read HDF5 file

# Read CERN ROOT file
#from rootpy.io import ROOTFile # To insert in the secondary libraries
#df_root = ROOT.TFile.Open('your_file.root') # To read ROOT file

# Read Feather file
#df_feather = pd.read_feather('your_file.feather') # To read Feather file

# Read fixed-width file
#column_widths = [10, 15, 20]    ## To define column widths
#df_fixed_width = pd.read_fwf('your_file.txt', widths=column_widths) # To read fixed-width file

# Read data from the clipboard
#df_clipboard = pd.read_clipboard() # To read data from the clipboard

## **III. Exploration Initiale des Données/Dataset Exploratory Analysis**
### **3.1. Vue d'ensemble des données (Dimensions, infos, en-tête du jeu de données)**

In [None]:
# Print the dataframe dimensions


In [None]:
# Print the information of the dataframe


In [None]:
# Print the first 5 rows of the dataframe


### **3.2. Nettoyage des données/Data Cleaning (valeurs manquantes, lignes dupliquées, réduction de données et valeurs aberrantes)**
**Toujours procéder à chaque étape en enregistrant dans une nouvelle variable**
#### **3.2.1. Deal with missing values**

In [None]:
# Print the potentially missing values and check their appartenance in the dataframe (Either Numerical or Categorical)


In [None]:
# For both numerical and categorical missing values, the solutions are from the best to the worst
# Deal with potentially categorical missing values with
# 1. Create a new category for missing values like "noclue/missing/unknown", to introduce a new dimension, better fitting reality

# 2.5. If it's boolean type, check this : https://github.com/travisjungroth/trinary#examples and that discussion : https://www.reddit.com/r/Python/comments/zudwr6/trinary_a_python_project_for_threevalued_logic_it/

# 2. Drop rows with missing values

# Deal with potentially numerical missing values with
# 1. Replace missing values with 0 and add  a new column to show the missing values (where 0 = no missing value, 1 = missing value), again, new dimension

# 2. Replace missing values with mode/median (if not normally distributed data) 

# 3. Replace missing values with mean (if normally distributed data)

# 4. Drop rows with missing values


In [None]:
# Drop columns with a high percentage of missing values (e.g. more than 60% threshold)


#### **3.2.2. Deal with duplicated rows**

In [None]:
# Print the potentially duplicated rows


In [None]:
# Drop the potentially duplicated rows


#### **3.2.3. Deal with constant columns values, low-variances**

In [None]:
# Check the unique and distinct values of the columns


In [None]:
# drop columns with constant values


In [None]:
# Only for predictions, put aside low-variance columns, inducing overfitting (threshold = 0.1 is an enough low-variance threshold)

#### **3.2.4. Deal with outliers**

In [None]:
#1. Handling Outliers Using the Interquartile Range (IQR):

# Calculate the Interquartile Range (IQR)

# Deal with outliers based on IQR, with a threshold of 1.5 times the IQR with :
# 1. Trimming weight (to reduce the weight of outliers)

# 2. Changing the scale (Winsorisation, Imputation, Trimming to reduce the values)

# 3. Using M-estimation (a robust estimation technique)

# 4. Removing outliers


In [None]:
# Box Plots for Outlier Detection


#### **3.2.5. Transform data types**

In [None]:
# Print columns data types


In [None]:
# Create a dictionnary or a list of the columns to be converted to either Numeric(uint8, int8, uint16, int16, uint32, int32, uint64, int64) or Categorical type(category, bool), descriptive types (object, str) or datetime types (date, datetime or even the conversion to ordinal type)


In [None]:
# Apply the conversion to the columns


### **3.3. Feature Engineering**
***Ce sont des exemples extrêmement utiles, mais certains d'entre eux ne sont pas des nécessités***

In [None]:
# 1. Creating New Features:

# Create a new feature by adding two existing features

# Create a new feature by multiplying two existing features

# Create a new feature by calculating the mean of a group

# Binning/Discretization

# 2. Transforming Existing Features:

# Log-transform a numeric feature

# Min-Max scaling

# One-hot encoding for a categorical feature to create dummy variables


In [None]:
# Create 3 subsets 
# 1. numerical one

# 2. categorical one

# 3. descriptive one


**Phrase de conclusion sur le nombre de lignes supprimées, de valeurs remplacées, de réduction du poids des données et du jeu de données**

## **IV- Analyse Exploratoire des Données/Exploratory Data Analysis**
### **4.1. Univariate Analysis**
#### **4.1.1. Quantitative Descriptive Statistical Analysis**

In [None]:
# Print the statistical summary of the variables (Descriptive Statistics of Dispersion Indicators (Range, Variance, Standard Deviation) & Position Indicators (Sum, Median & Quartiles, Mean, Minimum, Maximum, Mode)


In [None]:
# Visualisation of the Numerical Data Distribution with an histogram


#### **4.1.2. Qualitative Descriptive Statistical Analysis**

In [None]:
# Visualisation of the Categorical Data Distribution


In [None]:
# Print the sums/counts with categorical variables


### **4.2. Bivariate/Multivariate Analysis**

In [None]:
# Print a nice chart with one or two numerical and one categorical variables


In [None]:
# Print a pair plots with multiple numeric variables


In [None]:
# Print a chart with multiple categorical variables


In [None]:
# Create a bar plot for a categorical variable vs. a numeric variable (distribution of categorical across numeric)


In [None]:
# Create a violin plot for a categorical variable vs. a numeric variable (distribution of categorical across numeric)


In [None]:
# Create a scatter plot for numeric-numeric relationships


In [None]:
# Create and print a cross tabulation for categorical-numerical relationships


**Vous avez techniquement terminé les séries d'analyses statistiques**  
**Maintenant, il faut poser la conclusion des analyses précédentes, car par la suite, il va falloir supposer, proposer des hypothèses, à partir des conclusions**  



## **V. Analyse Inférentielle**
### **5.1. Hypothesis**

### **5.2. Hypothesis tests**
#### **5.2.1. Simple procedure tests**

In [None]:
# T-Value

# U-Value

# Z-Value

# Confidence Interval

# F-Value

# P-Value

In [None]:
# Binominal Classification, to be used for a lone categorical variable with 2 values (0 and 1, True and False, Man and Woman, Male and Female...)

# Mann-Whitney U Test

# Wilcoxon Rank Sum Test

# Chi-Square Test

#### **5.2.2. Multicollinearity test**

In [None]:
# Print tolerance value (Value has to be over 0.1 for a non critic, but over 0.3 is already a good value)

# Print VIF (Variance Inflation Factor), the lower the better (Over 10 is a critical value)


#### **5.2.3. Correlations**

In [None]:
# Print correlation matrix


In [None]:
# You can also use different correlation metrics such as:
# Pearson's correlation coefficient as a Linear correlation between continuous variables (NOT SUITABLE FOR DISCRETE VARIABLES, IT REQUIRES A NORMAL DISTRIBUTION)

# Spearman's rho as a Non-parametric correlation between ordinal variables (DOESN'T REQUIRES A NORMAL DISTRIBUTION)

# Kendall's tau is a Non-parametric correlation between ordinal variables (DOESN'T REQUIRES A NORMAL DISTRIBUTION, AND FIT LOW-SIZE DATASETS)


#### **5.2.4. Variance Analysis**

In [None]:
# ANOVA (Variance Analysis of One Factor), used to test if there is a significant difference between the means of two or more groups.

# MANOVA (Multivariate Analysis of Variance, Of Multiple Factors), used to test if there is a significant difference between the means of multiple groups. (REQUIRES NORMAL DISTRIBUTION)

# Kruskal-Wallis H-test, used to test if there is a significant difference between the means of multiple groups. (DOESN'T REQUIRES NORMAL DISTRIBUTION)

# Friedman Test (Used when normality and homogeneity of variance assumption is violated), used to test if there is a significant difference between the means of multiple groups. (DOESN'T REQUIRES NORMAL DISTRIBUTION)

**Now you can "assume" a causality IN THIS CELL**

## **VI- Analyse prédictive**
### **6.1. Preprocessus des données**

In [None]:
# Separe the target variable from the features with Y and X


In [None]:
# You can try with both separately, but at least have one feature set with nominal data such as date or datetime to be converted as an ordinal one


In [None]:
# Convert categorical features to dummy variables with One-hot encoding such as dummies

#### **6.1.1. Création des Features/Target & Mises à l'echelle, Normalisation, Transformation, Train-Test-Split**

In [None]:
# Scaler/transform for a numeric feature


In [None]:
# Train Test Split


#### **6.1.2. Préparation des modèles et des validateurs**

![Estimators Map](https://scikit-learn.org/stable/_downloads/b82bf6cd7438a351f19fac60fbc0d927/ml_map.svg)

In [None]:
# Check SKLearn models map for a better understanding of what fit the best 
# But the best is to create a list of multiple models and compare their performances throught validation tests



### **6.2. Prédiction des données**
#### **6.2.1. Execution des modèles sur les données d'entrainement**

In [None]:
# Call the list or model, to return the prediction of y_pred


#### **6.2.2. Vérification des erreurs et/ou de la précision des modèles**

In [None]:
# Regression validations such as MAE, MAPE, MSE, RMSE, R², Adjusted R²... Those are the 5 essentials metrics (MAPE is the MAE in %)

# Classification validations such as Confusion Matrix, Accuracy, Precision, Recall, F1-Score... Those also are the 5 essentials metrics

# Clustering validations such as KMeans, Meanshift, DBSCAN, BIRCH, Silhouette Coefficient, Indexes (Rand, Dunn, VI)... Those again are the 5 essentials metrics (Meanshift is a alternative version of KMeans)


#### **6.2.3. Visualisation par graphique des modèles**

In [None]:
# Use same charts than before to compare visually the data (you can also add the target variable before retiring it to check the differences)


### **6.2.4. Execution des modèles sur de potentiels jeux de données aux Target manquante**

In [None]:
# Replicate what you did in the training model but to fit the "real-life" data


## **VII - Analyse Prescriptive**

**What CAN be done following the results of the model ?**

In [None]:
# Statistical analysis of the dataset with target included


In [None]:
# Chart visualisation of the dataset with target included


**Following the visualisations, what SHOULD be done ?**

**Give advices, insights etc...**