# 1. Exploratory Data Analysis

## Description du dataset

### Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
- 1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
- 2 sex - student's sex (binary: "F" - female or "M" - male)
- 3 age - student's age (numeric: from 15 to 22)
- 4 address - student's home address type (binary: "U" - urban or "R" - rural)
- 5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- 6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
- 7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- 8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- 9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- 10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- 11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- 12 guardian - student's guardian (nominal: "mother", "father" or "other")
- 13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- 14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- 15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- 16 schoolsup - extra educational support (binary: yes or no)
- 17 famsup - family educational support (binary: yes or no)
- 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- 19 activities - extra-curricular activities (binary: yes or no)
- 20 nursery - attended nursery school (binary: yes or no)
- 21 higher - wants to take higher education (binary: yes or no)
- 22 internet - Internet access at home (binary: yes or no)
- 23 romantic - with a romantic relationship (binary: yes or no)
- 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
- 30 absences - number of school absences (numeric: from 0 to 93)

### these grades are related with the course subject, Math or Portuguese:
- 31 G1 - first period grade (numeric: from 0 to 20)
- 31 G2 - second period grade (numeric: from 0 to 20)
- 32 G3 - final grade (numeric: from 0 to 20, output target)

Additional note: there are several (382) students that belong to both datasets . 
These students can be identified by searching for identical attributes
that characterize each student, as shown in the annexed R file.


## Objectif :
- Comprendre du mieux possible nos données (un petit pas en avant vaut mieux qu'un grand pas en arriere)
- Développer une premiere stratégie de modélisation
- Prédire les notes d'un élève en fonction des différentes features

## Checklist de base
#### Analyse de Forme :
- **variable target** : G1, G2, G3
- **lignes et colonnes** : 395 lignes, 33 colonnes
- **types de variables** : qualitatives : 17, quantitatives : 16
- **Analyse des valeurs manquantes** : 0 valeurs manquantes

#### Analyse de Fond :
- **Visualisation de la target** :
- **Signification des variables** :
- **Relation Variables / Target** :
## Analyse plus détaillée
- **Relation Variables / Variables** :
- **NaN analyse** :    

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('/kaggle/input/student-alcohol-consumption/student-mat.csv')
df = data.copy()

In [None]:
df.shape

In [None]:
pd.set_option('display.max_row',33) #Affiche au plus 33 éléments dans les résultats de pandas
pd.set_option('display.max_column',33) #Affiche au plus 33 éléments dans les résultats de pandas
df.head()

In [None]:
df.dtypes.value_counts() # Compte les nombre de types de variables

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df.isna(),cbar=False)
plt.show()

# Analyse de fond
## 1. Visualisation initiale : Examen de la colonne "output"

In [None]:
df['G2'].value_counts(normalize=True) #Classes déséquilibrées

## 1.1 Visualisation des variables quantitatives

In [None]:
for col in df.select_dtypes("int64"):
    plt.figure()
    sns.displot(df[col],kind='kde')
    plt.show()

## 1.2 Variables Qualitatives

In [None]:
for col in df.select_dtypes("object"):
    print(f'{col :-<50} {df[col].unique()}')

In [None]:
for col in df.select_dtypes("object"):
    plt.figure()
    df[col].value_counts().plot.pie()
    plt.show()

# Relation Target / Variables

## Création de sous-ensembles Excellents et Mauvais

In [None]:
df['Notes'] = ((df['G1']+df['G2']+df['G3'])/3)//7 
# 0 = Notes compriese entre 0 et 7
# 1 = Notes comprises entre 7 et 14
# 2 = Notes comprises entre 14 et 21 (donc 20 comme c'est le maximum)

In [None]:
Mauvais = df[df['Notes'] == 0]
Excellent = df[df['Notes'] == 2]

In [None]:
print(Mauvais.shape)
print(Excellent.shape)
Excellent.head()

In [None]:
Mauvais.head()

In [None]:
for col in df.select_dtypes("int64"):
    plt.figure()
    sns.distplot(Excellent[col],label='Excellent')
    sns.distplot(Mauvais[col],label='Mauvais')
    plt.legend()
    plt.show()

In [None]:
pd.crosstab(df['Notes'],df['Medu'])

In [None]:
for col in df.select_dtypes('object'):
    plt.figure()
    sns.heatmap(pd.crosstab(df['Notes'],df[col]),annot=True,fmt='d')
    plt.show()

# Analyse plus poussée

## Relation variable/variable

In [None]:
sns.pairplot(df,hue='Notes', palette="pastel")

In [None]:
sns.heatmap(df.corr())

In [None]:
sns.clustermap(df.corr())

## Relation avec le temps d'étude

In [None]:
for col in df.select_dtypes('int64'):
    plt.figure()
    sns.lmplot(x='studytime', y=col, hue='Notes', data=df)

In [None]:
df.corr()['Notes'].sort_values()