# Training Dataset 

The training dataset is in dataset_train.csv file. 

## Preview
Reading the csv file and looking at dataset.

In [None]:
%pip install seaborn

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

dataset_train = f'../datasets/dataset_train.csv'
df = pd.read_csv(dataset_train)
df.head()

the `Hogwarts house` is the real outcome. We need to train our model with selected features and so that the predicted output `house` could later be compared to the real one. 
We need to inspect the features:
- non numeric or biographic features (first and last names, birthday dates and best hand) might be enriching our model accuracy but are harder to sort out. we will inspect them in a second round of training
- numeric values for columns Arithmancy to Flying, which represent 13 features is a good start

First,we are focusing on features with numeric values.
Selecting numeric columns, where dtypes are np.number
There is total of 13 features that has numeric types, index is not counted.
Standardization and handling missing data (NaN) will be required.

In [None]:
df.select_dtypes(include=np.number).head(10)

To describe statitics to dataset, we could have used pandas module describe() function
```python
df[df.columns[1:]].describe()
```
However, `pd.describe()` is a forbidden function and using it would be considered as cheating.

We will use our own `describe.py` Python program launched from `jupyter notebook`.
For that purpose, we can import the `os module`, to interact with the operating system
and execute it, of course in our `virtual environment`

In [None]:
import os

script_path = '../dslr/describe.py'
dataset_train = f'../datasets/dataset_train.csv'
os.system(f'../venv/bin/python {script_path} {dataset_train}')


Missing values for numeric columns features 

In [None]:
df[df.select_dtypes(include=np.number).columns[1:]].isna().sum()

In [None]:
df[df.select_dtypes(include=np.number).columns[1:]].notna().sum()

### Plots

Exploring these 13 features.

    Only the meaningful variables should be included.
    The independent variables should be independent of each other. 
    That is, the model should have little or no multicollinearity.

In [None]:
df[df.select_dtypes(include=np.number).columns[1:]].columns.to_list()

Plots can be of 3 kinds :
- `distibution`, such as histograms
- `categorical`, such as boxplots
- `relational` such as scatter plot

Using inline matplotlib module. (Affichage avec la bibliothèque graphique intégrée à Notebook)



#### Histogram
Distribution of a given features among Hogwarts houses.
It looks like 2 houses (Griffindor and Slytherin) are not good at `Herbology`.
`Herbology` features might be a good feature for our model beacuse it might allow a better classification.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

df_herb = df.groupby('Hogwarts House')['Herbology']
df_herb.plot(kind='hist', alpha=0.4, legend=True)
plt.show()

#### boxplot

A boxplot is a graphical and standardised way to display the distribution of data based on five key numbers:

    “minimum”
    1st Quartile (25th percentile)
    median (2nd Quartile/ 50th Percentile)
    3rd Quartile (75th percentile)
    “maximum”

In [None]:
import seaborn as sns

sns.boxplot(x='Hogwarts House', y='Astronomy', data=df)

#### scatter plot

In [None]:
plt.scatter('Astronomy', 'Herbology', marker='.', alpha=0.3, data=df)
plt.show()

In [None]:
sns.scatterplot(
 data=df, 
 x="Astronomy",
 y="Herbology",
 hue="Hogwarts House",
)

In [None]:
sns.scatterplot(
 data=df, 
 x="Astronomy",
 y="Defense Against the Dark Arts",
 hue="Hogwarts House",
 legend='auto'
)

#### Pair Plot Matrix

In [None]:
target="Hogwarts House"
remove_list = ['Index', 'First Name', 'Last Name', 'Birthday', 'Best Hand']

df_num = df.drop(remove_list, axis=1)
features = df_num.keys()[1:].to_list()
sns.pairplot(df_num,
                x_vars=features,
                y_vars=features,
                hue=target,
                corner=True
            )

#### boxplots

In [None]:
fig, axs = plt.subplots(4, 4, figsize=(20, 20))
features = df[df.select_dtypes(include=np.number).columns[1:]].columns.to_list()
for idx in range(13):
    i = idx // 4
    j = idx % 4
    sns.boxplot(data=df, x="Hogwarts House", ax=axs[i, j], y=features[idx])
plt.show()

In [None]:
sns.boxplot(data=df, x="Hogwarts House", y="Care of Magical Creatures")

Really bad features['Arithmancy', 'Care of Magical Creatures'] cannot classify well

"Defense Against the Dark Arts" is anti-correlated with  "Astronomy".

Thus we can start training our model with `10 features` : 

- [ ] 'Arithmancy',
- [X] 'Astronomy',
- [X] 'Herbology',
- [ ] 'Defense Against the Dark Arts',
- [X] 'Divination',
- [X] 'Muggle Studies',
- [X] 'Ancient Runes',
- [X] 'History of Magic',
- [X] 'Transfiguration',
- [X] 'Potions',
- [ ] 'Care of Magical Creatures',
- [X] 'Charms',
- [X] 'Flying'
