![classroom](https://images.unsplash.com/photo-1510531704581-5b2870972060?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=500)

# Getting Started

We're going to analyze student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features. The subjects are Math and Portuguese. There seem to be many interesting insight that we can dig from these data. Let's check them.

# Importing Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("ggplot")
import seaborn as sns
sns.set_palette("bwr")

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score

# Set pandas to display all columns
pd.set_option("display.max_columns", None)

%matplotlib inline

### Importing math performance data

In [None]:
# Student performance data in Mathematics subject

dfmat = pd.read_csv("/kaggle/input/student-performance-data-set/student/student-mat.csv", sep=";")
print(dfmat.shape)
dfmat.head()

In [None]:
dfmat.describe()

### Importing Portuguese performance data

In [None]:
# Student performance data in Portuguese subject

dfpor = pd.read_csv("/kaggle/input/student-performance-data-set/student/student-por.csv", sep=";")
print(dfpor.shape)
dfpor.head()

In [None]:
dfpor.describe()

### Checking duplicate and null values in both dataframes

In [None]:
print(dfmat.duplicated().value_counts(), "\n")
print(dfpor.duplicated().value_counts())

In [None]:
ax = sns.heatmap(dfmat.isnull(), cbar=False)

In [None]:
ax = sns.heatmap(dfpor.isnull(), cbar=False)

Nice! It looks like our dataset is perfect. We found no null or duplicate values!

# Preparing The Data

We need to know what does each attribute means for the data and what types of data they contain.

In [None]:
# Opening notes from the dataset provider

file = open("/kaggle/input/student-performance-data-set/student/student.txt", 'r')
print(file.read())

### Map binary values into one and zeroes

In [None]:
# Male, urban, family member > 3, parents living together, yes answers, and Gabriel Pereira schooler
one_values = ["M", "U", "GT3", "T", "yes", "GP"]

# Female, rural, family member <= 3, parents living apart, no answers, and Mousinho da Silveira schooler
zero_values = ["F", "R", "LE3", "A", "no", "MS"]

for column in dfmat.columns:
    dfmat[column] = dfmat[column].replace(to_replace=[one_values], value=1)
    dfmat[column] = dfmat[column].replace(to_replace=[zero_values], value=0)
    
for column in dfpor.columns:
    dfpor[column] = dfpor[column].replace(to_replace=[one_values], value=1)
    dfpor[column] = dfpor[column].replace(to_replace=[zero_values], value=0)

### Map nominal values with one-hot encoding

In [None]:
nominal_columns = ["Fjob", "Mjob"]
dfmat = pd.get_dummies(dfmat, columns=nominal_columns, prefix=nominal_columns)
dfpor = pd.get_dummies(dfpor, columns=nominal_columns, prefix=nominal_columns)

"reason" and "guardian" are also nominal attributes. But we'll just drop them because they are not very interesting data and to avoid making too many columns.

In [None]:
dfmat = dfmat.drop(columns=["reason", "guardian"])
dfpor = dfpor.drop(columns=["reason", "guardian"])

dfmat.head()

In [None]:
dfpor.head()

In [None]:
print("Now our dataframe has", dfpor.shape[1], "columns each!")

### Joining Two Dataset

The dataset is split into two subjects (Math and Portuguese). However, several students belong to both datasets. To find them out, we must find what features are related to the student's characteristics and what features are related to the study subjects (therefore different in each dataset).

From the data above we can see that:

* G1, G2, G3 is obviously the grade for each subject.
* The attribute "paid" represents extra paid classes within the course subject. Therefore it will likely be different for each study subject.
* absences and failures likely depend on the subject the student is learning.

Therefore the attributes above are less likely to be the key attributes to join both datasets. We will use any other attributes.

In [None]:
# Key attributes to join dataset

not_join_val = ['G1', 'G2', 'G3', 'paid', 'absences', 'failures']
join_val = list(set(dfmat.columns) - set(not_join_val))

print("Attributes to be joined:\n\n", join_val)

In [None]:
dfall = dfpor.merge(dfmat, on=join_val, suffixes=["_por", "_mat"])
print(dfall.shape)
dfall.head()

# Exploring Data

Below, we'll be exploring the dataset. I want to make clear the way that we use our data:

* The merged dataset (**dfall**) is used to explore student characteristics data (sex, family, etc)
* The dataset for each subject (**dfmat** and **dfpor**) is used to explore data relating to the students' grades.

For the sake of convenience and DRY (don't repeat yourself) principle, I've also created some function to plot pie charts.

In [None]:
# Defining a simple function to plot pie charts

def plot_pie(value, title, label=None, count=True, sort=False, legend=False):
    plt.figure(figsize=(4,4))
    ax = plt.pie(
        value.value_counts(sort=(label is None)),
        startangle=90,
        labels=(None if legend is True else value.value_counts(sort=(label is None)).to_frame().index.to_numpy() if label is None else label) ,
        autopct=(
            lambda p: f'{p:.2f}%\n{p*sum(value.value_counts())/100 :.0f} items' if count is True
            else f'{p:.2f}%'
        ),
        pctdistance=0.6,
    )
    if legend:
        plt.legend(labels=label, loc="best", bbox_to_anchor=(1.1, 0., 0.5, 0.5))
    plt.title(title)
    plt.show()

## The Basic Stuff

### Sex ratio

In [None]:
plot_pie(dfall["sex"], title="Sex ratio", label=["Female", "Male"])

### Age distribution

In [None]:
ax = sns.countplot(dfall["age"], color="r")

### Urban vs Rural students

In [None]:
plot_pie(dfall["address"], title="Urban vs Rural ratio", label=["Rural", "Urban"])

### Family size of students

In [None]:
plot_pie(dfall["famsize"], title="Family size of students", label=["Three or less", "More than 3"])

### Home to school travel time

In [None]:
plot_pie(
    dfall["traveltime"],
    title="Travel times of students",
    label=["<15 min", "15 to 30 min", "30 min. to 1 hour", ">1 hour"],
    count=False,
    legend=True
)

### Student's father's job

In [None]:
dfall.filter(regex='^Fjob*').idxmax(1).value_counts()

In [None]:
plot_pie(
    dfall.filter(regex='^Fjob*').idxmax(1),
    title="Student's father's job",
    label=["Services", "Other", "Stay-at-home", "Teacher", "Healthcare"],
    count=False,
    legend=True
)

### Student's mother's job

In [None]:
dfall.filter(regex='^Mjob*').idxmax(1).value_counts()

In [None]:
plot_pie(
    dfall.filter(regex='^Mjob*').idxmax(1),
    title="Student's mother's job",
    label=["Other", "Teacher", "Stay-at-home", "Healthcare", "Services"],
    count=False,
    legend=True
)

### Insights about student's scores in each subject

In [None]:
dfmat[["G1", "G2", "G3"]].describe()

In [None]:
dfpor[["G1", "G2", "G3"]].describe()

### Average G3 score in female vs male students

Male students tend to excel in math, and females tend to excel in Portuguese subject.

0: Females, 1: Males

In [None]:
# Mathematics
dfmat[["sex","G3"]].groupby("sex").mean()

In [None]:
# Portuguese
dfpor[["sex","G3"]].groupby("sex").mean()

### Average G3 score in rural vs urban students

Urban students get higher scores than rural students.

0: Rural, 1: Urban

In [None]:
# Mathematics
dfmat[["address","G3"]].groupby("address").mean()

In [None]:
# Portuguese
dfpor[["address","G3"]].groupby("address").mean()

In [None]:
print("Students who have 0 grade in Math:")
print(dfmat[(dfmat["G1"] == 0) | (dfmat["G2"] == 0) | (dfmat["G3"] == 0)].shape[0], "\n")

print("Students who have 0 grade in Portuguese:")
print(dfpor[(dfpor["G1"] == 0) | (dfpor["G2"] == 0) | (dfpor["G3"] == 0)].shape[0])

## The Interesting Stuff

### Correlation heatmap

In [None]:
plt.figure(figsize=(15,15))

corr = dfall.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 160, n=256),
    square=True,
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=50,
    horizontalalignment="right"
);

From the heatmap above we can know so many interesting insights:

- Math grades is correlated with Portuguese grades
- G1, G2, and G3 in both subjects are strongly correlated with each other
- Female students tend to study more and achieve the higher grades
- Male students tend to have higher weekly and daily alcohol consumption (Dalc and Walc)
- Students who failed in Math tend to fail in Portuguese too
- Students who wants to achieve higher education tend to fail less and achieve more grades
- Students who frequently skipped Math class tend to also skip their Portuguese class
- Rural students travel for a longer time than urban students
- Students whose father has a higher education tend to have a mother with a higher education too (and vice versa)
- Mothers who has a higher education are less likely to stay at home and more likely to be a teacher
- Students whose mother has a higher education is less likely to fail at Math and Portuguese