# **Final Academic Grade Study Notebook**

## Objectives

* Answer the business requirement 1 : Understand which factors most influence students' academic performance.
  - The client expects to explore correlations and patterns between variables such as parental education, study time, relationship status, internet access family support, and other lifestyle indicators, with the final academic grade (G3)

## Inputs

* Generate Dataset: outputs/datasets/collection/student-por.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App


---

# Change working directory

The following action will change the working directory from its current folder to its parent folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data


In [None]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/student-por.csv')
df.head()

---

# Data Exploration

* Since we have loaded the data, we are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in a business context.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

---

## Correlation Study

Prearson and Spearman correlation study methods require numerical values. It seems from the dataset overview that a lot of the variables are categorical which will be ignored during the study. We will do one-hot encoding hence turning the categorical values to numeric inputs.

In [None]:
from feature_engine.encoding import OneHotEncoder
encoder = OneHotEncoder(variables=df.columns[df.dtypes=='object'].to_list(), drop_last=True)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head()

* This method split the binary variables into two new variables e.g. `paid_no` & `paid_yes`
* After a preamptive correlation run, we saw that `higher_yes` and `higher_no` are reciprocal variables which can inflate our correlation list without adding any new information.
* We are setting `drop_last` to true so that such variables that are dummies are dropped.

We use `.corr() `for spearman and pearson methods, and investigate the correlations to the target variable `G3`.
- We know this command returns a pandas series and the first item is the correlation between `G3` and `G3`, which happens to be 1, so we exclude that with [1:]
- We sort values considering the aboslute value, by setting key=abs

In [None]:
corr_spearman = df_ohe.corr(method='spearman')['G3'].sort_values(key=abs, ascending=False)[1:]
corr_spearman.head(10)

We do the same for pearson

In [None]:
corr_pearson = df_ohe.corr(method='pearson')['G3'].sort_values(key=abs, ascending=False)[1:]
corr_pearson.head(10)

From the above methods:
* `G2`(Second period grade) and `G1`(First period grade) are highly correlated (strong positive) with our target variable: `G3`(Final grade).
  
* `failures`(Number of past class failures) and `higher_yes`(Plans for higher education) are low - moderately correlated (one of them being moderate negative).
  
* `school_GP` shows positive correlation which might lead us to believe Gabriel Pereira to be a better environment academically for the students.
  
* We also see that `Dalc`(Workday alcohol consumption) and `Walc`(Weekend alcohol consumption) negatively imapacts the final grade.
  
* Overall, for both methods, we notice strong, weak or moderate levels (positive or negative), of correlation between `G3` and a given variable.

Ideally, we pursue strong correlation levels. However, this is not always possible.
* We will consider the top 5 correlation levels at df_ohe and will study the associated variables at df

In [None]:
top_n = 6
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

* This gives us a list of five variables - 'G1', 'G2', 'failures', 'higher_yes', 'school_GP'.
* These 6 variables will be tested on strength to predicting the Final Grade `G3`.

In [None]:
vars_to_study = ['G1', 'G2', 'failures', 'higher_yes', 'school_GP', 'studytime']
vars_to_study

## EDA on selected variables

In [None]:
df_eda = df_ohe.filter(vars_to_study + ['G3'])
df_eda.head()

We plot the distribution:

In [None]:
%matplotlib inline

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):

    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var, order=df[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'G3'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")
    else:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

---