In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

# Case Study

It is no secret students drink alcohol before reaching legal age, this dataset contains alcohol consumption data for two secondary schools students. It also contains data about their grades, families and how the students spend their free time. I hope that by analyzing such an intresting dataset I'll be able to find interesting patterns and correlations between alcohol consumption and academic perfromance.

In [None]:
MATH = '../input/student-alcohol-consumption/student-mat.csv'

dataset = pd.read_csv(MATH)
dataset.head()

In [None]:
dataset.isna().sum()

# Alcohol consumption

First let's check out the distributions in two most interesting columns - **daily alcohol consumption** and **weekly alcohol consumption.** 

In [None]:
def create_plot(n: int, m: int, size: tuple = (12, 5)): 
    fig, ax = plt.subplots(n, m, figsize=size)
    return fig, ax

In [None]:
fig, ax = create_plot(1, 2)

sns.histplot(dataset['Dalc'], ax=ax[0])
sns.histplot(dataset['Walc'], ax=ax[1])
plt.show()

In [None]:
daily = dataset[dataset['Dalc'] >= 3.0]
weekly = dataset[dataset['Walc'] >= 3.0]

daily_count = len(daily)
weekly_count = len(weekly)

p_daily = np.round((daily_count / len(dataset)) * 100)
p_weekly = np.round((weekly_count / len(dataset)) * 100)

print(f'{p_daily}%, {p_weekly}%')

**"Dalc**" and **"Walc"** - Daily alcohol consumption and weekly alcohol consumption are discrete values ranging from 1.0 (very low) to 5.0 (very high). Even though distributions might look fairly normal, one curious observations that I've made is that **40% of students drink alcohol on weekly basis with "Walc" value of bigger or equal 3.0**, which can be considered quite high for secondary school students.

Even more disturbing is the fact that **11% of daily drinkers are students with "Dalc" value bigger or equal to 3.0.** Imo this is a very worring fact.

**Let's take a closer look at those students.**

**It might be not that bad if they aren't too young.**

In [None]:
fig, ax = create_plot(1, 2)

sns.histplot(daily['age'], ax=ax[0], kde=True)
sns.histplot(weekly['age'], ax=ax[1], kde=True)
plt.show()

In [None]:
daily_mean = np.round(daily['age'].mean())
weekly_mean = np.round(weekly['age'].mean())

print(f"Daily mean {daily_mean}, weekly mean {weekly_mean}")

Atleast means aren't too low, however looking at both distributions it seems quite weird for me how there are more younger people who frequently consume alcohol than older people. This might be due to dataset imbalances, let's check it out!

In [None]:
sns.displot(dataset['age'])
plt.show()

Yes, clearly there are more younger students in this dataset 

Knowing that the data is biased, having ~90% students aged 15 - 18 compared to ~10% of ages 19 - 22, let's take a look at some **pivot tables.**

In [None]:
dataset['count'] = 1

pivot_1 = pd.pivot_table(dataset, 
                         values='count',
                         index='age', 
                         columns='Dalc', 
                         aggfunc=np.sum, 
                         fill_value=0)

pivot_2 = pd.pivot_table(dataset, 
                         values='count', 
                         index='age', 
                         columns='Walc', 
                         aggfunc=np.sum, 
                         fill_value=0)

fig, ax = create_plot(1, 2)

sns.heatmap(pivot_1, ax=ax[0], annot=True)
sns.heatmap(pivot_2, ax=ax[1], annot=True)
plt.show()

Pivots are great and even though here they are quite irrepresentative we can still find some valuable insight looking at them. For example value **1 in both Walc and Dalc is by far the most common for 15 year olds**, yay underage students don't drink that much! **Values are more evenly distributed for ages 17 and 18** which is to be expected. **Anything past 19 is just not worth considerning since we lack much data in those age ranges.**

Now let's take a look at perhaps the most intresting topic: **how alcohol consumption affects academic performance.**

In [None]:
fig, ax = create_plot(3, 2, (20,20))

for idx, j in enumerate(['Dalc', 'Walc']):
    for i in range(3):
        sns.boxplot(data=dataset, x=f'{j}', y=f'G{i+1}', ax=ax[i][idx])

Quite interestingly there dosn't seem to be a noticible trend in academic performance in regards to alcohol consumption. Most boxplots have a median of around 10, Q2 and Q3 are very simillar for all the cases, usually bigger values of alcohol consumption have slimer range for grades, however this is to be attributed to much fewer students who drink so much alcohol.

# Academic performance

Let's forget about alcohol for a moment and take a closer look at **grades**. The dataset contains three grades:
* G1 - First period grade
* G2 - Second period grade
* G3 - Final grade

One interesting thing that I always wanted to know is **how correlated really are grades and study time**. But first let's take a look at grades distribution.

In [None]:
fig, ax = create_plot(1, 3, (15, 5))

for i in range(3):
    sns.kdeplot(dataset[f'G{i+1}'], ax=ax[i], fill=True, linewidth=2, alpha=0.1)
plt.show()

The distributions are quite close to normal distributions.

In [None]:
fig, ax = create_plot(3, 1, (5, 15))

for i in range(3):
    sns.boxplot(data=dataset, x='studytime', y=f'G{i+1}', ax=ax[i])
plt.show()

Maybe suprisingly, studytime is only slightly correlated with grades. **If not studytime then what is?**

In [None]:
corr = dataset.corr()

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

fig, ax = create_plot(1, 1, (12,12))

ax = sns.heatmap(corr, mask=mask, square=True)
plt.show()

It seems like **Medu** and **Fedu** (respectively **Mother's education** and **Father's education**) have the biggest impact on chilldren's academic perfomrance. Let's take a look at those variables.

In [None]:
fig, ax = create_plot(1, 2)

sns.kdeplot(dataset['Medu'], ax=ax[0], fill=True, linewidth=2, alpha=0.1)
sns.kdeplot(dataset['Fedu'], ax=ax[1], fill=True, linewidth=2, alpha=0.1)
plt.show()

It looks like Mothers are on average better educated that Fathers

# Modeling

**Now it's time for sexiest element in data science - modeling data with machine learning**

* Handeling cateogrical variables
* Normalizing the data
* Fitting the models and evaluation

In [None]:
# School is irrelevant for my purpose
dataset = dataset.drop(['school'], axis=1)

In [None]:
T = True

if T:
    dataset.drop(['G1', 'G2'], axis=1)

# Handling categorical variables

Some cateogircal variables have only two possible values, these ones I am simply going to map either to 1 or 0, others have many classes for such cases we can encode them in sparse representations.

In [None]:
# Selecting categorical variables
categorical = dataset.select_dtypes(exclude=['int64', 'float64'])
categorical.head()

In [None]:
binary = ['sex', 
          'address', 
          'famsize', 
          'Pstatus', 
          'schoolsup', 
          'famsup', 
          'paid', 
          'activities', 
          'nursery', 
          'higher', 
          'internet', 
          'romantic'
         ]

for col in categorical.columns:
    if col in binary:
        unique = categorical[col].unique()
        categorical[col] = categorical[col].map({
            f'{unique[0]}': 0, 
            f'{unique[1]}': 1
        }) 
    else:
        dummies = pd.get_dummies(categorical[col])
        for dummy in dummies:
            categorical[dummy] = dummies[dummy]
        categorical.drop(col, axis=1, inplace=True)

In [None]:
categorical.head()

Now with clean categorical variables let's move onto numerical columns.

In [None]:
numerical = dataset.select_dtypes(include=['int64', 'float64'])
numerical.head()

In [None]:
dataset_clean = pd.concat([numerical, categorical], axis=1)
dataset_clean.head()

# Numerical data

Preprocessing numerical variables usually means scaling and normalizing them. I am going to use StandardScaler with Pipeline from sklearn during model training, but for now I would like to know if the data forms some kind of pattern.

# PCA

**PCA** is an algorithm that is able to reduce the dimensionality (i.e reduce the number of columns from 39 to 2) of dataset and retain the most valuable information about underlaying patters in the data. It is often used to visualize the dataset and helps with model selection. I am also going to use sklearn's implementation of PCA

In [None]:
from sklearn.linear_model import Lasso, LinearRegression, SGDRegressor, ElasticNet
from sklearn.svm import SVR, LinearSVR
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error as mse

In [None]:
X = dataset_clean.iloc[:, :-1]
y = dataset_clean['G3']

In [None]:
# Dimensionality after PCA
n_components = 2

pca = PCA(n_components=n_components)

principal_components = StandardScaler().fit_transform(pca.fit_transform(X))
# Explained variance ration is the % of information that is retained after PCA
print(f"Explained variance ration = {np.sum(pca.explained_variance_ratio_)}")

pca_df = pd.DataFrame(principal_components)
pca_df = pd.concat([pca_df, y], axis=1)
pca_df.head()

**0.84 explained variance ratio is quite good, now let's visualize our reduced dataset.**

In [None]:
fig, ax = create_plot(1, 2)

for i in range(n_components):
    ax[i].scatter(x=pca_df[i], y=pca_df['G3']) 

plt.show()

**First component of pca_df dosn't tell us too much about the patterns in data, however in the second one there is a clearly visible linear trend, let's visualize the whole dataset with 3D plotly scatterplot.**

In [None]:
import plotly.express as px

fig = px.scatter_3d(pca_df, x=0, y=1, z='G3',
              color='G3')
fig.show()

**This 3D scatterplot is really helpful for understandig the patterns and correlations in the dataset, besides that don't you think it looks super cool?** 

# Model selection

Thanks to PCA we already know that the dataset has a linear trend, so obvious model choice is beloved **least squares**, however just for fun I am going to fit few other models and see how they perform.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
MODELS = [Lasso, LinearRegression, SGDRegressor, ElasticNet, SVR, LinearSVR]

def train_model(model):
    pipe = Pipeline([
        ('scaler', StandardScaler()), 
        (f'{str(model.__name__).lower()}', model())
    ])

    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)
    score = mse(y_test, preds)
    print(str(model.__name__) + " score: " + str(score))
    return score

In [None]:
scores = {}

for model in MODELS:
    scores[str(model.__name__)] = train_model(model)

print()
print(min(scores, key=scores.get) + " is the winner!")

No big surprise here, Linear Regression won the competition

**To be continued...**