# **Does Alcohal Affects Final Grades?**
By: [Hussain Mansoor, CISA, ICAP-Affiliate](https://www.linkedin.com/in/hussainalyCISA), Dated: July 16, 2020

**Objective:** Utilize data to bring meaningful insights.

**Acknowlegement:** This kernal is inspired from the analysis prepared by [DATAI](https://www.kaggle.com/kanncaa1/does-alcohol-affect-success/)

***

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

# from subprocess import check_output
# print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
plt.rc('figure', figsize=(15, 5))

**Glance the dataset:**

In [None]:
data = pd.read_csv('../input/student-mat.csv')
rows, columns = data.shape
print(f"The data set has {rows} rows and {columns} columns.")
data.head()

# Scope Definition
Based on problem statement, our variable of interest are alcohol (X=input) and grades (y=output):
* Alcohol:
    1. `Dalc` - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
    2. `Walc` - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)


* Grades:
    3. `G1` - first period grade (numeric: from 0 to 20)
    4. `G2` - second period grade (numeric: from 0 to 20)
    5. `G3` - final grade (numeric: from 0 to 20, output target)

All of the variables are categorical and discrete, and measured at ordinal scale.

In [None]:
alco = data.loc[:, ['Dalc', 'Walc', 'G1', 'G2', 'G3']]

# Describe Data

In [None]:
summary = alco.describe().T
summary

## Measure Central Tendency

In [None]:
summary[['mean', '50%']].plot.bar();

**Conclusion:**  
* Alcohol consumption during weekend is higher as compared to workday.
* Average grades at each period (`G1`, `G2`, `G3`) are fairly consistent.

## Measure Spread

In [None]:
sns.boxplot(data=alco);

**Conclusion:**  

* Alcohol:
    - As we already noted that consumption during weekend is higher. With the measure of spread, we can see there are some students (outliers) who are consistent in alcohol consumption irrespective of workday or weekend.


* Grades:
    - Grades during `G2` improved as compared to `G1`. However, `G2` has got few outliers where grades are zero which has influenced the standard deviation (that is, std of `G2` is higher as compared to `G1`). Analysing these observations may be the area of interest but not the scope.
    - Range of grades has increased over the period, that is, from `G1` to `G3`.

For further analysis, we will drop `G1` & `G2` since they are either our target variables nor explainatory variables.

In [None]:
alco = alco.drop(['G1', 'G2'], axis=1)

## Measure Frequencies

### Categories of each variable

In [None]:
for column in alco.columns:
    print("----- ", column, " -----")
    print(sorted(alco[column].unique()), '\n\n')

**Conclusion:**  
- `Dalc` and `Walc` has same categories. 
- `G3` has some missing catagories (1, 2 & 3). Reason related to this is not available. Hence, assumption need to be made.

### Alcohol

In [None]:
fig, axes = plt.subplots(ncols=2, sharey=True)
for i, column in enumerate(alco.columns[:2]):
    sns.countplot(alco[column], ax=axes[i]);

**Conclusion:**  
During weekends, students' alcohol consumption shifts from lower categories to the higher categories.

### Grades

**Absolute Frequencies**

In [None]:
sns.countplot(alco['G3']);

**Conclusion:**  
Categories 1, 2 & 3 do not exists. It may be possible that less than 4 are treated as 0.

**Relative Frequencies**

In [None]:
sns.distplot(alco['G3'], bins=range(0,21), kde=True);

**Conclusion:**  
Shape of the distribution is normal with outliers at left size.

# Measure Relationship

Let's take mean of both `Dalc` and `Walc` alcohol variables for the purpose of measuring relationship and store them in `Alc` variable.

In [None]:
alco['Alc'] = alco[['Dalc', 'Walc']].mean(axis=1)

### Correlation

In [None]:
alco_corr = alco.corr()
plt.figure(figsize=(5,5))
sns.heatmap(alco_corr, annot=True, fmt='.2f',
            vmax=1, vmin=-1, center=0,
            mask=np.triu(alco_corr), cmap='coolwarm')
plt.xticks(rotation=90)
plt.yticks(rotation=0);

**Conclusion:**  
There is no relationship between alcohol and grades.

### Joint Distribution

In [None]:
joint_dist = pd.crosstab(alco.Alc, alco.G3)
sns.heatmap(joint_dist, annot=True, cbar=False);

**Conclusion:**
There is no clear relationship between alcohol consumption and grades.


**Limitation:**
There is no enough data to perform statistical test such as chi-square.



**Possible Strategy**
Notwithstanding, we can group the categories and analyise the relationship once again.

### Grouped Grades

In [None]:
alco['grp_G3'] = pd.cut(alco.G3, bins=[-1, 5, 10, 15, 21], 
                labels=['Poor', 'Fair', 'Good', 'Excellent'])
sns.countplot(alco.grp_G3);

### Grouped/Discretize Consumption

In [None]:
grp_Alc = pd.cut(alco.Alc, bins=np.arange(0, 6), labels=np.arange(1, 6))
sns.countplot(grp_Alc);

In [None]:
grp_G3_Alc = (pd.crosstab(alco.grp_G3, grp_Alc)
              .reindex(['Poor', 'Fair', 'Good', 'Excellent']))
sns.heatmap(grp_G3_Alc, annot=True)
plt.yticks(rotation=0);

# Conclusion

Even after grouping the data, frequency count is not a clear indicator of any relationship.