# PH1978L Final Group Project
Authors: Chiu-Feng Yap, Allison Shay, Joy Yoo

Given a dataset of demographic, lifestyle, social and school related attributes of students, **predict student performance (G3)- (final grade; numeric from 0 to 20)**.

**Three scenarios of predictions will be considered:**
* 1- Classification with two levels (pass/fail)
* 2- Classification with five levels (from I - excellent to V - insufficient)
* 3- Regression, with a numeric output that ranges between 0 and 20

Should include comparison between different machine-learning models (one of which must be linear, and at least 2 non-linear models).

Consider different scenarios where we exclude G1 and G2 variables from our models.

In [1]:
# import libraries we will be using:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

# Adjust notebook settings to widen the notebook
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:85% !important;}</style>"))
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [2]:
# read in the data
df = pd.read_csv("./Data/school_grades_dataset.csv")

# Scenario 1 -  Classification with two levels (pass/fail);

In [None]:
df1 = df

In [None]:
# create two level categorical variable for classification with two levels (pass/fail)
# create a list of our conditions
conditions = [
    (df['G3'] <= 10),
    (df['G3'] > 10)
    ]

# create a list of the values we want to assign for each condition
values = ['fail', 'pass' ]

# create a new column and use np.select to assign values to it using our lists as arguments
df1['G3_pass_fail'] = np.select(conditions, values)

In [None]:
df1.head(10)

In [None]:
df1.describe()

In [None]:
df1.dtypes

In [None]:
X = df1.drop(['G3', 'G3_pass_fail'], axis=1) # exclude independent variables
y = df1['G3_pass_fail']  # only include dependent variable data

# Scenario 2 -  Classification with five levels (from I - excellent to V - insufficient)

In [None]:
df2 = df

In [None]:
# create five level categorical variable for classification with five levels (from I - excellent to V - insufficient); and
conditions = [
    (df['G3'] <= 4),
    (df['G3'] > 4) & (df['G3'] <=8),
    (df['G3'] > 8) & (df['G3'] <=12),
    (df['G3'] > 12) & (df['G3'] <=16),
    (df['G3'] > 16)
    ]

# create a list of the values we want to assign for each condition
values = ['I', 'II', 'III', 'IV', 'V' ]

# create a new column and use np.select to assign values to it using our lists as arguments
df2['G3_five_level'] = np.select(conditions, values)

In [None]:
X = df2.drop(['G3', 'G3_five_level'], axis=1) # exclude independent variables
y = df2['G3_five_level']  # only include dependent variable data

# Scenario 3 - Regression, with a numeric output that ranges between 0 and 20.

In [3]:
df3 = df

In [4]:
X = df3.drop(['G3'], axis=1) # exclude independent variables
y = df3['G3']  # only include dependent variable data

In [5]:
df3.head(5)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,0,yes,no,no,no,yes,yes,yes,no,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,no,yes,yes,yes,yes,yes,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,no,no,yes,yes,no,no,4,3,2,1,2,5,0,11,13,13


In [6]:
df3.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

### Create Dummy Variables for Categorical Variables