# Importing Data

The dataset used in this project is included as `student-data.csv`. This dataset has the following attributes:

- `school` : student's school (binary: "GP" or "MS")
- `sex` : student's sex (binary: "F" - female or "M" - male)
- `age` : student's age (numeric: from 15 to 22)
- `address` : student's home address type (binary: "U" - urban or "R" - rural)
- `famsize` : family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
- `Pstatus` : parent's cohabitation status (binary: "T" - living together or "A" - apart)
- `Medu` : mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- `Fedu` : father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
- `Mjob` : mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- `Fjob` : father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
- `reason` : reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
- `guardian` : student's guardian (nominal: "mother", "father" or "other")
- `traveltime` : home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
- `studytime` : weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- `failures` : number of past class failures (numeric: n if 1<=n<3, else 4)
- `schoolsup` : extra educational support (binary: yes or no)
- `famsup` : family educational support (binary: yes or no)
- `paid` : extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- `activities` : extra-curricular activities (binary: yes or no)
- `nursery` : attended nursery school (binary: yes or no)
- `higher` : wants to take higher education (binary: yes or no)
- `internet` : Internet access at home (binary: yes or no)
- `romantic` : with a romantic relationship (binary: yes or no)
- `famrel` : quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- `freetime` : free time after school (numeric: from 1 - very low to 5 - very high)
- `goout` : going out with friends (numeric: from 1 - very low to 5 - very high)
- `Dalc` : workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- `Walc` : weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- `health` : current health status (numeric: from 1 - very bad to 5 - very good)
- `absences` : number of school absences (numeric: from 0 to 93)
- `passed` : did the student pass the final exam (binary: yes or no)



In [13]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from IPython.display import display
from time import time

# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

Student data read successfully!


# Data Exploration

In [2]:
# TODO: Calculate number of students
n_students = student_data.shape[0]

# TODO: Calculate number of features
n_features = student_data.shape[1]

# TODO: Calculate passing students
n_passed = sum(student_data['passed'] == 'yes')

# TODO: Calculate failing students
n_failed = sum(student_data['passed'] == 'no')

# TODO: Calculate graduation rate
grad_rate = n_passed / float(n_students) * 100

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of features: 31
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%


## Preprocess features, Extract Labels

In [3]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
features = student_data[feature_cols]
labels = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
display(features.head())

Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,yes,no,no,4,3,4,1,1,3,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,yes,no,5,3,3,1,1,3,4
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,yes,no,4,3,2,2,3,3,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,yes,3,2,2,1,1,5,2
4,GP,F,16,U,GT3,T,3,3,other,other,...,yes,no,no,4,3,2,1,2,5,4


In [4]:

def numerical_transform_features(features):
    # Convert 'yes' and 'no' to numerical values 0 and 1
    nominal_cols = features.columns[features.dtypes == 'object']
    for col in nominal_cols:
        features[col] = features[col].replace(to_replace={'yes':1, 'no':0})
    
    # Create dummy variables for categorical features like school, famsize, Pstatus, 
    #  Mjob, Fjob, gaurdian, reason etc. 
    # Assuming all categorical featyres are nominal
    nominal_cols = features.columns[features.dtypes == 'object']
    features = pd.get_dummies(features, prefix=nominal_cols)
    return features


# Split Data: 

Steps to take:

1. Convert the features to numerical features
2. Create Training and Testing set, so that model can be tested on test set.
3. Create Folds of Training set for cross-validation and hyperparameter optimization. 


In [10]:
from sklearn.model_selection import train_test_split, StratifiedKFold

# Set the number of training points
num_train = 300

# Convert features to numerical data
features_all = numerical_transform_features(features)
labels_all = labels.replace({'yes':1, 'no':0})

# stratify to avoid having very few examples of failure class in training data. 
features_train, features_test, labels_train, labels_test = train_test_split(features_all, labels_all,\
                                                train_size=num_train,\
                                                stratify=labels_all, random_state=42)

# Folds for cross-validation 
skf = StratifiedKFold(n_splits=10)
cv_folds = skf.split(features_train, labels_train)

# Show the results of the split
print "Training set has {} samples.".format(features_train.shape[0])
print "Testing set has {} samples.".format(features_test.shape[0])
print "Training set has {} % of passed students".format(100 * float(sum(labels_train))/labels_train.shape[0])
print "Testing set has {} % of passed students".format(100 * float(sum(labels_test))/labels_test.shape[0])

Training set has 300 samples.
Testing set has 95 samples.
Training set has 67.0 % of passed students
Testing set has 67.3684210526 % of passed students


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


# Model Evalutation Metrics
Framework for evaluating model performance


In [None]:
from sklearn.metrics import f1_score

class classifier(object):
    def __init__(clf):
        self.clf = clf
        self.train_time = np.Inf # to indicate its not trained yet
        self.predtrain_time = np.Inf # prediction time on testing data
        self.predtest_time = np.Inf
        self.test_score = 0
        self.train_score = 0
        
    def train(self,features_train, labels_train):
        start = time()
        self.clf.fit(features_train,labels_train)
        self.train_time = time() - start
        return
    
    def predict(self,features_train=None,features_test=None):
        predictions = []
        if features_train:
            start = time()
            labels_train_pred = self.clf.predict(features_train)
            self.predtrain_time = time()-start
            predictions.append(labels_train_pred)
        
        if features_test:
            start = time()
            labels_train_pred = self.clf.predict(features_train)
            self.predtest_time = time()-start
            predictions.append(labels_test_pred)
        return predictions
  
    def error_metric(self, features, labels, test=True):
        if test:
            labels_pred = self.clf.predict(feature_train=None,\
                                           features_test=features)[0]
            self.test_score = f1_score(labels_pred, labels)
        else:
            labels_pred = self.clf.predict(feature_train=features)[0]
            self.test_score = f1_score(labels_pred, labels)
        return
    


In [None]:
from sklearn.neighbors import DistanceMetric, KNeighborsClassifier
