# **An Evaluation on Student's Performance in Exams**

I downloaded a csv data from the internet and created my classifiers evaluation as a submission requirement for the class Advance Database Management System MIT504.

**Overview**

The classification task is to determine the performance of the student based on gender, ethnicity, highest educational attainment of parents and so on. Since it has different factors, I will focus more on the gender class.

This dataset is composed of basic student information such as gender, ethnicity and parental highest education attainment. It serves as guide in determining if these factors really affects the performance of the student. With these factors or columns, we can identify what factor must be consider when enghancing the performance of our students in the future. 

Acknowledgement to my classmates who offered help in utilizing this type of "project repository or kernels" or "notebook" where I really don't have a firm knowledge in creating my own "kernel" from scatch. 

# About this Dataset
Attribute Information: 

* **gender**: male, female

* **race/ethnicity**: group A, group B, group C, group D, group E 

* **parental level of education**: some college, associate degree, high school, some high school, bachelor's degree

* **lunch**: standard, free/reduced

* **test preparation**: none, completed

* **math score**: (student's score range from 1-100)

* **reading score:** (student's score range from 1-100)

* **writing score:** (student's score range from 1-100)

# I. Initialization
Lets try to load our dataset csv file

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

stud_perf = pd.read_csv('../input/studentsperformance/datasets_74977_169835_StudentsPerformance.csv')

# II. Visualization or Data Analysis Expolaration

1) Select the columns that describes the student (gender,race/ethnicity,parental level of education,test preparation course)

2) Run some statistical analysis

3) Assign numerical values to the letter values in order to feed the Logistic Regression Algo

# A. To identify and visualize the count of distinct the different race or ethnicity.
    
    a. Using the column data: "race/ethnicity"

In [None]:
#Obtain total number of students for each 'gender' (Entire DataFrame)
p_race = stud_perf['race/ethnicity'].value_counts()
p_race_height = p_race.values.tolist() #Provides numerical values
p_race.axes #Provides row labels
p_race_labels = p_race.axes[0].tolist() #Converts index object to list

#=====PLOT Preparations and Plotting====#
ind = np.arange(5)  # the x locations for the groups
width = 0.7        # the width of the bars
colors = ['#FD1414','#FFF012','#11F237','#1155F2','#B611F2']
fig, ax = plt.subplots(figsize=(5,7))
stud_perf_bars = ax.bar(ind, p_race_height , width, color=colors)

#Add some text for labels, title and axes ticks
ax.set_xlabel("Ethnity Group",fontsize=20)
ax.set_ylabel('Count',fontsize=20)
ax.set_title('Race/Ethnicity',fontsize=22)
ax.set_xticks(ind) #Positioning on the x axis
ax.set_xticklabels(('group C', 'group D','group B','group E','group A'),
                  fontsize = 12)

#Auto-labels the number of mushrooms for each bar color.
def autolabel(rects,fontsize=14):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1*height,'%d' % int(height),
                ha='center', va='bottom',fontsize=fontsize)
autolabel(stud_perf_bars)        
plt.show() #Display bars. 

# B. To discover the distinct values student gender by ethnicity.
     
     a. Using the column data: "gender"

In [None]:
female_count = [] #female
male_count = []    #male
for genCount in p_race_labels:
    size = len(stud_perf[stud_perf['race/ethnicity'] == genCount].index)
    f_c = len(stud_perf[(stud_perf['race/ethnicity'] == genCount) & (stud_perf['gender'] == 'female')].index)
    female_count.append(f_c)
    male_count.append(size-f_c)
                        
#=====PLOT Preparations and Plotting====#
width = 0.40
fig, ax = plt.subplots(figsize=(12,7))
female_bar_value = ax.bar(ind, female_count , width, color='#FF3A75')
male_bar_value = ax.bar(ind+width, male_count , width, color='#0A0AFF')

#Add some text for labels, title and axes ticks
ax.set_xlabel("Ethnicity Group",fontsize=20)
ax.set_ylabel('Count',fontsize=20)
ax.set_title('Gender Comparison By Group',fontsize=22)
ax.set_xticks(ind + width / 2) #Positioning on the x axis
ax.set_xticklabels(('group C', 'group D','group B','group E','group A'),
                  fontsize = 12)
ax.legend((female_bar_value,male_bar_value),('Female Count','Male Count'),fontsize=17)
autolabel(female_bar_value, 10)
autolabel(male_bar_value, 10)
plt.show()

# **C. To Visualize the differences of the scores of the students in math, reading and writing scores**
* Using the columns 'gender', 'test preparation course','math scores','reading scores' and 'writing scores'

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1, 3, 1)
sns.barplot(x='test preparation course',y='math score',data=stud_perf,hue='gender',palette='summer')
plt.title('MATH SCORES')
plt.subplot(1, 3, 2)
sns.barplot(x='test preparation course',y='reading score',data=stud_perf,hue='gender',palette='summer')
plt.title('READING SCORES')
plt.subplot(1, 3, 3)
sns.barplot(x='test preparation course',y='writing score',data=stud_perf,hue='gender',palette='summer')
plt.title('WRITING SCORES')
plt.show()

# D. To visualize the relevance of the parental educational level to the math, reading and writing scores of the students.
    
    a. Using the column data: "parental level of education", "math scores", "reading scores", "writing scores"

In [None]:
stud_perf['Total Score']=stud_perf['math score']+stud_perf['reading score']+stud_perf['writing score']

In [None]:
fig,ax=plt.subplots()
sns.barplot(x=stud_perf['parental level of education'],y='Total Score',data=stud_perf,palette='Wistia')
fig.autofmt_xdate()

When interpreting the graph above, we could say that students with high level of parental education gets higher scores in math, reading and writing.

# A comparison of scores: male vs female based on test preparation course

In [None]:
plt.figure(figsize=(20,8))
plt.subplot(1, 3, 1)
sns.barplot(x='test preparation course',y='math score',data=stud_perf,hue='gender',palette='summer')
plt.title('MATH SCORES')
plt.subplot(1, 3, 2)
sns.barplot(x='test preparation course',y='reading score',data=stud_perf,hue='gender',palette='summer')
plt.title('READING SCORES')
plt.subplot(1, 3, 3)
sns.barplot(x='test preparation course',y='writing score',data=stud_perf,hue='gender',palette='summer')
plt.title('WRITING SCORES')
plt.show()

After generating the graph, we can say that:
|* the first plot says that the math scores of boys are better irrespective of wether they completed the course or no.
* the next two plots says that girls perform more better in reading and writing
* Lastly, from all three graphs its clear that if the course is completed we can achieve higher scores

# Testing a random sample of the dataset of 500 students

In [None]:
stud_perf_sample = stud_perf.loc[np.random.choice(stud_perf.index, 1000, False)]

In [None]:
#Get all unique race/ethnicity
stud_perf_sample['race/ethnicity'].unique()

In [None]:
#mushrooms_sample.groupby('cap-color', 0).nunique()

#Get 'race/ethnicity' Series
genCount = stud_perf_sample['race/ethnicity']

#Get the total number of mushrooms for each unique cap color. 
genCount.value_counts()


# Dataset Balancing / Cleanup

To test if my datasheet is balance, two classes must be compared.

In [None]:
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

stud_perf = pd.read_csv('../input/studentsperformance/datasets_74977_169835_StudentsPerformance.csv')

#import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
 #   for filename in filenames:
 #       print(os.path.join(dirname, filename))
        

#df = pd.read_csv('/kaggle/input/mushroom-classification/mushrooms.csv') # df usually is used to abbreviate "Data Frame" from pandas library

#print(f'Data Frame Shape (rows, columns): {df.shape}') 

sns.countplot(data=stud_perf, x="gender").set_title("Class Outcome - Female-F/Male-M")

With the graph illustrated above, we could say that the population for genders female and male is not balanced. Though, this class is not the main basis for this dataset but just a supplimentary factor that must be considered. 

# III. Classifier Setups and Build Model

The next procedure is to create classifier setups and build model as indicated on the instructions.

Logistic Regression
Support Vector machines (SVC)
K-Nearest Neighbours(K-NN)
Naive Bayes classifier
Decision Tree Classifier


# A. Importing the Libraries

In [None]:
import numpy as np 
import pandas as pd
import warnings
warnings.simplefilter("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

# B. Checking for nulls

In [None]:
stud_perf.isnull().sum()

# Description of Dataset

In [None]:
stud_perf.describe()

# "Class" column is response and rest columns are predictors.

> **Seprating Predictors and Response**

In [None]:
X=stud_perf.drop('gender',axis=1) #Predictors
y=stud_perf['gender'] #Response
X.head()

# C. Encoding categorical data > Label encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
Encoder_X = LabelEncoder() 
for col in X.columns:
    X[col] = Encoder_X.fit_transform(X[col])
Encoder_y=LabelEncoder()
y = Encoder_y.fit_transform(y)

In [None]:
X.head()

In [None]:
y

# Poisonous = 1

# Edible = 0

# D. Getting dummy variables

In [None]:
X=pd.get_dummies(X,columns=X.columns,drop_first=True)
X.head()

# E. Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# F. Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# G. Applying PCA with n_components = 2

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)

X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

# H. Functions to visualize Training & Test Set Results.

In [None]:
def visualization_train(model):
    sns.set_context(context='notebook',font_scale=2)
    plt.figure(figsize=(16,9))
    from matplotlib.colors import ListedColormap
    X_set, y_set = X_train, y_train
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                     np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
             alpha = 0.6, cmap = ListedColormap(('red', 'green')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                    c = ListedColormap(('red', 'green'))(i), label = j)
    plt.title("%s Training Set" %(model))
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.legend()
def visualization_test(model):
    sns.set_context(context='notebook',font_scale=2)
    plt.figure(figsize=(16,9))
    from matplotlib.colors import ListedColormap
    X_set, y_set = X_test, y_test
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
                         np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
                 alpha = 0.6, cmap = ListedColormap(('red', 'green')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                    c = ListedColormap(('red', 'green'))(i), label = j)
    plt.title("%s Test Set" %(model))
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.legend()

# IV. Integrating Artificial Neural Networks (ANN)

In [None]:
import keras
from keras.models import Sequential
from keras.layers import Dense

# A. Initializing ANN

# ARTIFICIAL NEURAL NETWORK

To perdict whether a mushroom is poisonous or edible, I use ANN classification.

In [None]:
classifier = Sequential()

# B. Adding Layers

In [None]:
classifier.add(Dense(8, kernel_initializer='uniform', activation= 'relu', input_dim = 2))
classifier.add(Dense(6, kernel_initializer='uniform', activation= 'relu'))
classifier.add(Dense(5, kernel_initializer='uniform', activation= 'relu'))
classifier.add(Dense(4, kernel_initializer='uniform', activation= 'relu'))
classifier.add(Dense(1, kernel_initializer= 'uniform', activation= 'sigmoid'))
classifier.compile(optimizer= 'adam',loss='binary_crossentropy', metrics=['accuracy'])

# C. Fitting ANN to Training Set

In [None]:
classifier.fit(X_train,y_train,batch_size=10,epochs=100)

# D. Predicting the Test Set Results

In [None]:
y_pred=classifier.predict(X_test)
y_pred=(y_pred>0.5)

# E. Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test, y_pred))

# F. Classification Report

In [None]:
print(classification_report(y_test, y_pred))

# G. Visualizing ANN Training Set results

In [None]:
visualization_train(model='ANN')

# H. Creating a function to evaluate model's performance.

**Creating a func to evaluate model's performance.**

In [None]:
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [None]:
def print_score(classifier,X_train,y_train,X_test,y_test,train=True):
    if train == True:
        print("Training results:\n")
        print('Accuracy Score: {0:.4f}\n'.format(accuracy_score(y_train,classifier.predict(X_train))))
        print('Classification Report:\n{}\n'.format(classification_report(y_train,classifier.predict(X_train))))
        print('Confusion Matrix:\n{}\n'.format(confusion_matrix(y_train,classifier.predict(X_train))))
        res = cross_val_score(classifier, X_train, y_train, cv=10, n_jobs=-1, scoring='accuracy')
        print('Average Accuracy:\t{0:.4f}\n'.format(res.mean()))
        print('Standard Deviation:\t{0:.4f}'.format(res.std()))
    elif train == False:
        print("Test results:\n")
        print('Accuracy Score: {0:.4f}\n'.format(accuracy_score(y_test,classifier.predict(X_test))))
        print('Classification Report:\n{}\n'.format(classification_report(y_test,classifier.predict(X_test))))
        print('Confusion Matrix:\n{}\n'.format(confusion_matrix(y_test,classifier.predict(X_test))))

# V. Classifiers

# A. Logistic Regression Model

# Fitting Logistic Regression model to the Training set

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()

classifier.fit(X_train,y_train)

# Logistic Regression Training Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=True)

# Logistic Regression Test Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)

# Visualising the Logistic Regression Training set results

In [None]:
visualization_train('Logistic Reg')

# Visualising the Logistic Regression Test set results

In [None]:
visualization_test('Logistic Reg')

# B. Support Vecor (SVC) Classification Model

# Fitting SVC to the Training set

In [None]:
from sklearn.svm import SVC
classifier = SVC(kernel='rbf',random_state=42)

classifier.fit(X_train,y_train)

# SVC Training Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=True)

# SVC Test Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)

# Visualising the SVC Training set results

In [None]:
visualization_train('SVC')

# # Visualising the SVC Test set results

In [None]:
visualization_test('SVC')

# C. K Nearest Neighbors (K-NN) Classification Model

# Fitting K-NN to the Training set

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNN

classifier = KNN()
classifier.fit(X_train,y_train)

# K-NN Training Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=True)

# K-NN Test Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)

# Visualising the K-NN Training set results

In [None]:
visualization_train('K-NN')

# Visualising the K-NN Test set results

In [None]:
visualization_test('K-NN')

# D. Naive Bayes Classification Model

# Fitting Naive Bayes classifier to the Training set

In [None]:
from sklearn.naive_bayes import GaussianNB as NB

classifier = NB()
classifier.fit(X_train,y_train)

# Naive Bayes Training Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=True)

# Naive Bayes Test Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)

# Visualising the Naive Bayes Training set results

In [None]:
visualization_train('Naive Bayes')

# Visualising the Naive bayes Test set results

In [None]:
visualization_test('Naive Bayes')

# E. Decision Tree Classification Model

# Fitting Decision Tree classifier to the Training set

In [None]:
from sklearn.tree import DecisionTreeClassifier as DT

classifier = DT(criterion='entropy',random_state=42)
classifier.fit(X_train,y_train)

# Decision Tree Training Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=True)

# Decision Tree Test Results

In [None]:
print_score(classifier,X_train,y_train,X_test,y_test,train=False)

# Visualising the Decision tree Training set results

In [None]:
visualization_train('Decision Tree')

# Visualising the Decision Tree Test set results

In [None]:
visualization_test('Decision Tree')

# Results :


Classifier | Logistic Reg| SVC | K-NN | Naive Bayes | Decision Tree |

Train accuracy score | 0.9057 | 0.9289 | 0.9430 | 0.8980 | 1.0000 |

Average accuracy score | 0.9057 | 0.9281 | 0.9314 | 0.8982 | 0.8920 | 

SD | 0.0097 | 0.0112 | 0.0097 | 0.0114 | 0.0128 | 

Test accuary score | 0.9028 | 0.9258 | 0.9307 | 0.8966 | 0.9016 |

# Results :
| Classifier | Logistic Reg| SVC | K-NN | Naive Bayes | Decision Tree |
| --- | --- | --- | --- | --- | --- |
| Train accuracy score | 0.9057 | 0.9289 | 0.9430 | 0.8980 | 1.0000 |
| Average accuracy score | 0.9057 | 0.9281 | 0.9314 | 0.8982 | 0.8920 |
| SD | 0.0097 | 0.0112 | 0.0097 | 0.0114 | 0.0128 |
| Test accuary score | 0.9028 | 0.9258 | 0.9307 | 0.8966 | 0.9016 |