<h2>Data Analysis and Machine Learning on Test Scores Dataset</h2>

<h3>Exploratory Data Analysis</h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Splitting data 80/20 for final evaluation

In [1]:
test_scores = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')
test_scores = test_scores.sample(frac=1, random_state=0).reset_index(drop=True) # shuffling ordered data
data_train, data_test = train_test_split(test_scores,test_size=.2,random_state=0)
data = data_train # the set we'll explore
data.head()

Some features have ambigous definitions, so let me explain them: <br>
n_student: number of students per <u>classroom</u> <br>
teaching_method: teaching method of <u>classroom</u>

In [1]:
data.info()

2133 samples, 11 features (1 target) and no missing data. 

In [1]:
data.describe()

Students did better at posttest. 75% of students scored less than or equal to 64 at pretest and 77 at posttest. No one scored 100 at pretest, and...

In [1]:
data[data.posttest == 100].shape[0]

7 students scored 100 at posttest.

In [1]:
sns.pairplot(data)

pretest and posttest are highly correlated and n_student seem to have negative correlation with pretest & posttest.

In [1]:
plt.pie(data.gender.value_counts())

Gender distribution is not skewed.

<h4>School</h4>

In [1]:
data.school.value_counts().shape[0]

Therea are 23 schools in total:

In [1]:
data.school[data.school_type == 'Public'].value_counts().shape[0]

In [1]:
data.school[data.school_type == 'Non-public'].value_counts().shape[0]

15 public and 8 non-public, and...

In [1]:
data.school[data.school_setting == 'Suburban'].value_counts().shape[0]

In [1]:
data.school[data.school_setting == 'Urban'].value_counts().shape[0]

In [1]:
data.school[data.school_setting == 'Rural'].value_counts().shape[0]

7 in suburban, 9 in urban, and 7 in rural.

Student population and gender distribution:

In [1]:
plt.subplots(figsize=(14, 6))
sns.histplot(data, x='school', hue="gender", multiple="dodge", shrink=.8).tick_params(labelsize=8.1)

Number of classrooms and average number of students per classroom in a school:

In [1]:
data[['school','classroom']].groupby(['school']).count()

In [1]:
data[['school','classroom','n_student']].groupby(['school'], as_index=False).mean()

In [1]:
plt.subplots(figsize=(14, 6))
sns.barplot(data=data[['school','classroom','n_student']].groupby(['school'], as_index=False).mean(),
            x='school', y='n_student').tick_params(labelsize=8.1)

Pretest scores:

In [1]:
plt.subplots(figsize=(14, 6))
sns.boxplot(data=data, x='school', y='pretest').tick_params(labelsize=8.1)

UKPGS is the school with highest pretest average, and KZKKE with the lowest.

<h4>School setting</h4>

Student population and gender distribution:

In [1]:
sns.histplot(data, x='school_setting', hue='gender', multiple='dodge', shrink=.8)

Pretest scores:

In [1]:
sns.boxplot(data=data, x='school_setting', y='pretest')

<h4>School type</h4>

Student population and gender distribution:

In [1]:
sns.histplot(data, x='school_type', hue='gender', multiple='dodge', shrink=.8)

Pretest scores:

In [1]:
sns.boxplot(data=data, x='school_type', y='pretest')

School setting - school type:

In [1]:
sns.histplot(data, x='school_type', hue='school_setting', multiple='dodge', shrink=.8)

In [1]:
sns.catplot(data=data, x='school_type', y='pretest', hue='school_setting', alpha=0.7)

<h4>Teaching method</h4>

Student population and school setting: 

In [1]:
sns.histplot(data, x='teaching_method', hue='school_setting', multiple='dodge', shrink=.8)

Pretest scores:

In [1]:
sns.boxplot(data=data, x='teaching_method', y='pretest')

<h4>Lunch</h4>

In [1]:
data.lunch.value_counts()

Pretest scores (their correlation is -0.6):

In [1]:
sns.catplot(data=data,x='lunch',y='pretest')

Lunch - school type:

In [1]:
data_public = data[data.school_type == 'Public']
data_public[data_public.lunch == 'Qualifies for reduced/free lunch'].shape[0] / data_public.shape[0]

In [1]:
data_non_public = data[data.school_type == 'Non-public']
data_non_public[data_non_public.lunch == 'Qualifies for reduced/free lunch'].shape[0] / data_non_public.shape[0]

In public schools, more students are qualified for reduced/free lunch. 

Lunch - school:

In [1]:
lunch_percentage = data.school[data.lunch == 'Qualifies for reduced/free lunch'].value_counts() / data.school.value_counts()
lunch_percentage.sort_values(ascending=True)

Every student in KZKKE qualifies for reduced/free lunch. In IDGFP, LAYPA and UKPGS, no one qualifies.

<h3>Feature Engineering & Preprocessing</h3>

In [1]:
data_nf = data.copy()
for student in range(data_nf.shape[0]):
    data_nf.iloc[student,0] = data_nf.pretest[data_nf.school == data_nf.iloc[student,0]].mean()
data_nf.school = data_nf.school.astype('float')

for student in range(data_nf.shape[0]):
    data_nf.iloc[student,3] = data_nf.pretest[data_nf.classroom == data_nf.iloc[student,3]].mean()
data_nf.classroom = data_nf.classroom.astype('float')

data_nf.drop('student_id',axis=1,inplace=True)
data_nf.rename(columns={'school':'school_pretest_mean','classroom':'classroom_pretest_mean'}, inplace=True)
data_nf

In [1]:
data_nf.corr()['posttest']

In [1]:
test_nf = data.copy()
for student in range(test_nf.shape[0]):
    test_nf.iloc[student,0] = test_nf.pretest[test_nf.school == test_nf.iloc[student,0]].mean()
test_nf.school = test_nf.school.astype('float')

for student in range(test_nf.shape[0]):
    test_nf.iloc[student,3] = test_nf.pretest[test_nf.classroom == test_nf.iloc[student,3]].mean()
test_nf.classroom = test_nf.classroom.astype('float')

test_nf.drop('student_id',axis=1,inplace=True)

In [1]:
data_nf_dummy = pd.get_dummies(data_nf, columns=['school_setting','school_type','teaching_method','gender','lunch'], drop_first=True)
data_nf_dummy

In [1]:
test_nf_dummy = pd.get_dummies(test_nf, columns=['school_setting','school_type','teaching_method','gender','lunch'], drop_first=True)

In [1]:
sns.heatmap(data_nf_dummy.corr())

<h3>Building models</h3>

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import mean_absolute_error

In [1]:
y_train = data_nf_dummy.posttest
X_train = data_nf_dummy.drop(['posttest'],axis=1)

In [1]:
y_test = test_nf_dummy.posttest
X_test = test_nf_dummy.drop(['posttest'],axis=1)

<h4>Linear models</h4>

In [1]:
pipeline_linear = Pipeline([('scaler',MinMaxScaler()),('linear_model',LinearRegression())])
param_grid_linear = [{'linear_model':[LinearRegression()],'scaler':[MinMaxScaler(),StandardScaler(),None]},
 {'linear_model':[Ridge()],'linear_model__alpha':[0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100,300],
  'scaler':[MinMaxScaler(),StandardScaler(),None]},
 {'linear_model':[Lasso()],'linear_model__alpha':[0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100,300],
  'scaler':[MinMaxScaler(),StandardScaler(),None]}]
grid_linear = GridSearchCV(pipeline_linear, param_grid_linear)
grid_linear.fit(X_train,y_train)

In [1]:
grid_linear.best_params_

In [1]:
grid_linear.best_score_

Final evaluation:

In [1]:
pred_linear = grid_linear.best_estimator_.predict(X_test)
mean_absolute_error(y_test, pred_linear)

Feature selection done by L1 regularization:

In [1]:
weights = grid_linear.best_estimator_.named_steps['linear_model'].coef_
lasso_weights = pd.DataFrame({'feature':X_train.columns.to_list(),'weight':weights})
lasso_weights

<h4>SVM</h4>

In [1]:
pipeline_svm = Pipeline([('scaler',MinMaxScaler()),('svm',SVR())])
param_grid_svm = {'scaler':[MinMaxScaler(),StandardScaler(),None],'svm__C':[0.001,0.003,0.01,0.03,0.1,0.3,1,3,10,30,100,300],
                 'svm__gamma':[0.001,0.003,0.01,0.03,0.1,0.3,0,1,3,10,30,100,300]}
grid_svm = GridSearchCV(pipeline_svm, param_grid_svm)
grid_svm.fit(X_train, y_train)

In [1]:
grid_svm.best_params_

In [1]:
grid_svm.best_score_

Final evaluation:

In [1]:
pred_svm = grid_svm.best_estimator_.predict(X_test)
mean_absolute_error(y_test, pred_svm)

<h4>k-Nearest Neighbors</h4>

In [1]:
pipeline_knn = Pipeline([('scaler',MinMaxScaler()),('knn',KNeighborsRegressor())])
param_grid_knn = {'scaler':[MinMaxScaler(),StandardScaler(),None],
                  'knn__n_neighbors':[5,10,15,20,25,50,100]}
grid_knn = GridSearchCV(pipeline_knn, param_grid_knn)
grid_knn.fit(X_train, y_train)

In [1]:
grid_knn.best_params_

In [1]:
grid_knn.best_score_

Final evaluation:

In [1]:
pred_knn = grid_knn.best_estimator_.predict(X_test)
mean_absolute_error(y_test, pred_knn)