## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Introduction</div>

### The dataset 
at hand contains information about a test written by some students.
It includes features such as: School-Setting, School-Type, Gender etc.<br><br>

While working with this dataset I will put a special **emphasis on the 'Exploratory Data Analysis'**
and try to not only craft visual appealing but also meaningful visuals.

After the analyis I will try to **predict** the test-score, **compare and evaluate** different baseline-models.<br><br>

**Thank you already for taking some time and checking out my notebook.**<br>
**Feel free to leave an upvote if you like my work :)**

Let's get started...

**Take a look at some of my other work here:**
* [Water-Quality EDA & Model-Comparison](https://www.kaggle.com/mlanhenke/waterquality-eda-baseline-model-comparison)
* [Netflix-Awesome EDA & Prediction (CB,LGBM,XGB)](https://www.kaggle.com/mlanhenke/netflix-awesome-eda-prediction-cb-lgbm-xgb)

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Import Data</div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.lines as lines
import matplotlib.gridspec as gridspec
import seaborn as sns

from warnings import filterwarnings
filterwarnings('ignore')

plt.rcParams['font.family'] = 'monospace'

In [None]:
df = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Color Palettes</div>

In [None]:
cmap0 = ['#68595b','#7098af','#6f636c','#907c7b']
cmap1 = ['#484146','#8da0b3','#796d72','#9fa9ba']
cmap2 = ['#545457','#a79698','#5284a2','#bbbcc4']

bg_color = '#fbfbfb'
txt_color = '#5c5c5c'

sns.palplot(cmap0)
sns.palplot(cmap1)
sns.palplot(cmap2)

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Basic Overview</div>

In [None]:
print(f"Shape: {df.shape}")
print('--'*20)
df.head(3)

In [None]:
df.info()

In [None]:
# check for missing values
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

mv = df.isna()
ax = sns.heatmap(data=mv, cmap=sns.color_palette(cmap0), cbar=False, ax=ax, )

ax.set_ylabel('')
ax.set_yticks([])
ax.set_xticklabels(labels=mv.columns,rotation=45)
ax.tick_params(length=0)

fig.text(
    s=':Missing Values',
    x=0, y=1.1,
    fontsize=17, fontweight='bold',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    we can't see any ...
    ''',
    x=0, y=1.075,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Target Analysis: Posttest</div>

In [None]:
# helper functions
def despine_ax(ax, spines=['top','left','right','bottom']):
    for spine in spines:
        ax.spines[spine].set_visible(False)

def get_line(x=[0,0], y=[0,0], alpha=0.5, lw=1):
    return lines.Line2D(xdata=x, ydata=y, lw=lw, alpha=alpha, color='#aeaeae', transform=fig.transFigure, figure=fig)

In [None]:
fig, (ax0, ax1) = plt.subplots(2, 1, tight_layout=True, sharex=True, figsize=(12,6))
fig.patch.set_facecolor(bg_color)

mean = df['posttest'].mean()
median = df['posttest'].median()

ax0.boxplot(
    data=df, x='posttest',
    vert=False, patch_artist=True,
    boxprops=dict(facecolor=cmap0[1], lw=0, alpha=0.75),
    whiskerprops=dict(color='gray', lw=1, ls='--'),
    capprops=dict(color='gray', lw=1, ls='--'),
    medianprops=dict(color='#fff', lw=0),
    flierprops=dict(markerfacecolor=cmap0[0],alpha=0.75),
    zorder=0
)

ax1 = sns.kdeplot(
    data=df, x='posttest', shade=True, 
    color=cmap0[0], edgecolor='#000', lw=1, 
    zorder=0, alpha=0.8, ax=ax1
)

ax0.axvline(x=mean, ymin=0.4, ymax=0.6, color=bg_color, ls=':', zorder=1, label='mean')
ax1.axvline(x=mean, ymin=0, ymax=0.9, color=bg_color, ls=':', zorder=1)

ax0.axvline(x=median, ymin=0.4, ymax=0.6, color=bg_color, ls='--', zorder=1)
ax1.axvline(x=median, ymin=0, ymax=0.9, color=bg_color, ls='--', zorder=1)

ax0.axis('off')
ax0.set_facecolor(bg_color)

ax1.set_ylabel('')
ax1.set_xlabel('')
ax1.set_yticks([])
ax1.tick_params(length=0)
ax1.set_facecolor(bg_color)

despine_ax(ax1, ['top','left','right'])

fig.text(
    s=':Posttest - Distribution',
    x=0, y=1.05,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    in the plot below we can see signs
    of a binominal distribution, with 
    one peak at around 57-62 and the other 
    at approx. 72-79 points.
    ''',
    x=0, y=1.02,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s=f"Mean: {np.round(mean,1)}\nMedian: {np.round(median,1)}",
    x=0.56, y=0.925,
    fontsize=9, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.55,0.55], y=[0.85,0.95])
fig.lines.extend([l1])

plt.show()

In [None]:
fig, ax = plt.subplots(tight_layout=True, figsize=(12,2.5))
fig.patch.set_facecolor(bg_color)

uniq_scores = df['posttest'].nunique()

ax.barh(
    y=1, width=uniq_scores, 
    color=cmap0[1], alpha=0.75,lw=1, edgecolor='white'
)
ax.barh(
    y=1, width=100-uniq_scores, left=uniq_scores,
    color=cmap1[1], alpha=0.25, lw=1, edgecolor='white'
)

ax.axis('off')

ax.annotate(
    s=f"{uniq_scores}",
    xy=(35,1.05),
    va='center', ha='center',
    fontsize=36, fontweight='bold', fontfamily='serif',
    color='#fff'
)

ax.annotate(
    s='unqiue scores',
    xy=(35,0.85),
    va='center', ha='center',
    fontsize=16, fontstyle='italic', fontfamily='serif',
    color='#fff'
)

fig.text(
    s=':Unique Number of Scores',
    x=0, y=1.25,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    68 unique scores have been scored 
    from a total of 100 possible outcomes.
    ''',
    x=0, y=1.2,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.645,0.645], y=[0,1], lw=3, alpha=1)
fig.lines.extend([l1])

plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Feature Analysis</div>

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: Schools</div>

In [None]:
!pip install circlify

In [None]:
from circlify import circlify, Circle

# prepare top 10 df
schools_by_num_students = df.groupby('school').count()[['posttest']].reset_index().sort_values(by='posttest',ascending=False).rename(columns={'posttest':'count'})
schools_by_num_students['ratio'] = df['school'].value_counts().values / len(df['school'])
schools_by_num_students = schools_by_num_students[:10]

# plot
fig, ax = plt.subplots(tight_layout=True, figsize=(8,8))

fig.patch.set_facecolor(bg_color)
ax.patch.set_facecolor(bg_color)

# circle plot
circles = circlify(
    data=schools_by_num_students['count'].tolist(), 
    show_enclosure=False, 
    target_enclosure=Circle(x=0, y=0, r=1)
)

lim = max(
    max(
        abs(circle.x) + circle.r,
        abs(circle.y) + circle.r,
    ) for circle in circles)

ax.set_xlim(-lim, lim)
ax.set_ylim(-lim, lim)

labels = schools_by_num_students['school'][::-1]
counts = schools_by_num_students['count'][::-1]
ratios = schools_by_num_students['ratio'][::-1]

for circle, label, count, ratio in zip(circles, labels, counts, ratios):
    x, y, r = circle
    ax.add_patch(
        plt.Circle(
            (x,y), r, 
            lw=1, fill=True,
            alpha=1*(ratio*10), 
            facecolor=cmap0[1]
        )
    )
    ax.annotate(
        s=f"{label}",
        xy=(x,y),
        fontweight='bold',
        va='center',ha='center', color='#fff'
    )
    ax.annotate(
        s=f"#{count} ({int(ratio*100)}%)",
        xy=(x,y-0.04),
        fontstyle='italic',fontsize=9,
        va='center',ha='center', color='#fff'
    )

ax.axis('off')

fig.text(
    s=':TOP 10 - Schools',
    x=0, y=1,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    by number of students
    ''',
    x=0, y=0.985,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s=f"{df['school'].nunique()}",
    x=1.04,y=0.8,
    fontsize=52, fontfamily='serif',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    unique 
    schools''',
    x=1.13,y=0.82,
    fontsize=11, fontfamily='serif',
    color=txt_color,
    va='top', ha='left'
)
   

fig.text(
    s='''
    The students are nearly 
    equally distributed
    among the total of 23 
    different schools
    ''',
    x=1,y=0.65,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[1,1], y=[0.45,0.8])
fig.lines.extend([l1])

plt.show()

In [None]:
# create alternating y-coords for each school
np.random.seed(9)
low_coords = [np.round(np.random.uniform(0.35,0.85),2) for _ in range(0,23)] 
high_coords = [np.round(np.random.uniform(1.35,1.75),2) for _ in range(0,23)] 

y_coords = [low_coords[idx] if idx % 2 == 0 else high_coords[idx] for idx in range(0,23)]
y_coords = pd.DataFrame(data=y_coords, columns=['y_coords'])

schools_by_avg_score = df.groupby('school').mean()[['posttest']].rename(columns={'posttest':'mean_score'}).sort_values(by='mean_score').reset_index()
schools_by_avg_score = pd.concat([schools_by_avg_score, y_coords], axis=1)

# plot schools in timeline style
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.set_ylim(0,2)

ax.axhline(
    y=1, zorder=0, color=txt_color
)

sns.scatterplot(
    data=df, 
    x=schools_by_avg_score['mean_score'], y=schools_by_avg_score['y_coords'],
    s=650, linewidth=1, alpha=0.75, zorder=1,
    edgecolor='#fff', facecolor=cmap0[0], ax=ax
)

sns.scatterplot(
    data=df, 
    x=schools_by_avg_score['mean_score'], y=1,
    s=50, linewidth=1, zorder=1,
    edgecolor=txt_color, facecolor=bg_color, ax=ax
)

for idx in range(0, len(schools_by_avg_score['mean_score'])):
    name = schools_by_avg_score['school'][idx]
    x = schools_by_avg_score['mean_score'][idx]
    y = schools_by_avg_score['y_coords'][idx]

    ax.annotate(
        s=f"{name}",
        xy=(x,y),
        va='center', ha='center',
        fontsize=7, fontweight='bold',
        color='white'
    )

    ax.vlines(
        x=x,
        ymin=min(1,y+0.06),
        ymax=max(1,y-0.06),
        color=txt_color,
        lw=1, zorder=0, alpha=0.25, ls='--'
    )

ax.axis('off')

fig.text(
    s=':Schools vs. Scores',
    x=0, y=1.25,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    ...and the winner is UKPGS. This plot clearly 
    shows the distribution around the mean-value 
    of 67-68 points. Furthermore we can tell that 
    we have 2 'elite schools' in our dataset.'UKPGS' 
    also belongs to one of the TOP 10 Schools by 
    number of total students.
    ''',
    x=0, y=1.23,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    'elite-schools'
    ''',
    x=0.79, y=0.93,
    fontsize=7, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.9,0.9],y=[0.25,0.75], lw=125, alpha=0.075)
fig.lines.extend([l1])

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: School Setting</div>

In [None]:
import squarify

sizes = df['school_setting'].value_counts().values
labels = [label+'\n#'+str(size) for label, size in zip(df['school_setting'].value_counts().index, sizes)]

fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, tight_layout=False, figsize=(12,6))

ax0 = squarify.plot(
    sizes=sizes,
    label=labels,
    color=cmap2,
    alpha=0.8,
    pad=True,
    text_kwargs=dict(color='white', fontsize=9, fontstyle='italic'),
    ax = ax0
)

ax1 = sns.boxplot(
    data=df,
    x='posttest',
    y='school_setting',
    palette=sns.color_palette("ch:start=.2,rot=-.3"),
    linewidth=1,
    showmeans=True,
    meanprops=dict(markerfacecolor='white', markeredgecolor='white', marker='x'),
    boxprops=dict(edgecolor='white'),
    medianprops=dict(color='white'),
    whiskerprops=dict(color=txt_color, ls=':'),
    capprops=dict(ls=':'),
    flierprops=dict(markersize=2, marker='D'),
    ax=ax1,
)


fig.patch.set_facecolor(bg_color)
fig.subplots_adjust(wspace=0.25)

ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)

ax0.axis('off')
despine_ax(ax1, ['top','left','right'])

ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_yticklabels(df['school_setting'].value_counts().index, rotation=90, va='center', ha='center')
ax1.tick_params(axis='both',length=0, labelcolor=txt_color)

fig.text(
    s=':School Setting',
    x=0.1, y=1.1,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    most of the students attend an urban 
    or suburban school. The suburban schools 
    achieve the highest scores on average.
    ''',
    x=0.1, y=1.08,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    setting vs. scores
    ''',
    x=0.505, y=0.75,
    rotation=90,
    fontsize=7, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.52,0.52], y=[0.45,0.8])
fig.lines.extend([l1])

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: School Type</div>

In [None]:
df_school_type = df.groupby('school_type').count()[['posttest']].rename(columns={'posttest':'count'}).reset_index()
public_count = df_school_type[df_school_type['school_type'] == 'Public']['count'].squeeze() 
non_public_count = df_school_type[df_school_type['school_type'] == 'Non-public']['count'].squeeze() 

fig, ax = plt.subplots(tight_layout=True, figsize=(12,2.5))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.barh(
    y=1, width=public_count,
    color=cmap0[1], alpha=0.75,lw=1, edgecolor='white'
)
ax.barh(
    y=1, width=non_public_count, left=public_count,
    color=cmap1[1], alpha=0.25, lw=1, edgecolor='white'
)

ax.axis('off')

ax.annotate(
    s=f"#{public_count}",
    xy=((public_count/2),1.05),
    va='center', ha='center',
    fontsize=36, fontweight='bold', fontfamily='serif',
    color='#fff'
)

ax.annotate(
    s='Public Students',
    xy=((public_count/2),0.85),
    va='center', ha='center',
    fontsize=16, fontstyle='italic', fontfamily='serif',
    color='#fff'
)

ax.annotate(
    s=f"#{non_public_count}",
    xy=((public_count+non_public_count-non_public_count/2),1.05),
    va='center', ha='center',
    fontsize=36, fontweight='bold', fontfamily='serif',
    color=txt_color 
)

ax.annotate(
    s='Non-Public Students',
    xy=((public_count+non_public_count-non_public_count/2),0.85),
    va='center', ha='center',
    fontsize=16, fontstyle='italic', fontfamily='serif',
    color=txt_color
)

fig.text(
    s=':School Types',
    x=0, y=1.25,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    by number of students
    ''',
    x=0, y=1.2,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.7,0.7],y=[0,1], lw=3, alpha=1)
fig.lines.extend([l1])

plt.show()

In [None]:
fig, ax = plt.subplots(2, 1, sharex=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
fig.subplots_adjust(hspace=-0.75)

sns.kdeplot(
    data=df[df['school_type']=='Public'],
    x='posttest',
    shade=True,
    color=cmap0[0],
    edgecolor='w',
    lw=2,
    alpha=1,
    ax=ax[0]
)

sns.kdeplot(
    data=df[df['school_type']!='Public'],
    x='posttest',
    shade=True,
    color=cmap0[1],
    edgecolor='w',
    lw=2,
    alpha=1,
    ax=ax[1]
)

for idx, axis in enumerate(ax):
    axis.axhline(lw=5, color=cmap0[idx])
    axis.set_xlabel('')
    axis.set_ylabel('')
    axis.set_yticks([])
    axis.tick_params(length=0, labelcolor=txt_color)
    axis.patch.set_alpha(0)
    despine_ax(axis)

fig.text(
    s=':School Types vs. Scores',
    x=0.1, y=1.2,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    the average score by a public school student is
    lower (~12 points) than the average score of a student 
    visiting a non-public school. However we can see a 
    small group of public students who score above average.
    ''',
    x=0.1, y=1.18,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='Public',
    x=0.125, y=0.305,
    fontsize=11, fontweight='bold',
    color=cmap0[0],
    va='top', ha='left'
)

fig.text(
    s='Non-Public',
    x=0.125, y=0.155,
    fontsize=11, fontweight='bold',
    color=cmap0[1],
    va='top', ha='left'
)

mean_public = df[df['school_type'] == 'Public'].mean()['posttest']
mean_non_public = df[df['school_type'] != 'Public'].mean()['posttest']

l1 = get_line(x=[0.75,0.75], y=[0.70,0.85])
fig.lines.extend([l1])

fig.text(
    s=f"Public Mean: {np.round(mean_public,1)}\nNon-Public Mean: {np.round(mean_non_public,1)}",
    x=0.76, y=0.81,
    fontsize=9, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: Teaching Method</div>

In [None]:
df_teaching_method = df.groupby('teaching_method').count()['student_id']

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,6))
fig.patch.set_facecolor(bg_color)

ax[0].barh(
    y=df_teaching_method.index,
    width=df_teaching_method.values, 
    height=0.75,
    color=cmap0[1]
)

ax[1] = sns.boxplot(
    data=df,
    x='posttest',
    y='teaching_method',
    palette=sns.color_palette("ch:start=.2,rot=-.3"),
    linewidth=1,
    showmeans=True,
    meanprops=dict(markerfacecolor='white', markeredgecolor='white', marker='x'),
    boxprops=dict(edgecolor='white'),
    medianprops=dict(color='white'),
    whiskerprops=dict(color=txt_color, ls=':'),
    capprops=dict(ls=':'),
    flierprops=dict(markersize=2, marker='D'),
    ax=ax[1],
)

for idx in range(0, len(df_teaching_method.index)):
    value = df_teaching_method.values[idx]
    ratio = np.round(value/sum(df_teaching_method.values)*100,1)

    ax[0].annotate(
        s=f"{value} #\n{ratio} %",
        xy=(value-100,idx),
        va='center', ha='right',
        fontsize=11, fontstyle='italic',
        color='white'
    )

for axis in ax:
    axis.tick_params(length=0, labelcolor=txt_color)
    axis.set_facecolor(bg_color)
    axis.set_xlabel('')
    axis.set_ylabel('')

despine_ax(ax[0])
despine_ax(ax[1], ['top','left','right'])

ax[0].set_xticks([])
ax[0].set_yticklabels(df_teaching_method.index, rotation=90, va='center')
ax[1].set_yticks([])

l1 = get_line(x=[0.52,0.52], y=[0.2,0.8])
fig.lines.extend([l1])

fig.text(
    s=':Teaching Method',
    x=0.1, y=1.1,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    64.4 % of the students are taught by conventional methods.
    The experimental teaching methods however perform better
    (~10 points on average) in terms of test-scores.
    ''',
    x=0.1, y=1.08,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    teaching method vs. scores
    ''',
    x=0.505, y=0.75,
    rotation=90,
    fontsize=7, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: Number of Students</div>

In [None]:
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax = sns.kdeplot(
    data=df,
    x='n_student',
    shade=True,
    color=cmap0[0],
    edgecolor='black',
    lw=1,
    alpha=0.8,
    ax=ax
)

despine_ax(ax, ['top','left','right'])
ax.tick_params(length=0, labelcolor=txt_color)
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_yticks([])

mean = df['n_student'].mean()
n_min = df['n_student'].min()
n_max = df['n_student'].max()

fig.text(
    s=':Number of Students - Distribution',
    x=0, y=1.2,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    the average number of students per class is ~23,
    while ranging from a minimum of 14 to a maximum of 31
    students. Furthermore we can spot two 'size-clusters'.
    One around 23, the other at about 27 students per class.
    ''',
    x=0, y=1.18,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.75,0.75], y=[0.71,0.86])
fig.lines.extend([l1])

fig.text(
    s=f"{int(np.round(mean,0))}",
    x=0.76, y=0.85,
    fontsize=52, fontstyle='italic', fontfamily='serif',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='avg. students\nper class',
    x=0.84, y=0.835,
    fontsize=9, fontstyle='italic', fontfamily='serif',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s=f"Range: {int(n_min)}-{int(n_max)}",
    x=0.84, y=0.78,
    fontsize=7, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

In [None]:
# prepare df
by_type = df.groupby('school_type').mean()['n_student']
by_setting = df.groupby('school_setting').mean()['n_student']
df_students_means = pd.concat([by_type, by_setting]).sort_values(ascending=True).reset_index()
df_students_means['y_coords'] = [1.5 if i%2==0 else 0.5 for i in range(0, len(df_students_means))]

# plot
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.set_ylim(0,2)

ax.axhline(
    y=1, zorder=0, color=txt_color
)

sns.scatterplot(
    data=df_students_means, 
    x='n_student', y='y_coords',
    s=3e3, linewidth=1, alpha=0.75, zorder=1,
    edgecolor='w', facecolor=cmap0[0], ax=ax
)

sns.scatterplot(
    data=df_students_means, 
    x='n_student', y=1,
    s=50, linewidth=1, zorder=1,
    edgecolor=txt_color, facecolor=bg_color, ax=ax
)

for idx in range(0, len(df_students_means['n_student'])):
    name = df_students_means['index'][idx]
    x = df_students_means['n_student'][idx]
    y = df_students_means['y_coords'][idx]

    ax.annotate(
        s=f"{name}\n{int(np.round(x))}",
        xy=(x,y),
        va='center', ha='center',
        fontsize=7, fontweight='bold',
        color='white'
    )

    ax.vlines(
        x=x,
        ymin=min(1,y+0.06),
        ymax=max(1,y-0.06),
        color=txt_color,
        lw=1, zorder=0, alpha=0.25, ls='--'
    )

ax.axis('off')

fig.text(
    s=':Number of Students vs. School Type & Setting',
    x=0, y=1.2,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    From the graph below we can tell, that Urban- & Public-
    Schools have the most number of students per class on average.
    Whereas Non-Public Schools have the least number of students
    per class on average.
    ''',
    x=0, y=1.18,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

In [None]:
df_student_scores = df.groupby('n_student').mean()['posttest']

# plot
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))
fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.fill_between(x=df_student_scores.index, y1=0, y2=df_student_scores.values, color=cmap0[1], alpha=0.9)

ax.set_xlim(df_student_scores.index.min()-0.5, df_student_scores.index.max()+0.5)
ax.set_ylim(0, df_student_scores.values.max()+10)

sns.scatterplot(
    x=df_student_scores.index, 
    y=df_student_scores.values, 
    s=200, linewidth=1, zorder=1,
    edgecolor=cmap0[0], facecolor=bg_color, ax=ax
)

ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(length=0, labelcolor=txt_color)
despine_ax(ax,['top','left','right'])

fig.text(
    s=':Number of Students vs. Score (avg)',
    x=0, y=1.2,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    the average score declines when the number
    of students per class rises. The optimal size
    seems to be between 16-18 students per class.
    ''',
    x=0, y=1.18,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: Gender</div>

In [None]:
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, tight_layout=False, figsize=(12,6))

ax0.bar(
    x=df['gender'].value_counts().index,
    height=df['gender'].value_counts().values,
    color=cmap0[0],
    alpha=0.8
)

for idx in range(0, len(df['gender'].unique())):
    x = df['gender'].value_counts().index[idx]
    y = df['gender'].value_counts().values[idx]
    ax0.annotate(
        s=f"#{y}",
        xy=(x,y-10),
        va='top', ha='center',
        fontsize=11, fontstyle='italic',
        color='white'
    )

ax1 = sns.boxplot(
    data=df,
    x='posttest',
    y='gender',
    palette=sns.color_palette("ch:start=.2,rot=-.3"),
    linewidth=1,
    showmeans=True,
    meanprops=dict(markerfacecolor='white', markeredgecolor='white', marker='x'),
    boxprops=dict(edgecolor='white'),
    medianprops=dict(color='white'),
    whiskerprops=dict(color=txt_color, ls=':'),
    capprops=dict(ls=':'),
    flierprops=dict(markersize=2, marker='D'),
    ax=ax1,
)

fig.patch.set_facecolor(bg_color)
fig.subplots_adjust(wspace=0.25)

ax0.set_facecolor(bg_color)
ax1.set_facecolor(bg_color)

despine_ax(ax0, ['top','left','right'])
despine_ax(ax1, ['top','left','right'])

ax0.tick_params(axis='both',length=0, labelcolor=txt_color)
ax0.set_yticks([])

ax1.set_xlabel('')
ax1.set_ylabel('')
ax1.set_yticklabels(df['gender'].unique(), rotation=90, va='center', ha='center')
ax1.tick_params(axis='both',length=0, labelcolor=txt_color)

fig.text(
    s=':Gender',
    x=0.1, y=1.1,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    the number of male & female students are nearly the same.
    Moreover the gender doesn't play an essential role when
    it comes to the final test-scores.
    ''',
    x=0.1, y=1.08,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='''
    gender vs. scores
    ''',
    x=0.505, y=0.75,
    rotation=90,
    fontsize=7, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.52,0.52], y=[0.45,0.8])
fig.lines.extend([l1])

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: Lunch</div>

In [None]:
fig, ax = plt.subplots(tight_layout=True, figsize=(12,2.5))
fig.patch.set_facecolor(bg_color)

labels = df['lunch'].value_counts().index
values = df['lunch'].value_counts().values

ax.barh(
    y=1, width=values[0], 
    color=cmap0[1], alpha=0.75,lw=1, edgecolor='white'
)
ax.barh(
    y=1, width= sum(values) - values[0], left=values[0],
    color=cmap1[1], alpha=0.25, lw=1, edgecolor='white'
)

ax.axis('off')

for idx in range(0,len(labels)):
    if idx == 0:
        x = values[idx] / 2
    else:
        x = (values[0] + values[idx]) - values[idx] / 2
        
    ax.annotate(
        s=f"#{values[idx]}",
        xy=(x,1.05),
        va='center', ha='center',
        fontsize=36, fontweight='bold', fontfamily='serif',
        color='#fff'
    )
    ax.annotate(
        s=f"{labels[idx]}",
        xy=(x,0.85),
        va='center', ha='center',
        fontsize=16, fontstyle='italic', fontfamily='serif',
        color='#fff'
    )

fig.text(
    s=':Lunch',
    x=0, y=1.25,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    ~57% of the students does not qualify
    for a reduced or free lunch.
    ''',
    x=0, y=1.2,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.54,0.54], y=[0,1], lw=3, alpha=1)
fig.lines.extend([l1])

plt.show()

In [None]:
fig, ax = plt.subplots(2, 1, sharex=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
fig.subplots_adjust(hspace=-0.75)

sns.kdeplot(
    data=df[df['lunch']=='Does not qualify'],
    x='posttest',
    shade=True,
    color=cmap0[0],
    edgecolor='w',
    lw=2,
    alpha=1,
    ax=ax[0]
)

sns.kdeplot(
    data=df[df['lunch']!='Does not qualify'],
    x='posttest',
    shade=True,
    color=cmap0[1],
    edgecolor='w',
    lw=2,
    alpha=1,
    ax=ax[1]
)

for idx, axis in enumerate(ax):
    axis.axhline(lw=5, color=cmap0[idx])
    axis.set_xlabel('')
    axis.set_ylabel('')
    axis.set_yticks([])
    axis.tick_params(length=0, labelcolor=txt_color)
    axis.patch.set_alpha(0)
    despine_ax(axis)

fig.text(
    s=':Lunch vs. Scores',
    x=0.1, y=1.2,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    students not-qualified for a free lunch score
    ~ 17 points higher on average than students who 
    are qualified for free lunch.
    ''',
    x=0.1, y=1.18,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='Does not qualify',
    x=0.125, y=0.305,
    fontsize=11, fontweight='bold',
    color=cmap0[0],
    va='top', ha='left'
)

fig.text(
    s='Does qualify',
    x=0.125, y=0.155,
    fontsize=11, fontweight='bold',
    color=cmap0[1],
    va='top', ha='left'
)

mean_not_qualified = df[df['lunch'] == 'Does not qualify'].mean()['posttest']
mean_qualified = df[df['lunch'] != 'Does not qualify'].mean()['posttest']

l1 = get_line(x=[0.72,0.72], y=[0.70,0.85])
fig.lines.extend([l1])

fig.text(
    s=f"Not-Qualified Mean: {np.round(mean_not_qualified,1)}\nQualified Mean: {np.round(mean_qualified,1)}",
    x=0.73, y=0.81,
    fontsize=9, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

#### <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>:: Pre-Test</div>

In [None]:
fig, (ax0, ax1) = plt.subplots(2, 1, tight_layout=True, sharex=True, figsize=(12,6))
fig.patch.set_facecolor(bg_color)

mean = df['pretest'].mean()
median = df['pretest'].median()

ax0.boxplot(
    data=df, x='pretest',
    vert=False, patch_artist=True,
    boxprops=dict(facecolor=cmap0[1], lw=0, alpha=0.75),
    whiskerprops=dict(color='gray', lw=1, ls='--'),
    capprops=dict(color='gray', lw=1, ls='--'),
    medianprops=dict(color='#fff', lw=0),
    flierprops=dict(markerfacecolor=cmap0[0],alpha=0.75),
    zorder=0
)

ax1 = sns.kdeplot(
    data=df, x='pretest', shade=True, 
    color=cmap0[0], edgecolor='#000', lw=1, 
    zorder=0, alpha=0.8, ax=ax1
)

ax0.axvline(x=mean, ymin=0.4, ymax=0.6, color=bg_color, ls=':', zorder=1, label='mean')
ax1.axvline(x=mean, ymin=0, ymax=0.9, color=bg_color, ls=':', zorder=1)

ax0.axvline(x=median, ymin=0.4, ymax=0.6, color=bg_color, ls='--', zorder=1)
ax1.axvline(x=median, ymin=0, ymax=0.9, color=bg_color, ls='--', zorder=1)

ax0.axis('off')
ax0.set_facecolor(bg_color)

ax1.set_ylabel('')
ax1.set_xlabel('')
ax1.set_yticks([])
ax1.tick_params(length=0)
ax1.set_facecolor(bg_color)

despine_ax(ax1, ['top','left','right'])

fig.text(
    s=':Pretest - Distribution',
    x=0, y=1.05,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    from the plot below we can tell,
    that the students scored ~12 points less on average
    in the pretest than they did in the posttest.
    ''',
    x=0, y=1.02,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s=f"Mean: {np.round(mean,1)}\nMedian: {np.round(median,1)}",
    x=0.56, y=0.925,
    fontsize=9, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.55,0.55], y=[0.85,0.95])
fig.lines.extend([l1])

plt.show()

In [None]:
fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))

fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

sns.scatterplot(
    data=df, 
    x='pretest',
    y='posttest',
    color=cmap0[1],
    alpha=0.65,
    s=5*df['posttest'],
    ax=ax
)

despine_ax(ax,['top','right'])
ax.tick_params(length=0, labelcolor=txt_color)
ax.set_xlabel('')
ax.set_ylabel('')

fig.text(
    s=':Pretest vs. Posttest',
    x=0, y=1.15,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    there is clearly a strong linear 
    relationship evident, with a correlation
    coefficient of 0.95.
    ''',
    x=0, y=1.13,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

corr = df.corr()['pretest'][2]

fig.text(
    s=f"{np.round(corr,2)}",
    x=0.7, y=0.4,
    fontsize=52, fontfamily='serif',
    color=txt_color,
    va='top', ha='left'
)

fig.text(
    s='Correlation Coefficient',
    x=0.69, y=0.29,
    fontsize=11, fontfamily='serif',
    color=txt_color,
    va='top', ha='left'
)

l1 = get_line(x=[0.675,0.675], y=[0.225,0.425])
fig.lines.extend([l1])

plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Modeling</div>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer

# prepare dataset and drop id columns
X = df.drop(columns=['student_id','posttest']).copy()
y = df['posttest'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1, shuffle=True)

# get all numerical & categorical features
cat_cols = [col for col in X_train.columns if X_train[col].dtype == 'object']
num_cols = [col for col in X_train.columns if X_train[col].dtype == 'float']

# build transformer
preprocessing = make_column_transformer(
    (StandardScaler(),num_cols),
    (OrdinalEncoder(),cat_cols)
)

# preprocess data
X_train = preprocessing.fit_transform(X_train)
X_test = preprocessing.transform(X_test)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

results = dict()

models = [
    ('Linreg',LinearRegression()),
    ('Lasso',Lasso()),
    ('Ridge',Ridge()),
    ('ElasticNet',ElasticNet()),
    ('LGBM',LGBMRegressor()),
    ('CATB',CatBoostRegressor(verbose=0)),
    ('XGB',XGBRegressor(verbosity=0)),
]

for name, model in models:
    model.fit(X_train, y_train),
    y_hat = model.predict(X_test)
    rmse = mean_squared_error(y_test, y_hat, squared=False)
    results[name] = rmse

In [None]:
df_results = pd.DataFrame([results], index=['RMSE']).transpose()

fig, ax = plt.subplots(tight_layout=True, figsize=(12,6))
fig.patch.set_facecolor(bg_color)
ax.set_facecolor(bg_color)

ax.barh(
    y=df_results.index,
    width=df_results['RMSE'], 
    height=0.75,
    color=cmap0[1]
)

for idx in range(0, len(df_results)):
    x = df_results['RMSE'][idx]
    ax.annotate(
        s=f"RMSE: {np.round(x,2)}",
        xy=(x-0.2,idx),
        va='center', ha='right',
        fontsize=9, fontstyle='italic',
        color='white'
    )

despine_ax(ax)
ax.set_ylabel('')
ax.set_xlabel('')
ax.set_xticks([])
ax.tick_params(length=0, labelcolor=txt_color)

fig.text(
    s=':Model Evaluation',
    x=0, y=1.15,
    fontsize=17, fontweight='bold',
    color=txt_color, 
    va='top', ha='left'
)

fig.text(
    s='''
    with default parameter 
    CatBoost & LGBM perform the best.
    ''',
    x=0, y=1.13,
    fontsize=11, fontstyle='italic',
    color=txt_color,
    va='top', ha='left'
)

plt.show()

## <div style='background:#2b6684;color:white;padding:0.5em;border-radius:0.2em'>Conclusion</div>

I had a lot of fun working with this dataset and tried to put a special emphasis on the EDA.

The most important features when it comes to predicting the score seem to be:
* the pretest score
* the school_type & setting
* the teaching_method
* the size or number of students per class
* the fact if you qualify for 'free lunch' or not

So boiled down one can conclude the following. 
If you're a student in a suburban, non-public school, 
exposed to experimental methods in a small sized classroom 
while you have a well (financially) situated background - you should do well!

The difference in gender does not contribute to a higher / lower score.

Things to improve ... 
I just compared some baseline models, from here on one should do hyperparameter tuning to achieve a better RMSE.

<div style='border-radius:3px;background:#b1d3e3;padding:2em;text-align:left;font-family:monospace;font-weight:light;font-size:1.1em;color:black'>
    <b>Thanks for checking out my notebook!</b><br>
    Feel free to leave a comment, a suggestion, an upvote or just a simple message to say hello :)
</div>