![](https://www.ft.com/__origami/service/image/v2/images/raw/http%3A%2F%2Fcom.ft.imagepublish.upp-prod-eu.s3.amazonaws.com%2Fcaa74138-69d2-11ea-a6ac-9122541af204?fit=scale-down&source=next&width=500)

<a id="top"></a>
    
<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Contents</h3>

    
* [1. Introduction](#1)


[Part-1: EDA]
* [2. Setup](#2)
* [3. Is data balanced?](#3)
    - [3.1 How is gender distributed in other categories?](#3.1)
* [4. Test scores distribution](#4)
* [5. Initial observations](#5)
    - [5.1 Who benefited from test preparation course the most?](#5.1)
    - [5.2 What is the effect of lunch type on test results?](#5.2)
* [6. Test score correlation](#6)
* [7. Correlation matrix](#7)
* [8. Conclusions: EDA](#8)

[Part-2: Model]

* [9. Modeling and prediction](#9)
* [10. Conclusions: Model](#10)
* [11. References](#11)
    


<a id="1"></a>
<font color="lightseagreen" size=+2.5><b>1. Introduction</b></font>

In this notebook the exam score of students at a public school is explored. The score of three test subjects are considered, math, reading and writing tests. In this dataset other features such as variety of personal, social, and economic factors that have interaction effects upon them are included. We can divide the features of the datasets into two groups.

**Inherent attributes** (those we do not have control over): 
- Gender, race or ethnicity

**Aquired attributes** (those we can have control over):
- Test score, test preparation course, lunch type, and parental education level

## The questions we would like to answer using this datasest include:
- How effective is the test preparation course?
- Which major factors contribute to test outcomes?
- What would be the best way to improve student scores on each test?

**Remark**: The author of the dataset (kaggle dataset)  has only published one download (1000) from the source data generator function (see the ref). I downloaded more and this notebooks has 10000 more data. This can be helpful especially if modeling is part of the project. 

# Part-1: EDA
<a id="2"></a>
<font color="lightseagreen" size=+2.5><b>2. Setup</b></font>

In [None]:
import numpy as np 
from scipy import stats
import pandas as pd
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
import plotly.express as pex
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv(r'/kaggle/input/students-performance-in-exams/StudentsPerformance.csv')
data0 = pd.read_csv(r'/kaggle/input/more-exam-data/exams.csv')
data1 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (1).csv')
data2 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (2).csv')
data3 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (3).csv')
data4 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (4).csv')
data5 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (5).csv')
data6 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (6).csv')
data7 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (7).csv')
data8 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (8).csv')
data9 = pd.read_csv(r'/kaggle/input/more-exam-data/exams (9).csv')

In [None]:
data_total = pd.concat([data,data0,data1,data2,data3,data4,data5,data6,data7,data8,data9], axis=0)

In [None]:
data = data_total
data.info()

In [None]:
data.head()

In [None]:
data['average_score'] = (data['math score'] + data['reading score'] + data['writing score'])/3

In [None]:
df_mod = data.copy()
data.head()

In [None]:
data.describe()

<a id="3"></a>
<font color="lightseagreen" size=+2.5><b>3. Is data balanced?</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

Is the data balanced with respect to the categorical features?

- Gender: is fairly balanced
- Race/ethcinity: group C is te most represented with 31.9%, group A is the least with 8.1%
- Parental education level: master's degree holder parents are the fewest represented followed by parents with bachelor's degree. The rest are in the same ball-park
- Luch type: 65.2% of the students have standard lunch
- Test preparation course: 65.5% of the students haven't had test preparation course

In [None]:
#colors = ['lightgray', 'Rebeccapurple','gold','royalblue','lightseagreen','lightsalmon']
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen', 'black', 'Gray']

fig = make_subplots(rows=3, cols=2,
                    specs=[[{'type':'domain'}, {'type':'domain'}],
                           [{'type':'domain'}, {'type':'domain'}], 
                           [{'type':'domain'}, {'type':'domain'}]])


fig.add_trace(
    go.Pie(
        labels=data['gender'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Gender',
        titlefont={'color':'black', 'size': 24},
        ),
    row=1,col=1
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=colors, #['lightseagreen', 'lightsalmon'], 
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=data['race/ethnicity'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Race',
        titlefont={'color':'black', 'size': 24},
        ),
    row=1,col=2
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=colors,#[0:6],
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=data['parental level of education'],
        values=None,#scalegroup='one',
        hole=.4,
        title='ParentEduc.',
        titlefont={'color':'black', 'size': 24},
        ),
    row=2,col=1
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=colors,
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=data['lunch'],
        values=None,#scalegroup='one',
        hole=.4,
        title='Lunch',
        titlefont={'color':'black', 'size': 24},
        ),
    row=2,col=2
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=colors, #['lightseagreen', 'lightsalmon'],
        line=dict(color='#000000',
                  width=2)
        )
    )

fig.add_trace(
    go.Pie(
        labels=data['test preparation course'],
        values=None,#scalegroup='one',
        hole=.4,
        title='TestPrep.',
        titlefont={'color':'black', 'size': 24},
       ),
    row=3,col=1
    )
fig.update_traces(
    hoverinfo='label+value',
    textinfo='label+percent',
    textfont_size=12,
    marker=dict(
        colors=colors,#['lightseagreen', 'lightsalmon'],
        line=dict(color='#000000',
                  width=2)
        )
    )
fig.layout.update(title="Independent Features Distribution", showlegend=False, height=850, width=750, 
                  template=None, titlefont={'color':'black', 'size': 24}
                 )
fig.show()


<a id="3.1"></a>
<font color="lightseagreen" size=+1.5><b>3.1. How is gender distributed in other categories?</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>
- The gender distribution in the individual categories of the other features is fairly even. The difference is within 1% for all categoral features except race group B and D which has upto 2% variation.

In [None]:
##### Parental education ####
df = data
fig = px.histogram(df, x="parental level of education", y=None, color="gender",
                width=600,
                histnorm='percent',
                category_orders={ 
                "parental level of education": ["some high school", "high school", "associate's degree",
                                            "some college", "bachelor's degree", "master's degree"],             
                "gender": ["male", "female"]
                },
                
                color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                template="simple_white"
                )

fig.update_layout(title="Parental Level of Education", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()

##### race #####

fig = px.histogram(df, x="race/ethnicity", y=None, color="gender",
                width=600,
                histnorm='percent',
                category_orders={ 
                "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"], 
                "gender": ["male", "female"]
                },
                color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                template="simple_white"
                )

fig.update_layout(title="Race/Ethnic Group", 
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()

##### lunch #####

fig = px.histogram(df, x="lunch", y=None, color="gender",
                width=600,
                histnorm='percent',
                color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                template="simple_white"
                )

fig.update_layout(title="Lunch Type",
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()


#### test prep ####
fig = px.histogram(df, x="test preparation course", y=None, color="gender",
                width=600,
                histnorm='percent',
                color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                template="simple_white"
                )

fig.update_layout(title="Test Preparation Course",
                  font_family="San Serif",
                  titlefont={'size': 20},
                  legend=dict(
                  orientation="v", y=1, yanchor="top", x=1.0, xanchor="right" )                 
                 )
fig.show()



<a id="4"></a>
<font color="lightseagreen" size=+2.5><b>4. Test scores distribution</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>


In [None]:
data_m = data['math score']
data_r = data['reading score']
data_w = data['writing score']

fig = go.Figure()

fig.add_trace(go.Violin(x=data_m, line_color='salmon', name='Math'))
fig.add_trace(go.Violin(x=data_r, line_color='gold', name= 'Reading'))
fig.add_trace(go.Violin(x=data_w, line_color='lightseagreen', name='Writing'))

fig.update_traces(orientation='h', side='positive', width=3, points=False, meanline_visible=True)
fig.update_layout(xaxis_showgrid=True, xaxis_zeroline=False)

fig.update_layout(title='<b> Test score distribution <b>',
                  titlefont={'size': 24},
                  xaxis_title='Test scores', 
                  width=750,
                  #template="plotly_dark",
                  showlegend=False,
                  paper_bgcolor="lightgray",
                  plot_bgcolor='lightgray',             
                  font=dict(
                       color ='black', 
                    )
 )
fig.show()

In [None]:
df_genderAvg = df.groupby(["gender"])['math score', 'reading score', 'writing score'].mean()
df_genderAvg

<a id="5"></a>
<font color="lightseagreen" size=+2.5><b>5. Initial observations</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

- Female students performed better than male students.
- Free vs standard lunch: students who have the standard lunch packet performed better.
- Test preparation course helped increase test score.
- Students of race group E have the best overall performance. However, those from race group A were the lower end of the score spectrum.
- Students whose parents have master's degree performed best. However, students whose parents have 'some high school' education level scored the lowest average. Interestingly though, for group A students having parents with master's degree didn't help that much.**

In [None]:
df = data
total_average = data['average_score'].mean()
fig = px.box(df, y="average_score", x="race/ethnicity",color="gender",
             title='Average score vs Race',
             template='plotly_dark', width=650,height=450,
             category_orders={ 
            "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]},
#              color_discrete_map={ 
#             "male": "RebeccaPurple", "female": "lightsalmon"}             
            )

fig.add_shape( 
    type="line", line_color="yellow", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=total_average, y1=total_average, yref="y"
)
fig.update_layout(
    paper_bgcolor="#232624",
    font=dict(
        color='white'))

fig.show()


##########

fig = px.box(df, y="average_score", x="lunch",color="gender",
             title='Average score vs Lunch',
             template='plotly_dark', width=650,height=450,
             category_orders={ 
            "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
             }           
            )

fig.add_shape( 
    type="line", line_color="yellow", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=total_average, y1=total_average, yref="y"
)
fig.update_layout(
    paper_bgcolor="#232624",
    font=dict(
        color='white'))

fig.show()

###########

fig = px.box(df, y="average_score", x="test preparation course",color="gender",
             title='Average score vs TestPrep.',
             template='plotly_dark', width=650,height=450,
             category_orders={ 
            "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
             }           
            )

fig.add_shape( 
    type="line", line_color="yellow", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=total_average, y1=total_average, yref="y"
)
fig.update_layout(
    paper_bgcolor="#232624",
    font=dict(
        color='white'))

fig.show()
#######################

fig = px.box(df, y="average_score", x="parental level of education", color="gender",
             title='Average score vs ParentEduc.',
             template='plotly_dark', width=650,height=450,
             category_orders={ 
            "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
             }           
            )

fig.add_shape( 
    type="line", line_color="yellow", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=total_average, y1=total_average, yref="y"
)
fig.update_layout(
    paper_bgcolor="#232624",
    font=dict(
        color='white'))

fig.show()

########################

fig = px.box(df, y="average_score", x="parental level of education", color="race/ethnicity",
             
             title='Averge score vs ParentEduc. vs Race ',
             template='plotly_dark',
             
             width=850,height=450,
             
             category_orders={ 
             "parental level of education": ["some high school", "high school", "associate's degree",
                                            "some college", "bachelor's degree", "master's degree"],
             
             "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]             
             }
             )

fig.add_shape( 
    type="line", line_color="yellow", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=total_average, y1=total_average, yref="y"
)

fig.update_layout(
    paper_bgcolor="#202624",
    font=dict(
        color ='white', 
    )
 )


fig.show()

<a id="5.1"></a>
<font color="lightseagreen" size=+1.0><b>5.1 Who benefited from test preparation course the most?</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

- **Male students** benefited from test preparation course more than female students did
- **Writing test score** is where the highest points (10point for male students) gain achieved
- Male students from **Group B and C** showed the highest points gain (writing score)
- From those race group B & C male students whose parents had '**some college**' education got the biggest advantage from the test preparation course. Male students had a 13 point gain whereas female students gained 11 points.


In [None]:
# maths 
fig = px.box(df, 
                 x="test preparation course", y="math score", 
                 facet_col="gender",
                 color='gender',
                 template='simple_white',
                 width=750, height=350,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Test preparation course on math score",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

# reading 
fig = px.box(df, 
                 x="test preparation course", y="reading score", 
                 facet_col="gender",
                 color='gender',
                 template='simple_white',
                 width=750, height=350,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Test preparation course on reading score",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

# writing
fig = px.box(df, 
                 x="test preparation course", y="writing score", 
                 facet_col="gender",
                 color='gender',
                 template='simple_white',
                 width=750, height=350,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Test preparation course on writing score",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

# writing, race
fig = px.box(df, 
                 x="test preparation course", y="writing score", 
                 facet_col="race/ethnicity",
                 color='gender',
                 template='simple_white',
                 width=1500, height=350,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Test preparation course on writing score: variation accros race group",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()


# writing, race, parent education
DF = df[(df['race/ethnicity'] == 'group B') | (df['race/ethnicity'] == 'group C')]

fig = px.box(DF, 
                 x="test preparation course", y="writing score", 
                 facet_col="parental level of education",
                 color='gender',
                 template='simple_white',
                 width=2000, height=350,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Test preparation course on writing score: variation across ParentEduc (group B&C)",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

<a id="5.2"></a>
<font color="lightseagreen" size=+1.0><b>5.2 What is the effect of lunch type on test results?</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

- **Math score** on both male and female students were the most affected test score by a lunch type. Both gender groups lost 12 points by having to resort to the free or reduced lunch.
- **Female** students of race **group A** were the most affected. Their math score suffered a wopping 15points drop by not having the standard lunch. 
- **Female** students of race **group C** were the second most affected with a drop of 13points.


In [None]:
# maths 
fig = px.box(df, 
                 x="lunch", y="math score", 
                 facet_col="gender",
                 color='gender',
                 template='simple_white',
                 width=900, height=400,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Lunch on math score",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

# reading 
fig = px.box(df, 
                 x="lunch", y="reading score", 
                 facet_col="gender",
                 color='gender',
                 template='simple_white',
                 width=900, height=400,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Lunch on reading score",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

# writing
fig = px.box(df, 
                 x="lunch", y="writing score", 
                 facet_col="gender",
                 color='gender',
                 template='simple_white',
                 width=900, height=400,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Lunch on writing score",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

# writing, race
fig = px.box(df, 
                 x="lunch", y="math score", 
                 facet_col="race/ethnicity",
                 color='gender',
                 template='simple_white',
                 width=900, height=400,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Lunch on math score: variation across race group",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()


# # writing, race, parent education
DF = df[(df['race/ethnicity'] == 'group A') & (df['gender'] == 'female')]

fig = px.box(DF, 
                 x="lunch", y="writing score", 
                 color="parental level of education",
#                  facet_col=
#                  color='gender',
                 template='simple_white',
                 width=900, height=400,
                 category_orders={
                     "race/ethnicity": ["group A", "group B", "group C", "group D", "group E"]
                     }                
                )

fig.update_layout(
    title="Lunch on writing score: variation across ParentEduc. (female students of group A) ",
    margin=dict(l=20, r=20, t=70, b=20, pad=1),
    paper_bgcolor="lightgray",
)
fig.show()

<a id="6"></a>
<font color="lightseagreen" size=+2.5><b>6. Test score correlation?</b></font>

<font color="lightseagreen" size=+1.5><b>Are test score correlated/related with one another?</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

- Reading and writing test scores seem to be correlated with each other for both gender groups
- However, the correlation of math test score with reading and writing tests is different for the two gender groups. For female students their reading & writing scores bettered their math score while the opposite is true for male students.

In [None]:
templates  = ["plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "none"]

fig = px.density_contour(df, x="math score", y="reading score", color="gender",trendline="ols",
                 marginal_x="histogram", marginal_y="histogram",
                 #hover_data=['race/ethnicity'],
                         width=600,height=600,
                 title= '<b> Math vs Reading <b>',
                 template="simple_white",
                        color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                        )
fig.layout.update(titlefont={'color': 'black', 'size': 24},
                  paper_bgcolor='#ececec',
                  plot_bgcolor='#ececec',
                 )
fig.show()

#######


fig =px.density_contour(df, x="math score", y="writing score", color="gender",trendline="ols",
                 marginal_x="histogram", marginal_y="histogram",
                 #hover_data=['race/ethnicity'],
                        width=600,height=600,
                 title= '<b> Math vs Writing <b>',
                 template="simple_white",
                       color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                       )
fig.layout.update(titlefont={'color': 'black', 'size': 24},
                  paper_bgcolor='#ececec',
                  plot_bgcolor='#ececec',
                 )
fig.show()

######
fig = px.density_contour(df, x="reading score", y="writing score", color="gender",trendline="ols",
                 marginal_x="histogram", marginal_y="histogram",
                 #hover_data=['race/ethnicity'],
                         width=600,height=600,
                 title= '<b> Reading vs Writing <b>',
                 template="simple_white", 
                               color_discrete_map={ 
                    "male": "RebeccaPurple", "female": "lightsalmon"
                },
                        )
fig.layout.update(titlefont={'color': 'black', 'size': 24},
                  paper_bgcolor='#ececec',
                  plot_bgcolor='#ececec',
                 )
fig.show()


<a id="7"></a>
<font color="lightseagreen" size=+2.5><b>7. Correlation matrix</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

* Here we use the point-biserial correlation coefficient to check the correlation between the independent variables (categorials) vs continuous targets. However, we can only do so for the binary catergories (gender, lunch and test preparations).
* We notice that the correlations are at best weak. The highest correlation is *lunch* vs *math test score* at 0.38.


In [None]:
data= data.copy()

num_cols =['math score', 'reading score', 'writing score', 'average_score']
cat_cols = ['gender', 'race/ethnicity', 'lunch', 'parental level of education', 'test preparation course']
feats = ['gender', 'lunch', 'test preparation course', 'math score', 'writing score', 'reading score', 'race/ethnicity', 'parental level of education']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

le_data = data.copy()

for col in feats:
    le_data[col] = le.fit_transform(data[col])
    
train = le_data

In [None]:
def point_biserial(x, y):
    pb = stats.pointbiserialr(x, y)
    return pb[0]

rows= []
for x in feats[:-2]:
    col = []
    for y in feats[:-2] :
        pbs =point_biserial(train[x], train[y]) 
        col.append(round(pbs,2))  
    rows.append(col)  

    
pbs_results = np.array(rows)
DF = pd.DataFrame(pbs_results, columns = train[feats[:-2]].columns, index =train[feats[:-2]].columns)

mask = np.triu(np.ones_like(DF, dtype=bool))
DF=DF.mask(mask)

fig = go.Figure(data= go.Heatmap(z=DF,
                  x=DF.index.values,
                  y=DF.columns.values,       
                  xgap=3, ygap=3,
                  colorscale='greens',
                  colorbar_thickness=10,
                  colorbar_ticklen=3,
                   )
                )
fig.update_layout(title_text="Correlation heatmap", 
                title_x=0.5,
                font_family="San Serif",
                titlefont={'size': 24},
                width=600, height=500,
                xaxis_showgrid=False,
                yaxis_showgrid=False,
                yaxis_autorange='reversed', 
                paper_bgcolor='#ececec',
                margin=dict(l=70, r=70, t=70, b=70, pad=1),
                template="simple_white"    )

fig.add_vrect(
    x0=-0.5, x1=2.5, y0=0, y1=0.5,
    fillcolor='red', opacity=0.95,
    layer="below", line_width=0,
)



fig.show()

<a id="8"></a>
<font color="lightseagreen" size=+2.5><b>8. Conclusions: EDA</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

- Both groups of students have their strength and weekness in test performances. Male students were better in maths and females in the other two test subjects.
- Tutorials/test prepartion courses helped improve their score. It helped male students in their writing test especially those from race group B and C whose parents have 'some college' education.
- Lunch type also had an effect on students performance on tests. A reduced or free lunch had an influence (negative) on math score of race group A and C female students.
- Race and parental education level also seem to have effect on the students test score. Generally race group E were the better performers while race group A being the low achievers. Students who's parents have a high level education scored the better scores, generally.

# Part-2
<a id="9"></a>
<font color="lightseagreen" size=+2.5><b>9. Modeling and prediction</b></font>

In this section we will try to see if we can predict the test scores (for the three test subjects separately) from the categorical input features. We suspect that the model accuracy may not be high just by looking at what kind of features we have and from experience (we all have been students atleast once in our lifetime). Let's see.
- Input features : gender, race/ethnicity, lunch, parental level of education, and test preparation course
- Target variable: maths, witing and reading test sccores 
- We use a simple randomforest regression model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import shap 

### Target variables

In [None]:
y_m = train['math score']
y_w = train['writing score']
y_r = train['reading score']

### Outliers detection, IQR 
* We use the inter-quartile range to screen any outlier. 
* IQR = Q3 - Q1
* The cut-off (scale) is taken as 1.5*IQR (see reference for explanation)

In [None]:
def iqr_outliers(data):
    q25, q75 = np.percentile(data, 25), np.percentile(data, 75)
    iqr = q75 - q25

    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    return(lower, upper)

In [None]:
scores = [y_m, y_w, y_r]
subjects =['maths', 'writing', 'reading']

for subjects, scores in zip(subjects, scores):
    print(subjects + ' ourliers ranges are', iqr_outliers(scores))

### Plotting the lower-boundaries for the outliers
* dashed lines in the following plots are the lower boundaries (IQR)

In [None]:
## ploltting ouliers boundaries
fig1 = px.scatter(data,
                 x='math score',
                 y='writing score',
                 template="ggplot2",
                 width=700,
                 height=550)
fig1.add_shape( 
    type="line", line_color="blue", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=iqr_outliers(y_w)[0], y1=iqr_outliers(y_w)[0], yref="y"
)

fig1.add_shape( 
    type="line", line_color="black", line_width=3, opacity=1, line_dash="dot",
    y0=0, y1=1, xref="x", x0=iqr_outliers(y_m)[0], x1=iqr_outliers(y_m)[0], yref="paper"
)
fig1.update_layout(title_text="Math vs writing: showing the outlier boundaries")
fig1.show()

fig2 = px.scatter(data,
                 x='math score',
                 y='reading score',
                 template="ggplot2",
                 width=700,
                 height=550)
fig2.add_shape( 
    type="line", line_color="blue", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=iqr_outliers(y_r)[0], y1=iqr_outliers(y_r)[0], yref="y"
)

fig2.add_shape( 
    type="line", line_color="black", line_width=3, opacity=1, line_dash="dot",
    y0=0, y1=1, xref="x", x0=23, x1=23, yref="paper"
)
fig2.update_layout(title_text="Math vs reading: showing the outlier boundaries")
fig2.show()

### Drop outliers
> The upper bounday limit for the outliers is higher than the max test scores (100) so we use only the lower boundary limit to drop outliers

In [None]:
train = train.drop(train[train['math score'] < iqr_outliers(y_m)[0]].index)
train = train.drop(train[train['writing score'] < iqr_outliers(y_w)[0]].index)
train = train.drop(train[train['reading score'] < iqr_outliers(y_r)[0]].index)

In [None]:
# the features
X = train[cat_cols]

# the three targets (test scores)
y_m = train['math score']
y_w = train['writing score']
y_r = train['reading score']

### Helper functions for model and SHAP

In [None]:
def make_model(X, y):
    #train_test_split
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=100)
    # model selection and , we do here randomforest
    rf = RandomForestRegressor(n_estimators=500, max_depth=6, random_state=100)
    # fit model
    rf.fit(X_train, y_train)
    # predict
    predict = rf.predict(X_val)
    # model luation evametric
    rmse = np.sqrt(mean_squared_error(y_val, predict))
    r_squared = (rf.fit(X_train, y_train).score(X_val, y_val))
    return ('rmse score: {0:.3}'.format(rmse), 'r-squared value is: {0:.3}'.format(r_squared))


In [None]:
def shape_and_feature_importance(X, y, features):
    # split
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=100)
    # model
    rf = RandomForestRegressor(n_estimators=500, max_depth=6, random_state=100)
    # fit model
    rf.fit(X_train, y_train)    
    # create object that can calculate shap values
    explainer = shap.TreeExplainer(rf)
    # calculate Shap values
    shap_values = explainer.shap_values(X_train)
    # feature importance plot
    shap.summary_plot(shap_values, X_train, feature_names=features, plot_type="bar")
    # shap summary plot
    shap.summary_plot(shap_values, X_train, feature_names=features)    
    return 

### Model accuracy

In [None]:
scores = [y_m, y_w, y_r]
subjects =['maths', 'writing', 'reading']

for sub, item in zip(subjects, scores):
    print(sub + ' score =>', make_model(X, item))

### Feature importance ans SHAP plots
* We see that `lunch`, `test prep` and `gender` are important features for the model
* However the model does't see the importance of parental education 

In [None]:
scores = [y_m, y_w, y_r]
subjects =['maths', 'writing', 'reading']

for sub, item in zip(subjects, scores):
    print(' ')
    print('Feature importance and SHAP summary for ' + sub + ' test score')
    print(shape_and_feature_importance(X, item, features=cat_cols))   

<a id="10"></a>
<font color="lightseagreen" size=+2.5><b>10. Conclusions: Model</b></font>

The model accuracy (judged by the r-squared and RMSE values) is not very high. But this was kind of expected as the feature we have are not highly predictive of school/test performace. Test performace may depend on more features than given in this dataset. Therefore, more features are required to accurately identify what affects students test performace. Among other, the following features could be thought of as additional features that could have an impact on students test score. 

- study hour per week (hrs)
- interest for the subject (low, medium, high)
- get help from parents (yes or no)
- is first child ? (yes or no)
- number of siblings
- marital status of parents (married, divorced, widowed)
- practice sport? yes or no
- etc..

So the model accuracy and prediction should be read/taken with cautions! After all the dataset is fictitious. 

# Closing remark:
### We would like to close the loop by answering the question we asked at the start
* How effective is the test preparation course?
> Ans: Test preparation helps indeed. Male students increased their writing score by test prep the most
* Which major factors contribute to test outcomes?
> Ans: Lunch, test preparation and gender. Male students better their female colleagues on math test. The feamale students did so on the writing/reading tests. Luch had the highest effect on math test.
* What would be the best way to improve student scores on each test?
> Ans: See above 

<font color="lightseagreen" size=+1.5><b>Thank you for reading this notebook!!</b></font>

<font color="lightseagreen" size=+1.5><b>Remarks and/or questions are most welcome! If you liked the notebook, please do not forget to upvote!</b></font>


<a id="11"></a>
<font color="lightseagreen" size=+2.5><b>11. References</b></font>

<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a> 

[Plotly official webpage](https://plotly.com/)

[Royce Kimmons' (the dataset creator) webpage](http://roycekimmons.com/tools/generated_data/exams)

[Sckit-learn official webpage](https://scikit-learn.org/stable/index.html)

[Why “1.5” in IQR Method of Outlier Detection?](https://towardsdatascience.com/why-1-5-in-iqr-method-of-outlier-detection-5d07fdc82097)

