# Problem Statement : Predict the student's performance in exam
# Type : Regression (Supervised Machine learning)

This data set includes scores from three exams and a variety of personal, social, and economic factors that have interaction effects upon students. The task is to find following :
1. How effective is the test preparation course?
2. Which major factors contribute to test outcomes?
3. What would be the best way to improve student scores on each test?

### Importing require libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

### Importing dataset

In [None]:
df = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")
print(df.shape)
print()
df.head()

### Checking duplicate values

In [None]:
print(df[df.duplicated()])

Dataset do not have any duplicate values

### Checking data info

In [None]:
df.info()

In [None]:
df.describe()

### Observations :
1. Dataset do not have any null values.
2. There's no outliers in score as all values are in the range of 0-100.
3. All features are categorical.
4. Grades are require to measure students performance in exam.
5. Percdentage needs to calculate to allocate grade.

In [None]:
df['percentage'] = round(((df['math score'] + df['reading score'] + df['writing score'])/300)*100, 2)
df.head(10)

In [None]:
for c in df.columns:
    if (df[c].dtype != 'object'):
        ax = sns.distplot(df[c])
        plt.show()

Above graphs shows that score 40 can be deciding score to state student is pass or fail.

In [None]:
df['result'] = df.apply(lambda x : 'F' if x['math score']<40 or x['reading score']<40 or x['writing score']<40 else 'P', axis = 1)
df.head(10)

Now grade can be allocated as follow:

A : above 90

B : 80 - 89

C : 70 - 79

D : 55 - 69

E : 40 - 54

F : below 40 score in any subject

In [None]:
def grading(percentage, result):
    if (result == 'F'):
        return 'F'
    if (percentage >= 90):
        return 'A'
    if (percentage >= 80):
        return 'B'
    if (percentage >= 70):
        return 'C'
    if (percentage >= 55):
        return 'D'
    if (percentage >= 40):
        return 'E'
    
df['grade'] = df.apply(lambda x : grading(x['percentage'], x['result']), axis=1)
df.head()

## Data analysis

In [None]:
for c in df.columns:
    if (df[c].dtype == 'object'):
        plt.figure(figsize=(12,4))
        ax = sns.countplot(df[c])
        ax.spines['top'].set_visible(False)
        for p in ax.patches:
            ax.text(p.get_x()+p.get_width()/2, p.get_height(), str(p.get_height())+'\n', ha='center', weight='bold')
        plt.show()
        print('-'*100)

In [None]:
f = df[df['result'] == 'F'].drop(['result', 'grade'], axis=1)
print('Graphs shows the conunt and percent of students out of total students who failed the examination.')
print('*'*100)
print()
for c in f.columns:
    if (f[c].dtype == 'object'):
        plt.figure(figsize=(12,4))
        ax = sns.countplot(f[c])
        ax.spines['top'].set_visible(False)
        for p in ax.patches:
            ax.text(p.get_x()+p.get_width()/2, p.get_height(), str(round(((p.get_height()/len(f))*100), 2))+' %\n', ha='center', weight='bold')
        plt.show()
        print('-'*100)

From above graphs we can conclude the followings :
1. Test preparation course helps a lot to student to pass the examination.
2. Parent's education level also plays major role in students result.

In [None]:
pp = df[df['result'] == 'P'].drop(['result'], axis=1)
print('Graphs shows the conunt and percent of students out of total students who passed the examination.')
print('*'*100)
print()
for c in pp.columns:
    if (pp[c].dtype == 'object'):
        plt.figure(figsize=(18,4))
        ax = sns.countplot(x=c, hue='grade', data=pp)
        ax.spines['top'].set_visible(False)
        for p in ax.patches:
            ax.text(p.get_x()+p.get_width()/2, p.get_height()+10, str(round(((p.get_height()/len(pp))*100), 2))+' %', ha='center', weight='bold', rotation=90)
        plt.show()
        print('-'*100)

### Above graphs concludes following :
1. Studeds having standard lunch and conpleted test preparation course gets good grade in examination.
2. Also if parents have degree or master level education then studens perfoms well in examination.

### Analysis result :
How effective is the test preparation course?

-> It plays major role in result and helps to score better grade.

Which major factors contribute to test outcomes?

-> Lunch, test preparation course and parents education are major factors.

What would be the best way to improve student scores on each test?

-> If student get standard lunch and completed test preparation course will help student to score better in exam. Also if possible parent should also educate themself so that they can help the student with thier studies whenever they require.

## Handling categorical variable


1. gender, lunch and test preparation course are binay columns so simply converted in 0 and 1 values.
2. Parent education is ordinal category so it'll replace with proper order.
3. Group is nominal variable so appling one hot encoding.

In [None]:
df['gender_num'] = df['gender'].apply(lambda x : 0 if x == 'female' else 1)
df['lunch_num'] = df['lunch'].apply(lambda x : 0 if x == 'free/reduced' else 1)
df['course_num'] = df['test preparation course'].apply(lambda x : 0 if x == 'none' else 1)
df.head()

In [None]:
def edu(x) :
    if x == "master's degree" :
        return 0
    if x == "bachelor's degree" :
        return 1
    if x == "associate's degree" :
        return 2
    if x == 'some college' :
        return 3
    if x == 'high school' :
        return 4
    else :
        return 5

df['parent education_num'] = df['parental level of education'].apply(lambda x : edu(x))

df.head(10)

In [None]:
df['group_A'] = df['race/ethnicity'].apply(lambda x : 1 if 'A' in x else 0)
df['group_B'] = df['race/ethnicity'].apply(lambda x : 1 if 'B' in x else 0)
df['group_C'] = df['race/ethnicity'].apply(lambda x : 1 if 'C' in x else 0)
df['group_D'] = df['race/ethnicity'].apply(lambda x : 1 if 'D' in x else 0)

df.head(10)

## Train - Test split

In [None]:
col = ['math score', 'reading score', 'writing score', 'percentage']
for c in df.columns:
    if df[c].dtype == 'object':
        col.append(c)
print(col)

In [None]:
from sklearn.model_selection import train_test_split

xtn, xts, ytn, yts = train_test_split(df.drop(col, axis=1), df[['math score', 'reading score', 'writing score', 'percentage']], test_size=0.2, random_state=8)

In [None]:
xtn.head()

In [None]:
ytn.head()

## Model Selection

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from  sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, accuracy_score

In [None]:
ln = LinearRegression()
ln.fit(xtn, ytn['percentage'])
pred = ln.predict(xts)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

### Above plot shows that model is predicting quite good for value between range of 50-80 as majority of data lies in that range but it is performing worst for other values.

In [None]:
ln = LinearRegression()
pl = PolynomialFeatures(degree=4)
xpl = pl.fit_transform(xtn)
ln.fit(xpl, ytn['percentage'])
xts_pl = pl.fit_transform(xts)
pred = ln.predict(xts_pl)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

In [None]:
dec = DecisionTreeRegressor(random_state=8)
dec.fit(xtn, ytn['percentage'])
pred = dec.predict(xts)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

In [None]:
ran = RandomForestRegressor(random_state=8)
ran.fit(xtn, ytn['percentage'])
pred = ran.predict(xts)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

In [None]:
et = ExtraTreesRegressor(random_state=8)
et.fit(xtn, ytn['percentage'])
pred = et.predict(xts)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

In [None]:
sr = SVR()
sr.fit(xtn, ytn['percentage'])
pred = sr.predict(xts)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

In [None]:
kn = KNeighborsRegressor()
kn.fit(xtn, ytn['percentage'])
pred = kn.predict(xts)
print(np.sqrt(mean_squared_error(yts['percentage'], pred)))

plt.figure(figsize=(15,5))

plt.scatter(range(0,len(pred)), yts['percentage'], color = 'blue') 
plt.scatter(range(0,len(pred)), pred, color='red')

for i, tx in enumerate(yts['percentage']):
    plt.annotate('  ' + str(round(tx, 0)), (i, tx))
    plt.annotate('  ' + str(round(pred[i], 2)), (i, pred[i]))

## From above we can conclude that linear regression and support vector MSE is less but it not good model as it'll always predict the value in range of 50-80 therefoere Randonforest can be consider as best model for this dataset.