# <center>Student Grade Prediction</center>
#### <center>by Sushant Deshpande</center>

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
grades = pd.read_csv("data/grades.csv")
grades.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


<div class="alert alert-warning">
    <b>Attribute Information:</b>

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
<ul><b>1 school</b> - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)</ul>
<ul><b>2 sex</b> - student's sex (binary: 'F' - female or 'M' - male)</ul>
<ul><b>3 age</b> - student's age (numeric: from 15 to 22)</ul>
<ul><b>4 address</b> - student's home address type (binary: 'U' - urban or 'R' - rural)</ul>
<ul><b>5 famsize</b> - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)</ul>
<ul><b>6 Pstatus</b> - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)</ul>
<ul><b>7 Medu</b> - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education) UCI Machine Learning Repository: Student Performance Data Set https://archive.ics.uci.edu/ml/datasets/student+performance</ul>
<ul><b>8 Fedu</b> - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)</ul>
<ul><b>9 Mjob</b> - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')</ul>
<ul><b>10 Fjob</b> - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')</ul>
<ul><b>11 reason</b> - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')</ul>
<ul><b>12 guardian</b> - student's guardian (nominal: 'mother', 'father' or 'other')</ul>
<ul><b>13 traveltime</b> - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)</ul>
<ul><b>14 studytime</b> - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)</ul>
    <ul><div><b>15 failures</b> - number of past class failures (numeric: n if 1<=n<3, else 4)</div></ul>
<ul><b>16 schoolsup</b> - extra educational support (binary: yes or no)</ul>
<ul><b>17 famsup</b> - family educational support (binary: yes or no)</ul>
<ul><b>18 paid</b> - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)</ul>
<ul><b>19 activities</b> - extra-curricular activities (binary: yes or no)</ul>
<ul><b>20 nursery</b> - attended nursery school (binary: yes or no)</ul>
<ul><b>21 higher</b> - wants to take higher education (binary: yes or no)</ul>
<ul><b>22 internet</b> - Internet access at home (binary: yes or no)</ul>
<ul><b>23 romantic</b> - with a romantic relationship (binary: yes or no)</ul>
<ul><b>24 famrel</b> - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)</ul>
<ul><b>25 freetime</b> - free time after school (numeric: from 1 - very low to 5 - very high)</ul>
<ul><b>26 goout</b> - going out with friends (numeric: from 1 - very low to 5 - very high)</ul>
<ul><b>27 Dalc</b> - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)</ul>
<ul><b>28 Walc</b> - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)</ul>
<ul><b>29 health</b> - current health status (numeric: from 1 - very bad to 5 - very good)</ul>
<ul><b>30 absences</b> - number of school absences (numeric: from 0 to 93)</ul>

<ul>these grades are related with the course subject, Math or Portuguese:</ul>
<ul><b>31 G1</b> - first period grade (numeric: from 0 to 20)</ul>
<ul><b>31 G2</b> - second period grade (numeric: from 0 to 20)</ul>
<ul><b>32 G3</b> - final grade (numeric: from 0 to 20, output target)</ul>
</div>

Let's find out how many null values are present in out dataset

In [3]:
grades.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

Let's fid out what are the datatypes of each column

In [4]:
grades.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

Now first, let's make a copy of the original dataset and call it <code>grades_working</code>

In [5]:
grades_working = grades.copy()

In [6]:
grades_working["schoolsup"].head()

0    yes
1     no
2    yes
3     no
4     no
Name: schoolsup, dtype: object

In [7]:
grades_working["schoolsup"] = grades_working["schoolsup"].map({'yes': 1, 'no': 0})
grades_working["famsup"] = grades_working["famsup"].map({'yes': 1, 'no': 0})
grades_working["paid"] = grades_working["paid"].map({'yes': 1, 'no': 0})
grades_working["activities"] = grades_working["activities"].map({'yes': 1, 'no': 0})
grades_working["nursery"] = grades_working["nursery"].map({'yes': 1, 'no': 0})
grades_working["higher"] = grades_working["higher"].map({'yes': 1, 'no': 0})
grades_working["internet"] = grades_working["internet"].map({'yes': 1, 'no': 0})
grades_working["romantic"] = grades_working["romantic"].map({'yes': 1, 'no': 0})

In [8]:
grades_working.count()

school        395
sex           395
age           395
address       395
famsize       395
Pstatus       395
Medu          395
Fedu          395
Mjob          395
Fjob          395
reason        395
guardian      395
traveltime    395
studytime     395
failures      395
schoolsup     395
famsup        395
paid          395
activities    395
nursery       395
higher        395
internet      395
romantic      395
famrel        395
freetime      395
goout         395
Dalc          395
Walc          395
health        395
absences      395
G1            395
G2            395
G3            395
dtype: int64

<hr>

#### BEGIN TESTING

In [9]:
#Testing
grades_dummies = pd.get_dummies(grades_working["sex"])
grades_dummies.head()

Unnamed: 0,F,M
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0


In [10]:
grades_dummies.count()

F    395
M    395
dtype: int64

In [11]:
empty_df = pd.DataFrame(columns=[])
empty_df

In [12]:
empty_df_test = pd.DataFrame(columns=[])
empty_df_test

In [13]:
test_df = pd.concat([empty_df_test, grades_working["sex"], grades_working["age"], grades_working["school"]], axis=1)
test_df.head()

Unnamed: 0,sex,age,school
0,F,18,GP
1,F,17,GP
2,F,15,GP
3,F,15,GP
4,F,16,GP


In [14]:
for i in test_df:
    if (test_df.dtypes[i] == "object"):
        temp_df = pd.get_dummies(grades_working[i])
        #grades_dummies = grades_working.drop(columns=[i])
        empty_df = pd.concat([empty_df, temp_df], axis=1)

empty_df.count()

F     395
M     395
GP    395
MS    395
dtype: int64

In [15]:
empty_df.head()

Unnamed: 0,F,M,GP,MS
0,1,0,1,0
1,1,0,1,0
2,1,0,1,0
3,1,0,1,0
4,1,0,1,0


#### END TESTING
<hr>

In [16]:
empty_df_blank = pd.DataFrame(columns=[])

In [17]:
for i in grades_working:
    if (grades_working.dtypes[i] == "object"):
        temp_df = pd.get_dummies(grades_working[i])
        #grades_non_dummies = grades_working.drop(i, axis=1)
        empty_df_blank = pd.concat([empty_df_blank, temp_df], axis=1)

empty_df_blank.head()

Unnamed: 0,GP,MS,F,M,R,U,GT3,LE3,A,T,...,other,services,teacher,course,home,other.1,reputation,father,mother,other.2
0,1,0,1,0,0,1,1,0,1,0,...,0,0,1,1,0,0,0,0,1,0
1,1,0,1,0,0,1,1,0,0,1,...,1,0,0,1,0,0,0,1,0,0
2,1,0,1,0,0,1,0,1,0,1,...,1,0,0,0,0,1,0,0,1,0
3,1,0,1,0,0,1,1,0,0,1,...,0,1,0,0,1,0,0,0,1,0
4,1,0,1,0,0,1,1,0,0,1,...,1,0,0,0,1,0,0,1,0,0


In [18]:
empty_df_blank.dtypes

GP            uint8
MS            uint8
F             uint8
M             uint8
R             uint8
U             uint8
GT3           uint8
LE3           uint8
A             uint8
T             uint8
at_home       uint8
health        uint8
other         uint8
services      uint8
teacher       uint8
at_home       uint8
health        uint8
other         uint8
services      uint8
teacher       uint8
course        uint8
home          uint8
other         uint8
reputation    uint8
father        uint8
mother        uint8
other         uint8
dtype: object

In [19]:
for i in grades_working:
    if (grades_working.dtypes[i] == "object"):
        #temp_df = pd.get_dummies(grades_working[i])
        grades_working = grades_working.drop(i, axis=1)
        #empty_df_blank = pd.concat([empty_df_blank, temp_df], axis=1)

grades_working.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,18,4,4,2,2,0,1,0,0,0,...,4,3,4,1,1,3,6,5,6,6
1,17,1,1,1,2,0,0,1,0,0,...,5,3,3,1,1,3,4,5,5,6
2,15,1,1,1,2,3,1,0,1,0,...,4,3,2,2,3,3,10,7,8,10
3,15,4,2,1,3,0,0,1,1,1,...,3,2,2,1,1,5,2,15,14,15
4,16,3,3,1,2,0,0,1,1,0,...,4,3,2,1,2,5,4,6,10,10


In [20]:
grades_final = pd.concat([grades_working, empty_df_blank], axis=1)
grades_final.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,...,other,services,teacher,course,home,other.1,reputation,father,mother,other.2
0,18,4,4,2,2,0,1,0,0,0,...,0,0,1,1,0,0,0,0,1,0
1,17,1,1,1,2,0,0,1,0,0,...,1,0,0,1,0,0,0,1,0,0
2,15,1,1,1,2,3,1,0,1,0,...,1,0,0,0,0,1,0,0,1,0
3,15,4,2,1,3,0,0,1,1,1,...,0,1,0,0,1,0,0,0,1,0
4,16,3,3,1,2,0,0,1,1,0,...,1,0,0,0,1,0,0,1,0,0


In [21]:
grades_final.dtypes

age           int64
Medu          int64
Fedu          int64
traveltime    int64
studytime     int64
failures      int64
schoolsup     int64
famsup        int64
paid          int64
activities    int64
nursery       int64
higher        int64
internet      int64
romantic      int64
famrel        int64
freetime      int64
goout         int64
Dalc          int64
Walc          int64
health        int64
absences      int64
G1            int64
G2            int64
G3            int64
GP            uint8
MS            uint8
F             uint8
M             uint8
R             uint8
U             uint8
GT3           uint8
LE3           uint8
A             uint8
T             uint8
at_home       uint8
health        uint8
other         uint8
services      uint8
teacher       uint8
at_home       uint8
health        uint8
other         uint8
services      uint8
teacher       uint8
course        uint8
home          uint8
other         uint8
reputation    uint8
father        uint8
mother        uint8


In [22]:
X = grades_final.drop('G3', axis=1)
y = grades_final['G3']

In [23]:
# Use train_test_split to create training and testing data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [24]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X_minmax = MinMaxScaler().fit(X_train)

X_train_minmax = X_minmax.transform(X_train)
X_test_minmax = X_minmax.transform(X_test)



#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X_train_minmax,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
top_features = featureScores.nlargest(9,'Score')
print(top_features)  #print best features

        Specs      Score
5    failures  38.740947
21         G1  28.906988
33    at_home  25.496292
6   schoolsup  25.232126
13   romantic  19.134260
34     health  17.915130
8        paid  17.620926
45      other  17.563107
22         G2  17.325296


In [25]:
# Create the model using LinearRegression
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [26]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8622445843035902


In [27]:
# Calculate the score for the testing data

testing_score = model.score(X_test, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.7812016133435433


In [28]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train_minmax, y_train)
training_score_normalized = model.score(X_train_minmax, y_train)

print(f"Training Score: {training_score_normalized}")

Training Score: 0.8622314115268453


In [29]:
# Calculate the score for the testing data

testing_score_normalized = model.score(X_test_minmax, y_test)
print(f"Testing Score: {testing_score_normalized}")

Testing Score: 0.7812293279464217


### Testing model with best features

In [30]:
feature_arrey = []
for i in top_features["Specs"]:
    feature_arrey.append(i)

In [31]:
feature_selected_df = grades_final.loc[:, feature_arrey]
feature_selected_df.head()

Unnamed: 0,failures,G1,at_home,at_home.1,schoolsup,romantic,health,health.1,health.2,paid,other,other.1,other.2,other.3,G2
0,0,5,1,0,1,0,3,0,0,0,0,0,0,0,6
1,0,5,1,0,0,0,3,0,0,0,0,1,0,0,5
2,3,7,1,0,1,0,3,0,0,1,0,1,1,0,8
3,0,15,0,0,0,1,5,1,0,1,0,0,0,0,14
4,0,6,0,0,0,0,5,0,0,1,1,1,0,0,10


In [32]:
X = feature_selected_df
y = grades_final['G3']

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [34]:
model = LinearRegression()

In [35]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8379508826668948


In [36]:
# Calculate the score for the testing data

testing_score = model.score(X_test, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.7973880269874376


In [37]:
test_df = grades_final[['G1', 'G2', 'failures']]
test_df.head()

Unnamed: 0,G1,G2,failures
0,5,6,0
1,5,5,0
2,7,8,3
3,15,14,0
4,6,10,0


In [38]:
selected_features = grades_final[['G1', 'G2', 'studytime', 'failures', 'health', 'absences']]
selected_features.head()

Unnamed: 0,G1,G2,studytime,failures,health,health.1,health.2,absences
0,5,6,2,0,3,0,0,6
1,5,5,2,0,3,0,0,4
2,7,8,2,3,3,0,0,10
3,15,14,3,0,5,1,0,2
4,6,10,2,0,5,0,0,4


In [39]:
X = selected_features
y = grades_final['G3']

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [41]:
model = LinearRegression()

In [42]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8335780966652488


In [43]:
# Calculate the score for the testing data

testing_score = model.score(X_test, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.8192924049135445


In [44]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [45]:
import joblib
filename = 'LogisticRegression.sav'
joblib.dump(regr, filename)

['LogisticRegression.sav']

In [46]:
joblib_LR_model = joblib.load(filename)
joblib_LR_model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

test_data = [[19, 19, 5, 0, 0]]
Ypredict = joblib_LR_model.predict(test_data)  
Ypredict

In [49]:
empty_df_int = pd.DataFrame(columns=[])
empty_df_int

In [50]:
for i in grades:
    if (grades.dtypes[i] == "int64"):
        empty_df_int = empty_df_int.append(grades[i])
        #grades_dummies = grades_working.drop(columns=[i])
        #empty_df = pd.concat([empty_df, temp_df], axis=1)

empty_df_int.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,385,386,387,388,389,390,391,392,393,394
age,18.0,17.0,15.0,15.0,16.0,16.0,16.0,17.0,15.0,15.0,...,18.0,18.0,19.0,18.0,18.0,20.0,17.0,21.0,18.0,19.0
Medu,4.0,1.0,1.0,4.0,3.0,4.0,2.0,4.0,3.0,3.0,...,2.0,4.0,2.0,3.0,1.0,2.0,3.0,1.0,3.0,1.0
Fedu,4.0,1.0,1.0,2.0,3.0,3.0,2.0,4.0,2.0,4.0,...,2.0,4.0,3.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0
traveltime,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,...,2.0,3.0,1.0,1.0,2.0,1.0,2.0,1.0,3.0,1.0
studytime,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,...,3.0,1.0,3.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0


In [51]:
grades_df_obj = grades.select_dtypes(exclude=['int64'])
grades_df_obj.head()

Unnamed: 0,school,sex,address,famsize,Pstatus,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic
0,GP,F,U,GT3,A,at_home,teacher,course,mother,yes,no,no,no,yes,yes,no,no
1,GP,F,U,GT3,T,at_home,other,course,father,no,yes,no,no,no,yes,yes,no
2,GP,F,U,LE3,T,at_home,other,other,mother,yes,no,yes,no,yes,yes,yes,no
3,GP,F,U,GT3,T,health,services,home,mother,no,yes,yes,yes,yes,yes,yes,yes
4,GP,F,U,GT3,T,other,other,home,father,no,yes,yes,no,yes,yes,no,no


In [52]:
grades_df_int = grades.select_dtypes(include=['int64'])
grades_df_int.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,18,4,4,2,2,0,4,3,4,1,1,3,6,5,6,6
1,17,1,1,1,2,0,5,3,3,1,1,3,4,5,5,6
2,15,1,1,1,2,3,4,3,2,2,3,3,10,7,8,10
3,15,4,2,1,3,0,3,2,2,1,1,5,2,15,14,15
4,16,3,3,1,2,0,4,3,2,1,2,5,4,6,10,10


In [53]:
X = grades_df_int.drop(['age', 'G3'], axis=1)
y = grades_df_int['G3']

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [55]:
model = LinearRegression()

In [56]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8416657094972266


In [57]:
# Calculate the score for the testing data

testing_score = model.score(X_test, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.8130656276244678


In [58]:
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k='all')
fit = bestfeatures.fit(X_train,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
top_features = featureScores.nlargest(7,'Score')
print(top_features)  #print best features

       Specs       Score
11  absences  535.313800
13        G2  329.180626
12        G1  219.550138
4   failures  116.222840
9       Walc   20.982562
10    health   14.605100
8       Dalc   12.191248


In [59]:
feature_arrey_int = []
for i in top_features["Specs"]:
    feature_arrey_int.append(i)

In [60]:
feature_selected_int = grades_df_int.loc[:, feature_arrey_int]
feature_selected_int.head()

Unnamed: 0,absences,G2,G1,failures,Walc,health,Dalc
0,6,6,5,0,1,3,1
1,4,5,5,0,1,3,1
2,10,8,7,3,3,3,2
3,2,14,15,0,1,5,1
4,4,10,6,0,2,5,1


In [61]:
X = feature_selected_int
y = grades_df_int['G3']

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LinearRegression()

In [63]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8329819545431996


In [64]:
# Calculate the score for the testing data

testing_score = model.score(X_test, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.8183938263548582


In [72]:
filename = 'LogisticRegression_1.sav'
joblib.dump(regr, filename)

['LogisticRegression_1.sav']

In [73]:
joblib_LR_model = joblib.load(filename)
joblib_LR_model

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [77]:
test_data_1 = [[2, 19, 16, 1, 5, 1, 1, 1]]
Ypredict = joblib_LR_model.predict(test_data_1)
Ypredict

array([14.7366912])

#### Testing with user selected features

In [98]:
X = grades_df_int[['studytime', 'failures', 'freetime', 'absences', 'health', 'G1', 'G2']]
y = grades_df_int['G3']

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LinearRegression()

In [100]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train, y_train)
training_score = model.score(X_train, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8327928933085645


In [101]:
# Calculate the score for the testing data

testing_score = model.score(X_test, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.8211221777599599


#### Testing with MinMax Scaler

In [65]:
X_minmax = MinMaxScaler().fit(X_train)

X_train_minmax = X_minmax.transform(X_train)
X_test_minmax = X_minmax.transform(X_test)

In [66]:
# Fit the model to the training data and calculate the score for the training data

model.fit(X_train_minmax, y_train)
training_score = model.score(X_train_minmax, y_train)

print(f"Training Score: {training_score}")

Training Score: 0.8329819545431996


In [67]:
# Calculate the score for the testing data

testing_score = model.score(X_test_minmax, y_test)
print(f"Testing Score: {testing_score}")

Testing Score: 0.8183938263548582


#### Selecting Best Features

In [28]:
feature_arrey = []
for i in top_features["Specs"]:
    feature_arrey.append(i)

In [29]:
feature_selected_df = grades_final.loc[:, feature_arrey]
feature_selected_df.head()

Unnamed: 0,failures,G1,at_home,at_home.1,schoolsup,romantic,health,health.1,health.2,paid,...,at_home.2,A,teacher,teacher.1,R,teacher.2,teacher.3,MS,M,course
0,0,5,1,0,1,0,3,0,0,0,...,0,1,0,1,0,0,1,0,0,1
1,0,5,1,0,0,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,7,1,0,1,0,3,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,15,0,0,0,1,5,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,6,0,0,0,0,5,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [30]:
X_feature = feature_selected_df.copy()
y_feature = grades_final['G3']

In [31]:
X_train_feature, X_test_feature, y_train_feature, y_test_feature = train_test_split(X_feature, y_feature, random_state=42)
X_train_feature.head()

Unnamed: 0,failures,G1,at_home,at_home.1,schoolsup,romantic,health,health.1,health.2,paid,...,at_home.2,A,teacher,teacher.1,R,teacher.2,teacher.3,MS,M,course
16,0,13,0,0,0,0,2,0,0,1,...,0,0,0,0,0,0,0,0,0,0
66,0,13,0,0,0,1,3,0,0,0,...,0,1,0,0,0,0,0,0,1,0
211,0,12,0,0,0,1,3,0,0,1,...,0,0,0,0,0,0,0,0,1,0
7,0,6,0,0,1,0,1,0,0,0,...,0,1,0,1,0,0,1,0,0,0
19,0,8,0,0,0,0,5,1,0,1,...,0,0,0,0,0,0,0,0,1,0


In [32]:
X_minmax_feature = MinMaxScaler().fit(X_train_feature)

X_train_minmax_feature = X_minmax_feature.transform(X_train_feature)
X_test_minmax_feature = X_minmax_feature.transform(X_test_feature)

In [33]:
model.fit(X_train_minmax_feature, y_train_feature)
training_score_feature = model.score(X_train_minmax_feature, y_train_feature)

print(f"Training Score: {training_score_feature}")

Training Score: 0.8398808444211366


In [34]:
# Calculate the score for the testing data

testing_score_feature = model.score(X_test_minmax_feature, y_test_feature)
print(f"Testing Score: {testing_score_feature}")

Testing Score: 0.7938934567962836


In [35]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [36]:
regr.coef_

array([-1.73135322e-01,  1.29724861e-01, -1.33787854e-01,  9.70972520e-02,
       -1.04779954e-01, -1.60608558e-01,  4.56467477e-01,  1.76552134e-01,
        7.61899362e-02, -3.45785346e-01, -2.22143379e-01,  2.25860726e-01,
       -1.44194011e-01, -2.71883159e-01,  3.56965015e-01,  4.70991258e-02,
        1.18191733e-02, -1.85123343e-01,  1.76763124e-01,  6.29547387e-02,
        4.58748209e-02,  1.88756816e-01,  9.57354671e-01, -6.76051723e+11,
       -6.76051723e+11,  8.80102251e+11,  8.80102251e+11,  1.13605457e+11,
        1.13605457e+11, -1.40447826e+10, -1.40447826e+10, -9.45366593e+10,
       -9.45366593e+10, -2.33111178e+11, -2.33111178e+11, -2.33111178e+11,
       -2.33111178e+11, -2.33111178e+11,  1.13272768e+11,  1.13272768e+11,
        1.13272768e+11,  1.13272768e+11,  1.13272768e+11, -8.35928104e+10,
       -8.35928104e+10, -8.35928104e+10, -8.35928104e+10,  5.54390160e+11,
        5.54390160e+11,  5.54390160e+11])

In [37]:
regr.intercept_

-560033481943.9865

In [38]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.scatter(model.predict(X_train_minmax_feature),
            model.predict(X_train_minmax_feature) - y_train_feature, c="blue", label="Training Data")
plt.scatter(model.predict(X_test_minmax_feature),
            model.predict(X_test_minmax_feature) - y_test_feature, c="orange", label="Testing Data")
plt.legend()
plt.hlines(y=0, xmin=y.min(), xmax=y.max())
plt.title("Residual Plot")
plt.show()

<Figure size 1000x600 with 1 Axes>

### Random Forest

In [39]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
import joblib
import warnings
warnings.filterwarnings('ignore')

In [40]:
# Scale X values
X_scaler = MinMaxScaler().fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [41]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [42]:
X_minmax = MinMaxScaler().fit(X_train)

X_train_minmax = X_minmax.transform(X_train)
X_test_minmax = X_minmax.transform(X_test)

In [43]:
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train_minmax,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
top_features = featureScores.nlargest(20,'Score')
print(top_features)  #print best features

         Specs      Score
5     failures  38.740947
21          G1  28.906988
33     at_home  25.496292
6    schoolsup  25.232126
13    romantic  19.134260
34      health  17.915130
8         paid  17.620926
45       other  17.563107
22          G2  17.325296
46  reputation  17.176953
36    services  16.768418
38     at_home  14.633767
31           A  13.079180
42     teacher  12.719343
27           R  11.166497
37     teacher  10.581299
24          MS  10.435549
26           M  10.100207
43      course  10.011794
41    services   9.670417


In [44]:
feature_arrey = []
for i in top_features["Specs"]:
    feature_arrey.append(i)

In [45]:
feature_selected_df = grades_final.loc[:, feature_arrey]
feature_selected_df.head()

Unnamed: 0,failures,G1,at_home,at_home.1,schoolsup,romantic,health,health.1,health.2,paid,...,teacher,teacher.1,R,teacher.2,teacher.3,MS,M,course,services,services.1
0,0,5,1,0,1,0,3,0,0,0,...,0,1,0,0,1,0,0,1,0,0
1,0,5,1,0,0,0,3,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,7,1,0,1,0,3,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,15,0,0,0,1,5,1,0,1,...,0,0,0,0,0,0,0,0,0,1
4,0,6,0,0,0,0,5,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [46]:
X_feature = feature_selected_df.copy()
y_feature = grades_final["G3"]

In [47]:
X_train_feature, X_test_feature, y_train_feature, y_test_feature = train_test_split(X_feature, y_feature, random_state=42)
X_train_feature.head()

Unnamed: 0,failures,G1,at_home,at_home.1,schoolsup,romantic,health,health.1,health.2,paid,...,teacher,teacher.1,R,teacher.2,teacher.3,MS,M,course,services,services.1
16,0,13,0,0,0,0,2,0,0,1,...,0,0,0,0,0,0,0,0,1,1
66,0,13,0,0,0,1,3,0,0,0,...,0,0,0,0,0,0,1,0,0,1
211,0,12,0,0,0,1,3,0,0,1,...,0,0,0,0,0,0,1,0,1,0
7,0,6,0,0,1,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
19,0,8,0,0,0,0,5,1,0,1,...,0,0,0,0,0,0,1,0,0,0


In [48]:
X_minmax_feature = MinMaxScaler().fit(X_train_feature)

X_train_minmax_feature = X_minmax_feature.transform(X_train_feature)
X_test_minmax_feature = X_minmax_feature.transform(X_test_feature)

In [49]:
classifier = RandomForestClassifier(n_estimators=200)
classifier

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [50]:
classifier.fit(X_train_minmax_feature, y_train_feature)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [51]:
classifier_score_feature = round(classifier.score(X_train_minmax_feature, y_train_feature)*100, 2)
base_model_score_feature = round(classifier.score(X_test_minmax_feature, y_test_feature)*100, 2)

print(f'Train Data Score: {classifier_score_feature}%')
print(f'Test Data Score: {base_model_score_feature}%')

Train Data Score: 99.66%
Test Data Score: 33.33%
