# Objective : Student Grant Recommendation

You have historical student performance data and their grant recommendation outcomes in the form of a comma separated value file named student_records.csv. Each data sample consists of the following attributes.

>• Name (the student name)

>• OverallGrade (overall grade obtained)

>• Obedient (whether they were diligent during their course of stay)

>• ResearchScore (marks obtained in their research work)

>• ProjectScore (marks obtained in the project)

>• Recommend (whether they got the grant recommendation)

Your main objective is to build a predictive model based on this data such that you can predict for any future student whether they will be recommended for the grant based on their performance attributes.

In [1]:
!pip install pandas

Collecting pandas
  Using cached https://files.pythonhosted.org/packages/52/ca/f986280226b62da6ae5474589a369b0d240f9a61a99144a501b45f108883/pandas-0.25.3-cp38-cp38-macosx_10_9_x86_64.whl
Collecting numpy>=1.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/9e/cf/7cea38d32df6087d7c15bca8edef0be82e0d957119e9dafd7052dc6192f0/numpy-1.17.4-cp38-cp38-macosx_10_9_x86_64.whl (15.1MB)
[K     |████████████████████████████████| 15.1MB 2.0MB/s eta 0:00:01
[?25hCollecting pytz>=2017.2
[?25l  Downloading https://files.pythonhosted.org/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509kB)
[K     |████████████████████████████████| 512kB 5.0MB/s eta 0:00:01
Installing collected packages: numpy, pytz, pandas
Successfully installed numpy-1.17.4 pandas-0.25.3 pytz-2019.3


In [62]:
    import pandas as pd
                            #--turn of warning messages
pd.options.mode.chained_assignment = None  # default='warn'

#--get data
df = pd.read_csv('./datasets_n_images/datasets_module_1/student_records.csv')
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85,Yes
1,John,C,N,85,51,Yes
2,David,F,N,10,17,No
3,Holmes,B,Y,75,71,No
4,Marvin,E,N,20,30,No
5,Simon,A,Y,92,79,Yes
6,Robert,B,Y,60,59,No
7,Trent,C,Y,75,33,No


In [63]:
#--get features and corresponding outcomes
feature_names = ['OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore']
training_features = df[feature_names] 
#OR above two lines could also be written as below:
#training_features = df['OverallGrade', 'Obedient', 'ResearchScore', 'ProjectScore']


print("type(training_features): ",type(training_features))
print("--------------\n")
outcome_name = ['Recommend']
outcome_labels = df[outcome_name]
print("\ntraining_features:")
print(training_features)
print("----------------")
print("\noutcome_labels:")
print(outcome_labels)


type(training_features):  <class 'pandas.core.frame.DataFrame'>
--------------


training_features:
  OverallGrade Obedient  ResearchScore  ProjectScore
0            A        Y             90            85
1            C        N             85            51
2            F        N             10            17
3            B        Y             75            71
4            E        N             20            30
5            A        Y             92            79
6            B        Y             60            59
7            C        Y             75            33
----------------

outcome_labels:
  Recommend
0       Yes
1       Yes
2        No
3        No
4        No
5       Yes
6        No
7        No


In [64]:
#--list down features based on type
numeric_feature_names = ['ResearchScore', 'ProjectScore']
categoricial_feature_names = ['OverallGrade', 'Obedient']

In [24]:
!pip install sklearn
!pip install scikit-datasets

Collecting scikit-datasets
  Downloading https://files.pythonhosted.org/packages/89/f1/858de12daffb183c368412425905d4b6e0fdda4d9d119dd2158123d2fdaa/scikit_datasets-0.1.36-py3-none-any.whl
Installing collected packages: scikit-datasets
Successfully installed scikit-datasets-0.1.36


In [65]:
#--scale or normalize our two numeric score-based attributes
from sklearn.preprocessing import StandardScaler    #sklearn.preprocessing is the package used 
ss = StandardScaler() #creating object of StandardScaler 

# fit scaler on numeric features
ss.fit(training_features[numeric_feature_names])  #fit means it will analyse the data

# scale numeric features now
training_features[numeric_feature_names] = ss.transform(training_features[numeric_feature_names]) #transform means it will find log

# view updated feature-set
print(training_features)

  OverallGrade Obedient  ResearchScore  ProjectScore
0            A        Y       0.899583      1.376650
1            C        N       0.730648     -0.091777
2            F        N      -1.803390     -1.560203
3            B        Y       0.392776      0.772004
4            E        N      -1.465519     -0.998746
5            A        Y       0.967158      1.117516
6            B        Y      -0.114032      0.253735
7            C        Y       0.392776     -0.869179


In [66]:
#--Engineering Categorical Features
training_features = pd.get_dummies(training_features, columns=categoricial_feature_names) #get_dummies is the method

# view newly engineering features

print(training_features)

# We have converted our categoricial data into numeric. 
# or we can say we have done feature engineering over categorical data.

   ResearchScore  ProjectScore  OverallGrade_A  OverallGrade_B  \
0       0.899583      1.376650               1               0   
1       0.730648     -0.091777               0               0   
2      -1.803390     -1.560203               0               0   
3       0.392776      0.772004               0               1   
4      -1.465519     -0.998746               0               0   
5       0.967158      1.117516               1               0   
6      -0.114032      0.253735               0               1   
7       0.392776     -0.869179               0               0   

   OverallGrade_C  OverallGrade_E  OverallGrade_F  Obedient_N  Obedient_Y  
0               0               0               0           0           1  
1               1               0               0           1           0  
2               0               0               1           1           0  
3               0               0               0           0           1  
4               0        

In [67]:
#--get list of new categorical features
categorical_engineered_features = list(set(training_features.columns) - set(numeric_feature_names))

print(categorical_engineered_features)

['OverallGrade_C', 'OverallGrade_F', 'OverallGrade_A', 'OverallGrade_B', 'Obedient_N', 'Obedient_Y', 'OverallGrade_E']


In [68]:
from sklearn.linear_model import LogisticRegression
import numpy as np
import warnings; warnings.simplefilter('ignore')  

#--fit the model
lr = LogisticRegression() #object banaya
model = lr.fit(training_features, np.array(outcome_labels['Recommend'])) 
# np.array() converts from dataframe to numeric array

#--view model parameters
print(model)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


In [69]:
#--simple evaluation on training data
pred_labels = model.predict(training_features)
actual_labels = np.array(outcome_labels['Recommend'])

#--evaluate model performance
from sklearn.metrics import accuracy_score    
from sklearn.metrics import classification_report

print('Accuracy:', float(accuracy_score(actual_labels, pred_labels))*100, '%')
# SYNTAX: accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

print('Classification Stats:')
print(classification_report(actual_labels, pred_labels))

Accuracy: 100.0 %
Classification Stats:
              precision    recall  f1-score   support

          No       1.00      1.00      1.00         5
         Yes       1.00      1.00      1.00         3

    accuracy                           1.00         8
   macro avg       1.00      1.00      1.00         8
weighted avg       1.00      1.00      1.00         8



MUST WATCH VIDEO TO UNDERSTAND "CLASSIFICATION REPORT"
--
>https://www.youtube.com/watch?v=HBi-P5j0Kec

sklearn Metrics

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

In [70]:
#--Model Deployment  -- optional in our case
from sklearn.externals import joblib
import os
#--save models to be deployed on your server
if not os.path.exists('Model'):
    os.mkdir('Model') #make directory (mkdr)
if not os.path.exists('Scaler'):
    os.mkdir('Scaler') 
    
joblib.dump(model, r'Model/model.pickle') #Create a Pickle file->i.e  Model is convert into encrypted compressed binary form
joblib.dump(ss, r'Scaler/scaler.pickle')

# Check both the folders under  C:\Program Files\Python36

['Scaler/scaler.pickle']

In [71]:
#--Prediction in Action
#--load model and scaler objects
model = joblib.load(r'Model/model.pickle')
scaler = joblib.load(r'Scaler/scaler.pickle')

# We have some sample new student records (for two students) 
# for which we want our model to predict if they will get the 
# grant recommendation. 
# Let’s retrieve and view this data using the following code.

#--data retrieval
new_data = pd.DataFrame([{'Name': 'Ninad', 'OverallGrade': 'F', 'Obedient': 'N', 'ResearchScore': 30, 'ProjectScore': 20},
                  {'Name': 'Thomas', 'OverallGrade': 'A', 'Obedient': 'Y', 'ResearchScore': 78, 'ProjectScore': 80}])

print(new_data)

     Name OverallGrade Obedient  ResearchScore  ProjectScore
0   Ninad            F        N             30            20
1  Thomas            A        Y             78            80


In [72]:
# w.r.t new data
# We will now carry out the tasks relevant to 
# data preparation—feature extraction, engineering, and scaling 
# in the following code snippet.

#--data preparation
prediction_features = new_data[feature_names]
print("prediction_features\n",prediction_features)
#--scaling
prediction_features[numeric_feature_names] = scaler.transform(prediction_features[numeric_feature_names])

#--engineering categorical variables
prediction_features = pd.get_dummies(prediction_features, columns=categoricial_feature_names)

#--view feature set
print(prediction_features)

prediction_features
   OverallGrade Obedient  ResearchScore  ProjectScore
0            F        N             30            20
1            A        Y             78            80
   ResearchScore  ProjectScore  OverallGrade_A  OverallGrade_F  Obedient_N  \
0      -1.127647     -1.430636               0               1           1   
1       0.494137      1.160705               1               0           0   

   Obedient_Y  
0           0  
1           1  


In [73]:
# add missing categorical feature columns
current_categorical_engineered_features = set(prediction_features.columns) - set(numeric_feature_names)

missing_features = set(categorical_engineered_features) - current_categorical_engineered_features

for feature in missing_features:
    # add zeros since feature is absent in these data samples
    prediction_features[feature] = [0] * len(prediction_features) #missing columns ke values ko 0 kiye
    


# view final feature set
print(prediction_features)

   ResearchScore  ProjectScore  OverallGrade_A  OverallGrade_F  Obedient_N  \
0      -1.127647     -1.430636               0               1           1   
1       0.494137      1.160705               1               0           0   

   Obedient_Y  OverallGrade_B  OverallGrade_E  OverallGrade_C  
0           0               0               0               0  
1           1               0               0               0  


In [74]:
# We have our complete feature set ready for both the new students. 
# Let’s put our model to the test and get the predictions 
# with regard to grant recommendations!

predictions = model.predict(prediction_features)

##--display results
new_data['Recommend'] = predictions
print(new_data)

     Name OverallGrade Obedient  ResearchScore  ProjectScore Recommend
0   Ninad            F        N             30            20        No
1  Thomas            A        Y             78            80       Yes


# Few Q n A

![Underfitting overfitting image](datasets_n_images/images/underfitting_overfitting_image_1.png 'underfitting overfitting image')

![Another Underfitting Overfitting Image](datasets_n_images/images/underfitting_overfitting_image_2.png 'Another Underfitting Overfitting image')