#### TASK:
The task is to build a suitable model that establishes a relation between the number of hours devoted to studies, by a student and the scores he / she gets in the respective examination.

#### APPROACH:
We would first import the data using pandas. We need not perform any fundamental EDA on it because the data is not very large, i.e., there are only two variables in the whole dataset. We would stick to analyzing and comparing some models to find out as to which model work best / makes the most accurate predictions. Finally, we will predict the outcome or score that a student should expect if he / she devotes 9.25 hours, daily, to stduies.

In [5]:
import pandas as pd

from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor as xgbr
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import RandomizedSearchCV

In [6]:
df = pd.read_csv('data_set_student.csv')
df

Unnamed: 0,Hours,Scores
0,2.5,21
1,5.1,47
2,3.2,27
3,8.5,75
4,3.5,30
5,1.5,20
6,9.2,88
7,5.5,60
8,8.3,81
9,2.7,25


In [7]:
inputs = df[['Hours']]
inputs

Unnamed: 0,Hours
0,2.5
1,5.1
2,3.2
3,8.5
4,3.5
5,1.5
6,9.2
7,5.5
8,8.3
9,2.7


In [8]:
target = df[['Scores']]
target

Unnamed: 0,Scores
0,21
1,47
2,27
3,75
4,30
5,20
6,88
7,60
8,81
9,25


#### We now define our model parameter dictionary that helps us to choose the best model - parameter combination for the prediction task.

In [9]:
model_params = {
    'LinearRegression':{
        'model': LinearRegression(),
        'params':{
            'fit_intercept': [False, True],
            'normalize': [False, True],
            'copy_X': [False, True]
        }
    },
    'XGBoostRegressor':{
        'model': xgbr(),
        'params':{
             "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30,0.35,0.4],
             "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
             "min_child_weight" : [ 1, 3, 5, 7 ],
             "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
             "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
        }
    },
    'DecisionTreeRegressor': {
        'model': DecisionTreeRegressor(),
        'params': {
            'criterion': ['mse', 'friedman_mse', 'mae'],
            'splitter': ['best', 'random'],
            'max_features': ['auto', 'sqrt', 'log2']
        }
    },
    'RandomForestRegressor': {
        'model': RandomForestRegressor(),
        'params': {
            'n_estimators': [10,20,50,100,150,200],
            'criterion': ['mse', 'mae'],
            'max_features': ['auto', 'sqrt', 'log2']
        }
    },
    'GausssianNB': {
        'model': GaussianNB(),
        'params': {
            'var_smoothing': [1e-09, 1e-10, 1e-11, 1e-12]
        }
    },
    'MultinomialNB': {
        'model': MultinomialNB(),
        'params': {
            'alpha': [1,2,3,4,5,10],
            'fit_prior': ['false', 'true']
        }
    }
}

In [13]:
scores_rscv =[]
for model_name, mp in model_params.items():
    rscv_clf = RandomizedSearchCV(mp['model'], mp['params'], 
                            cv = 3,n_iter = 3,n_jobs = -1,
                            verbose = 3, return_train_score = False)

    def timer(start_time=None):
        if not start_time:
            start_time = datetime.now()
            return start_time
        elif start_time:
            thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
            tmin, tsec = divmod(temp_sec, 60)
            print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

    from datetime import datetime
    start_time = timer(None)
    rscv_clf.fit(inputs, target.values.ravel())
    timer(start_time)

    scores_rscv.append({
        'Model Name': model_name,
        'Best Score': rscv_clf.best_score_,
        'Best Parameter': rscv_clf.best_params_,
    })
#pd.set_option('display.max_colwidth',-1)
result_rscv = pd.DataFrame(scores_rscv, columns = ['Model Name', 'Best Score', 'Best Parameter'])
result_rscv

Fitting 3 folds for each of 3 candidates, totalling 9 fits

 Time taken: 0 hours 0 minutes and 0.05 seconds.
Fitting 3 folds for each of 3 candidates, totalling 9 fits

 Time taken: 0 hours 0 minutes and 0.07 seconds.
Fitting 3 folds for each of 3 candidates, totalling 9 fits

 Time taken: 0 hours 0 minutes and 0.06 seconds.
Fitting 3 folds for each of 3 candidates, totalling 9 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.0s finished




 Time taken: 0 hours 0 minutes and 0.2 seconds.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.0s finished


Fitting 3 folds for each of 3 candidates, totalling 9 fits

 Time taken: 0 hours 0 minutes and 0.05 seconds.
Fitting 3 folds for each of 3 candidates, totalling 9 fits

 Time taken: 0 hours 0 minutes and 0.03 seconds.


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:    0.0s finished


Unnamed: 0,Model Name,Best Score,Best Parameter
0,LinearRegression,0.924573,"{'normalize': True, 'fit_intercept': False, 'copy_X': True}"
1,XGBoostRegressor,0.872366,"{'min_child_weight': 7, 'max_depth': 3, 'learning_rate': 0.05, 'gamma': 0.3, 'colsample_bytree': 0.5}"
2,DecisionTreeRegressor,0.864808,"{'splitter': 'best', 'max_features': 'log2', 'criterion': 'mae'}"
3,RandomForestRegressor,0.909704,"{'n_estimators': 100, 'max_features': 'log2', 'criterion': 'mse'}"
4,GausssianNB,0.04,{'var_smoothing': 1e-11}
5,MultinomialNB,0.12,"{'fit_prior': 'true', 'alpha': 1}"


#### We can clearly see that the Linear Regression model has the highest accuracy score when it comes to making the predictons. We would thus, train our model accordingly and then make the suitable prediction(s).

In [14]:
model = LinearRegression(normalize = True, fit_intercept = False, copy_X = True)

In [16]:
model.fit(inputs, target)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None, normalize=True)

In [19]:
model.score(inputs, target)

0.9509792879125228

In [22]:
model.predict([[9.25]])

array([[94.1118779]])

## RESULT: 
Thus, if the student devotes 9.25 hours to studies, on a daily basis, then he or she is expected to secure a score of about 94.11%, as predicted by our Linear Regression model.