## Exams are one of the most important parts of every student's life.<br> <br>
### Each country has its own type of exam. This dataset provides information about the school, school setting, school type, classroom type, teaching method, number of students, gender and benefits per student.

#### I will do data analysis and regression model to predict the test result

### Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot
import missingno as msno

import tensorflow as tf
from catboost import CatBoostRegressor

### Read and explore the data 

In [None]:
data = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')
data.head()

Missing values in data

In [None]:
msno.bar(data, figsize=(12,7))
plt.show()

We can see that data haven't misses

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data.classroom.unique()

if only I knew what it all means ...

In [None]:
data.school_type.unique()

In [None]:
data.school_setting.unique()

In [None]:
data.teaching_method.unique()

In [None]:
data.lunch.unique()

In [None]:
data.school.unique()

#### EDA

#### Dependence of gender on the test result

In [None]:
gender_to_result = data.groupby(['gender']).agg({'posttest':'mean'}).reset_index()


fig = px.bar(gender_to_result, x='gender', y='posttest',
            title='Dependence of gender on the mean test result')
iplot(fig)

The test result does not depend on gender

#### Dependence of the type of school on the test result

In [None]:
school_type_to_result = data.groupby(['school_type']).agg({'posttest':'mean'}).reset_index()


fig = px.bar(school_type_to_result, x='school_type', y='posttest',
            title='Dependence of the type of school on the mean test result')
iplot(fig)

Student performance in non-public schools is almost 12 points higher

#### The dependence of the location of the school on the result

In [None]:
school_setting = data.school_setting.value_counts()

fig = px.pie(school_setting, values=school_setting.values, names=school_setting.index,
            title='The dependence of the location of the school on the result')
iplot(fig)

school_setting_to_result = data.groupby(['school_setting']).agg({'posttest':'mean'}).reset_index()

fig2 = px.bar(school_setting_to_result, x='school_setting', y='posttest')

iplot(fig2)

As we can see, there are many urban schools, but the result is the worst. The best result was for pupils from suburban schools, followed by rural schools

#### Dependence of the number of students in the class on the result on the exam

In [None]:
number_of_students_to_results = data.groupby(['n_student']).agg({'posttest':'mean'}).reset_index()

fig = px.scatter(number_of_students_to_results, x='n_student', y='posttest',
                 size='posttest', color='n_student', size_max=60,
                 title='Dependence of the number of students in the class on the result on the exam')

iplot(fig)

As we can see the number of students in the class is inversely proportional to the result on the exam

In [None]:
data.head()

### Correlation heatmap

In [None]:
for_corr = data.drop(['school','student_id','gender','classroom'], axis=1)

for_replace = {'school_setting':{'Urban':0, 'Suburban':1, 'Rural':2},
               'school_type':{'Public':0, 'Non-public':1},
               'teaching_method':{'Standard':0, 'Experimental':1},
               'lunch':{'Does not qualify':0, 'Qualifies for reduced/free lunch':1}}

for_corr = for_corr.replace(for_replace)

In [None]:
for_corr.head()

In [None]:
corr = for_corr.corr()

plt.figure(figsize=(14,8))
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)
plt.show()

### Prepare to learning

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
X = for_corr.drop('posttest', axis=1)
y = for_corr['posttest'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=142)

### Training

#### CatBoost

In [None]:
cat_model = CatBoostRegressor(loss_function='RMSE', random_state=142, verbose=50)

cat_model.fit(X_train, y_train, early_stopping_rounds=100, eval_set=[(X_test, y_test)])
pred = cat_model.predict(X_test)
print(np.sqrt(mean_squared_error(pred, y_test)))

#### LGBM

In [None]:
import lightgbm as lgbm

lgb_model = lgbm.LGBMRegressor(loss_function='RMSE', random_state=142)

lgb_model.fit(X_train, y_train, 
        eval_set=[(X_test, y_test)],  
        early_stopping_rounds=100, 
        verbose=20)
pred = lgb_model.predict(X_test)
print(np.sqrt(mean_squared_error(pred, y_test)))

##### CatBoost model does better than LGBM, but dataset is so small then model is overfitting.

In [None]:
my_test = X_test.iloc[0]
my_test.school_setting = 0.
my_test.school_type = 0.
my_test.teaching_method = 1.
my_test.n_student = 22.
my_test.lunch = 1.
my_test.pretest = 84.

In [None]:
cat_model.predict(my_test)

In [None]:
lgb_model.predict(my_test.values.reshape(1, -1))

### Thanks for reading!