## **1. Importing the necessary libraries and the dataset:**

In [None]:
#Data Structure libraries
import numpy as np 
import pandas as pd

#Plotting Libraries
import matplotlib.pyplot as plt
import seaborn as sns 

#Machine Learning Libraries
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression

%matplotlib inline

In [None]:
#Importing the dataset from Kaggle's working directory
dataset = pd.read_csv('../input/predict-test-scores-of-students/test_scores.csv')
dataset.head()

In [None]:
dataset.describe()

In [None]:
#Separating the parameters and the result column names
columns = dataset.columns.values.tolist()
columns.remove('posttest')
columns

In [None]:
#Separating the parameters and the result values
X = dataset[columns]
y = dataset['posttest']

In [None]:
#Encoding the categorical data in the dataset
x = pd.get_dummies(X)
x.head()

## **2. Train-Test Split**

#### After evaluating the metrics of models trained using various *test_size* values ranging from 0.1 to 0.4, I found 0.15 to have one of the lowest error. Thus I picked the ratio between the training set and testing set as 85:15.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.15)

## **3. Training the data:**

#### We are using Multiple Linear Regression on this dataset to predict the *Post Test Scores* of the students. 

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
#Making predictions using our model
y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred)

## **4. Evaluating the results:**

#### We compare the actual and predicted values using various plots and metrics to evaluate the predictions. 

In [None]:
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
result.head()

In [None]:
result.insert(2, "Percent Difference",(abs((result['Predicted']-result['Actual'])*100/result['Actual'])).round(), True)
result.head()

In [None]:
#Histogram
result['Percent Difference'].hist() 
plt.xlabel('% Difference')


#Boxplot
plt.figure() 
plt.xlabel('% Difference')
sns.boxplot(x=result['Percent Difference'])
plt.show()
plt.ion()

#Density Plot
plt.figure()
plt.xlabel('% Difference')
sns.kdeplot(result['Percent Difference'],shade=True)
plt.show()


In [None]:
#For test_size = 0.25
print(metrics.max_error(y_test, y_pred))
print(metrics.mean_absolute_error(y_test, y_pred))
print(metrics.mean_squared_error(y_test, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

## **5. Conclusion**

#### As we can see, the Root Mean Square Error is just below 3 and less than 5% of the mean of y,which is approximately 67. The above plots also show us that most of the percentage differences between the Actual and Predicted values are below 5%. 

#### **Thus, we have successfully implemented linear regression to predict the test scores of students based on 10 unique parameters.**