## Importing Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 

%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Loading the Dataset

In [None]:
path = "/kaggle/input/tsf-datasets/student_scores.csv"
data = pd.read_csv(path)

In [None]:
data.head()

## Understanding the Data

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.describe()

The key take-aways are :
- Max Score: 95 , Min Score: 17, Avg Score: 51.48
- On an average, Students study around 5 hours

In [None]:
data.isna().sum()

No null values

In [None]:
data.corr()

This function has given the correlation between Hours and Scores.It is cleary visible that they are highly correlated because closer the value is to 1 , higher is the correlation.

## Plotting the data

In [None]:
px.scatter(data,x="Hours",y="Scores",template="plotly_dark")

Plotting our data to figure out which model is best suited.

From the Scatter-Plot it is pretty clear that Linear Regression is the best Model for our data.

In [None]:
x= data.iloc[:,:-1].values
y= data.iloc[:,1].values

## Splitting Data

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2, random_state=0)

In [None]:
reg = LinearRegression()    # Creating a linear regression object
reg.fit(x_train,y_train)

Plotting the data with the regression line

In [None]:
px.scatter(data,x="Hours",y="Scores",trendline='ols',template="plotly_dark")

In [None]:
reg.score(x_train,y_train)   # Score of our trained model

## Prediction

In [None]:
y_preds = reg.predict(x_test)

In [None]:
compare = pd.DataFrame({"Actual":y_test,"Predicted":y_preds})
compare           # a dataframe to store values of y_test(Actual values from our dataset) and y_preds(Predicted values)

## Evaluating Our Model

In [None]:
from sklearn.metrics import mean_absolute_error,r2_score
mean_absolute_error(y_test,y_preds)

In [None]:
r2_score(y_test,y_preds)     

In [None]:
x_axis = range(len(y_test))

In [None]:
plt.plot(x_axis,y_test, label='original')
plt.plot(x_axis,y_preds, label='predicted')
plt.legend()
plt.show()

Plotting the values to visualize how well our model works.

## Conclusion
We created a Simple Linear Regression Model to predict the scores of a student on a basis of the hours spent to study for the examination. The Mean Absolute Error was low and r2_score of our model was high, which means that our model is working just fine.

Please do check out my other notebooks
- [Beginner : Simple Linear Regression](https://www.kaggle.com/amartyanambiar/beginner-simple-linear-regression)
- [Beginner : Multiple Linear Regression](https://www.kaggle.com/amartyanambiar/beginner-multiple-linear-regression)