# Linear Regression to predict the average score 

My first kernel, just trying to learn by practicing some some easy models, in this case Linear Regression


In [None]:
# Import stuff

import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score, mean_squared_error
%matplotlib inline
# Input data files are available in the "../input/" directory.
print(os.listdir("../input"))

Let's first check the shape of our data and how it is distributed

In [None]:
df = pd.read_csv('../input/StudentsPerformance.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

We can see that only the scores are actually numerical values and the rest are categorical. Since our model only accepts numerical values we will have to deal with this problem.
For now let's create a new feature that is the average of all the scores and a grouping of those scores  and see how they are distributed.

In [None]:
df["average score"] = (df["math score"] + df["reading score"] + df["writing score"]) /3
df['average score group'] = pd.cut(df["average score"], bins=[g for g in range(0, 101, 10)], include_lowest=True)


In [None]:
df.hist(bins=20, figsize=(12,8))

Let's check our other attributes and their distribution

In [None]:
plt.figure(figsize=(12, 8))
p = sns.countplot(x='parental level of education', data = df, palette='deep')

In [None]:
plt.figure(figsize=(12, 8))


p = sns.countplot(x='parental level of education', data = df, hue='average score group', palette="deep")

We can see that we don't have that many data about children whose parents have a master's or bachelor's degree compared to the other groups.
We can also identify that there doesn't seem to exist a clear correlation between the different grades of the students and their parent's education.

In [None]:
plt.figure(figsize=(12, 8))
p = sns.countplot(x='race/ethnicity', data = df, palette='deep')

In [None]:
plt.figure(figsize=(12, 8))


p = sns.countplot(x='race/ethnicity', data = df, hue='average score group', palette="deep")

Again we cannot see a clear distinction of the grades based on the race group that the student belongs to.

In [None]:
plt.figure(figsize=(12, 8))
p = sns.countplot(x='lunch', data = df, palette='deep')

In [None]:
plt.figure(figsize=(12, 8))


p = sns.countplot(x='lunch', data = df, hue='average score group', palette="deep")

With the lunch feature we can start to see a small distinction where the students that have a free/reduced lunch tend to have sligthly smaller grades. Let's analyse this assumption further

In [None]:
fr_lunch = df[df['lunch']=='free/reduced']
std_lunch = df[df['lunch']=='standard']

print("Free/Reduced lunch mean",fr_lunch['average score'].mean())
print("Standard lunch mean",std_lunch['average score'].mean())

We can indeed see a small difference between the means of both groups, which our algorithm will use to improve it's predictions.

Like we said before,* scikit learn*'s ** Linear Regression** will not accept non numeric features so we will now change these into a **One Hot Encoding** version that will be better suited.

In [None]:
new_df = df.copy()

one_hot = pd.get_dummies(df['gender'], prefix='gender')
new_df = new_df.join(one_hot)
one_hot = pd.get_dummies(df['race/ethnicity'], prefix='race/ethnicity')
new_df = new_df.join(one_hot)
one_hot = pd.get_dummies(df['parental level of education'], prefix='parental level of education')
new_df = new_df.join(one_hot)
one_hot = pd.get_dummies(df['lunch'], prefix='lunch')
new_df = new_df.join(one_hot)
one_hot = pd.get_dummies(df['test preparation course'], prefix='test preparation course')
new_df = new_df.join(one_hot)

new_df.drop(["reading score", "writing score", "math score", "gender", "race/ethnicity", "parental level of education", "test preparation course","lunch", "average score group"], axis=1, inplace=True)

new_df.head()

Now we separate the data into two groups, the train data which we will use to train our model and the test data where we will test the performance of our model.
We will also split both the groups into the actual features and the labels we are trying to predict, in this case the average score.

In [None]:
train_set, test_set = train_test_split(new_df, test_size=0.20, random_state=21)

train_X = train_set.drop('average score', axis=1)
train_Y = train_set['average score'].copy()

test_X = test_set.drop('average score', axis=1)
test_Y = test_set['average score'].copy()

Finnaly we can start training the model! 
We will be using the Linear Regression algorithm inside a Cross Validation function with 10 folds.

In [None]:

lin_reg = LinearRegression()

results = cross_validate(lin_reg, train_X, train_Y, cv=10, return_estimator=True)

scores = results['test_score']
print("Scores:",scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())

By analising the results we can see that our model has a mean error of about 0.2 which isn't very good. 
We can also see that there is a large variance between the models that we attained with the Cross Validation.

Of course all of this is based on the test group data. Let's see how our best model works when paired aggainst our test set.

In [None]:
# Find the best model

best = np.where(scores == min(scores))[0][0]
best_estimator = results['estimator'][best]
final_predictions = best_estimator.predict(test_X)
final_mse = mean_squared_error(test_Y, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)



So as we can see our estimator gives us a final error of around 12.6