# **Machine Learning: Linear Regression**

## Objectives

* Build a linear regression model
* Build a pipeline model which includes scaling
* Consider **hypothesis 4** by comparing the models
* Use the model to evaluate **hypothesis 2**

## Inputs

* Cleaned CSV file "academic_performance_cleaned.csv" 

## Outputs

* A pipeline which performs linear regression on the dataset. I hope to output this to a streamlit app 

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance'

# Load Data and Prepare for Linear Regression

First step is to import packages. This time I will be including packages that I will be needing from sklearn so that I can perform linear regression

In [6]:
#import data manipulation and visualisation packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 
#set seaborn style so plots look nice
sns.set_style("whitegrid")



Load dataset and display first 10 rows

In [10]:
df = pd.read_csv("data/academic_performance_cleaned.csv")
#display first 10 rows of the dataset
df.head(10)

Unnamed: 0,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100),Average Test Score,Study Group
0,84,30,36,7,3,72,33.0,low
1,91,24,38,6,3,56,31.0,low
2,73,29,26,7,3,56,27.5,low
3,80,36,35,7,3,74,35.5,low
4,84,31,37,8,3,66,34.0,low
5,100,34,34,7,3,79,34.0,low
6,96,40,36,8,3,83,38.0,low
7,83,39,37,7,3,77,38.0,low
8,91,30,37,8,2,71,33.5,low
9,87,27,37,8,3,61,32.0,low


When the CSV is saved and loaded it resets the datatypes to int64, so I will change datatypes from int64 to int8 to save memory. 

In [None]:

#change datatypes to save memory
df = df.astype({col: 'int8' for col in df.select_dtypes('int64').columns})
#display datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   Attendance (%)                 2000 non-null   int8    
 1   Internal Test 1 (out of 40)    2000 non-null   int8    
 2   Internal Test 2 (out of 40)    2000 non-null   int8    
 3   Assignment Score (out of 10)   2000 non-null   int8    
 4   Daily Study Hours              2000 non-null   int8    
 5   Final Exam Marks (out of 100)  2000 non-null   int8    
 6   Average Test Score             2000 non-null   float64 
 7   Study Group                    2000 non-null   category
dtypes: category(1), float64(1), int8(6)
memory usage: 29.5 KB


### Linear Regression

Linear regression is used to model the relationship between a dependent variable and one or more independent variables. It is the perfect model for this dataset WHYWHYWHYWHYWHY

Linear regression requires numerical data only and WHAT DOES IT DO?? Precdiction

Before I perform linear regression I need to clean the data. The Study Group column, which was generated for statistical analysis, is categorical and so cannot be considered by the regression algorithm without one hot encoding it. The information in the study group column is represented numerically by the column daily study hours so the study group column is redundant. Leaving it in would make the results misleading as it would be double counting the same metric. 

The same can be said for the two internal test score columns which also need to be removed. These have been summarised in one column: average test scores. Leaving them in would again mislead the results of the linear regression model.

### Small data cleaning step

In [15]:
df_reg = df.drop(['Internal Test 1 (out of 40)',
       'Internal Test 2 (out of 40)', 'Study Group'], axis=1)
df_reg.head(10)

Unnamed: 0,Attendance (%),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100),Average Test Score
0,84,7,3,72,33.0
1,91,6,3,56,31.0
2,73,7,3,56,27.5
3,80,7,3,74,35.5
4,84,8,3,66,34.0
5,100,7,3,79,34.0
6,96,8,3,83,38.0
7,83,7,3,77,38.0
8,91,8,2,71,33.5
9,87,8,3,61,32.0


# Hypothesis 4: A fully processed regression pipeline achieves better accuracy than a model without preprocessing.

EXPLANATION overall plan for H4

What does preprocessing mean, why might the data need to be scaled

# Split Dataset into Train and Test

This is a supervised learning task which means that the dataset can be thought of as having features and a target variable.

The target variable is a column which the model is trying to predict. In this case the target is the final exam marks.

The features are the rest of the data in the dataset. 

The model is trying to answer the question: with a set of unseen features, how accurately can the target variable be predicted. 

The dataset is split into two sections, the train set and the test set. The model is trained on the train set, it learns what features are important and contibute most to the variance in the target data. The model is then tested on unseen data (the test set) and it predicts what the targets variables are on the unseen data. The model can then compare its predictions of the test data to the actual target values of the test data and can assess how good it was at predicting the target variables. 

MORE MORE MORE

In [17]:
df_reg.columns.unique()

Index(['Attendance (%)', 'Assignment Score (out of 10)', 'Daily Study Hours',
       'Final Exam Marks (out of 100)', 'Average Test Score'],
      dtype='object')

In [None]:
#split dataset into train and test X represents features (drop target variable) and y represents target variable
X = df_reg.drop('Final Exam Marks (out of 100)', axis=1)
y = df['Final Exam Marks (out of 100)']
#create 4 variables for the X_train and X_test are the features and y_train and y_test are the targets
#test_size = 0.2 the dataset is split into 80% train and 20% test, ramdom state provides reproducability 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=101)


Train set: (1600, 4) (1600,) 
Test set: (400, 4) (400,)


Inspecting the X_train shows us that it no longer contains the final marks column so represents just the features, and inspecting the y_train dataframe shows us that it is just the the final marks i.e. the target variable.

In [None]:
#features only
X_train

Unnamed: 0,Attendance (%),Assignment Score (out of 10),Daily Study Hours,Average Test Score
668,73,8,2,35.0
1345,81,6,3,32.5
373,92,7,2,31.0
1388,98,7,4,30.0
132,75,8,2,31.5
...,...,...,...,...
1599,97,9,3,36.5
1862,91,7,3,30.5
1361,81,8,2,36.0
1547,94,9,3,36.0


In [None]:
#targets only
y_train

668     65
1345    59
373     70
1388    68
132     58
        ..
1599    78
1862    71
1361    63
1547    81
863     79
Name: Final Exam Marks (out of 100), Length: 1600, dtype: int64

Printing the shape of the datafames shows us that the train set is now 1600 rows (80% of 2000) and the test set is 400 rows (20%)

In [27]:
#print the shape of the train and test sets
print(
    "Train set:",
    X_train.shape,
    y_train.shape,
    "\nTest set:",
    X_test.shape,
    y_test.shape,
)

Train set: (1600, 4) (1600,) 
Test set: (400, 4) (400,)


# Baseline linear regression model (no pre-processing)

In [28]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In [29]:
model.predict(X_test)

array([70.72942766, 77.69489339, 59.84996204, 77.70651976, 65.65996455,
       71.10032554, 67.41854474, 80.39306091, 65.64959454, 84.47293757,
       55.99633129, 75.79763441, 58.1673226 , 73.57224715, 60.36551188,
       78.27915445, 77.50747183, 62.74717749, 69.66987363, 65.93620959,
       57.72771351, 72.02954285, 86.60241564, 65.31626465, 57.94107669,
       69.21863817, 84.6292159 , 62.66683991, 65.83181449, 56.18375285,
       82.61844818, 59.41035295, 64.00577945, 52.85729832, 67.6059663 ,
       63.20958773, 67.61759267, 53.74500241, 64.29113814, 69.81578196,
       80.69399122, 55.02954599, 71.27217549, 79.76400232, 73.47036477,
       63.564914  , 54.0303611 , 74.38401048, 60.22085991, 79.54938278,
       75.54984371, 82.98934606, 74.18621891, 76.06664992, 71.48679503,
       72.24021715, 75.9979387 , 59.35595699, 67.10078646, 86.03246984,
       84.85940706, 77.15214557, 74.58305841, 62.3075684 , 83.80708257,
       60.16772032, 69.04678822, 72.06711084, 66.33304909, 82.07

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
