# **Machine Learning: Linear Regression**

## Objectives

* Build a linear regression model
* Build a pipeline model which includes scaling
* Consider **hypothesis 4** by comparing the models
* Use the model to evaluate **hypothesis 2**

## Inputs

* Cleaned CSV file "academic_performance_cleaned.csv" 

## Outputs

* A pipeline which performs linear regression on the dataset. I hope to output this to a streamlit app 

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance'

# Load Data and Prepare for Linear Regression

First step is to import packages. This time I will be including packages that I will be needing from sklearn so that I can perform linear regression

In [6]:
#import data manipulation and visualisation packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 
#set seaborn style so plots look nice
sns.set_style("whitegrid")



Load dataset and display first 10 rows

In [10]:
df = pd.read_csv("data/academic_performance_cleaned.csv")
#display first 10 rows of the dataset
df.head(10)

Unnamed: 0,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100),Average Test Score,Study Group
0,84,30,36,7,3,72,33.0,low
1,91,24,38,6,3,56,31.0,low
2,73,29,26,7,3,56,27.5,low
3,80,36,35,7,3,74,35.5,low
4,84,31,37,8,3,66,34.0,low
5,100,34,34,7,3,79,34.0,low
6,96,40,36,8,3,83,38.0,low
7,83,39,37,7,3,77,38.0,low
8,91,30,37,8,2,71,33.5,low
9,87,27,37,8,3,61,32.0,low


When the CSV is saved and loaded it resets the datatypes to int64, so I will change datatypes from int64 to int8 to save memory. 

In [None]:

#change datatypes to save memory
df = df.astype({col: 'int8' for col in df.select_dtypes('int64').columns})
#display datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   Attendance (%)                 2000 non-null   int8    
 1   Internal Test 1 (out of 40)    2000 non-null   int8    
 2   Internal Test 2 (out of 40)    2000 non-null   int8    
 3   Assignment Score (out of 10)   2000 non-null   int8    
 4   Daily Study Hours              2000 non-null   int8    
 5   Final Exam Marks (out of 100)  2000 non-null   int8    
 6   Average Test Score             2000 non-null   float64 
 7   Study Group                    2000 non-null   category
dtypes: category(1), float64(1), int8(6)
memory usage: 29.5 KB


### Linear Regression

Linear regression is used to model the relationship between a dependent variable and one or more independent variables. It is the perfect model for this dataset WHYWHYWHYWHYWHY

Linear regression requires numerical data only and WHAT DOES IT DO??

Before I perform linear regression I need to clean the data. The Study Group column, which was generated for statistical analysis, is categorical and so cannot be considered by the regression algorithm without one hot encoding it. The information in the study group column is represented numerically by the column daily study hours so the study group column is redundant. Leaving it in would make the results misleading as it would be double counting the same metric. 

The same can be said for the two internal test score columns which also need to be removed. These have been summarised in one column: average test scores. Leaving them in would again mislead the results of the linear regression model.

### Small data cleaning step

In [15]:
df_reg = df.drop(['Internal Test 1 (out of 40)',
       'Internal Test 2 (out of 40)', 'Study Group'], axis=1)
df_reg.head(10)

Unnamed: 0,Attendance (%),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100),Average Test Score
0,84,7,3,72,33.0
1,91,6,3,56,31.0
2,73,7,3,56,27.5
3,80,7,3,74,35.5
4,84,8,3,66,34.0
5,100,7,3,79,34.0
6,96,8,3,83,38.0
7,83,7,3,77,38.0
8,91,8,2,71,33.5
9,87,8,3,61,32.0


# Hypothesis 4

# Split Dataset into Train and Test

In machine learning a dataset can be 

the model needs data to be trained on, it used these data to understand the relationships between them and how they relate to a target variable. 

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
