### **1. Problem Statement**

**Objective**
The goal is to develop a predictive model that accurately forecasts the performance index based on various predictors such as hours studied, previous scores, participation in extracurricular activities, sleep hours, and the number os sample question papers practiced.

**Dataset**
The dataset consists of `10,000` students records, each containing the following variables:
1. **Hours Studied:** The total number of hours spent studying by each student.
2. **Previous Scores:** The scores obtained by students in previous tests.
3. **Extracurricular Activities:** Whether the student participates in extracurricular activities.
4. **Sleep Hours:** The average number of hours of sleep the student had per day.
5. **Sample Question Papers Practiced:** The number of sample question papers the student practiced.

The target variable is:
* **Performance index:** A measure of the overall performance of each student, ranging from `10` to `100`, with higher values indicating better performance.

**Methodology**
To address this problem, we will create a simple Neural Network using PyTorch for performing Multiple Linear Regression.

**Tools**
- **NumPy:** A library for scientific computing, mainly involving linear algebra operations.
- **Pandas:** A library for data analysis and manipulation.
- **Matplotlib:** A library for plotting data.
- **PyTorch:** A library for flexibility and speed when building deep learning models.

---

In [2]:
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")

PyTorch version: 2.9.0+cpu


In [3]:
# Check if CUDA is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("GPU not available, using CPU")

Using device: cpu
GPU not available, using CPU


### **2. Data Preprocessing**

In [None]:
# Read the data from the CSV file

df = pd.read_csv('Student_Performance.csv')
df

In [None]:
# Verifying the Datatypes and Checking for Null Values

df.info()

Before we can use the dataset for building our predictive model, it's essential to preprocess the data. This ensures that our model interprets the data correctly and efficiently. Here are the key preprocessing steps we'll undertake:

<br>

**1. Converting Categorical Data to Numerical Data**

Our dataset includes categorical data (like `Yes` or `No` responses). Machine learning models, however, work with numerical data. Thus, we will convert these responses into binary numerical values (`1` or `0`):
- `Yes` will be converted to `1`.
- `No` will be converted to `0`.

<br>

**2. Normalizing Feature Values**

Feature normalization is crucial in data preprocessing for machine learning models, especially when features vary widely in scale and units.

For instance, in our dataset, the `Previous Scores` typically range from `0` to `100`, representing the scores students obtained in previous tests. In contrast, `Sample Question Papers Practiced` might range typically from `0` to `10`, reflecting how many practice tests a student has completed. If these features are used as-is, the model might inappropriately prioritize `Previous Scores` over `Sample Question Papers Practiced` because of their larger numerical range.

Normalization adjusts the scale of features so they contribute equally to model training, enhancing the model's performance.

For this purpose we'll use **Z-score Normalization**, a common normalization method. It transforms each feature to have a mean (average) of 0 and a standard deviation of 1. This standardization process involves the following formula for each feature:

$$ X_{\text{norm}} = \frac{(X - \mu)}{\sigma} $$

Where:
- $ X $ is the original value of a feature.
- $ \mu $ is the mean of the feature.
- $ \sigma $ is the standard deviation of the feature.
- $ X_{\text{norm}} $ is the normalized value.

This method ensures that features with larger scales do not dominate those with smaller scales, leading to a more balanced and fair input to the model.


In [None]:
# Replacing Yes/No features with 0 or 1

df = df.replace({'Yes':1, 'No':0})
df

In [None]:
# Normalizing Feature Values

def zscore_normalize_features(X):
    """
    computes  X, zcore normalized by column

    Args:
      X (ndarray (m,n))     : input data, m examples, n features

    Returns:
      X_norm (ndarray (m,n)): input normalized by column
      mu (ndarray (n,))     : mean of each feature
      sigma (ndarray (n,))  : standard deviation of each feature
    """

    # Find the mean of each column/feature
    mu = np.mean(X, axis=0)                 # mu will have shape (n,)
    
    # Find the standard deviation of each column/feature
    sigma = np.std(X, axis=0)                  # sigma will have shape (n,)

    # Element-wise, subtract mu for that column from each example, divide by std for that column
    X_norm = (X - mu) / sigma

    return X_norm

In [None]:
for column in df.columns:
    if column != 'Performance Index':
        df[column] = zscore_normalize_features(df[column])
df

### **3. Multiple Linear Regression: A Machine Learning Model**