# Tutorial: Multiclass Linear Regression for Predicting Test Scores

In this tutorial, we will use a dataset to predict students' test scores based on their study hours, sleep hours, and attendance percentage using **multiclass linear regression**. We will also enforce a rule: students with attendance below 60% are not eligible to take the exam.

## Objectives
- Understand the dataset and its features.
- Preprocess the data (handle missing values, check skewness).
- Build and train a linear regression model.
- Make predictions for new students while enforcing the attendance rule.
- Test the model with various input cases.

## Dataset Description
The dataset (`tutorial_dataset.csv`) contains the following columns:
- **StudyHours**: Number of hours a student studies per week.
- **SleepHours**: Number of hours a student sleeps per night.
- **Attendance**: Attendance percentage of the student.
- **TestScore**: The student's test score (target variable).

Let's start by loading and exploring the dataset.

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/tutorial_dataset.csv')

# Display the first few rows
data.head()

## Step 1: Data Exploration

Before building a model, we need to understand the data. Let's check for missing values and the distribution of the features.

### Check for Missing Values
Missing values can affect model performance. Let's identify them.

In [None]:
# Check for missing values
data.isnull().sum()

The output shows:
- `StudyHours`: 50 missing values
- `SleepHours`: 50 missing values
- `Attendance`: 50 missing values
- `TestScore`: 158 missing values

Since our target variable (`TestScore`) has missing values, we will remove rows where `TestScore` is missing to ensure we have complete data for training. For the input features, we can impute missing values (e.g., with the mean).

### Check Skewness
Since the features (`StudyHours`, `SleepHours`, `Attendance`) are continuous, we check their skewness to understand their distribution. Skewness helps determine if we need to transform the data (e.g., log transformation for highly skewed data).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Select input columns (exclude TestScore)
input_cols = data.drop(columns=['TestScore']).select_dtypes(include='number').columns

# Plot histograms and calculate skewness
for col in input_cols:
    plt.figure(figsize=(6, 4))
    sns.histplot(data[col].dropna(), kde=True)
    skew_val = data[col].skew()

    # Interpret skewness
    if abs(skew_val) < 0.5:
        skew_type = 'Normal'
    elif abs(skew_val) < 1:
        skew_type = 'Slightly Skewed'
    else:
        skew_type = 'Highly Skewed'

    plt.title(f'{col} – Skewness: {skew_val:.2f} ({skew_type})')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

Based on the skewness plots (not shown here but generated by the code), we can determine if the data is normally distributed or skewed. If any feature is highly skewed (skewness > 1 or < -1), we might consider transformations. For this tutorial, we assume the skewness is acceptable for linear regression, as it is robust to moderate skewness.

## Step 2: Data Preprocessing

To prepare the data for modeling:
1. Remove rows where `TestScore` is missing.
2. Impute missing values in `StudyHours`, `SleepHours`, and `Attendance` with their respective means.
3. Split the data into features (X) and target (y).

In [None]:
# Remove rows where TestScore is missing
data = data.dropna(subset=['TestScore'])

# Impute missing values in input features with mean
for col in ['StudyHours', 'SleepHours', 'Attendance']:
    data[col].fillna(data[col].mean(), inplace=True)

# Verify no missing values remain
print(data.isnull().sum())

# Split features (X) and target (y)
X = data[['StudyHours', 'SleepHours', 'Attendance']]
y = data['TestScore']

## Step 3: Building the Linear Regression Model

Linear regression predicts a continuous target variable (TestScore) based on multiple input features (StudyHours, SleepHours, Attendance). The model assumes a linear relationship:

$$ y = β_0 + β_1 \cdot \text{StudyHours} + β_2 \cdot \text{SleepHours} + β_3 \cdot \text{Attendance} + \epsilon $$

Where:
- $y$: TestScore (target)
- $β_0$: Intercept
- $β_1, β_2, β_3$: Coefficients for each feature
- $\epsilon$: Error term

We will use `scikit-learn` to train the model.

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize and train the model
model = LinearRegression()
model.fit(X, y)

# Print model coefficients
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)

The coefficients indicate the impact of each feature on the test score. For example, a positive coefficient for `StudyHours` means that increasing study hours is associated with a higher test score.

## Step 4: Making Predictions

We will create a function to predict test scores for new students. The function will:
- Validate inputs (ensure they are numeric).
- Check if attendance is ≥ 60% (eligibility rule).
- Use the trained model to predict the test score.

In [None]:
def predict_test_score(study_hours, sleep_hours, attendance):
    try:
        # Convert inputs to float
        study_hours = float(study_hours)
        sleep_hours = float(sleep_hours)
        attendance = float(attendance)

        # Check eligibility
        if attendance < 60:
            return 'Student is not eligible for the exam (attendance below 60%).'
        
        # Create DataFrame for prediction
        student_df = pd.DataFrame([[study_hours, sleep_hours, attendance]],
                                  columns=['StudyHours', 'SleepHours', 'Attendance'])
        
        # Predict score
        predicted_score = model.predict(student_df)[0]
        return f'Predicted Test Score: {predicted_score:.2f}'
    
    except ValueError:
        return 'Invalid input. Please enter numeric values.'

## Step 5: Testing the Model

Let's test the model with the provided test cases to verify its behavior.

### Test Case 1: Valid Input (Eligible)
- Study Hours: 5
- Sleep Hours: 7
- Attendance: 85

In [None]:
print(predict_test_score(5, 7, 85))

### Test Case 2: Ineligible Student (Attendance < 60%)
- Study Hours: 4
- Sleep Hours: 6.5
- Attendance: 55

In [None]:
print(predict_test_score(4, 6.5, 55))

### Test Case 3: Invalid Input (Non-Numeric)
- Study Hours: 'five'
- Sleep Hours: 'seven'
- Attendance: 'eighty'

In [None]:
print(predict_test_score('five', 'seven', 'eighty'))

### Test Case 4: Edge Case (Just Eligible)
- Study Hours: 3
- Sleep Hours: 6
- Attendance: 60

In [None]:
print(predict_test_score(3, 6, 60))

## Step 6: Model Evaluation

To evaluate the model's performance, we can split the data into training and testing sets and calculate metrics like Mean Squared Error (MSE) and R² score.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model on training data
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R² Score: {r2:.2f}')

- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual test scores. Lower values indicate better performance.
- **R² Score**: Indicates how much of the variance in the target variable is explained by the model. Values closer to 1 indicate a better fit.

## Conclusion

In this tutorial, we:
- Explored the dataset and handled missing values.
- Checked the skewness of features to understand their distribution.
- Built a linear regression model to predict test scores based on study hours, sleep hours, and attendance.
- Implemented a prediction function with eligibility checks (attendance ≥ 60%).
- Tested the model with various cases to ensure robustness.
- Evaluated the model using MSE and R² metrics.

This approach demonstrates how to apply multiclass linear regression to a real-world dataset while incorporating business rules (e.g., attendance eligibility).