<a href="https://colab.research.google.com/github/ryakkalauncc/3162project3/blob/main/ITCS3162Project3Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Introduction:***
The dataset I'm using in my analysis is the Exploring Student Acheivement Trends dataset from Kaggle, which explores the records of a 1,000 students with each record containing socioeconomic, demographic, and educational factors. It also encompasses the math, reading, and writing test scores of each student. The goal is to predict student academic performance specifically through their math scores, using demographic and educational factors.


# What is regression and how does it work?
Regression is a machine learning technique used to predict a a continuous target variable. In this context, the goal is to predict student's math scores based on their reading and writing scores, demograpahics, and educational factors.
Liner regression models a linear relationship between input variables (features) and the output (target).

**Experiments**
1. Baseline Linear Regression
2. Linear Regression
3. Ridge Regression


**Experiment 1**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

df = pd.read_csv("StudentsPerformance.csv")
X = df[["reading score","writing score"]]
y = df["math score"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Experiment 1: Baseline Linear Regression RMSE", rmse)

Experiment 1: Baseline Linear Regression RMSE 8.788798451027851


Findings: The RMSE was a 8.79 indicating that on average, the model's predications were about 8.8 points away from actual math scores. This model demonstrates that while reading and writing scores are correlated with math achievement, additional context from other features is needed to improve accuracy.

**Experiment 2**

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

df = pd.read_csv("StudentsPerformance.csv")
df["average_score"] = (df["reading score"] + df["writing score"])/2
X = df.drop(columns = ["math score"])
y = df["math score"]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=42)

categorical_columns = [
    "gender",
    "race/ethnicity",
    "parental level of education",
    "lunch",
    "test preparation course"
]
numeric_columns = ["reading score", "writing score", "average_score"]
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(drop = "first"), categorical_columns),
        ("num", StandardScaler(), numeric_columns)
    ],
    remainder = "drop"
)
model = Pipeline(steps = [("preprocessor", preprocessor), ("regressor", LinearRegression())])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Experiment 2: Linear Regression RMSE", rmse)

Experiment 2: Linear Regression RMSE 5.393993869732843


Findings: New variable average_score was created to combine reading and writing scores into 1 metric. Categorical variables were introduced through one-hot encoding. All numeric features were standardized. The RMSE dropped to 5.39 this demonstrates that incorporating categorical and contextual features providees more information for the model to learn from. Standardizing numeric inputs helps stabilize regression coefficients. Overall, the experiment shows how feature inclusion and preprocessing improves model performance.

**Experiment 3**

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
import numpy as np

df = pd.read_csv("StudentsPerformance.csv")

In [4]:
df["average_score"] = (df["reading score"] + df["writing score"])/2
X = df.drop(columns = ["math score"])
y = df["math score"]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20,random_state=42)

categorical_columns = [
    "gender",
    "race/ethnicity",
    "parental level of education",
    "lunch",
    "test preparation course"
]
numeric_columns = ["reading score", "writing score", "average_score"]

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(drop = "first"), categorical_columns),
        ("num", StandardScaler(), numeric_columns)
    ],
    remainder = "drop"
)

ridge_pipeline = Pipeline(steps = [("preprocessor", preprocessor), ("model", Ridge(alpha=1.0))])
ridge_pipeline.fit(X_train, y_train)
y_pred = ridge_pipeline.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Experiment 3: Ridge Regression RMSE" , rmse)

Experiment 3: Ridge Regression RMSE 5.393451465682586


Findings: A ridge regression model was used instead of linear regression. Using L2 regularizatino, which penalizes large coefficient values to prevent overfitting. The ridge model achieved an RMSE of 5.39, suggesting that the model was not overfitting in experiment 2 so regularizatino offered marginal improvement. Ridge regression adds robustness and stability to the model, making it less sensitive to variations in the data regardless.

**Impact:**
 The impact of this project can extend into both educational and ethical policy debates. By building models that predict student performance, policymakers, school staff, parents, and even students can identify who may need additional academic support more effectively. This can also lead to proper resource allocation and promote a more inclusive education system. There are ethical concerns that can arise, such as the race/ethnicity or parental education factors. Models can potentially reinforce bias concerns. To combat such concerns, models should be used as informational resources and not to make high-stakes decisions. I believe this project can make light of where educational resources can be improved upon.

**Conclusion:**
Through this project, I explored how different levels of model complexity and preprocessing affect the accuracy of predicting student math scores based on demographic, socioeconomic, and educational factors. The first experiment was a baseline linear regression model that used only numeric reading and writing scores. This provided a simple benchmark for performance, showing how these scores correlate with math performance. In the second experiment, I expanded the model to include all categorical and numeric features such as gender, race/ethnicity, parental education level, lunch type, test preparation, and  course participation. After applying one-hot encoding and feature scaling, the model's performance improved, demonstrating that background and demographic variables provide additional explanatory power. The third experiment leveraged ridge regression. This adds an L2 regularization term to prevent overfitting. While this model may not always reduce RMSE, it generally stabilizes the model's weights and improves generalization to unseen data. The slight improvement showed that regularization can enhance model robustness.

**References**
** I leveraged Chat-GPT 5 and Grammarly to improve upon grammar and sentence structure throughout the written sections of this report.