# Problem Statement
Investigate factors affecting student scores and build a model to predict math scores.

# 1. Load and Clean the Dataset

First, we load the dataset and inspect it for missing values and data types.

In [None]:
import pandas as pd

url = r"C:\Users\telug\Downloads\StudentsPerformance.csv"
data = pd.read_csv(url)

print(data.head())

print(data.isnull().sum())

print(data.info())


# Observations:

* The dataset has columns like gender, race/ethnicity, parental level of education, lunch, test preparation course, and scores (math score, reading score, writing score).

* There are no missing values, so cleaning is minimal.

* Categorical variables need encoding for modeling.

# 2. Exploratory Data Analysis (EDA)

We want to understand the relationship between categorical variables and math scores.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x='gender', y='math score', data=data)
plt.show()

sns.boxplot(x='lunch', y='math score', data=data)
plt.show()

sns.boxplot(x='test preparation course', y='math score', data=data)
plt.show()


# Insights to look for:

* Does gender affect math scores?

* Does having a standard or free/reduced lunch matter?

* Does completing the test preparation course improve scores?

# 3. Encode Categorical Variables

We need numerical encoding for Linear Regression.

In [None]:
data_encoded = pd.get_dummies(data, columns=['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course'], drop_first=True)

print(data_encoded.head())


# 4. Train a Linear Regression Model

We will predict math score using other features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

X = data_encoded.drop(['math score', 'reading score', 'writing score'], axis=1)
y = data_encoded['math score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


# 5. Evaluate the Model

Use MAE and RMSE to check performance.

In [None]:
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Mean Absolute Error (MAE): {mae}")
print(f"Root Mean Squared Error (RMSE): {rmse}")


# 6. Insights and Conclusion

* Test preparation course and lunch type often show a significant impact on math scores.

* Gender might have a minor effect depending on the data distribution.

* Linear Regression provides a baseline; more advanced models (Random Forest, XGBoost) can improve prediction accuracy.