# Students Performance in Exams: Predicting Total and Average Scores

## About Dataset

The dataset captures the marks secured by students in various subjects, providing an opportunity to explore factors influencing student performance.

- **Source**: [Generated Data - Exams](http://roycekimmons.com/tools/generated_data/exams)
- **Columns**:
  - Gender
  - Race/Ethnicity
  - Parental Level of Education
  - Lunch
  - Test Preparation Course
  - Math Score
  - Reading Score
  - Writing Score

## Objectives

The primary goals of this analysis are to:

1. **Explore Factors Influencing Performance**: Understand the influence of variables such as parental background, test preparation, and other factors on students' scores.

2. **Predict Total and Average Scores**: Build predictive models to estimate total and average scores based on available features.

## Analysis Steps

### 1. Data Exploration

- **Overview**: Examine the distribution of scores and explore the relationships between variables.
- **Descriptive Statistics**: Calculate summary statistics for key variables.

### 2. Data Preprocessing

- **Handling Missing Values**: Check for missing data and apply appropriate strategies.
- **Categorical Encoding**: Convert categorical variables into a format suitable for modeling.

### 3. Feature Engineering

- **Create New Features**: Explore the possibility of creating new features that might enhance predictive power.

### 4. Exploratory Data Analysis (EDA)

- **Visualizations**: Utilize visualizations to gain insights into patterns and correlations.

### 5. Model Building

- **Linear Regression Models**: Develop models to predict total and average scores.
- **Model Evaluation**: Assess model performance using relevant metrics.

### 6. Interpretation and Conclusions

- **Feature Importance**: Identify key factors influencing scores.
- **Insights**: Draw conclusions and insights from the analysis.

## Conclusion

This project aims to provide a comprehensive analysis of students' performance in exams, considering various factors that may contribute to their scores. By predicting total and average scores, we can gain valuable insights into the dynamics of academic achievement.



#### Importing Necessary Dependencies

In [2]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## 1. Data Exploration

### 1.1 Load the dataset into a Pandas DataFrame


In [3]:
df = pd.read_csv("Dataset/StudentsPerformance.csv")

In [5]:
df.head(5) #Display the first 5 rows for inspection

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [14]:
df.describe() #Overview of the statistics

Unnamed: 0,Math_Score,Reading_Score,Writing_Score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Gender            1000 non-null   object
 1   Ethnicity         1000 non-null   object
 2   Parent_Education  1000 non-null   object
 3   Lunch             1000 non-null   object
 4   Test_Preparation  1000 non-null   object
 5   Math_Score        1000 non-null   int64 
 6   Reading_Score     1000 non-null   int64 
 7   Writing_Score     1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


### 2. Data Preprocessing

### 2.1 Rename the headers

In [7]:
new_columns = {
    'gender': 'Gender',
    'race/ethnicity': 'Ethnicity',
    'parental level of education': 'Parent_Education',
    'lunch': 'Lunch',
    'test preparation course': 'Test_Preparation',
    'math score': 'Math_Score',
    'reading score': 'Reading_Score',
    'writing score': 'Writing_Score'
}

# Use the rename method to update the column names
df.rename(columns=new_columns, inplace=True)

# Display the updated DataFrame
df.head()

Unnamed: 0,Gender,Ethnicity,Parent_Education,Lunch,Test_Preparation,Math_Score,Reading_Score,Writing_Score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### 2.2 Identify and handle missing values

In [9]:
# Check for missing values in the DataFrame
missing_values = df.isnull().sum()

# Display the count of missing values for each column
print("Missing Values:")
print(missing_values)

Missing Values:
Gender              0
Ethnicity           0
Parent_Education    0
Lunch               0
Test_Preparation    0
Math_Score          0
Reading_Score       0
Writing_Score       0
dtype: int64


    Luckily, the dataset does not contain any missing values. This is beneficial for our analysis, as it ensures that we have complete information for all the variables. Having a dataset without missing values simplifies the preprocessing steps and allows us to focus on exploring and analyzing the available data more effectively.

#### 2.2 Categorical Encoding

##### 2.1.1  Extracting categorical attributes headers from the DataFrame

In [25]:

categorical_attributes_headers = df.select_dtypes(include='object').columns.tolist()

# Displaying the headers of categorical attributes
print("Categorical Attributes Headers:")
print(categorical_attributes_headers )
print("---"*30)

# Extracting neumarical attributes headers from the DataFrame
neumarical_attributes_headers = df.select_dtypes(include='int64').columns.tolist()

# Displaying the headers of neumarical attributes
print("Neumarical Attributes Headers:")
print(neumarical_attributes_headers)


Categorical Attributes Headers:
['Gender', 'Ethnicity', 'Parent_Education', 'Lunch', 'Test_Preparation']
------------------------------------------------------------------------------------------
Neumarical Attributes Headers:
['Math_Score', 'Reading_Score', 'Writing_Score']


In [37]:
# Finding unique values in each categorical column 
for column in categorical_attributes_headers:
    unique_values = df[column].unique()
    print(f"Unique values in {column}:", unique_values)
    print()

Unique values in Gender: ['female' 'male']

Unique values in Ethnicity: ['group B' 'group C' 'group A' 'group D' 'group E']

Unique values in Parent_Education: ["bachelor's degree" 'some college' "master's degree" "associate's degree"
 'high school' 'some high school']

Unique values in Lunch: ['standard' 'free/reduced']

Unique values in Test_Preparation: ['none' 'completed']



##### Variable Types

###### Categorical Variables:
- **Ethnicity:** This variable represents the race or ethnicity of the students and likely consists of multiple categories.

###### Ordinal Variables (a subtype of Categorical):
- **Parent_Education:** This variable represents the parental level of education. It could be ordinal if there is a clear order such as "bachelor's degree" 'some college' "master's degree" "associate's degree" "high school" "some high school" etc.

###### Binary Variables (a subtype of Categorical):
- **Lunch:** This variable is binary, indicating whether a student receives free/reduced lunch or not.
- **Test_Preparation:** Another binary variable indicating whether a student completed a test preparation course or not.
- **Gender:** This variable represents the gender of the students and can take values like "Male" or "Female."
###### Numerical Variables:
- **Math_Score, Reading_Score, Writing_Score:** These variables are numerical and likely represent the scores obtained by students in the corresponding subjects.